# stringids: map strings to a numbers and use those in string heavy apps
suppose you are working on a compiler. you might be doing a lot of short of string operations. each variable has a string identifier after all. if you keep looking up variables via strings, you will have the following problems:
if these strings don't grow unbounded, you could simply map each string to a unique number and use those in your data structures. once you work with integers, all the above issues go away:
now the question is, how do you implement mapping strings to integers and back? in this post i'll explore implementing this in go. in @/dynmem i mentioned pointers should be avoided in performant code so i'll focus on a pointerless solution.
# interface
let's compare several implementations of the problem. for simplicity let's go with this interface:
type ID int64 // Mapper is an interface for mapping strings to integers and vice versa. // As a convenience, the empty string is always mapped to 0 and vice versa. type Mapper interface { // Map maps a string to an ID. // if s was seen previously, the previously allocated ID will be returned. // otherwise a new ID is allocated. Map(s string) ID // Unmap returns the string that represents ID. // it is a panic-level error to provide an ID not returned from Map. Unmap(id ID) string }
# motivational example: acl system
linux users and groups is an example where a similar idea is used. each user is assigned a number in /etc/passwd and each group is assigned a number in /etc/group. then functions like geteuid() or stat() only return those id numbers rather than strings.
but let's look at how you could implement an optimized acl system with the above interface. you might start out like this:
type SlowACL struct { members []string } type SlowACLs struct { acls map[string]SlowACL } func (acls *SlowACLs) IsMember(acl, member string) bool { for _, m := range acls.acls[acl].members { if m == member { return true } } return false }
if you have millions of users, this can get pretty rough. now the gc must visit millions of nodes every gc cycle. and the member lookup needs to do a lot of cache unfriendly string comparisons.
now compare that solution with a stringid based one:
type ACL struct { members []stringid.ID } type ACLs struct { acls map[stringid.ID]ACL idmapper stringid.HashMapper } func (a *ACLs) IsMember(acl, member string) bool { query := a.idmapper.Map(member) for _, id := range a.acls[a.idmapper.Map(acl)].members { if id == query { return true } } return false }
assuming the idmapper has a constant gc load, this solution creates much less load on the gc. there's only one pointer per acl group. there are tricks to avoid even that but that's outside the scope of this post.
and also note the query loop. it's just a cache friendly linear scan. it's pretty performant compared to the previous solution.
# testing
i'll implement 3 variations of mappers. to doublecheck my implementations, i'd run them through this function:
func test() error { for _, m := range []Mapper{&SimpleMapper{}, &IndexMapper{}, &HashMapper{}} { fmt.Println("testing a mapper:") for _, w := range []string{ "apple", "pear", "orange", "orange", "apple", "peach", "each", "orange", "kiwi", "apple", } { id := m.Map(w) fmt.Printf(" %8s %8s %16d\n", w, m.Unmap(id), id) } } return nil } func main() { if err := test(); err != nil { log.Fatal(err) } }
and then i'd spotcheck the output to confirm it makes sense. i could create proper testcases too but meh, this should be enough for the demo purposes.
# simple implementation
the simple implementation could be done via maps:
type SimpleMapper struct { idmap map[ID]string strmap map[string]ID } func (m *SimpleMapper) Map(s string) ID { // initialize if needed. if m.idmap == nil { m.idmap, m.strmap = make(map[ID]string), make(map[string]ID) m.idmap[0], m.strmap[""] = "", 0 } if id, ok := m.strmap[s]; ok { return id } id := ID(len(m.strmap)) m.strmap[s], m.idmap[id] = id, s return id } func (m *SimpleMapper) Unmap(id ID) string { return m.idmap[id] }
and the test output looks as expected:
apple apple 1 pear pear 2 orange orange 3 orange orange 3 apple apple 1 peach peach 4 each each 5 orange orange 3 kiwi kiwi 6 apple apple 1
straightforward and fast but it comes with two problems.
so let's explore different alternatives.
# all strings in one buffer
what if we had all strings concatenated in one long string and the IDs would be just offsets? with such an implementation there would be zero fragmentation. and there would be no memory pressure either because there would be only one pointer to be stored, the long string.
the problem is that every time we append a string to the long string, it might need a reallocation. but if you double the buffer every time you run out of capacity, the running time will be still amortized O(n). so this string appending shouldn't be a problem.
strings are immutable in go but fortunately strings.Builder is willing to hack around this limit. with the Builder struct we can have efficient string appending.
# IDs are offsets
we had this definition of the ID:
type ID int64
for simplicity, let's assume the total length of all the strings is less than a gigabyte so it fits into a 31 bit integer. then the lower 32 bits of the ID is the start offset, the upper 32 bits are the end offset.
an Unmap operation could look like this:
type IndexMapper struct { s strings.Builder } func (m *IndexMapper) Unmap(id int) string { if id == 0 { return "" } start, end := int(id&0xffffffff), int(id>>32) return m.s.String()[start:end] }
# index mapping
suppose all you have is that long string buffer and you want to map a string to an ID. there are two cases:
implementation-wise this would look like this:
func (m *IndexMapper) Map(s string) ID { // return from buffer if possible. if start := strings.Index(m.s.String(), s); start != -1 { return ID(int64(start) | int64(start+len(s))<<32) } // append string to the buffer. if m.s.Len()+len(s) > 1e9 { panic("stringid.IndexMapper grew too big") } m.s.WriteString(s) // must re-search here to guarantee the first entry is returned. start := strings.Index(m.s.String(), s) id := ID(int64(start) | int64(start+len(s))<<32) return id }
and the test confirms this works:
apple apple 500000000 pear pear 900000005 orange orange f00000009 orange orange f00000009 apple apple 500000000 peach peach 140000000f each each 1400000010 orange orange f00000009 kiwi kiwi 1800000014 apple apple 500000000
the only problem of this approach is that this is that mapping gets very slow as the buffer size grows.
# hashing IDs
to address this, let's introduce a hashmap for finding strings quickly:
type HashMapper struct { len, cap uint64 htab []ID s strings.Builder }
we must implement our own hashmap because map[string]ID would create gc pressure.
a Map operation would consist of these operations:
for simplicity let's keep our hashtable power of two sized. we need a string hasher:
func strhash(s string) uint64 { var h uint64 = 5381 for i := 0; i < len(s); i++ { h = 33*h + uint64(s[i]) } return h }
and the above steps could be implemented like this:
func (m *HashMapper) Map(s string) ID { if s == "" { return 0 } // initialize if needed. if m.s.Len() == 0 { // initialize. m.cap, m.htab = 4, make([]ID, 4) } // find existing entry in the hashtable. mask := m.cap - 1 h := strhash(s) slot := h & mask for m.htab[slot] != 0 { if id := m.htab[slot]; s == m.Unmap(id) { return id } slot = (slot + 1) & mask } // append to s and add the resulting ID to the hashtable. start, end := m.s.Len(), m.s.Len()+len(s) if end > 1e9 { panic("stringid.HashMapper grew too big") } m.s.WriteString(s) id := ID(int64(start) | int64(end)<<32) m.htab[slot] = id m.len++ // resize the hashtable if needed. if m.len >= m.cap/2 { newcap := m.cap * 4 mask = newcap - 1 newhtab := make([]ID, newcap) for i := uint64(0); i < m.cap; i++ { id := m.htab[i] if id == 0 { continue } h := strhash(m.Unmap(id)) & mask for j := h; true; j = (j + 1) & mask { if newhtab[j] == 0 { newhtab[j] = id break } } } m.cap, m.htab = newcap, newhtab } return id }
the Unmap function would remain the same as in the IndexMapper. and the test function confirms that this works too:
apple apple 500000000 pear pear 900000005 orange orange f00000009 orange orange f00000009 apple apple 500000000 peach peach 140000000f each each 1400000010 orange orange f00000009 kiwi kiwi 1800000014 apple apple 500000000
there are couple downsides to this approach.
but even if the mapper is a bit inefficient, it's hopefully offset by the more efficient data structures this allows having.
# caveats
this sort of optimization could come at a loss of code readability. the readers now have to understand the stringid abstraction over simple strings. only do such optimizations if it was deemed necessary.
another potential optimization avenue would be to add "Lookup(s string) ID" function. it would return -1 if the mapper doesn't contain the requested string. basically it's the same as the Map() function but without the "add to the hashmap" logic. the example acl system could use this instead of Map(member). non-existent users would then not grow the mapper's data. this can be important if you don't fully control your callers.
i have to admit i never used this structure before because i never worked on a large go program where this actually would have mattered at all. but it's a nice example how could one think about removing pointers from their go code. maybe one day i'll have a chance to use this and then i can report back on its usefulness.
# edit on 2023-08-12
btw, i learned that java had a similar idea about optimizing strings too. here are some docs describing the feature:
# edit on 2024-02-04
i just noticed that go is getting a new "unique" package: https://github.com/golang/go/issues/62483. the newly proposed handles can make the string comparisons fast. but they still contain a pointer so the gc pressure remains high. my points in this post still stand.
# edit on 2024-05-09
i came across a cool blog post better explaining the pointer problem: https://blog.gopheracademy.com/advent-2018/avoid-gc-overhead-large-heaps/. the code to demonstrate the problem is pretty short:
func main() { a := make([]*int, 1e9) for i := 0; i < 10; i++ { start := time.Now() runtime.GC() fmt.Printf("GC took %s\n", time.Since(start)) } runtime.KeepAlive(a) }
now change the `*int` to `int`. on my machine the gc cycle speeds up from 300ms to 0.2ms. that's because the gc has a billion pointers less to traverse in the latter version.
and then at the end of the article it links to some go libraries implementing pretty much what i was arguing for here:
i haven't looked too deeply but they seem pretty cool! i recommend checking them if you need something like this.
published on 2023-05-06, last modified on 2024-05-09
new comment
see @/comments for the mechanics and ratelimits of commenting.