Expand description

String interning

A string interner is a data structure that enables strings to be represented as integers in a computer program. Interning strings is often an optimization because only one copy of each distinct string is stored, the string type is smaller and more cache friendly, and string operations like comparisons are faster. The cost of string interning (at least as implemented here) is that once a string is interned, it is never deallocated.

When using the Interner in this module, strings are interned using the get_or_intern method. This method returns a key. If the same string is interned twice, the same key is returned. The type of the key is fixed for each instance of the interner, and can be any type that implements the Key trait. By default the interner uses std::num::NonZeroU32, which is a 32-bit integer.

Given a key, the original string value can be recovered using the resolve method.

let mut interner: Interner = Default::default();
let hello_1 = interner.get_or_intern("hello");
let world_1 = interner.get_or_intern("world");
let hello_2 = interner.get_or_intern("hello");
assert_eq!(hello_1, hello_2);
assert_ne!(hello_1, world_1);

assert_eq!(interner.resolve(hello_1), Some("hello"));
assert_eq!(interner.resolve(world_1), Some("world"));

assert_eq!(interner.get("hello"), Some(hello_1));
assert_eq!(interner.get("other"), None);

The code in the interner is written from scratch, but all aspects of it (the algorithm, the API, even some variable names) are based on Robin Freyler’s string-interner crate. For this reason the code is jointly copyrighted between Robin Freyler and the Texcraft contributors.

The implementation

The algorithm is based on the string-interner crate. This algorithm is also separately discovered and discussed in a post by Mat Klad.

The interner maintains a String buffer, and each time a new string is interned it’s appended to the buffer. A vector of indices is used to record the position of each string in the buffer. When a new string is added to the buffer, the current length of the buffer (which is the end index of the string in the buffer) is appended to the vector. The key of the string is then the length of the vector when the index is appended. Thus using the key, we can easily find the end index.

To recover a string using its key, we get the end index from the vector. We get the start index by getting the end index of the previous string that was interned. Given the process described above, the start index is stored in the vector just before the end index. The recovered string is then the substring of the buffer between these two indices.

This handles adding new strings. A key property of the interner is that it also deduplicates strings. The naive way to do this is to maintain a map from strings to keys, and first search in this map for the string. The problem with this approach is that it requires a costly second allocation of each interned string in this map.

The string-interner crate and this module use different approaches to fix this. In the crate, the map is keyed on the interned string’s integer key. When performing operations on the map, hash and equality of keys is based on the underlying string.

In this module, the map is keyed on the u64 hash of each string, which is computed outside of the map. There is an edge case in which the hashes of two strings collide. For this reason the value of the map is a linked list of all string keys that have the corresponding hash. When checking if a string exists, we walk the linked list and check if the resolved string for each key matches. If not, we intern the string and append to the linked list. In the worst case this can result in O(n) lookups, but in reality hash collisions are rare.

Structs

Traits

  • Types implementing this trait can be used as keys in the Interner.