Jackson Allan · Published 29 May 2024 · Updated 2 June 2024
Although thorough hash-table benchmarks have already been published by others in recent years, I have decided to contribute another benchmarking suite for several reasons. Firstly, the existing publications concentrate on C++ tables, whereas there is relatively little coverage of C tables. Consequently, C programmers tend to be directed towards older, well-known libraries despite the existence of newer libraries that may offer superior performance. Secondly, the results reported by the existing benchmarks are mostly cumulative, so they do not clearly show how the tables tested perform under specific conditions such as the table’s current load factor. Moreover, some existing benchmarks test only one key type or bucket size, which makes it hard to draw general conclusions about a given table’s performance. The final motivation is a personal interest: I recently released a novel generic hash table in C (via Verstable and CC) and would like to showcase how it measures up against both the leading C++ tables and other tables available in the C landscape.
Click here to skip to the results.
The benchmarks include three hash table configurations designed to measure the tables’ performance under diverse conditions:
The 32-bit integer key, 32-bit value benchmarks test how the tables perform when the key hash and comparison functions are cheap, traversing buckets is cheap (i.e. does not cause many cache misses), and moving key-value pairs is cheap. These benchmarks disadvantage tables that store metadata in a separate array because doing so necessarily causes at least one extra cache miss per lookup.
The 64-bit integer key, 448-bit value benchmarks test how the tables perform when the hash and comparison functions are cheap, traversing buckets is expensive (one cache miss per bucket on architectures with 64-byte cache lines), and moving key-value pairs is expensive. These benchmarks disadvantage tables that do not store metadata in a separate array (or do so but access the buckets array with every probe anyway to check the key) and that move key-value pairs around often (e.g. Robin Hood hash tables).
The 16-char c-string key, 64-bit value benchmarks test how the tables perform when the hash and comparison functions are expensive. The keys that the tables store are pointers into an array of contiguously allocated strings. These benchmarks disadvantage tables that lack a metadata mechanism to limit key comparisons or that rehash keys often.
All the tables benchmarked are configured to use the same hash functions, namely the MurmurHash3 finalizer for integer keys and FNV-1a for string keys.
The maximum load factor for all tables is set to 0.875. Hence, and because all the tables—except for std::unordered_map under GCC—grow by doubling the bucket count once the maximum load factor is reached, the results show the performance of each table across the load-factor range of approximately 0.44 (the troughs in the graphs) to 0.875 (the peaks). I chose this limit because it is relatively high and several tables with hard-coded, non-customizable maximum load factors (namely absl::flat_hash_map and boost::unordered_flat_map) use it.
The integer-key data set consists of sequential integers shuffled randomly. The string-key data set consists of sequential NULL
-terminated strings shuffled randomly.
The code was compiled in C++ using GCC 13.2.0, via MinGW-w64, with -O3
and -DNDEBUG
. The benchmarks were run on an AMD Ryzen 7 5800H with the frequency locked at 90%.
As for the benchmarks themselves, they consist of the following:
Total time to insert N nonexisting keys: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, the total time elapsed since the beginning of the benchmark is recorded. Hence, the results of this benchmark are cumulative, unlike those of all the other benchmarks below.
Time to erase 1,000 existing keys with N keys in the table: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, 1,000 keys are erased, and the time taken to complete this operation is recorded. These keys consist of a randomly selected 1,000-key sequence within the sequence of keys already inserted. After the keys are erased, they are reinserted into the table before the process of inserting unique keys continues.
Time to replace 1,000 existing keys with N keys in the table: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, 1,000 keys are reinserted, and the time taken to complete this operation is recorded. These keys consist of a randomly selected 1,000-key sequence within the sequence of keys already inserted.
Time to erase 1,000 nonexisting keys with N keys in the table: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, the erase function is called for 1,000 keys that have not been inserted, and the time taken to complete this operation is recorded. These keys consist of a randomly selected 1,000-key sequence within a separate sequence of keys that are never inserted.
Time to look up 1,000 existing keys with N keys in the table: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, 1,000 keys are looked up, and the time taken to complete this operation is recorded. These keys consist of a randomly selected 1,000-key sequence within the sequence of keys already inserted.
Time to look up 1,000 nonexisting keys with N keys in the table: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, the lookup function is called for 1,000 keys that have not been inserted, and the time taken to complete this operation is recorded. These keys consist of a randomly selected 1,000-key sequence within a separate sequence of keys that are never inserted.
Time to iterate over 5,000 keys with N keys in the table: In this benchmark, N unique keys are inserted into the table. At even intervals during this process, 5,000 keys in the table are iterated over, and the time taken to complete this operation is recorded. The position where the iteration begins is found by looking up a random key from the sequence of keys already inserted.
Of the aforementioned benchmarks, Time to replace 1,000 existing keys with N keys in the table, Time to erase 1,000 nonexisting keys with N keys in the table, Time to look up 1,000 existing keys with N keys in the table, Time to look up 1,000 nonexisting keys with N keys in the table, and Time to iterate over 5,000 keys with N keys in the table are performed side-by-side as part of the same process of inserting N unique keys.
This benchmarking setup has some limitations. Firstly, the range of load factors tested is—as previously mentioned—approximately 0.44 to 0.875. Hence, the results do not show the performance at lower load factors. As some of the tables have default maximum load factors as low as 0.5, they are being measured mostly outside the load-factor range in which their authors intended them to operate. Secondly, the benchmarks do not measure memory usage. However, the memory usage of each open-addressing table, at least, can be estimated based on the maximum load factor it can reasonably tolerate and its memory overhead per bucket. Thirdly, although the choice of the MurmurHash3 finalizer as the integer hash function ensures that all the tables compete on equal footing and none are impaired by a weak hash function, it may disadvantage tables that can tolerate a fast but weak hash function (e.g. the default integer hash function for std::unordered_map is usually an identity function, whereas ankerl::unordered_dense, absl::flat_hash_map, boost::unordered_flat_map, CC’s cc_map, and Verstable all require or benefit from a hash function that provides entropy across all bits of the hash code). Finally, the benchmarks do little to show the cumulative impact of tombstones on the performance of the tables that rely on them when erasures are frequent.
The complete code of the benchmarks is available here.
absl::flat_hash_map v20240116.2:
Developed by Google, this table is thoroughly documented on the Abseil website and via two presentations. It is an open-addressing table that stores a 7-bit fragment of each key’s hash code in a separate array and uses SIMD instructions to scan this array for potential key matches 16 buckets at a time. It relies on tombstones for erasure.
This table’s approximate memory overhead is one byte, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
ankerl::unordered_dense v4.1.2:
This table employs Robin Hood ordering—which is an open-addressing variant that moves key-value pairs around to keep their displacements from the buckets to which they hash as constant as possible—in conjunction with linear probing. However, it makes two additions to the Robin Hood design. Firstly, rather than storing key-value pairs inside the table’s buckets, it stores them contiguously in a separate array. The table buckets store indices into this array. Secondly, it stores an 8-bit fragment of each key’s hash code to limit the need to compare keys directly. This table is the successor to an earlier table from the same author, which I have excluded from this article because it is deprecated and proved to be inferior in my earlier testing.
By default, this table can only accommodate 232 (4.29 billion) key-value pairs. Support for a higher number can be enabled at the cost of additional memory usage and less cache friendliness. For these benchmarks, I used the default limit.
On x86-64, this table’s approximate memory overhead is eight bytes (or 12 bytes, for tables that can accommodate more than 232 key-value pairs), plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
boost::unordered_flat_map v1.85.0:
This table, too, is an open-addressing table that stores hash-code fragments in a separate array and uses SIMD instructions to scan them for potential key matches multiple buckets at a time. However, it differs from absl::flat_hash_map in several important ways. Firstly, keys are hashed not to individual buckets but to 15-bucket groups, which fill up contiguously from one end to the other:
Clustering of key-value pairs in boost::unordered_flat_map vs absl::flat_hash_map at a load factor of 0.44 | |
boost | |
absl |
Secondly, the hash-code fragments consist of 7.99 bits rather than 7 bits. Thirdly, instead of tombstones, the table uses a group-level bloom filter that its authors call an “overflow byte”, which also accelerates lookups of nonexisting keys (below, I consider this mechanism “tombstone-like” because erasures still leave a residual impact on the table’s performance and cause more frequent full-table rehashing). The above details and more are documented via a presentation.
This table’s approximate memory overhead is 1.07 bytes, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
For the sake of easy distribution with this benchmarking project, I have opted to use an amalgamated single-header version of this table provided by a third party. However, Boost.Unordered, the Boost module that contains this table, can also be obtained and installed separately from the rest of the library straight from the official sources.
This table is yet another open-addressing table that uses SIMD instructions to scan multiple hash-code fragments at once for potential key matches. The library is still under development, and details about its implementation are sparse and liable to change. Its approximate memory overhead is 1.25 bytes, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
The version benchmarked is emilib2o.hpp 434a205, which I modified to set the maximum load factor to 0.875.
This table is the culmination of its author’s experiments with various hash-table designs, including Robin Hood and SIMD-accelerated tables. It is documented most thoroughly by a presentation he delivered. The table is a hybrid of open addressing and separate chaining. Keys-value pairs overflowing from one bucket are stored in otherwise vacant buckets in the same array and chained together using 7-bit “jump distances” (indices into a hard-coded array of possible distances, in terms of buckets, to the next key-value pair in the chain). Groups of this metadata—i.e. 7-bit jump distances coupled with 1-bit flags—are stored interspersed with groups of buckets (every 16 bytes of metadata is followed by the 16 corresponding buckets). This design shares similarities with an older technique called coalesced hashing, except that the chains do not coalesce.
This table’s approximate memory overhead is one byte, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
Unfortunately, the library appears to be unmaintained.
std::unordered_map from GCC 13.2.0:
This is the go-to hash table for many developers because it is part of C++’s standard library. While implementations may differ in their minor details, the constraints imposed by the C++ Standard effectively dictate that this table uses node-based separate chaining, rather than open addressing.
On x86-64, this table’s approximate memory overhead is eight bytes per bucket and eight bytes, plus pointer-key-value padding and malloc
header and padding, per key-value pair. However, because of the fragmentation of available memory caused by many tiny allocations, its actual memory impact may be greater.
tsl::robin_map v1.3.0:
This is another popular Robin Hood table. Unlike ankerl::unordered_dense, it stores key-value pairs, displacements from their ideal buckets, and (under some conditions) hash codes together inside the buckets array. It is therefore a more conventional implementation of the Robin Hood design.
On x86-64, this table’s approximate memory overhead is three bytes, plus uint16_t
-bool
-key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
cc_map from CC v1.1.1:
This table implements Verstable within the constraints of CC’s API. See below for more details.
khash from klib v0.2.8:
This is a very popular open-addressing table that uses quadratic probing by default. It stores keys and values in two separate arrays (in addition to a third metadata array), rather than interspersed together in one buckets array. This choice conserves memory by eliminating padding bytes, but it means that lookups of existing keys necessarily entail extra cache misses. The table relies on tombstones for erasure.
This table’s approximate memory overhead is only 0.25 bytes per bucket, in addition to the size of a key and size of a value per vacant bucket.
Note that klib also includes a newer hash table that is called khashl and uses linear probing without tombstones. This table is not included in these benchmarks.
DICT from M*LIB v0.7.3:
This is another open-addressing table that uses quadratic probing by default. Like ankerl::unordered_dense and stb_ds’ hm and sh (described below), it stores key-value pairs in an array separate from the buckets array. The buckets array stores indices into the key-value pairs array, along with hash codes. Unlike those two other tables, it does not store key-value pairs contiguously. This is because when erasing, it does not move the last key-value pair in the array backward to fill the gap created (a process that requires another lookup to update the index of the moved pair stored in the buckets array). Erasures rely on tombstones.
This table, too, can only accommodate 232 key-value pairs by default, and enabling support for a higher number impacts memory usage and cache efficiency. For these benchmarks, I used the default limit.
This table’s approximate memory overhead is eight bytes (or 16 bytes, for tables that can accommodate more than 232 key-value pairs), plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
DICT_OA from M*LIB v0.7.3:
Like DICT, DICT_OA is an open-addressing table using quadratic probing by default. However, it is more conventional in that it stores keys and values together inside the table buckets. Its standout feature is that instead of storing per-bucket metadata, it requires users to reserve two keys to mark empty buckets and tombstones. Hence, the table can often store data more densely and, therefore, in a more cache-friendly manner. For the benchmarks involving integer keys, I have opted to reserve two integer values to act as these sentinels (rather than manually coupling each key with an extra flag) so that the table can take advantage of this feature. However, when configured thusly, this table cannot technically accommodate the full range of keys that the other tables can.
This table’s approximate memory overhead is only the key-value padding per bucket and the size of a key-value pair per vacant bucket.
Note that the _OA suffix, as a means of distinguishing this table from M*LIB’s similarly named DICT, is a misnomer that exists for API backwards compatibility with earlier versions of the library, in which DICT used separate chaining. Now, both DICT and DICT_OA are open-addressing tables.
hm and sh from stb_ds v0.67:
This open-addressing table appears to use linear probing in conjunction with quadratic probing at the bucket-group level. Its remarkable feature, when it comes to performance, is that—like ankerl::unordered_dense—it stores key-value pairs not directly in the hash-table buckets but contiguously in a separate array. This table is also split between one implementation for string keys and another implementation for keys of all other data types. It relies on tombstones for erasure.
Unfortunately, this table offers little flexibility: to customize the hash functions, comparison functions, and maximum load factor, I had to modify the library header. Nevertheless, I decided to include this library in the benchmarks because of its immense popularity.
On x86-64, this table’s approximate memory overhead is 16 bytes, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
hmap from STC v5.0 beta 4:
This table is an open-addressing table using linear probing. However, it stores a 7-bit fragment of each key’s hash code in a separate array to limit direct key comparisons. One unusual feature of this table is that instead of relying on tombstones, an erasure at a given bucket shifts subsequent key-value pairs whose probe sequences include that bucket backward to fill the gap. This technique is commonly used by Robin Hood tables but not conventional linear-probing tables.
This table’s approximate memory overhead is one byte, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
uthash v2.3.0:
This is the oldest and perhaps most popular C hash table included in the benchmarks. Like std::unordered_map, it is a node-based, separate-chaining table. Compared to the other tables, the functionality it provides is rudimentary, with users having to write much of the scaffolding themselves. It is also the only intrusive table included in these benchmarks.
On x86-64, this table’s approximate memory overhead is 16 bytes per bucket and 56 bytes, plus malloc
header and padding (assuming that the user allocates key-value pairs individually) and key-value padding, per key-value pair.
The 56-byte struct that uthash requires to be embedded with every key-value pair (code comments omitted) | ||
|
Verstable v2.1.0:
Like ska::bytell_hash_map, this table is a hybrid of open addressing and separate chaining that stores key-value pairs overflowing from one bucket in otherwise vacant buckets of the flat buckets array. However, rather than chaining key-value pairs using a 7-bit index into an array of “jump distances”, it does so using an 11-bit integer denoting quadratic displacement. It also stores a 4-bit fragment of each key’s hash code to limit key comparisons. Hence, it uses two bytes of metadata per bucket, rather than the one byte used by ska::bytell_hash_map. The metadata is stored in a separate array, rather than interspersed with groups of buckets.
This table’s approximate memory overhead is two bytes, plus key-value padding, per bucket, in addition to the size of a key-value pair per vacant bucket.
Since the horizontal scale of the resulting graphs (each corresponding to one of the aforementioned benchmarks) is linear, not logarithmic, I ran the benchmarks for three different total key counts. Generally, the lower the key count is, the hotter the tables are in the cache. Here are the results:
As I ran the benchmarks 14 times for every key count, each data point shown in the graphs represents the average of 10 recordings (I excluded the two highest and two lowest recordings to limit the effect of outliers and any background processing).
The graphs are interactive. Hover over a table’s label to highlight the associated plot, and click it to toggle the plot’s visibility. The graphs automatically scale to the visible plots.
Displayed below are the graphs for 0 to 20,000,000 keys. Click here to skip to the heatmap summarizing the data shown in these graphs.
Below is my analysis of the results, especially the 0-to-20,000,000-key graphs displayed above. Naturally, this (and the next) section is somewhat subjective. Hence, I encourage readers seeking a hash-table library to also examine the graphs and heatmaps themselves, bearing in mind their specific needs.
absl::flat_hash_map: This table has very fast insertions unless the buckets are large. Its erasure of existing keys is fast but not among the top contenders, especially when we consider its reliance on tombstones. Its lookups and erasures of nonexisting keys are very fast. However, its lookups and replacements of existing keys are only moderately fast for large buckets and string keys and relatively slow for small buckets with integer keys (although they are far more competitive in the 0-200,000-key benchmarks). Its iteration is fast, albeit not as fast as the other SIMD tables, Verstable, or—of course—the tables that store key-value pairs contiguously.
ankerl::unordered_dense: This table has fast lookups, particularly when its load factor is lower than its default limit of 0.8. Its insertions, however, are only moderately fast for large buckets and string keys and slow for integer keys with small buckets. For lookups and erasures of nonexisting keys, it is faster than the other Robin Hood table (tsl::robin_map) and the conventional linear- and quadratic-probing open-addressing tables. Its performance in most benchmarks is load-factor dependent but less so than the performance of those other tables. Predictably, its iteration is extremely fast: because key-value pairs are stored contiguously, iteration is tantamount to iterating over an array.
boost::unordered_flat_map: This table has very fast insertions and erasures of existing keys (although, as previously mentioned, erasures leave a residual performance impact via the “overflow byte” mechanism). For looking up and erasing nonexisting keys, it is among the fastest tables tested. It is rather fast for looking up existing integer keys (but not the top contender, especially for large buckets). It is also very fast for looking up string keys. Its iteration speed is also excellent thanks to the clustering of key-value pairs within each bucket group, with only the tables that store key-value pairs contiguously being significantly faster. This table’s edge over most of the other tables is notably larger in the low-key-count (0 to 200,000 keys) benchmarks, which suggests that it benefits from being in the cache more than the other tables do.
emilib2::HashMap: This table’s performance is very close to boost::unordered_flat_map’s performance in most benchmarks. The most significant difference is that lookups and erasures of nonexisting keys are considerably slower. Like boost::unordered_flat_map, this table’s overall speed relative to the other tables is at its fastest in the low-key-count benchmarks.
ska::bytell_hash_map: This table’s lookups are relatively fast when it comes to integer keys (but not as competitive when it comes to string keys, probably due to the absence of a mechanism to limit key comparisons based on hash-code fragments). Its erasure of existing keys is also fast (albeit not as fast as the top contenders). However, its insertions are slow. Its iteration is also slow even though it uses a separate array of metadata. The other tables tested that use a similar mechanism—including Verstable, which must iterate over twice as much metadata—are significantly faster in this regard. This difference exists because during iteration, ska::bytell_hash_map does not harness compiler intrinsics to scan for the next occupied bucket multiple buckets at a time. Moreover, how this table’s reliance on an array of constant “jump distances” affects real-world performance when enough other work is being done between hash-table operations to flush it out of the cache remains an open question.
std::unordered_map: This table is far slower than most of the open-addressing tables in most of the benchmarks, except in the case of looking up existing keys, wherein its performance is comparable to that of the slowest open-addressing tables. This poor performance is due to pointer chasing and the need to allocate and free nodes individually. Particularly alarming is the iteration performance, which is at least an order of magnitude slower than most of the open-addressing tables.
tsl::robin_map: This table is fast for inserting string keys. However, its performance in most of the other benchmarks ranges from average to slow and is very dependent on the load factor. Of course, this table would be much more competitive were its maximum load factor kept at its very low default of 0.5.
cc_map from CC: Disclaimer: I am the author of this table. Since CC implements the same hash-table design as Verstable, its results are similar to Verstable’s results, and the same comments apply. However, its iteration—though fast—is somewhat slower than Verstable’s iteration. This is because iterators in CC are raw pointers to buckets, rather than a struct containing both a pointer to a bucket and a pointer to its corresponding metadata. Hence, cc_map must perform an extra calculation per iteration to locate the current bucket’s metadata.
khash from klib: This table has reasonably good insertion speed, particularly for integer keys. Its erasure of existing keys is fast (but relies on tombstones). However, its lookups of existing keys are relatively slow, as are its lookups and erasures of nonexisting keys. Its iteration is also rather slow for an open-addressing table.
DICT from M*LIB: This table has fast insertions (especially for string keys and large buckets) and extremely fast lookups of existing keys. It is similarly fast for erasing existing keys (although it relies on tombstones). However, it is rather slow for looking up or erasing nonexisting keys. Its iteration is also 2-3x slower than the slowest among the other open-addressing tables, despite the storage of key-value pairs in a separate array. This is because the potential presence of gaps in the key-value-pairs array precludes iterating over it directly and without reference to the buckets array, which stores eight or 16 bytes per bucket and is therefore not especially cache-friendly.
DICT_OA from M*LIB: This table has very fast lookups of integer keys—the fastest among all the tables in the 0-to-20,000,00-key benchmarks when the buckets are small. In this regard, the table is probably benefiting from the better cache performance that results from the choice to have users reserve sentinel values instead of storing metadata. However, it is relatively slow for looking up or erasing nonexisting keys, and its insertion and iteration speeds are average. Its erasures of integer keys are very fast but rely on tombstones.
hm and sh from stb_ds: hm is slow—the slowest among the open-addressing tables—across most benchmarks. In particular, its lookups of existing and nonexisting keys and erasures of nonexisting keys are in the same poor-performance ballpark as the node-based tables. sh fares somewhat better, with fast insertions, replacements, and lookups of existing keys, but still displays average performance when it comes to erasing and looking up nonexisting keys. Like ankerl::unordered_dense, these tables offer near-perfect iteration speed because they store key-value pairs contiguously.
hmap from STC: This table has very fast insertions of integer keys, irrespective of bucket size. It is also rather fast for looking up existing keys, and its iteration performance is on the faster side of average. However, it is relatively slow on lookups and erasures of nonexisting keys. Its erasure of existing keys is also rather slow for integer keys and extremely slow for string keys. This is due to the backward-shifting erasure mechanism. While this mechanism eliminates the need for tombstones, it necessitates both moving many key-value pairs and rehashing every key that may need to be moved to determine the bucket to which it hashed (this problem does not exist for Robin Hood tables, which must store displacements, ideal bucket indices, or hash codes). Hence, when erasures are frequent and the hash function is expensive, I suggest that users manually calculate and store each key’s hash code as part of a struct and supply the table with a hash function that simply returns the precalculated hash code.
uthash: My comments about the poor performance of std::unordered_map also apply to uthash. However, uthash’s iteration is about twice as fast as std::unordered_map’s iteration (albeit still extremely slow). Its author seems aware that its performance is no longer competitive as its documentation includes the following note:
When uthash was written, there were fewer options for doing generic hash tables in C than exist today. There are faster hash tables, more memory-efficient hash tables, with very different API’s today. But, like driving a minivan, uthash is convenient, and gets the job done for many purposes.
Verstable: Disclaimer: I am the author of this table. This table’s lookups of existing and nonexisting keys are all fast (indeed, it has some of the fastest lookups of integer keys when the buckets are large). It erases existing keys almost as quickly as most of the tables that rely on tombstones or a tombstone-like mechanism, even though it offers true erasures. Its lookups and erasures of nonexisting keys are extremely fast. Its iteration performance is also good, beaten only by boost::unordered_flat_map, emilib2::HashMap, and the tables that store key-value pairs contiguously. However, its insertion speed is only average.
I would like to finish with a few general observations about hash-table designs. Firstly, these benchmarks show once again that node-based separate-chaining tables are slow. Secondly, conventional open-addressing tables that use linear or quadratic probing are very susceptible to load factor. The same goes for Robin Hood tables, despite the oft-repeated claim that a reasonable default maximum load factor for such tables is 0.9. Thirdly, SIMD and hybrid open-addressing/separate-chaining tables are mostly immune to high load factors. These two designs achieve this load-factor resilience in very different ways: the former maximizes the number of buckets probed at once, while the latter limits the buckets probed to the few containing a key that hashed to the lookup key’s ideal bucket. Finally, SIMD, hybrid, and Robin Hood tables all limit variance in the time needed to complete an operation. This is reflected by the steadiness of their graph plots compared to the plots for the other tables.
In C++, boost::unordered_flat_map appears to be the all-around best performer, especially when hot in the cache. Therefore, it makes an excellent default choice. However, ankerl::unordered_dense offers the best iteration speed, performs reasonably well in the other benchmarks, and is a standalone header. Hence, it is a good choice if iteration speed is the factor most important to you or you prefer a single-header solution. It also has faster lookups, especially when buckets are large and when the load factor is kept low. emilib2 shows great promise and could be a good alternative to boost::unordered_flat_map once it is ready for public consumption.
In C, the results show that Verstable and CC’s cc_map are all-around excellent performers, with the former being faster for iteration. However, they are not the best performers in every scenario. If your use case is skewed towards a particular operation or pair of data types, then it is worth also considering other options, especially if you are content to use a low maximum load factor. In particular, STC’s hmap is faster for inserting integer keys, M*LIB’s DICT has the fastest lookups of existing integer keys, and stb_ds’ hm and sh have the fastest iteration. Although not shown in the benchmarks, klib’s khash generally has the lowest memory overhead among the tables tested, so it—or khashl, which consumes even less memory—may be the best choice when memory is limited.
STC’s hmap, klib’s khash, and M*LIB’s DICT and DICT_OA are each part of a large, comprehensive library. The usefulness of these libraries’ other features to you is also a compelling reason to choose one of them.
uthash, on the other hand, should be avoided due to all-around poor performance, unless you need the specific qualities of an intrusive or node-based table.
Finally, I must warn that some of the C tables raise portability concerns. uthash and stb_ds’ hm hash and compare keys byte-by-byte by default. This poses an issue not only for structs containing padding bytes but also—at least theoretically—for fundamental integer types, which may contain padding bits (6.2.6.2). The problem is especially apparent for stb_ds because it offers no way for users to provide their own hash and comparison functions (other than modifying the library).
The above conclusion may leave readers wondering about the differences between Verstable and CC’s cc_map, which both use the same hybrid design, outside the context of performance. In short, Verstable is a standalone hash-table library. It requires C99 or later, and it requires users to instantiate a pseudo-template for every combination of key and value data types (as is common among high-performance C container libraries). It also offers full support for custom memory allocators. CC, on the other hand, is a larger library that includes a range of generic containers. It uses novel techniques to provide a simpler API and avoid the need for boilerplate from users. However, it requires C11 or later and compiler support typeof
(which is standard as of C23), and its support for custom memory allocators is more limited.
I would like to thank Joaquín M López Muñoz, Martin Leitner-Ankerl, Patrick Pelissier, and Attractive Chaos for their valuable feedback during the drafting of this article.
Comments on this article belong here.
The article was also discussed on Reddit at r/C_Programming and r/cpp.