Benchmarks¶
These are benchmark results for the gen_tries library, focusing on various operations such as building tries, checking key existence, prefix matching, and updating tries.
Platform¶
Hardware¶
Mac Studio, Apple M2 Ultra, 24-Core CPU (approx 3.5 GHz), 128GB RAM
Python Version¶
Python 3.13.7
Benchmark Configuration¶
gen-tries Version: 0.9.4
Single Threaded
Symbol Set: ‘0123’
Key Generation: All possible combinations of symbols up to a certain length, starting with all ‘0’s and incrementing the last symbol until all combinations were generated.
Trie Depths: 2 to 9
Runtime Validation: Enabled and Disabled
Benchmarking Tool: pytest-benchmark
Benchmarking Code: bench/benchmark_performance.py
Running the Benchmarks¶
To run all benchmarks with pytest-benchmark:
cd bench
pytest bench_performance.py \
--benchmark-sort=name \
--benchmark-group-by=func \
--benchmark-histogram=histogram/benchmark \
--benchmark-time-unit=ns
Benchmark Configuration¶
Benchmarks are configured with:
min_rounds: 100
min_time: 1s
max_time: 10s
timer: perf_counter_ns
See the source for details.
What is Measured¶
Add: Time to add keys to the trie
Update: Time to update existing keys
Lookup: Time to check key presence (hit/miss)
Remove: Time to remove keys
Traversal: Time for prefixes() and prefixed_by()
Interpreting Results¶
ops: Operations per second (higher is better)
Scaling: As trie depth increases, expect some slowdown; large jumps may indicate inefficiency.
Advanced Usage¶
Compare runs:
pytest ... --benchmark-compare=previous.jsonExport results:
pytest ... --benchmark-json=results.jsonView histograms:
pytest ... --benchmark-histogram=histogram/benchmark
Discussion¶
The benchmarks are designed to evaluate performance across different trie depths and configurations.
Comparison of runtime validation settings reveals that disabling runtime validation can lead to small performance improvements in some scenarios, particularly for deeper tries. However, the performance gains are often minimal and may not justify the trade-offs in terms of safety and correctness. Runtime validation helps catch errors and ensures correctness by validating the structure and content of the trie during operations.
Turning it off may be appropriate in extremely performance-critical applications where the data is well-defined and unlikely to change, allowing for optimizations that bypass some of the safety checks.
HOWEVER, it is crucial to thoroughly test and validate the trie implementation in such scenarios to avoid potential issues. Absolutely NO assumptions should be made about the data or its structure. The behavior of the trie with invalid data is undefined and may lead to catastrophic failures.
In Summary: If you choose to disable runtime validation, do so with extreme caution and only after thorough testing. The performance increase is relatively small, and the risks may outweigh the benefits.
For the tests, a symbol set of ‘0123’ was used and keys were generated by creating all possible combinations of these symbols up to a certain length starting with all ‘0’s and incrementing the last symbol until all combinations were generated.
So for a trie depth of 3, the keys would be ‘000’, ‘001’, ‘002’, …, ‘333’. Thus covering all possible keys of length 3. This approach ensures a comprehensive evaluation of the trie performance across different key patterns and depths and has a branching factor (4) that allows for efficient exploration of the search space to a substantial depth (9).
Depth 1 is generally omitted from the benchmarks due to its limited key space and minimal impact on performance.
The tests themselves are ‘micro-benchmarks’ designed to measure specific operations in isolation. They measure the performance of the SPECIFIC call being tested and do not include the overhead taken by other operations or the overall process.
This means that it measures the performance of something like ‘trie.add(key)’ or ‘key in trie’ in isolation, not including any overhead from say a containing ‘for’ loop or other operations.
It is measuring the performance of code that looks like this:
1from gentrie import GeneralizedTrie
2
3trie = GeneralizedTrie()
4trie['example_key'] = 'example_value'
5trie['different_key'] = 'different_value'
6trie['yet_another_key'] = 'yet_another_value'
NOT code that looks like
1from gentrie import GeneralizedTrie
2
3trie = GeneralizedTrie()
4data = [('example_key', 'example_value'),
5 ('different_key', 'different_value'),
6 ('yet_another_key', 'yet_another_value')]
7for key, value in data:
8 trie[key] = value
The reason for this is that the overhead of the loop and other operations are implementation details that cannot be easily separated from the operation being measured, and so would skew the results by including unrelated overhead from code NOT being targeted for measurement.
The for loop in the above example would add significant overhead that would skew the results for the specific operation being measured. The hash assignment (__setitem__) method in this case.
As such, the benchmarks focus on specific operations in isolation to provide a clearer picture of performance BUT tend to over-estimate real-world performance due to the extremely controlled nature of the tests (which are designed to eliminate as many variables as possible).
In practice, the performance characteristics of a trie may vary significantly based on the specific use case, data distribution, and access patterns. As such, while synthetic benchmarks provide valuable insights, they should be interpreted with caution and supplemented with real-world testing whenever possible.
Synthetic Benchmarks¶
The following benchmarks were conducted using synthetically generated keys.
Synthetic keys were created by generating all possible combinations of the symbols ‘0123’ for a specific depth (length) of keys. This approach ensures a comprehensive evaluation of the trie performance across different key patterns and depths - and has a branching factor (4) that allows for efficient exploration of the search space to a substantial depth (9).
This allows for a thorough examination of the trie’s performance characteristics across various scenarios.
However, it is important to note that synthetic benchmarks may not fully capture real-world performance characteristics, as they do not account for factors such as data distribution, key collisions, and other complexities that may arise in practical applications. It is designed to provide a controlled environment for evaluating the trie implementation’s baseline performance characteristics that is not influenced by quirks of real-world data patterns.
Building trie using add()¶
Table shows thousands of operations per second (Kops/sec) for different trie depths and runtime key validation settings while building the trie using the add() method using strings as keys. Trie Depth is equivalent to the number of symbols in each key being added to the trie.
The operations being measured are the individual add() method calls adding keys to the trie.
Trie Depth |
Runtime Validation |
Kops / sec |
|---|---|---|
2 |
False |
266.7 |
2 |
True |
266.7 |
3 |
False |
609.5 |
3 |
True |
587.2 |
4 |
False |
612.4 |
4 |
True |
587.2 |
5 |
False |
644.8 |
5 |
True |
612.4 |
6 |
False |
634.4 |
6 |
True |
601.8 |
7 |
False |
618.5 |
7 |
True |
591.6 |
8 |
False |
622.9 |
8 |
True |
600.3 |
9 |
False |
602.6 |
9 |
True |
584.5 |
Building trie using update()¶
This is the same as above but using the update() method to build the trie instead of add()
Trie Depth |
Runtime Validation |
Kops / sec |
|---|---|---|
2 |
False |
400.0 |
2 |
True |
444.4 |
3 |
False |
547.0 |
3 |
True |
561.4 |
4 |
False |
651.4 |
4 |
True |
635.2 |
5 |
False |
592.6 |
5 |
True |
635.2 |
6 |
False |
599.2 |
6 |
True |
600.5 |
7 |
False |
602.3 |
7 |
True |
585.9 |
8 |
False |
608.8 |
8 |
True |
588.8 |
9 |
False |
598.1 |
9 |
True |
573.8 |
key in trie¶
Key Depth |
Runtime Validation |
Kops / sec |
|---|---|---|
2 |
False |
4327.1 |
2 |
True |
3388.7 |
3 |
False |
3617.9 |
3 |
True |
2959.5 |
4 |
False |
3167.6 |
4 |
True |
2657.5 |
5 |
False |
2834.5 |
5 |
True |
2399.2 |
6 |
False |
2555.6 |
6 |
True |
2193.9 |
7 |
False |
2344.7 |
7 |
True |
2045.4 |
8 |
False |
2156.1 |
8 |
True |
1736.1 |
9 |
False |
1946.3 |
9 |
True |
1690.6 |
key not in trie¶
Key Depth |
Runtime Validation |
Kops / sec |
|---|---|---|
2 |
False |
5243.8 |
2 |
True |
3636.4 |
3 |
False |
4024.1 |
3 |
True |
3200.0 |
4 |
False |
3403.7 |
4 |
True |
2849.0 |
5 |
False |
3027.6 |
5 |
True |
2548.4 |
6 |
False |
2642.7 |
6 |
True |
2325.0 |
7 |
False |
2403.3 |
7 |
True |
2143.2 |
8 |
False |
2268.6 |
8 |
True |
1992.0 |
9 |
False |
2072.1 |
9 |
True |
1806.7 |
get()¶
The get() operation retrieves the value (the TrieEntry)associated with a key in the trie.
This benchmark measures the performance of the get() operation for existing keys at various depths in the trie, with and without runtime validation enabled.
Key Depth |
Runtime Validation |
Kops / sec |
|---|---|---|
2 |
False |
1460.3 |
2 |
True |
1460.3 |
3 |
False |
2103.8 |
3 |
True |
1852.3 |
4 |
False |
2299.3 |
4 |
True |
1992.7 |
5 |
False |
2151.9 |
5 |
True |
1929.5 |
6 |
False |
1987.7 |
6 |
True |
1778.5 |
7 |
False |
1881.2 |
7 |
True |
1667.6 |
8 |
False |
1698.2 |
8 |
True |
1552.3 |
9 |
False |
1640.2 |
9 |
True |
1495.8 |
update()¶
This benchmark measures the performance of updating values for existing keys in the trie.
The results show the number of values updated per second (Kops) for different key depths and runtime validation settings.
remove()¶
This benchmark is designed as a ‘semi-worst-case’ scenario for the remove() operation.
By generating only keys at the maximum depth and then removing them, we can observe the performance impact on the trie structure as intermediate nodes are frequently removed all the way to the root node (requiring the most work).
A true worst case scenario for the remove() operation would involve a degenerate trie, where all keys share the same prefix, effectively behaving like a linked list. While this is not the primary focus of the benchmark, it is worth noting that such a structure would exhibit significantly different performance characteristics than typical trie use cases.
In practice, the performance impact of the remove() operation is usually mitigated by the structure of the trie and the distribution of keys within it. A degenerate trie is an edge case and is not representative of typical usage patterns (and probably indicative of a poorly chosen data structure for the task at hand).
Key Depth |
Runtime Validation |
Kops / sec |
|---|---|---|
2 |
False |
533.3 |
2 |
True |
457.1 |
3 |
False |
820.5 |
3 |
True |
633.7 |
4 |
False |
870.7 |
4 |
True |
795.0 |
5 |
False |
819.2 |
5 |
True |
782.3 |
6 |
False |
836.8 |
6 |
True |
751.1 |
7 |
False |
807.7 |
7 |
True |
727.4 |
8 |
False |
780.4 |
8 |
True |
709.3 |
9 |
False |
749.9 |
9 |
True |
685.8 |
prefixed_by()¶
Performance of the prefixed_by() method is more complex due to the nature of prefix matching.
Rather than a simple hit/miss scenario, prefixed_by() can return multiple, potentially thousands or more (!), matches depending on the prefix length, the search depth, and the number of keys present in the trie.
For this test, a completely filled trie 7 levels deep is used using the same keys as before. Matches are done starting from key depths of 2 to 4 symbols to a search depth of 1 to 3 symbols from that starting point.
A search for the prefix “01” (a key depth of 2) with a search depth of 1 would match the keys “01”, “010”, “011”, “012” and “013” (5 keys).
The number of matches increases exponentially with the search depth in the completely filled trie. In this case, a search depth of 2 would match 21 keys (the key itself, 4 at the first level below it, and 16 more at the second level). A search depth of 3 would match 85 keys (1 + 4 + 16 + 64).
At lower levels of a trie, the number of matches can be smaller due to there not being as many keys because of reaching the maximum depth of the trie, so the performance impact of returning multiple matches is less pronounced (it effectively limits the search depth). In this benchmark, we made sure that the trie is fully populated at all levels up to the maximum depth being searched (4 + 3) to avoid that issue.
As can be seen, while the performance impact of looking up keys is slightly more pronounced at higher key depths, the performance impact of returning multiple matches is far more significant as the search depth increases, as the number of potential matches grows exponentially.
Turning off runtime validation has a noticeable positive effect on performance, especially at higher search depths, but is not as significant as the impact of returning multiple matches.
Note that the performance numbers here are for returning all matches as a list, rather than stopping at the first match found. This is done to provide a more comprehensive view of the performance characteristics of the prefixed_by() method.
Note that although, for example, a search depth of 3 from a key depth of 4 in a trie of depth 7 has a Kops/sec of 77.8 (Runtime Validation == False) or 85.5 (Runtime Validation == True), it is returning 85 matches each time and so the number of prefixed keys matched per second is actually 85 times higher (6.6 Mkeys/sec or 7.3 Mkeys/sec respectively).
Real world performance will vary based on the specific use case, data distribution, and other factors.
Search Depth |
Key Depth |
Trie Depth |
Returned Matches |
Runtime Validation |
Kops / sec |
|---|---|---|---|---|---|
1 |
2 |
7 |
5 |
False |
902.5 |
1 |
2 |
7 |
5 |
True |
825.8 |
1 |
3 |
7 |
5 |
False |
883.4 |
1 |
3 |
7 |
5 |
True |
824.4 |
1 |
4 |
7 |
5 |
False |
873.4 |
1 |
4 |
7 |
5 |
True |
834.0 |
2 |
2 |
7 |
21 |
False |
308.5 |
2 |
2 |
7 |
21 |
True |
303.1 |
2 |
3 |
7 |
21 |
False |
316.8 |
2 |
3 |
7 |
21 |
True |
304.2 |
2 |
4 |
7 |
21 |
False |
307.3 |
2 |
4 |
7 |
21 |
True |
300.6 |
3 |
2 |
7 |
85 |
False |
86.1 |
3 |
2 |
7 |
85 |
True |
84.9 |
3 |
3 |
7 |
85 |
False |
85.8 |
3 |
3 |
7 |
85 |
True |
83.4 |
3 |
4 |
7 |
85 |
False |
77.8 |
3 |
4 |
7 |
85 |
True |
85.5 |
prefixes()¶
prefixes() performance is generally better than prefixed_by() as it stops searching as soon as it finds a match and only returns the match and its parent keys. As the number of matches is limited to a maximum of the key depth, and key depth generally grows logarithmically (for non-degenerate tries), the performance impact is far less pronounced at greater key depths.
A fully populated trie with a fanout of 4 and a depth 9 contains 349524 nodes - but prefixes() will never return more than 9 matches for it. With a fanout of 32, the number of nodes in a fully populated 9 level trie increases exponentially (to more than 36 trillion keys), but the maximum number of prefixes() matches returned remains unchanged at 9.
The exception here is degenerate tries (tries where all keys share the same prefix). In these cases, the number of returned prefixes() matches can be much larger, as the trie may effectively behave like a linked list.
Key Depth |
Returned Matches |
Runtime Validation |
Kops / sec |
|---|---|---|---|
3 |
3 |
False |
1927.9 |
3 |
3 |
True |
1693.5 |
4 |
4 |
False |
1722.7 |
4 |
4 |
True |
1545.5 |
5 |
5 |
False |
1503.1 |
5 |
5 |
True |
1379.9 |
6 |
6 |
False |
1248.9 |
6 |
6 |
True |
1226.5 |
7 |
7 |
False |
1231.5 |
7 |
7 |
True |
1135.4 |
8 |
8 |
False |
1098.0 |
8 |
8 |
True |
993.4 |
9 |
9 |
False |
966.9 |
9 |
9 |
True |
947.1 |
Organic Benchmarks¶
Organic benchmarks use real-world data to evaluate trie performance in practical scenarios.
For example, measuring the time it takes create a trie populated by a large dataset of English words from a list using a for loop.
As such, they provide a more realistic assessment of trie performance in typical use cases.
Creating a Trie Of English Words¶
The following benchmarks were conducted on a trie populated by a list of 274137 English words.
It is functionally equivalent to benchmarking the performance of this code:
1from typing import Sequence
2
3from gentrie import GeneralizedTrie
4
5def create_dictionary(words: Sequence[str], runtime_validation: bool) -> GeneralizedTrie:
6 trie = GeneralizedTrie(runtime_validation=runtime_validation)
7 for word in words:
8 trie.update(word)
9 return trie
As can be seen, the performance impact of runtime validation is minimal in this scenario - it has less than a 1% impact on net performance.
Micro-benchmark of the update() method for the same real-world dataset This shows performance of just the update() calls rather than benchmarking the entire process.
The loop in the organic benchmark adds