Counting Distinct Elements with Judy Arrays
2017-05-24Daniel Lemire has written an interesting blog post entitled Counting exactly the number of distinct elements: sorted arrays vs. hash sets?. I've used Judy arrays for similar tasks in the past and was curious how they would stack up, so I've taken Daniel's code and modified it slightly to measure this.
One of the interesting aspects of Daniel's benchmark is that the counting methods are implemented in simple and straightforward ways using generic data-structure implementations. No special performance-tuning was done. This is one of the aspects of Judy arrays that appeals to me: Fairly good performance across a variety of workloads without tuning.
Judy arrays are essentially trie-like data-structures, although the implementation uses a bunch of different compression techniques.
Let's see if they make sense in this scenario.
The code
For the entire benchmark code, please see my github repo. Note that this is just a minor modification of Daniel Lemire's code.
As a quick reference, here are the three contenders (the C++ interfaces are naturally a bit cleaner since Judy is a C library, not C++):
size_t distinct_count_hash(const uint64_t * values, size_t howmany) { std::unordered_sethash(values, values + howmany); return hash.size(); } size_t distinct_count_sort(const uint64_t * values, size_t howmany) { std::vector array(values, values + howmany); std::sort(array.begin(), array.end()); return std::unique(array.begin(), array.end()) - array.begin(); } size_t distinct_count_judy(const uint64_t * values, size_t howmany) { void *j_array = nullptr; for (; howmany; values++, howmany--) { int ret; J1S(ret, j_array, *values); } size_t count, bytes_freed; J1C(count, j_array, 0, -1); J1FA(bytes_freed, j_array); return count; }
Benchmark results
The "original" mode is the same as Daniel's test, randomly chosen integers in the whole 64-bit range:
Original
N | hash | sort | judy |
---|---|---|---|
10 | 799.900 | 2091.000 | 5423.900 |
100 | 476.770 | 240.300 | 731.010 |
1000 | 587.062 | 335.139 | 664.785 |
10000 | 687.179 | 319.718 | 700.199 |
100000 | 481.774 | 131.039 | 240.177 |
1000000 | 702.065 | 176.030 | 376.434 |
10000000 | 830.487 | 184.250 | 509.867 |
Once we start getting into the larger values of N, Judy arrays seem to perform better than the hash but worse than the sort.
I had a feeling that the distribution of the input values might have an impact on the results, so I modified the benchmark to accept a "mode" parameter.
Here is the "dense_random" mode where the range of the numbers is [0, N):
dense_random
N | hash | sort | judy |
---|---|---|---|
10 | 791.000 | 2174.200 | 4626.600 |
100 | 249.840 | 195.480 | 681.300 |
1000 | 281.552 | 255.968 | 348.256 |
10000 | 466.012 | 322.200 | 332.084 |
100000 | 238.186 | 221.104 | 117.947 |
1000000 | 249.519 | 179.998 | 114.024 |
10000000 | 275.373 | 301.188 | 159.435 |
Aha! Now Judy arrays are starting to hit their sweet spot, outperforming both hash and sort.
Here is a test where the range of numbers is [0, N/10):
really_dense_random
N | hash | sort | judy |
---|---|---|---|
10 | 552.500 | 334.900 | 6878.200 |
100 | 179.700 | 504.310 | 259.250 |
1000 | 162.291 | 320.816 | 599.769 |
10000 | 140.936 | 363.566 | 275.752 |
100000 | 89.402 | 245.404 | 122.145 |
1000000 | 84.090 | 150.125 | 85.825 |
10000000 | 223.785 | 273.394 | 91.641 |
And just out of curiosity, here are modes where the numbers are ascending from 0 to N-1, and descending from N-1 to 0 (plus a large offset, see conclusion):
sequential
N | hash | sort | judy |
---|---|---|---|
10 | 222.300 | 113.900 | 2290.800 |
100 | 129.580 | 158.380 | 321.770 |
1000 | 157.004 | 132.265 | 116.448 |
10000 | 530.956 | 222.064 | 234.421 |
100000 | 186.679 | 199.344 | 91.489 |
1000000 | 169.161 | 397.652 | 106.706 |
10000000 | 169.059 | 406.508 | 109.828 |
rev_sequential
N | hash | sort | judy |
---|---|---|---|
10 | 582.000 | 1953.000 | 3850.700 |
100 | 341.080 | 107.700 | 1024.080 |
1000 | 412.915 | 103.454 | 340.089 |
10000 | 466.905 | 82.989 | 303.928 |
100000 | 166.437 | 32.308 | 142.295 |
1000000 | 168.156 | 62.047 | 133.119 |
10000000 | 165.229 | 47.946 | 110.284 |
The sort method is quite different depending on the order of the input (which makes sense).
Finally, here is a mode where the input is the numbers 0 to N-1, but shuffled prior to the count:
shuffle
N | hash | sort | judy |
---|---|---|---|
10 | 649.300 | 2122.400 | 4101.200 |
100 | 382.830 | 229.240 | 1025.010 |
1000 | 464.091 | 312.579 | 432.148 |
10000 | 466.443 | 371.991 | 279.502 |
100000 | 316.911 | 230.575 | 180.805 |
1000000 | 274.971 | 165.339 | 115.582 |
10000000 | 287.619 | 295.054 | 172.413 |
Conclusion
Depending on your input data, Judy arrays may be a good fit for this task. For large enough N, Judy arrays outperform the hash. Similarly, they outperform the sort in cases where the input is very dense (and not already sorted).
After I ran the benchmarks, it occurred to me that for the modes other than original, due to the input ranges selected, you could use a smaller integer type (such as uint32_t) which would likely improve the sort and hash by quite a bit. However, the performance of Judy arrays don't depend on the input values being small. To demonstrate this, I've added a large (10 billion) offset to each input value in the rev_sequential mode and, as you can see, this doesn't change the results of the test. Although not demonstrated in this benchmark, I believe Judy arrays will also perform well if you have multiple dense clusters scattered around the input range.
So to sum up, Judy arrays seem to be a good choice for the following scenarios:
- Dense data-sets, when you don't know where the data will cluster in advance.
- You don't want to performance-tune your application, either because it's difficult or because the input distribution is not known in advance.
- You have a lot of repeated numbers and don't want to keep all of them in memory as you are collecting them. In this case, since the sort method is excluded, according to this benchmark Judy arrays are preferable to the obvious hash implementation (although it's very likely a tuned hash implementation could do better for particular input distributions).