Counting Distinct Elements with Judy Arrays

2017-05-24

Daniel Lemire has written an interesting blog post entitled Counting exactly the number of distinct elements: sorted arrays vs. hash sets?. I've used Judy arrays for similar tasks in the past and was curious how they would stack up, so I've taken Daniel's code and modified it slightly to measure this.

One of the interesting aspects of Daniel's benchmark is that the counting methods are implemented in simple and straightforward ways using generic data-structure implementations. No special performance-tuning was done. This is one of the aspects of Judy arrays that appeals to me: Fairly good performance across a variety of workloads without tuning.

Judy arrays are essentially trie-like data-structures, although the implementation uses a bunch of different compression techniques.

Let's see if they make sense in this scenario.

The code

For the entire benchmark code, please see my github repo. Note that this is just a minor modification of Daniel Lemire's code.

As a quick reference, here are the three contenders (the C++ interfaces are naturally a bit cleaner since Judy is a C library, not C++):

size_t distinct_count_hash(const uint64_t * values, size_t howmany) {
  std::unordered_set hash(values, values + howmany);
  return hash.size();
}

size_t distinct_count_sort(const uint64_t * values, size_t howmany) {
  std::vector array(values, values + howmany);
  std::sort(array.begin(), array.end());
  return std::unique(array.begin(), array.end()) - array.begin();
}

size_t distinct_count_judy(const uint64_t * values, size_t howmany) {
  void *j_array = nullptr;

  for (; howmany; values++, howmany--) {
    int ret;
    J1S(ret, j_array, *values);
  }

  size_t count, bytes_freed;

  J1C(count, j_array, 0, -1);
  J1FA(bytes_freed, j_array);

  return count;
}

Benchmark results

The "original" mode is the same as Daniel's test, randomly chosen integers in the whole 64-bit range:

Original

N	hash	sort	judy
10	799.900	2091.000	5423.900
100	476.770	240.300	731.010
1000	587.062	335.139	664.785
10000	687.179	319.718	700.199
100000	481.774	131.039	240.177
1000000	702.065	176.030	376.434
10000000	830.487	184.250	509.867

Once we start getting into the larger values of N, Judy arrays seem to perform better than the hash but worse than the sort.

I had a feeling that the distribution of the input values might have an impact on the results, so I modified the benchmark to accept a "mode" parameter.

Here is the "dense_random" mode where the range of the numbers is [0, N):

dense_random

N	hash	sort	judy
10	791.000	2174.200	4626.600
100	249.840	195.480	681.300
1000	281.552	255.968	348.256
10000	466.012	322.200	332.084
100000	238.186	221.104	117.947
1000000	249.519	179.998	114.024
10000000	275.373	301.188	159.435

Aha! Now Judy arrays are starting to hit their sweet spot, outperforming both hash and sort.

Here is a test where the range of numbers is [0, N/10):

really_dense_random

N	hash	sort	judy
10	552.500	334.900	6878.200
100	179.700	504.310	259.250
1000	162.291	320.816	599.769
10000	140.936	363.566	275.752
100000	89.402	245.404	122.145
1000000	84.090	150.125	85.825
10000000	223.785	273.394	91.641

And just out of curiosity, here are modes where the numbers are ascending from 0 to N-1, and descending from N-1 to 0 (plus a large offset, see conclusion):

sequential

N	hash	sort	judy
10	222.300	113.900	2290.800
100	129.580	158.380	321.770
1000	157.004	132.265	116.448
10000	530.956	222.064	234.421
100000	186.679	199.344	91.489
1000000	169.161	397.652	106.706
10000000	169.059	406.508	109.828

rev_sequential

N	hash	sort	judy
10	582.000	1953.000	3850.700
100	341.080	107.700	1024.080
1000	412.915	103.454	340.089
10000	466.905	82.989	303.928
100000	166.437	32.308	142.295
1000000	168.156	62.047	133.119
10000000	165.229	47.946	110.284

The sort method is quite different depending on the order of the input (which makes sense).

Finally, here is a mode where the input is the numbers 0 to N-1, but shuffled prior to the count:

shuffle

N	hash	sort	judy
10	649.300	2122.400	4101.200
100	382.830	229.240	1025.010
1000	464.091	312.579	432.148
10000	466.443	371.991	279.502
100000	316.911	230.575	180.805
1000000	274.971	165.339	115.582
10000000	287.619	295.054	172.413

Conclusion

Depending on your input data, Judy arrays may be a good fit for this task. For large enough N, Judy arrays outperform the hash. Similarly, they outperform the sort in cases where the input is very dense (and not already sorted).

After I ran the benchmarks, it occurred to me that for the modes other than original, due to the input ranges selected, you could use a smaller integer type (such as uint32_t) which would likely improve the sort and hash by quite a bit. However, the performance of Judy arrays don't depend on the input values being small. To demonstrate this, I've added a large (10 billion) offset to each input value in the rev_sequential mode and, as you can see, this doesn't change the results of the test. Although not demonstrated in this benchmark, I believe Judy arrays will also perform well if you have multiple dense clusters scattered around the input range.

So to sum up, Judy arrays seem to be a good choice for the following scenarios:

Dense data-sets, when you don't know where the data will cluster in advance.
You don't want to performance-tune your application, either because it's difficult or because the input distribution is not known in advance.
You have a lot of repeated numbers and don't want to keep all of them in memory as you are collecting them. In this case, since the sort method is excluded, according to this benchmark Judy arrays are preferable to the obvious hash implementation (although it's very likely a tuned hash implementation could do better for particular input distributions).

Projects

More

Counting Distinct Elements with Judy Arrays

The code

Benchmark results

Conclusion