Counting Distinct Elements with Judy Arrays

2017-05-24

Daniel Lemire has written an interesting blog post entitled Counting exactly the number of distinct elements: sorted arrays vs. hash sets?. I've used Judy arrays for similar tasks in the past and was curious how they would stack up, so I've taken Daniel's code and modified it slightly to measure this.

One of the interesting aspects of Daniel's benchmark is that the counting methods are implemented in simple and straightforward ways using generic data-structure implementations. No special performance-tuning was done. This is one of the aspects of Judy arrays that appeals to me: Fairly good performance across a variety of workloads without tuning.

Judy arrays are essentially trie-like data-structures, although the implementation uses a bunch of different compression techniques.

Let's see if they make sense in this scenario.

The code

For the entire benchmark code, please see my github repo. Note that this is just a minor modification of Daniel Lemire's code.

As a quick reference, here are the three contenders (the C++ interfaces are naturally a bit cleaner since Judy is a C library, not C++):

size_t distinct_count_hash(const uint64_t * values, size_t howmany) {
  std::unordered_set hash(values, values + howmany);
  return hash.size();
}

size_t distinct_count_sort(const uint64_t * values, size_t howmany) {
  std::vector array(values, values + howmany);
  std::sort(array.begin(), array.end());
  return std::unique(array.begin(), array.end()) - array.begin();
}

size_t distinct_count_judy(const uint64_t * values, size_t howmany) {
  void *j_array = nullptr;

  for (; howmany; values++, howmany--) {
    int ret;
    J1S(ret, j_array, *values);
  }

  size_t count, bytes_freed;

  J1C(count, j_array, 0, -1);
  J1FA(bytes_freed, j_array);

  return count;
}

Benchmark results

The "original" mode is the same as Daniel's test, randomly chosen integers in the whole 64-bit range:

Original

Nhashsortjudy
10799.9002091.0005423.900
100476.770240.300731.010
1000587.062335.139664.785
10000687.179319.718700.199
100000481.774131.039240.177
1000000702.065176.030376.434
10000000830.487184.250509.867

Once we start getting into the larger values of N, Judy arrays seem to perform better than the hash but worse than the sort.

I had a feeling that the distribution of the input values might have an impact on the results, so I modified the benchmark to accept a "mode" parameter.

Here is the "dense_random" mode where the range of the numbers is [0, N):

dense_random

Nhashsortjudy
10791.0002174.2004626.600
100249.840195.480681.300
1000281.552255.968348.256
10000466.012322.200332.084
100000238.186221.104117.947
1000000249.519179.998114.024
10000000275.373301.188159.435

Aha! Now Judy arrays are starting to hit their sweet spot, outperforming both hash and sort.

Here is a test where the range of numbers is [0, N/10):

really_dense_random

Nhashsortjudy
10552.500334.9006878.200
100179.700504.310259.250
1000162.291320.816599.769
10000140.936363.566275.752
10000089.402245.404122.145
100000084.090150.12585.825
10000000223.785273.39491.641

And just out of curiosity, here are modes where the numbers are ascending from 0 to N-1, and descending from N-1 to 0 (plus a large offset, see conclusion):

sequential

Nhashsortjudy
10222.300113.9002290.800
100129.580158.380321.770
1000157.004132.265116.448
10000530.956222.064234.421
100000186.679199.34491.489
1000000169.161397.652106.706
10000000169.059406.508109.828

rev_sequential

Nhashsortjudy
10582.0001953.0003850.700
100341.080107.7001024.080
1000412.915103.454340.089
10000466.90582.989303.928
100000166.43732.308142.295
1000000168.15662.047133.119
10000000165.22947.946110.284

The sort method is quite different depending on the order of the input (which makes sense).

Finally, here is a mode where the input is the numbers 0 to N-1, but shuffled prior to the count:

shuffle

Nhashsortjudy
10649.3002122.4004101.200
100382.830229.2401025.010
1000464.091312.579432.148
10000466.443371.991279.502
100000316.911230.575180.805
1000000274.971165.339115.582
10000000287.619295.054172.413

Conclusion

Depending on your input data, Judy arrays may be a good fit for this task. For large enough N, Judy arrays outperform the hash. Similarly, they outperform the sort in cases where the input is very dense (and not already sorted).

After I ran the benchmarks, it occurred to me that for the modes other than original, due to the input ranges selected, you could use a smaller integer type (such as uint32_t) which would likely improve the sort and hash by quite a bit. However, the performance of Judy arrays don't depend on the input values being small. To demonstrate this, I've added a large (10 billion) offset to each input value in the rev_sequential mode and, as you can see, this doesn't change the results of the test. Although not demonstrated in this benchmark, I believe Judy arrays will also perform well if you have multiple dense clusters scattered around the input range.

So to sum up, Judy arrays seem to be a good choice for the following scenarios: