Range-Based Set Reconciliation
Doug Hoyte, December 2023
Introduction
Range-Based Set Reconciliation (RBSR) is a method for synchronising data-sets, either because they were created independently or because they have drifted apart due to downtime, network partitions, misconfigurations, etc. RBSR works well with very large data-sets, and can be safely used in open networks of un-coordinated peers. It is complementary to Conflict-free Replicated Data Types (CRDTs).
RBSR is simple to explain and implement. Compared to alternatives such as bloom filters, it is more efficient and requires less tuning of parameters. Unlike systems based on merkle-trees, it does not require storing per-connection state and is resistant to DoS attacks.
This article describes the theory behind RBSR and introduces Log Periodic's Negentropy project, which specifies a wire-protocol and includes protocol conformance tests and reference implementations in several languages.
This work is based on Aljoscha Meyer's research (overview / paper / master's thesis).
- Introduction
- Case-study: rsync
- The Protocol
- Fingerprints
- Comparison with Alternatives
- Negentropy
- Applications
- Conclusion
- Acknowledgements
Case-study: rsync
rsync
is a popular command-line application for synchronising the contents of directories between machines. When the directories are mostly similar, it does this by transmitting less data than simply copying the whole directory. Because rsync
is run frequently to keep mirror servers up to date, it has saved a lot of bandwidth since its introduction in 1996.
When invoked, rsync
runs through two phases:
Phase 1: Figure out which files were changed
The full list of files in a directory and each file's modification time is transferred to the remote party. If the corresponding remote file's modification time is identical, the files are assumed to be equivalent.
Phase 2: Figure out which parts of a file were changed
For each file that Phase 1 determined was different, break it up into equal sized chunks, compute a checksum for each chunk, and send this list of checksums to the remote party. By comparing these checksums against the checksums of its own file, the remote party can determine which chunks are new/modified and queue them for download.
rsync
also has a mechanism to detect chunks that have moved around within a file (more on that below).
Content-Addressing
In phase 1, rsync
compares a local modification timestamp against a remote timestamp. Although this works well for the mirroring use-case, meta-data timestamps should generally be avoided in distributed systems.
Consider the case where you first connect to an unknown server. Comparing this server's timestamps against your own doesn't provide useful information. Even if you trust its clock has always been set correctly, exactly when it saved a bit of data doesn't indicate if its saved data is older or newer than your version.
A common protocol for replication is for every server to keep a log of updates it has processed. Every time a client connects, it asks the server for data that has come in since it last checked. While this scheme doesn't depend on clocks to be correct (and in fact, sequence IDs can be used instead of timestamps), it is inefficient in open networks where a client synchronises with many different servers. This is because the same updates will be downloaded from many servers, and when adding a new server, all updates or the full state must be transferred.
In dynamic/chaotic distributed environments, timestamps can become unreliable for many other reasons. Servers can unexpectedly re-build their databases, resetting timestamp meta-data. Multiple servers can be put behind a load balancer and not present consistent timestamps. Due to bugs or operator intervention, modifications can bypass the log and directly update the DB. Servers can be restored from backups, rewinding the logs.
Take-away 1: In a distributed environment, it is preferable for synchronisation methods to use the contents of data elements (or their hashes) to determine equivalence, and not meta-data such as timestamps.
Divide-and-Conquer
In phase 2, rsync
transmits the list of checksums for each chunk in the file. This means that every time a file is synchronised, an amount of data linear in the full file-size must be transferred. To reduce this overhead rsync
uses a larger chunk size as the file size increases. However, chunks should not be too large since whenever a difference is detected an entire chunk must be transferred.
In the rsync protocol, the full list of chunks is constructed and transferred in one communication round. If we are willing to relax this, and allow back-and-forth communication, then we can improve on the overhead described above.
Imagine taking the checksum of the full file and transferring that. If it matches, great, we are done. Otherwise, split the file into two chunks, and send a checksum for each. If either of those match the remote chunk, then we have confirmed that this entire half of the file is identical. Whenever there is a difference, recurse until we have an arbitrarily small chunk that can be transferred directly. (This algorithm was in fact proposed in 2001 by John Langford as Multiround rsync, but was never incorporated into the official rsync
program).
The divide-and-conquer technique is well known to computer scientists as the basis of binary search, and works well in this case because the bandwidth overhead grows logarithmically (very slowly) with the total file size.
Take-away 2: Using divide-and-conquer, shared sub-sets of records can be identified early-on so that no further synchronisation work needs to be performed on them. If the databases are large, this can save substantial resources such as bandwidth, CPU, and IO.
The Protocol
Alternating Messages
In order to avoid transferring a potentially large list of checksums like rsync
, RBSR uses back-and-forth communication rounds between the protocol participants. Each round tries to find common sub-sets of elements so that processing of these sub-sets can be halted. When one side has determined that the other needs an element, it can be sent immediately (without waiting for the sync to complete).
Ranges and Fingerprints
With RBSR, all the elements to be synchronised are sorted into a particular order. Various ordering criteria can be used, but for now let's just suppose that each element has a unique timestamp and this is what we sort by.
A contiguous (no gaps) sequence of these sorted elements is called a range. Whenever we communicate about our ranges with the outside world, we cannot use the indices of the elements in our sorted list because other protocol participants may have entirely different sets. So, ranges must be specified using lower and upper bounds, which are values that can be compared against elements using the sorting criteria (ie, timestamps).
Let's walk through an example of how RBSR works.
The initiator begins by computing a fingerprint over its full range of elements (from timestamp 0 to timestamp infinity), and sending it to the other remote side. We'll talk about fingerprints in more detail soon, but for now assume they are cryptographic hashes of all the sorted elements within a range.
In our example the two parties share most of the elements in common, except that the initiator (top) is missing one element that the remote side (bottom) has, and the remote side is missing two elements the initiator has:
Once the remote side receives the fingerprint, it computes the fingerprint for its own full range. If they match, then it immediately knows that both sides have identical sets, so it can reply to the initiator that nothing needs to be done, and the protocol can stop.
But in our example this is not the case, since the two sides have different sets. To make further progress, the remote side splits its range into two smaller sub-ranges, computes the fingerprints for each, and sends them along with corresponding bounds back to the initiator:
After receiving these fingerprints, the initiator computes its own fingerprints for each sub-range specified by the remote side. It is important that the receiver uses the exact same range bounds the other side sent rather than blindly split its own range in half (for instance).
In our example, the initiator sees that the first fingerprint matches its own for this range. This is great news, since it means that we have now determined half of the elements are identical and we no longer need to do any processing for them (illustrated by colouring these elements green). However, the second fingerprint indicates that the other sub-range has one or more differences, so we need to recurse further.
Next, the initiator splits the sub-range into two sub-sub-ranges, computes their fingerprints, and sends them to the remote side:
Similar to the previous step, the remote side computes its own fingeprints and determines that one of them is identical: nothing more needs to be done for that range.
Although not strictly required, most RBSR implementations will want to stop splitting ranges once the number of elements they contain becomes sufficiently small. In our example, the remote side decides that it doesn't want to split the 6 elements in the right-most range, so it sends those elements (or their hashes), along with the range boundaries, back to the initiator:
After receiving the set of elements in the range, the initiator immediately knows which elements in this range it has that the other side needs, and vice-versa, simply by comparing these received elements against the elements in its own corresponding range.
The initiator notices that it was missing one element, and adds it to its set (illustrated by colouring blue). It also notices that it has two elements that were not sent by the remote side, so it sends these elements and the remote side adds them to its set:
At this point, the two sets are fully reconciled and the synchronisation is complete.
Batched Recursion
In the previous diagrams, each pair of fingerprints had one matching and one non-matching fingerprint. But there is a possibility that both sub-ranges will have a difference, and therefore both should be recursed into. For this reason, each message transmission may include multiple ranges and their fingerprints:
This "batching" of multiple ranges has a significant implication: The number of rounds of the protocol only depends on the sizes of the data-sets being synced, and not on their number of differences.
For example, suppose we have a data-set of size 1 million. The number of messages required to reconcile this data-set is:
log(1_000_000)/log(2) = 19.9316
This is an upper-bound: In fact, it would be fewer than this because small sub-ranges are sent in their entirety, stopping the protocol early (or if the sets are equal, of course).
In our example above, we started with a single range and split non-matching ranges into two sub-ranges. However, if we want to minimise the number of messages needed, we can use a higher branching factor at the expense of increasing the size of each message. For instance, if we used a branching factor of 16, we would then require this many messages:
log(1_000_000)/log(16) = 4.98289
Another thing you may have noticed about the protocol is that messages in both directions (initiator to remote and vice versa) help narrow down the ranges. If instead of counting messages we want to count communication round-trips, we should divide this by 2:
log(1_000_000)/log(16)/2 = 2.49145
In summary, to reconcile a set of 1 million, we expect to need 3 round-trips (rounding up because the initiator must always get a final response before terminating the protocol). Suppose our data-set was of size 1 billion:
log(1_000_000_000)/log(16)/2 = 3.73717
In this case, it would take 4 round-trips. This very slow growth is characteristic of logarithmic functions, and is precisely the reason why the divide and conquer strategy is so effective.
So far we haven't even considered how many differences there are in the sets because it would not affect our analysis. However, if there are many differences distributed throughout then the messages can grow quite large and we may wish to fragment them. This may increase the number of round-trips.
Range Choice
Implementations have considerable flexibility when choosing their ranges. Although the descriptions above assumed splitting into evenly sized groups of 2 (or 16), there are various reasons why implementations may choose different divisions:
- Partial syncing: If a client is only interested in syncing particular portions of the data-set, it can create ranges only for those portions. The un-sent ranges will be considered already synced by the protocol and will be ignored. For instance, if records are sorted by timestamp, clients could use this to sync just the latest week's worth of records. Conversely, clients could choose to not synchronise the most recent few seconds to avoid redundant double-downloads if records are also being streamed in real-time.
- Variable sizing: Ranges can be split into more or fewer sub-ranges as the sync progresses. This could be directed by simple heuristics, or dynamically derived according to measured network conditions in order to optimise for a particular point in the bandwidth/latency trade-off space.
- Weighted splitting: In some situations, clients may have reason to believe that certain ranges are more or less likely to be in sync than others. For example, clients who sync with each other relatively often may rely on the fact that old data ranges are likely to be in sync, and therefore that most of the differences will be with recent elements. In this case, ranges that cover recent time periods can be made smaller in order to speculatively pre-partition them and reduce latency without increasing bandwidth usage.
- In order to protect against some attacks on fingerprints, implementations may choose to randomise their range boundaries.
- In order to keep messages acceptably small, ranges can be abbreviated to stay within a frame size limit, postponing their resolution until a later round of the protocol.
In all the cases described above, clients can implement their desired range policies with no coordination from the other side of the sync. All of these and others are naturally emergent abilities arising from the protocol itself.
Fingerprints
The purpose of a fingerprint is to provide a compact and reliable summary of a set so that it can be used to check for equality with a remote set.
Previously we suggested that fingerprints could be the cryptographic hash of the sorted elements within a range. (Practically, you'd hash each element and use the hash of these hashes, if elements are variable in size.) Although RBSR will work with such fingerprints, it becomes much more efficient when fingerprints can be computed incrementally.
Incremental Hashing
Suppose you have the elements A, B, and C
and you computed a cryptographic hash (ie SHA-256) of the concatentation of these elements:
h = sha256(A + B + C)
Now you receive a new element D
and you'd like to update h
to include it. Unfortunately, the only way to do this is to re-compute the hash from scratch using the entire new set:
h' = sha256(A + B + C + D)
With incremental hashing, we can compute the new hash value without having to start again from scratch. A simple incremental hash can be implemented by hashing each individual element and then combining them with bit-wise XOR (⊕):
h = sha256(A) ⊕ sha256(B) ⊕ sha256(C)
Since XOR is associative, we can incrementally add in the new value:
h' = h ⊕ sha256(D)
Most incremental hashes allow you to subtract elements out of the hash too. With XOR, every element is its own inverse so it is as simple as:
h' ⊕ -sha256(D) = h' ⊕ sha256(D) = h
Hashing and then combining with XOR is an example of the randomise then combine paradigm for creating incremental hash functions.
Tree-Friendly Functions
The advantage of using incremental hash functions is that we can use them to reduce the amount of work needed to compute fingerprints by storing elements in a tree data-structure. If an incremental hash function has the properties that make this possible, it is called "tree-friendly" (see Meyer's paper for the exact algebraic requirements).
The most obvious type of tree to use is the B-tree. These trees have large branching factors (many children) because this encourages related data to be stored together, exploiting the locality of reference benefits inherent to most types of storage.
As well as pointers to its children, each node of the tree contains the incremental hash of all elements below it in the tree:
To find an item in the tree, we start at the root node, and traverse down through the levels until we get to a leaf node (path indicated in red below).
If we wanted to compute the hash of all nodes preceding the item we have found (indicated in blue), we could iterate over all the leaf nodes and incrementally hash the elements. However, this would involve loading and processing an amount of elements linear in the DB size, which could be very large:
Fortunately, because nodes store the incremental hashes of all their sub-elements, this is unnecessary. Instead, as we traverse the tree, we can accumulate the tree nodes to the left of the path we traversed, and the elements in the leaf node that precede the element we found:
The result is that only a logarithmic number of hashes need to be incrementally combined.
This operation (computing the incremental hash of preceding elements) can be used to find the incremental hash of any range of elements. Suppose we have the elements 1 through 10, and we want to find the incremental hash of elements in the range [3,5)
(in computer science, ranges always include the begin index and exclude the end index). To do so, use the operation two times -- one for each end of the range -- and subtract the results:
(1 + 2 + 3 + 4) - (1 + 2) = (3 + 4)
In fact, being able to subtract incremental hashes is not strictly necessary for an RBSR tree (see Meyer's paper).
Because tree-friendly hash functions can be combined and separated, combinations of multiple trees or sub-trees can be synced in the same protocol-round. This gives implementations considerable freedom in how they store their data and allows parties that use different configurations to remain sync-compatible.
Fingerprint Threat Model
Two parties synchronising must necessarily trust one another to some degree: At the very least enough to download data from each other. Regardless of the sync protocol, if one side wants to it could always pretend that it doesn't have some data, or insert some bogus data into its data-set. The validity of the data, and the rules about its acceptance, are outside the scope of a sync protocol.
However, with RBSR there is another threat model to worry about: If a malicious third-party is able to insert specially-crafted data elements into one of the data-sets, it could choose elements that cause a range to have an equal fingerprint as some different range. During the sync, if the two parties compared the fingerprints of these two sets, they would incorrectly consider them equivalent and would stop trying to sync them.
If the specially-crafted malicious data elements themselves are not synced, nobody is likely to care. However, if the malicious elements prevent innocent victim elements from syncing, an attacker could use this to censor users by preventing their events from propagating throughout the network.
In the case of XOR, consider what would happen if you found a set of malicious elements that when hashed and XORed together were equivalent to the hash of a victim element. When computing the fingerprint of a range that includes the malicious elements and the victim element, they would "cancel out" and the two ranges would incorrectly have the same fingerprint.
Note that there are many obstacles to actually performing this censorship attack:
- The two fingerprints actually need to be compared in a sync. If one side of the sync is mostly empty, then the recursion may terminate early, and the elements will be transferred.
- The malicious elements need to be adjacent to or at least close-by the victim element. If a range boundary ever lands in the middle of these malicious elements, the attack will fail.
- If the ranges being compared also differ in any other elements, the ranges will be synced.
- If any future elements are added to the range, they may correctly sync up at that point.
Security of Incremental Hash Functions
How difficult is it to find a set of elements that when hashed and then XORed together equal a specific target value? Surprisingly easy. We built a gaussian elimination script that does it in 2 seconds for 256-bit fingerprints. This is unfortunate because XOR can be computed extremely efficiently.
Another attractively-efficient alternative is addition mod 2N. For instance, hashing each element with SHA-256, and then adding them together (mod 2256). Finding collisions in this scheme is significantly harder than with XOR, because of the propagating carry bit. Still, it is feasible. We built a k-dimensional birthday-problem solver that can find arbitrary 256-bit collisions in about 28 hours using 8 cores, 60 GB of RAM, and 1.5 TB of storage.
There are other candidates, most notably Elliptic Curve Multiset Hash. As far as we know, finding ECMH collisions is computationally infeasible. However, it is significantly slower than other methods, even when using optimisations which would not be available in many production environments. Furthermore, there only seems to be a single reference implementation, and requiring it as a dependency would go against our goals of building a simple, general-purpose RBSR protocol.
Research into finding an optimal incremental hash function is on-going.
Extra Hardening
Because of the many difficulties in executing this attack, a cryptographically secure incremental hash function is not necessarily even required.
rsync
is theoretically vulnerable to the same problem because it uses known problematic hash functions such as MD4/MD5/xxhash and to our knowledge this has never been a problem (the analogy here is two trusted servers rsyncing a tarball containing attacker-created files).
Additionally, there are hardening steps that we can take to make attacks even less reliable:
- Range Randomisation: Collision attacks involve creating several hundred malicious records. If a range boundary ever falls in the middle of these events, then the attack will fail. Because of this, implementations may choose to randomise their sub-ranges when recursing. Either the number of ranges can be randomised, their exact offsets, or both.
- Incorporating Set Size: In addition to the incremental hash, fingerprints can also incorporate the number of elements within the range. In cases where the attacker can only write to one side of a network-split, this can categorically prevent certain attack classes, even when a collision in the incremental hash function can be rapidly found. For example, this provides protection in the important scenario where mostly-offline clients periodically sync up with a wider network.
Comparison with Alternatives
There are many alternative methods of set reconciliation. Under specific circumstances, some of them may have advantages over RBSR. However, in our opinion, all of them have serious down-sides which make them inferior to RBSR as general-purpose synchronisation protocols.
Bloom filters
Bloom filters are probabilistic data-structures that encode membership information into hash-derived bit-fields. These bit-fields are transferred to peers who then compare them against their own data-sets.
The fundamental problem with bloom filters is that they have false positives. These manifest as records that one side believes the other side to have but in fact does not, and therefore are not reconciled. To deal with this, the protocol must somehow detect false positives outside of the protocol, or be parameterised in a way to make this very unlikely (which implies impracticably large filters).
Similar to RBSR's fingerprints, bloom filter designs are susceptible to censorship attacks where specially selected inputs can prevent a target record from being synced.
If a protocol has a mechanism to detect false positives, it can be vulnerable to a DoS attack where large amounts of colliding inputs are selected. To demonstrate this, we built a proof-of-concept that causes Automerge's sync system to degrade to a 97% false-positive rate. Fortunately, Automerge's data is stored in a hash-linked DAG format which can be relied on to reduce the impact of this DoS attack.
There is a variation of bloom filters called invertible bloom filters. These suffer from a related problem where when the filter occupancy gets too high then the reconciliation fails entirely, and you have to start over with a larger filter.
Previously, our opinion was that bloom filters were non-ideal for set reconciliation and RBSR represented an entirely superior approach. However, the paper Practical Rateless Set Reconciliation and the algorithm it describes, RIBLT, has changed our perception. Although RBSR still has several advantages, rateless invertible bloom filters also represent a compelling solution to set reconciliation. To learn more about this, Log Periodic has built an implementation of this algorithm called riblet.
CPI Sketches
There is a class of algorithms called "sketches" based on characteristic polynomial interpolation (CPI), of which the pinnacle implementation is probably Minisketch.
These algorithms are extremely impressive in that they can encode sets of elements into packets so small that they approach the information-theoretic limits.
Unfortunately, they do not seem to scale well to even modest-sized data-sets. According to minisketch's docs, the largest size currently of interest to the authors is 4096. As the size grows, the CPU requirements grow rapidly. In addition, you need to predict the "capacity" of the sketch in advance, which is the largest number of elements that can differ, otherwise the reconciliation will fail. Finally, the sizes of the elements themselves are limited and larger sizes mean more CPU usage. Minisketch only supports up to 64-bit elements.
We believe that sketches have their niche, but compared to RBSR they are in an entirely different category.
Rigid Structures
A rigid data-structure uses properties of its stored data to organise its storage layout. If the history of insertions and deletions is irrelevant to the resulting organisation, these data-structures have a "path-independence" that makes them useful for set reconciliation.
For instance, a rigid tree structure can compare the higher-level nodes between the two sides of the sync and eliminate common nodes early. Unfortunately, this means that implementations have limited freedom to re-arrange or re-balance their trees.
gzip --rsyncable
To better understand the consequences of rigid data-structures, let's return to our rsync
case study.
In addition to the "strong" MD5 checksums that rsync
uses for content-addressing, rsync
also uses a special rolling checksum to detect when blocks of a file have been moved around. For example, if you insert a few bytes at the beginning of a large file, rsync
will be able to detect that all chunks of the file have been "shifted", and will not need to retransmit them.
However, suppose you would like to store this large file in a compressed format. In this case, modifying a few bytes at the beginning of the uncompressed file could cause all the later data to be changed, meaning that rsync
would not find any shifted blocks and would degrade to transferring the entire file.
To avoid this, gzip
supports an --rsyncable flag. When provided, gzip
will flush its output stream periodically, reseting the compressor's state, and preserving chunks of compressed data so rsync
can sync it more effectively.
But how does gzip
know when to flush? It can't be at fixed offsets, because of the shifting issue. It can't look for patterns in the compressed stream because that is what it's trying to control. Therefore, it must look for patterns in its uncompressed input data. Specifically, gzip
queues a flush whenever the last 4096 bytes of its input data sum to 0 (mod 4096). There is nothing inherently special about this pattern, except that it is easy to test for and you can expect it to come around relatively often. In cryptanalysis, these arbitrary break-locations are called distinguished points.
When the input data is random, this works quite well. On the other hand, if malicious parties have some level of control over the input data, they can specially compute some data to have an undesired effect.
To demonstrate this, we built a script called gzip-unrsyncable. This will copy data from its input to its output, while looking for the same patterns that gzip --rsyncable
does. When it detects one, it modifies the last byte to break the pattern, preventing gzip --rsyncable
from ever flushing its stream.
This attack is possible and 100% reliable across all rsync
implementations because the pattern used to determine break-points is unchanging and publicly known. In fact, it must be unchanging and publicly known since the protocol depends on this same rigid structure being computed on both sides of the sync.
Merkle-Search/Prolly Trees
Merkle Search and Prolly trees are rigid, probabilistic data-structures roughly based on Merkle trees. In order to determine the structure of their trees, they use unchanging, publicly known patterns in the cryptographic hashes of their elements.
When used for syncing, these rigid trees have average-case behaviour similar to RBSR. However, unlike RBSR, their worst-case behaviour degrades into transferring the entire data-set (or arbitrarily large portions of it). With random elements, this worst-case behaviour is overwhelmingly unlikely to occur. However, in an open-network environment, malicious data can be submitted at any time, and this is something that rigid data-structures are fundamentally vulnerable to.
Also, in many designs, rigid data-structures will require a busy server to implement copy-on-write behaviour to preserve a previous tree structure for ongoing sync sessions while the main version of the tree is being updated with new data. Because it is not rigid, RBSR can freely modify its single source of truth without invalidating sync sessions started in the past, and servers can be entirely stateless.
For these reasons, we believe that RBSR is a superior approach to syncing, and rigid data structures like merkle trees should only be used when membership proofs are required.
Negentropy
Negentropy is a Log Periodic project created to accelerate the adoption of RBSR in high-performance production applications.
Negentropy consists of a wire-protocol specification, reference implementations, and conformance tests.
All of the protocol descriptions above apply to Negentropy and its implementations. This section will explain at a high-level the design decisions that went into the protocol itself. For specifics, see the Negentropy project page.
Data Requirements
Elements must contain a numeric "timestamp" field for sorting purposes. This can be in any desired unit (seconds, microseconds, etc), as long as it fits into a uint64
. In fact, it doesn't necessarily need to be a timestamp at all.
Different elements can have identical timestamps. In order to provide a consistent ordering in this case, elements must also have a 32-byte ID. Typically this will be a cryptographic hash of the entire element (post-canonicalisation, if necessary). Elements with identical timestamps are sorted lexically by ID.
Timestamps must be an inherent part of the element itself, and not metadata about when an item was downloaded, for example.
It is highly beneficial if elements with similar timestamps are likely to be generated/downloaded/stored together. This is because if several elements with similar timestamps are missing from one side of the sync, then there will be fewer sub-ranges created compared to a situation where the elements are distributed throughout the ID-space.
Range Encoding
In computer science, ranges are always referred to by inclusive lower bounds, and exclusive upper bounds. This has the nice property that with two adjacent ranges, the upper bound of the first is the same as the lower bound of the second.
The Negentropy protocol takes advantage of this to avoid double-transmitting redundant range bounds: Each range only needs to send its upper bound. When an implementation would like to not process a given range, there is a special "Skip" range type.
Because sorted element timestamps are monotonically increasing (remember, duplicates are allowed), they can be delta encoded using unsigned integers. This means that if two elements differ by a small number of seconds, the second element's timestamp is sent as this small difference instead of a larger absolute timestamp.
Furthermore, although they are inclusive, lower bounds don't have to begin at an existing element. Negentropy takes advantage of this by truncating the element ID of the first element in a range to the smallest length necessary to separate it from its preceding element: The truncated bytes are implicitly filled in with 0
bytes. If two ranges differ by a non-zero timestamp, the ID is truncated to 0 bytes, and only the (delta-encoded) timestamp is sent.
As an example, if these are the upper bound elements of our ranges...
(1000, "4a8a769a...")
(1002, "351c5e86...")
(1002, "3560d9c4...")
(1003, "beabef25...")
... then the following will be actually used to construct the wire-level message:
(1000, "")
(2, "")
(0, "3560")
(1, "")
Fingerprint function
For the fingerprint function, Negentropy uses the following:
- Compute the addition mod 2256 of the element IDs (interpreted as 32-byte little-endian unsigned integers)
- Concatenate with the number of elements, encoded as a variable-length integer
- Hash with SHA-256
- Take the first 16 bytes
This is a tree-friendly function, reasonably secure, and very efficient.
Frame Size Limits
The Negentropy reference implementations return and accept arbitrary byte blobs as input and output, and leave the actual transfer of messages to the application. Since some network transports may have message-size restrictions, Negentropy supports a frame-size limit parameter, and it will strictly ensure that all messages are less than or equal to this size. This saves applications from having to fragment and reconstruct messages.
Additionally, because with RBSR you will often discover some elements are missing from one of the sides before the sync has completed, the processing of these missing elements can be performed concurrently with the remainder of the sync. By setting a frame size limit, the API will be able to notify you of these missing elements more rapidly.
The trade-off with setting a frame-size limit is that it can significantly increase the number of round-trip communication rounds required for a sync, and will somewhat increase the total bandwidth used.
Negentropy assumes a reliable transport (messages are not dropped or corrupted). However, it does not require an ordered transport because at most one message per sync is in-flight at any time.
Storage
Negentropy does not mandate a particular storage mechanism, and implementations can support different storage back-ends as they see fit.
The reference implementations have a pluggable storage interface that makes it relatively easy to add new back-ends. The C++ reference implementation includes the following:
- Vector: All the elements are put into a contiguous vector in memory, and are then sorted. This can be useful for syncing the results of a dynamic query, since it can be constructed rapidly and consumes a minimal amount of memory. However, modifying it by adding or removing elements is expensive (linear in the size of the data-set).
- BTreeMem: This keeps the elements in an in-memory B-Tree. Computing fingerprints, adding, and removing elements are all logarithmic in data-set size. However, the elements will not be persisted to disk, and the data-structure is not thread-safe.
- BTreeLMDB: This uses the same implementation as
BTreeMem
, except that it uses LMDB to save the data-set to persistent storage. Because the database is memory mapped, its read-performance is identical to the "in-memory" version (it is also in-memory, the memory just happens to reside in the page cache). Additionally, the tree can be concurrently accessed by multiple threads/processes using ACID transactions. - SubRange: This is an adaptor that lets you create a storage instance from a sub-range of another storage instance. This is useful for partially syncing DBs.
All reference implementations include at least the Vector
back-end. See the Negentropy project page for the current list of implementations and their status.
Applications
Stateless Sync
To store data-sets, trees are the obvious choice because all operations (insert, erase, find lower bound, compute fingerprint, etc) are logarithmic in data-set size. Because RBSR does not require a rigid data-structure, different implementations can use different branching factors or rebalancing strategies and still inter-operate with each other.
In addition to protecting against DoS attacks, the fact that the protocol does not rely on a particular tree configuration has another advantage: The tree structure can be modified in between rounds of the same synchronisation session. A busy server may be constantly ingesting new data and having to wait until all sync sessions complete before modifying its tree would result in unacceptable delays or live-locks. Similarly, a client may be syncing with many servers simultaneously, and/or want to immediately update its stored data-set without waiting for the syncs to complete.
With rigid data-structures, an in-progress synchronisation session may be invalidated if the underlying tree changes. This can be solved with Copy-on-Write (CoW), which involves taking a temporary snapshot of the tree that will not be affected by subsequent modifications. To make it efficient, the snapshot and the main tree share the majority of their structure, and nodes are only copied when they need to be modified. Although effective, this has disadvantages:
- The server needs to store session-specific sync data: at minimum, a reference to the snapshot
- Additional intermediate memory is consumed to hold the snapshots (a potentially unbounded amount, unless write volume is fixed and/or sync sessions time-out)
- Periodic garbage collection (or reference counting) needs to be performed to delete old snapshots
- Long-running sync sessions will not sync data that was added after the sync, even if its corresponding range has not yet been synced
By contrast, with RBSR there is no extra memory or garbage collection overhead, and servers can be entirely stateless.
RBSR is resilient to tree modifications because the tree structure is not exposed in the protocol. Consider the two cases:
- A modification is made to a range before it has been reconciled. In this case, it will still be recursed into as usual. The fingerprints could change from one round to another, but this does not impact the protocol.
- A modification is made to a range after it has been reconciled. In this case, the modification is missed, but this can be modeled as though the modification was made after the sync was performed (which is identical to CoW semantics)
CRDTs
Conflict-free Replicated Data Types (CRDTs) and RBSR are related and complementary technologies. As explained in Ink & Switch's Local-first software article:
"CRDT algorithms provide only for the merging of data, but say nothing about how different users’ edits arrive on the same physical computer."
After performing a sync with RBSR, one or both of the sides has all the information they need to make the sets consistent. If they choose to insert all missing elements into their own data-sets, then they will converge. Because elements are referenced according to their content hashes, there can never be conflicts on write, and the sync operation is therefore a G-Set CRDT.
More advanced "replacement" rules can be implemented at a higher layer. For example, if the inclusion of a particular item implies the deletion of another, then the data-sets will also converge: this is a more useful CRDT.
In general, RBSR can be a building block for many CRDT designs. It can be used as the mechanism to replicate operations in operation-based CRDTs, or to efficiently reconcile state history in state-based CRDTs.
Nostr
Nostr is a simple decentralised social media protocol. Users post signed JSON messages to relays (servers), and download messages they are interested in. Because relays are un-coordinated, and users frequently cross-post subsets of their own (and others') notes, it is a somewhat chaotic environment where different users occasionally see different views of the network.
Although client software is generally tolerant of data inconsistencies, there are many situations where users would like to synchronise their data-sets with various relays. Users may want to ensure a relay has a full back-up of their own events, or boost the visibility of all comments in a particularly important discussion thread. Local-first clients may like to periodically connect to a relay and download any events they missed while offline. Clients might also want to find the union of two sets without having to download all their elements in order to reliably compute like or reaction counts.
In all cases, especially because many clients run on under-powered devices/networks like mobile phones, this should be done using the smallest possible amount of resources.
Since Nostr is an excellent protocol test-bed, to prove the concept of RBSR we have integrated Negentropy with the strfry relay software (also by Log Periodic). Although the protocol extension is still considered experimental, it has been successful so-far, and is routinely used to synchronise data-sets of sizes in the 10s of millions of elements.
Conclusion
Range-Based Set Reconciliation has a number of nice properties that make it uniquely suited to efficiently synchronising large amounts of data in diverse open networks. Decoupling the protocol from the underlying storage structure gives implementations important flexibility, and provides other benefits like not having to use Copy-on-Write, and DoS attack resistance.
Although search for an optimal fingerprint function is on-going, reasonable compromises between security and efficiency are possibile, and effective counter-measures have been developed.
Negentropy is an effort by Log Periodic to accelerate the adoption of RBSR, and has resulted in production-tested open-source software that can be used to enhance existing distributed protocols, or enable entirely new classes, today.
Acknowledgements
Aljoscha Meyer's work on RBSR was the inspiration for Negentropy, and almost all of the ideas in this article are described more formally in his writings on this topic. I also want to thank him for his private correspondence on this subject, and comments on an early version of this article. All errors are mine.
@Nuhvi first forwarded me Meyer's RBSR paper, and helped with the Negentropy Javascript implementation.
Yuki Kishimoto wrote and maintains the Negentropy Rust implementation.