Geo sharding a DAG

By Olaf van Wijk Last updated Dec 19, 2019

For true horizontal scalability.

Not surprisingly is this article about sharding a DAG, specifically a DAG in the distributed setting of DLT’s and blockchains. To add to the disclaimer is that this article focusses on the sharding the data-structure, the DAG itself, and not the financial ledger! With that said, this is not a scientific paper but a general description of some key elements for sharding a DAG geographically. I nevertheless believe that if you can’t shard/scale the data-structure you can’t scale the ledger.

if you can’t scale the data-structure you can’t scale the ledger.

The basis of sharding

In the database world and scaling, in general, consistent hashing is one of the main concepts to deterministically assign any data to a shard, a shard is a ‘location’ where your data ends up. Now say you want 16 shards in your system then you simply do Hash(data) mod 16 (Take a number, subtract 16 and take what is left, aka modulo) to identify where you want to store information. In order to know where to find it, you simply find a node/machine/server that matches hash(IP) mod 16 == Hash(data) mod 16. This way you can in a simple deterministic way know where to find your data. There is, of course, a bit more to it and if you like to get into detail please read this article.

The difficult part with this approach comes when there is an unknown amount of nodes that will want to divide the data in more and more shards. Or in other words, do dynamic sharding. Dynamically determining your shards would cause a so-called rebalance of the network because it needs to send data all around and not something you want to do too often. For this reason, DLT solutions like Radix pick a VERY VERY large number of shards with the idea that shard space never runs out. So now we just mod with a very large number but have the problem that there will be not enough devices to fill all this shard-space 1 on 1 like we potentially could with only 16 shards. This situation is handled by so-called hash-rings and Distributed Hash Tables, in this way we can discover what our neighbors have without knowing the entire network topology.

Now, this works fine with key-value pairs but not all data is just and only key-values. Most real-world data, including distributed ledgers, is linked data in some form or another. It is contextual and only means something within that context, like a balance of an address is the result of all transactions sending something to it. With just a distributed key-value store it will have to pop from location to location in your distributed hash table and consistent hash rings to retrieve all of these grouped and related data. If everyone needs to do this all the time then that would clog up all the network bandwidth and then the entire point of sharding is lost.

What I am trying to scetch here are some basic requirements for a distributed linked/related data structure.