Blockchain Sync Algorithms
How incremental changes can have a substantial impact on network performance
The purpose of this article is primarily informative explaining the functionality behind how the Aion kernel synchronizes with the network. Emphasis is put on the enhancements introduced in version 0.3.2 which speed up the time in which a new node can catch up to the network by more than a factor of 6x.
My secondary objective lies in the hope to illustrate how incremental changes can have a substantial impact on performance, and that it is important to not overlook optimizing existing functionality in favor of flashy equivocal features.
The base Aion sync algorithm
The sync algorithm is the process through which a node on the network communicates with other nodes in order to request and import the full history of blocks. It involves multiple components of the kernel, notable among them being:
- the peer-to-peer protocol that enables communication; and
- the consensus algorithm that ensures all nodes agree on how the information in each block is interpreted.
In the following, I illustrate the general ideas behind the algorithms in the sync process. This information can be useful to both developers and anyone running a kernel in understanding certain log messages frequently encountered during execution. For full implementation details, please review the sync package in the Aion public repository.
The Aion kernel imports blocks through two different paths:
- First, there is a long-running dedicated thread for importing distant past blocks named sync-ib (short for “import blocks”). This thread makes sure that if the local top block is far from the network’s best block the node catches up to the network by requesting and processing batches of blocks. A new node connecting to the network will typically see the output from this thread immediately as it logs information for its processed blocks.
- The second path for importing blocks is through short-lived, numbered, worker threads that process status responses from the network. These threads are called p2p-worker-N (where N varies depending on how many such workers are active at the time). A status message contains a peer’s topmost block at the time the response was sent. While the local node is far from the topmost block of the network these status blocks are used to update local information about peers and then discarded. Once a node has caught up with the network, i.e. its best block is the same as the network’s top block, it will continue importing blocks coming from these status messages.
Sample logs for the different import cases are shown below. Note that initially, the p2p workers receive status blocks that cannot be imported either because the block is ahead of the local chain (highlighted in black) or behind it (in white). Then as the import blocks process (in pink) makes progress in advancing the chain it reaches the topmost network block. At that point, the p2p workers take the primary role in importing blocks (in blue), while the sync-ib thread waits to pop back into action if the node falls behind the network again.
A fully synced node is one that receives its new blocks from its peers’ status messages, i.e. imported by the p2p-worker threads. Similarly, a node is still syncing when its blocks are imported mainly by the sync-ib thread.
The time for a new node to sync with the network grows proportionally with the number of blocks on the chain. Our objective, however, is to facilitate the path for new nodes to join the network. The following sections offer details on how the sync process was performed before version 0.3.2. This knowledge is a prerequisite in understanding the enhancements applied in version 0.3.2 described subsequently.
Normal sync progression
The image above outlines the way blocks from the network are handled by the Aion kernel. When working on catching up with the best block of the network the kernel will request sets of blocks from its peers. The requests are made randomly to one of the current active peers and ask for ranges of blocks that partially overlap with the local chain. The responses that come in from the network are appended to the end of a queue, as illustrated by the right side of the figure.
The sync-ib thread mentioned above is the one processing the received blocks. As the left side of the image shows, the blocks are processed in batches, in the order that they were received from the network. As each batch is retrieved from the queue, the blocks in it are filtered to exclude the already imported blocks and what is left is added to the current state of the blockchain. The requests continue in parallel to the import process, resulting in a good deal of overlap between the ranges of blocks received from peers.
The process above is in the NORMAL state of sync. The overlap between responses can be useful. If there were multiple chains of similar height on the network, the local node would be able to import all chains and eventually choose for itself the one with the highest total difficulty. However, the blockchain by definition aims towards a state where all peers agree on a single main chain. This means that in practice the overlap in the normal sync process is highly redundant. This redundancy constitutes an opportunity for optimization which is described in the later sections.
Backward and forward sync modes
Short side chains occur fairly often in practice. As new blocks are mined, multiple options for the same height can be created. As these blocks get communicated through the network only the chain with the highest total difficulty will prevail. This means that some of the nodes may need to adjust their chain to reflect the correct main chain if previously they were on a side chain. Changing the chain is done by finding the common ancestor block and updating what was initially a side chain into the new main chain.
If a node gets disconnected from its peers for a longer duration and continues mining it may end up in a situation where the common ancestor is outside the block import range. This means that its active range of blocks does not allow finding the common ancestor required for updating to the new main chain. For these cases there exist two special states during sync that allow adjusting the active range of blocks to reflect the needs of the local node.
The figure above illustrates how the chain is adjusted back to the main chain once a peer that got disconnected from the network reestablishes connections with its peers. The steps are as follow:
1. Upon making a connection with a peer, the node discovers that it is behind the main chain. The node makes a request for the range of blocks starting at the height of block X which is a few blocks behind its current best block (3 or 15 blocks in practice depending on the height of the main chain).
2. When receiving blocks from the network it discovers that it has in fact received block Y and its children and that there is no known parent for block Y in its database. Since X.parent != Y.parent the node is facing the situation of not being able to switch to what it knows to be a higher total difficulty chain, i.e. the main chain.
3. This is the point where the BACKWARD sync state is used. Given that the blocks it received from its peer do not have a parent in the database, the next request will be made from an earlier block, jumping back a number of blocks to see if the common ancestor can be found. The requested height gets repeatedly pushed back until a common ancestor is found.
4. Once a common block has been found for the two chains the sync state gets switched to FORWARD mode. This signals to the kernel to perform a fast forward of requests upward without range overlap. When a block from the peer in FORWARD mode is imported with the IMPORTED_BEST result, signaling that the node has converted to the main chain, the state is switched back to NORMAL sync and imports proceed with the usual range overlap.
Note that different peers may be in different sync modes at any point in time. The kernel keeps a copy of all chains it receives from the network as long as the blocks on those chains are all valid.
Enhancements introduced in version 0.3.2
The optimization introduced in Aion version 0.3.2 redesigns the sync algorithm by introducing two new states for requesting blocks, namely LIGHTNING and THUNDER. The inspiration for the algorithm lies in the behavior of torrents where data is requested according to its availability. Instead of waiting for a piece of data to be processed before making further requests, the kernel can anticipate a successful outcome and request blocks further from the current best block. Due to the nature of the blockchain, all peers reliably have all blocks up to their current best block, so getting the blocks in ascending order is more efficient than random requests.
The new protocol performs more efficient block requests by:
- reducing the number of times the same request gets sent to the network,
- avoiding some overlap between requests made to different peers,
- and anticipating future needed blocks.
This change effectively leverages the number of active peers to speed up the sync process. Consequently, the speed of full network sync increases proportionally with the number of active peers. Smart temporary storage allows handling more blocks than would normally fit in memory and enables fast retrieval for import. Parallelization is also used for improved performance.
Lightning sync mode
Knowing that a peer’s topmost block is at height 2.6 million (approximately the current height of the Aion main network), we can request any subset of blocks from the local best block to block 2.6 million with reasonable certainty that there will come a time for those blocks to be processed. Given that the fully synced peers on the network agree on the topmost block (by definition of the blockchain), all fully synced nodes will reliably return the same blocks for the same range. It makes sense, therefore, to try to request these blocks only once.
The LIGHTNING sync mode is a new local state for a peer that:
- indicates that requests will be made to that peer anticipating a future need for blocks, and
- allocates a range of blocks to the peer to avoid overlap with other requests.
The figure above illustrates a switch to theLIGHNING state for Peer-1 (in pink). After the start of the kernel, a connection is established withPeer-1 which is initially assigned a NORMAL sync state. As the peer responds to the node’s requests, several other connections are established with peers 2 to 5, which also return blocks based on the requests they received and their individual lag in communication. When the fourth response from Peer-1 is processed the kernel notices that the response duplicates the previously processed one. At that point, the kernel verifies that it already has more than four peers working in NORMAL mode. The redundant work already done by Peer-1 in returning the known block range of 61 to 84, together with the expectation of more redundant work due to the other four active connections constitutes a basis for switching to the LIGHTNING state and requesting a different subset of blocks. This jump can be seen in the returned subset of the next request which starts from block 324, instead of 81 which is what the peers continuing in NORMAL mode return.
Two other things differentiate the LIGHTNING state from a NORMAL one. First, the request size is larger, 40 instead of 24. Second, there is no range overlap for the consecutive requests made to the same peer in LIGHTNING mode. The reasoning behind both of these changes lies in the fact that there is a strong expectation that the blocks returned will be imported, so we maximize the size of the requests without triggering the peer’s cooldown period (when too many requests are made to a peer it may choose to stop answering). Moreover, knowing that we’re making the request to a single peer for the allocated range, there is the strong expectation that the blocks will be connected and no need to verify this through block range overlap.
In the unlucky (and unlikely) case when a peer on a side chain is assigned a LIGHTNING state there is no risk added to the security of the blockchain since the kernel would typically import all chains and asses for itself which has the highest total difficulty. The peers moving forward in NORMAL mode would then import the actual main chain blocks.
Thunder sync mode
A node continues making requests in the LIGHTNING state until it exhausts the range of blocks it was allocated when it switched to this state. At that point, the peer state gets reassessed by checking in with the progress of the local main chain and the block ranges assigned to other LIGHTNING peers. This is done by assigning the peer the THUNDER state, which behaves fairly similar to the NORMAL states. This state is necessary to make a distinction between late responses to LIGHTNING requests and side chain blocks that trigger a BACKWARD sync. In a NORMAL state blocks received without an existing parent indicate a side chain and the need to sync backward as explained the section above. In theTHUNDER state, the expectation is that blocks had previously been requested ahead of time, therefore the blocks are delegated to storage.
The figure above illustrates a switch to the THUNDER state for Peer-1. Continuing the previous example, Peer-1 makes subsequent lightning requests until the allocated range is exhausted (after 6 requests of 40 blocks). At that point, it switches to THUNDER mode requesting blocks according to the current height of the chain. When the response is received, the kernel notes that the peer is again doing redundant work due to the overlap with the previous response. Consequently, it allocates to Peer-1 the next block range, which in this case picks up from where the peer left off since none of the other peers had been switched to LIGHTNING states yet.
Note that the requests made in a THUNDER state are also for 40 blocks instead of 24. The reasoning behind this is that in the likely scenario the peer will still be doing redundant work, and the larger request will prove more useful, i.e. the extra 16 blocks will be imported. The switch away from LIGHNING is also motivated if one considers the less likely scenario where the peers importing blocks in NORMAL mode get disconnected. Without the THUNDER state checking back with the main chain, the chain progression would stop at block 201, while Peer-1 would only return blocks that cannot currently be imported.
Temporary storage and parallelization
The blocks received from LIGHTNING mode requests cannot be imported to the blockchain and cannot be re-added to the queue. Therefore they are temporarily stored in a new data store created for pending blocks. To be able to efficiently store and retrieve these blocks when they can be imported, the pending block storage is designed using three key-value databases as follows:
- The blocks received from LIGHTNING requests are stored in a database called queue where a queue identifier points to a list of consecutive blocks. When the kernel can import the first block in the range, it will retrieve and import the full range of blocks, optimizing, therefore, the required calls to disk.
- Next, a database called level records the height at the start of each known queue pointing to the queue identifier, or a list of queue identifiers in the (unlikely) case of blocks belonging to multiple chains retrieved for the same level. The kernel determines if it can import any of the blocks in storage by checking for known heights in the level database and retrieving the adequate queues according to its current block height.
- Finally, the index database points block hashes to queue identifiers to allow expanding known queues when storing data.
Access to these databases is also optimized for increased performance. Separate threads are used to store received status blocks, which are no longer discarded by the p2p workers, but saved for import when reaching the required height. The block ranges from LIGHTNING requests are also stored by different threads created for this purpose. The blocks from the pending block store that get processed by the blockchain are deleted in parallel, when not performing higher priority work, in order to keep the temporary storage size manageable.
Can we do better?
The answer is “Of course! And we will!”. The high cost of IO is slowing down sync as the blockchain grows and the account states expand. Redundant calls to disk can be eliminated and through the use of caches, we will improve the way data is accessed.
Moreover, additional optimizations to sync have already been started. Geth offers a different type of sync enhancement in its implementation of fast sync. The idea is based on the fact that most nodes are mainly interested in the current state of the blockchain and care less about past states. With this in mind, sync can be expanded to request the state of accounts and transaction receipts to rapidly bring a new node up to the top.
These enhancements are something to look out for in future kernel releases. Our take on other interesting implementation details will be presented in subsequent blog posts.