or: Notes on indexing blockchains

A copy of this post can be found on rust-bitcoin-indexer wiki.

Abstract: I have ventured on the quest of discovering the best way to index a blockchain. Surprisingly little has been written about it, considering that it is a common task for any organization handling blockchain-based interactions. It seems that, it is an area dominated by in-house solutions, with very little knowledge sharing. In this post, I go over the problem, ideas, discoveries, and experiences. I also describe a Bitcoin Indexer implementation that I have worked on while researching the subject: rust-bitcoin-indexer.

I expect the reader to be familiar with at least the basics of Bitcoin.

Please note that this is not scientific research and only a spare-time project. Feel free to contact me with comments: pointing out any mistakes and ideas for improvements.

Introduction to Indexing

Why is Indexing needed? In short: Most business systems interacting with a blockchain need it to know what is going on the Blockchain, both historically and in real-time.

A full node, like Bitcoin Core (aka bitcoind) exposes a JSON-RPC interface, which allows querying for a state of the network, blocks, transactions, etc.

While useful, a JSON-RPC interface quickly becomes insufficient. First, the node is mostly concerned with maintaining data necessary for handling network consensus and storing or processing anything else would be a "bloat". Because of this, the range of supported queries is narrow. There is often no practical event system - there's no way to register for a particularly interesting event and receive a callback when it happens. Whatever indices and event notifications are available, they are typically designed for a personal use scenario, while most businesses have multiple customers, need to manage multitudes of addresses, perform many transactions and so on. Whatever in-built wallet, indexing or notification mechanisms are available, quickly fall short to service needs of a business user, both in performance and functionality.

The idea behind indexing is to periodically pull the latest data from the node, and throw it into a database and/or event system. This shifts the burden of storing the data and answering queries from the node to a more suitable platform and potentially enables:

  • control over format of all the data,
  • scaling according to business needs,
  • execution of complex queries,
  • flexible and reliable processing of interesting events,

and other functionality which would otherwise be difficult or impossible to achieve.

Challenges for Indexing

Eventual consistency

The core problem with Indexing is that it is almost like a ledger. Almost, because the latest history is not final. At least on Bitcoin-like blockchains, one or more last blocks can get extinct (a.k.a. orphaned, discarded, stale) during a reorganization (also called simply a "reorg") because of an alternative recent history (a fork in the chain) accumulating more PoW. Previously extinct blocks could potentially become live again, theoretically even multiple times. While reorganizations are not common, short reorgs happen on a daily basis, and in some rare circumstances reorgs can be even hundreds of blocks deep.

In short: Unlike eg. traditional financial systems, event history on a blockchain is not immediately final. That's the reason for "waiting for N confirmations".

Allowing indexing, APIs and systems around blockchain to support even deep reorgs correctly 100% of the time is quite challenging.

Performance

Blockchains consist of the ever-growing whole history of every event that happened on the blockchain. For Bitcoin, this means around 190GB after 10 years. Some blockchains grow slower, some much faster. One way or another, indexing must be able to keep up. In particular, if the blockchain is indexed from the start, for the first time, it is going to take quite a bit of time. While this might be a one-time event, it is much more practical if it is a matter of hours and not weeks.

Complexity and errors

Especially for blockchains supporting smart-contracts, the number of classes of events that can happen on the blockchain and be of interest to the user is endless and ever-growing. New contracts (or types of contracts!) are created, Hard Forks are introduced to change rules and data formats and so on.

Main challenges in practice are:

  • re-indexing historical data to include information that was not indexed before
  • fixing data that was indexed incorrectly

Two approaches to interacting with a blockchain

Fundamentally, approach to interacting with the blockchain can be divided into two groups. I call them: caching and streaming, but they can be described using many adjectives. First one: caching, querying, poll, passive. The other: streaming, reactive, push, active.

A real system/indexing can be a combination of both approaches.

Caching indexing

I believe a caching approach is the most common one. The idea is to passively translate data from the node and dump it to a database, and then let other software/services/components query the database for anything of interest.

The biggest advantage of this approach is its simplicity. Translating data returned from the node into database format, suitable for querying is relatively straightforward. There is little room for mistakes and programming errors.

Another advantage is development simplicity. Databases are well understood and almost everyone: developers, devops, managers and even users are familiar with this mode of operation. Building APIs around querying a database and queries itself is a bread and butter of software engineering.

Third advantage is robustness. Since every complex operation/query is recalculated from scratch, there is not many moving pieces to worry about. If the original data was wrong, it can be simply fixed. If the query was wrong, it can be rewritten. As soon as the problem is addressed, the system returns to normal operation.

The main disadvantage is passiveness. With caching approach it is much harder to react to events on the blockchain. If the system is waiting for the incoming transactions, it will have to continuously poll for updates using the same query. What if the system is monitoring millions of addresses? The cost is latency and load on the database.

Because of the load, it often has to be supplemented with fair amount of caching in different layers, and/or scaling the database to handle growing load on the database. Both increase complexity, canceling out part of the above advantages.

As this approach is not event based, it is quite difficult to gracefully handle reorgs in it, other than just updating the data in the database, to represent the current blockchain state. For any downstream consumers this means possible sudden changes in the upstream state, without any warning or notification. For some applications this is acceptable, for some it might be a problem.

The final problem is that of, handling complex blockchain interactions which is generally very limited or even impossible. Especially for blockchains supporting Turing-complete smart-contracts, not every blockchain event is easy to represent in a database form suitable for querying.

Streaming indexing

This approach comes down to actively scanning blockchain data as it arrives, and reacting to interesting events as they are observed. Downstream systems can register what are they are interested in, and simply wait for notifications through callbacks, queues, websocket events, etc.

First major advantage of this approach is flexibility. Even the most complex blockchain interactions can be detected and processed. As everything is already event-based, handling reorgs is possible and even quite easy.

Second is scalability. With this approach it is generally unnecessary to poll by periodically issuing the same query. Every block on the blockchain can be scanned once, triggering events and changes internally or in the downstream systems. This increases system throughput and lowers latency. This is especially beneficial for systems with a lot of users, addresses, etc.: the amount of the work done by the system is related more closely to the amount of blockchain data, and less to the size of the business data itself. The bigger the business system, the more appealing is this trade-off.

The main disadvantage is potential complexity and fragility. This approach can involve a lot of moving pieces, with errors cascading to the downstream systems. If those systems are stateful there might retain the errors, even after their source was fixed upstream. It's worth noting that the root of the problem is mostly in the statefulness that was made possible by streaming and not as much in the streaming approach itself.

Evolving a system using this approach can be more difficult, too. Not only does it have to be done more carefully, because of general lesser robustness to mistakes, but the whole process is more complex and less familiar. What was basically a common database schema/API migration process in the caching approach, turns into evolving a microservice or microservice-like event-based system.

Random thoughts about the blockchain

Stack of blocks

When explaining blockchain to newcomers, quite often it is compared to a linked list, because each new block contains the hash of the previous one.

While this is one way to look at it, I think blockchain are more useful to be looked at as a stack. In stacks, all the action happens on the top of them. Only two operations are supported: pop and push. Each push buries the previous elements deeper and deeper.

A blockchain can be thought as a stack, with only two operations:

  • push - just like on normal stack
  • replace - replacing N last elements with different ones; which can be implemented as a combination of N pop and N push operations on a stack. This corresponds to a reorg, where the popped elements became extinct, replaced with the new ones.

Although the pop operation is internally still there, it is not available on its own. There is no standalone pop on the blockchain, because it can only grow. This might be an important point when consistency is considered. A database should use atomic transactions to implemented replace operation, to avoid presenting an inconsistent state of the blockchain.

Stream of events

One idea that I have started with was that a blockchain can be thought about in terms of operations on a stack or rather previously described stack-like datastructure. It could be encoded as a stream of "block events": the first event pushes a genesis block, then each event thereafter can only be either adding a new block (push), or replacing previous ones (replace).

This is a very convenient way to think about it. It maps nicely in a lot of software architecture patterns: iterators, message passing, event sourcing, etc.

I have found that an indexer can be designed as a pipeline processing block events:

Full node -> Fetcher -> DB -> API -> rest of the system -> potentially chained

Full node is the Bitcoin Core instance, Fetcher is component polling Node for updates. DB persists the content of the blocks for consumption and potentially exposes it via API to any bigger system. Variations of this pipeline are obviously possible.

In such an architecture any time a change happens on the blockchain, it propagates through the pipeline.

A single Block Event is an atomic unit of change and can be usually represented as a tuple of (block height, block id/hash, associated data). The block data typically contains block content in some form, typically all the transactions in the block.

I have found two alternative ways to actually encode a stream of "block events". I have named them: truncating protocol and canceling protocol.

Truncating protocol

The following is the sequence of block heights that is an example of "truncating protocol":

1, 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, ...

Can you see where exactly reorgs occurred? After 5th block, there was 3-blocks deep reorg, replacing previous blocks at height 3, 4 and 5. After block at height 6, there was a reorg replacing blocks at height 4, 5 and 6. Items initiating a reorg have been marked for clarity.

I named this a truncating protocol because if one was to store the blockchain state in array/vector, representing blockchain, any time a block event is received the array is to be truncated to the length equal to the height of the block. In other words: any time a block event height is lower than previously recorded, the blocks with higher or equal height are canceled.

Canceling protocol

The same sequence of reorgs from the previous section can be alternatively encoded as follows:

1, 2, 3, 4, 5, 5, 4, 3, 3, 4, 5, 6, 6, 5, 4, 4, 5, 6, 7, ...

In this protocol, each element of the sequence maps closely to a push and pop operations from a stack. If the height is equal to the height of top element on the stack, this block is canceled (hence the name) and should be popped from the stack. Any time it is equal to the one of top element increased by 1, the new block is extending the blockchain should be pushed on the stack. This time, items canceling existing block have been marked.

Comparing canceling and truncating protocols

These two protocols can be employed in different situations. One application is storing the blockchain as a ledger in a database. Another is communication between two parts of the system: sender and receiver.

In communication

In communication, the truncating protocol is potentially more convenient for the sender, and more demanding for the receiver. The canceling protocol is the opposite: more difficult for the sender, but easier for the receiver. This is true for scenarios where the receivers only calculate some aggregates from the blocks and discard them afterwards.

To make it more concrete: imagine that we have two parts of the system: a sender making block events available to a receiver. The receiver is responsible for calculating a balance for one address of its interest, by scanning each block event for updates.

In the *truncating protocol" the sender does not have to maintain any memory of previous blocks. It just generates new elements of the sequence as new blocks become available. When a new block event extends the blockchain, the job the the receiver is easy: to scan the block, potentially updating its own state accordingly. However, when a reorg is signaled, the receiver needs to undo whatever balance changes from the extinct blocks. To do so, it needs to either keep the history of previous blocks or be able to fetch them from somewhere.

In the canceling protocol, the situation is the opposite: during a reorg, it is the sender's job to send the content of each extinct block to the receiver. So, it has to be able to get them from somewhere: its own or an external store. The job of the receiver is always easy: the block content (extending or being canceled) is always available.

Because of this asymmetry, I believe it is better to use the truncating protocol when the events are streamed towards persistent storage (DB) and the canceling protocol when events are streamed from the persistent storage, towards more reactive and limited parts of the pipeline.

In the DB

When storing the block data in some form of database, it would be great to utilize the kind-of-immutable, append-only nature of blockchains to its full potential. Using exclusively read and append-only operations can enable performance increase (eg. by utilizing append-only DB), easier backups and generally easier mental model.

To leverage the benefits of immutable, append-only storage, the canceling protocol seems perfect as it is: entirely append-only. Eg. when storing transactions (or rather: transaction outputs), the value of canceling a record can be stored as negative, making aggregate operations like SUM(value) work without any changes.

The truncating protocol used as a way to store entries in a database fundamentally requires a mutation to mark extinct/orphaned blocks or deleting them completely. However, I have found it better at handling the UTXO set, because such markers make it easier to filter-out records from extinct blocks.

If the outputs (creation of an unspent output) and inputs (spending an output) are stored as follows:

CREATE TABLE IF NOT EXISTS outputs (
  id BIGSERIAL NOT NULL UNIQUE PRIMARY KEY,
  tx_id BIGINT NOT NULL,
  tx_idx INT NOT NULL,
  value BIGINT NOT NULL,
  address TEXT,
  coinbase BOOLEAN NOT NULL
);

CREATE TABLE IF NOT EXISTS inputs (
  output_id BIGINT NOT NULL PRIMARY KEY,
  tx_id BIGINT NOT NULL
);

then, in the truncating protocol it is possible to query eg. balance of the address by:

SELECT address, SUM(
    CASE WHEN inputs.output_id IS NULL THEN value ELSE 0 END
  ) AS value
  FROM outputs
  JOIN txs AS output_txs ON outputs.tx_id = output_txs.id
  JOIN blocks AS output_blocks ON output_txs.block_id = output_blocks.id
  LEFT JOIN inputs
    JOIN txs AS input_txs ON inputs.tx_id = input_txs.id
    JOIN blocks AS input_blocks ON input_txs.block_id = input_blocks.id
  ON outputs.id = inputs.output_id AND input_blocks.orphaned = false
  WHERE
    output_blocks.orphaned = false
  GROUP BY
outputs.address;

To explain the query: the balance of an address is a sum of all outputs, that don't have corresponding input (were not spent). Both inputs and outputs should be just ignored if they come from an extinct block.

I was unable to find a similar query or a method to organize inputs/outputs, when using canceling protocol. If it exists, it would make canceling protocol a clear winner. Without it, truncating protocol can be better if the database might be used for querying UTXO set state.

rust-bitcoin-indexer

rust-bitcoin-indexer is my experiment in implementing a simple, but featureful and reliable Bitcoin Indexer.

The core indexer is just a very tiny bit of glue between two parts: a Prefetcher fetching blocks from the node, and a DataStore, persisting the data inside them for later consumption.

A lot of thought was put into making the database format usable for both the caching indexing and the streaming indexing approaches that I have described before.

Note that design and implementation of rust-bitcoin-indexer are probably going to evolve, and I have used permalinks to link to parts at the time I was writing about them.

Prefetcher & Rpc

Prefetcher is the beginning of the indexing pipeline. It uses a thread-pool to rapidly query blocks (with a big look-ahead) and returns them as an Iterator yielding RpcBlock items using truncating protocol.

Prefetcher takes an object implementing an Rpc type

pub trait Rpc: Send + Sync {
    type Data: Send;
    type Id: Send + Eq + PartialEq + Display + Debug + Clone;
    const RECOMMENDED_HEAD_RETRY_DELAY_MS: u64;

    fn get_block_count(&self) -> Result<u64>;

    fn get_block_id_by_height(&self, height: prelude::BlockHeight) -> Result<Option<Self::Id>>;

    fn get_block_by_id(&self, hash: &Self::Id) -> Result<Option<(Self::Data, Self::Id)>>;
}

I have found the above abstraction to be the simplest and most straightforward interface to interact with a node from the perspective of an indexer. All is needed is an ability to ask for the current chain height (get_block_count), query block id (block hash) at a given height, and then query the block content by an id.

Both Id and the returned Data are parametrized over, so they can be arbitrary. This makes it easier to test, and potentially reuse for other blockchains.

get_block_by_id returns both data of the requested block, and Id to the previous block (as implemented in Bitcoin), so it is possible for the Prefetcher to detect reorgs. For each returned block, the Id is compared with the Id of a previously processed block, and if they are different it means, there was an reorg, and the Indexer resets work-pool and tries indexing again from height one less than the current one. If the reorg is deeper, this might repeat multiple times.

Prefetcher logic is tested using randomized property-testing test. It basically generates a random sequence of reorgs, and at the end compares the expected and actual state of the blockchain.

Postgresql DataStore

I picked Postgresql as a store for the indexed data. However, I'm trying to keep the design db-agnostic, and not rely on any Postgresql-specific features.

The whole DataStore is abstracted away behind a simple DataStore trait/interface.

pub trait DataStore {
    // (...)

    fn get_head_height(&mut self) -> Result<Option<BlockHeight>>;

    fn get_hash_by_height(&mut self, height: BlockHeight) -> Result<Option<BlockHash>>;

    fn insert(&mut self, info: crate::BlockCore) -> Result<()>;
}

This should allow extending the indexer to support any other form of database and/or event systems.

get_head_height and get_hash_by_height are called on start, to known at which height the Prefetcher should start fetching for blocks. Afterward, all blocks returned by Prefetcher are passed to DataStore::insert.

Inside the Polstgresql struct implementing DataStore trait, blocks are gathered in batches, parsed to an internal db format, and fired to a thread-pool that inserts them into a DB.

Quite a bit of code (and thus complexity) and effort was dedicated to optimizing insertion performance. While in normal operation Bitcoin produces a relatively tiny amount of data every 10 minutes, the problem lies in having to index 10 years of Bitcoin history. A first naive implementation turned out to be too slow to be practical. Initial indexing would take a week or even more to complete. The current improved implementation takes around 12 hours to complete on my desktop. I won't be getting into details of performance optimizations listed in the README file.

The database structure can be found in the db initialization file. In short: 4 tables: blocks, txs, outputs and inputs are used to keep the blockchain data. Everything is normalized to simplify the write pipeline and optimize space used. The only mutable-data is the blocks.orphaned column. This organization of data comes at the cost of having to join everything all the way to the blocks table to filter-out all data from orphaned blocks. It doesn't seem to be a problem in practice, and I am able to write and execute many useful queries on a fully-indexed Bitcoin blockchain dataset.

This database organization could be relatively easily changed, to achieve different trade-offs. Right now, I believe that it is much more important for the data organization to support efficient event-based (streaming) APIs. This will eliminate the need for querying database for no reason by the downstream systems, making read performance less of an issue.

Event streaming support

I'd like to implement both a REST API and a library for downstream services, allowing reliable, easy and efficient tracking of updates in the database itself.

The basic idea is as follow: each downstream system would maintain an opaque cursor, which basically corresponds to a position in the main indexer database itself. This allows periodically querying for updates since the last position. The indexer DB API would respond the with list of blocks using canceling protocol, and the new cursor. Downstream systems can then query new blocks (and data in them) using the list, and react to any interesting events.

Some early code adding event streaming support is ready, but it is still quite early.

Future plans

I'd like to implement mempool indexing, which I think should be a separate set of columns and entirely separate pipeline, that could even be running independently, on a different host. Transactions in the mempool behave differently, and data organization should be built around it.

Summary

rust-bitcoin-indexer and this article were inspired by work and discussions with my colleagues at BitGo. I'm really grateful for all the help I've received.

Due to time constraints, this article is not as elaborated or polished as I would like, but I hope the information presented here will be useful anyway, especially for: people designing their own systems interacting with blockchains, potential users/developers of rust-bitcoin-indexer itself, or just casual readers trying to better understand nature of working with blockchains.

Please don't hesitate to contact me and share your feedback.