Skip to content

Instantly share code, notes, and snippets.

@RubenSomsen
Last active February 12, 2026 16:52
Show Gist options
  • Select an option

  • Save RubenSomsen/cdcfaee6bced7668130e7a044787a86e to your computer and use it in GitHub Desktop.

Select an option

Save RubenSomsen/cdcfaee6bced7668130e7a044787a86e to your computer and use it in GitHub Desktop.

Transaction Output Bitfield Compression

Open research/dev problem for SwiftSync

You can represent the unspent transaction output (UTXO) set as a set of bits over the ordered set of all transaction outputs (TXOs), signifying whether it is spent or unspent. The majority of the TXOs will be spent, resulting in many 0 bits and relatively few 1 bits (e.g. 00100001). The number of unspent outputs (1 bits) will increase as we get closer to the tip. These predictable patterns are well-suited for compression.

We currently use the TXO bitfield in SwiftSync inside the so-called "hints file" to speed up IBD and intend to implement this into Bitcoin Core, but the distribution of the hints file is made harder by its size. For the initial release we will likely forgo optimal encoding, but it would be nice to eventually ship it in an optimally compressed state.

Our current most concise known method is to essentially count the number of 0s between each 1. So for 00100001 that would be [2, 4]. This gets us a ~170MB file that goes down further to ~80MB using generic zip compression. This implies that our current encoding still has plenty of room for optimization.

A possible area for optimization is the number of bits used to represent the number of 0s between each 1 (something that fits most values, and handle the exceptions in a more expensive way). The number of bits can probably go down as we get closer to the end of the chain (fewer 0s between each 1) and as the distribution of 1s and 0s approaches 50% it may make sense to simply start encoding the bitfield as-is.

A basic functional implementation would have a compress and unpack function. The former takes the bitfield and turns it into an efficiently encoded version, while the latter reverses the operation. Code complexity does matter - try not to introduce complex code for minimal gain.

All of the above are just suggestions on how to tackle the problem. There may be better ways none of us have thought of, and it's plausible that more patterns exist which can be taken advantage of (maybe empty early blocks with a single unspent coinbase output). Hopefully some of you will find this an interesting problem to work on.

@rustaceanrob
Copy link

To elaborate on the two encoding styles, in the case of using only 0/1, we may use a technique called bit packing. This encoding allows for 8 outputs to be encoded using a single byte. In the example above 00100001 is the binary representation of a byte. To decode this value from the file, we walk along the binary representation of the u8, and if the bit is 0 we consider the output spent.

When using counting, also commonly know as run length encoding, notice that at least one byte is used for every value. If the space between spent and unspent outputs in a block are generally small (i.e. 2, 4 as in the example), then counting is wasteful. However, when the space between unspent outputs is very large, this technique is superior. The threshold is when around 1/8th of the outputs in a block remain unspent.

@exd02
Copy link

exd02 commented Jan 24, 2026

Our current method is to essentially count the number of 0s between each 1. So for 00100001 that would be [2, 4]. This gets us a ~170MB file that goes down further to ~80MB using generic zip compression. This implies that our current encoding still has plenty of room for optimization.

Is this comment outdated? The current implementation appears to use bit packing (storing boolean hints as individual bits in bytes), not run-length encoding (counting zeros between ones).

@rustaceanrob
Copy link

There are multiple current implementations, for instance there is a Rust project that, until this commit, was using counts instead of bit-packing. "Current method" perhaps could be edited to "most concise known method", however the optimal encoding is likely a hybrid of both.

@exd02
Copy link

exd02 commented Jan 26, 2026

Before starting the implementation, a statistical analysis was performed to support the decision on the most suitable compression strategy for this case.

I want to sanity-check whether the encoding approach below matches what is being discussed, i.e. whether a hybrid RLE + literal bit-packing format like this would be a reasonable baseline.

Below are concrete examples of a deterministic encoding format I had in mind.

Encoding Rules (Summary)

RLE

RLE tag (1 byte):
- bit7 = 1 → RLE
- bit6 = value → bit value (0 = spent, 1 = unspent)
- bit5..0 = 0 → reserved

Followed by:
- varint(run_length): number of consecutive bits

Bit Packing

Literal (bit packing) tag (1 byte):
- bit7 = 0 → literal
- bit6..0 = L → number of valid bits = L + 1 (range 1..128)

Followed by:
- ceil((L + 1) / 8) raw bytes (LSB-first)
- unused bits in the last byte are padding

Examples

Example 1 - 32 consecutive unspents

Value: 11111111 11111111 11111111 11111111 (32 × unspent)

Compact form: 11000000 00100000

Explanation  
- 11000000  
  - bit7 = 1 → RLE  
  - bit6 = 1 → unspent  
  - bit5..0 = 0 → reserved  
- 00100000  
  - varint(32) → run length

Example 2 - 32 consecutive spents

Value: 00000000 00000000 00000000 00000000 (32 × spent)

Compact form: 10000000 00100000

Explanation  
- 10000000  
  - bit7 = 1 → RLE  
  - bit6 = 0 → spent  
  - bit5..0 = 0 → reserved  
- 00100000  
  - varint(32) → run length

Example 3 - 20 mixed bits (literal)

Value: 01100011 11111111 0111 (20 bits)

Compact form: 00010101 01100011 11111111 01110000

Explanation  
- 00010101
  - bit7 = 0 → literal (bit-packed)
  - bit6..0 = 19 → 20 valid bits
- Payload bytes
  - 01100011
  - 11111111
  - 01110000
    - the last 4 bits are padding and must be ignored by the decoder

Just to validate: do the examples above represent a valid compression model for this problem?

And regarding RLE: in my examples I intentionally left bit5..0 reserved for potential future updates, while in the literal bit-packing case I used those bits to encode the length directly (mainly for demonstration purposes). Would it be better to keep this symmetry and leave padding in both headers for extensibility, or should the available bits be used in both cases to inline small lengths and only fall back to extra bytes when necessary (using Variable-length encoding)?

@rustaceanrob
Copy link

rustaceanrob commented Feb 6, 2026

I believe you have the right idea here on encoding. Compatibility in data representation between file versions is not necessary, however, so the reserve bytes bits can be omitted. I have implemented the hybrid scheme on my branch where you can see the diff here. The encoding first computes the size of each representation, and encodes the smallest one for that block. The flag for which encoding field was used is included in the block height field. This achieves a size of 161MB for block height 930,000. I encourage you build the branch yourself and run the RPC if you are interested.

For you (or anyone else interested in this problem), there is a data structure called a Roaring bitmap which has been developed for efficient bitmap encodings. I would be interested to see if we can achieve better than 161MB using this technique. There is a C++ library that should be usable off-the-shelf, which could replace my current implementation. Note that we will want to preserve the properties of the current hintsfile if possible, in that it can be accessed concurrently and is indexed by block height.

Also note that with a generic compression algorithm like zip, the file is reduced to 98MB, so this is a good baseline for what is possible, but is certainly not required to settle on a format. If possible I would like to strike a balance between size and ease-of-maintenance

@exd02
Copy link

exd02 commented Feb 6, 2026

Thanks for sharing your implementation. That clarifies the per-block adaptive approach very well.

I explored a slightly different direction and wanted to share it for comparison.

Key difference:

  • Instead of selecting a single encoding per block, the encoder switches between RLE and bit-packing within the same block based on local patterns.

  • Encoding is done in a single streaming pass (WriteBit-style), without precomputing representation sizes.

Trade-off:
Your approach minimizes marker overhead by fixing one encoding per block.
Mine may reduce size on blocks with heterogeneous structure, at the cost of additional intra-block markers.

Benchmarks:

{
  "height": 909208,
  "path": "/home/exd/teste/main_compact.hints",
  "duration": "179 minutes"
}
# 209MB - this technique was worse (yours was 161MB)

Notes:
This is still early (writer only, no reader/tests yet), but I wanted to share the approach since it was developed independently before seeing the block-level selection strategy.

I will also take a look at Roaring bitmaps.

@rustaceanrob
Copy link

rustaceanrob commented Feb 8, 2026

Thanks for trying a novel approach. I look forward to viewing your results if you implement the Roaring bitmap approach or a different scheme. For transparency, I recently ran my 161MB file through the zstd library with the most aggressive compression settings and got a compression of 94MB, which seems to be about the smallest possible representation for block 930,000.

@exd02
Copy link

exd02 commented Feb 12, 2026

Thanks for trying a novel approach. I look forward to viewing your results if you implement the Roaring bitmap approach or a different scheme. For transparency, I recently ran my 161MB file through the zstd library with the most aggressive compression settings and got a compression of 94MB, which seems to be about the smallest possible representation for block 930,000.

Thanks for the suggestion.

I implemented the Roaring bitmap approach here. The resulting size at block height 930,000 was 324MB.

I used the CRoaring amalgamation script, which generates amalgamated source and header files (single C implementation file and corresponding headers, plus the C++ wrapper header).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment