In this post I will describe how I refactored quite
complicated Rust codebase (
rdedup) to optimize
performance and utilize 100% of CPU cores.
This will serve as a documentation of
Other reasons it might be interesting:
- I explain some details of deduplication in rdedup.
- I show interesting approach of zero-copy data stream processing in Rust.
- I show how to optimize
- I share tips working on performance oriented Rust codebase.
It was created for my simple usecase of deduplicating my backups between many machines, and synchronizing the deduplicated archive using syncthing or any other Dropbox-like software.
You can read more about
rdedup on its github page. In this
post I’m going to focus on data storing processing pipeline.
Deduplication and initial performance
When storing data through
rdedup, a stream of binary,
non-compressed, non-encrypted data should be fed to it. For
backup purposes that stream could be output of
I personally prefer
rdup since it really
Given such a stream
A: reads it into buffers
B: finds data-deterministic “edges” using rolling hash function and breaks the streams into logical chunks
C: for every chunk: calculates sha256 digest, compresses and encrypts it
D: writes every chunk under its sha256 digest
Deduplication comes from the fact, that same pieces of data, will get split into same chunks, which will reuse the existing chunk files.
Initially all of it was implemented as a naive pipeline:
A -> B -> C -> D
where every step runs in a separate thread, and arrows are
channels from Rust standard library (
Actually, that’s a bit of oversimplification. Every stream of data that is being deduplicated generates and index: a list of sha256 digests that is used to restore the original data.
This index is a stream of data and is handled similarity: chunked and stored. This will will generate index of an index data… and so on… recursively until it collapses to a single, final sha256 digest, that is used as an ID of stored stream of data.
It might seem a bit complicated, but maybe an illustration will make it clear:
A -> B -> C ----------------------> D \ ^ ^ ^ B -> C -/ / / \ / / (potentially more levels)
This was working quite well initially for my personal purposes, but
users with big datasets asked if it could be improved. It was clear that it is
Channels and threads
Before diving into optimizations, I’d like to mention two crates that I’ve used to help with concurrency.
Since Rust standard library features only single-consumer channel, I’ve picked two-lock-queue for channel communication.
For threading, I’ve used well known crossbeam’s scoped threads.
Bottleneck 1: Fsync
The initial problem was with step
D which uses
fsync on every chunk
to make sure no data can be corrupted.
This step is necessary.
Data corruption or loss is not something any backup software
should risk. Some will argue that fsync is not that
Unfortunately for our purposes it really is. It might not
matter for an app that writes handful of files, but
fsync serially repeatedly thousands of
times is just way too slow.
The obvious, and surprisingly well working, solution is to
parallelize it. Step
D was split into
num_cpus * 4
To illustrate the change:
/--> D A -> B -> C -----------------*---> . \ ^ ^ ^ \--> D B -> C -/ / / \ / / (potentially more levels)
Bottleneck 2: Compression and encryption
After the above change, the bottleneck moved from IO at
to CPU at
C. The plan is to support multiple algorithms in
rdedup and hashing, compression and encryption can be
quite demanding. It make sense to parallelize this part as
Cw. . .Cw ------ ^ ^ ^ \ \|/ \ * <-------- \ /--> D A -> B -> C \ \ --*---> . \ / / \--> D B -> C - / \ / (potentially more levels)
C now distributes the work to a pool
Cw. The pool is shared between all
Cws submit processed chunks to a pool of
and respond (via callback channel
Cs send with each chunk)
with an digest of a chunk so
C can assemble the index
(list of digests).
While it was not a bottleneck, there was some needless copying involved in some of these pipelines.
A is reading the data into fixed size
B finds “edges” in them forming logical
chunks in the stream.
How to map the underlying framed data into logical chunks?
I call the method I’ve used a scatter gather buffer
Data buffers become
multiple places can keep them alive, while every chunk
Vec<ArcRef<Vec<u8>, [u8]>>: a list of
ArcRef is a type from very useful
ArcRef can hold
ownership to a reference counted (
Arc) buffer along a
slice to a part of it (
buffers: [ Buf ][ Buf ][ Buf ][ Buf ] edges: | | | | | | | | | | \ /\ / \ / \ /\ /\ / \ / \ / || || | | \ / \ / \ / chunks C1 C1 C2 C3 C3 C3 C4
This eliminates unnecessary copying, and since all data is immutable it can be worked in parallel
The first step of optimizing should be measuring. Without it, it is hard to tell what is going on and what to do.
All the measuring efforts eventually settled on a simple
struct collecting time spent on: input, processing and output
of every thread and forwarding cumulative time to
slog logger on
drop. Spending time on
output means respectively: previous or next
piece of the pipeline is not keeping up.
rdedup with debug output enabled, the
following will be printed:
Mar 27 00:50:16.881 DEBG total time, name: input-reader, input: 0.275241757, inside: 0, output: 4.99729611 Mar 27 00:50:16.898 DEBG total time, name: chunker, input: 0, inside: 3.789896676, output: 1.499951845 Mar 27 00:50:16.913 DEBG total time, name: chunker, input: 0, inside: 0.001270074, output: 0.00000522 Mar 27 00:50:16.915 DEBG total time, name: chunker, input: 0, inside: 0.000003259, output: 0.000002716 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.292956768, inside: 3.399154403, output: 0.169168692 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.319576917, inside: 3.313578766, output: 0.200596487 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.361151046, inside: 3.355589657, output: 0.19483244 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.303446365, inside: 3.36682324, output: 0.196335971 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.280932044, inside: 3.384652279, output: 0.175253461 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.346162129, inside: 3.319742591, output: 0.202955268 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.371646506, inside: 3.268694323, output: 0.191890475 Mar 27 00:50:16.915 DEBG total time, name: chunk-processing, input: 0.373532858, inside: 3.303381288, output: 0.206956405 Mar 27 00:50:16.916 DEBG total time, name: chunk-writer, input: 2.047794657, inside: 0.007015684, output: 3.022296424 Mar 27 00:50:16.916 DEBG total time, name: chunk-writer, input: 2.017267334, inside: 0.008629837, output: 3.035579557 (.. removed a lot of chunk-writers for brevity ..) Mar 27 00:50:16.923 DEBG total time, name: chunk-writer, input: 2.055406248, inside: 0.010434317, output: 3.018590681 Mar 27 00:50:16.923 DEBG total time, name: chunk-writer, input: 1.989486888, inside: 0.010648274, output: 3.081730459 Mar 27 00:50:16.928 DEBG total time, name: chunk-writer, input: 2.040859497, inside: 0.013481347, output: 3.055903415
We can see that
A is spending it’s time waiting
B spends most of it’s time processing. It’s a new
bottleneck, at least in this workload.
C does a lot of
processing, but at least on my 8 core machine, it can keep
up. The multitude of
D threads, spends most of the time on
fsync, but because of their numbers, they still have some
spare capacity (waiting on input).
rdedup has this measurement system built-in
and available using a command line flag, anyone should be
able to check and understand the performance on a given
machine, with given data.
I think this pipeline performance measurement system is so universal and useful, that it would make sense to tidy it up, and release as a reusable crate.
Edit: I’ve published improved version of that code as slog-perf
Fearless concurrency with Rust
rdedup codebase is not the neatest one, and already fairly
complex. While I expected it, I’m still impressed how casual
this refactoring felt.
There has been not a single crash during the whole process. Which is somewhat obvious considering the code is in Rust… but somehow still fells good.
Actually, I don’t think there has been a single time that the program stopped working. The set of sanity checks was passing every time the code compiled.