I've been optimizing a lot of projects over the years, and here are some tips that I'd like to share with you

Remember to optimize the whole system, and not individual parts

This one is a big one and it's counter-intuitive so it goes first: always, always optimize the whole system!

When optimizing any not-trivial system that involves more than just a single thread, optimizing any part at random might make the overall system performance get worse.

Don't believe me?

Imagine a simple process consisting of two threads: A and B.

A ---> B

A is reading some data from the network, passes them to B which writes them in some way to disk.

An eager but naive developer decides to "optimize" and makes A run on multiple threads so it can ingest more data faster, and introduce an in-memory cache to avoid fetching some data twice.

Surprisingly, it turns out that the whole system is slightly slower now. After more detailed investigation it turns out that B was a bottleneck of the system all along, and additional threads and memory used by A are making B (and thus the whole system) slower.

Another example.

Service D is handling requests from two other services: A and B. A developer optimizes A to batch requests to D for better throughput. Unfortunately, the system bottleneck was always on the B which is sensitive to the latency of responses from D, which went down after D has to handle much larger requests from D.

It is also worth noting that while optimizing for performance (latency, throughput) is usually what we have in mind when talking about "optimizing", one can optimize for other things as well: resource usage, cost, energy consumption, simplicity, deployment time, etc. With such optimizations, it's even easier to achieve counter-productive results. Saving on the cost of running your system's main database obviously can lead to performance degradation of the system as a whole, and that might lead to less happy users, which might lead to less income and altogether cost your system money.

The rule of thumb is: when optimizing a larger system always identify the current bottleneck first and focus on optimizing that part only until there is no more room for improvement or the bottleneck moved somewhere else. At the very least it will save you time working on things that don't matter. When optimizing other parts, especially for things other than performance, be mindful of the effects on the rest of the system.

I highly recommend The Goal by Eliyahu M. Goldratt. While the book was written to help management optimize business processes, absolutely everything it describes applies to larger/distributed software system optimization. And it will possibly open your eyes to other counter-productive "optimizations" in the software engineering industry, e.g. the fact that trying to optimize "velocity" by assigning as many Jira tickets to as many developers as possible to optimize the overall throughput, leads to increased latency between dependent parts (e.g. waiting longer to get help or code review), more wasted time, more work in progress and lowering overall velocity. But anyway... just read the book. It's worth it.

Measure before you optimize

In your code and systems, invest in profiling, measuring, and observability early. Measuring is important for performance optimization at least in two ways.

First, as per the previous point - you should focus on optimizing the bottleneck of your system. And how you are going to find that bottleneck if you don't know which parts of your systems are spending time waiting for what?

For this purpose, it's usually important to monitor inputs and outputs: sizes of queues, latencies of outgoing and incoming requests, etc.

Second, after you changed anything - how do you know how well it works now?

In this case, it's generally useful to monitor the times things take, and all sorts of throughput rates.

Focus on IO first

Most systems are IO-bound. Since most systems are IO-bound, usually optimizing the IO is the most effective (if not the only) way to get better overall performance.

IO is generally characterized by high per-request latency and overheads. When you need to lookup an order by id in a database or some remote server, a large part of the overall time it takes is spent by pure overheads: serializing and deserializing data, transmitting them over network and disk IO, and so on.

Remember that from the perspective of a CPU, pretty much everything is more or less slow IO, so the techniques described below apply to a variety of use cases.

Amortize latency and increase throughput with pipelining

Let's say you have a code like:

for order_id in order_ids {
    let order = lookup_order(order_id);
    let more_order_data = fetch_more_order_data(&order), order);
    write_out_order(order, more_order_data);
}

which means for every order_id, sequentially perform three steps to process it. One of the simplest ways to make it run faster is to pipeline it.

First, express it as a sequence of steps in an iterator chain:

order_ids
  .into_iter()
  .map(|id| lookup_order(order_id))
  .map(|order| (fetch_more_order_data(&order), order))
  .for_each(|(more_order_data, order)| write_out_order(order, more_order_data))

meaning every step takes as input the output of the previous step.

Then make each step run in parallel. While last step is writting out the data about the first order, the middle one can already fetch more data about the second one, and first one lookup third order. That's basically what pipelining means.

At least in Rust, pipelining an existing iterator chain is as simple as:

use dpc_pariter::IteratorExt;

order_ids
  .into_iter()
  .map(|id| lookup_order(order_id))
  .readahead(0)
  .map(|order| (fetch_more_order_data(&order), order))
  .readahead(0)
  .for_each(|(more_order_data, order)| write_out_order(order, more_order_data))

Each .readahead(0) will put the previous part of the iterator chain in a separate thread and feed the output to next step via blocking in-memory buffer (aka "channel").

Generally use small in-memory buffers

This is an important side-note about pipelining which is applicable more broadly. As you can see we're using .readahead(0) which means the buffer size between threads will be 0.

Over and over I see developers using large sizes for buffers used like this, (wrongly) assuming that larger buffers somehow mean better performance.

Let's think about a simple pipeline consisting of 5 steps:

A -----> B -----> C -----> D -----> E

Each - in the arrow between pipeline step represents a space in the channel.

Here is a simple but important observation: every pipeline like this has a bottleneck somewhere. For vast majorities that bottleneck will be persistent. One of the steps will be consistently taking the longest by its very nature. Let's assume that step C is the slowest and o represents a unit of work:

A ooooo> B ooooo> C -----> D -----> E

In a steady state, all channels before C will always be full, and all channels after D will be empty. That's because, since C is the slowest, A and B can always produce more work for C faster than it can consume it and D and E can consume work from C faster than it can produce it.

The throughput of the overall pipeline is entirely defined by the throughput of C, and no amount of additional buffering will change it. Larger buffers only waste resources, potentially making things slower, not faster.

Larger in-memory buffers increase performance in a pipeline only if the time to process each unit of work is non-uniform. If each o on the above graph could greatly vary in how long it takes to process and/or each of the steps could take less or more time due to some other reasons, some amount of buffering helps smoothen the variance. Even then small buffer sizes are usually enough.

Amortize latency and increase the throughput with multi-threading

The previous version of our code:

use dpc_pariter::IteratorExt;

order_ids
  .into_iter()
  .map(|id| lookup_order(order_id))
  .readahead(0)
  .map(|order| (fetch_more_order_data(&order), order))
  .readahead(0)
  .for_each(|(more_order_data, order)| write_out_order(order, more_order_data))

by definition must have a slowest step. Assuming "fetching more order data" is the slowest step and thus overal bottleneck, it's possible to optimize it by spreading the work between many threads:

use dpc_pariter::IteratorExt;

order_ids
  .into_iter()
  .map(|id| lookup_order(order_id))
  .readahead(0)
  .parallel_map(|order| (fetch_more_order_data(&order), order))
  .readahead(0)
  .for_each(|(more_order_data, order)| write_out_order(order, more_order_data))

parallel_map will use a thread-pool to process multiple orders at the time and call fetch_more_order_data in parallel for each of them.

This only works in cases where the order of side-effects for each unit of work is not important. Which is quite common.

Amortize latency with batching

If you request 10 orders by id at once, it generally will take much less time than requesting a single order 10 times.

Depending on domain and exact application, there might be more or less room for batching the IO, but it's worth keeping in mind. Too many times the whole codebase, the whole system, all the interfaces, all the abstractions assume working one entity at the time, even when it would make sense to work on batches.

It's worth noting that going over the board with the size of batches can be counter-productive. Batches of 10 or 100 are often large enough to completely amortize per-item latency, while large batches can lead to increased resource consumption and undesirable latencies.

Let's adjust the previous code:

use dpc_pariter::IteratorExt;
use itertools::Itertools;

order_ids
  .into_iter()
  .chunks(10)
  .map(|ids| lookup_orders(ids.collect()))
  .readahead(0)
  .parallel_map(|orders| fetch_more_order_data(&orders))
  .readahead(0)
  .for_each(|orders_with_more_order_data| write_out_order(orders_with_more_order_data))

Now order_ids will be processed 10 at the time, accross the whole pipeline and threadpool. Notably, the APIs had to be altered to accomodate working on sets & maps.

Focus on the data architecture

You can't just start implementing classes and modules without a bigger plan and hope they will eventually come up together into good software. It's especially important for non-trivial applications and when good performance is desired.

Before starting to write any code, you need to have a plan for how to organize your data - both in memory and in the persistence layer. That's what I call the "data architecture". The data architecture determines the performance characteristics of your code.

One of the most popular types of software being created nowadays, at least from my perspective, is CRUD apps. CRUD backends are typically organized very much alike and their data architecture differs only in the way they store their domain-specific business data. They often don't have much to organize in memory altogether.

There’s a sharp increase in the difficulty of coming up with good data architecture when the application stops being just a CRUD backend and has to do some more complex or heavy-duty data processing, possibly involving asynchronous communication, etc.

It seems to me that the fact that we so often just write look-alike CRUD apps, means we don't get a lot of practice in data architecture. I highly recommend reading Designing Data-Intensive Applications to get some insights and perspectives on data organization.

This is how you optimize writes to an SQL database

Very specific, but probably not widely-known.

For one of my projects, I've spent a lot of time researching how to insert tons of data into Postgres as fast as possible, entirely from the application itself.

Batching a lot of multi-row value insert statements is the fastest approach.
- Multiple INSERTs inside single large database transactions are much faster, as they delay/avoid fsync syscalls used to commit data to the disk.
- Multi-row value insert statements help a bit too, as they amortize each INSERT overhead.
Indexes slow down INSERTs. Avoid creating them until you have to (after you pumped all the data).
Cache whatever you can and avoid reading back anything you've already had.

As far as I know, the general approach here should apply to other databases.

Know your algorithms and data structures

I think it's generally pointless to try to remember the implementation details of all the important data structures and algorithms (you can always just look them up). What is practical and very important is knowing what is generally possible: what useful algorithms and data structures there are and what can they give you (and why).

Why are linked lists generally bad? What's the state of the art in sorting? Why are databases and filesystems using B-Tree so much? What are probabilistic data structures like Cuckoo hashing, Bloom filters, Invertible Bloom Filter , etc?

Unless you dedicate your life to studying algorithms, the amount of information about them might feel overwhelming. Because it is. But remember that your goal here is not to know them in-depth, but only to be aware of their existence. Treat them as a workshop tool: as long as you know it exists and what it is good for, you can look deeper into it when you need it.

Have a decent understanding of how modern CPUs work

While most typical business code will be IO-bound, there are times when it's important to optimize a pure CPU run time performance.

If you want to optimize how fast the CPU executes your code, you need to have a good understanding of how it works. While you'll hear a lot that modern CPUs are very complex and unpredictable, here are some things I'd like you to know, which should be good rules of thumb and starting points to read more.

Use your caches wisely

While CPUs can execute instructions extremely fast, accessing data from memory is so much slower than CPU waiting for data to be loaded or stored would waste all of its performance. One way CPU design mitigates this problem problem is utilizing large and complex multi-level cache hierarchies, between CPU cores and main memory. The cache level closest to the CPU core is the fastest but most expensive and smallest, and the further away they are from the core they get larger, but slower and shared between more cores.

What you want to remember here is that your code will run faster if you can make it access only a smaller subset of your data at the time, so it doesn't have to shuffle things in and out of cache needlessly.

Remember about cache-lines

CPU cores access data one cache-line at a time. A cache-line size can vary but is typically 64 bytes in contemporary CPUs. It's important to realize that even if a CPU is accessing one bit (like a boolean variable) it will by necessity bring the whole 64B into its cache. That's why fast code needs to generally keep data (variables) used together next to each other.

Know that CPUs are distributed systems

Another thing to know about caches is that in a multi-core CPU all these cores and caches are so away from each other that the CPU turns into a distributed system. CPUs/caches track which cache-line "belongs" to which core using a special coherence protocol. This means that multiple CPUs modifying the same cache-line can keep waiting for it to bounce around the cache hierarchy between the cores. And since the cache-lines are somewhat large this can happen when two different variables are being accessed.

Keep the branch predictor happy

I've already explained how pipelining can improve the performance of an otherwise sequential process. It shouldn't come as a surprise that CPU cores also use pipelining when executing instructions. It works well but has at least one major problem: when the CPU core needs to make a conditional jump (branching). Branching corresponds to if a then b else c statement. The CPU core needs to start processing either a or b, but it does not know which one yet, as a is still in the execution pipeline. So it picks one and hopes for the best. If it guessed right - things go smoothly and fast. If it guessed incorrectly, it has to go back and redo a lot of stuff and things go slow. If you want your code to be fast you need to make the branch predictor always right by making your if statement well.. predictable (e.g. consistently taking the same side). Or even better - avoiding conditional statements in a tight high-performance loop altogether.

Don't overuse atomics and be mindful about synchronization overhead

I've already mentioned that multi-core CPU is a lot like distributed systems. To help avoid slowing the CPU down with needless communication between the cores, most CPU architectures leave CPU cores a great deal of freedom onhow exactly they are going to execute their memory accesses, and when exactly they have to tell other cores about it.

Thanks to that and to help improve the memory access performance, the CPUs will utilize write buffers, combine writes, reorder execution of otherwise independent instructions, and other techniques.

When synchronization between multiple threads is required programming languages tend to offer "atomic variables" (aka "atomics"), that utilize atomic CPU instructions. Atomic instructions by their very nature, can't benefit from the same optimizations.

On top of it, the usual synchronization primitives utilize memory barrier instructions that force the CPU core to make sure that any pending writes are visible to other cores, and any pending writes by other cores are visible to the current core. This means flushing all these fancy buffers and optimization techniques, potentially slowing down the execution considerably.

Avoid thrashing your TLB

Another low-level detail that can sometimes be significant for performance is TLB cache. The way programs see memory that they use and not see memory that they are not supposed to see, is implemented on mainstream operating systems using a technique called Virtual Memory. OS maintains bunch of a Page tables. Anytime a program tries to access a given (virtual) memory address, the hardware CPU does the translation by performing a "page table walk". A page table walk requires additional memory accesses (from the page table), to figure out which physical memory address to use. Having to perform such a translation for every memory access would be slow, so the results of the translations are cached in the TLB.

Unfortunately, unlike normal memory caches, TLBs are more difficult to implement in hardware, and because of that, they are generally somewhat small.

Depending on the exact hardware, OS, presence of virtualization, hardware mitigations against speculative execution attacks like Meltdown and Spectre the exact details will vary greatly, but as a rule of thumb it's important to remember that multiple processes being scheduled on the same CPU core will possibly compete for TLB potentially leading to Thrashing

In most cases, it comes down to avoiding running multiple CPU-heavy workloads competing for CPU cores, especially if they don't share the address space (they are not different threads, but different processes).

When you can - access memory in a linear fashion

If you can organize your data in memory in a linear sequence and then access it from start to end, you can expect to get very good performance.

First - your cache-line utilization will be good, second CPUs can use hardware cache prefetching which can detect common access patterns and start loading caches with data even before the CPU core requests it.

If you want to push it - learn and use SIMD

Consider investing time in coming up with a SIMD implementation. Modern CPUs usually provide a set of special instructions that can perform operations on a vector of data at the time.

Oftentimes, compilers can optimize well-structured loops and iterators in a normal code to compile into SIMD implementations, possibly greatly improving the performance. For non-trivial use-cases coming up with a SIMD code is somewhat involving, but can yield large speedups.

I have little experience with SIMD, so I'll leave it at that: it's there, and it can be powerful.

Use a profiler

You might already know about profilers like perf.

Once you've invested some time learning and understanding some of the above low-level details, you should have a much easier time reading the hardware performance counters which profilers often collect and conveniently display, among all the other useful things.

Use a disassembler

Compilers usually do a great job translating higher-level code into machine code, but you can't ever be sure. Sometimes they get confused or need you to write your code slightly differently to come up with a better machine code.

If you identified a performance-critical section of code, it's useful to check the actual instructions it compiles to.

You don't necessarily have to be a low-level wizard to read and reason about disassembled code. "Fewer instructions good, more instructions bad" is a good first-order approximation.

Check the well-known Godbolt Compiler Explorer to get a feel for what it's about.

Look into eBPF

For profiling and debugging complex workloads, involving Linux kernel, check out eBPF. Among other impressive capabilities, it's offers powerful and flexible facilities for profiling Linux applications.

Summary

That's it. I think that's all I wanted to tell you about right now. I hope you've found something useful. Now go and make something run fast.

Remember to optimize the whole system, and not individual parts#

Measure before you optimize#

Focus on IO first#

Amortize latency and increase throughput with pipelining#

Generally use small in-memory buffers#

Amortize latency and increase the throughput with multi-threading#

Amortize latency with batching#

Focus on the data architecture#

This is how you optimize writes to an SQL database#

Know your algorithms and data structures#

Have a decent understanding of how modern CPUs work#

Use your caches wisely#

Remember about cache-lines#

Know that CPUs are distributed systems#

Keep the branch predictor happy#

Don't overuse atomics and be mindful about synchronization overhead#

Avoid thrashing your TLB#

When you can - access memory in a linear fashion#

If you want to push it - learn and use SIMD#

Use a profiler#

Use a disassembler#

Look into eBPF#

Summary#