Building a key-value store in Rust

Open Table of contents

Introduction
Building the world’s simplest database
Storage engines
Analyzing the performance of the world’s simplest database
Benchmarking the world’s simplest database
Solving the problem of slow reads
High-level design for kv-store
Implementing kv-store
Benchmarking kv-store
Tradeoffs
- Pros
- Cons
Summary

Introduction

It has been some time since I published the last article. As it often happens with side projects / hobby interests, I got a little sidetracked. So this blog post won’t be a continuation of my machine learning blog posts. I’m sure I will soon pick up the machine learning journey at some point in time though 🙃.

In my job as a software engineer at Amazon, I recently had to think a lot about the tradeoffs of different database technologies for various data models. Obviously, this was the perfect time to pick up Martin Kleppmann’s fantastic book Designing Data-Intensive Applications again. I have written a short summary about this book in my blog post Book List 2022.

Also, I wanted to get my feet wet with Rust. I have started to learn Rust during a hackathon project last year, continued a little in my free time, but haven’t done any side projects with it since then. I’m inspired by the very informative videos that Jon Gjengset is sharing in his YouTube channel (they’re really worth watching! His technical expertise is astonishing).

To sum up, the motivation for this blog post was to

Re-read parts of ‘Designing Data-Intensive Applications’
Implement some techniques discussed in the book to gain a deeper understanding and
get some hands-on experience with Rust.

So today, we’ll be looking a little into how storage engines work and how we can build a simple key-value store ourselves. We start out with a very simple approach and then extend it with technique called ‘indexing’ to speed up the performance.

Building the world’s simplest database

The most fundamental thing that a database must be able to do is “I give you this piece of information now. Store it for me and let me retrieve it at some time in the future”.

Turns out, it’s quite easy to do this using two bash functions leveraging the file system abstraction that UNIX operating systems offer us.

So here is the simplest key-values store in the world (Note: This example is taken from ‘Designing Data-Intensive Applications’ p. 70).

db_set() {
  echo "$1,$2" >> /tmp/simpledatabase.log
}

db_get() {
  grep "^$1," /tmp/simpledatabase.log | sed -e "s/^$1,//" | tail -n 1
}

We can use this key-value store like this:

db_set hello world
db_set foo bar
db_set bar baz

db_get hello
> world

db_set hello there

db_get hello
> there

So as you can see, this database supports a GET and a SET operation. When we set a value for the same key twice, only the latest value will be returned, as we would expect from a database.

So how does our database file look like after executing the mentioned commands?

cat /tmp/simpledatabase.log

> hello,world
> foo,bar
> bar,baz
> hello,there

Storage engines

We just built a storage engine using two bash functions. Isn’t that neat? Storage engines are at the heart of databases. They are responsible for persisting the data that you put into your database to disk. Turns out, there is not that many conceptually different storage engines. Both relational databases and NoSQL databases make use of two different kinds of storage engines:

Log-structured storage engines and page-oriented storage engines.

The storage engine that our simple database makes use of is an instance of a so-called ‘Log-structured storage engine’. Why is it called like that? Because we simply use a log file (log in this case describes an append-only file) to store our data.

Analyzing the performance of the world’s simplest database

The write performance of our simple database is fairly good, as a SET operation simply involves appending to a file. More importantly, this operation does not slow down when the amount of data grows. Appending to a 1kB file is just as fast as appending to a 1MB file. In other words, our SET operation has a runtime complexity of O(1), i.e. it’s constant. That’s very good news!

Things look different when analyzing the read performance. We are using grep to search the key in the file. grep has to scan the entire file. The output of grep is then piped into sed. Depending on the number of times that the searched key has been updated (via the SET command), sed will have to either read through on one single line, or once again the entire file. So the bigger the file gets, the worse our performance gets. Or, put more precisely, the amount of work we have to do for a READ operation grows linearly with the amount of data we have stored in our file. In Big O notation, this translates to O(n). As you can imagine, that’s no good news.

But enough theory. Let’s benchmark our simple database a bit.

Benchmarking the world’s simplest database

For our benchmark, we use hyperfine to repeatedly run our bash functions db_get and db_set. To get some sense of the statistical distribution, I let hyperfine run the commands 1000 times (except for in the 1GB file case for time reasons 🙃). I have prepared an empty database file, a database file with 1MB of foo,bar lines and one with 1GB of foo,bar lines.

Of course, this is a very rudimentary benchmarking setup, but it suffices for our analysis. Also note that you shouldn’t trust the absolute numbers here, as hyperfine invokes a shell for each execution which adds to the time spent per command. However, our benchmark is suited to analyze the trends in the numbers depending on the database file size.

Benchmarking with an empty database file

ls -lh simpledatabase.log
> -rw-r--r--  1 christian  wheel     0B Oct 26 16:48 simpledatabase.log

hyperfine --shell=bash --runs 1000 'db_set hello world'
> Benchmark 1: db_set hello world
>  Time (mean ± σ):       0.6 µs ±  19.5 µs    [User: 0.0 µs, System: 0.0 µs]
>  Range (min … max):     0.0 µs … 615.2 µs    1000 runs

hyperfine --shell=bash --runs 1000 'db_get hello'
> Benchmark 1: db_get hello
>  Time (mean ± σ):     891.9 µs ± 100.0 µs    [User: 578.9 µs, System: 1677.1 µs]
>  Range (min … max):   772.3 µs … 2842.8 µs    1000 runs

Benchmarking with a 1MB database file

ls -lh simpledatabase.log
> -rw-r--r--  1 christian  wheel   1.0M Oct 26 16:47 simpledatabase.log

hyperfine --shell=bash --runs 1000 'db_set hello world'
> Benchmark 1: db_set hello world
>   Time (mean ± σ):       2.4 µs ±  75.2 µs    [User: 0.0 µs, System: 0.0 µs]
>   Range (min … max):     0.0 µs … 2378.8 µs    1000 runs

hyperfine --shell=bash --runs 1000 'db_get hello'
> Benchmark 1: db_get hello
>   Time (mean ± σ):      10.4 ms ±   0.4 ms    [User: 10.0 ms, System: 1.9 ms]
>   Range (min … max):     9.9 ms …  22.9 ms    1000 runs

Benchmarking with a 1GB database file

ls -lh simpledatabase.log
> -rw-r--r--  1 christian  wheel   1.0G Oct 26 16:46 simpledatabase.log

hyperfine --shell=bash --runs 1000 'db_set hello world'
> Benchmark 1: db_set hello world
>   Time (mean ± σ):       1.8 µs ±  57.1 µs    [User: 0.0 µs, System: 0.0 µs]
>   Range (min … max):     0.0 µs … 1805.3 µs    1000 runs

hyperfine --shell=bash --runs 20 'db_get hello'
> Benchmark 1: db_get hello
>   Time (mean ± σ):      9.130 s ±  0.262 s    [User: 8.960 s, System: 0.117 s]
>   Range (min … max):    8.920 s …  9.860 s    20 runs

Results

As you can see, the db_set does not deteriorate with growing database file size.

db_get however, shows dramatic decline in performance depending on the database file size. While it’s below 1ms for an almost empty database file (the db_set run of course inserted some commands), it’s already at 10ms for 1MB database file. With a 1GB database file, each db_get operation already takes 9,13 seconds on average.

In summary, db_set performance is already very good. It’s O(1), i.e. constant time, and it’s also trivial to implement. However, db_get performance is prohibitively bad with a growing database file size. We have to do something about that.

Solving the problem of slow reads

The root cause of the terrible read performance is that the db_get operation has to go through the entire database file to be able to retrieve the latest value for a given key. However, we can introduce additional metadata about entries in the database file. This metadata we’ll call an index. We can use an in-memory hash table to store mappings from keys to the location of the corresponding entry in the database file. That way, we can first look up the key in the in-memory hash table. If an entry exists, we’ll be returned the location of the entry in the database file. We then only have to read the file at that location.

To implement this, we’ll switch gears a little and use Rust.

High-level design for kv-store

I’m not a big fan of designing and planning out every detail before starting to code. At the same time, I do think it’s a good idea to roughly lay out the high-level design before starting to code. After all, it’s a lot faster to refactor a high-level design than it is to refactor already written code.

For this project, I’ve collected the following high-level design decisions:

The key-value store uses a custom application protocol on top of TCP
The storage engine uses an append-only file to store data
A custom binary format is used to store data in the append-only file
An in-memory hash map is used to maintain an index
The index will not be persisted to disk
The supported operations will be GET, SET and DELETE

Implementing kv-store

When writing this blog post, I realized that it might be an overkill to walk through the complete kv-store implementation in this blog post. So instead, I decided to just go through the SET operation as an example and point out the most important things.

For those interested: You can find the complete code of kv-store at https://github.com/ChristianSchleifer/kv-store.

So let’s look at the SET operation. At the storage engine level of kv-store, the implementation looks like this:

pub(crate) fn set(&mut self, key: String, value: String) -> io::Result<()> {
    let entry_metadata = self.file_store.set(key.clone(), value)?;
    self.index.set(key, entry_metadata);
    Ok(())
}

As you can see, the key-value pair is first appended to the database file. Given that his was successful, metadata about the entry is saved in the variable entry_metadata. But how exactly does this appending to the database file work and what kind of metadata are we talking about here? Let’s next look at the EntryMetadata struct and the set method implementation of the FileStore.

#[derive(Debug, Clone, Copy)]
pub struct EntryMetadata {
    pub start: u64,
    pub key_length: u64,
    pub value_length: u64,
}

pub fn set(&mut self, key: String, value: String) -> io::Result<EntryMetadata> {
    let mut entry = Entry::new(key, value.as_bytes());
    let bytes = entry.serialize();

    self.append_handle.write_all(bytes.as_slice())?;
    let position = self.append_handle.seek(SeekFrom::Current(0))?;

    Ok(EntryMetadata::new(
        position - bytes.len() as u64,
        entry.key.len() as u64,
        entry.value.len() as u64))
}

The FileStore set method first serializes the key-length, the value-length, the key and the value to a slice of bytes. These bytes are then written to the database file. The location of the file descriptor after writing the entry is then stored in the variable position.

The EntryMetadata struct is constructed by passing it the position minus the length of the bytes that were just written to the file, as well as the key length and value length. This metadata is sufficient to efficiently perform read operations. Take a look at the GET operation if you are curious how this works.

Let’s go up one call in the method call chain again. You know this method already:

pub(crate) fn set(&mut self, key: String, value: String) -> io::Result<()> {
    let entry_metadata = self.file_store.set(key.clone(), value)?;
    self.index.set(key, entry_metadata);
    Ok(())
}

Now that we know what the set method on the FileStore does and what data it returns, we can proceed. Next, the metadata is stored in the Index.

Time to look at the Index set method.

pub struct Index {
    hashmap: HashMap<String, EntryMetadata>,
}

pub fn set(&mut self, key: String, entry_metadata: EntryMetadata) {
    self.hashmap.insert(key, entry_metadata);
}

This method is quite trivial. The only thing it does is to store the metadata for a particular key in a Rust HashMap. In the average case, Rust HashMaps allow for constant Read and Write access (O(1)).

Benchmarking kv-store

Time to benchmark kv-store. Once again, we use hyperfine to repeatedly run our commands ( e.g. GET hello" | nc 127.0.0.1 7878) and generate some statistics for us. We’ll invoke each command 1000 times and make our test with an empty database file, a 1MB (filled with distinct entries) and 1GB (filled with distinct entries) database file.

Let me point out again that this benchmark is not suited to make callouts about absolute numbers. After all, we’ll measure a lot more than only the StorageEngine operations. In this case, we’ll also measure the time it takes for the TCP connection to be established and for the data to be transmitted over the TCP socket. This involves several calls into the kernel network stack, thus adding latency.

However, we can use the benchmark to analyze trends in the numbers depending on the database file size.

Benchmarking with an empty database file

ls -lh log.db
> -rw-r--r--  1 christian  wheel     0B Oct 27 08:20 log.db

hyperfine --runs 1000 "echo 'SET hello world' | nc 127.0.0.1 7878"
> Benchmark 1: echo 'SET hello world' | nc 127.0.0.1 7878
>   Time (mean ± σ):       1.6 ms ±   0.2 ms    [User: 0.4 ms, System: 0.8 ms]
>   Range (min … max):     1.2 ms …   4.2 ms    1000 runs


hyperfine --runs 1000 "echo 'GET hello world' | nc 127.0.0.1 7878"
> Benchmark 1: echo 'GET hello world' | nc 127.0.0.1 7878
>   Time (mean ± σ):       1.6 ms ±   0.1 ms    [User: 0.4 ms, System: 0.8 ms]
>   Range (min … max):     1.3 ms …   2.9 ms    1000 runs

Benchmarking with a 1MB database file

ls -lh log.db
> -rw-r--r--  1 christian  wheel   1.0M Oct 27 09:37 log.db

hyperfine --runs 1000 "echo 'SET hello world' | nc 127.0.0.1 7878"
> Benchmark 1: echo 'SET hello world' | nc 127.0.0.1 7878
>   Time (mean ± σ):       1.6 ms ±   0.2 ms    [User: 0.5 ms, System: 0.8 ms]
>   Range (min … max):     1.3 ms …   3.8 ms    1000 runs

hyperfine --runs 1000 "echo 'GET hello world' | nc 127.0.0.1 7878"
> Benchmark 1: echo 'GET hello world' | nc 127.0.0.1 7878
>   Time (mean ± σ):       1.6 ms ±   0.4 ms    [User: 0.4 ms, System: 0.8 ms]
>   Range (min … max):     1.2 ms …   7.6 ms    1000 runs

Benchmarking with a 1GB database file

ls -lh log.db
> -rw-r--r--  1 christian  wheel   1.0G Oct 27 09:46 log.db

hyperfine --runs 1000 "echo 'SET hello world' | nc 127.0.0.1 7878"
> Benchmark 1: echo 'SET hello world' | nc 127.0.0.1 7878
>   Time (mean ± σ):       1.5 ms ±   0.2 ms    [User: 0.4 ms, System: 0.8 ms]
>   Range (min … max):     1.2 ms …   4.0 ms    1000 runs

hyperfine --runs 1000 "echo 'GET hello world' | nc 127.0.0.1 7878"
> Benchmark 1: echo 'GET hello world' | nc 127.0.0.1 7878
>   Time (mean ± σ):       1.5 ms ±   0.1 ms    [User: 0.4 ms, System: 0.8 ms]
>   Range (min … max):     1.2 ms …   2.3 ms    1000 runs

Results

As expected, this time, both read and write performance does not deteriorate, but stays constant. In Big O notation, this means that both our reads and writes are served in O(1) time.

Tradeoffs

We might be thinking now: ‘Great, this greatly improved our read performance while not having any disadvantages’. We’d be wrong. If there is one key takeaway from this blog post, then it’s

For every solution/approach, there are tradeoffs. Make sure to analyze them. Then evaluate the tradeoffs and decide whether they are a good fit for your problem.

What does this mean in our case? Let’s look at the Pros and Cons of introducing the index data structure.

Pros

Performing reads in constant time, i.e. O(1), instead of linear time, i.e. O(n).

Cons

Increased memory consumption of the process because of the necessity to hold the index in memory. When running kv-store on the 1GB database file, memory consumption of the process is at 1.9GB.
Limitation that the index must fit into memory. For example, my machine wouldn’t be able to hold the index for a 1TB file with only distinct key-value pairs in memory.
Writes get a tiny bit slower as the hashmap has to be updated in addition to writing to the file.
Loading the index from the database file at startup takes time as the entire database file has to be scanned to build the index. This can be worked around by persisting the index to disk as well. However, this will add additional complexity.

Summary

Writing a storage engine seems pretty straight forward. But as always, the devil is in the details. There are numerous features our kv-store is missing for it to be a production-grade key-value store. Just to mention a few: It lacks proper error handling, multithreading, buffered I/O, log compaction, index persistence, handling of crashes during writes to the database file, distribution across multiple machines etc.

However, to grasp the basic principles that underlay a simplistic storage engine, this small project was very helpful to me. And I hope, that this blog post was helpful to you as well 🙂.