A Tiny QuestDB: How Writes, Partitions, and Transactions Work

Jun 24, 2026

This is a follow-up on The Anatomy of a QuestDB Table on Disk. There we looked at how data is natively stored on disk for some data types. In this post, we look at what is approximately happening at the storage layer when writes occur. For that, we implement a simplified version of QuestDB’s storage engine. We’re particularly focused on implementing just what is necessary for the storage engine to be able to store the data we’ve been working with so far:

CREATE TABLE trades (ts TIMESTAMP, price DOUBLE, sym SYMBOL, note VARCHAR) TIMESTAMP(ts) PARTITION BY DAY WAL;
INSERT INTO trades VALUES ('2026-06-10T10:00:00Z', 1.0, 'BTC', 'hi');
INSERT INTO trades VALUES ('2026-06-10T10:00:01Z', 2.5, 'ETH', 'this note is too long to inline');
INSERT INTO trades VALUES ('2026-06-10T10:00:02Z', -3.0, 'BTC', 'x');

We break the implementation into 3 parts:

Note that our implementation is minimal and mostly for educational purposes with the goal of understanding some underlying principles of a real database. We try to capture most of QuestDB’s essence into our implementation and try to be as accurate as possible; however, don’t treat it as an exact representation of QuestDB. We try to make the right parallels when relevant. Also, we won’t go through every line of code, just what I think it is worth mentioning. You can find the complete code at brunocalza/tiny-quest to follow along. I’ve broken the code in parts as well, so it is easier to tell the difference between parts.

Part I - Writing data to disk

In Part I, our goal is to write a storage engine with an API that would allow a layer on top of the engine to execute those SQL statements. Besides that, we aim to be 100% compatible with what is actually written to disk (except for _meta and _txn), meaning the bits stored on disk by our engine should match QuestDB’s bits.

The engine

QuestDB’s storage engine is called cairo. We start by creating a similar object to CairoEngine. We call ours, simply, Engine:

public class Engine {
    Path root;

    public Engine(String root) {
        this.root = Path.of(root);
    }

    public void createTable(Schema schema) throws Exception {
        Files.createDirectories(root);

        Path tableDir = root.resolve(schema.tableName());
        if (Files.exists(tableDir)) {
            throw new IOException("table already exists: " + schema.tableName());
        }
        Files.createDirectory(tableDir);

        Schema.save(root, schema);
    }

    public TableWriter getWriter(String table) throws IOException {
        Schema schema = Schema.load(root, table);
        return TableWriter.open(root, schema);
    }
}

Our engine, which points to a root directory, can create a table given a schema and provide a TableWriter for a given table. It is through the TableWriter that we’ll do our writes.

Our Schema is represented as:


// Schema record
public record Schema(String tableName, List<Column> columns, int timestampIdx) {
}


// Column record
public record Column(String name, ColumnType type) {
}

// ColumnType
public enum ColumnType {
    DOUBLE,
    TIMESTAMP,
    VARCHAR,
    SYMBOL;

    public int width() {
        return switch (this) {
            case DOUBLE, TIMESTAMP, VARCHAR -> 8;
            case SYMBOL -> 4;
        };
    }
}

We can go ahead and look at the part1.Demo, to see how the engine is used:

Path root = args.length > 0 ? Path.of(args[0]) : Files.createTempDirectory("tinyquest-demo");

Schema schema = new Schema(
        "trades",
        List.of(
                new Column("ts", ColumnType.TIMESTAMP),
                new Column("price", ColumnType.DOUBLE),
                new Column("sym", ColumnType.SYMBOL),
                new Column("note", ColumnType.VARCHAR)),
        0);

Engine db = new Engine(root.toString());
db.createTable(schema);

try (TableWriter w = db.getWriter("trades")) {
    w.row(micros("2026-06-10T10:00:00Z")).putDouble(1, 1.0).putSymbol(2, "BTC").putVarchar(3, "hi").append();
    w.row(micros("2026-06-10T10:00:01Z")).putDouble(1, 2.5).putSymbol(2, "ETH").putVarchar(3, "this note is too long to inline").append();
    w.row(micros("2026-06-10T10:00:02Z")).putDouble(1, -3.0).putSymbol(2, "BTC").putVarchar(3, "x").append();
}

You can see that most of the job is being done by the TableWriter. We’ll discuss it in the next section. For now, let’s focus on the other responsibility of our engine: creating a table. If we were to execute our demo without writes, we’d see something like this in our root directory:

ls -R
.:
trades

./trades:
_meta

A trades directory, for our trades table, and a _meta file.

In our previous post, The Anatomy of a QuestDB Table on Disk, we owe an explanation for _meta. The purpose of _meta is to store metadata about the table, for example its schema. We implemented our _meta with a simple layout just enough for our purposes; it is not compatible with QuestDB. We simply serialize our schema and write it to disk:

ByteArrayOutputStream buf = new ByteArrayOutputStream();
try (DataOutputStream out = new DataOutputStream(buf)) {
    out.writeUTF(schema.tableName());
    out.writeInt(schema.timestampIdx());
    out.writeInt(schema.columnCount());
    for (Column c : schema.columns()) {
        out.writeUTF(c.name());
        out.writeUTF(c.type().name());
    }
}
byte[] payload = buf.toByteArray();

Path metaPath = root.resolve(schema.tableName()).resolve("_meta");
try (DataOutputStream out = new DataOutputStream(
        Files.newOutputStream(metaPath, StandardOpenOption.CREATE_NEW))) {
    out.write(payload);
}

So, creating a table means creating the table’s directory and writing the _meta file to disk. If you look at CairoEngine you’ll see a similar API to ours, with some interesting stuff, such as:

The TableWriter

The TableWriter is the object responsible for writing bytes to disk. It writes bytes exactly how we explained in The Anatomy of a QuestDB Table on Disk. If you run part1.Demo, and look at the files and data written to disk, you’ll get the same as you would by running those statements in QuestDB (as seen in that post).

The TableWriter API is:

try (TableWriter w = db.getWriter("trades")) {
    w
        .row(micros("2026-06-10T10:00:00Z"))
        .putDouble(1, 1.0)
        .putSymbol(2, "BTC")
        .putVarchar(3, "hi")
        .append();
}

Every time we want to write a new row, we call row passing the timestamp, then, call putXXX methods to write the data for each of the columns. When we’re done we call append. The append is just a guard to mark the row as complete. It does not append the data or offer any durability (this will come later with commit() in Part III). The data is appended on the row() and putXXX calls.

In our TableWriter implementation we rely on a generic ColumnWriter that hides the logic of each of the types we support. This is not ideal in terms of performance because it implies a dynamic dispatch on a hot code path, but we opted for this for clarity.

public class TableWriter implements Closeable {
    private static final String DEFAULT_PARTITION = "default";

    private final Schema schema;
    private final ColumnWriter[] cols;
    private final SymbolDictionary[] dicts;
    private long lastTs = Long.MIN_VALUE;
    private boolean rowOpen = false;


    // ...
}

We have three implementations for ColumnWriter: FixedColumnWriter, SymbolColumnWriter, and VarcharColumnWriter. Each one of those implementations runs exactly the logic we described in the previous post. We won’t go through that again. So, just imagine the TableWriter as an orchestrator holding a ColumnWriter implementation for each of the columns of the schema. And every time we call a putXXX method it delegates the write logic to the correct ColumnWriter.

Also, the column writers hold references to mmapped files (represented by MappedAppendBuffer object) backing the data that column writer is responsible for:

// price.d, ts.d
final class FixedColumnWriter implements ColumnWriter {
    private final MappedAppendBuffer d;
    private final int width;

    // ...
}

// sym.d
final class SymbolColumnWriter implements ColumnWriter {
    private final MappedAppendBuffer d;
    private final SymbolDictionary dictionary;

    // ...
}


// note.d, note.i
final class VarcharColumnWriter implements ColumnWriter {
    //...
    
    private final MappedAppendBuffer d;
    private final MappedAppendBuffer i;

    // ...
}


SymbolDictionary is the object responsible for managing the symbol dictionary backed by the *.c and *.o files. That’s why SymbolColumnWriter has the references MappedAppendBuffer d and SymbolDictionary dictionary; it needs to append data to sym.d and manage entries on sym.c and sym.o files.

The last thing to discuss about the TableWriter is that it only supports in-order data. That’s how we keep the invariant of always having data sorted by timestamp. The TableWriter keeps track of the last timestamp inserted, and if new data arrives whose timestamp is less than the last, it throws an error:

if (ts < lastTs) {
    throw new IllegalStateException("out-of-order ts: row=" + ts + " last=" + lastTs);
}

In QuestDB, an out-of-order row would go through the out-of-order (O3) path, which we won’t discuss here. So, we assume our data is always in-order.

Part II - Introducing partitions

In Part II, we introduce partitions. One of the features that make QuestDB a very performant time-series database is its ability to physically separate a table’s data based on its designated timestamp. Whenever you create a table, you can decide whether it will be partitioned and the time unit of the partition:

CREATE TABLE market_data (
    ts TIMESTAMP,        -- Will be the designated timestamp
    symbol SYMBOL,
    price DOUBLE
) TIMESTAMP(ts) PARTITION BY DAY;

The writer will then split your table into time-based chunks. It’s this partitioning behavior that we implement in Part II.

Firstly, we need to expand our schema to support partitions:

public enum PartitionUnit {
    HOUR,
    DAY,
}

record Schema(String tableName, List<Column> columns, int timestampIdx, PartitionUnit partitionBy)

Also, _meta gains one extra writeUTF(partitionBy) field.

Then, we need a representation for our partitions (Partition.java):

record Partition(PartitionUnit unit, long lo) implements Comparable<Partition>  {
  // ...
}

Given a timestamp, we map it to a partition (Partition.of(ts, unit) floors to the day/hour boundary); we can also produce the directory name (2026-06-10) and parse it back.

Lastly, we need to adjust the TableWriter to be partition-aware, and create new partitions as new data crosses the partition boundary. That is done by keeping track of the current partition the writer is writing into and if the boundary is crossed switching to the new partition:

public TableWriter row(long ts) throws IOException {
    // ...
    Partition p = Partition.of(ts, schema.partitionBy());
    if (cols == null || !p.equals(currentPartition)) {
        switchPartition(p);
    }
    
    // ...
    
    return this;
}

private void switchPartition(Partition p) throws IOException {
    closeColumns();
    Path partitionDir = tableDir.resolve(p.name());
    Files.createDirectories(partitionDir);
    ColumnWriter[] next = new ColumnWriter[schema.columnCount()];
    for (int i = 0; i < schema.columnCount(); i++) {
        next[i] = ColumnWriter.openForAppend(partitionDir, schema.column(i), dicts[i]);
    }
    cols = next;
    currentPartition = p;
}

The switchPartition method closes the column writers for the current partition, creates the new partition directory, and opens the column writer for the new partition.

That’s all for Part II. You can play with part2.Demo and the tests to understand how that behaves.

Part III - Introducing transactions

Although we’re able to write data to disk in the same format as QuestDB, there are a couple of problems with our implementation. When we are calling those putXXX methods, we are already appending data to disk. What if one putXXX works but another one for a different column fails? We would have a mismatch across our columnar files. Also, let’s say there’s a query running at the same time as we call putXXX, that query should not be able to see that value, because the entire row is still writing. And furthermore, what if we need to roll back those writes? We need to introduce the concept of a transaction.

Transactions are a central theme in databases. The idea is that the storage engine needs somehow to be able to indicate that a set of writes are durable and visible on disk; ready to be consumed by readers. It is important that while those writes are happening there’s no way for readers to partially see them. We introduce a commit method to our API to allow the engine to indicate that.

The commit is just another method on our API. It does not really tell us how to achieve the transactional behavior. That’s where the _txn file (that we also ignored in the previous post) enters the scene. We use _txn as the source of truth of what is exactly visible in our table. Readers rely on _txn to know which bytes can be read; writers rely on _txn to know the last position of the data files and the correct symbol information so they can append correctly. So, calling commit essentially means updating the information on _txn to match what’s already on disk.

Before discussing what information we need to store on _txn, let’s discuss another tricky detail: that information needs to be updated atomically. We cannot allow partial writes to happen. There are a couple of strategies to make that happen. A very common one is writing the new information in a temporary _txn.tmp file, not yet visible, and using a filesystem rename operation, which is considered atomic in some operating systems, to do the switch. As soon as the rename is done, new data becomes visible. Because readers may have cached the information of what is visible, they may have to be reopened. Another approach, the one used by QuestDB, and implemented by us, is to have two regions inside _txn. One with the live information (let’s say region A) and another with the outdated information (let’s say region B). And a version field inside the file that can be used to indicate which one is the active one. For example, if the version is even, we use region A, if the version is odd, we use region B. So a commit needs to write the new information into one of the regions, keeping both the old and the new at the same time, then bump the version.

For our implementation, there’s just a handful of information that needs to get stored on _txn:

With that in mind, we have designed the following layout for our _txn file:

graphviz-b4241e1d5089a61340c1a31c1741a96c digraph txn { rankdir=TB; nodesep=0.5; ranksep=0.6; node [shape=plaintext, fontsize=12, fontname="JetBrains Mono,SFMono-Regular,Consolas,Menlo,monospace"]; header [label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <TR> <TD BORDER="0"></TD> <TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">0</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">1</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">2</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">3</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">4</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">5</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">6</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">7</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">8</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">9</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">10</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">11</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">12</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">13</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">14</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">15</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">16</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">17</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">18</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">19</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">20</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">21</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">22</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">23</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">24</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">25</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">26</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">27</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">28</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">29</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">30</TD><TD FIXEDSIZE="TRUE" WIDTH="28" HEIGHT="24">31</TD> </TR> <TR> <TD BORDER="0" ALIGN="LEFT"></TD> <TD ROWSPAN="2" COLSPAN="8" BGCOLOR="#f7c9c9">version</TD> <TD COLSPAN="12" BGCOLOR="#eef0f2">slot A</TD> <TD COLSPAN="12" BGCOLOR="#eef0f2">slot B</TD> </TR> <TR> <TD BORDER="0"></TD> <TD COLSPAN="4" PORT="aoff" BGCOLOR="#cdd9f3">offset</TD><TD COLSPAN="4" BGCOLOR="#cdebcd">symSize</TD><TD COLSPAN="4" BGCOLOR="#f3e3c2">partSize</TD> <TD COLSPAN="4" PORT="boff" BGCOLOR="#cdd9f3">offset</TD><TD COLSPAN="4" BGCOLOR="#cdebcd">symSize</TD><TD COLSPAN="4" BGCOLOR="#f3e3c2">partSize</TD> </TR> </TABLE>>]; record [label=< <TABLE BORDER="0" CELLBORDER="1" CELLSPACING="0" CELLPADDING="4"> <TR> <TD BORDER="0" ALIGN="RIGHT"></TD> <TD PORT="rec" WIDTH="108" BGCOLOR="#f7c9c9">version</TD> <TD WIDTH="108" BGCOLOR="#e6d6f7">lastTs</TD> <TD WIDTH="148" BGCOLOR="#cdebcd">symBytes</TD> <TD WIDTH="156" BGCOLOR="#f3e3c2">partBytes</TD> </TR> <TR> <TD BORDER="0"></TD> <TD BORDER="0"><FONT POINT-SIZE="10">0&#8211;7</FONT></TD> <TD BORDER="0"><FONT POINT-SIZE="10">8&#8211;15</FONT></TD> <TD BORDER="0"><FONT POINT-SIZE="10">16 &#8230; 16+symSize</FONT></TD> <TD BORDER="0"><FONT POINT-SIZE="10">&#8230; +partSize</FONT></TD> </TR> </TABLE>>]; header:aoff:s -> record:rec:n [label=" live slot\l (version &amp; 1)\l"]; header:boff:s -> record:rec:n [style=dashed, label=" other slot =\l previous record\l"]; } txn header 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 version slot A slot B offset symSize partSize offset symSize partSize record version lastTs symBytes partBytes 0–7 8–15 16 … 16+symSize … +partSize header:s->record:n  live slot  (version & 1) header:s->record:n  other slot =  previous record

The first 32 bytes make up the header:

Right after the header is the area where we store the information we need, composed of (the four bullets above): version, last timestamp, symbol counts, and partition’s committed row counts.

Let’s unpack this with some code. We add a TxWriter object that is responsible for writing to the _txn file. TxWriter exposes a commit method that is used by TableWriter to commit. The file format and the right offsets are found at TxFormat.

Let’s start with the version. The version is used to determine which slot is active:

static int slot(long version) {
    return (version & 1L) == 0L ? SLOT_A : SLOT_B;
}

If the version is even we rely on slot A metadata, otherwise, we rely on slot B metadata. Each slot stores the offset (the byte offset where the information can be found), the size of the symBytes area, and the size of the partBytes area. Let’s see what that really means by looking at the commit implementation. First the signature:

public long commit(long lastTs, SymbolDictionary[] dicts, Map<Long, Long> partitions) {
    //...
    
}

We pass in exactly what needs to be stored in _txn. The new version is calculated by bumping the current version, so we don’t need to pass that in. Then we calculate symSize, symBytes, partSize, and partBytes:

int symCols = 0;
for (SymbolDictionary d : dicts) if (d != null) symCols++;
ByteBuffer sb = ByteBuffer.allocate(symCols * SYM_ENTRY_SIZE).order(ByteOrder.LITTLE_ENDIAN);
for (SymbolDictionary d : dicts) if (d != null) sb.putInt(d.count());
byte[] symBytes = sb.array();
int symSize = symBytes.length;

ByteBuffer pb = ByteBuffer.allocate(partitions.size() * PART_ENTRY_SIZE).order(ByteOrder.LITTLE_ENDIAN);
for (Map.Entry<Long, Long> e : partitions.entrySet()) {
    pb.putLong(e.getKey());
    pb.putLong(e.getValue());
}
byte[] partBytes = pb.array();
int partSize = partBytes.length;

Then we need to figure out where we’re going to write that information. That is done with calculateWriteOffset:

private static int calculateWriteOffset(MappedByteBuffer m, long version, int symSize, int partSize) {
  // find the active slot
  int curSlot   = slot(version);                 
  int curOffset = m.getInt(curSlot + SLOT_OFFSET);
  
  // size of the NEW record
  int needed    = REC_HEAD_SIZE + symSize + partSize;   

  // Option A: fits in the gap before the live record
  if (HEADER_SIZE + needed <= curOffset) return HEADER_SIZE;   

  // Option B: just after the live record
  int curSym  = m.getInt(curSlot + SLOT_SYM_SIZE);
  int curPart = m.getInt(curSlot + SLOT_PART_SIZE);
  return curOffset + REC_HEAD_SIZE + curSym + curPart;          
}

Depending on the size of the new record and where the live record is written, it decides if the new record will be written right after the header (option A), or right after the live record (option B).

Then we write the information to the correct offset and update the new slot metadata:

long newVersion = version + 1;
int newSlot = slot(newVersion);

m.putLong(writeOffset + REC_OFF_TXN, newVersion);
m.putLong(writeOffset + REC_OFF_LAST_TS, lastTs);
m.position(writeOffset + REC_HEAD_SIZE);
m.put(symBytes);
m.put(partBytes);

m.putInt(newSlot + SLOT_OFFSET, writeOffset);
m.putInt(newSlot + SLOT_SYM_SIZE, symSize);
m.putInt(newSlot + SLOT_PART_SIZE, partSize);
m.force();

Up to this point we have a _txn file with both the live and new information (still not live) written in different places of the file. The real commit happens when we bump the version on disk:

m.putLong(OFF_VERSION, newVersion);
m.force();

And that’s how commit is implemented.

Besides that, commit needs to be exposed by our TableWriter so clients can choose when they want to commit. Here’s the TableWriter’s implementation:

public void commit() throws IOException {
    flushData();
    syncCurrentRows();
    txWriter.commit(lastTs, dicts, partitionRows);
}

Furthermore, whenever TableWriter is opened it is actually relying on information from _txn (read back through TxReader) to tell it what is true:

private TableWriter(Schema schema, Path tableDir, TxReader txn, TxWriter txWriter, SymbolDictionary[] dicts) {
    this.schema = schema;
    this.tableDir = tableDir;
    this.txWriter = txWriter;
    this.lastTs = txn.lastTs();
    partitionRows.putAll(txn.partitions());
    this.dicts = dicts;
}

public static TableWriter open(Path root, Schema schema) throws IOException {
    Path tableDir = root.resolve(schema.tableName());
    TxReader txn = TxReader.read(tableDir);
    TxWriter txWriter = TxWriter.open(tableDir);
    return new TableWriter(schema, tableDir, txn, txWriter, openDictionaries(tableDir, schema, txn.symbolCounts()));
}

Lastly, we don’t have any reader implementations in our engine, only in tests. In Part III tests, you can see an example of a reader implementation that cannot see uncommitted writes. We cap how much of the file can be read by relying on the committed row count of _txn:

MappedByteBuffer m = mapSlice(partitionDir.resolve(fileName), 0, (long) rowCount * width);

We don’t explore rollbacks in this post; that is left as an exercise for the future.

Wrapping Up

That is what I wanted to explore in this blog post: how writes happen, how partitions are created on the fly, and how transactions are implemented. Of course, QuestDB is much more complex than this. It supports things such as out-of-order writes and WAL mode with background writes. Not only that, their implementation is much more performance-oriented, and that can lead to a very different implementation of similar concepts.

I’ve added a cairo.Demo using CairoEngine directly. With a debugger, we can easily compare both implementations.

Up next, I’d like to explore out-of-order writes and WAL mode. Stay tuned!

#databases #questdb #storage-engine #partitioning #transactions #mmap #java