A Tiny QuestDB: How Writes, Partitions, and Transactions Work
Jun 24, 2026This is a follow-up on The Anatomy of a QuestDB Table on Disk. There we looked at how data is natively stored on disk for some data types. In this post, we look at what is approximately happening at the storage layer when writes occur. For that, we implement a simplified version of QuestDB’s storage engine. We’re particularly focused on implementing just what is necessary for the storage engine to be able to store the data we’ve been working with so far:
CREATE TABLE trades (ts TIMESTAMP, price DOUBLE, sym SYMBOL, note VARCHAR) TIMESTAMP(ts) PARTITION BY DAY WAL;
INSERT INTO trades VALUES ('2026-06-10T10:00:00Z', 1.0, 'BTC', 'hi');
INSERT INTO trades VALUES ('2026-06-10T10:00:01Z', 2.5, 'ETH', 'this note is too long to inline');
INSERT INTO trades VALUES ('2026-06-10T10:00:02Z', -3.0, 'BTC', 'x');
We break the implementation into 3 parts:
- Part I: be able to create a table and write rows to disk
- Part II: be able to switch partitions when a partition boundary is crossed
- Part III: be able to write the data within a transaction
Note that our implementation is minimal and mostly for educational purposes with the goal of understanding some underlying principles of a real database. We try to capture most of QuestDB’s essence into our implementation and try to be as accurate as possible; however, don’t treat it as an exact representation of QuestDB. We try to make the right parallels when relevant. Also, we won’t go through every line of code, just what I think it is worth mentioning. You can find the complete code at brunocalza/tiny-quest to follow along. I’ve broken the code in parts as well, so it is easier to tell the difference between parts.
Part I - Writing data to disk
In Part I, our goal is to write a storage engine with an API that would allow a layer on top of the engine to execute those SQL statements. Besides that, we aim to be 100% compatible with what is actually written to disk (except for _meta and _txn), meaning the bits stored on disk by our engine should match QuestDB’s bits.
The engine
QuestDB’s storage engine is called cairo. We start by creating a similar object to CairoEngine. We call ours, simply, Engine:
public class Engine {
Path root;
public Engine(String root) {
this.root = Path.of(root);
}
public void createTable(Schema schema) throws Exception {
Files.createDirectories(root);
Path tableDir = root.resolve(schema.tableName());
if (Files.exists(tableDir)) {
throw new IOException("table already exists: " + schema.tableName());
}
Files.createDirectory(tableDir);
Schema.save(root, schema);
}
public TableWriter getWriter(String table) throws IOException {
Schema schema = Schema.load(root, table);
return TableWriter.open(root, schema);
}
}
Our engine, which points to a root directory, can create a table given a schema and provide a TableWriter for a given table. It is through the TableWriter that we’ll do our writes.
Our Schema is represented as:
// Schema record
public record Schema(String tableName, List<Column> columns, int timestampIdx) {
}
// Column record
public record Column(String name, ColumnType type) {
}
// ColumnType
public enum ColumnType {
DOUBLE,
TIMESTAMP,
VARCHAR,
SYMBOL;
public int width() {
return switch (this) {
case DOUBLE, TIMESTAMP, VARCHAR -> 8;
case SYMBOL -> 4;
};
}
}
We can go ahead and look at the part1.Demo, to see how the engine is used:
Path root = args.length > 0 ? Path.of(args[0]) : Files.createTempDirectory("tinyquest-demo");
Schema schema = new Schema(
"trades",
List.of(
new Column("ts", ColumnType.TIMESTAMP),
new Column("price", ColumnType.DOUBLE),
new Column("sym", ColumnType.SYMBOL),
new Column("note", ColumnType.VARCHAR)),
0);
Engine db = new Engine(root.toString());
db.createTable(schema);
try (TableWriter w = db.getWriter("trades")) {
w.row(micros("2026-06-10T10:00:00Z")).putDouble(1, 1.0).putSymbol(2, "BTC").putVarchar(3, "hi").append();
w.row(micros("2026-06-10T10:00:01Z")).putDouble(1, 2.5).putSymbol(2, "ETH").putVarchar(3, "this note is too long to inline").append();
w.row(micros("2026-06-10T10:00:02Z")).putDouble(1, -3.0).putSymbol(2, "BTC").putVarchar(3, "x").append();
}
You can see that most of the job is being done by the TableWriter. We’ll discuss it in the next section. For now, let’s focus on the other responsibility of our engine: creating a table. If we were to execute our demo without writes, we’d see something like this in our root directory:
ls -R
.:
trades
./trades:
_meta
A trades directory, for our trades table, and a _meta file.
In our previous post, The Anatomy of a QuestDB Table on Disk, we owe an explanation for _meta. The purpose of _meta is to store metadata about the table, for example its schema. We implemented our _meta with a simple layout just enough for our purposes; it is not compatible with QuestDB. We simply serialize our schema and write it to disk:
ByteArrayOutputStream buf = new ByteArrayOutputStream();
try (DataOutputStream out = new DataOutputStream(buf)) {
out.writeUTF(schema.tableName());
out.writeInt(schema.timestampIdx());
out.writeInt(schema.columnCount());
for (Column c : schema.columns()) {
out.writeUTF(c.name());
out.writeUTF(c.type().name());
}
}
byte[] payload = buf.toByteArray();
Path metaPath = root.resolve(schema.tableName()).resolve("_meta");
try (DataOutputStream out = new DataOutputStream(
Files.newOutputStream(metaPath, StandardOpenOption.CREATE_NEW))) {
out.write(payload);
}
So, creating a table means creating the table’s directory and writing the _meta file to disk. If you look at CairoEngine you’ll see a similar API to ours, with some interesting stuff, such as:
- We rely on
IOExceptionto indicate a table already exists. They rely on error codes. Also, they have their own exception implementation calledCairoException. It is reused over and over throughout the system to avoid new allocations and expensive stack traces. - We rely on
java.nio.file.Filesto access disk operations. Their access is through theFilesFacadeinterface. One of the trickiest things in databases is to test certain errors and faults. So it is common to see external access being done through an interface, so faults can be injected and tested deterministically. - They rely on a writer pool. The table writer comes from a pool. As you’ll see, there’s a lot of overhead when opening a table writer. They keep a pool of writers cached to avoid that overhead.
- They support both non-WAL and WAL modes
The TableWriter
The TableWriter is the object responsible for writing bytes to disk. It writes bytes exactly how we explained in The Anatomy of a QuestDB Table on Disk. If you run part1.Demo, and look at the files and data written to disk, you’ll get the same as you would by running those statements in QuestDB (as seen in that post).
The TableWriter API is:
try (TableWriter w = db.getWriter("trades")) {
w
.row(micros("2026-06-10T10:00:00Z"))
.putDouble(1, 1.0)
.putSymbol(2, "BTC")
.putVarchar(3, "hi")
.append();
}
Every time we want to write a new row, we call row passing the timestamp, then, call putXXX methods to write the data for each of the columns. When we’re done we call append. The append is just a guard to mark the row as complete. It does not append the data or offer any durability (this will come later with commit() in Part III). The data is appended on the row() and putXXX calls.
In our TableWriter implementation we rely on a generic ColumnWriter that hides the logic of each of the types we support. This is not ideal in terms of performance because it implies a dynamic dispatch on a hot code path, but we opted for this for clarity.
public class TableWriter implements Closeable {
private static final String DEFAULT_PARTITION = "default";
private final Schema schema;
private final ColumnWriter[] cols;
private final SymbolDictionary[] dicts;
private long lastTs = Long.MIN_VALUE;
private boolean rowOpen = false;
// ...
}
We have three implementations for ColumnWriter: FixedColumnWriter, SymbolColumnWriter, and VarcharColumnWriter. Each one of those implementations runs exactly the logic we described in the previous post. We won’t go through that again. So, just imagine the TableWriter as an orchestrator holding a ColumnWriter implementation for each of the columns of the schema. And every time we call a putXXX method it delegates the write logic to the correct ColumnWriter.
Also, the column writers hold references to mmapped files (represented by MappedAppendBuffer object) backing the data that column writer is responsible for:
// price.d, ts.d
final class FixedColumnWriter implements ColumnWriter {
private final MappedAppendBuffer d;
private final int width;
// ...
}
// sym.d
final class SymbolColumnWriter implements ColumnWriter {
private final MappedAppendBuffer d;
private final SymbolDictionary dictionary;
// ...
}
// note.d, note.i
final class VarcharColumnWriter implements ColumnWriter {
//...
private final MappedAppendBuffer d;
private final MappedAppendBuffer i;
// ...
}
SymbolDictionary is the object responsible for managing the symbol dictionary backed by the *.c and *.o files. That’s why SymbolColumnWriter has the references MappedAppendBuffer d and SymbolDictionary dictionary; it needs to append data to sym.d and manage entries on sym.c and sym.o files.
The last thing to discuss about the TableWriter is that it only supports in-order data. That’s how we keep the invariant of always having data sorted by timestamp. The TableWriter keeps track of the last timestamp inserted, and if new data arrives whose timestamp is less than the last, it throws an error:
if (ts < lastTs) {
throw new IllegalStateException("out-of-order ts: row=" + ts + " last=" + lastTs);
}
In QuestDB, an out-of-order row would go through the out-of-order (O3) path, which we won’t discuss here. So, we assume our data is always in-order.
Part II - Introducing partitions
In Part II, we introduce partitions. One of the features that make QuestDB a very performant time-series database is its ability to physically separate a table’s data based on its designated timestamp. Whenever you create a table, you can decide whether it will be partitioned and the time unit of the partition:
CREATE TABLE market_data (
ts TIMESTAMP, -- Will be the designated timestamp
symbol SYMBOL,
price DOUBLE
) TIMESTAMP(ts) PARTITION BY DAY;
The writer will then split your table into time-based chunks. It’s this partitioning behavior that we implement in Part II.
Firstly, we need to expand our schema to support partitions:
public enum PartitionUnit {
HOUR,
DAY,
}
record Schema(String tableName, List<Column> columns, int timestampIdx, PartitionUnit partitionBy)
Also, _meta gains one extra writeUTF(partitionBy) field.
Then, we need a representation for our partitions (Partition.java):
record Partition(PartitionUnit unit, long lo) implements Comparable<Partition> {
// ...
}
Given a timestamp, we map it to a partition (Partition.of(ts, unit) floors to the day/hour boundary); we can also produce the directory name (2026-06-10) and parse it back.
Lastly, we need to adjust the TableWriter to be partition-aware, and create new partitions as new data crosses the partition boundary. That is done by keeping track of the current partition the writer is writing into and if the boundary is crossed switching to the new partition:
public TableWriter row(long ts) throws IOException {
// ...
Partition p = Partition.of(ts, schema.partitionBy());
if (cols == null || !p.equals(currentPartition)) {
switchPartition(p);
}
// ...
return this;
}
private void switchPartition(Partition p) throws IOException {
closeColumns();
Path partitionDir = tableDir.resolve(p.name());
Files.createDirectories(partitionDir);
ColumnWriter[] next = new ColumnWriter[schema.columnCount()];
for (int i = 0; i < schema.columnCount(); i++) {
next[i] = ColumnWriter.openForAppend(partitionDir, schema.column(i), dicts[i]);
}
cols = next;
currentPartition = p;
}
The switchPartition method closes the column writers for the current partition, creates the new partition directory, and opens the column writer for the new partition.
That’s all for Part II. You can play with part2.Demo and the tests to understand how that behaves.
Part III - Introducing transactions
Although we’re able to write data to disk in the same format as QuestDB, there are a couple of problems with our implementation. When we are calling those putXXX methods, we are already appending data to disk. What if one putXXX works but another one for a different column fails? We would have a mismatch across our columnar files. Also, let’s say there’s a query running at the same time as we call putXXX, that query should not be able to see that value, because the entire row is still writing. And furthermore, what if we need to roll back those writes? We need to introduce the concept of a transaction.
Transactions are a central theme in databases. The idea is that the storage engine needs somehow to be able to indicate that a set of writes are durable and visible on disk; ready to be consumed by readers. It is important that while those writes are happening there’s no way for readers to partially see them. We introduce a commit method to our API to allow the engine to indicate that.
The commit is just another method on our API. It does not really tell us how to achieve the transactional behavior. That’s where the _txn file (that we also ignored in the previous post) enters the scene. We use _txn as the source of truth of what is exactly visible in our table. Readers rely on _txn to know which bytes can be read; writers rely on _txn to know the last position of the data files and the correct symbol information so they can append correctly. So, calling commit essentially means updating the information on _txn to match what’s already on disk.
Before discussing what information we need to store on _txn, let’s discuss another tricky detail: that information needs to be updated atomically. We cannot allow partial writes to happen. There are a couple of strategies to make that happen. A very common one is writing the new information in a temporary _txn.tmp file, not yet visible, and using a filesystem rename operation, which is considered atomic in some operating systems, to do the switch. As soon as the rename is done, new data becomes visible. Because readers may have cached the information of what is visible, they may have to be reopened. Another approach, the one used by QuestDB, and implemented by us, is to have two regions inside _txn. One with the live information (let’s say region A) and another with the outdated information (let’s say region B). And a version field inside the file that can be used to indicate which one is the active one. For example, if the version is even, we use region A, if the version is odd, we use region B. So a commit needs to write the new information into one of the regions, keeping both the old and the new at the same time, then bump the version.
For our implementation, there’s just a handful of information that needs to get stored on _txn:
- We need a version field
- For each partition, we have to indicate how many rows are committed
- For each symbol dictionary, we need something that tells which symbols are committed. That can be done with a single count. Because our
*.cfile is append-only, with a single number we can tell that all symbols from top to bottom are committed. So let’s say, our symbol dictionary contains 3 symbols, and all of our data is committed;_txnshould have 3 stored somewhere to indicate that. Whenever a new symbol arrives, we update the*.cfile immediately via column writer, but that new symbol is not yet committed. It will be considered committed when we store 4 on_txn. - Lastly, we’ll also store
lastTs, so we can recover ourTableWriter’s last timestamp without reading that value from disk.
With that in mind, we have designed the following layout for our _txn file:
The first 32 bytes make up the header:
- 8 bytes to store the version
- 12 bytes for slot A metadata
- 12 bytes for slot B metadata
Right after the header is the area where we store the information we need, composed of (the four bullets above): version, last timestamp, symbol counts, and partition’s committed row counts.
Let’s unpack this with some code. We add a TxWriter object that is responsible for writing to the _txn file. TxWriter exposes a commit method that is used by TableWriter to commit. The file format and the right offsets are found at TxFormat.
Let’s start with the version. The version is used to determine which slot is active:
static int slot(long version) {
return (version & 1L) == 0L ? SLOT_A : SLOT_B;
}
If the version is even we rely on slot A metadata, otherwise, we rely on slot B metadata. Each slot stores the offset (the byte offset where the information can be found), the size of the symBytes area, and the size of the partBytes area. Let’s see what that really means by looking at the commit implementation. First the signature:
public long commit(long lastTs, SymbolDictionary[] dicts, Map<Long, Long> partitions) {
//...
}
We pass in exactly what needs to be stored in _txn. The new version is calculated by bumping the current version, so we don’t need to pass that in. Then we calculate symSize, symBytes, partSize, and partBytes:
int symCols = 0;
for (SymbolDictionary d : dicts) if (d != null) symCols++;
ByteBuffer sb = ByteBuffer.allocate(symCols * SYM_ENTRY_SIZE).order(ByteOrder.LITTLE_ENDIAN);
for (SymbolDictionary d : dicts) if (d != null) sb.putInt(d.count());
byte[] symBytes = sb.array();
int symSize = symBytes.length;
ByteBuffer pb = ByteBuffer.allocate(partitions.size() * PART_ENTRY_SIZE).order(ByteOrder.LITTLE_ENDIAN);
for (Map.Entry<Long, Long> e : partitions.entrySet()) {
pb.putLong(e.getKey());
pb.putLong(e.getValue());
}
byte[] partBytes = pb.array();
int partSize = partBytes.length;
Then we need to figure out where we’re going to write that information. That is done with calculateWriteOffset:
private static int calculateWriteOffset(MappedByteBuffer m, long version, int symSize, int partSize) {
// find the active slot
int curSlot = slot(version);
int curOffset = m.getInt(curSlot + SLOT_OFFSET);
// size of the NEW record
int needed = REC_HEAD_SIZE + symSize + partSize;
// Option A: fits in the gap before the live record
if (HEADER_SIZE + needed <= curOffset) return HEADER_SIZE;
// Option B: just after the live record
int curSym = m.getInt(curSlot + SLOT_SYM_SIZE);
int curPart = m.getInt(curSlot + SLOT_PART_SIZE);
return curOffset + REC_HEAD_SIZE + curSym + curPart;
}
Depending on the size of the new record and where the live record is written, it decides if the new record will be written right after the header (option A), or right after the live record (option B).
Then we write the information to the correct offset and update the new slot metadata:
long newVersion = version + 1;
int newSlot = slot(newVersion);
m.putLong(writeOffset + REC_OFF_TXN, newVersion);
m.putLong(writeOffset + REC_OFF_LAST_TS, lastTs);
m.position(writeOffset + REC_HEAD_SIZE);
m.put(symBytes);
m.put(partBytes);
m.putInt(newSlot + SLOT_OFFSET, writeOffset);
m.putInt(newSlot + SLOT_SYM_SIZE, symSize);
m.putInt(newSlot + SLOT_PART_SIZE, partSize);
m.force();
Up to this point we have a _txn file with both the live and new information (still not live) written in different places of the file. The real commit happens when we bump the version on disk:
m.putLong(OFF_VERSION, newVersion);
m.force();
And that’s how commit is implemented.
Besides that, commit needs to be exposed by our TableWriter so clients can choose when they want to commit. Here’s the TableWriter’s implementation:
public void commit() throws IOException {
flushData();
syncCurrentRows();
txWriter.commit(lastTs, dicts, partitionRows);
}
Furthermore, whenever TableWriter is opened it is actually relying on information from _txn (read back through TxReader) to tell it what is true:
private TableWriter(Schema schema, Path tableDir, TxReader txn, TxWriter txWriter, SymbolDictionary[] dicts) {
this.schema = schema;
this.tableDir = tableDir;
this.txWriter = txWriter;
this.lastTs = txn.lastTs();
partitionRows.putAll(txn.partitions());
this.dicts = dicts;
}
public static TableWriter open(Path root, Schema schema) throws IOException {
Path tableDir = root.resolve(schema.tableName());
TxReader txn = TxReader.read(tableDir);
TxWriter txWriter = TxWriter.open(tableDir);
return new TableWriter(schema, tableDir, txn, txWriter, openDictionaries(tableDir, schema, txn.symbolCounts()));
}
Lastly, we don’t have any reader implementations in our engine, only in tests. In Part III tests, you can see an example of a reader implementation that cannot see uncommitted writes. We cap how much of the file can be read by relying on the committed row count of _txn:
MappedByteBuffer m = mapSlice(partitionDir.resolve(fileName), 0, (long) rowCount * width);
We don’t explore rollbacks in this post; that is left as an exercise for the future.
Wrapping Up
That is what I wanted to explore in this blog post: how writes happen, how partitions are created on the fly, and how transactions are implemented. Of course, QuestDB is much more complex than this. It supports things such as out-of-order writes and WAL mode with background writes. Not only that, their implementation is much more performance-oriented, and that can lead to a very different implementation of similar concepts.
I’ve added a cairo.Demo using CairoEngine directly. With a debugger, we can easily compare both implementations.
Up next, I’d like to explore out-of-order writes and WAL mode. Stay tuned!
#databases #questdb #storage-engine #partitioning #transactions #mmap #java