Building a Grow-Only Counter on a Sequentially Consistent KV Store

Apr 13, 2026

This post is part of a series on Fly.io’s distributed systems challenges:

Implementing Snowflake Unique ID Generation

Generating Unique IDs with Raft Consensus

Fly.io’s Broadcast Challenges

Building a Grow-Only Counter on a Sequentially Consistent KV Store (this post)

We’re going to discuss Challenge #4: Grow-Only Counter. This challenge is particularly tricky. I wouldn’t say it’s hard, but if the goal is to learn, there’s a lot to unwrap.

The task is to build a grow-only counter. Nothing strange so far. However, the specification says to build the counter on top of Maelstrom’s SeqKV built-in service. And that’s where things get weird. In this post I try to explore that weirdness to the best of my ability. Moreover, I briefly touch on CRDTs (Conflict-free replicated data type).

First, let’s understand what is going on. The following test needs to pass:

./maelstrom test -w g-counter \
    --bin ~/go/bin/maelstrom-counter \
    --node-count 3 \
    --rate 100 \
    --time-limit 20 \
    --nemesis partition

The g-counter workload sends two types of request: add and read, that our nodes will need to accept.

# add request
{
  "type": "add",
  "delta": 123
}
# add response
{
  "type": "add_ok"
}

# read request
{
  "type": "read"
}
# read response
{
  "type": "read_ok",
  "value": 1234
}

Essentially, the test checks that after all adds are done, every node’s final read sees the full sum.

SeqKV is a key-value store that the node can use to build the algorithm. SeqKV offers the following API:

ReadInt(key string) -> int
Write(key string, value any)
CompareAndSwap(key string, from any, to any, createIfNotExists bool)

And that’s all we need to get started.

Starting simple

Let’s use the key counter in our key-value store to store the value of our counter. So we initialize our nodes with counter set to 0.

n.Handle("init", func(msg maelstrom.Message) error {
    return kv.Write(context.Background(), "counter", 0)
})

Then whenever an add request comes in, we read the value, update it, and write it back:

n.Handle("add", func(msg maelstrom.Message) error {
    var req struct {
        Type  string `json:"type"`
        Delta int    `json:"delta"`
    }
    if err := json.Unmarshal(msg.Body, &req); err != nil {
        return err
    }

    old, err := kv.ReadInt(context.Background(), "counter")
    if err != nil {
        return err
    }
    new := old + req.Delta
    if err := kv.Write(context.Background(), "counter", new); err != nil {
        return err
    }

    type response struct {
        Type string `json:"type"`
    }

    return n.Reply(msg, response{Type: "add_ok"})
})

And when a read request arrives:

n.Handle("read", func(msg maelstrom.Message) error {
    value, err := kv.ReadInt(context.Background(), "counter")
    if err != nil {
        return err
    }

    type response struct {
        Type  string `json:"type"`
        Value int    `json:"value"`
    }

    return n.Reply(msg, response{Type: "read_ok", Value: value})
})

Well, if you’re experienced with this kind of thing, you probably know this won’t work. And that is really the case. But it’s a good starting point and a way of validating that Maelstrom checks are solid. If you’re not experienced, the reason this does not work is that the read-modify-write inside add is being executed by concurrent nodes. If two nodes both read counter = 5 at the same time, they’ll both write 5 + delta, and one write will be lost.

There are two solutions to this problem: one is making that operation atomic, and the other is by making sure the node’s writes don’t conflict with one another. In the latter approach, we enter into the CRDT world.

Solving read-modify-write with CAS

We saw that our key-value store offers a CompareAndSwap method. With that, we can make our operation atomic:

for {
    old, _ := kv.ReadInt(context.Background(), "counter")
    if err := kv.CompareAndSwap(context.Background(), "counter", old, old+req.Delta, true); err != nil {
        // CAS failed retry
        continue
    }
    break
}

I’m being a bit sloppy here with this infinite loop. Ideally, we’d have a timeout and would also check for specific CompareAndSwap errors, so we don’t retry all of them. But you get the idea.

If you run the test, you may get a valid result. If you’re not really that curious to understand why you got a valid result, you’ll probably move on to the next challenge and miss many extra learnings. And that is because you may also get an invalid result for this solution. And that’s where the weirdness starts. This solution is not deterministic.

Before exploring that weirdness, I’d like to discuss an alternative solution that touches a bit on what a CRDT looks like.

CRDT-like solution

There are different kinds of CRDTs, and G-Counter is one of them. If we look at the mathematical definition of the G-Counter CRDT on Wikipedia (link), you’ll see:

payload integer[n] P
    initial [0,0,...,0]

update increment()
    let g = myId()
    P[g] := P[g] + 1

query value() : integer v
    let v = Σi P[i]

compare (X, Y) : boolean b
    let b = (∀i ∈ [0, n - 1] : X.P[i] ≤ Y.P[i])

merge (X, Y) : payload Z
    let ∀i ∈ [0, n - 1] : Z.P[i] = max(X.P[i], Y.P[i])

Let’s ignore compare and merge for a moment. If we squint and think a bit, the payload integer, update increment and query value kind of map to the responsibilities of our node: init, add, and read.

What is roughly being said in there is that we’ll have as an initial state a vector [0,0,...,0]. When a node needs to increment the counter, it only updates its counter (Wikipedia shows +1; we generalize to +delta). See the myId, and the P[g] := P[g] + 1 operation? And when we read the counter value, we actually sum all values of the vector. Interesting.

Let’s map this to what we have. Our state is stored on a key-value store, and it does not offer a vector data structure. However, if a node is only updating its part of the vector, we can create a key-value counter for each node and consider that to be a vector. Using the node ID as the key, our init becomes:

n.Handle("init", func(msg maelstrom.Message) error {
    return kv.Write(context.Background(), n.ID(), 0)
})

Our read-modify-write becomes:

old, err := kv.ReadInt(context.Background(), n.ID())
if err != nil {
    return err
}
new := old + req.Delta
if err := kv.Write(context.Background(), n.ID(), new); err != nil {
    return err
}

Our counter is now spread into different counters, and the solution we saw on Wikipedia says that we should sum them to get the real value. Let’s do that:

var sum = 0
for _, nodeID := range n.NodeIDs() {
    value, err := kv.ReadInt(context.Background(), nodeID)
    if err != nil {
        return err
    }
    sum += value
}

And, as before, if we run the test, you may get a valid or invalid result.

So, what happened here is that by making each node work on its counter, they don’t conflict with one another when writing, eliminating the race condition we had.

So is this a CRDT? Not quite. The merge and compare pieces, part of the definition that we ignored, are what would make this a CRDT. But what are those? Well, CRDT is a decentralized algorithm. But the fact we’re using a centralized key-value store to build our solution means we have eliminated the need for merge and compare. If we were to remove the key-value store, we would need a way for nodes to know which values the other nodes have. Currently, it is the key-value store that is doing this job. A node can always read from the key-value store to know what the other node has; that’s how our read works. So, what happens in a real CRDT algorithm is that the key-value store would be replaced by a gossip algorithm used to broadcast local node state, and compare and merge used to reconcile with others.

I did not explore this approach on this challenge but wanted to share that to make the learnings from this more complete. In the future, I’ll see if I can use the approaches discussed on Fly.io’s Broadcast Challenges, add the merge and compare, and pass the test.

On consistency models

Sequential consistency

So, we discussed two solutions, but none of them deterministically pass the test. We need to figure out why and fix our code. The fact that the problem specification suggests the use of SeqKV gives us a hint where our issue may lie. And proposing its use may indicate something about the author’s pedagogical intentions.

SeqKV is a sequentially consistent key-value store. To understand what that means, we have to understand what a consistency model is, because sequential consistency is one of many. One way of thinking about the consistency model of a data storage is to think that there’s a contract between clients and the data storage. If clients agree with certain rules, the data storage promises that it will behave in a certain way. You might be thinking, what are these behaviors, and why doesn’t all data storage behave the same way? Well, in a distributed system with concurrent clients where nodes might fail, messages can get delayed, and clocks are unreliable, weird things can happen, and different kinds of behaviors emerge. We can call these odd behaviors anomalies. These anomalies were categorized, and a consistency model type is essentially defined by the set of anomalies that it allows to happen. The stronger the model, the more restrictive it is in terms of what kinds of anomalies it allows. And the reason every data storage doesn’t pick the stronger model when designing a system to achieve a certain consistency model is that there’s a compromise around availability and performance that you might not be willing to give up. Consistency Models is a good reference on the different consistency models.

Now we need to understand the behavior that a sequentially consistent key-value store promises its clients.

In a sequentially consistent model, a total order of all operations is required, and that must be consistent with each client’s program order. However, the total order of events may not be what in fact happened in reality. This is the anomaly it allows.

I’ll use Maelstrom’s example to explain further. Suppose two clients execute the following operations in real-time order:

client1: write x = 1
client2: CAS x (1 -> 2)
client1: write x = 1
client2: read x = 2

You might be thinking that there’s something wrong with the fact that client2 read x = 2. But that anomaly is totally valid under sequential consistency. Maybe the third operation got delayed, and client2 is doing a stale read. Or the CAS got delayed and happened after the second write. The following two orders are valid:

client1: write x = 1
client1: write x = 1
client2: CAS x (1 -> 2)
client2: read x = 2

client1: write x = 1
client2: CAS x (1 -> 2)
client2: read x = 2
client1: write x = 1

We now understand sequential consistency. Let’s see if that is enough to understand our results.

Does sequential consistency explain our results?

Here’s one of the results I got in one of my runs:

:workload {:valid? false,
    :errors (#jepsen.history.Op{:index 3938,
                                :time 30006761427,
                                :type :ok,
                                :process 0,
                                :f :read,
                                :value 1343,
                                :final? true}),
    :final-reads (1345 1343 1345),
    :acceptable ([1345 1345])},

Process 0 read 1343 and it was expected 1345. Okay, given our understanding of sequential consistency, it might be the case that, for this run, the final read happened before the last write, so we’re getting a stale value. Reasonable. It’s a valid anomaly. And for some runs the anomaly does not happen, and the test passes.

That kind of explains our results. But there’s something deeper going on. To realize there’s something deeper going on, we need to look at how the final results are checked. If we look at the Maelstrom source code of the test run, we’ll see the following pattern:

Run the main workload
Heal partitions
Sleep for 10 seconds. Waiting for recovery…
Send the final reads to all nodes

So, there’s a cooldown period of 10 seconds before the final reads happen. That is strange. Our read continues stale even after 10 seconds. You could be asking, how long do we have to wait for writes to converge?

I asked that question. And the answer to that leads to a very important realization about consistency models. Maelstrom is a platform built for learning purposes. So SeqKV was built as a way to illustrate what sequential consistency might look like in practice. It is a key-value store that simulates a sequential-consistency-ish behavior. But the thing is that sequential consistency only defines what anomalies are allowed, not what anomalies must occur and how they occur. Our final stale read that never seems to get the most recent value, although it’s a valid anomaly, is a behavior that has nothing to do with sequential consistency, in the sense that different databases, under the same consistency model, might have a different manifestation of that anomaly.

The hacky solution

So, the kind of unfortunate conclusion is that to solve the challenge, we really need to understand some implementation details of SeqKV. Just understanding sequential consistency is not enough.

By reading Compare-and-swap on seq-kv and SeqKV source code, you get a sense of what is going on. The two crucial parts: SeqKV has some randomness on reads allowing the node to pick state from history, and if a change of state occurs the client is forced to have the most up-to-date view of the data store.

And with that we can come up with a solution for g-counter that uses SeqKV. If clients are forced to have the most up-to-date view when a change of state occurs, we can simply do a write of a unique value before reads:

kv.Write(context.Background(), "rand", time.Now().UnixMilli())

Any read that comes after the write will not be stale anymore. It is important that the value be unique. SeqKV only forces clients to the latest state when the write actually changes state. If the value matches what’s already there, nothing changes and the reads stay stale.

The “write forces clients to the latest state” behavior is just how Maelstrom’s SeqKV works internally to keep the total order well-defined. Do not take this additional write as a workaround for the anomalies of any sequentially consistent database. You’re not fixing the anomaly. There’s not really any fixing to be done. Your data store operates under a certain consistency model, and you have to deal with that. Of course, in this case, by knowing how SeqKV works internally, we could make a change to our program to behave the way we want it to behave. That’s why it can feel hacky.

The non-hacky solutions

If the hack bothers you, there are a couple of approaches to the g-counter workload. The more appropriate solution is to implement a CRDT with the gossip mechanism as we discussed. A G-Counter is a CRDT, and the workload is made for that. But if you’re willing to explore the consistency model path, you can switch from SeqKV to LinKV to get a linearizable key-value store. A linearizable key-value store operates under the linearizable consistency model. Now the total order observed by clients must match real-time order, and we don’t get stale reads. And the tests pass deterministically. If you’re wondering what is required to implement linearizability, you may be able to use the solution we discussed in Generating Unique IDs with Raft Consensus to add consensus on the nodes of your G-Counter. Exploring alternatives makes it easier to understand the trade-offs.

This challenge was brutal for me. Not in terms of coding. But it took a while to understand all the nuances of what was going on. In any case, it was fun, and I learned a lot. That’s all for this one.

Solution can be found at maelstrom-g-counter.

#distributed-systems #crdt #consistency #sequential-consistency #go #maelstrom