Generating Unique IDs with Raft Consensus

Apr 08, 2026

This blog post is a follow-up on Implementing Snowflake Unique ID Generation. In that post, I explain an implementation of Snowflake IDs that can be used to solve the distributed systems challenge Challenge #2: Unique ID Generation.

Now, we discuss an alternative to that. We’re going to use Raft to generate globally unique IDs. We assume here a general understanding of how Raft works. I recommend The Secret Lives of Data, just in case. The idea is to see if we can use the etcd-io/raft library to build a program that passes the challenge.

Here’s the full code we’ll discuss: maelstrom-unique-ids-raft.

Thinking about the system architecture

One challenge in building a Raft application is the architecture of the code. Because you have to wait for nodes to achieve consensus, it can be a bit confusing how to architect that waiting. The architecture also depends on the API provided by the Raft library.

Here’s an idea on how we can architect the code to build a distributed counter using an embedded Raft library that communicates with maelstrom clients. In essence, we can do something very similar to what we’ve done with Snowflake. Have some kind of object, and call a method (e.g. Next) on that object to get the next number in the sequence whenever a generate message from maelstrom comes in.

We’ll call this object DistributedCounter.

type Result int

type DistributedCounter struct {
    // state under consensus
    seq int

    // deps
    node    raft.Node
    storage *raft.MemoryStorage

    // ...
}

func (dc *DistributedCounter) Next(ctx context.Context, reqID string) (Result, error) {

    // request/propose a new value to raft nodes and wait for it
}

This object will act as a Raft node, communicate with other Raft nodes to achieve consensus, and store the Raft log. That’s why we see a raft.Node and *raft.MemoryStorage as dependencies. Also, seq is the state for our globally unique ID, and it is incremented whenever consensus is achieved.

And this is how everything fits together:

// in main.go

n := maelstrom.NewNode()

var counter *DistributedCounter

n.Handle("init", func(msg maelstrom.Message) error {
    // initialize raft nodes and distributed counter

    return nil
})

n.Handle("generate", func(msg maelstrom.Message) error {
    next, err := counter.Next(context.Background(), generateRequestID(16))

    // ...

})

You can see the similarities with the Snowflake solution. There, we had a Generator, that needed to be initialized inside init, with a Next method. Here we have a DistributedCounter, that needs to be initialized in init, with a Next method. In essence, we need to implement these 3 methods: the init handler, the generate handler, and the Next method. There’s more to it, actually, but we’ll get there.

Generate implementation

Starting with generate because it’s the simplest and there’s no Raft explicitly involved (we treat it as a black box for now). We can simply return the value we got from our counter back to maelstrom client.

n.Handle("generate", func(msg maelstrom.Message) error {
    next, err := counter.Next(context.Background(), generateRequestID(16))
    if err != nil {
        return maelstrom.NewRPCError(maelstrom.TemporarilyUnavailable, err.Error())
    }

    type response struct {
        Type string `json:"type"`
        Id   string `json:"id"`
    }

    return n.Reply(msg, response{
        Type: "generate_ok",
        Id:   fmt.Sprintf("%d", next),
    })
})

We add a “random” request ID for every generate request, because in our DistributedCounter implementation, when a node proposes a next ID, it needs to wait for the nodes to achieve consensus before getting the next ID. And that waiting happens through a channel. So we use a request ID to map results to requests. Whenever a result is ready, we associate it with its request, and the node waiting on that request ID can respond to its client.

Getting the next value

Next can be implemented simply using Propose from raft.Node.

func (dc *DistributedCounter) Next(ctx context.Context, reqID string) (Result, error) {
    ch := make(chan Result, 1)
    dc.mu.Lock()
    dc.waiters[reqID] = ch
    dc.mu.Unlock()

    // If this node is a leader, it will append the entry and send to the followers waiting for majority
    // If not, the proposal is sent internally to the leader
    // We retry until raft accepts the proposal (e.g., no leader yet).    
    for {
        if err := dc.node.Propose(ctx, []byte(reqID)); err == nil {
            break
        }
        time.Sleep(100 * time.Millisecond)
    }

    // blocked waiting for a result
    return <-ch, nil
}

A couple of things to discuss here. We add a waiters map[string]chan Result to our object. For every request, we create a channel, and our node waits on that channel:

// blocked waiting for a result
return <-ch, nil

To get the next number in the sequence, we need to propose an increment to our cluster, hence node.Propose. The Raft protocol is agnostic in regard to the data it replicates. It is just concerned about making sure nodes execute the same operations in the same order. In our case, there’s only one operation and it is a simple increment (e.g., seq++). In more complex applications, it can be a command, for example, set x = 2, or a SQL statement. I’m saying this because, given the simplicity of our application, there’s not much we need to propose to our nodes; we just need to signal them that a new ID was requested. And they will achieve consensus on that signaling. So that’s why we just send a reqID. Whenever consensus is achieved, nodes will use this value to associate the new ID with reqID using the waiters map.

The other thing to say is how Propose works. In Raft, only the leader can propose a new entry to the log. The recommended approach in the case of a proposal request arriving at a follower is to deny the request and send the information about the leader back to the client. If we do that, our maelstrom test will fail, because maelstrom clients are not expecting that kind of interaction. However, Propose detects internally whether the node is the leader, and if not, it sends the proposal to the leader. So we don’t have to worry about that.

One last thing. The proposal can fail. Maybe the network is going through a partition and is having a hard time electing a new leader, so no new proposals can be made. In this case, we naively retry until it works. A more appropriate solution would be to add a timeout and return an error to the client. I’ll come back to this later.

Initializing our distributed counter

There are two parts to init: we need to initialize the Raft nodes, and we need to start a background routine in our counter that runs the logic the Raft library expects our application to execute.

Initializing Raft nodes

Here, we need to figure out the node ID and all the peers and start the Raft node with the right configuration.

nodeNum, err := strconv.ParseUint(n.ID()[1:], 10, 64)
if err != nil {
    return fmt.Errorf("failed to parse node id: %v", err)
}
raftID := nodeNum + 1

var peers []raft.Peer
for _, nid := range n.NodeIDs() {
    num, err := strconv.ParseUint(nid[1:], 10, 64)
    if err != nil {
        return fmt.Errorf("failed to parse peer id: %v", err)
    }
    peers = append(peers, raft.Peer{ID: num + 1})
}

storage := raft.NewMemoryStorage()
cfg := &raft.Config{
    ID:              raftID,
    ElectionTick:    10,
    HeartbeatTick:   1,
    Storage:         storage,
    MaxSizePerMsg:   4096, 
    MaxInflightMsgs: 256,
}

raftNode := raft.StartNode(cfg, peers)
counter = NewDistributedCounter(raftNode, storage, n.Send)
go counter.run()

For this application, we’re using a memory storage, but we can swap that if we want. A HeartbeatTick of 1, meaning every clock tick, a heartbeat is sent. And ElectionTick of 10: if nodes don’t hear from leader in 10 clock ticks, a leader election starts.

Background routine

The background routine is what runs when we call go counter.run(). If we look at etcd-io/raft Readme, it says that the Raft client application has certain responsibilities. Let’s start with the 2 high-level responsibilities: indicating how frequently the clock ticks, and being ready to act on a batch of updates:

ticker := time.NewTicker(100 * time.Millisecond)
defer ticker.Stop()

for {
    select {
    case <-ticker.C:
        dc.node.Tick()

    case rd := <-dc.node.Ready():
        // ...
    }
}

Now we know that heartbeats will happen every 100ms and the election timeout is 1s. And node.Ready is a channel that informs us that a new batch of updates is available, and the application needs to act on it. Inside Ready, there’s a bunch more responsibilities. In our case, there are five:

// 1. store updated term/vote/commit
if !raft.IsEmptyHardState(rd.HardState) {
    if err := dc.storage.SetHardState(rd.HardState); err != nil {
        log.Printf("failed to set hard state: %v", err)
    }
}

// 2. append log entries to local storage
if len(rd.Entries) > 0 {
    if err := dc.storage.Append(rd.Entries); err != nil {
        log.Printf("failed to append entries: %v", err)
    }
}

// 3. if we have new messages to send to other nodes, send it
for _, msg := range rd.Messages {
    dc.send(msg)
}

// 4. if we have log entries that have achieved consensus 
//    apply your application logic
for _, ent := range rd.CommittedEntries {
    dc.applyEntry(ent)
}

// 5. indicates that we're done
dc.node.Advance()

This is what the Raft library provides as an API so we can make our application work with the consensus engine it uses. I wasn’t familiar with this library, so it was a bit hard to grasp at first. But if you’re familiar with Raft, you can get a sense of what is going on. Other libraries will probably offer a different API. I won’t spend too much time on the details of each of these responsibilities. The most important for our discussion are 3 and 4. We’ll skip 3 for now and focus on 4.

So, what happens in 4 is that we have received all new log entries that achieved consensus. We’re in a state where the Raft log is replicated across nodes, and our application’s state needs to reflect that. The library is telling you to apply these new entries to your application so that its state is consistent across nodes. If the application is a decentralized database, the entry could be a SQL statement, for example. In our case, it is just a request ID signaling that the sequence must be incremented.

Here’s applyEntry implementation:

func (dc *DistributedCounter) applyEntry(ent raftpb.Entry) {
    if ent.Type == raftpb.EntryConfChange {
        var cc raftpb.ConfChange
        if err := cc.Unmarshal(ent.Data); err != nil {
            log.Printf("failed to unmarshal conf change: %v", err)
            return
        }
        dc.node.ApplyConfChange(cc)
        return
    }

    if ent.Type != raftpb.EntryNormal || len(ent.Data) == 0 {
        return
    }

    reqID := string(ent.Data)

    dc.mu.Lock()
    dc.seq++
    res := Result(dc.seq)

    ch := dc.waiters[reqID]
    if ch != nil {
        delete(dc.waiters, reqID)
    }
    dc.mu.Unlock()

    if ch != nil {
        ch <- res
    }
}

It starts with some more responsibilities that I’ll ignore. The important thing is that it is here that seq is incremented and the value is put on waiters[reqID] so that the node waiting for that value can respond to the client. There’s a very subtle thing happening here that might not be obvious:

Note that all nodes increment the sequence to ensure the value is consistent across the cluster.

Communication among the nodes

We’ve implemented the core of our application. But there’s one supporting piece missing. Nodes need to be able to communicate with each other. We said that if a follower receives a proposal, it needs to send it to the leader. Also, the leader needs to send entries to followers. There are a bunch of messages being exchanged, and we did not set that up. That’s where that third responsibility above comes in:

for _, msg := range rd.Messages {
    dc.send(msg)
}

Whenever the protocol indicates that new messages need to be sent to other nodes, we need to send them. Here’s how we implement that send, hooking that up with maelstrom :

func (dc *DistributedCounter) send(msg raftpb.Message) {
    destID := fmt.Sprintf("n%d", msg.To-1)

    data, err := msg.Marshal()
    if err != nil {
        log.Printf("failed to marshal raft message: %v", err)
        return
    }

    if err := dc.sendFn(destID, map[string]any{
        "type": "raft",
        "data": data,
    }); err != nil {
        log.Printf("failed to send raft message to %s: %v", destID, err)
    }
}

And sendFn is a dependency that we add in DistributedCounter

type DistributedCounter struct {
    // state under consensus
    seq int

    // deps
    sendFn  func(dest string, body any) error
    node    raft.Node
    storage *raft.MemoryStorage

    //control
    mu      sync.Mutex
    waiters map[string]chan Result
}

// in main.go

counter = NewDistributedCounter(raftNode, storage, n.Send)

You see that we created a new kind of maelstrom message called raft. We need to handle that as well:

// maelstrom handler
n.Handle("raft", func(msg maelstrom.Message) error {
    var body struct {
        Data []byte `json:"data"`
    }
    if err := json.Unmarshal(msg.Body, &body); err != nil {
        log.Printf("failed to unmarshal raft envelope: %v", err)
        return nil
    }

    if err := counter.Sync(context.Background(), body.Data); err != nil {
        log.Printf("failed to sync raft: %v", err)
    }
    return nil
})

// a new method on DistributedCounter
func (dc *DistributedCounter) Sync(ctx context.Context, data []byte) error {
    var msg raftpb.Message
    if err := msg.Unmarshal(data); err != nil {
        return err
    }
    return dc.node.Step(ctx, msg)
}

We can see that whenever a new internal message is received, we eventually call node.Step. That is the method that the library provides to be called whenever new messages arrive.

And with that, our solution is completed.

Running the test

So I ran the same test I ran for Snowflake:

./maelstrom test -w unique-ids \
    --bin ~/go/bin/maelstrom-unique-ids-raft \
    --node-count 5 \
    --time-limit 30 \
    --rate 1000 \
    --availability total \
    --concurrency 10 \
    --nemesis partition

And the test failed. All that work to get a failing test!

But if we think about it and look at the test we’re running, it kind of makes sense. We’re saying we want total availability even with a network partition. However, Raft is not a totally available system. Depending on how the network partition occurs, the system may become unavailable while it tries to elect a new leader. We can either reduce our availability requirement or remove the requirement that it needs to be fault-tolerant to network partitions. The following tests pass:

# availability 0.999
./maelstrom test -w unique-ids \
    --bin ~/go/bin/maelstrom-unique-ids-raft \
    --node-count 5 \
    --time-limit 30 \
    --rate 1000 \
    --availability 0.999 \
    --concurrency 10 \
    --nemesis partition
# no network partition
./maelstrom test -w unique-ids \
    --bin ~/go/bin/maelstrom-unique-ids-raft \
    --node-count 5 \
    --time-limit 30 \
    --rate 1000 \
    --availability total \
    --concurrency 10 

That is evidence of Raft’s CP (from the CAP theorem) nature and the trade-offs between availability and network partition tolerance.

One thing that bugged me was that the application was implemented so that Next should never fail. Remember that ugly retry loop? So why don’t we get 100% availability? Turns out maelstrom has a 5-second timeout on the client. If the client doesn’t hear a response within 5s, it will count as a failure. By default, partitions last 10s in tests. So that makes sense. But even if you set the partitions to last 2s, you still don’t get 100% availability. My hypothesis is that there’s probably a network partition at the end of the test run, so requests made close to the end never get a response, and those failures are counted.

Comparing with Snowflake

It is interesting to compare the consensus approach with the Snowflake approach. You clearly see the trade-offs of the CAP theorem in action. The Snowflake approach is totally available even with network partitions. Meaning it is compromising on consistency somehow. That makes sense. You cannot get an orderly increasing sequence with no gaps (e.g., 1, 2, 3, 4, ….) with Snowflake. Snowflake IDs are only roughly sortable, not strictly ordered.

Another comparison is the amount of overhead coordination adds. The table below compares the latency of both approaches.

Metric Raft Snowflake
Ops 23,269 27,965
msgs/op 11.2 0.0
Min 0.3 ms 0.1 ms
Mean 3.7 ms 0.4 ms
Median 0.7 ms 0.4 ms
p95 2.4 ms 0.8 ms
p99 5.5 ms 1.1 ms
Max 4,250 ms 10.6 ms

I mean, this is all running locally and does not reflect real-world scenarios, but it is interesting to see the latency increase, the effect of the network partition on max, and the introduction of internal messages being reflected on msgs/op.

That was a fun challenge. I hope you have enjoyed reading as much as I have enjoyed writing. And that’s all for this blog post.

#distributed-systems #raft #consensus #unique-id-generation #go #maelstrom