What Zero-copy Serialization Means?

I was reading about serialization formats the other day and came across the last column "Supports Zero-copy operations". I had no idea of what it meant. Moments before I got on this Wikipedia page, I was looking into how to serialize a struct in Go, without using any specific format, just raw serialization (don't even know if the term raw serialization means anything). While searching for a way I stumbled upon this Stack Overflow answer:

... if you consent to unsafety and actually need to read struct as bytes, then relying on byte array memory representation might be a bit better than relying on byte slice internal structure.
type Struct struct {
   Src int32
   Dst int32
   SrcPort uint16
   DstPort uint16
}

const sz = int(unsafe.SizeOf(Struct{}))
var asByteSlice []byte = (*(*[sz]byte)(unsafe.Pointer(&struct_value)))[:]
Works and provides read-write view into struct, zero-copy.

I became intrigued by the term and decided to research a little bit. The first results were about the zero-copy strategy of copying data minimizing context switches between the kernel space and user space. I had no clue this existed and I was very excited to know about this. Learned about the sendfile system call and found an awesome blog post about it by @b0rk. But it seemed to me that this had nothing to do with the zero-copy meant in the serialization context. So I favorited a bunch of web pages about the operational system zero-copy and resumed my research about serialization zero-copy.

For some reason, I could only find discussions about zero-copy tied to the Cap'n Proto serialization protocol. Cap'n Proto is a serialization protocol (after v0.4 it also became a RPC protocol) created Kenton Varda, which worked on Protocol Buffers version 2. According to him, Cap’n Proto is the result of years of experience working on Protobufs, listening to user feedback, and thinking about how things could be done better.

The marvelous of Cap'n Proto is that it has no cost of serialization/deserialization. This is because of its zero-copy implementation. But what zero-copy means? The first good clarification for me came when reading the comparison of zero-copy protocols to Protocol Buffer at Cap'n Proto, FlatBuffers, and SBE:

Zero-copy

The central thesis of all three competitors is that data should be structured the same way in-memory and on the wire, thus avoiding costly encode/decode steps.

Protobufs represents the old way of thinking.

On the Cap'n Proto home page, it says:

Proto encoding is appropriate both as a data interchange format and an in-memory representation, so once your structure is built, you can simply write the bytes straight out to disk!

Things became a lot clearer to me. It is as if the "serialization" implicitly happened when the object was built, and you already have the bytes in your hands. I found an interesting question on the Cap'n Proto forum: What does zero-copy mean?. The discussion revolves around the fact that zero-copy protocols are not well suitable for an object that mutates state as pointed in Cap'n Proto, FlatBuffers, and SBE:

Usable as mutable state

Protobuf generated classes have often been (ab)used as a convenient way to store an application’s mutable internal state. There’s mostly no problem with modifying a message gradually over time and then serializing it when needed.

This usage pattern does not work well with any zero-copy serialization format because these formats must use arena-style allocation to make sure the message is built in contiguous memory. Arena allocation has the property that you cannot free any object unless you free the entire arena. Therefore, when objects are discarded, the memory ends up leaked until the message as a whole is destroyed. A long-lived message that is modified many times will thus leak memory.

So, if it not recommended to mutate the state of the Cap'n Proto message, doesn't that imply that a copy is needed to perform the mutation in another structure? You can check Kenton's answer in the thread.

I have also come across an interesting and intense discussion on Hacker News about the definition of zero-copy. Most of the discussion was around the usefulness and the trade-offs of the definition. I am just going to highlight Kenton's definition:

Some people use the term "zero-copy" to mean only that when the message contains a string or byte array, the parsed representation of those specific fields will point back into the original message buffer, rather than having to allocate a copy of the bytes at parse time.

Cap'n Proto and FlatBuffers implement a much stronger form of zero-copy. With them, it's not just strings and byte buffers that are zero-copy, it's the entire data structure. With these systems, once you have the bytes of a message mapped into memory, you do not need to do any "parse" step at all before you start using the message.

All this made things much clearer to me. But I still wondered about the claim made on that Stack Overflow post. Isn't that assignment a copy of data? And the answer to that I found on Go Slices: usage and internals

Slicing does not copy the slice's data. It creates a new slice value that points to the original array. This makes slice operations as efficient as manipulating array indices.

Stumbled upon a lot of new things that I will surely explore further and learned a lot from this quest. I was recently working on an implementation of database storage using the concept of slotted paged. I was basically doing raw manipulation of bytes and pointers in order to insert a tuple into a table. I wonder if one can make use of a protocol such as Cap'n Proto to make database storage more reliable and faster. That's it for this post.

What Zero-copy Serialization Means?

Bruno Calza

Bruno Calza

Getting To Know Logical Clocks By Implementing Them

The Cache is Full