Protobuffers Are Wrong (2018)

Posted by b-man 9/5/2025

Protobuffers Are Wrong (2018)(reasonablypolymorphic.com)

244 points | 307 commentspage 5

jeffbee 9/5/2025|

Type system fans are so irritating. The author doesn't engage with the point of protocol buffers, which is that they are thin adapters between the union of things that common languages can represent with their type systems and a reasonably efficient marshaling scheme that can be compact on the wire.

cryptonector 9/6/2025||

I've written several screeds in the comments here on HN about protobufs being terrible over the past few years. Basically the creators of PB ignored ASN.1 and built a bad version of mid-1980s ASN.1 and DER.

Tag-length-value (TLV) encodings are just overly verbose for no good reason. They are _NOT_ "self-describing", and one does not need everything tagged to support extensibility. Even where one does need tags, tag assignments can be fully automatic and need not be exposed to the module designer. Anyone with a modicum of time spent researching how ASN.1 handles extensibility with non-TLV encoding rules knows these things. The entire arc of ASN.1's evolution over two plus decades was all about extensibility and non-TLV encoding rules!

And yes, ASN.1 started with the same premise as PB, but 40 years ago. Thus it's terribly egregious that PB's designers did not learn any lessons at all from ASN.1!

Near as I can tell PB's designers thought they knew about encodings, but didn't, and near as I can tell they refused to look at ASN.1 and such because of the lack of tooling for ASN.1, but of course there was even less tooling for PB since it hadn't existed.

It's all exasperating.

dinobones 9/5/2025||

lols, the weird protobuf initialization semantics has caused so many OMGs. Even on my team it lead to various hard to debug bugs.

It's a lesson most people learns the hard way after using PBs for a few months.

sylware 9/6/2025||

I don't recall properly (because I did selve my mapping projects for the moment), but don't openstreet map core data distribution format based on protobuffers?

mkl95 9/5/2025||

If you mostly write software with Go you'll likely enjoy working with protocol buffers. If you use the Python or Ruby wrappers you'd wish you had picked another tech.

jonathrg 9/5/2025|

The generated types in go are horrible to work with. You can't store instances of them anywhere, or pass them by value, because they contain a bunch of state and pointers (including a [0]sync.Mutex just to explicitly prohibit copying). So you have to pass around pointers at all times, making ownership and lifetime much more complicated than it needs to be. A message definition like this

    message AppLogMessage {
        sint32 Value1 = 1;
        double Value2 = 2;
    }

becomes

    type Example struct {
        state                    protoimpl.MessageState 
        xxx_hidden_Value1        int32                  
        xxx_hidden_Value2        float64                  
        xxx_hidden_unknownFields protoimpl.UnknownFields
        sizeCache                protoimpl.SizeCache
    }

For [place of work] where we use protobuf I ended up making a plugin to generate structs that don't do any of the nonsense (essentially automating Option 1 in the article):

    type ExamplePOD struct {
        Value1 int32
        Value2 float64
    }

with converters between the two versions.

shdh 9/5/2025||

I just wish protobuf had proper delta compression out of the box

cenamus 9/6/2025||

I really liked the typography/layout of the page, reminds me of gwern.net. But people will probably complain about serif fonts regardless

BobbyTables2 9/6/2025||

Even the low level implementation of protobuffers is pretty uninspiring.

Adds a lot of space overhead, specially for structs only used one yet not self descriptive either.

Doesn’t solve a lot of problems related to changes either.

Quite frankly, too many are using up in it because it came from Google and is supposed to be some sort of divinely inspired thing.

JSON, ASN.1, and even rigid C structs start to look a lot better.

co_dh 9/6/2025|

But why do you need serialization? Because the data structure on disk is not the same as in memory. Arthur Whitney's k/q/kdb+ solved this problem by making them the same. An array has the same format in memory and on disk, so there is no serialization, and even better, you can mmap files into memory, so you don't need cache!

He also removed the capability to define a structure, and force you to use dictionary(structure) of array, instead of array of structure.

RossBencina 9/6/2025||

Forget on-disk. Different CPUs represent basic data types with different in-memory representations (endianness). Furthermore different CPUs have different capabilities with respect to how data must be aligned in memory in order to read or write it (aligned/unaligned access). At least historically unaligned access could fault your process. Then there's the problem, that you allude to, that different programming languages use different data layouts (or often a non-standardised layout). If you want communication within a system comprising heterogeneous CPUs and/or languages, you need to translate or standardise your a wire format and/or provide a translation layer aka serialisation.

throwaway127482 9/6/2025|||

> But why do you need serialization? Because the data structure on disk is not the same as in memory.

Not always - in browser applications for example, there is no way to directly access the disk, nevermind mmap().

More comments...