Posted by chaokunyang 2 days ago
Technical approach: compile-time codegen (no reflection), compact binary protocol with meta-packing, little-endian layout optimized for modern CPUs.
Unique features that other fast serializers don't have:
- Cross-language without IDL files (Rust ↔ Python/Java/Go)
- Trait object serialization (Box<dyn Trait>)
- Automatic circular reference handling
- Schema evolution without coordination
Happy to discuss design trade-offs.
Benchmarks: https://fory.apache.org/docs/benchmarks/rustFory’s format was designed from the ground up to handle those cases efficiently, while still enabling cross‑language compatibility and schema evolution.
https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...
It seems if the serialization object is not a "Fory" struct, then it is forced to go through to/from conversion as part of the measured serialization work:
https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...
The to/from type of work includes cloning Strings:
https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...
reallocating growing arrays with collect:
https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...
I'd think that the to/from Fory types is shouldn't be part of the tests.
Also, when used in an actual system tonic would be providing a 8KB buffer to write into, not just a Vec::default() that may need to be resized multiple times:
https://github.com/hyperium/tonic/blob/147c94cd661c0015af2e5...
I can see the source of an 10x improvement on an Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz, but it drops to 3x improvement when I remove the to/from that clones or collects Vecs, and always allocate an 8K Vec instead of a ::Default for the writable buffer.
If anything, the benches should be updated in a tower service / codec generics style where other formats like protobuf do not use any Fory-related code at all.
Note also that Fory has some writer pool that is utilized during the tests:
https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...
Original bench selection for Fory:
Benchmarking ecommerce_data/fory_serialize/medium: Collecting 100 samples in estimated 5.0494 s (197k it
ecommerce_data/fory_serialize/medium
time: [25.373 µs 25.605 µs 25.916 µs]
change: [-2.0973% -0.9263% +0.2852%] (p = 0.15 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) high mild
2 (2.00%) high severe
Compared to original bench for Protobuf/Prost: Benchmarking ecommerce_data/protobuf_serialize/medium: Collecting 100 samples in estimated 5.0419 s (20k
ecommerce_data/protobuf_serialize/medium
time: [248.85 µs 251.04 µs 253.86 µs]
Found 18 outliers among 100 measurements (18.00%)
8 (8.00%) high mild
10 (10.00%) high severe
However after allocating 8K instead of ::Default and removing to/from it for an updated protobuf bench: fair_ecommerce_data/protobuf_serialize/medium
time: [73.114 µs 73.885 µs 74.911 µs]
change: [-1.8410% -0.6702% +0.5190%] (p = 0.30 > 0.05)
No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
2 (2.00%) high mild
12 (12.00%) high severeProtobuf is very much a DOP (data‑oriented programming) approach — which is great for some systems. But in many complex applications, especially those using polymorphism, teams don’t want to couple Protobuf‑generated message structs directly into their domain models. Generated types are harder to extend, and if you embed them everywhere (fields, parameters, return types), switching to another serialization framework later becomes almost impossible without touching huge parts of the codebase.
In large systems, it’s common to define independent domain model structs used throughout the codebase, and only convert to/from the Protobuf messages at the serialization boundary. That conversion step is exactly what’s represented in our benchmarks — because it’s what happens in many real deployments.
There’s also the type‑system gap: for example, if your Rust struct has a Box<dyn Trait> field, representing that cleanly in Protobuf is tricky. You might fall back to a oneof, but that essentially generates an enum variant, which often isn’t what users actually want for polymorphic behavior.
So, yes — we include the conversion in our measurements intentionally, to reflect the real‑world large systems practices.
So to reflect the real‑world practices, the benchmark code should then allocate and give the protobuf serializer an 8K Vec like in tonic, and not an empty one that may require multiple re-allocations?
In my experience, while starting from a language to arrive at the serialization often feels more ergonomic (e.g. RPC style) in the start, it hides too much of what's going on from the users and over time suffers greatly from programming language / runtime changes - the latter multiplied by the number of languages or frameworks supported.
The way I think about it is: • Single‑language projects often work best without an IDL — it keeps things simple and avoids extra steps. • Two languages – both IDL and no‑IDL approaches can work, depending on the team’s habits. • Three or more – an IDL can be really useful as a single source of truth and to avoid manually writing struct definitions in every language.
For Apache Fory, my plan is to add optional IDL support, so teams who want that “single truth” can generate definitions automatically, and others can continue with language‑first development. My hope is to give people flexibility to choose what fits their situation best.
Otherwise, the schema seems to be derived from the class being serialized for typed languages, or otherwise annotated in code. The serializer and deserializer code must be manually written to be compatible instead of both sides being codegen'd to match from a schema file. He's the example I found for python: https://fory.apache.org/docs/docs/guide/python_serialization...
When running in compatible mode, Fory automatically derives a compact schema from those definitions at runtime time and sends it along to peers for the first time serialization. That way, both sides know the structure without needing a separate schema file.
The idea is to make cross‑language exchange work out‑of‑the‑box, while still allowing teams to add an explicit IDL later if they want a single source of truth.
It's not clear for me how to achieve the same with fory?
[0] https://fory.apache.org/blog/2025/10/29/fory_rust_versatile_...
But once you’re dealing with three or more languages, I agree an IDL becomes valuable as a single source of truth. That’s work we’ve started: adding optional IDL support so teams can generate data structures in each language from one shared definition.
It'd be helpful to see a plot of serialization costs vs data size. If you only display serialization TPS, you're always going to lose to the "do nothing" option of just writing your C structs directly to the wire, which is essentially zero cost.
| data type | data size | fory | protobuf |
| --------------- | --------- | ------- | -------- |
| simple-struct | small | 21 | 19 |
| simple-struct | medium | 70 | 66 |
| simple-struct | large | 220 | 216 |
| simple-list | small | 36 | 16 |
| simple-list | medium | 802 | 543 |
| simple-list | large | 14512 | 12876 |
| simple-map | small | 33 | 36 |
| simple-map | medium | 795 | 1182 |
| simple-map | large | 17893 | 21746 |
| person | small | 122 | 118 |
| person | medium | 873 | 948 |
| person | large | 7531 | 7865 |
| company | small | 191 | 182 |
| company | medium | 9118 | 9950 |
| company | large | 748105 | 782485 |
| e-commerce-data | small | 750 | 737 |
| e-commerce-data | medium | 53275 | 58025 |
| e-commerce-data | large | 1079358 | 1166878 |
| system-data | small | 311 | 315 |
| system-data | medium | 24301 | 26161 |
| system-data | large | 450031 | 479988 |
https://github.com/apache/fory/blob/fd1d53bd0fbbc5e0ce6d53ef...
I’m curious though: what’s an example scenario you’ve seen that requires so many distinct types? I haven’t personally come across a case with 4,096+ protocol messages defined.
git clone https://github.com/googleapis/googleapis.git
cd googleapis
find . -name '*.proto' -and -not -name '*test*' -and -not -name '*example*' -exec grep '^message' {} \; | wc -l
I think this more speaks to the tradeoff of not having an IDL where the deserializer either knows what type to expect if it was built with the IDL file version that defined it, e.g., this recent issue:https://github.com/apache/fory/issues/2818
But now I do see that the 4096 is just arbitrary:
If schema consistent mode is enabled globally when creating fory, type meta will be written as a fory unsigned varint of type_id. Schema evolution related meta will be ignored.Have we learned nothing? Endian swap on platforms that need it is faster than conditionals, and simpler.
You can browse https://fory.apache.org/docs/, but I didn't find any benchmarks directory