Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.
I’ll be nice and not attack the feature. But making that the default is one of the biggest mistakes in the history of networking (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)
Upgraded our DC switches to new ones around 2014 and needed to keep a few old ones because the new ones didn't support 10Mbit half duplex.
Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old switch that supports 100Base-T onto a modern one a few times myself. If you drop 10/100 support, you can also drop HD (simplex) support. In my junk drawer, I still have a few old 10/100 hubs (not switches), which are by definition always HD.
Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.
TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for.
Yes, as I mentioned, it should be kept around for this but off by default. Make it a sysctl param, done.
> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for
Only because it's on by default for no real reason. I'm saying the default should be off.
A smarter implementation would have been to call it TCP_MAX_DELAY_MS, and have it take an integer value with a well-documented (and reasonably low) default.
"CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."
Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
"Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."
That's full duplex as opposed to half duplex.
Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.
CSMA/CD is specifically for a shared medium (shared collision domain in Ethernet terminology), putting a switch in it makes every port its own collision domain that are (in practice these days) always point-to-point. Especially for gigabit Ethernet, there was some info in the spec allowing for half-duplex operation with hubs but it was basically abandoned.
As others have said, different mechanisms are used to manage trying to send more data than a switch port can handle but not CSMA (because it's not doing any of it using Carrier Sense, and it's technically not Multiple Access on the individual segment, so CSMA isn't the mechanism being used).
> That's full duplex as opposed to half duplex.
No actually they're talking about something more complex, 100Mbps Ethernet had full duplex with separate transmit and receive pairs, but with 1000Base-T (and 10GBase-T etc.) the four pairs all simultaneously transmit and receive 250 Mbps (to add up to 1Gbps in each direction). Not that it's really relevant to the discussion but it is really cool and much more interesting than just being full duplex.
Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.
Some progress has been made in doing the same thing with radio links, but it's harder.
Nagle's algorithm is somewhat intertwined with the backoff timer in the sense that it prevents transmitting a packet until some condition is met. IIRC, setting the TCP_NODELAY flag will also disable the backoff timer, at least this is true in the case of TCP/IP over AX25.
Only in the sense that the L1 "peer" is the switch. As soon as the switch goes to forward the packet, if ports 2 and 3 are both sending to port 1 at 1Gbps and port 1 is a 1Gbps port, 2Gbps won't fit and something's got to give.
Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
No idea why you are mentioning radios. That's another medium.
Gigabit (and faster) is able to do full duplex without needing separate wires in each direction. That's the distinction they're making.
> The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
Not in a modern network, where there's no such thing as a wired collision.
> Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
Switches are not hubs. Switches have a separate receiver for each port, and each receiver is attached to one sender.
Too many switches will get a PAUSE frame from port X and send it to all the ports that send packets destined for port X. Then those ports stop sending all traffic for a while.
About the only useful thing is if you can see PAUSE counters from your switch, you can tell a host is unhealthy from the switch whereas inbound packet overflows on the host might not be monitored... or whatever is making the host slow to handle packets might also delay monitoring.
It turns out that in my case it wasn't TCP_NODELAY - my backend is written in go, and go sets TCP_NODELAY by default!
But I still found the article - and in particular Nagle's acknowledgement of the issues! - to be interesting.
There's a discussion from two years ago here: https://news.ycombinator.com/item?id=40310896 - but I figured it'd been long enough that others might be interested in giving this a read too.
[0]: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...
I mostly use go these days for the backend for my multiplayer games, and in this case there's also some good tooling for terminal rendering and SSH stuff in go, so it's a nice choice.
(my games are often pretty weird, I understand that "high framerate multiplayer game over SSH" is a not a uhhh good idea, that's the point!)
"Golang disables Nagle's Algorithm by default"
oxide and friends episode on it! It's quite good
If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.
With that said, I'm pretty sure it is a feature of the TCP stack only because the TCP stack is the layer they were trying to solve this problem at, and it isn't clear at all that "unacked data" is particularly better than a timer -- and of course if you actually do want to implement application layer Nagle directly, delayed acks mean that application level acking is a lot less likely to require an extra packet.
BTW, Hardware based TCP offloads engine exists... Don't think they are widely used nowadays though
Widely used in low latency fields like trading
For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).
Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...
The API should have been message oriented from the start. This would avoid having the network stack try to compensate for the behavior of the application layer. Then Nagel’s or something like it would just be a library available for applications that might need it.
The stream API is as annoying on the receiving end especially when wrapping (like TLS) is involved. Basically you have to code your layers as if the underlying network is handing you a byte at a time - and the application has to try to figure out where the message boundaries are - adding a great deal of complexity.
Very well said. I think there is enormous complexity in many layers because we don't have that building block easily available.
TCP_CORK is a rather kludgey alternative.
The same issue exists with file IO. Writing via an in-process buffer (default behavior or stdio and quite a few programming languages) is not interchangeable with unbuffered writes — with a buffer, it’s okay to do many small writes, but you cannot assume that the data will ever actually be written until you flush.
I’m a bit disappointed that Zig’s fancy new IO system pretends that buffered and unbuffered IO are two implementations of the same thing.
Seems like there's been a disconnect between users and kernel developers here?
Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.
One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).
Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.
Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.
What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.