Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.
I’ll be nice and not attack the feature. But making that the default is one of the biggest mistakes in the history of networking (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)
Unless you have some kind of special circumstance you can leverage it's hard to beat TCP. You would not be the first to try.
The fundamental congestion control issue is that after you drop to half, the window is increased by /one packet/, which for all sorts of artificial reasons is about 1500 bytes. Which means the performance gets worse and worse the greater the bandwidth-delay product (which have increased by tens of orders of magnitude). Not to mention head-of-line blocking etc.
The reason for QUIC's silent success was the brilliant move of sidestepping the political quagmire around TCP congestion control, so they could solve the problems in peace
QUIC is real and works great, and they sidestepped all of that and just built it and tuned it and has basically won. As for QUIC "sending more parts of the page in parallel" yes thats what I referred to re head of line blocking in TCP.
Unlike TLS over TCP, QUIC is still not able to be offloaded to NICs. And most stacks are in userspace. So it is horrifically expensive in terms of watts/byte or cycles/byte sent for a CDN workload (something like 8x as a expensive the last time I looked), and its primarily used and advocated for by people who have metrics for latency, but not server side costs.
That's not quite true. You can offload QUIC connection steering just fine, as long as your NICs can do hardware encryption. It's actually _easier_ because you can never get a QUIC datagram split across multiple physical packets (barring the IP-level fragmentation).
The only real difference from TCP is the encryption for ACKs.
Some NICs, like Broadcom's newer ones, support crypto offloads, but this is not enough to be competitive with TCP / TLS. Especially since support for those offloads are not in any mainline kernel in Linux or BSD.
What would you change here?
Upgraded our DC switches to new ones around 2014 and needed to keep a few old ones because the new ones didn't support 10Mbit half duplex.
One co-op job at a manufacturing plant I worked at ~20 years ago involved replacing the backend core networking equipment with more modern ethernet kit, but we had to setup media converters (in that case token ring to ethernet) as close as possible to the manufacturing equipment (so that token ring only ran between the equipment and the media converter for a few meters at most).
They were "lucky" in that:
1) the networking protocol that was supported by the manufacturing equipment was IPX/SPX, so at least that worked cleanly on ethernet and newer upstream control software running on an OS (HP-UX at the time)
2) there were no lives at stake (eg nuclear safety/hospital), so they had minimal regulatory issues.
Was an old isp/mobile carrier so could find all kinds of old stuff. Even the first SMSC from the 80s (also DEC, 386 or similar cpu?) was still in it's racks because they didn't need the rack space as 2 modern racks used up all the power for that room, was also far down in a mountain so was annoying to remove equipment.
Old CNC equipment.
Older Zebra label printers.
Some older Motorola radio stuff.
That SGI Indy we keep around for Jurassic Park jokes.
The LaserJet 5 thats still going after 30 years or something.
Some modern embedded stuff that does not have enough chooch to deal with 100mbit.
Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old switch that supports 100Base-T onto a modern one a few times myself. If you drop 10/100 support, you can also drop HD (simplex) support. In my junk drawer, I still have a few old 10/100 hubs (not switches), which are by definition always HD.
(My other post in this thread mentions it.) https://news.ycombinator.com/item?id=46360209#46361580
Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.
TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for.
Yes, as I mentioned, it should be kept around for this but off by default. Make it a sysctl param, done.
> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for
Only because it's on by default for no real reason. I'm saying the default should be off.
> Only because it's on by default for no real reason. I'm saying the default should be off.
This is wrong.
I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default. It seems you think the only extra fingerprinting info TCP_NODELAY gives you is the single bit "TCP_NODELAY is on vs off". But it's more than that.
In a world where every application's traffic goes through Nagle's algorithm, lots of applications will just be seen to transmit a packet every 300ms or whatever as their transmissions are buffered up by the kernel to be sent in large packets. In a world where Nagle's algorithm is off by default, those applications could have very different packet sizes and timings.
With something like Telnet or SSH, you might even be able to detect who exactly is typing at the keyboard by analyzing their key press rhythm!
To be clear, this is not an argument in favor of Nagle's algorithm being on by default. I'm relatively neutral on that matter.
Correct, I wrote that backwards, good callout.
RE: fingerprinting, I'd concede the point in a sufficiently lazy implementation. I'd fully expect the application layer to handle this, especially in cases where this matters.
Applications also don't know the MTU (the size of packets) on the interface they're using. Hell, they probably don't even know which interface they're using! This is all abstracted away. So, if you're on a network with a 14xx MTU (such as a VPN), assuming an MTU of 1500 means you'll send one full packet and then a tiny little packet after that. For every one packet you think you're sending!
Nagle's algorithm lets you just send data; no problem. Let the kernel batch up packets. If you control the protocol, just use a design that prevents Delayed ACK from causing the latency. IE, the "OK" from Redis.
If we need them, and they’re not being maintained, then maybe that’s the kind of “scream test” wake up we need for them to either be properly deprecated, or updated.
Given how often issues can be traced back to open source projects barely scraping along? Yes and they are probably doing something important. Hell, if you create enough pointless busywork you can probably get a few more "helpfull" hackers into projects like xz.
> If nobody is maintaining them, do we really need them?
Software can have value even when not maintained.
You used AI to write this didn't you? Your sentence structure is not just tedious - it's a dead give-away.
A smarter implementation would have been to call it TCP_MAX_DELAY_MS, and have it take an integer value with a well-documented (and reasonably low) default.
I was testing some low-bandwidth voice chat code using two unloaded PCs sitting on the same desk. I nearly jumped out of my skin when "HELLO, HELLO?" came through a few seconds late, at high volume, after I had already concluded it wasn't working. After ruling out latency on the audio side, TCP_NODELAY solved the problem.
All respect to Animats, but whoever thought this should be the default behavior of TCP/IP had rocks in their head, and/or were solving a problem that had a better solution that they just didn't think of at the time.
I would even argue that NODELAY for a VoIP solution makes no sense - why are you even using TCP instead of UDP in the first place?
Send exactly one 205 byte packet. How do you really know? I can see it go out on a scope. And the other end receives a packet with bytes 0-56. Then another packet with bytes 142-204. Finally a packet a 200ms later with bytes 57-141.
FfffFFFFffff You!
However, malicious middleboxes insert themselves into your TCP connections, terminating a separate TCP connection on each side of the spyware and therefore completely rewriting TCP segment boundaries.
In less common scenarios, the same may be done by non malicious middleboxes - but it's almost always malicious ones. The party that attacked xmpp.is/jabber.ru terminated not only TCP but also TLS and issued itself a Let's Encrypt certificate.
The same is true of those who do understand it.
"CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."
Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
"Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."
That's full duplex as opposed to half duplex.
Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.
CSMA/CD is specifically for a shared medium (shared collision domain in Ethernet terminology), putting a switch in it makes every port its own collision domain that are (in practice these days) always point-to-point. Especially for gigabit Ethernet, there was some info in the spec allowing for half-duplex operation with hubs but it was basically abandoned.
As others have said, different mechanisms are used to manage trying to send more data than a switch port can handle but not CSMA (because it's not doing any of it using Carrier Sense, and it's technically not Multiple Access on the individual segment, so CSMA isn't the mechanism being used).
> That's full duplex as opposed to half duplex.
No actually they're talking about something more complex, 100Mbps Ethernet had full duplex with separate transmit and receive pairs, but with 1000Base-T (and 10GBase-T etc.) the four pairs all simultaneously transmit and receive 250 Mbps (to add up to 1Gbps in each direction). Not that it's really relevant to the discussion but it is really cool and much more interesting than just being full duplex.
Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.
Some progress has been made in doing the same thing with radio links, but it's harder.
Nagle's algorithm is somewhat intertwined with the backoff timer in the sense that it prevents transmitting a packet until some condition is met. IIRC, setting the TCP_NODELAY flag will also disable the backoff timer, at least this is true in the case of TCP/IP over AX25.
Only in the sense that the L1 "peer" is the switch. As soon as the switch goes to forward the packet, if ports 2 and 3 are both sending to port 1 at 1Gbps and port 1 is a 1Gbps port, 2Gbps won't fit and something's got to give.
Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
No idea why you are mentioning radios. That's another medium.
this is absolutely hilarious.
Admittedly, I’m no networking expert but it was my understanding that most installs now use switches almost exclusively. Are you suggesting otherwise?
A quick search would seem to indicate I’m right. Do you mind elaborating on your snark?
Gigabit (and faster) is able to do full duplex without needing separate wires in each direction. That's the distinction they're making.
> The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
Not in a modern network, where there's no such thing as a wired collision.
> Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
Switches are not hubs. Switches have a separate receiver for each port, and each receiver is attached to one sender.
Too many switches will get a PAUSE frame from port X and send it to all the ports that send packets destined for port X. Then those ports stop sending all traffic for a while.
About the only useful thing is if you can see PAUSE counters from your switch, you can tell a host is unhealthy from the switch whereas inbound packet overflows on the host might not be monitored... or whatever is making the host slow to handle packets might also delay monitoring.
Things like back pressure and flow control are very powerful systems concepts, but intrinsically need there to be an identifiable flow to control! Our systems abstractions that multiplex and obfuscate flows are going to be unable to differentiate which application flow is the one that needs back pressure, and paint too-wide brush.
In my view, the fundamental problem is we're all trying to "have our cake and eat it". We expect our network core to be unaware of the edge device and application goals. We expect to be able to saturate an imaginary channel between two edge devices without any prearrangement, as if we're the only network users. We also expect our sparse and async background traffic to somehow get through promptly. We expect fault tolerance and graceful degradation. We expect fairness.
We don't really define or agree what is saturation, what is prompt, what is graceful, or what is fair... I think we often have selfish answers to these questions, and this yields a tragedy of the commons.
At the same time, we have so many layers of abstraction where useful flow information is effectively hidden from the layers beneath. That is even before you consider adversarial situations where the application is trying to confuse the issue.
It turns out that in my case it wasn't TCP_NODELAY - my backend is written in go, and go sets TCP_NODELAY by default!
But I still found the article - and in particular Nagle's acknowledgement of the issues! - to be interesting.
There's a discussion from two years ago here: https://news.ycombinator.com/item?id=40310896 - but I figured it'd been long enough that others might be interested in giving this a read too.
[0]: https://jvns.ca/blog/2015/11/21/why-you-should-understand-a-...
I mostly use go these days for the backend for my multiplayer games, and in this case there's also some good tooling for terminal rendering and SSH stuff in go, so it's a nice choice.
(my games are often pretty weird, I understand that "high framerate multiplayer game over SSH" is a not a uhhh good idea, that's the point!)
Maybe that will be useful for thinking about workarounds or maybe you can just use hpn-ssh.
I’m no expert by any means, but this makes sense to me. Plus, I can’t come up with many modern workloads where delayed ACK would result in significant improvement. That said, I feel the same about Nagle’s algorithm - if most packets are big, it seems to me that both features solve problems that hardly exist anymore.
Wouldn't the modern http-dominated best practice be to turn both off?
> Unfortunately, it’s not just delayed ACK2. Even without delayed ack and that stupid fixed timer, the behavior of Nagle’s algorithm probably isn’t what we want in distributed systems. A single in-datacenter RTT is typically around 500μs, then a couple of milliseconds between datacenters in the same region, and up to hundreds of milliseconds going around the globe. Given the vast amount of work a modern server can do in even a few hundred microseconds, delaying sending data for even one RTT isn’t clearly a win.
For stuff where no answer is required, Nagel's algorithm works very well for me, but many TCP channels are mixed use these days. They send messages that expect a fast answer and other that are more asynchronous (from a users point of view, not a programmers).
Wouldn't it be nice if all operating systems, (home-)routers, firewalls and programming languages would have high quality implementations of something like SCTP...
I never thought about that but I think you're absolutely right! In hindsight it's a glaring oversight to offer a stream API without the ability to flush the buffer.
The API should have been message oriented from the start. This would avoid having the network stack try to compensate for the behavior of the application layer. Then Nagel’s or something like it would just be a library available for applications that might need it.
The stream API is as annoying on the receiving end especially when wrapping (like TLS) is involved. Basically you have to code your layers as if the underlying network is handing you a byte at a time - and the application has to try to figure out where the message boundaries are - adding a great deal of complexity.
The problem is that this is not in practice quite what most applications need, but the Internet evolved towards UDP and TCP only.
So you can have message-based if you want, but then you have to do sequencing, gap filling or flow control yourself, or you can have the overkill reliable byte stream with limited control or visibility at the application level.
I’m not suggesting exposing retransmission, fragmentation, etc to the API user.
The sender provides n bytes of data (a message) to the network stack. The receiver API provides the user with the block of n bytes (the message) as part of an atomic operation. Optionally the sender can be provided with notification when the n-bytes have been delivered to the receiver.
Because TCP, by design, is a stream-oriented protocol, and the only out-of-band signal I'm aware of that's intended to be exposed to applications is the urgent flag/pointer, but a quick Google search suggests that many firewalls clear these by default, so compatibility would almost certainly be an issue if your API tried to use the urgent pointer as a message separator.
I suppose you could implement a sort of "raw TCP" API to allow application control of segment boundaries, and force retransmission to respect them, but this would implicitly expose applications to fragmentation issues that would require additional API complexity to address.
Your API is constrained by the actual TCP protocol. Even if the sender uses this message-oriented TCP API, the receiver can't make any guarantees that a packet they receive lines up with a message boundary, contains N messages, etc etc, due to how TCP actually works in the event of dropped packets and retransmissions. The receiver literally doesn't have the information needed to do that, and it's impossible for the receiver to reconstruct the original message sequence from the sender. You could probably re-implement TCP with retransmission behaviour that gives you what you're looking for, but that's not really TCP anymore.
This is part of the motivation for protocols like QUIC. Most people agree that some hybrid of TCP and UDP with stateful connections, guaranteed delivery and discrete messages is very useful. But no matter how much you fiddle with your code, neither TCP or UDP are going to give you this, which is why we end up with new protocols that add TCP-ish behaviour on top of UDP.
Very well said. I think there is enormous complexity in many layers because we don't have that building block easily available.
But yeah, where that's unnecessary, it's probably just as easy to have a 4-byte length prefix, since TCP handles the checksum and retransmit and everything for you.
You should ideally design your messages to fit within a single Ethernet packet, so 2 bytes is more than enough for the size. Though I have sadly seen an increasing amount of developers send arbitrarily large network messages and not care about proper design.
TCP_CORK is a rather kludgey alternative.
The same issue exists with file IO. Writing via an in-process buffer (default behavior or stdio and quite a few programming languages) is not interchangeable with unbuffered writes — with a buffer, it’s okay to do many small writes, but you cannot assume that the data will ever actually be written until you flush.
I’m a bit disappointed that Zig’s fancy new IO system pretends that buffered and unbuffered IO are two implementations of the same thing.
Seems like there's been a disconnect between users and kernel developers here?
Well, of course not; it tries to reduce the problem of your kernel hanging on to an ack (or genearting an ack) longer than you would like. That pertains to received data. If the remote end is sending you data, and is paused due to filling its buffers due to not getting an ack from you, it behooves you to send an ack ASAP.
The original Berkeley Unix implementation of TCP/IP, I seem to recall, had a single global 500 ms timer for sending out acks. So when your TCP connection received new data eligible for acking, it could be as long as 500 ms before the ack was sent. If we reframe that in modern realities, we can imagine every other delay is negligible, and data is coming at the line rate of a multi gigabit connection, 500 ms represents a lot of unacknowledged bits.
Delayed acks are similar to Nagle in spirit in that they promote coalescing at the possible cost of performance. Under the assumption that the TCP connection is bidirectional and "chatty" (so that even when the bulk of the data transfer is happening in one direction, there are application-level messages in the other direction) the delayed ack creates opportunities for the TCP ACK to be piggy backed on a data transfer. A TCP segment carrying no data, only an ACK, is prevented.
As far as portability of TCP_QUICKACK goes, in C code it is as simple as #ifdef TCP_QUICKACK. If the constant exists, use it. Otherwise out of luck. If you're in another language, you have to to through some hoops depending on whether the network-related run time exposes nonportable options in a way you can test, or whether you are on your own.
(io_uring is another method that helps a lot here, and it can be combined with MSG_MORE or with preallocated buffers shared with the kernel.)
Also if you're doing asynchronous writes you typically can only have one write in-flight at any time, you should aggregate all other buffers while that happens.
Though arguably asynchronous writes are often undesired due to the complexity of doing flow-control with them.
Whether that's really useful or not depends on whether you do the associated buffer management work.
oxide and friends episode on it! It's quite good
If userspace applications want to make latency/throughput tradeoffs they can already do that with full awareness and control using their own buffers, which will also often mean fewer syscalls too.
With that said, I'm pretty sure it is a feature of the TCP stack only because the TCP stack is the layer they were trying to solve this problem at, and it isn't clear at all that "unacked data" is particularly better than a timer -- and of course if you actually do want to implement application layer Nagle directly, delayed acks mean that application level acking is a lot less likely to require an extra packet.
BTW, Hardware based TCP offloads engine exists... Don't think they are widely used nowadays though
Widely used in low latency fields like trading
And it would be right choice if it worked. Hell, simple 20ms flush timer would've made it work just fine.