It's Always TCP_NODELAY

Posted by eieio 12/22/2025

491 points | 179 commentspage 2

rowanG077 12/22/2025|

I fondly remember a simple simulation project we had to do with a group of 5 students in a second year class which had a simulation and some kind of scheduler which communicated via TCP. I was appalled at the perfomance we were getting. Even on the same machine it was way too slow for what it was doing. After hours of debugging in turned out it was indeed Nagle's algorithm causing the slowness, which I never heard about at the time. Fixed instantly with TCP_NODELAY. It was one of the first times it was made abundantly clear to me the teachers at that institution didn't know what they were teaching. Apparently we were the only group that had noticed the slow performance, and the teachers had never even heard of TCP_NODELAY.

TacticalCoder 12/23/2025||

It's a bit tricky in that browsers may be using TCP_NODELAY anyway or use QUIC (UDP) and whatnots BUT, in doubt, I've got a wrapper script around my browsers launcher script that does LD_PRELOAD with TCP_NODELAY correctly configured.

Dunno if it helps but it helps me feel better.

What speeds up browsing the most though IMO is running your own DNS resolver, null routing a big part of the Internet, firewalling off entire countries (no really I don't need anything from North Korea, China or Russia for example), and then on top of that running dnsmasq locally.

I run the unbound DNS (on a little Pi so it's on 24/7) with gigantic killfiles, then I use 1.1.1.3 on top of that (CloudFlare's DNS that filters out known porn and known malware: yes, it's CloudFlare and, yes, I own shares of NET).

Some sites complain I use an "ad blocker" but it's really just null routing a big chunk of the interwebz.

That and LD_PRELOAD a lib with TCP_NODELAY: life is fast and good. Very low latency.

mgaunard 12/23/2025||

I've always found Nagle's algorithm being a kernel-level default quite silly. It should be up to the application to decide when to send and when to buffer and defer.

nlitened 12/23/2025|

It’s up to the application to change this parameter on per-socket basis

jonstewart 12/22/2025||

Animats 12/23/2025||

OK, I suppose I should say something. I've already written on this before, and that was linked above.

You never want TCP_NODELAY off at the sending end, and delayed ACKs on at the receiving end. But there's no way to set that from one end. Hence the problem.

Is TCP_NODELAY off still necessary? Try sending one-byte TCP sends in a tight loop and see what it does to other traffic on the same path, for, say, a cellular link. Today's links may be able to tolerate the 40x extra traffic. It was originally put in as a protection device against badly behaved senders.

A delayed ACK should be thought of as a bet on the behavior of the listening application. If the listening application usually responds fast, within the ACK delay interval, the delayed ACK is coalesced into the reply and you save a packet. If the listening application does not respond immediately, a delayed ACK has to actually be sent, and nothing was gained by delaying it. It would be useful for TCP implementations to tally, for each socket, the number of delayed ACKs actually sent vs. the number coalesced. If many delayed ACKs are being sent, ACK delay should be turned off, rather than repeating a losing bet.

This should have been fixed forty years ago. But I was out of networking by the time this conflict appeared. I worked for an aerospace company, and they wanted to move all networking work from Palo Alto to Colorado Springs, Colorado. Colorado Springs was building a router based on the Zilog Z8000, purely for military applications. That turned out to be a dead end. The other people in networking in Palo Alto went off to form a startup to make a "PC LAN" (a forgotten 1980s concept), and for about six months, they led that industry. I ended up leaving and doing things for Autodesk, which worked out well.

e40 12/22/2025|||

Came here thinking the same thing...

armitron 12/22/2025||

Assuming he shows up, he'll still probably be trying to defend the indefensible..

Disabling Nagle's algorithm should be done as a matter of principle, there's simply no modern network configuration where it's beneficial.

Veserv 12/23/2025||

The problem is actually that nobody uses the generic solution to these classes of problems and then everybody complains that the special-case for one set of parameters works poorly for a different set of parameters.

Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.

One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).

Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.

Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.

What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.

foltik 12/23/2025||

Why doesn’t linux just add a kconfig that enables TCP_NODELAY system wide? It could be enabled by default on modern distros.

indigodaddy 12/23/2025||

Looks like there is a sysctl option for BSD/MacOS but Linux it must be done at application level?

rini17 12/23/2025||

Perhaps you can set up iptables rule to add the bit.

dan-robertson 12/23/2025||

A few random thoughts:

1. Perhaps on more modern hardware the thing to do with badly behaved senders is not ‘hang on to unfull packets for 40ms’ but another policy could still work, e.g. eagerly send the underfilled packet, but wait the amount of time it would take to send a full packet (and prioritize sending other flows) before sending the next underfull packet.

2. In Linux there are packets and then there are (jumbo)packets. The networking stack has some per-packet overhead so much work is done to have it operate on bigger batches and then let the hardware (or a last step in the OS) do segmentation. It’s always been pretty unclear to me how all these packet-oriented things (Nagle’s algorithm, tc, pacing) interact with jumbo packets and the various hardware offload capabilities.

3. This kind of article comes up a lot (mystery 40ms latency -> set TCP_NODELAY). In the past I’ve tried to write little test programs in a high level language to listen on tcp and respond quickly, and in some cases (depending on response size) I’ve seen strange ~40ms latencies despite TCP_NODELAY being set. I didn’t bother looking in huge detail (eg I took a strace and tcpdump but didn’t try to see non-jumbo packets) and failed to debug the cause. I’m still curious what may have caused this?

hathawsh 12/23/2025||

Ha ha, well that's a relief. I thought the article was going to say that enabling TCP_NODELAY is causing problems in distributed systems. I am one of those people who just turn on TCP_NODELAY and never look back because it solves problems instantly and the downsides seem minimal. Fortunately, the article is on my side. Just enable TCP_NODELAY if you think it's a good idea. It apparently doesn't break anything in general.

medoc 12/23/2025||

Around 1999, I was testing a still young MySQL as an INFORMIX replacement, and network queries needed a very suspect and quite exact 100 mS. A bug report and message to mysql@lists.mysql.com, and this is how MySQL got to set TCP_NODELAY on its network sockets...

saghm 12/22/2025|

I first ran into this years ago after working on a database client library as an intern. Having not heard of this option beforehand, I didn't think to enable it in the connections the library opened, and in practice that often led to messages in the wire protocol being entirely ready for sending without actually getting sent immediately. I only found out about it later when someone using it investigated why the latency was much higher than they expected, and I guess either they had run into this before or were able to figure out that it might be the culprit, and it turned out that pretty much all of the existing clients in other languages set NODELAY unconditionally.

More comments...