Dunno if it helps but it helps me feel better.
What speeds up browsing the most though IMO is running your own DNS resolver, null routing a big part of the Internet, firewalling off entire countries (no really I don't need anything from North Korea, China or Russia for example), and then on top of that running dnsmasq locally.
I run the unbound DNS (on a little Pi so it's on 24/7) with gigantic killfiles, then I use 1.1.1.3 on top of that (CloudFlare's DNS that filters out known porn and known malware: yes, it's CloudFlare and, yes, I own shares of NET).
Some sites complain I use an "ad blocker" but it's really just null routing a big chunk of the interwebz.
That and LD_PRELOAD a lib with TCP_NODELAY: life is fast and good. Very low latency.
You never want TCP_NODELAY off at the sending end, and delayed ACKs on at the receiving end. But there's no way to set that from one end. Hence the problem.
Is TCP_NODELAY off still necessary? Try sending one-byte TCP sends in a tight loop and see what it does to other traffic on the same path, for, say, a cellular link. Today's links may be able to tolerate the 40x extra traffic. It was originally put in as a protection device against badly behaved senders.
A delayed ACK should be thought of as a bet on the behavior of the listening application. If the listening application usually responds fast, within the ACK delay interval, the delayed ACK is coalesced into the reply and you save a packet. If the listening application does not respond immediately, a delayed ACK has to actually be sent, and nothing was gained by delaying it. It would be useful for TCP implementations to tally, for each socket, the number of delayed ACKs actually sent vs. the number coalesced. If many delayed ACKs are being sent, ACK delay should be turned off, rather than repeating a losing bet.
This should have been fixed forty years ago. But I was out of networking by the time this conflict appeared. I worked for an aerospace company, and they wanted to move all networking work from Palo Alto to Colorado Springs, Colorado. Colorado Springs was building a router based on the Zilog Z8000, purely for military applications. That turned out to be a dead end. The other people in networking in Palo Alto went off to form a startup to make a "PC LAN" (a forgotten 1980s concept), and for about six months, they led that industry. I ended up leaving and doing things for Autodesk, which worked out well.
Disabling Nagle's algorithm should be done as a matter of principle, there's simply no modern network configuration where it's beneficial.
Nagle’s algorithm is just a special case solution of the generic problem of choosing when and how long to batch. We want to batch because batching usually allows for more efficient batched algorithms, locality, less overhead etc. You do not want to batch because that increases latency, both when collecting enough data to batch and because you need to process the whole batch.
One class of solution is “Work or Time”. You batch up to a certain amount of work or up to a certain amount of time, whichever comes first. You choose your amount of time as your desired worst case latency. You choose your amount of work as your efficient batch size (it should be less than max throughput * latency, otherwise you will always hit your timer first).
Nagle’s algorithm is “Work” being one packet (~1.5 KB) with “Time” being the time until all data gets a ack (you might already see how this degree of dynamism in your timeout might pose a problem already) which results in the fallback timer of 500 ms when delayed ack is on. It should be obvious that is a terrible set of parameters for modern connections. The problem is that Nagle’s algorithm only deals with the “Work” component, but punts on the “Time” component allowing for nonsense like delayed ack helpfully “configuring” your effective “Time” component to a eternity resulting in “stuck” buffers which is what the timeout is supposed to avoid. I will decline to discuss the other aspect which is choosing when to buffer and how much of which Nagle’s algorithm is again a special case.
Delayed ack is, funnily enough, basically the exact same problem but done on the receive side. So both sides set timeouts based on the other side going first which is obviously a recipe for disaster. They both set fixed “Work”, but no fixed “Time” resulting in the situation where both drivers are too polite to go first.
What should be done is use the generic solutions that are parameterized by your system and channel properties which holistically solve these problems which would take too long to describe in depth here.
1. Perhaps on more modern hardware the thing to do with badly behaved senders is not ‘hang on to unfull packets for 40ms’ but another policy could still work, e.g. eagerly send the underfilled packet, but wait the amount of time it would take to send a full packet (and prioritize sending other flows) before sending the next underfull packet.
2. In Linux there are packets and then there are (jumbo)packets. The networking stack has some per-packet overhead so much work is done to have it operate on bigger batches and then let the hardware (or a last step in the OS) do segmentation. It’s always been pretty unclear to me how all these packet-oriented things (Nagle’s algorithm, tc, pacing) interact with jumbo packets and the various hardware offload capabilities.
3. This kind of article comes up a lot (mystery 40ms latency -> set TCP_NODELAY). In the past I’ve tried to write little test programs in a high level language to listen on tcp and respond quickly, and in some cases (depending on response size) I’ve seen strange ~40ms latencies despite TCP_NODELAY being set. I didn’t bother looking in huge detail (eg I took a strace and tcpdump but didn’t try to see non-jumbo packets) and failed to debug the cause. I’m still curious what may have caused this?