Top
Best
New

Posted by linolevan 4 hours ago

What came first: the CNAME or the A record?(blog.cloudflare.com)
130 points | 48 commentspage 2
ShroudedNight 3 hours ago|
I'm not an IETF process expert. Would this be worth filing errata against the original RFC in addition to their new proposed update?

Also, what's the right mental framework behind deciding when to release a patch RFC vs obsoleting the old standard for a comprehensive update?

hdjrudni 2 hours ago|
I don't know the official process, but as a human that sometimes reads and implements IETF RFCs, I'd appreciate updates to the original doc rather than replacing it with something brand new. Probably with some dated version history.

Otherwise I might go to consult my favorite RFC and not even know its been superseded. And if it has been superseded with a brand new doc, now I have to start from scratch again instead of reading the diff or patch notes to figure out what needs updating.

And if we must supersede, I humbly request a warning be put at the top, linking the new standard.

ShroudedNight 2 hours ago||
At one point I could have sworn they were sticking obsoletion notices in the header, but now I can only find them in the right side-bar:

https://datatracker.ietf.org/doc/html/rfc5245

I agree, that it would be much more helpful if made obvious in the document itself.

It's not obvious that "updated by" notices are treated in any more of a helpful manner than "obsoletes"

paulddraper 3 hours ago||
> RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:

> If recursive service is requested and available, the recursive response to a query will be one of the following:

> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.

It's pretty clear that CNAME is at the beginning.

The "possibly" does not refer to the order but rather to the presence.

If they are present, they are are first.

urbandw311er 2 minutes ago|
The whole world knows this except Cloudflare who actually did know it but are now trying to pretend that they didn’t.
kayson 3 hours ago||
> However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.

Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.

It seems like an oversight, and that's totally fine.

bombcar 3 hours ago||
I took it as being "we wrote the tests to the standard" and then built the code, and whoever was writing the tests didn't read that line as a testable aspect.
kayson 3 hours ago||
Fair enough.
supriyo-biswas 3 hours ago||
My reading of that statement is their test, assuming they had one, looked something like this:

    rrs = resolver.resolve('www.example.test')
    assert Record("cname1.example.test", type="CNAME") in rrs
    assert Record("192.168.0.1", type="A") in rrs
Which wouldn't have caught the ordering problem.
hdjrudni 2 hours ago||
It's implied that they intentionally tested it that way, without any assertions on the order. Not by oversight of incompetence, but because they didn't want to bake the requirement in due to uncertainty.
therein 3 hours ago||
After the release got reverted, it took an 1hr28min for the deployment to propagate. You'd think that would be a very long time for CloudFlare infrastructure.
rhplus 2 hours ago||
We should probably all be glad that CloudFlare doesn't have the ability to update its entire global fleet any faster than 1h 28m, even if it’s a rollback operation.

Any change to a global service like that, even a rollback (or data deployment or config change), should be released to a subset of the fleet first, monitored, and then rolled out progressively.

tuetuopay 2 hours ago|||
Given the seriousness of outages they make with instant worldwide deploys, I’m glad they took it calmly.
steve1977 2 hours ago||
They had to update all the down detectors first.
frumplestlatz 3 hours ago||
Given my years of experience with Cisco "quality", I'm not surprised by this:

> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.

... but I am surprised by this:

> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.

Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.

renewiltord 3 hours ago||
Nice analysis. Boy I can’t imagine having to work at Cloudflare on this stuff. A month to get your “small in code” change out only to find some bums somewhere have written code that will make it not work.
urbandw311er 30 seconds ago||
Or — hot take — to find out that you made some silly misinterpretation of the RFC that you then felt the need to retrospectively justify.
stackskipton 3 hours ago|||
Or when working on massive infrastructure like this, you write plenty of tests that would have saved you a month worth of work.

They write reordering, push it and glibc tester fires, fails and you quickly discover "Crap, tests are failing and dependency (glibc) doesn't work way I thought it would."

renewiltord 2 hours ago||
I suspect that if you could save them this time, they'd gladly pay you for it. It'll be a bit of a sell, but they seem like a fairly sensible org.
rjh29 1 hour ago||
It was glibc's resolver that failed - not exactly obscure. It wasn't properly tested or rolled out, plain and simple.
charcircuit 3 hours ago|
Random DNS servers and clients being broken in weird ways is such a common problem and will probably never go away unless DNS is abandoned altogether.

It's surprising how something so simple can be so broken.