I'm going to be honest, I was pretty geared up to have a contrarian opinion until I looked at the standards but they're actually pretty clear, a 404 could be a proper response to unexpected query string; query string is as much part of the URL API as the path is and I think pretty much everyone can acknowledge that just tacking random stuff onto the path would be ill advised and undefined behavior.
[0]: https://url.spec.whatwg.org/#application/x-www-form-urlencod...
In fact lots of sites still work like that, they just hide it behind a couple rewrite rules in apache/nginx for SEO reasons
On the other hand, if it's a CRUD app and you're filtering a list of entities by various field values? Returning that no items matched your selection (or an empty list, if an API) makes more sense than a 404, which would more appropriate for an attempt to pull up a nonexistent entity URI.
204 No Content
for nothing found is both not an error (because 2xx code) but also indicates there was nothing found to match the request.If it's an API, a 200 with an empty JSON object or array in the body is legitimate as well, but a 204 is explicit.
/books/1 could return 200 or 404 depending on the existence of the book#1, here it make sense because if /books/1 does not exist the API must tell it explicitly. However 404 belongs to the 4XX family which means "client error", is it an error to ask for a non existing book ? If you enter in a bookshop and ask for a book they don't have you did not "make a mistake". It's not like if you asked for a chainsaw. But in an API, especially with hypermedia, you are not supposed to request a resource that does not exist (unless the API provides a link to an existing resource that is was deleted before the caller try to reach it).
If you ask for a book they don't have it's a different matter.
In any case, when you ask for a book in a library you are using their "search" endpoint. The equivalent to opening a books/1 url would be asking for a specific instance of a book by serial number or so. Then it's clear that you made a mistake uf you do that for an unexistent serial number...
/users/ returns a 404 in an API means that this resource does not exist. As in, this is not a part of the API.
/users/123 returns a 404 means this user record does not exist.
Yes this means that a 404 is context dependent but in a way that makes it easier for a human to think of and reason about.
Lots of REST libraries that I’ve used treat any 400 response as an error so generating a 404 when for an empty list would just create more headaches.
Responses with status codes in the 400 range are client errors, so the client shouldn't retry the same request. So a 404 is appropriate despite how annoying a library might be at handling it. Depending on which language/ecosystem you are using, there are likely more sane alternatives.
Although I do feel like I've seen too many instances of a 404 being used for an empty collection where it would make more sense to return `[]` and treat it as an expected (successful) state.
It would have been nice if there was an actually grouping of retriable and not retriable but in reality it’s a complete mess.
But at a minimum beware of 429. That’s not a permanent outage and is a frequent one you might get that needs a careful retry.
400 is the general “bad request” client area, indicating something is wrong with the request but not being specific about what.
404 is simply a more specific client error: it means the client asked for a resource that couldn’t be found.
That's not obvious at all. If I receive json data that contains a property I'm not aware of, i don't reject the entire document for that reason. In the case of query strings, extra query parameters might be used by other parts of the stack besides yours, so rejecting the entire document because someone somewhere else is trying to pass information to itself is the wrong approach.
As a web developer, you’re the like the guy standing with a clipboard outside a fancy club checking if people requesting entry are allowed or not. Basically, level 1 security.
If someone is not on the list, your job is to default to declining them access, not granting them access assuming level 2 security will handle them at a deeper layer.
It’s possible that the teams you work with expect fuzzy behaviour from the website but that’s a choice, not a practice.
This is how the vast majority of websites work. The practical reason is obvious: when we model the behaviour our code depends on, we want to create the simplest possible model that allows our code to work as expected. Placing requirements on it that our code doesn't actually depend on is useless, unneeded, complexity.
> As a web developer, you’re the like the guy standing with a clipboard outside a fancy club checking if people requesting entry are allowed or not. Basically, level 1 security.
there is no security benefit to filtering out unneeded url parameters.
there is - security in depth.
If a url parameter would've been a vulnerability because something lower down the stack misinterprets it (and the param wasn't necessary for your app in the first place), then you've just left a window open for the exploit.
If the set of url params are known ahead of time (which i claim should be true), then you could make adding unknown params an error.
What about passing extra data to fill the server memory with either extra known junk or a script / executable to use with a zero day in an internal component or something.
To misuse the nightclub analogy: it’s like checking for bags not being larger than A4 and disallow knives and other weapons.
Oh yeah? I remember a lot of semicolons from Perl and other CGI stuff where we would now use ampersands, back in the day, both in the path and in the query. (Sometimes the ? itself would be written ;.)
The really funny thing about this is that, when I was worrying about possible side effects if I responded 404, I somehow completely forgot how much of the web’s history the path has been useless for. Paths have won. No one really starts new things with URLs like /item?id=… any more. Yay!
So en.wikipedia.org/wiki/// is the article about C++ style comments
Though there are “smart” CDNs that will resize images etc. all beats are off for those.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/...
effectively lets you specify what parts of a query are relevant. So for example
url?a=b&c=d matches url?c=d&a=b in terms of caching
This feels like a technically correct is the best kind of correct situation. Like technically, yeah web servers may respond 404 if they dont understand a query parameter, but in practise that is not how urls are conceptualized normally.
Seems a lot better than the other potential world we could lived in, where paths were a black box and every web server/framework invented their own structure for them.
It’s your website. Have fun with it! Do dumb things! :-)
MII//epi
Is converted to MII/epi
- user gritzko,
- project beagle,
- view blob,
- commit a7e17290a39250092055fcda5ae7015868dabdb4,
- file path VERBS.md
... all concatenated indiscriminately.Grouping data by user is common and normal in computing: /home laid precedent decades ago.
Project directories are an extremely common grouping within a user’s work sets. Yeah, some of us just dump random files in $HOME, but this is still a sensible tier two path component.
The choice to make ‘view metadata-wrapped content in browser HTML output’ the default rather than ‘view raw file contents’ the default is legitimate for their usage. One could argue that using custom http headers would be preferable to a path element (to the exclusion of JavaScript being able to access them, iirc?) or that the path element blob should be moved into the domain component or should prefix rather than suffix the operands; all valid choices, but none implicitly better or worse here.
Object hash is obviously mandatory for git permalinks, and is perhaps the only mandatory component here. (But notably, that’s not the same as a commit hash.) However, such paths could arguably be interpreted as maximally user-hostile.
File path, interestingly enough, is completely disposable if one refers to a specific result object hash within a commit, but if the prior object hash was required to be a commit, then this is a valid unique identifier for the filesystem-tree contents of that commit. You could use the object hash instead of the full path within the commit hash, but that’s a pretty user-hostile way to go about this.
So, then, which part of the ordering and path selections do you consider indiscriminate, and why?
Query strings are more verbose as force to give each param a name.
edit: for instance, that specific VERBS.md is represented by the blob 3b9a46854589abb305ea33360f6f6d8634649108.
https://github.com/gritzko/beagle/a7e17290a39250092055fcda5ae7015868dabdb4/VERBS.md
this should be sufficient to represent the file."blob" is like a descriptor of the value that follows. it would be like doing this:
https://github.com/user/gritzko/project/beagle/blob/a7e17290a39250092055fcda5ae7015868dabdb4/file/VERBS.md
this actually irks me every time i see it in a github urlExcept it's not, because the oid can be a short hash (https://github.com/gritzko/beagle/blob/a7e172/VERBS.md) and that means you're at risk of colliding with every other top-level entry in the repository, so you're restricting the naming of those toplevel entries, for no reason.
So namespacing git object lookups is perfectly sensible, and doing so with the type you're looking for (rather than e.g. `git` to indicate traversal of the git db) probably simplifies routing, and to the extent that it is any use makes the destination clearer for people reading the link.
Back when GitHub URLs were kind of cool, github.com/user/gritzko/project/beagle would have been much less cool than just github.com/gritzko/beagle.
They are not. There's just a routing layer below the repository.
Of course there's nothing to stop you using URIs like this (I think Angular does, or did at one point?) but I don't think the rules for relative matrix URIs were ever figured out and standardised, so browsers don't do anything useful with them.
For sites without Javascript, it's great for things like search boxes, tables with sorting/filtering, etc. instead of POST, since it preserves your query in the URL.
https://developer.mozilla.org/en-US/docs/Web/HTML/Reference/...
Or you could accept that you're probably going to need a round trip to the server and use a normal URL and it's fine.
For all but the absolute biggest websites in the world, anyhow. At Facebook or Google scale yeah it's needed.
So yes query parameters existed before CGI but to use them you had to hack your server to do something with them (iirc NCSA web servers had some magic hacks for queries). CGI drove standardization.
func specialHandler(w http.ResponseWriter, r *http.Request) {
if time.Now().Weekday() == time.Tuesday {
http.NotFound(w, r)
return
}
fmt.Fprintln(w, "server made a decision")
}
Your server can make decisions however you program it to, you know? It's just software.Forgive the phone-posting.
Paths are hierarchical; query strings are name/value.
(Note I speak of common usage.)
You can create a different convention, but that one is pretty dang useful.
How does this benefit the other website? How does this hurt the authors website?
I am completely confused about the behavior of both side here.
I get that when I run an ad-campaing I want google to add a utm-query string, so I can track which campaign users arrived from - but then the origin and the destination are working together. Here the origin just adds stuff for no reason. Why?
Honestly, it is quite useful for niche/startup sites. I have been on both ends of conversations that began from seeing these in web analytics (as someone that saw incoming traffic from a site and reached out, and as someone that received contact from a site I linked to) - and both times it ended in a mutually beneficial partnership.
I can understand the privacy argument to some degree, but it provides no more information than the standard Referer header (and if you use analytics like Simple Analytics/Plausible, it is a lot more visible).
Why? Already getting traffic for free.
Query string additions are commonly used to track things. You can see that lots of people don’t want that by the existence of Firefox features like “copy clean link” and Extended Tracking Protection which proactively strips some like UTM parameters.
Some sites happily participate in what I will glibly call the tracking economy. They may benefit because the recipient will see in their logs that lots of people are coming from their site, and do something that helps their site because of that.
My rejecting query strings is a simple form of protest against that system.
Some web pages don't send referrers by making all links rel="noreferrer". Mastodon used to do this by default, though now they've changed their stance.
Links opened from non-browser apps don't have any referrer information either. E.g. if somebody shares your link on iMessage, WhatsApp, or Telegram.
Email clients may also strip out referrers, but I'm not entirely sure about this one.
If people read your work via RSS readers, you'll almost certainly not get any referrers. Unless it's a web-based reader like Feedly.
My website gets a lot of traffic marked as "Direct / None" by Plausible. I suspect this is traffic from RSS readers or Mastodon, but I can't be sure. A few times I've considered adding a "?ref=RSS" to all URLs served to RSS readers and "?ref=Mastodon" to everything I post on Mastodon. But like the author of this post, I feel uncomfortable tracking my readers like this.
Back in the Stone Age, we called these “Webrings,” but they weren’t as fancy.
One of the issues that I faced, while developing an open-source application framework, was that hosting that used FastCGI, would not honor Auth headers, so I was forced to pass the tokens in the query. It sucked, because that makes copy/paste of the Web address a real problem. It would often contain tokens. I guess maybe this has been fixed?
In the backends that I control, and aren’t required to make available to any and all, I use headers.
So you were writing your application as a fcgi-app, and (e.g.) Apache was bungling Auth headers? Can you expand on this? Curious about the technical detail of (I guess) PARAM records not actually giving you what you expect?
I just remember the auth headers never showing up in the $_SERVER global (it was a PHP app). This was what I was told was the issue. They made it sound like it was well-known.
[1]: https://httpd.apache.org/docs/2.4/en/mod/core.html#cgipassau...
His site returns (I think incorrectly) a 414 if a request includes a query string. If this protest is meant to advocate for the user, who presumably wasn't able to manage that string in the first place, why would you penalize them for it being there?
Why not just use it as a cue to tell users how they can make this decision themselves (e.g. through browser tools)?
400 Bad Request, the generic client error code, which is correct but boring;
402 Payment Required, and honestly if you want to pay me to make a particular URL with query string work, I’m open to it;
404 Not Found, but it’s too likely to have side effects, and it doesn’t convey the idea that the request was malformed, which is what I’m going for; and
303 See Other with no Location header, which is extremely uncommon these days but legitimate. Or at least it was in RFC 2616 (“The different URI SHOULD be given by the Location field in the response”), but it was reworded in 7231 and 9110 in a way that assumes the presence of a Location header (“… as indicated by a URI in the Location header field”), while 301, 302, 307 and 308 say “the server SHOULD generate a Location header field”. Well, I reckon See Other with no Location header is fair enough. But URI Too Long was funnier."
https://chrismorgan.info/no-query-strings?fooObviously it's against the spirit of the thing, but I don't think it's wrong per-se.
>Complain to whoever gave you the bad link, and ask them to stop modifying URLs, because it’s bad manners.
It's ironic that an error response so blatantly violating the robustness principle is throwing shade about bad manners.
In our modern world, the robustness principle has become an invitation to security bugs, and vendor lock-in. Edge cases snuck through one system on robustness, then trigger unfortunate behavior when they hit a different system. Two systems tried to do something reasonable on an ambiguous case, but did it differently, leading to software that works on one, failing to work on the other.
That said, we are paying a huge complexity cost due to our efforts to allow nonconforming pages. This complexity is widely abused by malicious actors. See, for instance, https://cheatsheetseries.owasp.org/cheatsheets/XSS_Filter_Ev... for ways in which attackers try to bypass security filters. A lot of it is only possible because of this unnecessary complexity.
Another option to consider is "418 I'm a teapot": teapots usually also don't support query strings
Several options which seem like they might be appropriate aren't on close examination:
- "406" ("Not Acceptable") which is based on content-negotiation headers.
- "409" ("Conflict") which is largely for WebDAV requests.
- Others such as 411, 422, and 431 are also for specific conditions which aren't relevant here.
- 300 or 500 errors are inappropriate as this isn't a relocation or server-side failure, it's a client-side request problem.
Teapot or too long seem best bets.
I’ve always used them in API servers when a client was POSTing to create a duplicate of a unique item.
Im not making this up btw. A old NOC I woeked at emitted every error as 200 OK with the body message with the real error. They were a real shitshow.
The technical purist: you’re modifying a URL in a way that, while in line with accepted custom, is technically incorrect. URLs should (the least effective type of should) basically be treated as opaque.
Social: it’s tracking stuff, sibling comment trees are good, I won’t reiterate.
Clutter: it’s getting in the way of the bit you should care about, and contributing to normal people no longer caring about URLs because they’re too hard, too complex.
There are a lot of reasons I might not want a site to know where I came from to get to their site. It is basically sharing your browsing history with the site you are visiting.
Because of this, there have been a lot of updates to the http referer header, with restrictions on when it is sent, and an ability to opt out of the feature entirely.
Adding a url parameter with the same information bypasses any of these existing rules and ability to opt out. They should just use the standard.
This is talking about links to third party sites, not your own.
Isn't this functionally the exact same?
You could simply throw the information away.
It's a ridiculously extreme stance and lacks proper explanation how this will lead to a better web.
They aren't saying the concept of query strings are bad, They're saying unsolicited query strings during referal are the issue.
On a more personal note, I hate it when I go to copy a link to send via a message, and the tracking code glued onto it is twice as long as original URL... I either have to fiddle around with it to clean it up or leave the person I sent it to to wonder wtf am I on about with a screenful of random characters...
So it's violating users' privacy, it's shit UX, and on top of that, nobody asked for it...
Query strings are useful for way more than just tracking. Saving and servicing search queries is a way more common use case. So assuming it's only useful for tracking is very misleading.
Query strings are probably the least invasive tracking. They are transparent, obvious, and anonymous. Users are free to strip out and edit query strings if they don't want them.
More to the point, I can essentially do the same thing with HTTP routing - create an infinite number of unique URLs for tracking purposes. In that regard calling out query strings specifically for essentially the same thing but more transparently seems like splitting hairs.
Filters especially make sense as query params as they are non sequential but still visually readable as to what they do.
URL slugs make sense for sequential pages that are hierarchical but make no sense for non hierarchical data/routes.
Services can force tracking into links by encoding the whole url into a shortlink that makes it impossible to just remove the tracking alone as everything is encoded into a shorter non editable string.
If I am handing out maps to your address, letting people know who is publishing the map is generally a good thing.
This is like saying having a return to sender address on mail is an invasion of privacy.
Instead of responding with an error, give a page that states “The link you followed to get here appears to have had some tracking gubbins added, in case you are a bot following arbitrary links, and/or using random URL additions to look like a more organic visit, please wait while we run a little PoW automaton deterrent before passing you on to the page you are looking for.” then do a little busy work (perhaps a real PoW thingy) before redirecting. Or maybe don't redirect directly, just output the unadorned URL for the user to click (and pass on to others). This won't stop the extra gubbins being added of course, but neither will the error and this inconveniences potential readers less.
Both are good but it seems fair to give priority to the original.
Facebook: no.
Pinterest: ?utm_source=Pinterest&utm_medium=organic.
ChatGPT: ?utm_source=chatgpt.com. (Aside: wow it’s confidently and atrociously wrong if you ask it about me. Ask it just vaguely enough, and it hallucinates someone clearly inspired by me, but who has done a whole lot of stuff that I haven’t. Ask it more precisely about me, and it gets all kinds of details wrong still. I feel further vindicated in hating this stuff. You made me use ChatGPT for the very first time.)
LinkedIn: no.
Twitter: no.
Reddit: no.
YouTube: no.
> if you get enough traffic that you can pick which sources you want to allow, that's a good problem to have.
Nah, I just don’t care about them. It’s my place, I’m doing things on my own terms. Should I discover it to be causing me problems, I’ll burn that bridge when I come to it.
Edit: Perhaps it only mangles links for logged-in users? That raises the possibility that some of the others may also only affect logged-in users.
(Trying with other ones I'm logged in on: Reddit doesn't mangle (obviously), Twitter doesn't mangle.)
https://chrismorgan.info/no-query-strings?
Never have I seen such a sassy web server
I noticed that his server also doesn't accept URLs ending is a single `/`: https://chrismorgan.info/no-query-strings/
But instead of the banned query strings message, it just returns a very sassy not-a-404 page. Once again, this is violating a common convention, but there's nothing in the HTTP spec that requires treating these URLs the same. Similarly the site also 404s when you add extra slashes like https://chrismorgan.info///no-query-strings
digression: I love trying "domain.com//" on various sites. Occasionally it'll trigger weird errors like a 502 or 500.
Where dealing with static file servers:
For URLs that are supposed to include a trailing slash and the server will find that directory and serve its index.html: it’s customary, though not ubiquitous, to redirect from no-slash to slash. (Some, including popular commercial services, serve the index.html file instead of redirecting to add the slash. This is extremely wrong because it changes the meaning of relative URLs.)
But the other way round is not common.
My URLs don’t include a file extension, and I think that’s influencing your perception into thinking no-query-strings is logically a directory name. But it’s not, it’s logically a file name, just with the .html removed as unnecessary.
Take https://susam.net/no-query-strings.html as an example; Susam is more clearly just serving from a file system than I am, and leaves the “.html” file extension in the URL. Do you expect https://susam.net/no-query-strings.html/ to work? I hope not. It’s a 404, just as I’d expect, because there is no directory with that name.
> not-a-404 page
No, that’s a 404, just a plain old boring 404, same as any other. In fact, it’s the same 404 page I’ve been using since 2019, just with dark mode support added.
> extra slashes
Ah, now for that I had to go out of my way, because Caddy misbehaves out of the box: https://chrismorgan.info/Caddyfile#:~:text=%40has%5Fmultiple...
> digression: I love trying "domain.com//" on various sites.
Closely related is adding the trailing dot of a fully-qualified domain name: https://example.com./. I didn’t remember to try this on my new site, but it turns out Caddy won’t talk at https://chrismorgan.info./, so that’s probably good.