Posted by robin_reala 7 days ago
I worked on projects which refused to use anything more modern than XSLT & XPATH 1.0 because of lack of support in the non Java/Net World (1.0 = tech from 1999). Kudos to Saxon though, it was and is great but I wished there were more implementations of XSLT 2.0 & XPATH 2.0 and beyond in the open source World... both are so much more fun and easier to use in 2.0+ versions. For that reason I've never touched XSLT 3.0 (because I stuck to Saxon B 9.1 from 2009). I have no doubt it's a great spec but there should be other ways than only Saxon HE to run it in an open source way.
It's like we have an amazing modern spec but only one browser engine to run it ;)
I will do some experiments with using newer XPATH on JSON... that could be interesting.
Nothing could compel me to like XSLT. I admire certain elements of its design, but in practice, it just seems needlessly verbose. But I really love XPath, though.
If your data is essentially a long piece of text, with annotations associated with certain parts of that text, this is where XML shines.
When you try to use XML to represent something like an ecommerce order, financial transaction, instant message and so on, this is where you start to see problems. Trying to shove some extremely convoluted representation of text ranges and their attributes into JSON is just as bad.
A good "rule of thumb" would be "does this document still make sense if all the tags are stripped, and only the text nodes remain?" If yes, choose XML, if not, choose JSON.
<tag type="tag" class="tag" purpose="tag" tag_subtype="empty" description="this is a emptytag, a subtype of tag" empty="true"></tag>
Now, that's not perfect, I would even describe it as minimalist, but I hope it sets you in the right direction!
Well it can't: JSON has no processing instructions, no references, no comments, JSON "numbers" are problematic, and JSON arrays can't have attributes, so you're stuck with some kind of additional protocol that maps the two.
For something that is basically text (like an HTML document) or a list of dictionaries (like RSS) it may not seem obvious what the value of these things are (or even what they mean, if you have little exposure to XML), so I'll try and explain some of that.
1. Processing instructions are like <?xml?> and <?xml-stylesheet?> -- these let your application embed linear processing instructions that you know are for the implementation, and so you know what your implementation needs to do with the information: If it doesn't need to do anything, you can ignore them easily, because they are (parsewise) distinct.
2. References (called entities) are created with <!ENTITY x ...> and then you use them as &#x; maybe you are familiar with < representing < but this is not mere string replacement: you can work with the pre-parsed entity object (for example, if it's an image), or treat it as a reference (which can make circular objects possible to represent in XML) neither of which is possible in JSON. Entities can be behind external URI as well.
3. Comments are for humans. Lots of people put special {"comment":"xxx"} objects in their JSON, so you need to understand that protocol and filter it. They are obvious (like the processing instructions) in XML.
4. JSON numbers fold into floats of different sizes in different implementations, so you have to avoid them in interchange protocols. This is annoying and bug-prone.
5. Attributes are the things on xml tags <foo bar="42">...</foo> - Some people map this in JSON as {"bar":"42","children":[...],"tag":"foo"} and others like ["foo",{"bar":"42"},...] but you have to make a decision -- the former may be difficult to parse in a streaming way, but the latter creates additional nesting levels.
None of this is insurmountable: You can obviously encapsulate almost anything in almost anything else, but think about all the extra work you're doing, and how much risk there is in that code working forever!
For me: I process financial/business data mostly in XML, so it is very important I am confident my implementation is correct, because shit happens as the result of that document getting to me. Having the vendor provide a spec any XML software can understand helps us have a machine-readable contract, but I am getting a number of new vendors who want to use JSON, and I will tell you their APIs never work: They will give me openapi and swagger "templates" that just don't validate, and type-coding always requires extra parsing of the strings the JSON parsing comes back with. If there's a pager interface: I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.
So I am always finding JSON that I wish were XML, because (in my use-cases) XML is just plain better than JSON, but if you do a lot in languages with poor XML support (like JavaScript, Python, etc) all of these things will seem hard enough you might think json+xyz is a good alternative (especially if you like JSON), so I understand the need for stuff like "xee" to make XML more accessible so that people stop doing so much with JSON. I don't know rust well enough to know if xee does that, but I understand fully the need.
Okay. This is syntactically painful, APL or J tier. C++ just uses "&" to indicate a reference. That's a lot of people's issue with XML, you get the syntactic pain of APL with the verbosity pain of Java.
> I have to implement special logic for that (this is built-in to XML). If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an HTTP date. And so on.
Special logic is built into every real-world programming scenario ever. It just means the programmer had to diverge from ideal to make something work. Unpleasant but vanilla and common. I don't see how XML magically solved the date issue forever. For example, I could just toss in <date>UNIXtime</date> or <date time=microseconds since 1997>324234234</date> or <datecontainer><measurement units="femtoseconds since 1776"><value>3234234234234</value></measurement></datecontainer>. The argument seems to be "ah yes, but if everyone uses this XML date feature it's solved!" but not so. It's a special case of "if everyone did the same thing, it would be solved". But nobody does the same thing.
Most protocols are used by exactly two parties; I meet someone who wants to have their computer talk to mine and so we have to agree on a protocol for doing so.
When we agree to use XML, they use that exact date format because I just ask for it. If someone wanted me to produce some weird timestamp-format, I'd ask for whatever xslt they want to include in the payload.
When we agree to use JSON, schema says integers, email say "unix time", integration testing we discover it's "whatever Date.now() says" and a few months later I discover their computer doesn't know the difference between UTC and GMT.
Also: I like APL.
You complain about dates in JSON (really a specific case of parsing text in JSON):
> If they implement dates, sometimes it's unix-time, sometimes it's 1000x off from > that, sometimes it's a ISO8601-inspired string, and fuck sometimes I just get an > HTTP date. And so on.
Sure, but does not XML have the exact same problem because everything is just a text?
No, you can specify what type an attribute (or element) is in the XSD (for example, xs:dateTime or xs:date). And there is only one way to specify a date in XML, and it's ISO8601. Of course JSON schema does exist, but it's mostly an afterthought.
But to my mind, whether they have a well-defined schema and follow proper datatypes really has very little to do with the choice of XML or JSON.
XML can really shine in the markup role. It got such a bad rap because people used it as a pure data format, something it isn't very suited for.
or:
`{paragraphs: [{spans: [{ text: "How you represent "}, {bold: true, italic: true, text: "mixed content"},{text: " in JSON?"}]}`
I'll fully grant, I don't want to write a document by hand in either of the JSON formats I suggested. Although, given the choice, I'd rather receive it in that format to be parsed than in any XML format.
["p", "How would you represent ", ["b", ["i", "mixed content"]], " in JSON?"]
```js import * as fastXmlParser from 'fast-xml-parser'; const xmlParser = new fastXmlParser.XMLParser({ ignoreAttributes: false }); ```
Validate input as required with jschema.
For some things that may just be down to how uses are specified. For YANG, the spec calls out XPath 1.0 as the form in which constrains (must and when statements) must be expressed.
So one is forced to learn and use XPath 1.0.
The obvious solution is streaming, but streaming appears to not be supported, though is listed under Challenging Future Ideas: https://github.com/Paligo/xee/blob/main/ideas.md
How hard is it to implement XML/XSLT/XPATH streaming?
There's a potential alternative to streaming, though - succinct storage of XML in memory:
https://blog.startifact.com/posts/succinct/
I've built a succinct XML library named Xoz (not integrated into Xee yet):
The parsed in memory overhead goes down to 20% of the original XML text in my small experiments.
There's a lot of questions on how this functions in the real world, but this library also has very interesting properties like "jump to the descendant with this tag without going through intermediaries".
May I ask why? I used to do a lot of XSLT in 2007-2012 and stuck with XSLT 2.0. I don't know what's in 3.0 as I've never actually tried it but I never felt there was some feature missing from 2.0 that prevented me to do something.
As for streaming, an intermediary step would be the ability to cut up a big XML file in smaller ones. A big XML document is almost always the concatenation of smaller files (that's certainly the case for Wikipedia for example). If one can output smaller files, transform each of them, and then reconstruct the initial big file without ever loading it in full in memory, that should cover a huge proportion of "streaming" needs.
I suspect with the right FM-Index Xoz might be able to store huge documents in a smaller size than the original, but that's an experiment for the future.
With modern SSDs and disk cache, that's likely enough to be plenty performant without having to store the whole document in memory at once.
It's actually quite annoying on the general case. It is completely possible to write an XPath expression that says to match a super early tag on an arbitrarily-distant further tag.
In another post in this thread I mention how I think it's better to think of it as a multicursor, and this is part of why. XPath doesn't limit itself to just "descending", you can freely cursor up and down in the document as you write your expression.
So it is easy to write expressions where you literally can't match the first tag, or be sure you shouldn't return it, until the whole document has been parsed.
Saxon's paid edition supports it. I've done it a few times, but you have to write your XSLT in a completely different way to make it work.
There's some fast databases that store prefix trees, which might be suitable for such a task actually (something like infinitydb). But building this database will basically take a while (it will require parsing the entire document). But i suppose if reading/querying is going to happen many times, its worth it?
If the payloads in question are in that range, the time spent to support streaming doesn't feel justified compared to just using a machine with more memory. Maybe reducing the size of the parsed representation would be worth it though, since that benefits nearly every use case
In any case I don't have $1500 to blow on a new computer with 100GB of ram in the unsubstantiated hope that it happens to fit, just so I can play with the Wikipedia data dump. And I don't think that's a reasonable floor for every person that wants to mess with big xml files.
You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.
I have processed Wikipedia many times with less than 8 GB of RAM.
In the English Wikipedia the wikitext accounts for about 80% of the bytes of the decompressed XML dump.
Not sure how common 100GB files are but I can certainly image that being the norm in certain niches.
I hated it the minute I learned about it, because it missed something I knew I cared about, but didn’t have a word for in the 90s - developer ergonomics. XML sucks shit for someone who wants to think tersely and code by hand. Seriously, I hate it with a fiery passion.
Happily to my mind the economics of easier-for-creators -> make web browsers and rendering engines either just DEAL with weird HTML, or else force people to use terse data specs like JSON won out. And we have a better and more interesting internet because of it.
However, I’m old enough now to appreciate there is a place for very long-standing standards in the data and data transformation space, and if the XML folks want to pick up that banner, I’m for it. I guess another way to say it is that XML has always seemed to be a data standard which is intended to be what computers prefer, not people. I’m old enough to welcome both, finally.
On one hand, you aren't wrong: XML has in fact been used for machine-to-machine communication mostly. OTOH, XML was just introduced as a subset of SGML doing away with the need of vocabulary-specific markup declarations for mere parsing in favor of always requiring explicit start- and end-element tags. Whereas HTML is chock full of SGMLisms such as tag inference (for example inferring paragraph ends on block elements), empty ("self-closing") elements and enumerated ("boolean") attributes driven by per-element declarations.
One can argue to death whether the web should work as a mere document delivery network with rigid markup a la XML, or that browsers should also directly support SGML authoring idioms such as the above shortform mechanisms. SGML also has text macros/shared fragments (entities) and even allows defining own parsing tokens for markdown, math, CSV, or custom syntaxes. HTML leans towards SGML in that its documentation portrays HTML as an authoring language, but browsers are lacking even in basic SGML features such as entities.
I do wonder what web application markup would look like today if designed from scratch. It is kind of amazing that HTML and CSS can be used for creating beautiful documents viewable on pretty much any device with a screen AND also for creating dynamic applications with pixel-perfect rendering, special effects, integrations with the device’s hardware, and even external peripherals.
If there was ever scope creep in a project this would be it. And given the recent discussion on here of curses based interfaces it reminded me just how primitive other GUI application layout tools can be while still achieving amazing results. Even something like GTK does not need the intense level of layout engine support and yet is somehow considered richer in some ways and probably more performant for a lot of stuff that’s done with it.
So I am curious what web application development would look like today if it wasn’t for HTML being “good enough”.
We just couldn't keeps apps' hands out of the cookie jar back then.
I will remind you that Swing sucked, hard in the 1990s. And HTML was pretty easy to write.
Suggesting something like HTML would have you laughed out of the room.
This is only the case because the BigTech view is one of an application platform.
I wish someone would write "XML - The Good Parts".
Others might argue that this is JSON but I'd disagree:
- No comments is a non-starter
- No proper integers
- No date format
- Schema validation is a primitive toy compared what we had for XML
- Lack of allowed trailing commas
YAML ain't better. I hated whitespace handling in XML, it's a miracle how YAML could make it even worse.
XML is from era long past and I certainly don't want to go back there, but it had its good parts and I feel we have not really learned a lot from its mistakes.
In the end maybe it is just that developer ergonomics is largely a matter of taste and no language will ever please everyone.
I know it's passé in the web dev world, but in my work we still work with XML all the time. We even have work in our queue to add support for new data sources built on XML (specifically QIF https://qifstandards.org/).
It's fine with me... I've come to like XML. It's nice to have a standard, easy way to do seschemas, validators, processors, queries, etc. It can be overdone and it's not for every use case, but it's pretty good at what it does.
In my military work, I've heard the senior project managers refer to a modern battleship as a floating XML document.
That is because the web dev world is unfortunately obsessed with the current thing. They chase trends like their lives depend on it.
Again not great for bigger documents.
what I really dread in XML though is that XML only has idref/id standardized, and no path references. so without tool support you can't navigate to a reference target.
which turns XML into the "binary" format for GUI tools.
Maybe, but XML tools are also just superior to JSON counterparts. XPath is fantastic, and so is XSD and XSLT. I also quite like the integration with .NET.
My general experience with JSON as a configuration language has been sad. It's a step back from XML in a lot of ways.
however most oss builds on libxml which is stuck in xslt 1 and xpath 2 iirc...
I recently stumbled upon James Clark's youngest brainchild the ballerina language.
https://en.wikipedia.org/wiki/James_Clark_(programmer)#Caree...
When was the last time you had an editor that wouldn't just auto close the current tag with "</" ? I mean it's a god-send for knowing where you are at in large structure. You aren't scrolling to the top to find which tag you are in.
Interesting take, but I'm always a little hesitant to accept any anthropomorphizing of computer systems.
Isn't it always about what we can reason and extrapolate about what the computer is doing? Obviously computers have no preference so it seems like you're really saying
"XML is a poor abstraction for what it's trying to accomplish" or something like that.
Before jQuery, chrome, and web 2.0, I was building xslt driven web pages that transformed XML in an early nosql doc store into html and it worked quite beautifully and allowed us to skip a lot of schema work that we definitely were ready or knowledgeable enough to do.
EDIT: It was the perfect abstraction and tool for that job. However the application was very niche and I've never found a person or team who did anything similar (and never had the opportunity to do anything similar myself again)
In fact the RSS reader I built still uses XSLT to transform the output to HTML as it’s just the easiest way to do so (and can now be done directly in the browser).
Names withheld to protect the guilty. :)
That was a huge reason JSON took over.
Another reason was the overall XML ecosystem grew unwieldy and difficult to navigate: XPath, XSLT, SOAP, WSDL, Xpointer, XLink, SOAP, XForms... They all made sense in their own way, but it was difficult to master them all. That complexity, plus the poor ergonomics, is what paved the way for JSON to become preferred.
I suspect it was SOAP and WSDL that killed it for a lot of people though. That was a typical example of a technical solution looking for a problem and complete overkill for most people.
The whole namespace thing was probably a step too far as well.
<greeting attr="val" href="#">Hello <thing>world</thing><greeting>
(greeting ((attr "val") (href "#")) "Hello " (thing "world"))
{:tag :greeting
:attrs {:href "#" :attr "val"}
:content ["Hello" {:tag :thing :content ["world"]}]}
Some people use namespaced keywords (e.g. :xml/tag) to help disambiguate keys in the map. This kind of data structure tends to be more convenient than dealing with plain sexps or so-called "Hiccup syntax". i.e. [:greeting {:href "#" :attr "val"} "Hello" [:thing "world"]]
The above syntax is convenient to write, but it's tedious to manipulate. For instance, one needs to dispatch on types to determine whether an element at some index is an attribute map or a child. By using the former data structure, one simply looks up the :attrs or :content key. Additionally, the map structure is easier to depth-first search; it's a one-liner with the tree-seq function.I've written a rudimentary EPUB parser in Clojure and found it easier to work with zippers than any other data structure to e.g. look for <rootfile> elements with a <container> ancestor.
Zippers are available in most programming languages, thankfully, so this advantage is not really unique to Clojure (or another Lisp). However, I will agree that something like sexps (or Hiccup) is more convenient than e.g. JSX, since you are dealing with the native syntax of the language rather than introducing a compilation step and non-standard syntax.
Racket has helper libraries like TxExpr (https://docs.racket-lang.org/txexpr/index.html) that make it pretty easy to manipulate S-expressions of this kind.
As in, I don't see a difference between `(attr "val")` which expresses an attribute key/value pair and `(thing "world")` which expresses a tag/content relationship. Even if I thought the rule might be "if the first element of the list is a list itself then it should be interpreted as a set of attribute key value pairs" then I would still be ambiguous with:
(foo (bar "baz") "content")
which could serialize to either: <foo bar="baz">content</foo>
or: <foo><bar>baz</bar>content</foo>
In fact, this ambiguity between attributes and children has always been one of the head scratching things for me about XML. Well, the thing I've always disliked the most is namespaces but that is another matter.See a grammar for the representation at https://docs.racket-lang.org/xml/index.html#%28def._%28%28li...
Most Scheme tools for working with XML use a different layout where a list starting with the symbol @ indicates attributes. See https://en.wikipedia.org/wiki/SXML for it.
(foo (bar "baz") "content")
vs (foo ((bar "baz")) "content")
Where the first one would be the nested tags and the second one would be a single `bar="baz"` attribute.I would prefer the differentiation to be more explicit than the position and/or structure of the list, so the @ symbol modifier for the attribute list in other tools makes sense.
The sibling comment with a map with a :attrs key feels even better. I don't work in languages with pattern matching or that kind of thing very often, but if I was wanting to know if a particular element had 1 or more attributes then being able to check a dictionary key just feels like a nicer kind of anchor point to match against.
Just remember that it's a markup language, and then it's not head-scratching at all: the text is the text being marked up, and the attribute values are the attribute of the markup - things like colour and font.
When it was co-opted to store structured data, those people didn't obey this rule (which would make everything attributes).
Namespaces had a very cool use in XHTML: you could just embed an SVG or MathML directly in your HTML and the browser would render it. This feature was copied into HTML5.
I mean, if I'm modeling a <Person> node in some structured format, making a decision about "what is the attribute of the person node" vs "what is a property of the specific Person" isn't an easy call to make in all cases. And then there are cases where an attribute itself ought to have some kind of hierarchy. Even the text example works here: I have a set of font properties and it would make sense to maybe have:
<font>
<color>...</color>
<family>...<family>
</font>
Rather than a series of `fontFamily`, `fontSize`, etc. attributes. This is true when those attributes are complex objects that ended up having nesting at several levels. You end up in the circumstance where you are forced to make things that ought to be attributes into children because you want to model the nested structure of the attributes themselves. Then you end up with some kind of wrapper structure where you might have a section for meta-data and a section for the real content.I just don't think the distinction works well for an extensible markup language where the nesting of elements is more or less the entire point.
It is much easier to write out though, which is why you see often see `<Element content=" ... " />` patterns all over the place.
There's something very zen-like with this language; you put a document in a kind of sieve and out comes a "better" document. It cannot fail; it can be wrong, full of errors, of course (although if you're validating the result against a schema it cannot be very wrong); but it will almost never explode in your face.
And then XSLT work kind of disappeared; I miss it a lot.
There, I said it.
At the risk of glibly missing the main point of your comment, take a look at KDL. Unlike JSON/TOML/YAML, it features XML-style node semantics. Unlike XML, it's intended to be human-readable and writeable by hand. It has specifications for both a query language and a schema language as well as implementations in a bunch of languages. https://kdl.dev/
XML gives you an object soup where text objects can be anywhere and data can be randomly stored in tags or attributes.
It just doesn't at all match the object model used by basically all programming languages.
I think that's a big reason JSON is so successful. It's literally the object model used by JavaScript. There's no weird impedance mismatch between the data represented on disk and in your program.
Then someone had to go and screw things up with YAML...
JSON5 is the way.
A couple years ago, I stumbled on a discussion considering deprecation/removal of XSLT support in Chrome. At some point in the discussion, they mentioned observing a notable uptick in usage—enough of an uptick (from a baseline of approximately zero) that they backed out.
The timing was closely correlated with work I’d done to adapt a library, which originally used XSLT via native Node extensions, to browser XSLT APIs. The project isn’t especially “popular” in the colloquial sense of the term, but it does have a substantial niche user base. I’m not sure how much uptake the browser adaptation of this library has had since, but some quick napkin math suggested it was at least plausible that the uptick in usage they saw might have been the onslaught of automated testing I used to validate the change while I was working on it.
Also, is it up to browser implementations, or does WHATWG expect browsers to stay at version XSLT 1?
XSLT 2 and 3 is a W3C standard written by the sole commercial provider of an XSLT 2 or 3 processor, which is problematic not only because it reduces W3C to a moniker for pushing sales, but also because it undermines W3C's own policy of at least two interworking implementations for a spec to get "recommendation" status.
XSLT is of course a competent language for manipulating XML. It would be a good fit if your processing requires lots of XML literals/fragments to be copied into your target document since XSLT is an XML language itself. Though OTOH it uses XPath embedded in strings excessively, thereby distrusting XML syntax for a core part of its language itself, and coding XPath in XML attributes can be awkward due to restrictive contextual encoding rules for special characters and such.
XSLT can be a maintenance burden if used casually/rarely, since revisiting XSLT requires substantial relearning and time investment due to its somewhat idiosyncratic nature. IDE support for discovery, refactoring, and test automation etc. is lacking.
I bought into and still believe in the separation of data and its presentation. This a place where XML/XSLT was very awesome despite some poor ergonomics.
An RSS XML document could live at an endpoint and contain in-line comments, extra data in separate namespaces, and generally be really useful structured data for any user agent or tool to ingest. An RSS reader or web spider could process the data directly, an XSLT stylesheet could let a web browser display a nice HTML/CSS output, and any other tools could use the data as well. Even better any user agent ingesting the XML could use in-built tools to validate the document.
XSLT to convert an XML feed to pretty HTML is a great example of the utility. Browsers have fast built-in conversion engines and the resulting HTML produced has all the normal capabilities of HTML including CSS and JavaScript. To the uninitiated: the XML feed just links to an external XSL stylesheet, when a web browser fetches the XML it grabs the stylesheet and transforms the XML to an HTML (or XHTML) representation that's then fed back into the browser.
A feed reader will fetch the XML and process it directly as RSS data and ignore the stylesheet. Some other user agent could fetch the XML and ignore its linked stylesheet but provide its own to process the RSS data. Since the feed has a declared schema pretty much any stylesheet written to understand that schema will work. For instance you could turn an RSS feed into a PDF with XSLT.
Why the "as if"? Isn't it a web page by definition?
I was following the W3C XSLT mailing list for quite some time back when they were doing 3.x, and this does not strike me as accurate.
The big place I've successfully used XSLT was in TEI, which nobody outside digital humanities uses. Even then, the XSLT processing is usually minimal, and Javascript is going to do a lot of work that XSL could have done.
It's another proof that working on fundamental tools is a good thing.
In my very opinionated opinion, XPath is about 99% of the value of XSLT, and XSLT itself is a misfire. Embedding an XML language in XML, rather than being an amazing value proposition, is actually a huge and really annoying mistake, in much the same way and for much the same reason as anyone who has spent much time around shell scripting has found trying to embed shell strings in shell strings (and, if the situation is particularly dire, another third or fourth level of such nesting) is quite unpleasant. Imagine trying to deal with bash, except you have to first quote all the command lines as bash strings like you're using bash -c, all the time. I think "XPath + your favorite language" has all the power of XSLT and, generally, better ergonomics and comprehensibility. Once you've got the selection of nodes in hand, a general-purpose programming language is a better way to deal with their contents then what XSLT provides. Hence why it has always languished.
Basically the only thing it's missing in XQuery vs XSLT is template rules and their application; but IMO simple ones are just as easy to write explicitly, and complex rulesets are hard to reason about and maintain anyway.
For cases where a host system wants to execute user-defined data transformations safely, XSLT seems like it might be useful. When they mature, maybe WASM and WASI will fill the same niche with better developer ergonomics?
Using an XML library to iterate through an entire XML document without XPATH is like looping through entire database tables without a JOIN filter or a WHERE clause.
XSLT is the SELECT, transforming XML output with a new level of crazy for recursion.
Let's look at JSON by comparison. Hmm, let's see: JSONPath, JMESPath, jq, jsonql, ...
The disadvantage is that it's not easily embeddable in your own programs - so programs use JSONPath / Go templates often.
I too am probably going to embed jmespath in my app.I need it to allow users to fill CLI flags from config files, and it'll replace my crappy homegrown version ( https://github.com/bbkane/warg/blob/740663eeeb5e87c9225fb627... )
- the error messages are night and day better than C-jq
- the magick of $(gojq --yaml-input), although I deeply abhor that it is 10 characters longer than "-y"
It's worth mentioning https://github.com/01mf02/jaq (MIT) because it actually strives to be an implementation of the specification versus just "execute better" as gojq does
I said goodbye to it a few weeks ago, personally (https://world-playground-deceit.net/blog/2025/03/a-common-li... https://world-playground-deceit.net/blog/2025/03/speeding-up...)
Oh, yeah, I 100% want to type this 15 times a day
# I'll grant you the imports, in the spirit of fairness
aws ec2 describe-instances | python -c '
for r in json.load(sys.stdin)["Reservations"]:
print("\n".join(i["PrivateIpAddress"] for i in r["Instances"]))
'
because that is undoubtedly better than aws ec2 describe-instances | jq -r '.Reservations[].Instances[].PrivateIpAddress'
I mean, seriously, who can read that terrible DSL with all of its line noise> The query part isn't even that well done, compared to XPath/JSONPath.
XPath I'll grant you, because it's actually very strong but putting JSONPath near jq in a "could be improved" debate tells me you're just not serious. JSONPath is a straight up farce
2) I didn't say "replace jq with Python", but that the language part (not the functions) of jq is horrible and didn't need to be invented. Same as Avisynth vs Vapoursynth.
3) I only mentioned Python as an example, I wouldn't choose a language that needs newlines in this specific case.
For example, in my own case, this expression is `cljq '(? $ "Reservations" * "Instances" * "PrivateIpAddress")` and if I need to do more complicated stuff (map, filter, group, etc...), I use CL instead of a bespoke, ad-hoc DSL that I never remember.
> JSONPath is a straight up farce
Why? At least it's specified (though you might say that jq is too, through jaq).
Another example: If you have very large XML you cannot fit even into memory you can still stream process them with XSLT.
It makes you the master of XML transformations and fetching information out of complex XML ;)
response.xpath("//div[string-contains(@data-foo, "foo")").css(".some-class").re(r"[a-z][a-zA-Z]+")
The .css() flavor gets complied down into .xpath() but there is no contest about their expressivity: https://github.com/scrapy/parsel/blob/v1.9.1/parsel/csstrans...(Aside: A long time ago, I had written an alternate XPath 1.1 implementation for Wine during GSoC, but rather shamefully, I never actually got it merged. Life became very hectic for me during that time period and I never really looped back to it. Still feel pretty bad about it all these years later.)
- XPath literally didn't exist when CSS selectors were introduced
- XPath's flexibility makes it a lot more challenging to implement efficiently, even more so when there are thousands of rules which need to be dynamically reevaluated at each document update
- XPath is lacking conveniences dedicated to HTML semantics, and handrolling them in xpath 1.0 was absolutely heinous (go try and implement a class predicate in xpath 1.0 without extensions)
[citation required]
https://www.w3.org/TR/1999/REC-xpath-19991116/
https://www.w3.org/TR/REC-CSS1-961217
> W3C Recommendation 17 Dec 1996, revised 11 Jan 1999
There are various drafts and statuses, so it's always open to hair-splitting but based only on the publication date CSS does appear to win
YES! This is so true! And ridiculous! It's a mystery why we didn't simply reuse XPath for selectors... it's all in there!!
It's not really a mystery:
> CSS was first proposed by Håkon Wium Lie on 10 October 1994. [...] discussions on public mailing lists and inside World Wide Web Consortium resulted in the first W3C CSS Recommendation (CSS1) being released in 1996
> XPath 1.0 was published in 1999
CSS2 was released before XPath 1.0.
- the "descendant" combinator (whitespace) - the "class" selector (".foo")
The 1998 CSS2 introduced "child", "following sibling", and attribute selectors. This state of things then remained unchanged forever (I see that Selectors Level 3 became a recommendation only in 2018?).
On the other hand, in 1999, XPath already specified all those basic ways to navigate the DOM, and CSS still doesn't have them all as of 2025.