Posted by robin_reala 3/28/2025
I have a service that extracts <meta> tags in webpages and to do that I'm currently using (and need) three different dependencies: html5ever, markup5ever_rcdom, markup5ever. I don't like those to be honest, the documentation is quite bad and it was difficult to understand how I should have used the libraries to achieve such a simple task.
XPath on the other hand makes this extremely easy in comparison, I wonder how this will perform compared to my current solution.
Unfortunately at this point there's no HTML parser frontend for Xee (and its underlying library Xot) yet (HTML 5 parser serialization is supported at least in code). It shouldn't be too hard to add at least HTML 5 support using something like html5ever.
https://github.com/Paligo/xee/blob/xee-v0.1.5/COPYRIGHT
And that goes double for when there is a separate LICENSE file in the repo https://github.com/Paligo/xee/blob/xee-v0.1.5/LICENSE-MIT
The fact that many project maintainers forget about vendored content and haphazardly slap the MIT license (or whatever) verbatim into a LICENSE file doesn’t actually give you a get-out-of-paying-lawyers-free card! If anything, Xee’s COPYRIGHT file gives me more confidence in my legal footing than an unadulterated LICENSE file would. It indicates the maintainer at least has a basic understanding of how copyright applies to their project.
Another pain point with XML is the lack of inline schema, so the languages around like XPath have to work with arbitrary structures unlike say JSON where you at least have basic primitives like map/dict, numbers, bool, etc
This could become a great foundation for a typed, (mostly) etree-compatible, python library built on top of this. I've used lxml for years and it's still my goto, but there are lots of places where it could be modernized.