Posted by jamesponddotco 15 hours ago
Show HN: Librario, a book metadata API that aggregates G Books, ISBNDB, and more
My wife and I have a personal library with around 1,800 books. I started working on a library management tool for us, but I quickly realized I needed a source of data for book information, and none of the solutions available provided all the data I needed. One might provide the series, the other might provide genres, and another might provide a good cover, but none provided everything.
So I started working on Librario, a book metadata aggregation API written in Go. It fetches information about books from multiple sources (Google Books, ISBNDB, Hardcover. Working on Goodreads and Anna's Archive next.), merges everything, and saves it all to a PostgreSQL database for future lookups. The idea is that the database gets stronger over time as more books are queried.
You can see an example response here[1], or try it yourself:
curl -s -H 'Authorization: Bearer librario_ARbmrp1fjBpDywzhvrQcByA4sZ9pn7D5HEk0kmS34eqRcaujyt0enCZ' \
'https://api.librario.dev/v1/book/9781328879943' | jq .
This is pre-alpha and runs on a small VPS, so keep that in mind. I never hit the limits in the third-party services, so depending on how this post goes, I’ll or will not find out if the code handles that well.The merger is the heart of the service, and figuring out how to combine conflicting data from different sources was the hardest part. In the end I decided to use field-specific strategies which are quite naive, but work for now.
Each extractor has a priority, and results are sorted by that priority before merging. But priority alone isn't enough, so different fields need different treatment.
For example:
- Titles use a scoring system. I penalize titles containing parentheses or brackets because sources sometimes shove subtitles into the main title field. Overly long titles (80+ chars) also get penalized since they often contain edition information or other metadata that belongs elsewhere.
- Covers collect all candidate URLs, then a separate fetcher downloads and scores them by dimensions and quality. The best one gets stored locally and served from the server.
For most other fields (publisher, language, page count), I just take the first non-empty value by priority. Simple, but it works.
Recently added a caching layer[2] which sped things up nicely. I considered migrating from net/http to fiber at some point[3], but decided against it. Going outside the standard library felt wrong, and the migration didn't provide much in the end.
The database layer is being rewritten before v1.0[4]. I'll be honest: the original schema was written by AI, and while I tried to guide it in the right direction with SQLC[5] and good documentation, database design isn't my strong suit and I couldn't confidently vouch for the code. Rather than ship something I don't fully understand, I hired the developers from SourceHut[6] to rewrite it properly.
I've got a 5-month-old and we're still adjusting to their schedule, so development is slow. I've mentioned this project in a few HN threads before[7], so I’m pretty happy to finally have something people can try.
Code is AGPL and on SourceHut[8].
Feedback and patches[9] are very welcome :)
[0]: https://sr.ht/~pagina394/librario/
[1]: https://paste.sr.ht/~jamesponddotco/a6c3b1130133f384cffd25b3...
[2]: https://todo.sr.ht/~pagina394/librario/16
[3]: https://todo.sr.ht/~pagina394/librario/13
[4]: https://todo.sr.ht/~pagina394/librario/14
[5]: https://sqlc.dev
[6]: https://sourcehut.org/consultancy/
[7]: https://news.ycombinator.com/item?id=45419234
[8]: https://sr.ht/~pagina394/librario/
[9]: https://git.sr.ht/~pagina394/librario/tree/trunk/item/CONTRI...
I haven't tried it for books. I imagine it's not sufficiently complete to serve as a backbone but a quick look at an example book gives me the ids for OpenLibrary, Librarything, Goodreads, Bing, and even niche stuff like the National Library of Poland MMS ID.
I recently (a year ago... wow) dipped my toe into the world of library science through Wikidata, and was shocked at just how complex it is. OP's work looks really solid, but I hope they're aware of how mature the field is!
For illustration, here are just the book-relevant ID sources I focused on from Wikidata:
ARCHIVERS:
Library of Congress Control Number `P1144` (173M)
Open Library `P648` (39M)
Online Computer Library Center `P10832` (10M)
German National Library `P227` (44M)
Smithsonian Institute `P7851` (155M)
Smitsonian Digital Ark `P9473` (3M)
U.S. Office of Sci. & Tech. Info. `P3894`
PUBLISHERS:
Google Books `P675` (1M)
Project Gutenberg `P2034` (70K)
Amazon `P5749`
CATALOGUERS:
International Standard Book Number `P212`
Wikidata `P8379` (115B)
EU Knowledge Graph `P11012`
Factgrid Database `P10787` (0.4M)
Google Knowledge Graph `P2671` (500B)Thanks for letting me know!
I’ve recently acquired some photo books that don’t appear to have any ISBN but are listed on WorldCat and have OCLC Numbers and are catalogued in the Japanese National Diet Library. Not sure if they actually don't have ISBNs or if I just haven't been able to find them, but from what I got from some research it's quite common for self-published books.
Merging on the fly kinda works for the future too, for when data change or for when the merging process changes.
No idea what the future will hold. The idea is to pre-warm the database after the schema has been refactored, and once we have thousands of books from that, I’ll know for sure what to do next.
TLDR, there is a lot of “think and learn” as I go here, haha.
https://newbooksnetwork.com/subscribe
It's definitely biased towards academia which I personally see as a pro not a con
After v1.0.0 is out I plan to add the ability to add books manually to the database, at which point we'll be able to start improving the database without relying on third-party services.
I'm hoping Goodreads and Anna's Archive will help fill in the gaps, especially since Anna's Archive have gigantic database dumps available[1].
In fact, now that I think about it, you could also contribute your work to WikiData. I don't see ISBNdb ids on WikiData so you could write a script to make those contributions. Then anyone else using WikiData for this sort of thing can benefit from your work
I’d love to help improve other services. I plan on charging for Librario at some point, but I’ll offer a free version and offer free API keys for projects like Calibre and others.
At least that’s the plan.
But once the database refactor is done, I wouldn’t say no to a patch that made the service database agnostic.
(Only the first 4 or so were json errors, the rest were html-from-nginx, if that matters.)
Right now, I use node-isbn https://www.npmjs.com/package/node-isbn which mostly works well but is getting old in the tooth.
I wrote a Go SDK[1] for the service, maybe I'll try writing one in TypeScript tomorrow.