Top
Best
New

Posted by jamesponddotco 20 hours ago

Show HN: Librario, a book metadata API that aggregates G Books, ISBNDB, and more

TLDR: Librario is a book metadata API that aggregates data from Google Books, ISBNDB, and Hardcover into a single response, solving the problem of no single source having complete book information. It's currently pre-alpha, AGPL-licensed, and available to try now[0].

My wife and I have a personal library with around 1,800 books. I started working on a library management tool for us, but I quickly realized I needed a source of data for book information, and none of the solutions available provided all the data I needed. One might provide the series, the other might provide genres, and another might provide a good cover, but none provided everything.

So I started working on Librario, a book metadata aggregation API written in Go. It fetches information about books from multiple sources (Google Books, ISBNDB, Hardcover. Working on Goodreads and Anna's Archive next.), merges everything, and saves it all to a PostgreSQL database for future lookups. The idea is that the database gets stronger over time as more books are queried.

You can see an example response here[1], or try it yourself:

  curl -s -H 'Authorization: Bearer librario_ARbmrp1fjBpDywzhvrQcByA4sZ9pn7D5HEk0kmS34eqRcaujyt0enCZ' \
  'https://api.librario.dev/v1/book/9781328879943' | jq .
  
This is pre-alpha and runs on a small VPS, so keep that in mind. I never hit the limits in the third-party services, so depending on how this post goes, I’ll or will not find out if the code handles that well.

The merger is the heart of the service, and figuring out how to combine conflicting data from different sources was the hardest part. In the end I decided to use field-specific strategies which are quite naive, but work for now.

Each extractor has a priority, and results are sorted by that priority before merging. But priority alone isn't enough, so different fields need different treatment.

For example:

- Titles use a scoring system. I penalize titles containing parentheses or brackets because sources sometimes shove subtitles into the main title field. Overly long titles (80+ chars) also get penalized since they often contain edition information or other metadata that belongs elsewhere.

- Covers collect all candidate URLs, then a separate fetcher downloads and scores them by dimensions and quality. The best one gets stored locally and served from the server.

For most other fields (publisher, language, page count), I just take the first non-empty value by priority. Simple, but it works.

Recently added a caching layer[2] which sped things up nicely. I considered migrating from net/http to fiber at some point[3], but decided against it. Going outside the standard library felt wrong, and the migration didn't provide much in the end.

The database layer is being rewritten before v1.0[4]. I'll be honest: the original schema was written by AI, and while I tried to guide it in the right direction with SQLC[5] and good documentation, database design isn't my strong suit and I couldn't confidently vouch for the code. Rather than ship something I don't fully understand, I hired the developers from SourceHut[6] to rewrite it properly.

I've got a 5-month-old and we're still adjusting to their schedule, so development is slow. I've mentioned this project in a few HN threads before[7], so I’m pretty happy to finally have something people can try.

Code is AGPL and on SourceHut[8].

Feedback and patches[9] are very welcome :)

[0]: https://sr.ht/~pagina394/librario/

[1]: https://paste.sr.ht/~jamesponddotco/a6c3b1130133f384cffd25b3...

[2]: https://todo.sr.ht/~pagina394/librario/16

[3]: https://todo.sr.ht/~pagina394/librario/13

[4]: https://todo.sr.ht/~pagina394/librario/14

[5]: https://sqlc.dev

[6]: https://sourcehut.org/consultancy/

[7]: https://news.ycombinator.com/item?id=45419234

[8]: https://sr.ht/~pagina394/librario/

[9]: https://git.sr.ht/~pagina394/librario/tree/trunk/item/CONTRI...

120 points | 44 commentspage 2
moritzruth 19 hours ago|
What do you think about BookBrainz?

https://bookbrainz.org/

jamesponddotco 19 hours ago||
First time I'm seeing it, to be honest, but it looks interesting. I do plan on having an UI for Librario (built a few mockups yesterday[1][2][3]), and I think the idea is similar, but BookBrainz looks bigger in scope.

I could add them as an extractor, I suppose :thinking:

[1]: https://i.cpimg.sh/pexvlwybvbkzuuk8.png

[2]: https://i.cpimg.sh/eypej9bshk2udtqd.png

[3]: https://i.cpimg.sh/6iw3z0jtrhfytn2u.png

nmstoker 19 hours ago||
This is great - the service and that you're extending it and considering a UI.

Personally I would go with option 2 as the colour from the covers beats the anaemic feel of 1 and it seems more original than the search with grid below of 3.

jamesponddotco 19 hours ago||
Glad you liked the idea!

Number two is what my wife and I prefer too, and likely what's going to be chosen in the end.

WillAdams 17 hours ago||
Doesn't seem to have a very compleat dataset --- the first book I thought to lok for, Hal Clement's _Space Lash_ (originally published as _Small Changes_) is absent, and I didn't see the later collection _Music of Many Sphere_ either:

https://www.goodreads.com/book/show/939760.Music_of_Many_Sph...

wizzwizz4 18 hours ago||
Please ensure that your database keeps track of whence data was obtained, and when. It's exceptionally frustrating when automated data ingesting systems overwrite manually-corrected data with automatically-generated wrong data: keeping track of provenance is a vital step towards keeping track of authoritativeness.
jamesponddotco 18 hours ago|
We don't support POST, PATCH, and whatnot yet so I didn't take that into account yet, but it's in the plans.

Still need to figure out how this will work, though.

fc417fc802 17 hours ago||
Since you support merging fields you likely would want to track provenance (including timestamp) on a per-field basis. Perhaps via an ID for the originating request.

Although I would suggest that rather than merge (and discard) on initial lookup it might be better to remember each individual request. That way when you inevitably decide to fix or improve things later you could also regenerate all the existing records. If the excess data becomes an issue you can always throw it out later.

I say all this because I've been frustrated by the quantity of subtle inaccuracies encountered when looking things up with these services in the past. Depending on the work sometimes the entries feel less like authoritative records and more like best effort educated guesses.

jamesponddotco 4 hours ago||
I’ll definitely discuss this with Drew, as he’s the one working on the database refactor. Thank you for the feedback!
wizzwizz4 4 hours ago||
In my experience, designing a database schema capable of being really pedantic about where everything comes from is a pain, but not having done so is worse. As a compromise, storing a semi-structured audit log can work: it'll be slow to consult, but that's miles better than having nothing to consult, and you can always create cached views later.
omederos 15 hours ago||
502 Bad Gateway :|
jamesponddotco 9 hours ago|
It seems someone found a bug that triggered a panic, and systemd failed to restart the service because the PID file wasn't removed. Fixed now, should be back online :)
ocdtrekkie 17 hours ago||
Library of Congress data seems like a huge omission especially for something named after a librarian. ;) It is a very easy API to consume too.
jamesponddotco 9 hours ago|
I didn't look into it yet because I assumed the current extractors had the information from them, but it's in my list of future extractors!
sijirama 16 hours ago|
hella hella cool

goodluck