History LLMs: Models trained exclusively on pre-1913 texts

Posted by iamwil 12/18/2025

History LLMs: Models trained exclusively on pre-1913 texts(github.com)

897 points | 421 commentspage 5

Myrmornis 12/19/2025|

It would be interesting to have LLMs trained purely on one language (with the ability to translate their input/output appropriately from/to a language that the reader understands). I can see that being rather revealing about cultural differences that are mostly kept hidden behind the language barriers.

Sprotch 12/20/2025||

This is a brilliant idea. We have lots of erroneous ideas about the views and thoughts people had in the past. This will show we are still, actually, largely similar. Hopefully more and more of these historical LLMs appear.

elestor 12/19/2025||

Excuse me if it's obvious, but how could I run this? I have run local LLMs before, but only have very minimal experience using ollama run and that's about it. This seems very interesting so I'd like to try it.

shireboy 12/19/2025||

Fascinating llm use case I never really thought about til now. I’d love to converse with different eras and also do gap analysis with present time - what modern advances could have come earlier, happened differently etc.

casey2 12/19/2025||

I'd be very surprised if this is clean of post-1913 text. Overall I'm very interested in talking to this thing and seeing how much difference writing in a modern style vs and older one makes to it's responses.

Agraillo 12/19/2025||

> Modern LLMs suffer from hindsight contamination. GPT-5 knows how the story ends—WWI, the League's failure, the Spanish flu. This knowledge inevitably shapes responses, even when instructed to "forget.

> Our data comes from more than 20 open-source datasets of historical books and newspapers. ... We currently do not deduplicate the data. The reason is that if documents show up in multiple datasets, they also had greater circulation historically. By leaving these duplicates in the data, we expect the model will be more strongly influenced by documents of greater historical importance.

I found these claims contradictory. Many books that modern readers consider historically significant had only niche circulation at the time of publishing. A quick inquiry likely points to later works by Nietzsche and Marx's Das Kapital. They're possible subjects to the duplication likely influencing the model's responses as if they had been widely known at the time

tedtimbrell 12/19/2025||

This is so cool. Props for doing the work to actually build the dataset and make it somewhat usable.

I’d love to use this as a base for a math model. Let’s see how far it can get through the last 100 years of solved problems

arikrak 12/19/2025||

I wouldn't have expected there to be enough text from before 1913 to properly train a model, it seemed like they needed an internet of text to train the first successful LLMs?

alansaber 12/19/2025|

This model is more comparable to GPT-2 than anything we use now.

awesomeusername 12/19/2025|

I've always like the idea of retiring to the 19th century.

Can't wait to use this so I can double check before I hit 88 miles per hour that it's really what I want to do

More comments...