Top
Best
New

Posted by iamwil 4 days ago

History LLMs: Models trained exclusively on pre-1913 texts(github.com)
886 points | 417 commentspage 5
Sprotch 3 days ago|
This is a brilliant idea. We have lots of erroneous ideas about the views and thoughts people had in the past. This will show we are still, actually, largely similar. Hopefully more and more of these historical LLMs appear.
Myrmornis 4 days ago||
It would be interesting to have LLMs trained purely on one language (with the ability to translate their input/output appropriately from/to a language that the reader understands). I can see that being rather revealing about cultural differences that are mostly kept hidden behind the language barriers.
elestor 4 days ago||
Excuse me if it's obvious, but how could I run this? I have run local LLMs before, but only have very minimal experience using ollama run and that's about it. This seems very interesting so I'd like to try it.
shireboy 4 days ago||
Fascinating llm use case I never really thought about til now. I’d love to converse with different eras and also do gap analysis with present time - what modern advances could have come earlier, happened differently etc.
casey2 4 days ago||
I'd be very surprised if this is clean of post-1913 text. Overall I'm very interested in talking to this thing and seeing how much difference writing in a modern style vs and older one makes to it's responses.
Agraillo 4 days ago||
> Modern LLMs suffer from hindsight contamination. GPT-5 knows how the story ends—WWI, the League's failure, the Spanish flu. This knowledge inevitably shapes responses, even when instructed to "forget.

> Our data comes from more than 20 open-source datasets of historical books and newspapers. ... We currently do not deduplicate the data. The reason is that if documents show up in multiple datasets, they also had greater circulation historically. By leaving these duplicates in the data, we expect the model will be more strongly influenced by documents of greater historical importance.

I found these claims contradictory. Many books that modern readers consider historically significant had only niche circulation at the time of publishing. A quick inquiry likely points to later works by Nietzsche and Marx's Das Kapital. They're possible subjects to the duplication likely influencing the model's responses as if they had been widely known at the time

arikrak 4 days ago||
I wouldn't have expected there to be enough text from before 1913 to properly train a model, it seemed like they needed an internet of text to train the first successful LLMs?
alansaber 4 days ago|
This model is more comparable to GPT-2 than anything we use now.
tedtimbrell 4 days ago||
This is so cool. Props for doing the work to actually build the dataset and make it somewhat usable.

I’d love to use this as a base for a math model. Let’s see how far it can get through the last 100 years of solved problems

Muskwalker 3 days ago|
So, could this be an example of an LLM trained fully on public domain copyright-expired data? Or is this not intended to be the case.
DGoettlich 3 days ago|
data is 100% public domain.
More comments...