Top
Best
New

Posted by iamwil 12/18/2025

History LLMs: Models trained exclusively on pre-1913 texts(github.com)
897 points | 421 commentspage 2
delis-thumbs-7e 12/19/2025|
Isn’t there obvious problems baked into this approach, if this is used for anything but fun? LLM’s lie and fake facts all the time, they are also masters at enforcing the users bias, even unconscious ones. How even a professor of history could ensure that the generated text is actually based on the training material and representative of the feelings and opinions of the given time period, not enforcing his biases toward popular topics of the day?

You can’t, it is impossible. That will always be an issue as long as this models are black boxes and trained the way they are. So maybe you can use this for role playing, but I wouldn’t trust a word it says.

kccqzy 12/19/2025|
To me it is pretty clear that it’s being used for fun. I personally like reading nineteenth century novels more than more recent novels (I especially like the style of science fiction by Jules Verne). What if the model can generate text in that style I like?
briandw 12/19/2025||
So many disclaimers about bias. I wonder how far back you have to go before the bias isn’t an issue. Not because it unbiased, but because we don’t recognize or care about the biases present.
gbear605 12/19/2025||
I don't think there is such a time. As long as writing has existed it has privileged the viewpoints of those who could write, which was a very small percentage of the population for most of history. But if we want to know what life was like 1500 years ago, we probably want to know about what everyone's lives were like, not just the literate. That availability bias is always going to be an issue for any time period where not everyone was literate - which is still true today, albeit many fewer people.
carlosjobim 12/19/2025||
That was not the question. The question is when do you stop caring about the bias?

Some people are still outraged about the Bible, even though the writers of it has been dead for thousands of years. So the modern mass produced man and woman probably does not have a cut-off date where they look at something as history instead of examining if it is for or against her current ideology.

seanw265 12/19/2025|||
It's always up to the reader to determine which biases they themself care about.

If you're wondering at what point "we" as a collective will stop caring about a bias or set of biases, I don't think such a time exists.

You'll never get everyone to agree on anything.

owenversteeg 12/19/2025|||
Depends on the specific issue, but race would be an interesting one. For most of recorded history people had a much different view of the “other”, more xenophobic than racist.
mmooss 12/19/2025||
Was there ever such a time or place?

There is a modern trope of a certain political group that bias is a modern invention of another political group - an attempt to politicize anti-bias.

Preventing bias is fundamental to scientific research and law, for example. That same political group is strongly anti-science and anti-rule-of-law, maybe for the same reason.

Teever 12/18/2025||
This is a neat idea. I've been wondering for a while now about using these kinds of models to compare architectures.

I'd love to see the output from different models trained on pre-1905 about special/general relativity ideas. It would be interesting to see what kind of evidence would persuade them of new kinds of science, or to see if you could have them 'prove' it be devising experiments and then giving them simulated data from the experiments to lead them along the correct sequence of steps to come to a novel (to them) conclusion.

ineedasername 12/19/2025||
I can imagine the political and judicial battles already, like with textualist feeling that the constitution should be understood as the text and only the text, meant by specific words and legal formulations of their known meaning at the time.

“The model clearly shows that Alexander Hamilton & Monroe were much more in agreement on topic X, putting the common textualist interpretation of it and Supreme Court rulings on a now specious interpretation null and void!”

nineteen999 12/19/2025||
Interesting ... I'd love to find one that had a cutoff date around 1980.
noumenon1111 12/19/2025|
> Which new band will still be around in 45 years?

Excellent question! It looks like Two-Tone is bringing ska back with a new wave of punk rock energy! I think The Specials are pretty special and will likely be around for a long time.

On the other hand, the "new wave" movement of punk rock music will go nowhere. The Cure, Joy Division, Tubeway Army: check the dustbin behind the record stores in a few years.

nineteen999 12/20/2025||
Hahaha as someone who once played in a Cure cover band as a teenager I found this hilarious.

I wonder what it might have predicted about the future of MS, Intel and IBM given the status quo at the time too.

noumenon1111 12/22/2025||
You're asking the right question!

1. IBM, as the all-time reigning king of computing is not expected to give up its position any time soon. In fact, I'm observing a swell of new microcomputers called "personal computers," and I fully expect IBM to capitalize on this trend soon.

2. Intel is a great company making microcontrollers and processors for microcomputers. The new 8086 microprocessor seems poised to make a splash in the new "personal computer" segment. I'll eat my hat if my prediction proves to be incorrect.

3. "One of these things is not like the other" Microsoft makes a pretty nice BASIC for microcomputers. I can imagine this becoming standard for "personal computers." But, a tiny company like Microsoft doesn't really stack up next to an industry titan like IBM or even a major, newer player like Intel.

If you'd like me to prognosticate some more, I'm ready. Just say the word.

doctor_blood 12/19/2025||
Unfortunately there isn't much information on what texts they're actually training this on; how Anglocentric is the dataset? Does it include the Encyclopedia Britannica 9th Edition? What about the 11th? Are Greek and Latin classics in the data? What about Germain, French, Italian (etc. etc.) periodicals, correspondence, and books?

Given this is coming out of Zurich I hope they're using everything, but for now I can only assume.

Still, I'm extremely excited to see this project come to fruition!

DGoettlich 12/19/2025|
thanks. we'll be more precise in the future. ultimately, we took whatever we could get our hands on, that includes newspapers, periodicals, books. its multilingual (including italian, french, spanish etc) though majority is english.
tonymet 12/19/2025||
I would like to see what their process for safety alignment and guardrails is with that model. They give some spicy examples on github, but the responses are tepid and a lot more diplomatic than I would expect.

Moreover, the prose sounds too modern. It seems the base model was trained on a contemporary corpus. Like 30% something modern, 70% Victorian content.

Even with half a dozen samples it doesn't seem distinct enough to represent the era they claim.

rhdunn 12/19/2025|
Using texts upto 1913 includes works like The Wizard of Oz (1900, with 8 other books upto 1913), two of the Anne of Green Gables books (1908 and 1909), etc. All of which read modern.

The Victorian era (1837-1901) covers works from Charles Dickens and the like which are still fairly modern. These would have been part of the initial training before the alignment to the 1900-cutoff texts which are largely modern in prose with the exception of some archaic language and the lack of technology, events, and language drift post that time period.

And, pulling in works from 1800-1850 you have works by the Bronte's and authors like Edgar Allan Poe who was influential in detective and horror fiction.

Note that other works around the time like Sherlock Holmes span both the initial training (pre-1900) and finetuning (post-1900).

tonymet 12/19/2025||
upon digging into it , I learned the post-training chat phases is trained on prompts with chat gpt 5.x to make it more conversational. that explains both contemporary traits.
monegator 12/19/2025||
I hereby declare that ANYTHING other than the mainstream tools (GPT, Claude, ...) is an incredibly interesting and legit use of LLMs.
kazinator 12/19/2025||
> Why not just prompt GPT-5 to "roleplay" 1913?

Because it will perform token completion driven by weights coming from training data newer than 1913 with no way to turn that off.

It can't be asked to pretend that it wasn't trained on documents that didn't exist in 1913.

The LLM cannot reprogram its own weights to remove the influence of selected materials; that kind of introspection is not there.

Not to mention that many documents are either undated, or carry secondary dates, like the dates of their own creation rather than the creation of the ideas they contain.

Human minds don't have a time stamp on everything they know, either. If I ask someone, "talk to me using nothing but the vocabulary you knew on your fifteenth birthday", they couldn't do it. Either they would comply by using some ridiculously conservative vocabulary of words that a five-year-old would know, or else they will accidentally use words they didn't in fact know at fifteen. For some words you know where you got them from by association with learning events. Others, you don't remember; they are not attached to a time.

Or: solve this problem using nothing but the knowledge and skills you had on January 1st, 2001.

> GPT-5 knows how the story ends

No, it doesn't. It has no concept of story. GPT-5 is built on texts which contain the story ending, and GPT-5 cannot refrain from predicting tokens across those texts due to their imprint in its weights. That's all there is to it.

The LLM doesn't know an ass from a hole in the ground. If there are texts which discuss and distinguish asses from holes in the ground, it can write similar texts, which look like the work of someone learned in the area of asses and holes in the ground. Writing similar texts is not knowing and understanding.

myrmidon 12/19/2025||
I do agree with this and think it is an important point to stress.

But we don't know how much different/better human (or animal) learning/understanding is, compared to current LLMs; dismissing it as meaningless token prediction might be premature, and underlying mechanisms might be much more similar than we'd like to believe.

If anyone wants to challenge their preconceptions along those lines I can really recommend reading Valentino Braitenbergs "Vehicles: Experiments in synthetic psychology (1984)".

alansaber 12/19/2025|||
Excuse me sir you forgot to anthropomorphise the language model
adroniser 12/19/2025||
[flagged]
andai 12/19/2025|
I had considered this task infeasible, due to a relative lack of training data. After all, isn't the received wisdom that you must shove every scrap of Common Crawl into your pre-training or you're doing it wrong? ;)

But reading the outputs here, it would appear that quality has won out over quantity after all!

More comments...