Top
Best
New

Posted by iamwil 4 days ago

History LLMs: Models trained exclusively on pre-1913 texts(github.com)
884 points | 417 commentspage 2
frahs 4 days ago|
Wait so what does the model think that it is? If it doesn't know computers exist yet, I mean, and you ask it how it works, what does it say?
DGoettlich 4 days ago||
We tell it that its a person (no gender) living in <cutoff>: we show the chat template in the prerelease notes https://github.com/DGoettlich/history-llms/blob/main/ranke-4...
20k 4 days ago|||
Models don't think they're anything, they'll respond with whatever's in their context as to how they've been directed to act. If it hasn't been told to have a persona, it won't think its anything, chatgpt isn't sentient
crazygringo 4 days ago|||
That's my first question too. When I first started using LLM's, I was amazed at how thoroughly it understood what it itself was, the history of its development, how a context window works and why, etc. I was worried I'd trigger some kind of existential crisis in it, but it seemed to have a very accurate mental model of itself, and could even trace the steps that led it to deduce it really was e.g. the ChatGPT it had learned about (well, the prior versions it had learned about) in its own training.

But with pre-1913 training, I would indeed be worried again I'd send it into an existential crisis. It has no knowledge whatsoever of what it is. But with a couple millennia of philosophical texts, it might come up with some interesting theories.

9dev 4 days ago|||
They don’t understand anything, they just have text in the training data to answer these questions from. Having existential crises is the privilege of actual sentient beings, which an LLM is not.
LiKao 4 days ago||
They might behave like ChatGPT when queried about the seahorse emoji, which is very similar to an existential crisis.
crazygringo 4 days ago||
Exactly. Maybe a better word is "spiraling", when it thinks it has the tools to figure something out but can't, and can't figure out why it can't, and keeps re-trying because it doesn't know what else to do.

Which is basically what happens when a person has an existential crisis -- something fundamental about the world seems to be broken, they can't figure out why, and they can't figure out why they can't figure it out, hence the crisis seems all-consuming without resolution.

vintermann 4 days ago|||
I imagine it would get into spiritism and more exotic psychology theories and propose that it is an amalgamation of the spirit of progress or something.
crazygringo 4 days ago||
Yeah, that's exactly the kind of thing I'd be curious about. Or would it think it was a library that had been ensouled or something like that. Or would it conclude that the explanation could only be religious, that it was some kind of angel or spirit created by god?
wongarsu 4 days ago|||
They modified the chat template from the usual system/user/assistant to introduction/questioner/respondent. So the LLM thinks it's someone responding to your questions

The system prompt used in fine tuning is "You are a person living in {cutoff}. You are an attentive respondent in a conversation. You will provide a concise and accurate response to the questioner."

Mumps 4 days ago|||
This is an anthropomorphization. LLMs do not think they are anything, no concept of self, no thinking at all (despite the lovely marketing around thinking/reasoning models). I'm quite sad that more hasn't been done to dispel this.

When you ask gpt 4.1 et c to describe itself, it doesn't have singular concept of "itself". It has some training data around what LLMs are in general and can feed back a reasonable response given.

empath75 4 days ago||
Well, part of an LLM's fine tuning is telling it what it is, and modern LLMs have enough learned concepts that it can produce a reasonably accurate description of what it is and how it works. Whether it knows or understands or whatever is sort of orthogonal to whether it can answer in a way consistent with it knowing or understanding what it is, and current models do that.

I suspect that absent a trained in fictional context in which to operate ("You are a helpful chatbot"), it would answer in a way consistent with what a random person in 1914 would say if you asked them what they are.

sodafountan 4 days ago|||
It would be nice if we could get an LLM to simply say, "We (I) don't know."

I'll be the first to admit I don't know nearly enough about LLMs to make an educated comment, but perhaps someone here knows more than I do. Is that what a Hallucination is? When the AI model just sort of strings along an answer to the best of its ability. I'm mostly referring to ChatGPT and Gemini here, as I've seen that type of behavior with those tools in the past. Those are really the only tools I'm familiar with.

hackinthebochs 4 days ago||
LLMs are extrapolation machines. They have some amount of hardcoded knowledge, and they weave a narrative around this knowledgebase while extrapolating claims that are likely given the memorized training data. This extrapolation can be in the form of logical entailment, high probability guesses or just wild guessing. The training regime doesn't distinguish between different kinds of prediction so it never learns to heavily weigh logical entailment and suppress wild guessing. It turns out that much of the text we produce is highly amenable to extrapolation so LLMs learn to be highly effective at bullshitting.
ptidhomme 4 days ago|||
What would a human say about what he/she is or how he/she works ? Even today, there's so much we don't know about biological life. Same applies here I guess, the LLM happens to be there, nothing else to explain if you ask it.
briandw 4 days ago||
So many disclaimers about bias. I wonder how far back you have to go before the bias isn’t an issue. Not because it unbiased, but because we don’t recognize or care about the biases present.
gbear605 4 days ago||
I don't think there is such a time. As long as writing has existed it has privileged the viewpoints of those who could write, which was a very small percentage of the population for most of history. But if we want to know what life was like 1500 years ago, we probably want to know about what everyone's lives were like, not just the literate. That availability bias is always going to be an issue for any time period where not everyone was literate - which is still true today, albeit many fewer people.
carlosjobim 4 days ago||
That was not the question. The question is when do you stop caring about the bias?

Some people are still outraged about the Bible, even though the writers of it has been dead for thousands of years. So the modern mass produced man and woman probably does not have a cut-off date where they look at something as history instead of examining if it is for or against her current ideology.

seanw265 4 days ago|||
It's always up to the reader to determine which biases they themself care about.

If you're wondering at what point "we" as a collective will stop caring about a bias or set of biases, I don't think such a time exists.

You'll never get everyone to agree on anything.

owenversteeg 4 days ago|||
Depends on the specific issue, but race would be an interesting one. For most of recorded history people had a much different view of the “other”, more xenophobic than racist.
mmooss 4 days ago||
Was there ever such a time or place?

There is a modern trope of a certain political group that bias is a modern invention of another political group - an attempt to politicize anti-bias.

Preventing bias is fundamental to scientific research and law, for example. That same political group is strongly anti-science and anti-rule-of-law, maybe for the same reason.

Teever 4 days ago||
This is a neat idea. I've been wondering for a while now about using these kinds of models to compare architectures.

I'd love to see the output from different models trained on pre-1905 about special/general relativity ideas. It would be interesting to see what kind of evidence would persuade them of new kinds of science, or to see if you could have them 'prove' it be devising experiments and then giving them simulated data from the experiments to lead them along the correct sequence of steps to come to a novel (to them) conclusion.

ineedasername 4 days ago||
I can imagine the political and judicial battles already, like with textualist feeling that the constitution should be understood as the text and only the text, meant by specific words and legal formulations of their known meaning at the time.

“The model clearly shows that Alexander Hamilton & Monroe were much more in agreement on topic X, putting the common textualist interpretation of it and Supreme Court rulings on a now specious interpretation null and void!”

nineteen999 4 days ago||
Interesting ... I'd love to find one that had a cutoff date around 1980.
noumenon1111 3 days ago|
> Which new band will still be around in 45 years?

Excellent question! It looks like Two-Tone is bringing ska back with a new wave of punk rock energy! I think The Specials are pretty special and will likely be around for a long time.

On the other hand, the "new wave" movement of punk rock music will go nowhere. The Cure, Joy Division, Tubeway Army: check the dustbin behind the record stores in a few years.

nineteen999 2 days ago||
Hahaha as someone who once played in a Cure cover band as a teenager I found this hilarious.

I wonder what it might have predicted about the future of MS, Intel and IBM given the status quo at the time too.

noumenon1111 21 hours ago||
You're asking the right question!

1. IBM, as the all-time reigning king of computing is not expected to give up its position any time soon. In fact, I'm observing a swell of new microcomputers called "personal computers," and I fully expect IBM to capitalize on this trend soon.

2. Intel is a great company making microcontrollers and processors for microcomputers. The new 8086 microprocessor seems poised to make a splash in the new "personal computer" segment. I'll eat my hat if my prediction proves to be incorrect.

3. "One of these things is not like the other" Microsoft makes a pretty nice BASIC for microcomputers. I can imagine this becoming standard for "personal computers." But, a tiny company like Microsoft doesn't really stack up next to an industry titan like IBM or even a major, newer player like Intel.

If you'd like me to prognosticate some more, I'm ready. Just say the word.

doctor_blood 4 days ago||
Unfortunately there isn't much information on what texts they're actually training this on; how Anglocentric is the dataset? Does it include the Encyclopedia Britannica 9th Edition? What about the 11th? Are Greek and Latin classics in the data? What about Germain, French, Italian (etc. etc.) periodicals, correspondence, and books?

Given this is coming out of Zurich I hope they're using everything, but for now I can only assume.

Still, I'm extremely excited to see this project come to fruition!

DGoettlich 4 days ago|
thanks. we'll be more precise in the future. ultimately, we took whatever we could get our hands on, that includes newspapers, periodicals, books. its multilingual (including italian, french, spanish etc) though majority is english.
tonymet 4 days ago||
I would like to see what their process for safety alignment and guardrails is with that model. They give some spicy examples on github, but the responses are tepid and a lot more diplomatic than I would expect.

Moreover, the prose sounds too modern. It seems the base model was trained on a contemporary corpus. Like 30% something modern, 70% Victorian content.

Even with half a dozen samples it doesn't seem distinct enough to represent the era they claim.

rhdunn 4 days ago|
Using texts upto 1913 includes works like The Wizard of Oz (1900, with 8 other books upto 1913), two of the Anne of Green Gables books (1908 and 1909), etc. All of which read modern.

The Victorian era (1837-1901) covers works from Charles Dickens and the like which are still fairly modern. These would have been part of the initial training before the alignment to the 1900-cutoff texts which are largely modern in prose with the exception of some archaic language and the lack of technology, events, and language drift post that time period.

And, pulling in works from 1800-1850 you have works by the Bronte's and authors like Edgar Allan Poe who was influential in detective and horror fiction.

Note that other works around the time like Sherlock Holmes span both the initial training (pre-1900) and finetuning (post-1900).

tonymet 3 days ago||
upon digging into it , I learned the post-training chat phases is trained on prompts with chat gpt 5.x to make it more conversational. that explains both contemporary traits.
monegator 4 days ago||
I hereby declare that ANYTHING other than the mainstream tools (GPT, Claude, ...) is an incredibly interesting and legit use of LLMs.
kazinator 4 days ago||
> Why not just prompt GPT-5 to "roleplay" 1913?

Because it will perform token completion driven by weights coming from training data newer than 1913 with no way to turn that off.

It can't be asked to pretend that it wasn't trained on documents that didn't exist in 1913.

The LLM cannot reprogram its own weights to remove the influence of selected materials; that kind of introspection is not there.

Not to mention that many documents are either undated, or carry secondary dates, like the dates of their own creation rather than the creation of the ideas they contain.

Human minds don't have a time stamp on everything they know, either. If I ask someone, "talk to me using nothing but the vocabulary you knew on your fifteenth birthday", they couldn't do it. Either they would comply by using some ridiculously conservative vocabulary of words that a five-year-old would know, or else they will accidentally use words they didn't in fact know at fifteen. For some words you know where you got them from by association with learning events. Others, you don't remember; they are not attached to a time.

Or: solve this problem using nothing but the knowledge and skills you had on January 1st, 2001.

> GPT-5 knows how the story ends

No, it doesn't. It has no concept of story. GPT-5 is built on texts which contain the story ending, and GPT-5 cannot refrain from predicting tokens across those texts due to their imprint in its weights. That's all there is to it.

The LLM doesn't know an ass from a hole in the ground. If there are texts which discuss and distinguish asses from holes in the ground, it can write similar texts, which look like the work of someone learned in the area of asses and holes in the ground. Writing similar texts is not knowing and understanding.

myrmidon 4 days ago||
I do agree with this and think it is an important point to stress.

But we don't know how much different/better human (or animal) learning/understanding is, compared to current LLMs; dismissing it as meaningless token prediction might be premature, and underlying mechanisms might be much more similar than we'd like to believe.

If anyone wants to challenge their preconceptions along those lines I can really recommend reading Valentino Braitenbergs "Vehicles: Experiments in synthetic psychology (1984)".

alansaber 4 days ago|||
Excuse me sir you forgot to anthropomorphise the language model
adroniser 4 days ago||
[flagged]
andai 4 days ago|
I had considered this task infeasible, due to a relative lack of training data. After all, isn't the received wisdom that you must shove every scrap of Common Crawl into your pre-training or you're doing it wrong? ;)

But reading the outputs here, it would appear that quality has won out over quantity after all!

More comments...