Top
Best
New

Posted by iamwil 5 days ago

History LLMs: Models trained exclusively on pre-1913 texts(github.com)
887 points | 417 commentspage 6
moffkalast 4 days ago|
> trained from scratch on 80B tokens of historical data

How can this thing possibly be even remotely coherent with just fine tuning amounts of data used for pretraining?

dr_dshiv 4 days ago||
Everyone learns that the renaissance was sparked by the translation of Ancient Greek works.

But few know that the Renaissance was written in Latin — and has barely been translated. Less than 3% of <1700 books have been translated—and less than 30% have ever been scanned.

I’m working on a project to change that. Research blog at www.SecondRenaissance.ai — we are starting by scanning and translating thousands of books at the Embassy of the Free Mind in Amsterdam, a UNESCO-recognized rare book library.

We want to make ancient texts accessible to people and AI.

If this work resonates with you, please do reach out: Derek@ancientwisdomtrust.org

carlosjobim 4 days ago||
Amazing project!

May I ask you, why are you publishing the translations as PDF files, instead of the more accessible ePub format?

dr_dshiv 3 days ago||
Will add, great point.
j-bos 4 days ago||
This ia very cool but should go in a Show HN post as per HN rules. All the best!
dr_dshiv 4 days ago||
Just read the rules again— was something inappropriate? Seemed relevant
j-bos 4 days ago||
I can see you being right, I didn't make the connection with 20th,19th century documents and the comment felt disconnected from the thread. Either way, very cool project, worth a show hn post.
awesomeusername 4 days ago||
I've always like the idea of retiring to the 19th century.

Can't wait to use this so I can double check before I hit 88 miles per hour that it's really what I want to do

smugtrain 3 days ago||
This would actually be a wonderful way to learn physics, before GR and quantum mechanics
why-o-why 4 days ago||
It sounds like a fascinating idea, but I'd be curious if prompting a more well-known foundational model to limit itself to 1913 and early be similar.
satisfice 4 days ago||
I assume this is a collaboration between the History Channel and Pornhub.

“You are a literary rake. Write a story about an unchaperoned lady whose ankle you glimpse.”

TZubiri 4 days ago||
hi, can I have latin only LLM? It can be latin plus translations (source and destination).

May be too small a corpus, but I would like that very much anyhow

jimmy76615 4 days ago||
> We're developing a responsible access framework that makes models available to researchers for scholarly purposes while preventing misuse.

The idea of training such a model is really a great one, but not releasing it because someone might be offended by the output is just stupid beyond believe.

nine_k 4 days ago||
Public access, triggering a few racist responses from the model, a viral post on Xitter, the usual outrage, a scandal, the project gets publicly vilified, financing ceases. The researchers carry the tail of negative publicity throughout their remaining careers.

Why risk all this?

vintermann 4 days ago|||
Because the problem of bad faith attacks can only get worse if you fold every time.

Sooner or later society has to come emotionally to terms with the fact that other times and places value things completely different from us, hold as important things we don't care about and are indifferent to things we do care about.

Intellectually I'm sure we already know, but e.g. banning old books because they have reprehensible values (or even just use nasty words) - or indeed, refusing to release a model trained on historic texts "because it could be abused" is a sign that emotionally we haven't.

It's not that it's a small deal, or should be expected to be easy. It's basically what Popper called "the strain of civilization" and posited as explanation for the totalitarianism which was rising in his time. But our values can't be so brittle that we can't even talk or think about other value systems.

cj 4 days ago||||
Because there are easy workarounds. If it becomes an issue, you can quickly add large disclaimers informing people that there might be offensive output because, well, it's trained on texts written during the age of racism.

People typically get outraged when they see something they weren't expecting. If you tell them ahead of time, the user typically won't blame you (they'll blame themselves for choosing to ignore the disclaimer).

And if disclaimers don't work, rebrand and relaunch it under a different name.

nine_k 4 days ago||
I wonder is you're being ironic here.

You speak as if the people who play to an outrage wave are interested in achieving truth, peace, and understanding. Instead the rage-mongers are there to increase their (perceived) importance, and for lulz. The latter factor should not be underappreciated; remember "meme stocks".

The risk is not large, but very real: the attack is very easy, and the potential downside, quite large. So not giving away access, but having the interested parties ask for it is prudent.

cj 4 days ago||
While I agree we live in a time of outrage, that also works in your favor.

When there’s so much “outrage” every day, it’s very easy to blend in to the background. You might have a 5 minute moment of outrage fame, but it fades away quick.

If you truly have good intentions with your project, you’re not going to get “canceled”, your career won’t be ruined

Not being ironic. Not working on a LLM project because you’re worried about getting canceled by the outrage machine is an overreaction IMO.

Are you able to name any developer or researcher who has been canceled because of their technical project or had their careers ruined? The only ones I can think of are clearly criminal and not just controversial (SBF, Snowden, etc)

kurtis_reed 4 days ago||||
If people start standing up to the outrage it will lose its power
Forgeties79 4 days ago||||
> triggering a few racist responses from the mode

I feel like, ironically, it would be folks less concerned with political correctness/not being offensive that would abuse this opportunity to slander the project. But that’s just my gut.

dingnuts 4 days ago||
[dead]
NuclearPM 4 days ago||||
That’s ridiculous. There is no risk.
nofriend 4 days ago||||
People know that models can be racist now. It's old hat. "LLM gets prompted into saying vile shit" hasn't been notable for years.
Alex2037 4 days ago||||
nobody gives a shit about the journos and the terminally online. the smear campaign against AI is a cacophony, background noise that most people have learned to ignore, even here.

consider this: https://news.ycombinator.com/from?site=nytimes.com

HN's most beloved shitrag. day after day, they attack AI from every angle. how many of those submissions get traction at this point?

why-o-why 4 days ago||||
I think you are confusing research with commodification.

This is a research project, and it is clear how it was trained, and targeted at experts, enthusiasts, historians. Like if I was studying racism, the reference books explicitly written to dissect racism wouldn't be racist agents with a racist agenda. And as a result, no one is banning these books (except conservatives that want to retcon american history).

Foundational models spewing racist white supremecist content when the trillion-dollar company forces it in your face is a vastly different scenario.

There's a clear difference.

aidenn0 4 days ago|||
> And as a result, no one is banning these books (except conservatives that want to retcon american history).

My (very liberal) local school district banned English teachers from teaching any book that contained the n-word, even at a high-school level, and even when the author was a black person talking about real events that happened to them.

FWIW, this was after complaints involving Of Mice and Men being on the curriculum.

zoky 4 days ago|||
Banning Huckleberry Finn from a school district should be grounds for immediate dismissal.
somenameforme 4 days ago|||
Even more so as the lesson of that story is perhaps the single most important one for people to learn in modern times.

Almost everybody in that book is an awful person, especially the most 'upstanding' of types. Even the protagonist is an awful person. The one and only exception is 'N* Jim' who is the only kind-hearted and genuinely decent person in the book. It's an entire story about how the appearances of people, and the reality of those people, are two very different things.

It being banned for using foul language, as educational outcomes continue to deteriorate, is just so perfectly ironic.

why-o-why 4 days ago|||
I don't support banning the book, but I think it is hard book to teach because it needs SO much context and a mature audience (lol good luck). Also, there are hundreds of other books from that era that are relevant even from Mark Twain's corpus so being obstinate about that book is a questionable position. I'm ambivalent honestly, but definitely not willing to die on that hill. (I graduated highschool in 1989 from a middle class suburb, we never read it.)
zoky 4 days ago||
I mean, you gotta read it. I’m not normally a huge fan of the classics; I find Steinbeck dry and tedious, and Hemingway to be self-indulgent and repetitious. Even Twain’s other work isn’t exactly to my taste. But I’ve read Huckleberry Finn three times—in elementary school just for fun, in high school because it was assigned, and I recently listened to it on audiobook—and enjoyed the hell out of each time. Banning it simply because it uses a word that the entire book simply couldn’t exist without is a crime, and does a huge disservice to the very students they are supposedly trying to protect.
why-o-why 4 days ago||
I have read it. I spent my 20s guiltily reading all of the books I was supposed to have read in high school but used Cliff's Notes instead. From my 20's perspective I found Finn insipid and hokey but that's because pop culture had recycled it hundreds of times since its first publication, however when I consider it from the period perspective I can see the satire and the pointed allegories that made Twain so formidable. (Funny you mention Hemingway. I loved his writing in my 20's, then went back and read some again in my 40's and was like "huh, this irritating and immature, no wonder i loved it in my 20's.")
Forgeties79 4 days ago|||
It’s a big country of roughly half a billion people, you’ll always find examples if you look hard enough. It’s ridiculous/wrong that your district did this but frankly it’s the exception in liberal/progressive communities. It’s a very one-sided problem:

* https://abcnews.go.com/US/conservative-liberal-book-bans-dif...

* https://www.commondreams.org/news/book-banning-2023

*https://en.wikipedia.org/wiki/Book_banning_in_the_United_Sta...

aidenn0 4 days ago|||
I agree that the coordinated (particularly at a state level) restrictions[1] on books sits largely with the political Right in the US.

However, from around 2010, there has been increasingly illiberal movement from the political Left in the US, which plays out at a more local level. My "vibe" is that it's not to the degree that it is on the Right, but bigger than the numbers suggest because librarians are more likely to stock e.g. It's Perfectly Normal at a middle school than something offensive to the left.

1: I'm up for suggestions for a better term; there is a scale here between putting absurd restrictions on school librarians and banning books outright. Fortunately the latter is still relatively rare in the US, despite the mistitling on the Wikipedia page you linked.

somenameforme 4 days ago|||
A practical issue is the sort of books being banned. Your first link offer examples of one side trying to ban Of Mice and Men, Adventures of Huckleberry Finn, and Dr. Seuss, with the other side trying to ban many books along the lines of Gender Queer. [1] That link is to the book - which is animated, and quite NSFW.

There are a bizarrely large number similar book as Gender Queer being published, which creates the numeric discrepancy. The irony is that if there was an equal but opposite to that book about straight sex, sexuality, associated kinks, and so forth - then I think both liberals and conservatives would probably be all for keeping it away from schools. It's solely focused on sexuality, is quite crude, illustrated, targeted towards young children, and there's no moral beyond the most surface level writing which is about coming to terms with one's sexuality.

And obviously coming to terms with one's sexuality is very important, but I really don't think books like that are doing much to aid in that - especially when it's targeted at an age demographic that's still going to be extremely confused, and even moreso in a day and age when being different, if only for the sake of being different, is highly desirable. And given the nature of social media and the internet, decisions made today may stay with you for the rest of your life.

So for instance about 30% of Gen Z now declare themselves LGBT. [2] We seem to have entered into an equal but opposite problem of the past when those of deviant sexuality pretended to be straight to fit into societal expectations. And in many ways this modern twist is an even more damaging form of the problem from a variety of perspectives - fertility, STDs, stuff staying with you for the rest of your life, and so on. Let alone extreme cases where e.g. somebody engages in transition surgery or 1-way chemically induced changes which they end up later regretting.

[1] - https://archive.org/details/gender-queer-a-memoir-by-maia-ko...

[2] - https://www.nbcnews.com/nbc-out/out-news/nearly-30-gen-z-adu...

Forgeties79 4 days ago|||
From your NBC piece

> About half of the Gen Z adults who identify as LGBTQ identify as bisexual,

So that means ~15% of those surveyed are not attracted to the opposite sex (there’s more nuance to this statement but I imagine this needs to stay boilerplate), more or less, which is a big distinction. That’s hardly alarming and definitely not a major shift. We have also seen many cultures throughout history ebb and flow in their expression of bisexuality in particular.

> There are a bizarrely large number similar book as Gender Queer being published, which creates the numeric discrepancy.

This really needs a source. And what makes it “bizarrely large”? How does it stack against, say, the number heterosexual romance novels?

> We seem to have entered into an equal but opposite problem of the past when those of deviant sexuality pretended to be straight to fit into societal expectations.

I really tried to give your comment a fair shake but I stopped here. We are not going to have a productive conversation. “Deviant sexuality” come on man.

Anyway it doesn’t change the fact that the book banning movement is largely a Republican/conservative endeavor in the US. The numbers clearly bear it out.

somenameforme 4 days ago||
I'll get back to what you said, but first let me ask you something if you would. Imagine Gender Queer was made into a movie that remained 100% faithful to the source content. What do you think it would be rated? To me it seems obvious that it would, at the absolute bare minimum, be R rated. And of course screening R-rated films at a school is prohibited without explicit parental permission. Imagine books were given a rating and indeed it ended up with an R rating. Would your perspective on it being unavailable at a school library then be any different? I think this is relevant since a standardized content rating system for books will be the long-term outcome of this all if efforts to introduce such material to children continues to persist.

------

Okay, back to what you said. 30% being attracted to the same sex in any way, including bisexuality, is a large shift. People tend to have a mistaken perception of these things due to media misrepresentation. The percent of all people attracted to the same sex, in any way, is around 7% for men, and 15% for women [1], across a study of numerous Western cultures from 2016. And those numbers themselves are significantly higher than the past as well where the numbers tended to be in the ~4% range, though it's probably fair to say that cultural pressures were driving those older numbers to artificially low levels in the same way that I'm arguing that cultural pressures are now driving them to artificially high levels.

Your second source discusses the reason for the bans. It's overwhelmingly due to sexually explicit content, often in the form of a picture book, targeted at children. As for "sexual deviance", I'm certainly not going General Ripper on you, Mandrake. It is the most precise term [2] for what we are discussing as I'm suggesting that the main goal driving this change is simply to be significantly 'not normal.' That is essentially deviance by definition.

[1] - https://www.researchgate.net/publication/301639075_Sexual_Or...

[2] - https://dictionary.apa.org/sexual-deviance

Forgeties79 3 days ago|||
> any sexual behavior, such as a paraphilia, that is regarded as significantly different from the standards established by a culture or subculture. Deviant forms of sexual behavior may include voyeurism, fetishism, bestiality, necrophilia, sadism, and exhibitionism

I don’t see Lesbian, Gay, Bisexual, or Transgender in here, which would absolutely be explicitly included in the list if it applied. Stop saying “sexual deviants” when talking about LGBT people. You know what you’re doing, it’s an incredibly loaded and inaccurate term. To continue calling them “sexual deviants” is a hostile and openly bigoted act. Bestiality and homosexuality are not in the same category and you are wrong to assert otherwise - all while masking it by misrepresenting the APA’s stance at that.

I am not discussing this further. Enjoy the rest of your weekend.

somenameforme 2 days ago||
I'm not at all bigoted. If somebody genuinely is sexually attracted to the same sex, more power to them. Homosexuality also exists within nature and there are obviously people who simply have never been attracted to anything except the same sex since their first days. It's completely unreasonable to expect these people to try to change who they are on such a fundamental level, and so I think society, at large, should absolutely be tolerant of such.

But there is a major difference between tolerating something and endorsing it. I think this is especially true in modern times. 30% of people are obviously not LGB. So you have people acting out sexually in a way that's probably not only 'unnatural' for them, but may end up harming them longterm. It's not a great situation. Because of this I do not indulge language policing which I believe is much more towards endorse than tolerate. Yes you are obviously right I'm aware of what I'm doing, but I also assure you if we met and had a coffee you'd find me anything but bigoted or hostile. We just have different worldviews.

aidenn0 1 day ago|||
In the 1940s, Kinsey et. al. found that 37% of Adult Males had at least one homosexual experience, and 10% were "more or less exclusively homosexual for at least 3 years."

The numbers for women were much lower, but 30% doesn't seem crazy high if you consider the reduced stigma of the bisexual label would allow people who are primarily heterosexual, but are open to homosexual experiences to label themselves as bi.

somenameforme 11 hours ago||
Kinsey's work was poor quality and suffered from irreconcilable volunteer bias. It was completely based on people willing to be interviewed, in excessive detail and in his uniquely invasive 1 on 1 fashion, about their most intimate sexual experiences. Even today that is not something which 'normal' people agree to, and this was done during the 40s and 50s! On top of that he made 0 effort whatsoever to obtain a representative sample of society, so it's a biased sample of a biased sample, which drives an exponential deviation from reality due to multiplicative biasing.

This is where you get his conclusions such as 37% of men having had a homosexual experience, or 69% of men having purchased a prostitute. It's plainly ridiculous.

aidenn0 4 days ago|||
[dead]
andsoitis 4 days ago|||
> no one is banning these books

No books should ever be banned. Doesn’t matter how vile it is.

gnarbarian 4 days ago||||
this is FUD.
teaearlgraycold 4 days ago|||
Sure but Grok already exists.
dash2 4 days ago|||
You have to understand that while the rest of the world has moved on from 2020, academics are still living there. There are many strong leftists, many of whom are deeply censorious; there are many more timeservers and cowards, who are terrified of falling foul of the first group.

And there are force multipliers for all of this. Even if you yourself are a sensible and courageous person, you want to protect your project. What if your manager, ethics committee or funder comes under pressure?

fkdk 4 days ago||
Maybe the authors are overly careful. Maybe avoiding to publish aspects of their work gives an edge over academic competitors. Maybe both.

In my experience "data available upon request" doesn't always mean what you'd think it does.

davidpfarrell 4 days ago||
Can't wait for all the syncopated "Thou dost well to question that" responses!
PeterStuer 4 days ago|
How does it do on Python coding? Not 100% troll, cross domain coherence is a thing.
More comments...