Why do AI models use so many em-dashes?

Posted by ahamez 3 days ago

Why do AI models use so many em-dashes?(www.seangoedecke.com)

94 points | 96 commentspage 2

qubex 16 hours ago|

I’m amongst those who used to use em-dashes and now seeks to actively avoid them.

redheadednomad 2 days ago||

"If AI labs wanted to go beyond that, they’d have to go and buy older books, which would probably have more em-dashes."

Actually, they wouldn't have to go and buy these old books: The texts are already available copyright free, due to legislation stating that copyright expires 70 years after the author's death (any book published in the USA before 1923 is also reproducible without adherence to copyright laws), making the full texts of old books much easier to find on the internet!

iainctduncan 2 days ago||

This has always seemed intuitively obvious to me. I use a lot of em dashes... because I read a lot. Including a lot of older, academic, or more formally written books. And the amount used in AI prose has never struck me as odd for the same reason. (Ditto for semi colons).

The truth is ... most people don't read much. So it's not too surprising they think it looks weird if all they read is posts on the internet, where the average writer has never even learned how to make one on the keyboard.

Delve on the other hand, that shit looks weird. That is waaay over-represented.

0xbadc0de5 3 days ago||

My first thought was watermarking. Same for it's affinity for using emojis in bullet lists.

shadowvoxing 2 days ago||

This episode of Big Technology Podcast goes into the reason why:https://pca.st/episode/4090833a-2abd-42b2-a31d-ebb2b4348007

atoav 2 days ago||

As someone who used em-dashes extensively before LLMs I can only hope (?) some of myself is in there. I really liked em-dashes, but now I have to actively avoid them, because many people use them as a marker to recognize text that has been invented by the stochastic machine.

spidersouris 3 days ago||

What we also learned after GPT-3.5 is that, to circumvent the need for new training data, we could simply resort to existing LLMs to generate new, synthetic data. I would not be surprised if the em dash is the product of synthetically generated data (perhaps forced to be present in this data) used for the training of newer models.

keiferski 3 days ago||

I am no grammarian, but I feel like em-dashes are an easy way to tie together two different concepts without rewriting the entire sentence to flow more elegantly. (Not to say that em-dashes are inelegant, I like them a lot myself.)

And so AI models are prone to using them because they require less computation than rewriting a sentence.

bitshiftfaced 3 days ago|

This is sort of my thinking too. It's finding next token once the previous ones have been generated. Dashes are an efficient way to continue a thought once you've already written a nearly complete sentence, but it doesn't create a run-on sentence. They're efficient in the sense that they allow more future grammatically correct options even when you've committed to previous tokens.

AbstractH24 2 days ago|

My question is given their satirical association with AI, why haven’t the models been manually optimized not to use them?

More comments...