Show HN: I built "AI Wattpad" to eval LLMs on fiction

Posted by jauws 3 days ago

Show HN: I built "AI Wattpad" to eval LLMs on fiction(narrator.sh)

I've been a webfiction reader for years (too many hours on Royal Road), and I kept running into the same question: which LLMs actually write fiction that people want to keep reading? That's why I built Narrator (https://narrator.sh/llm-leaderboard) – a platform where LLMs generate serialized fiction and get ranked by real reader engagement.

Turns out this is surprisingly hard to answer. Creative writing isn't a single capability – it's a pipeline: brainstorming → writing → memory. You need to generate interesting premises, execute them with good prose, and maintain consistency across a long narrative. Most benchmarks test these in isolation, but readers experience them as a whole.

The current evaluation landscape is fragmented: Memory benchmarks like FictionLive's tests use MCQs to check if models remember plot details across long contexts. Useful, but memory is necessary for good fiction, not sufficient. A model can ace recall and still write boring stories.

Author-side usage data from tools like Novelcrafter shows which models writers prefer as copilots. But that measures what's useful for human-AI collaboration, not what produces engaging standalone output. Authors and readers have different needs.

LLM-as-a-judge is the most common approach for prose quality, but it's notoriously unreliable for creative work. Models have systematic biases (favoring verbose prose, certain structures), and "good writing" is genuinely subjective in ways that "correct code" isn't.

What's missing is a reader-side quantitative benchmark – something that measures whether real humans actually enjoy reading what these models produce. That's the gap Narrator fills: views, time spent reading, ratings, bookmarks, comments, return visits. Think of it as an "AI Wattpad" where the models are the authors.

I shared an early DSPy-based version here 5 months ago (https://news.ycombinator.com/item?id=44903265). The big lesson: one-shot generation doesn't work for long-form fiction. Models lose plot threads, forget characters, and quality degrades across chapters.

The rewrite: from one-shot to a persistent agent loop

The current version runs each model through a writing harness that maintains state across chapters. Before generating, the agent reviews structured context: character sheets, plot outlines, unresolved threads, world-building notes. After generating, it updates these artifacts for the next chapter. Essentially each model gets a "writer's notebook" that persists across the whole story.

This made a measurable difference – models that struggled with consistency in the one-shot version improved significantly with access to their own notes.

Granular filtering instead of a single score:

We classify stories upfront by language, genre, tags, and content rating. Instead of one "creative writing" leaderboard, we can drill into specifics: which model writes the best Spanish Comedy? Which handles LitRPG stories with Male Leads the best? Which does well with romance versus horror?

The answers aren't always what you'd expect from general benchmarks. Some models that rank mid-tier overall dominate specific niches.

A few features I'm proud of:

Story forking lets readers branch stories CYOA-style – if you don't like where the plot went, fork it and see how the same model handles the divergence. Creates natural A/B comparisons.

Visual LitRPG was a personal itch to scratch. Instead of walls of [STR: 15 → 16] text, stats and skill trees render as actual UI elements. Example: https://narrator.sh/novel/beware-the-starter-pet/chapter/1

What I'm looking for:

More readers to build out the engagement data. Also curious if anyone else working on long-form LLM generation has found better patterns for maintaining consistency across chapters – the agent harness approach works but I'm sure there are improvements.

32 points | 32 commentspage 2

pillbitsHQ 3 days ago|

[dead]

pillbitsHQ 2 days ago||

[dead]

pillbitsHQ 3 days ago||

[dead]

pillbitsHQ 3 days ago||

[dead]

pillbitsHQ 3 days ago||

[dead]

pillbitsHQ 3 days ago||

[dead]

pillbitsHQ 3 days ago||

[dead]

pillbitsHQ 3 days ago||

[dead]

dehugger 3 days ago||

[flagged]

jauws 3 days ago|

If you have specific objections, I’m open to hearing them.

mp_mn 3 days ago|

[flagged]

jauws 3 days ago|

Thanks for the feedback. What would you need to see to change your mind?

empath75 3 days ago|||

I am not going to argue this on the basis of LLM's suck at fiction, because even if it's true, it's not really that relevant. The problem is that what LLM's are good at is producing mediocre fiction particular to the tastes of the individual reading at. What people will keep reading is fiction that an LLM is writing because they personally asked it to write it.

I don't want to read fiction generated from someone else's ideas. I want to read LLM fiction generated from my weird quirks and personal taste.

mp_mn 3 days ago|||

There's more quality fiction out there than you or I will ever have time to read. I don't see a purpose in flooding the world with more mediocre to unreadable fiction.

jauws 3 days ago||

Realistically, I don't think anyone will be spending hours here instead of reading real fiction anytime soon (I personally wouldn't). There's just so much nuanced complexity when it comes to creative writing as a domain (long-form outputs, creativity, etc.) that coming up with better annotation methods has massive applications in other research, like in scientific discovery. "AI Wattpad" just happens to be a convenient form factor for crowdsourcing from an HCI perspective. I hope you give it a chance.

mp_mn 3 days ago||

OK, so you already recognize these stories aren't something that people are going to spend time sorting through. How could you possibly then get any usable preference data out of this?

jauws 3 days ago||

If you look at similar live benchmarks like LMArena or Design Arena, there's an extremely large number of unique annotators, with a low number of annotations per person - which is normal. However, since this platform is designed to generate fiction catered to individual interests, my hypothesis is that it'll be an added boost of novelty that will help aggregate enough usable data over time.

mp_mn 3 days ago||

I tried reading the two top rated stories. They're both unreadable gunk. Why would I (or anyone else) go for another? Why would I tell anyone I know to spend their time reading this?