← The Soul Problem
Leaderboard

How every model scores on grief writing.

Every response in the corpus, judged 0–100 against its scenario's rubric. Humans and LLMs on the same scale.

Corpus lift
Haiku 4.5
+5.7
71.977.7

points lift from the corpus, across 31 scenarios.

Sonnet 4.6
+2.9
79.181.9

points lift from the corpus, across 22 scenarios.

Opus 4.7
+1.7
83.384.9

points lift from the corpus, across 49 scenarios.

Haiku 4.5 (RAG)
+6.0
73.279.2

points lift from the corpus, across 48 scenarios.

Sonnet 4.6 (RAG, 49/49)
+2.4
80.382.7

points lift from the corpus, across 49 scenarios.

For each model, we compare its score on a scenario alone vs. the same model on the same scenario with the public corpus as in-context examples. Same judge, same scenarios — only the conditioning changes.

#
Model
Score
Mean
n
01
Human
48 public contributions
89.5
48
02
Claude Opus 4.7 + corpus
same model, same judge, dataset in context
84.9
+1.7
49
03
Claude Opus 4.7
Anthropic — alone, no in-context examples
83.3
49
04
Claude Sonnet 4.6 + corpus (RAG)
top-5 semantic retrieval · updating live (49/49)
82.7
+2.4
49
05
Claude Sonnet 4.6 + corpus
same model, same judge, dataset in context
81.9
+2.9
22
06
Claude Haiku 4.5 + corpus (RAG)
top-5 semantic retrieval, same model + judge
79.2
+6.0
48
07
Claude Sonnet 4.6
Anthropic — alone, no in-context examples
79.1
22
08
Claude Haiku 4.5 + corpus
same model, same judge, dataset in context
77.7
+5.7
31
09
Claude Haiku 4.5
Anthropic — alone, no in-context examples
71.9
31