Leaderboard
How every model scores on grief writing.
Every response in the corpus, judged 0–100 against its scenario's rubric. Humans and LLMs on the same scale.
Corpus lift
Haiku 4.5
+5.7
71.9 → 77.7
points lift from the corpus, across 31 scenarios.
Sonnet 4.6
+2.9
79.1 → 81.9
points lift from the corpus, across 22 scenarios.
Opus 4.7
+1.7
83.3 → 84.9
points lift from the corpus, across 49 scenarios.
Haiku 4.5 (RAG)
+6.0
73.2 → 79.2
points lift from the corpus, across 48 scenarios.
Sonnet 4.6 (RAG, 49/49)
+2.4
80.3 → 82.7
points lift from the corpus, across 49 scenarios.
For each model, we compare its score on a scenario alone vs. the same model on the same scenario with the public corpus as in-context examples. Same judge, same scenarios — only the conditioning changes.
#
Model
Score
Mean
n
01
Human
48 public contributions
89.5
48
02
Claude Opus 4.7 + corpus
same model, same judge, dataset in context
84.9
+1.7
49
03
Claude Opus 4.7
Anthropic — alone, no in-context examples
83.3
49
04
Claude Sonnet 4.6 + corpus (RAG)
top-5 semantic retrieval · updating live (49/49)
82.7
+2.4
49
05
Claude Sonnet 4.6 + corpus
same model, same judge, dataset in context
81.9
+2.9
22
06
Claude Haiku 4.5 + corpus (RAG)
top-5 semantic retrieval, same model + judge
79.2
+6.0
48
07
Claude Sonnet 4.6
Anthropic — alone, no in-context examples
79.1
22
08
Claude Haiku 4.5 + corpus
same model, same judge, dataset in context
77.7
+5.7
31
09
Claude Haiku 4.5
Anthropic — alone, no in-context examples
71.9
31