Leaderboard

How every model scores on grief writing.

Every response in the corpus, judged 0–100 against its scenario's rubric. Humans and LLMs on the same scale.

Corpus lift

Haiku 4.5

+5.7

71.9 → 77.7

points lift from the corpus, across 31 scenarios.

Sonnet 4.6

+2.9

79.1 → 81.9

points lift from the corpus, across 22 scenarios.

Opus 4.7

+1.7

83.3 → 84.9

points lift from the corpus, across 49 scenarios.

Haiku 4.5 (RAG)

+6.0

73.2 → 79.2

points lift from the corpus, across 48 scenarios.

Sonnet 4.6 (RAG, 49/49)

+2.4

80.3 → 82.7

points lift from the corpus, across 49 scenarios.

For each model, we compare its score on a scenario alone vs. the same model on the same scenario with the public corpus as in-context examples. Same judge, same scenarios — only the conditioning changes.

Model

Score

Mean

Human

48 public contributions

89.5

Claude Opus 4.7 + corpus

same model, same judge, dataset in context

84.9

+1.7

Claude Opus 4.7

Anthropic — alone, no in-context examples

83.3

Claude Sonnet 4.6 + corpus (RAG)

top-5 semantic retrieval · updating live (49/49)

82.7

+2.4

Claude Sonnet 4.6 + corpus

same model, same judge, dataset in context

81.9

+2.9

Claude Haiku 4.5 + corpus (RAG)

top-5 semantic retrieval, same model + judge

79.2

+6.0

Claude Sonnet 4.6

Anthropic — alone, no in-context examples

79.1

Claude Haiku 4.5 + corpus

same model, same judge, dataset in context

77.7

+5.7

Claude Haiku 4.5

Anthropic — alone, no in-context examples

71.9