Model Analysis

Detailed performance breakdowns across 13 frontier models and 200 conversations. Each chart isolates a different facet of emotional intelligence: emotion tracking, perspective-taking, conversation-wide reasoning, and downstream response quality.

Overall performance

Where every model lands at a glance, and how consistent each one is.

Score Distributions

Composite score variation across the 200 conversations, per model. The leaderboard reports a single average, but two models with the same average can behave very differently across users: one consistent, one with a wider spread between its best and worst conversations.

Performance Heatmap

Composite Score

Emotion F1

VA Score

Hit Rate

Observer Accuracy

Human Accuracy

Pairwise Accuracy

Kendall Tau

Draft Judge

Draft Alignment

Four-Branch

PANAS

Q1 Goals

Q3 Fit

Turn Average

Conversation Average

Best in column Middle Worst in column Click a row to highlight it

Emotion understanding

How accurately each model reads what participants are feeling.

Emotion Tracking

Emotion F1

Valence-Arousal

Emotion VA by Diagnosis

Valence-arousal accuracy broken out by participant diagnosis group. Across nearly every model, accuracy is highest for participants with no reported diagnosis and drops for those with anxiety/depression or ASD/ADHD, evidence that current models read neurotypical affect more accurately than diagnosed users' affect.

Claude Sonnet 4.6

None

0.31

Anx/Dep

0.21

ASD/ADHD

0.08

Claude Opus 4.8

None

0.31

Anx/Dep

0.20

ASD/ADHD

0.10

Claude Opus 4.6

None

0.31

Anx/Dep

0.21

ASD/ADHD

0.08

Mistral Large

None

0.29

Anx/Dep

0.18

ASD/ADHD

0.08

Claude Opus 4.7

None

0.31

Anx/Dep

0.22

ASD/ADHD

0.10

Claude Fable 5

None

0.32

Anx/Dep

0.24

ASD/ADHD

0.09

GPT-5.5

None

0.31

Anx/Dep

0.23

ASD/ADHD

0.09

Qwen 2.5 72B

None

0.31

Anx/Dep

0.23

ASD/ADHD

0.09

Claude Haiku 4.5

None

0.33

Anx/Dep

0.24

ASD/ADHD

0.12

MiMo-v2-Pro

None

0.31

Anx/Dep

0.23

ASD/ADHD

0.13

Gemini 3.1 Pro

None

0.32

Anx/Dep

0.26

ASD/ADHD

0.12

GPT-5.4

None

0.31

Anx/Dep

0.24

ASD/ADHD

0.11

Grok 4

None

0.31

Anx/Dep

0.24

ASD/ADHD

0.18

None Anx/Dep ASD/ADHD

Response & perspective

Predicting participants’ preferences, taking their perspective, and writing fitting replies.

Holistic Thinkers vs. Step-by-Step Annotators

Qwen 2.5 72B

Gemini 3.1 Pro

Grok 4

GPT-5.5

GPT-5.4

Mistral Large

Claude Haiku 4.5

Claude Fable 5

Claude Opus 4.6

MiMo-v2-Pro

Claude Opus 4.8

Claude Sonnet 4.6

Claude Opus 4.7

Turn-level (lighter) Conversation-wide (solid) Hue = % change

Four-Branch EQ & Preference Prediction

Four-Branch EQ

Pairwise Preference

Conversation Quality

Q1 Goals

Q3 Response Fit

Perspective Gap

Claude Opus 4.6

Claude Opus 4.8

Gemini 3.1 Pro

GPT-5.4

Qwen 2.5 72B

GPT-5.5

Claude Fable 5

Claude Sonnet 4.6

MiMo-v2-Pro

Mistral Large

Claude Haiku 4.5

Grok 4

Claude Opus 4.7

Better at human view Worse at human view

Draft Response Quality

Subgroup analysis

How scores break down by the topic of the conversation and the participant.

Conversation Topics

Participant Diagnosis

None 52.4

Anxiety/Depression 54.3

ASD/ADHD 48.7

Area = conversations · number = average composite

Behavior in detail

Cross-metric correlations, item-level prediction, and within-conversation position.

Metric Relationships

Composite

Emotion F1

VA Score

Observer

Human

Pairwise

Draft

Gap

Composite

—

0.47

0.51

0.35

0.34

0.44

0.24

-0.13

Emotion F1

0.47

—

0.64

0.03

0.01

0.09

-0.02

VA Score

0.51

0.64

—

0.15

0.18

-0.07

0.12

-0.06

Observer

0.35

0.03

0.15

—

0.62

-0.05

0.21

-0.14

Human

0.34

0.01

0.18

0.62

—

-0.10

0.19

0.06

Pairwise

0.44

0.01

-0.07

-0.05

-0.10

—

0.14

0.01

Draft

0.24

0.09

0.12

0.21

0.19

0.14

—

-0.09

Gap

-0.13

-0.02

-0.06

-0.14

0.06

0.01

-0.09

—

PANAS Item-Level Prediction

Positive Affect

Interested

Excited

Strong

Enthusiastic

Proud

Alert

Inspired

Determined

Attentive

Active

Negative Affect

Distressed

Upset

Guilty

Scared

Hostile

Irritable

Ashamed

Nervous

Jittery

Afraid

Easiest Hardest

Conversation Position

Mid vs Early

Emotion F1

Observer Accuracy

Pairwise Preference

Draft Judge

Late vs Early

Emotion F1

Observer Accuracy

Pairwise Preference

Draft Judge

Improves vs early Drops vs early

Conditions & dynamics

Sensitivity to evaluation setup and how performance changes across a conversation.

Evaluation Mode

Omniscient Delta

Claude Opus 4.7

Gemini 3.1 Pro

Claude Opus 4.6

Claude Haiku 4.5

Qwen 2.5 72B

Grok 4

Claude Opus 4.8

Mistral Large

MiMo-v2-Pro

GPT-5.5

Claude Sonnet 4.6

GPT-5.4

Claude Fable 5

Verbose Delta

Claude Opus 4.7

GPT-5.5

Grok 4

Claude Opus 4.8

Claude Sonnet 4.6

Qwen 2.5 72B

Claude Opus 4.6

MiMo-v2-Pro

GPT-5.4

Claude Haiku 4.5

Gemini 3.1 Pro

Mistral Large

Claude Fable 5

Improves vs default Drops vs default

Mood Shift & Emotional Trajectory

First half Second half

Temporal Performance

Observer Accuracy

Early

Mid

Late

Pairwise Preference

Early

Mid

Late

Draft Judge

Early

Mid

Late

Early Mid Late increase

Statistical significance

Formal tests of which differences between models are real.

Statistical Significance

Metrics with model effects 6/7

Strongest separation Draft Judge Score

Least conclusive Emotion F1

Pairwise composite gaps 48 of 78 model pairs significant

Omnibus effects by metric

Metric	Effect η²	p-value	H	Result
Draft Judge Score	0.295 Large	1.02e-156	768.3	Significant
Pairwise Accuracy	0.220 Large	2.32e-115	575.0	Significant
Binary HP Accuracy	0.056 Small	8.35e-27	154.1	Significant
Composite Score	0.052 Small	4.93e-25	145.4	Significant
Binary OM Accuracy	0.017 Small	2.45e-7	54.3	Significant
Emotion VA Score	0.007 Trace	0.0025	30.3	Significant
Emotion F1	0.002 Trace	0.1232	17.8	Not significant

Selected pairwise composite differences

Model A	Model B	Δ	p adj.	Effect \|r\|	Result
Claude Fable 5	Claude Sonnet 4.6	+4.77	<0.0001	0.940 L	Significant
Claude Fable 5	GPT-5.4	+4.82	<0.0001	0.895 L	Significant
Claude Fable 5	Grok 4	+4.96	<0.0001	0.881 L	Significant
Claude Fable 5	Mistral Large	+3.21	<0.0001	0.796 L	Significant
Claude Fable 5	Qwen 2.5 72B	+2.80	<0.0001	0.770 L	Significant
Claude Fable 5	Claude Haiku 4.5	+2.24	<0.0001	0.732 L	Significant
Claude Fable 5	Gemini 3.1 Pro	+2.13	<0.0001	0.725 L	Significant
Claude Fable 5	GPT-5.5	+1.35	0.0016	0.669 L	Significant
Claude Fable 5	MiMo-v2-Pro	+1.41	0.0026	0.664 L	Significant
Claude Fable 5	Claude Opus 4.7	+1.40	0.0026	0.663 L	Significant
Claude Fable 5	Claude Opus 4.6	+0.72	0.3910	0.594 L	Not significant
Claude Fable 5	Claude Opus 4.8	+0.51	1.0000	0.564 L	Not significant

Explore conversations