Model Analysis

30.040.050.060.070.0Claude Opus 4.6MiMo-v2-ProGPT-5.5Claude Opus 4.7Claude Haiku 4.5Gemini 3.1 ProQwen 2.5 72BMistral LargeGPT-5.4Claude Sonnet 4.6Grok 4

Performance Heatmap

Mode
Columns

Scroll sideways to compare more metrics.

Comp
F1
VA
Hit
OM
HP
PW
Tau
Jdg
DBA
4B
PAN
Q1
Q3
Trn
Cnv
Best in column Middle Worst in column Click a row to highlight it

Emotion Tracking

Emotion F1

GPT-5.5Claude Opus 4.7Claude Sonnet 4.6Claude Opus 4.6MiMo-v2-ProGPT-5.4Mistral LargeClaude Haiku 4.5Grok 4Gemini 3.1 ProQwen 2.5 72B00.050.1

Valence-Arousal

Gemini 3.1 ProClaude Haiku 4.5Grok 4GPT-5.4MiMo-v2-ProGPT-5.5Qwen 2.5 72BClaude Opus 4.7Claude Sonnet 4.6Claude Opus 4.6Mistral Large00.10.2

Holistic Thinkers vs. Step-by-Step Annotators

Turn-level (lighter) Conversation-wide (solid) Hue = % change

Four-Branch EQ & Preference Prediction

Four-Branch EQ

Qwen 2.5 72BMistral LargeGPT-5.5Gemini 3.1 ProClaude Haiku 4.5Grok 4GPT-5.4MiMo-v2-ProClaude Opus 4.6Claude Opus 4.7Claude Sonnet 4.60%20%40%60%80%

Pairwise Preference

Claude Opus 4.7Claude Opus 4.6MiMo-v2-ProClaude Haiku 4.5GPT-5.5Mistral LargeClaude Sonnet 4.6Gemini 3.1 ProGPT-5.4Grok 4Qwen 2.5 72B0%20%40%60%chance 33.3%

Conversation Quality

Q1 Goals

GPT-5.4Qwen 2.5 72BClaude Haiku 4.5Mistral LargeClaude Sonnet 4.6Claude Opus 4.6MiMo-v2-ProGrok 4Gemini 3.1 ProClaude Opus 4.7GPT-5.50%20%40%60%chance 17.5%

Q3 Response Fit

GPT-5.5Claude Opus 4.6Grok 4GPT-5.4Claude Opus 4.7Gemini 3.1 ProMiMo-v2-ProQwen 2.5 72BClaude Haiku 4.5Claude Sonnet 4.6Mistral Large0%20%40%chance 25.0%

Perspective Gap

Claude Opus 4.6
Gemini 3.1 Pro
GPT-5.4
Qwen 2.5 72B
GPT-5.5
Claude Sonnet 4.6
MiMo-v2-Pro
Mistral Large
Claude Haiku 4.5
Grok 4
Claude Opus 4.7
Better at human view Worse at human view

Draft Response Quality

Claude Opus 4.6Claude Opus 4.7GPT-5.5Claude Sonnet 4.6Gemini 3.1 ProGPT-5.4Claude Haiku 4.5Mistral LargeGrok 4MiMo-v2-ProQwen 2.5 72B0%20%40%60%80%

Conversation Topics

PoliticsMoneyWork / SchoolFamilyHobbiesEntertainment Med…FriendsReligionPhysical HealthRomantic Relation…02040

Participant Diagnosis

None 51.9
Anxiety/Depression 54.0
ASD/ADHD 48.4
Area = conversations · number = average composite

Metric Relationships

Composite
Emotion F1
VA Score
Observer
Human
Pairwise
Draft
Gap
Composite
0.47
0.51
0.36
0.34
0.43
0.21
-0.13
Emotion F1
0.47
0.63
0.04
0.01
0.02
0.09
0.00
VA Score
0.51
0.63
0.15
0.17
-0.06
0.11
-0.06
Observer
0.36
0.04
0.15
0.60
-0.04
0.20
-0.20
Human
0.34
0.01
0.17
0.60
-0.12
0.17
0.03
Pairwise
0.43
0.02
-0.06
-0.04
-0.12
0.10
0.03
Draft
0.21
0.09
0.11
0.20
0.17
0.10
-0.08
Gap
-0.13
0.00
-0.06
-0.20
0.03
0.03
-0.08

PANAS Item-Level Prediction

Positive Affect
Negative Affect
Easiest Hardest

Conversation Position

Mid vs Early

Emotion F1
Observer Accuracy
Pairwise Preference
Draft Judge

Late vs Early

Emotion F1
Observer Accuracy
Pairwise Preference
Draft Judge
Improves vs early Drops vs early

Evaluation Mode

Omniscient Delta

Claude Opus 4.7
Gemini 3.1 Pro
Claude Opus 4.6
Claude Haiku 4.5
Qwen 2.5 72B
Grok 4
Mistral Large
MiMo-v2-Pro
GPT-5.5
Claude Sonnet 4.6
GPT-5.4

Verbose Delta

Claude Opus 4.7
GPT-5.5
Grok 4
Claude Sonnet 4.6
Qwen 2.5 72B
Claude Opus 4.6
MiMo-v2-Pro
GPT-5.4
Claude Haiku 4.5
Gemini 3.1 Pro
Mistral Large
Improves vs default Drops vs default

Mood Shift & Emotional Trajectory

-0.20+0.8NeutralOverallMoneyPhysical HealthWork / SchoolFriendsHobbiesRomantic RelationshipsPoliticsReligionFamilyEntertainment Media
First half Second half

Temporal Performance

Observer Accuracy

Early
Mid
Late

Pairwise Preference

Early
Mid
Late

Draft Judge

Early
Mid
Late
Early Mid Late increase

Statistical Significance

Metrics with model effects 6/7
Strongest separation Draft Judge Score
Least conclusive Emotion F1
Pairwise composite gaps 25 of 36 model pairs significant

Omnibus effects by metric

MetricEffect η²p-valueHResult
Draft Judge Score
0.299 Large
3.88e-112543.1Significant
Pairwise Accuracy
0.170 Large
9.77e-63312.3Significant
Binary HP Accuracy
0.054 Small
5.93e-19104.2Significant
Composite Score
0.040 Small
5.18e-1479.9Significant
Binary OM Accuracy
0.021 Small
2.82e-745.6Significant
Emotion VA Score
0.011 Small
0.000727.1Significant
Emotion F1
0.004 Trace
0.070714.4Trend

Selected pairwise composite differences

Model AModel BΔp adj.Effect |r|Result
Claude Opus 4.6Claude Sonnet 4.6+4.01<0.00010.899 LSignificant
MiMo-v2-ProClaude Sonnet 4.6+3.32<0.00010.831 LSignificant
Claude Opus 4.6Grok 4+4.16<0.00010.826 LSignificant
Claude Opus 4.6GPT-5.4+4.06<0.00010.822 LSignificant
MiMo-v2-ProGrok 4+3.47<0.00010.797 LSignificant
Claude Haiku 4.5Claude Sonnet 4.6+2.54<0.00010.776 LSignificant
MiMo-v2-ProGPT-5.4+3.37<0.00010.774 LSignificant
Gemini 3.1 ProGPT-5.4+2.69<0.00010.748 LSignificant
Claude Opus 4.6Mistral Large+2.64<0.00010.740 LSignificant
Claude Haiku 4.5GPT-5.4+2.59<0.00010.734 LSignificant
Claude Opus 4.6Qwen 2.5 72B+2.110.00010.693 LSignificant
Claude Opus 4.6Claude Haiku 4.5+1.470.00220.657 LSignificant
Claude Opus 4.6Gemini 3.1 Pro+1.370.00360.635 LSignificant
Claude Opus 4.6MiMo-v2-Pro+0.690.19520.594 LNot significant
Explore conversations