agi.score

Cross-benchmark aggregate rankings for frontier LLMs. Multiple sources averaged element-wise into one comparable score.

114
Models ranked
7
In both sources
106
LiveBench only
1
MiMo image only

Combined ranking

Normalised 0–1 within each source, then averaged where the same model appears in both.
in both  ·  MiMo image only  ·  LiveBench only

#ModelOrgAggregateSources
1GPT-5.5 Thinking xHigh EffortOpenAI1.000
LiveBench (80.71)
2GPT-5.4 Thinking xHigh EffortOpenAI0.924
both (MiMo 0.856 / LB 0.991)
3Claude 4.7 Opus Thinking xHigh EffortAnthropic0.921
LiveBench (76.91)
4GPT-5.5 Thinking High EffortOpenAI0.907
LiveBench (76.24)
5Claude 4.5 Opus Thinking High EffortAnthropic0.902
LiveBench (75.96)
6Claude 4.6 Sonnet Thinking Medium EffortAnthropic0.892
LiveBench (75.47)
7Claude 4.6 Sonnet Thinking High EffortAnthropic0.888
LiveBench (75.32)
8GPT-5.4 Thinking High EffortOpenAI0.883
LiveBench (75.07)
9Claude 4.7 Opus Thinking High EffortAnthropic0.880
LiveBench (74.89)
10GPT-5.2 HighOpenAI0.879
LiveBench (74.84)
11GPT-5.2 CodexOpenAI0.867
LiveBench (74.30)
12Claude 4.6 Opus Thinking High EffortAnthropic0.866
both (MiMo 0.822 / LB 0.909)
13GPT-5.1 Codex Max HighOpenAI0.861
LiveBench (73.98)
14Claude 4.5 Opus Thinking Medium EffortAnthropic0.857
LiveBench (73.78)
15Gemini 3 Pro Preview HighGoogle0.849
LiveBench (73.39)
16GPT-5.3 Codex HighOpenAI0.835
LiveBench (72.76)
17Gemini 3 Flash Preview HighGoogle0.828
LiveBench (72.40)
18Claude 4.7 Opus Thinking Medium EffortAnthropic0.826
LiveBench (72.31)
19GPT-5.1 HighOpenAI0.821
LiveBench (72.04)
20GPT-5.1 Codex MaxOpenAI0.819
LiveBench (71.95)
21GPT-5.2 MediumOpenAI0.816
LiveBench (71.84)
22GPT-5.3 Codex xHighOpenAI0.812
LiveBench (71.64)
23Qwen 3.6 PlusAlibaba0.796
LiveBench (70.85)
24GPT-5 ProOpenAI0.788
LiveBench (70.48)
25Claude 4.6 Sonnet Thinking Low EffortAnthropic0.787
LiveBench (70.44)
26GPT-5.4 Nano xHighOpenAI0.781
LiveBench (70.13)
27GPT-5.1 MediumOpenAI0.761
LiveBench (69.17)
28Claude 4.7 Opus Thinking Low EffortAnthropic0.760
LiveBench (69.13)
29Kimi K2.5 ThinkingMoonshot AI0.759
LiveBench (69.07)
30GLM 5Z.AI0.755
LiveBench (68.85)
31GPT-5.5 Thinking Medium EffortOpenAI0.751
LiveBench (68.66)
32Kimi K2.6 ThinkingMoonshot AI0.750
both (MiMo 0.677 / LB 0.823)
33GPT-5.1 CodexOpenAI0.750
LiveBench (68.61)
34Claude Sonnet 4.5 ThinkingAnthropic0.741
LiveBench (68.19)
35Grok 4.20 BetaxAI0.736
LiveBench (67.96)
36GPT-5.4 Mini xHighOpenAI0.727
LiveBench (67.54)
37GLM 5.1Z.AI0.725
both (MiMo 0.667 / LB 0.782)
38DeepSeek V4 FlashDeepSeek0.721
LiveBench (67.25)
39Grok 4.3xAI0.711
LiveBench (66.74)
40DeepSeek V4 ProDeepSeek0.709
both (MiMo 0.566 / LB 0.852)
41GPT-5 Mini HighOpenAI0.694
LiveBench (65.91)
42Claude 4.5 Opus Thinking Low EffortAnthropic0.687
LiveBench (65.59)
43Qwen 3.6 27BAlibaba0.686
LiveBench (65.56)
44GPT-5.2 LowOpenAI0.682
LiveBench (65.33)
45Gemini 3 Pro Preview LowGoogle0.652
LiveBench (63.90)
46GPT-5.4 Mini HighOpenAI0.645
LiveBench (63.57)
47Minimax M2.7Minimax0.644
LiveBench (63.49)
48MiMo-V2.5-Pro0.632
MiMo only
49GPT-5.4 Nano HighOpenAI0.628
LiveBench (62.75)
50DeepSeek V3.2 ThinkingDeepSeek0.617
LiveBench (62.20)
51Grok 4xAI0.613
LiveBench (62.02)
52Gemini 3.1 Pro Preview HighGoogle0.611
both (MiMo 0.239 / LB 0.984)
53Claude 4.1 Opus ThinkingAnthropic0.609
LiveBench (61.81)
54Gemini 3.1 Flash Lite Preview HighGoogle0.606
LiveBench (61.68)
55Gemma 4 31BGoogle0.605
LiveBench (61.62)
56Kimi K2 ThinkingMoonshot AI0.604
LiveBench (61.59)
57Claude Haiku 4.5 ThinkingAnthropic0.599
LiveBench (61.32)
58Claude 4 Sonnet ThinkingAnthropic0.598
LiveBench (61.27)
59GPT-5 MiniOpenAI0.592
LiveBench (61.01)
60GPT-5.1 Codex MiniOpenAI0.579
LiveBench (60.38)
61Qwen 3.6 FlashAlibaba0.579
LiveBench (60.37)
62Minimax M2.5Minimax0.574
LiveBench (60.14)
63GPT-5.3 InstantOpenAI0.571
LiveBench (59.99)
64Grok 4.1 FastxAI0.571
LiveBench (59.99)
65GPT-5.1 LowOpenAI0.570
LiveBench (59.95)
66Claude 4.5 Opus Medium EffortAnthropic0.553
LiveBench (59.10)
67DeepSeek V3.2 Exp ThinkingDeepSeek0.549
LiveBench (58.90)
68Claude 4.5 Opus High EffortAnthropic0.542
LiveBench (58.59)
69GPT-5.4 Nano MediumOpenAI0.540
LiveBench (58.46)
70Gemini 2.5 Pro (Max Thinking)Google0.537
LiveBench (58.33)
71GPT-5.4 Mini MediumOpenAI0.537
LiveBench (58.33)
72GLM 4.7Z.AI0.532
LiveBench (58.09)
73Gemini 3 Flash Preview MinimalGoogle0.496
LiveBench (56.35)
74Claude 4.5 Opus Low EffortAnthropic0.484
LiveBench (55.77)
75GLM 4.6Z.AI0.472
LiveBench (55.19)
76Claude 4.1 OpusAnthropic0.457
LiveBench (54.45)
77Claude Sonnet 4.5Anthropic0.441
LiveBench (53.69)
78Gemini 2.5 Flash (Max Thinking) (2025-09-25)Google0.428
LiveBench (53.09)
79GPT-5 Mini LowOpenAI0.428
LiveBench (53.07)
80Qwen 3 235B A22B Thinking 2507Alibaba0.426
LiveBench (52.97)
81DeepSeek V3.2DeepSeek0.403
LiveBench (51.84)
82Claude 4 SonnetAnthropic0.385
LiveBench (50.98)
83Qwen 3 Next 80B A3B ThinkingAlibaba0.373
LiveBench (50.41)
84DeepSeek V3.2 ExpDeepSeek0.361
LiveBench (49.85)
85GLM 5V TurboZ.AI0.357
LiveBench (49.62)
86GPT-5.4 Mini LowOpenAI0.355
LiveBench (49.54)
87GPT-5.2 No ThinkingOpenAI0.342
LiveBench (48.91)
88Qwen 3 235B A22B Instruct 2507Alibaba0.340
LiveBench (48.84)
89GPT-5.4 Nano LowOpenAI0.337
LiveBench (48.67)
90GPT-5 Nano HighOpenAI0.336
LiveBench (48.62)
91GPT-5 NanoOpenAI0.335
LiveBench (48.56)
92Qwen 3 Next 80B A3B InstructAlibaba0.330
LiveBench (48.35)
93Kimi K2 InstructMoonshot AI0.325
LiveBench (48.10)
94MiMo V2 ProXiaomi0.321
both (MiMo 0.110 / LB 0.533)
95Gemini 2.5 Flash (Max Thinking) (2025-06-05)Google0.318
LiveBench (47.74)
96GPT OSS 120bOpenAI0.284
LiveBench (46.09)
97Claude Haiku 4.5Anthropic0.268
LiveBench (45.33)
98Grok Code FastxAI0.264
LiveBench (45.13)
99Qwen 3 32BAlibaba0.231
LiveBench (43.56)
100GPT-5.1 No ThinkingOpenAI0.212
LiveBench (42.65)
101Gemini 2.5 Flash Lite (Max Thinking) (2025-06-17)Google0.210
LiveBench (42.56)
102Gemini 2.5 Flash Lite (Max Thinking) (2025-09-25)Google0.207
LiveBench (42.39)
103Devstral 2Mistral0.183
LiveBench (41.24)
104GLM 4.6VZ.AI0.159
LiveBench (40.07)
105GPT-5 Mini MinimalOpenAI0.155
LiveBench (39.90)
106Grok 4.20 Beta (Non-Reasoning)xAI0.151
LiveBench (39.70)
107Qwen 3 30B A3BAlibaba0.137
LiveBench (39.01)
108GPT-5.4 MiniOpenAI0.094
LiveBench (36.95)
109Elephant AlphaOpenRouter0.074
LiveBench (35.97)
110GPT-5 Nano LowOpenAI0.040
LiveBench (34.34)
111Grok 4.1 Fast (Non-Reasoning)xAI0.022
LiveBench (33.45)
112Trinity Large PreviewArcee0.007
LiveBench (32.74)
113Nemotron 3 Super 120B A12BNVIDIA0.002
LiveBench (32.51)
114GPT-5.4 NanoOpenAI0.000
LiveBench (32.39)

Source 1 — MiMo-V2.5-Pro release comparison (May 2026)

8 frontier models × 8 benchmarks · higher is better unless marked as rank (#N, lower is better).

Benchmark Results across 8 models on 8 benchmarks
Source: comparison sheet released alongside MiMo-V2.5-Pro. "Best open-source" highlighted in orange, "best overall" underlined.

Source 2 — LiveBench 2026-01-08 (full leaderboard)

113 models across 7 task categories (Reasoning, Coding, Agentic Coding, Mathematics, Data Analysis, Language, IF). Top score: GPT-5.5 Thinking xHigh Effort, 80.71. livebench.ai

Methodology

Per-source normalisation. Each source is normalised 0–1 within itself: the highest reported score becomes 1.0, the lowest becomes 0.0, intermediate values are linear interpolations. For MiMo data this is done per benchmark column then averaged across the 8 columns to produce a MiMo aggregate per model. For LiveBench the Global Average (already an aggregate of 7 category averages) is normalised directly across 113 models.

Element-wise combination. Where the same model appears in both sources (matched by exact name OR clear variant alias — e.g. Kimi K2.6 → Kimi K2.6 Thinking, GPT-5.4 → GPT-5.4 Thinking xHigh Effort), the two normalised aggregates are averaged. Otherwise the single source aggregate is used.

Matched pairs (7). DeepSeek V4 Pro, GLM 5.1, MiMo-V2-Pro, Kimi K2.6, Gemini 3.1 Pro, Claude Opus 4.6, GPT-5.4. The remaining 1 MiMo-image model (MiMo-V2.5-Pro) has no LiveBench equivalent yet.

Caveats. The peer-group size differs (8 vs 113), so a 1.0 in LiveBench is competing against 113 peers while a 1.0 in MiMo was against 7. Models present only in MiMo can therefore look inflated relative to LB-only models — interpret the combined ranking as a coarse summary, not a precision instrument. Variant-name matching is also a judgement call; aggressive matching (e.g. GPT-5.4 ↔ GPT-5.4 Thinking xHigh) merges high-scoring LiveBench entries into MiMo positions and can shift the top-of-table substantially.