agi.score leaderboard

Robust all-source benchmark score

A robust LLM ranking calculated from every numeric datapoint in the LiveBench table and the benchmark images hosted on the site: MiMo, Cursor, Google Gemini 3.5, the Anthropic Opus 4.8 launch comparison, NVIDIA Nemotron 3 Ultra, Mythos 5, Cohere North Mini Code, GLM-5.2, VibeThinker 3B, and Rio 3.5 Open.

LiveBench 2026-01-08MiMo + Cursor + Launch tables + Artificial Analysis + Nemotron + Mythos 5 + North Mini Code
Updated 16 June 2026
-
Models ranked
-
Numeric cells used
-
Sources

What is scored: LiveBench Global Average plus seven category averages, every numeric MiMo image cell, every numeric Cursor image cell, every numeric cell from the pooled vendor launch comparison tables (Google Gemini 3.5 and Anthropic Opus 4.8), the Artificial Analysis Intelligence Index, and the Liquid AI, StepFun, NVIDIA Nemotron 3 Ultra, Anthropic Mythos 5 / Fable 5, Cohere North Mini Code, GLM-5.2, VibeThinker 3B, and Rio 3.5 Open benchmark cells. The headline score is now robust/source-balanced rather than raw cell-weighted. Rank metrics such as FrontierSWE are inverted so lower ranks score higher.

Missing dashes are excluded from that source/model average rather than treated as zero. Image-derived models are matched to the closest LiveBench variant where the label clearly refers to the same model family.

The calculation is performed from the raw arrays in this page at render time, so the displayed averages and source tables share one data source.

LiveBench context: objective ground-truth answers, no LLM judge, regularly refreshed questions to limit contamination.

Official source: livebench.ai

LiveBench appeared as an ICLR 2025 Spotlight Paper and is sponsored by Abacus.AI.

Combined Ranking

- shown. agi.score v2 uses robust per-benchmark scaling, source-balanced averaging, and evidence shrinkage. Legacy keeps the old raw/min-max cell-weighted score for comparison.

#ModelOrganizationagi.score v2LegacySourcesCells

Methodology

Headline score is agi.score v2. Each benchmark is first converted onto a robust comparable scale instead of forcing every small source table through min-max. Percent metrics use a continuity-corrected logit transform, lower-is-better ranks are inverted, and off-scale point/Elo metrics keep their natural order before robust normalization.

Robust per-benchmark normalization. For each benchmark column, values are centered by the median and scaled by MAD/IQR rather than by the best and worst model in that source. Extreme z-scores are clamped to +/-3 and mapped back to a readable 0-100 score with a sigmoid. This avoids the old failure mode where a tiny four-model launch table automatically gave one model 100 and another 0.

Source-balanced model score. Each source contributes one source score per model, then model scores are averaged across sources rather than cell-weighted across every raw benchmark cell. Evidence confidence is still shown in the final number: confidence = n_eff / (n_eff + 8), where n_eff caps each source at 12 cells, and low-evidence models are shrunk toward 50. The Legacy column preserves the previous raw/min-max cell-weighted score for continuity.

Variant matching and source coverage. Image-derived labels are matched to the closest LiveBench/source equivalent only where the label clearly refers to the same model family or effort setting. Missing dashes remain excluded. Current sources include LiveBench, MiMo, Cursor, vendor launch tables, Artificial Analysis, Liquid AI, StepFun, Nemotron, Mythos, North Mini Code, GLM-5.2, VibeThinker 3B, VibeThinker LeetCode contests, and Rio 3.5 Open.

Source 1 - LiveBench 2026-01-08

All 116 rows and all eight supplied numeric columns are used in the LiveBench source average.

#ModelOrganizationGlobalReasoningCodingAgenticMathDataLanguageIFRobust

Source 2 - MiMo-V2.5-Pro image comparison

MiMo benchmark image

MiMo extracted numeric cells

Source 3 - Cursor Composer 2.5 image comparison

Cursor Composer benchmark image

Cursor extracted numeric cells

Source 4 - Vendor launch benchmark tables

Two vendor announcement comparison tables that report the same public benchmarks, pooled into one cross-vendor source so each benchmark normalizes across every model that reports it. Sources: Google Keyword, May 19 2026 and the Anthropic Opus 4.8 announcement. Benchmark table cells are included in the averages; the Artificial Analysis chart is a visual reference only.

Google Gemini 3.5 Flash benchmark comparison table
Anthropic Opus 4.8 benchmark comparison table against Opus 4.7, GPT-5.5, and Gemini 3.1 Pro
Artificial Analysis Intelligence Index vs Output Speed chart for Gemini 3.5 Flash

Pooled launch-table numeric cells

Source 5 - Artificial Analysis Intelligence Index

Composite intelligence score from Artificial Analysis (AI Model & API Providers Analysis, "Intelligence" highlight - higher is better). Eleven models; the single index column is min-max normalized across them. Labels are matched to their LiveBench/source equivalents (e.g. Claude Opus 4.8 (max) maps to Claude 4.8 Opus, gpt-oss-120b to GPT OSS 120b, NVIDIA Nemotron 3 Super to Nemotron 3 Super 120B A12B); Muse Spark has no equivalent and ranks on this source alone.

Artificial Analysis extracted numeric cells

Source 6 - Liquid AI LFM2.5-8B-A1B benchmarks

Benchmark chart from the Liquid AI LFM2.5-8B-A1B release, comparing it against Granite-4.0-H-Tiny (IBM), gpt-oss-20b (OpenAI), Gemma-4-26B-A4B-IT (Google), and Qwen3-30B-A3B-Thinking-2507 (Alibaba). All six benchmarks are used; the AA-Omniscience Index is reported on a negative off-scale axis, so it is min-maxed onto 0-100 like the other off-scale metrics.

LFM2.5 extracted numeric cells

Source 7 - StepFun Step 3.7 Flash benchmarks

Reported scores for Step 3.7 Flash (StepFun, 198B-param MoE VLM) from its release card - eight benchmarks across Agentic Coding, Multimodal, and General Agent. Only Step 3.7 Flash is charted here; the competitor bars in the source image were unlabeled, so they are not included. (Card prose lists HLE-with-Tool as 48.1; the chart shows 47.2 - the chart value is used.)

Step 3.7 Flash extracted numeric cells

Source 8 - NVIDIA Nemotron 3 Ultra keynote comparison

NVIDIA Nemotron 3 Ultra Frontier Smart keynote benchmark slide

Seven benchmarks from NVIDIA's “Frontier Smart” keynote slide comparing Nemotron 3 Ultra (550B), GLM 5.1 (744B), Kimi K2.6 (1T), and Qwen3.5 (397B). The Knowledge Work (GDPVal-AA) row is a points score, so it is min-max normalized to 0-100 like other off-scale metrics; Long Context lists only the two models with a published ≥1M-token result (the 256K-capped models are omitted).

Nemotron 3 Ultra extracted numeric cells

Source 9 - Anthropic Claude Mythos 5 / Fable 5 launch comparison

Anthropic Claude Mythos 5 / Fable 5 launch benchmark comparison

Fifteen benchmarks from Anthropic’s Claude Mythos 5 / Fable 5 launch comparison, pitting Mythos 5 / Fable 5 and the earlier Mythos Preview against Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro. Reported figures take the higher of the Mythos 5 and Fable 5 scores (the two are within 1-3 points). Knowledge work (GDPval-AA) is a points score, so it is min-max normalized to 0-100 like the other off-scale metrics; the remaining percentage metrics are used as-is. Starred rows in the original carry larger Mythos/Fable gaps from cybersecurity and biology safeguards; the higher score is charted.

Mythos 5 extracted numeric cells

Source 10 - Cohere North Mini Code image comparison

North Mini Code benchmark comparison table

Six benchmark rows from Cohere's North Mini Code release image are included. Starred competitor results in the source image are internally measured by Cohere and are used as reported.

North Mini Code extracted numeric cells

Source 11 - Z.ai GLM-5.2 image comparison

GLM-5.2 LLM performance evaluation across 8 benchmarks: SWE-bench Pro, Terminal-Bench 2.1, NL2Repo, DeepSWE, ProgramBench, MCP-Atlas, Tool-Decathlon, Humanity's Last Exam

GLM-5.2 launch comparison vs GLM-5.1, Claude Opus 4.8, GPT-5.5 and Gemini 3.1 Pro across eight agentic/coding/reasoning benchmarks at maximum thinking effort. Humanity's Last Exam is recorded both with and without tools.

Z.ai GLM-5.2 extracted numeric cells

Source 12 - VibeThinker 3B reasoning image comparison

Submitted VibeThinker evidence pack covering compact reasoning models, large open/proprietary comparisons, IMO-AnswerBench frontier-band context, and +CLR deltas where reported. All visible numeric cells from these reasoning images are extracted below.

IMO-AnswerBench frontier band

VibeThinker 3B IMO-AnswerBench score compared with larger models and +CLR
VibeThinker-3B reaches 76.4 on IMO-AnswerBench; +CLR lifts the displayed score to 80.6.

Small and large reasoning tables

VibeThinker 3B benchmark table against small and large reasoning models
Math, coding, knowledge, and instruction scores across small and large reasoning models.

Open-source and proprietary comparison

VibeThinker 3B benchmark table against open-source and proprietary frontier models with CLR row
Frontier comparison including VibeThinker-3B base and +CLR rows.

VibeThinker reasoning extracted numeric cells

Source 13 - VibeThinker 3B LeetCode contest image comparison

LeetCode contests (Python)

VibeThinker 3B LeetCode contest benchmark table against GPT, Gemini, Claude, Grok, Qwen, Kimi, and GLM models
Eight contest columns plus the published overall aggregate; fractional cells are converted to percentages in the extracted table.

VibeThinker LeetCode extracted numeric cells

Source 14 - Rio 3.5 Open 397B image comparison

Rio 3.5 Open 397B benchmark comparison on Terminal 2.1, SWE-Bench Pro, DeepSWE, HLE, IMOAnswerBench, GDPval

Rio 3.5 Open 397B vs Qwen 3.7 Plus, DeepSeek V4 Pro and Kimi-K2.6. GDPval is reported as an Elo (~1,480-1,554) and is min-maxed onto 0-100 like the other Elo columns.

Rio 3.5 Open 397B extracted numeric cells

Citation

@inproceedings{livebench,
  title={LiveBench: A Challenging, Contamination-Free {LLM} Benchmark},
  author={Colin White and Samuel Dooley and Manley Roberts and Arka Pal and Benjamin Feuer and Siddhartha Jain and Ravid Shwartz-Ziv and Neel Jain and Khalid Saifullah and Sreemanti Dey and Shubh-Agrawal and Sandeep Singh Sandha and Siddartha Venkat Naidu and Chinmay Hegde and Yann LeCun and Tom Goldstein and Willie Neiswanger and Micah Goldblum},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
}