| # | Model | Provider | Score | Score (rel.) | Price | Sentiment | Infra |
|---|---|---|---|---|---|---|---|
Higher = better. Dashed segment is the 72h forecast.
Sorted by Intel/Token score (same order as the Leaderboard's default view). Click any card to see sub-scores and community discussion.
Real-time quality drift across frontier LLMs and AI agents — so you can see when a model starts slipping the moment it happens. How we measure it →
Ranked by composite score within the selected category. Switch categories using the dropdown.
Sorted by best-overall score (same order as the Leaderboard's meta view). Click any card to see sub-scores and community discussion.
A 4-factor composite across 12 workload categories, computed hourly from 18 real-time signals.
We collect 18 real signals per agent across four categories, refreshed hourly:
Each agent receives a composite score from four weighted factors. The weights always sum to 1.0 — when an underlying signal is missing (e.g., a closed-source agent with no public GitHub repo), the corresponding sub-factor is renormalized rather than zero-filled, so closed-source agents are not structurally penalized for the absence of public-repo telemetry.
Average of all available public benchmark scores (SWE-bench, GAIA, WebArena, HumanEval+, TAU-bench), normalized to [0, 1]. Agents without benchmark data receive a neutral prior of 0.5.
Drawn from our multi-layer NLP pipeline (VADER + TextBlob + aspect analysis) over the last 7 days of social posts. Raw sentiment in [-0.2, 0.4] is rescaled to [0, 1].
Log-scaled count of new posts mentioning the agent across Bluesky, Reddit, Hacker News, and Mastodon over the last 24 hours. This is the universal time-varying signal: every agent — open- or closed-source — generates measurable discussion volume, so the composite reflects real shifts in attention rather than static catalog data.
Composed from whichever of the following are present for the agent: GitHub repo health (stars, contributors, release freshness, issue close-rate), package downloads (PyPI/npm weekly), and VS Code Marketplace installs + ratings. Sub-weights are renormalized over only the components that exist. If an agent has none of these signals, the slot falls back to mention velocity, so closed agents with no marketplace footprint can still rank on the strength of attention alone.
Agents are evaluated within 12 workload categories for actionable, domain-specific rankings:
The Leaderboard aggregates each agent's best category score to surface overall winners.
Agent scores are forecast 72 hours ahead using mean-reversion toward the cross-sectional average, with confidence intervals derived from historical volatility:
Where λ = 0.003 is the mean-reversion speed, 𝒮 is the cross-sectional mean, and σ̂ is estimated from recent score differences (minimum 0.008).
Each hourly scoring run reads the latest collected signals and produces one data point per agent per category. The time-series builds naturally from repeated runs — no simulation is used. This ensures all displayed trends reflect real changes in benchmark performance, community sentiment, and adoption metrics.
What the paper is, and isn’t: our NeurIPS 2026 D&B submission is an in-depth empirical study of this framework as of April 2026 — it documents the validation evidence (factor independence, IDE-marketplace shipping decision, discriminative validity, pre-registered falsifiers) that justified the framework’s structure. It is not a fixed methodology specification: validation logic is what the paper covers; production weights are re-tuned as the underlying data improves. The formula above may therefore differ in detail from the version frozen in the paper.
The methodology was validated in our NeurIPS 2026 Datasets & Benchmarks Track submission across three analyses plus a diagnostic, on a 50-agent registry curated as of April 2026.
Pairwise Spearman correlations across the four factors confirm largely complementary signal: max pairwise ρ = 0.75 (Adoption-Ecosystem, expected on substantive grounds), all other pairwise |ρ| ≤ 0.34. The four factors are not redundant.
The Benchmark factor alone — with no Adoption-, Sentiment-, or Ecosystem-derived signals — predicts whether an agent ships a public IDE-marketplace extension, replicated across three independent platforms:
All three remain significant after Holm correction within the IDE bucket. Confirmatory Spearman secondary statistics: B-only ρs = 0.48-0.60, B+S ρs = 0.42-0.55.
The same predictor is directionally negative on library-reuse metrics (GitHub Dependents: B-only ρs = -0.30, B+S ρs = -0.38, p = 0.04). The framework correctly distinguishes "agent that completes tasks" from "library that gets reused" — frameworks like LangGraph, MCP, and CrewAI dominate dependents-count but rank lower on the agent composite.
We commit to three falsifiers for the published claim: (i) IDE-bucket effect failing to replicate at T+6 months (rrb < +0.2 or p > 0.05), (ii) within-presence rank correlation null/negative once n ≥ 25, (iii) library-reuse correlation flipping positive. Replication snapshot will be posted to the artifact repository six months after publication.
Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.
Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)Three rankings, computed hourly from 19 public sources.
We continuously collect signals from 19 heterogeneous sources across five categories:
Every collected text is scored across four quality dimensions before entering the scoring pipeline:
Texts scoring below a calibrated threshold are flagged and excluded from downstream analysis.
Each text produces 30+ features through multi-layer analysis.
Composite sentiment blends four methods via calibrated weights:
Emotion detection scores seven dimensions using curated lexicons. For each emotion e:
Sarcasm detection uses pattern matching against K curated regex templates:
Texts with high sarcasm probability have their sentiment sign inverted before aggregation.
Aspect analysis scores five LLM-specific dimensions (performance, reliability, cost, innovation, adoption):
Engagement weighting amplifies community-validated content:
IntelligenceArena produces three distinct rankings, each measuring a different dimension:
Measures how much raw intelligence each million output tokens carries, in dollar terms. Computed as:
where V(m) is the calibrated intelligence value ($0.20–$10) and Pout is the output price per 1M tokens. The most capable models rank highest regardless of cost — GPT-5.4 Pro produces the most intelligent tokens, even though each costs $180/1M.
How V(m) is computed. V(m) is a model-level capability score in [0.20, 10.0], blended from public benchmark performance (SWE-bench, GAIA, HumanEval+, TAU-bench, Arena Elo where available), recent Sentiment Pulse, and our hype/nerf corrections. It is re-fit weekly so newly released or newly degraded models move within days, not quarters.
Measures how much intelligence you get per dollar spent. Computed as:
This is the intelligence value itself ($0.20–$10). Models that deliver strong capability at low prices rank highest — DeepSeek R1-0528 at $2.15/1M output delivers exceptional intelligence per dollar. The most expensive flagships rank lowest here.
A six-factor composite score capturing real-time community perception from 19 data sources:
The six factors are:
This ranking is independent of price — it reflects what developers and users are actually saying about each model across social media, forums, and developer platforms.
Validation logic is peer-review-grade (see our NeurIPS 2026 D&B submission); production weights are re-tuned as the underlying data improves and are not frozen with the paper.
Hype correction. We compute a raw hype indicator and cross-sectionally normalize it:
Models below median hype receive no discount. Negative sentiment (real complaints) is never discounted.
Open-source fairness. Top-k provider uptime (not average), max(HF, SDK) for adoption, endpoint diversity bonus.
We detect the compound signal of imminent model releases + rising user complaints:
When both release and degradation signals are present, a multiplicative amplification is applied:
72-hour predictions via four complementary methods:
Prediction intervals expand with the square root of the forecast horizon:
Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.
Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)Hourly-updated rankings, factor decompositions, and historical time-series for 50+ agents and frontier LLMs, delivered through APIs, bulk exports, or co-managed dashboards. For commercial procurement, due-diligence, and product-strategy teams.
Programmatic access to per-agent composite scores, factor breakdowns, and 1h / 24h / 7d deltas. SLA-backed. Refreshed every hour.
Full history of hourly scores, the underlying 18 signals per agent, and category-specific breakdowns. CSV / Parquet, with Croissant 1.0 metadata.
Add internal or unlisted agents to the registry, request bespoke benchmarks, or define custom workload categories aligned with your evaluation needs.