How IntelligenceArena scores AI agents
A 4-factor composite across 12 workload categories, computed hourly from 18 real-time signals.
- 4 weighted factors — Benchmarks (30%), Community Sentiment (20%), Mention Velocity 24h (25%), Adoption (25%, adaptive). Sub-weights renormalize when signals are missing, so closed-source agents aren't penalized for lacking public-repo telemetry.
- 12 workload categories — coding, research, browser, multi-agent, enterprise, etc. — each agent ranked within its specialization, then aggregated for "Best Overall."
- Falsifiable validation. The Benchmark factor alone predicts public IDE-marketplace shipping decisions, replicated on three independent platforms (VS Code, Open VSX, JetBrains). All three pre-registered falsifiers are in the paper.
Full technical breakdown — signals, factor formulas, validation analyses, falsifiers
1. Signal Collection (18 Signals)
We collect 18 real signals per agent across four categories, refreshed hourly:
- Benchmarks (5): SWE-bench resolve rate, GAIA accuracy, WebArena success rate, HumanEval+ pass rate, TAU-bench score — sourced from public leaderboards
- Adoption (6): GitHub stars, GitHub stars velocity (Δ/week), PyPI/npm weekly downloads, VS Code Marketplace installs, VS Code rating, Docker pulls
- Community (4): Social media sentiment (from NLP pipeline across Bluesky, Reddit, HN, Mastodon), Stack Overflow question count, GitHub issue count + close rate, GitHub contributor count
- Ecosystem (3): Days since last release, documentation quality proxy, enterprise readiness signals
2. Composite Scoring Model
Each agent receives a composite score from four weighted factors. The weights always sum to 1.0 — when an underlying signal is missing (e.g., a closed-source agent with no public GitHub repo), the corresponding sub-factor is renormalized rather than zero-filled, so closed-source agents are not structurally penalized for the absence of public-repo telemetry.
Factor 1: Benchmark Performance (30%)
Average of all available public benchmark scores (SWE-bench, GAIA, WebArena, HumanEval+, TAU-bench), normalized to [0, 1]. Agents without benchmark data receive a neutral prior of 0.5.
Factor 2: Community Sentiment (20%)
Drawn from our multi-layer NLP pipeline (VADER + TextBlob + aspect analysis) over the last 7 days of social posts. Raw sentiment in [-0.2, 0.4] is rescaled to [0, 1].
Factor 3: Mention Velocity, 24h (25%)
Log-scaled count of new posts mentioning the agent across Bluesky, Reddit, Hacker News, and Mastodon over the last 24 hours. This is the universal time-varying signal: every agent — open- or closed-source — generates measurable discussion volume, so the composite reflects real shifts in attention rather than static catalog data.
Factor 4: Adoption (25%, adaptive)
Composed from whichever of the following are present for the agent: GitHub repo health (stars, contributors, release freshness, issue close-rate), package downloads (PyPI/npm weekly), and VS Code Marketplace installs + ratings. Sub-weights are renormalized over only the components that exist. If an agent has none of these signals, the slot falls back to mention velocity, so closed agents with no marketplace footprint can still rank on the strength of attention alone.
Composite Score
3. Category-Based Evaluation
Agents are evaluated within 12 workload categories for actionable, domain-specific rankings:
- Development: Coding Agents, Coding Copilots, Autonomous SWE Agents
- Research & Analysis: Research Agents, Data Analysis
- Interaction: Browser Agents, General-Purpose, Consumer Productivity
- Infrastructure: Tool-Use, Multi-Agent Systems, Enterprise
- Specialized: Memory & Long-Horizon
The Leaderboard aggregates each agent's best category score to surface overall winners.
4. Forecasting
Agent scores are forecast 72 hours ahead using mean-reversion toward the cross-sectional average, with confidence intervals derived from historical volatility:
Where λ = 0.003 is the mean-reversion speed, 𝒮 is the cross-sectional mean, and σ̂ is estimated from recent score differences (minimum 0.008).
5. Time-Series Construction
Each hourly scoring run reads the latest collected signals and produces one data point per agent per category. The time-series builds naturally from repeated runs — no simulation is used. This ensures all displayed trends reflect real changes in benchmark performance, community sentiment, and adoption metrics.
6. Empirical Validation (NeurIPS 2026)
What the paper is, and isn’t: our NeurIPS 2026 D&B submission is an in-depth empirical study of this framework as of April 2026 — it documents the validation evidence (factor independence, IDE-marketplace shipping decision, discriminative validity, pre-registered falsifiers) that justified the framework’s structure. It is not a fixed methodology specification: validation logic is what the paper covers; production weights are re-tuned as the underlying data improves. The formula above may therefore differ in detail from the version frozen in the paper.
The methodology was validated in our NeurIPS 2026 Datasets & Benchmarks Track submission across three analyses plus a diagnostic, on a 50-agent registry curated as of April 2026.
Factor independence (n = 50)
Pairwise Spearman correlations across the four factors confirm largely complementary signal: max pairwise ρ = 0.75 (Adoption-Ecosystem, expected on substantive grounds), all other pairwise |ρ| ≤ 0.34. The four factors are not redundant.
Headline result: IDE-marketplace shipping decision (n = 34 public-repo agents)
The Benchmark factor alone — with no Adoption-, Sentiment-, or Ecosystem-derived signals — predicts whether an agent ships a public IDE-marketplace extension, replicated across three independent platforms:
- VS Code Marketplace: Mann-Whitney U = 42, p = 0.011, rank-biserial rrb = +0.60 (pre-specified)
- Open VSX: U = 12, p = 0.0011, rrb = +0.86 (post-hoc replication)
- JetBrains Marketplace: U = 27, p = 0.004, rrb = +0.71 (post-hoc replication)
All three remain significant after Holm correction within the IDE bucket. Confirmatory Spearman secondary statistics: B-only ρs = 0.48-0.60, B+S ρs = 0.42-0.55.
Discriminative validity (n = 30)
The same predictor is directionally negative on library-reuse metrics (GitHub Dependents: B-only ρs = -0.30, B+S ρs = -0.38, p = 0.04). The framework correctly distinguishes "agent that completes tasks" from "library that gets reused" — frameworks like LangGraph, MCP, and CrewAI dominate dependents-count but rank lower on the agent composite.
Pre-registered falsifiers
We commit to three falsifiers for the published claim: (i) IDE-bucket effect failing to replicate at T+6 months (rrb < +0.2 or p > 0.05), (ii) within-presence rank correlation null/negative once n ≥ 25, (iii) library-reuse correlation flipping positive. Replication snapshot will be posted to the artifact repository six months after publication.
Reference
Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.
Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)