IntelligenceArena — Agents Methodology

How IntelligenceArena scores AI agents

A 4-factor composite across 12 workload categories, computed hourly from 18 real-time signals.

TL;DR

4 weighted factors — Benchmarks (30%), Community Sentiment (20%), Mention Velocity 24h (25%), Adoption (25%, adaptive). Sub-weights renormalize when signals are missing, so closed-source agents aren't penalized for lacking public-repo telemetry.
12 workload categories — coding, research, browser, multi-agent, enterprise, etc. — each agent ranked within its specialization, then aggregated for "Best Overall."
Falsifiable validation. The Benchmark factor alone predicts public IDE-marketplace shipping decisions, replicated on three independent platforms (VS Code, Open VSX, JetBrains). All three pre-registered falsifiers are in the paper.

Read the full validation evidence (NeurIPS 2026 D&B submission) →

Full technical breakdown — signals, factor formulas, validation analyses, falsifiers

1. Signal Collection (18 Signals)

We collect 18 real signals per agent across four categories, refreshed hourly:

Benchmarks (5): SWE-bench resolve rate, GAIA accuracy, WebArena success rate, HumanEval+ pass rate, TAU-bench score — sourced from public leaderboards
Adoption (6): GitHub stars, GitHub stars velocity (Δ/week), PyPI/npm weekly downloads, VS Code Marketplace installs, VS Code rating, Docker pulls
Community (4): Social media sentiment (from NLP pipeline across Bluesky, Reddit, HN, Mastodon), Stack Overflow question count, GitHub issue count + close rate, GitHub contributor count
Ecosystem (3): Days since last release, documentation quality proxy, enterprise readiness signals

2. Composite Scoring Model

Each agent receives a composite score from four weighted factors. The weights always sum to 1.0 — when an underlying signal is missing (e.g., a closed-source agent with no public GitHub repo), the corresponding sub-factor is renormalized rather than zero-filled, so closed-source agents are not structurally penalized for the absence of public-repo telemetry.

Factor 1: Benchmark Performance (30%)

Average of all available public benchmark scores (SWE-bench, GAIA, WebArena, HumanEval+, TAU-bench), normalized to [0, 1]. Agents without benchmark data receive a neutral prior of 0.5.

Factor 2: Community Sentiment (20%)

Drawn from our multi-layer NLP pipeline (VADER + TextBlob + aspect analysis) over the last 7 days of social posts. Raw sentiment in [-0.2, 0.4] is rescaled to [0, 1].

Factor 3: Mention Velocity, 24h (25%)

Log-scaled count of new posts mentioning the agent across Bluesky, Reddit, Hacker News, and Mastodon over the last 24 hours. This is the universal time-varying signal: every agent — open- or closed-source — generates measurable discussion volume, so the composite reflects real shifts in attention rather than static catalog data.

Factor 4: Adoption (25%, adaptive)

Composed from whichever of the following are present for the agent: GitHub repo health (stars, contributors, release freshness, issue close-rate), package downloads (PyPI/npm weekly), and VS Code Marketplace installs + ratings. Sub-weights are renormalized over only the components that exist. If an agent has none of these signals, the slot falls back to mention velocity, so closed agents with no marketplace footprint can still rank on the strength of attention alone.

Composite Score

3. Category-Based Evaluation

Agents are evaluated within 12 workload categories for actionable, domain-specific rankings:

Development: Coding Agents, Coding Copilots, Autonomous SWE Agents
Research & Analysis: Research Agents, Data Analysis
Interaction: Browser Agents, General-Purpose, Consumer Productivity
Infrastructure: Tool-Use, Multi-Agent Systems, Enterprise
Specialized: Memory & Long-Horizon

The Leaderboard aggregates each agent's best category score to surface overall winners.

4. Forecasting

Agent scores are forecast 72 hours ahead using mean-reversion toward the cross-sectional average, with confidence intervals derived from historical volatility:

Where λ = 0.003 is the mean-reversion speed, 𝒮 is the cross-sectional mean, and σ̂ is estimated from recent score differences (minimum 0.008).

5. Time-Series Construction

Each hourly scoring run reads the latest collected signals and produces one data point per agent per category. The time-series builds naturally from repeated runs — no simulation is used. This ensures all displayed trends reflect real changes in benchmark performance, community sentiment, and adoption metrics.

6. Empirical Validation (NeurIPS 2026)

What the paper is, and isn’t: our NeurIPS 2026 D&B submission is an in-depth empirical study of this framework as of April 2026 — it documents the validation evidence (factor independence, IDE-marketplace shipping decision, discriminative validity, pre-registered falsifiers) that justified the framework’s structure. It is not a fixed methodology specification: validation logic is what the paper covers; production weights are re-tuned as the underlying data improves. The formula above may therefore differ in detail from the version frozen in the paper.

The methodology was validated in our NeurIPS 2026 Datasets & Benchmarks Track submission across three analyses plus a diagnostic, on a 50-agent registry curated as of April 2026.

Factor independence (n = 50)

Pairwise Spearman correlations across the four factors confirm largely complementary signal: max pairwise ρ = 0.75 (Adoption-Ecosystem, expected on substantive grounds), all other pairwise |ρ| ≤ 0.34. The four factors are not redundant.

Headline result: IDE-marketplace shipping decision (n = 34 public-repo agents)

The Benchmark factor alone — with no Adoption-, Sentiment-, or Ecosystem-derived signals — predicts whether an agent ships a public IDE-marketplace extension, replicated across three independent platforms:

VS Code Marketplace: Mann-Whitney U = 42, p = 0.011, rank-biserial r_rb = +0.60 (pre-specified)
Open VSX: U = 12, p = 0.0011, r_rb = +0.86 (post-hoc replication)
JetBrains Marketplace: U = 27, p = 0.004, r_rb = +0.71 (post-hoc replication)

All three remain significant after Holm correction within the IDE bucket. Confirmatory Spearman secondary statistics: B-only ρ_s = 0.48-0.60, B+S ρ_s = 0.42-0.55.

Discriminative validity (n = 30)

The same predictor is directionally negative on library-reuse metrics (GitHub Dependents: B-only ρ_s = -0.30, B+S ρ_s = -0.38, p = 0.04). The framework correctly distinguishes "agent that completes tasks" from "library that gets reused" — frameworks like LangGraph, MCP, and CrewAI dominate dependents-count but rank lower on the agent composite.

Pre-registered falsifiers

We commit to three falsifiers for the published claim: (i) IDE-bucket effect failing to replicate at T+6 months (r_rb < +0.2 or p > 0.05), (ii) within-presence rank correlation null/negative once n ≥ 25, (iii) library-reuse correlation flipping positive. Replication snapshot will be posted to the artifact repository six months after publication.

Reference

Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.

Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)