IntelligenceArena IntelligenceArena Agents Models
19 sources

Sentiment Pulse — Community Perception

#ModelProviderScoreScore (rel.) PriceSentimentInfra

Sentiment Pulse — Top 8 Models

Higher = better. Dashed segment is the 72h forecast.

Arena — Model Intelligence Breakdown

Sorted by Intel/Token score (same order as the Leaderboard's default view). Click any card to see sub-scores and community discussion.

⚡ Top Intel/Token

💬 Top Sentiment

Select a model to chat or get its API endpoint.
💬 Chat
🔌 Get API
Chat with the Smartest Models per Dollar
Select a top-ranked model from the sidebar. OpenMesh routes your messages to the fastest available provider, scored in real-time by our infrastructure engine.

Benchmarks say the model is great. Your gut says it got dumber. IntelligenceArena tracks the gap.

Real-time quality drift across frontier LLMs and AI agents — so you can see when a model starts slipping the moment it happens. How we measure it →

Agent Leaderboard

Ranked by composite score within the selected category. Switch categories using the dropdown.

Leaderboard
Best across all categories
Development
Coding Agents
Code generation, debugging, refactoring
Coding Copilots
IDE-integrated completion & assistance
Autonomous SWE
End-to-end issue resolution
Research & Analysis
Research Agents
Deep search, synthesis, citations
Data Analysis
Data exploration, visualization, insights
Interaction
Browser Agents
Web navigation, form filling, scraping
General-Purpose
Broad capability, multi-domain tasks
Consumer Productivity
Shopping, scheduling, personal tasks
Infrastructure
Tool-Use Agents
API calls, function chaining, MCP
Multi-Agent Systems
Orchestration, delegation, coordination
Enterprise Agents
Compliance, security, scale
Specialized
Memory & Long-Horizon
Persistent context, multi-session tasks

Agent Arena — Detailed Breakdown

Sorted by best-overall score (same order as the Leaderboard's meta view). Click any card to see sub-scores and community discussion.

How IntelligenceArena scores AI agents

A 4-factor composite across 12 workload categories, computed hourly from 18 real-time signals.

TL;DR
  • 4 weighted factors — Benchmarks (30%), Community Sentiment (20%), Mention Velocity 24h (25%), Adoption (25%, adaptive). Sub-weights renormalize when signals are missing, so closed-source agents aren't penalized for lacking public-repo telemetry.
  • 12 workload categories — coding, research, browser, multi-agent, enterprise, etc. — each agent ranked within its specialization, then aggregated for "Best Overall."
  • Falsifiable validation. The Benchmark factor alone predicts public IDE-marketplace shipping decisions, replicated on three independent platforms (VS Code, Open VSX, JetBrains). All three pre-registered falsifiers are in the paper.
Read the full validation evidence (NeurIPS 2026 D&B submission) →
Full technical breakdown — signals, factor formulas, validation analyses, falsifiers

1. Signal Collection (18 Signals)

We collect 18 real signals per agent across four categories, refreshed hourly:

  • Benchmarks (5): SWE-bench resolve rate, GAIA accuracy, WebArena success rate, HumanEval+ pass rate, TAU-bench score — sourced from public leaderboards
  • Adoption (6): GitHub stars, GitHub stars velocity (Δ/week), PyPI/npm weekly downloads, VS Code Marketplace installs, VS Code rating, Docker pulls
  • Community (4): Social media sentiment (from NLP pipeline across Bluesky, Reddit, HN, Mastodon), Stack Overflow question count, GitHub issue count + close rate, GitHub contributor count
  • Ecosystem (3): Days since last release, documentation quality proxy, enterprise readiness signals

2. Composite Scoring Model

Each agent receives a composite score from four weighted factors. The weights always sum to 1.0 — when an underlying signal is missing (e.g., a closed-source agent with no public GitHub repo), the corresponding sub-factor is renormalized rather than zero-filled, so closed-source agents are not structurally penalized for the absence of public-repo telemetry.

Factor 1: Benchmark Performance (30%)

Average of all available public benchmark scores (SWE-bench, GAIA, WebArena, HumanEval+, TAU-bench), normalized to [0, 1]. Agents without benchmark data receive a neutral prior of 0.5.

Factor 2: Community Sentiment (20%)

Drawn from our multi-layer NLP pipeline (VADER + TextBlob + aspect analysis) over the last 7 days of social posts. Raw sentiment in [-0.2, 0.4] is rescaled to [0, 1].

Factor 3: Mention Velocity, 24h (25%)

Log-scaled count of new posts mentioning the agent across Bluesky, Reddit, Hacker News, and Mastodon over the last 24 hours. This is the universal time-varying signal: every agent — open- or closed-source — generates measurable discussion volume, so the composite reflects real shifts in attention rather than static catalog data.

Factor 4: Adoption (25%, adaptive)

Composed from whichever of the following are present for the agent: GitHub repo health (stars, contributors, release freshness, issue close-rate), package downloads (PyPI/npm weekly), and VS Code Marketplace installs + ratings. Sub-weights are renormalized over only the components that exist. If an agent has none of these signals, the slot falls back to mention velocity, so closed agents with no marketplace footprint can still rank on the strength of attention alone.

Composite Score

3. Category-Based Evaluation

Agents are evaluated within 12 workload categories for actionable, domain-specific rankings:

  • Development: Coding Agents, Coding Copilots, Autonomous SWE Agents
  • Research & Analysis: Research Agents, Data Analysis
  • Interaction: Browser Agents, General-Purpose, Consumer Productivity
  • Infrastructure: Tool-Use, Multi-Agent Systems, Enterprise
  • Specialized: Memory & Long-Horizon

The Leaderboard aggregates each agent's best category score to surface overall winners.

4. Forecasting

Agent scores are forecast 72 hours ahead using mean-reversion toward the cross-sectional average, with confidence intervals derived from historical volatility:

Where λ = 0.003 is the mean-reversion speed, 𝒮 is the cross-sectional mean, and σ̂ is estimated from recent score differences (minimum 0.008).

5. Time-Series Construction

Each hourly scoring run reads the latest collected signals and produces one data point per agent per category. The time-series builds naturally from repeated runs — no simulation is used. This ensures all displayed trends reflect real changes in benchmark performance, community sentiment, and adoption metrics.

6. Empirical Validation (NeurIPS 2026)

What the paper is, and isn’t: our NeurIPS 2026 D&B submission is an in-depth empirical study of this framework as of April 2026 — it documents the validation evidence (factor independence, IDE-marketplace shipping decision, discriminative validity, pre-registered falsifiers) that justified the framework’s structure. It is not a fixed methodology specification: validation logic is what the paper covers; production weights are re-tuned as the underlying data improves. The formula above may therefore differ in detail from the version frozen in the paper.

The methodology was validated in our NeurIPS 2026 Datasets & Benchmarks Track submission across three analyses plus a diagnostic, on a 50-agent registry curated as of April 2026.

Factor independence (n = 50)

Pairwise Spearman correlations across the four factors confirm largely complementary signal: max pairwise ρ = 0.75 (Adoption-Ecosystem, expected on substantive grounds), all other pairwise |ρ| ≤ 0.34. The four factors are not redundant.

Headline result: IDE-marketplace shipping decision (n = 34 public-repo agents)

The Benchmark factor alone — with no Adoption-, Sentiment-, or Ecosystem-derived signals — predicts whether an agent ships a public IDE-marketplace extension, replicated across three independent platforms:

  • VS Code Marketplace: Mann-Whitney U = 42, p = 0.011, rank-biserial rrb = +0.60 (pre-specified)
  • Open VSX: U = 12, p = 0.0011, rrb = +0.86 (post-hoc replication)
  • JetBrains Marketplace: U = 27, p = 0.004, rrb = +0.71 (post-hoc replication)

All three remain significant after Holm correction within the IDE bucket. Confirmatory Spearman secondary statistics: B-only ρs = 0.48-0.60, B+S ρs = 0.42-0.55.

Discriminative validity (n = 30)

The same predictor is directionally negative on library-reuse metrics (GitHub Dependents: B-only ρs = -0.30, B+S ρs = -0.38, p = 0.04). The framework correctly distinguishes "agent that completes tasks" from "library that gets reused" — frameworks like LangGraph, MCP, and CrewAI dominate dependents-count but rank lower on the agent composite.

Pre-registered falsifiers

We commit to three falsifiers for the published claim: (i) IDE-bucket effect failing to replicate at T+6 months (rrb < +0.2 or p > 0.05), (ii) within-presence rank correlation null/negative once n ≥ 25, (iii) library-reuse correlation flipping positive. Replication snapshot will be posted to the artifact repository six months after publication.

Reference

Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.

Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)

How IntelligenceArena scores frontier LLMs

Three rankings, computed hourly from 19 public sources.

TL;DR
  • 19 real-time signals — social, developer platforms, infrastructure telemetry, package distribution, research feeds — refreshed hourly.
  • Three lensesSentiment Pulse (community perception), Intelligence / Dollar (best value), Intelligence / M Token (raw capability).
  • Bias-corrected — hype is normalized cross-sectionally; pre-release degradation is detected and amplified; closed-source models aren't penalized for missing public signals.
Read the full methodology (NeurIPS 2026 D&B submission) →
Full technical breakdown — data sources, NLP scoring, factor decomposition, bias corrections, forecasting

1. Data Collection (19 Sources)

We continuously collect signals from 19 heterogeneous sources across five categories:

  • Social media: Bluesky, Reddit, Mastodon, Lemmy — community perception and engagement
  • Developer platforms: Hacker News, Stack Overflow, GitHub Discussions, Dev.to, V2EX — technical discourse and adoption friction
  • Infrastructure: OpenRouter API (pricing), endpoint telemetry (uptime per provider), GitHub (stars/commits), HuggingFace (downloads)
  • Distribution: PyPI/npm SDK downloads, VS Code Marketplace, DockerHub container pulls
  • Research & Trends: arXiv papers, Google Trends search interest, LMSYS Chatbot Arena Elo

2. Data Quality Filtering

Every collected text is scored across four quality dimensions before entering the scoring pipeline:

  • Uniqueness: Exact-hash dedup + near-duplicate detection via Jaccard similarity over character trigram sets
  • Bot detection: Filtering templated, generic, and high-frequency automated content
  • Source credibility: Platform-weighted base scores reflecting moderation rigor
  • Specificity: Does the text discuss specific model capabilities, pricing, or benchmarks?

Texts scoring below a calibrated threshold are flagged and excluded from downstream analysis.

3. NLP Sentiment & Emotion Scoring

Each text produces 30+ features through multi-layer analysis.

Composite sentiment blends four methods via calibrated weights:

Emotion detection scores seven dimensions using curated lexicons. For each emotion e:

Sarcasm detection uses pattern matching against K curated regex templates:

Texts with high sarcasm probability have their sentiment sign inverted before aggregation.

Aspect analysis scores five LLM-specific dimensions (performance, reliability, cost, innovation, adoption):

Engagement weighting amplifies community-validated content:

4. Three Leaderboard Rankings

IntelligenceArena produces three distinct rankings, each measuring a different dimension:

Intelligence / M Token — Raw Intelligence

Measures how much raw intelligence each million output tokens carries, in dollar terms. Computed as:

where V(m) is the calibrated intelligence value ($0.20–$10) and Pout is the output price per 1M tokens. The most capable models rank highest regardless of cost — GPT-5.4 Pro produces the most intelligent tokens, even though each costs $180/1M.

How V(m) is computed. V(m) is a model-level capability score in [0.20, 10.0], blended from public benchmark performance (SWE-bench, GAIA, HumanEval+, TAU-bench, Arena Elo where available), recent Sentiment Pulse, and our hype/nerf corrections. It is re-fit weekly so newly released or newly degraded models move within days, not quarters.

Intelligence / Dollar — Best Value

Measures how much intelligence you get per dollar spent. Computed as:

This is the intelligence value itself ($0.20–$10). Models that deliver strong capability at low prices rank highest — DeepSeek R1-0528 at $2.15/1M output delivers exceptional intelligence per dollar. The most expensive flagships rank lowest here.

Sentiment Pulse — Community Perception

A six-factor composite score capturing real-time community perception from 19 data sources:

The six factors are:

This ranking is independent of price — it reflects what developers and users are actually saying about each model across social media, forums, and developer platforms.

Validation logic is peer-review-grade (see our NeurIPS 2026 D&B submission); production weights are re-tuned as the underlying data improves and are not frozen with the paper.

5. Bias Corrections

Hype correction. We compute a raw hype indicator and cross-sectionally normalize it:

Models below median hype receive no discount. Negative sentiment (real complaints) is never discounted.

Open-source fairness. Top-k provider uptime (not average), max(HF, SDK) for adoption, endpoint diversity bonus.

6. Pre-Release Degradation Detection

We detect the compound signal of imminent model releases + rising user complaints:

When both release and degradation signals are present, a multiplicative amplification is applied:

7. Ensemble Forecasting

72-hour predictions via four complementary methods:

Prediction intervals expand with the square root of the forecast horizon:

Reference

Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.

Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)
Enterprise Data Access

Real-Time AI Agent & LLM Ranking Data

Hourly-updated rankings, factor decompositions, and historical time-series for 50+ agents and frontier LLMs, delivered through APIs, bulk exports, or co-managed dashboards. For commercial procurement, due-diligence, and product-strategy teams.

Real-Time API
Hourly REST + WebSocket

Programmatic access to per-agent composite scores, factor breakdowns, and 1h / 24h / 7d deltas. SLA-backed. Refreshed every hour.

Historical Bulk Export
Time-series & raw signals

Full history of hourly scores, the underlying 18 signals per agent, and category-specific breakdowns. CSV / Parquet, with Croissant 1.0 metadata.

Custom Coverage
Private agents & categories

Add internal or unlisted agents to the registry, request bespoke benchmarks, or define custom workload categories aligned with your evaluation needs.

Contact

Tell us about your use case and we'll respond within one business day.

By submitting, you agree to be contacted about IntelligenceArena data products. Your information is not shared with third parties. Academic researchers: please cite our NeurIPS 2026 D&B submission and email michael@openmesh.ai for non-commercial access.

Methodology & provenance: All rankings are computed from public signals across GitHub, package registries, IDE marketplaces, social platforms, and benchmark leaderboards. Methodology is documented in our NeurIPS 2026 D&B submission. Validation logic, falsifiers, and headline result are peer-review-grade; production weights are re-tuned as the underlying data improves. Implementation, registry curation, and operational pipeline are proprietary.