IntelligenceArena IntelligenceArena Agents Models / Methodology
Open dashboard

How IntelligenceArena scores frontier LLMs

Three rankings, computed hourly from 19 public sources.

TL;DR
  • 19 real-time signals — social, developer platforms, infrastructure telemetry, package distribution, research feeds — refreshed hourly.
  • Three lensesSentiment Pulse (community perception), Intelligence / Dollar (best value), Intelligence / M Token (raw capability).
  • Bias-corrected — hype is normalized cross-sectionally; pre-release degradation is detected and amplified; closed-source models aren't penalized for missing public signals.
Read the full methodology (NeurIPS 2026 D&B submission) →
Full technical breakdown — data sources, NLP scoring, factor decomposition, bias corrections, forecasting

1. Data Collection (19 Sources)

We continuously collect signals from 19 heterogeneous sources across five categories:

  • Social media: Bluesky, Reddit, Mastodon, Lemmy — community perception and engagement
  • Developer platforms: Hacker News, Stack Overflow, GitHub Discussions, Dev.to, V2EX — technical discourse and adoption friction
  • Infrastructure: OpenRouter API (pricing), endpoint telemetry (uptime per provider), GitHub (stars/commits), HuggingFace (downloads)
  • Distribution: PyPI/npm SDK downloads, VS Code Marketplace, DockerHub container pulls
  • Research & Trends: arXiv papers, Google Trends search interest, LMSYS Chatbot Arena Elo

2. Data Quality Filtering

Every collected text is scored across four quality dimensions before entering the scoring pipeline:

  • Uniqueness: Exact-hash dedup + near-duplicate detection via Jaccard similarity over character trigram sets
  • Bot detection: Filtering templated, generic, and high-frequency automated content
  • Source credibility: Platform-weighted base scores reflecting moderation rigor
  • Specificity: Does the text discuss specific model capabilities, pricing, or benchmarks?

Texts scoring below a calibrated threshold are flagged and excluded from downstream analysis.

3. NLP Sentiment & Emotion Scoring

Each text produces 30+ features through multi-layer analysis.

Composite sentiment blends four methods via calibrated weights:

Emotion detection scores seven dimensions using curated lexicons. For each emotion e:

Sarcasm detection uses pattern matching against K curated regex templates:

Texts with high sarcasm probability have their sentiment sign inverted before aggregation.

Aspect analysis scores five LLM-specific dimensions (performance, reliability, cost, innovation, adoption):

Engagement weighting amplifies community-validated content:

4. Three Leaderboard Rankings

IntelligenceArena produces three distinct rankings, each measuring a different dimension:

Intelligence / M Token — Raw Intelligence

Measures how much raw intelligence each million output tokens carries, in dollar terms. Computed as:

where V(m) is the calibrated intelligence value ($0.20–$10) and Pout is the output price per 1M tokens. The most capable models rank highest regardless of cost — GPT-5.4 Pro produces the most intelligent tokens, even though each costs $180/1M.

How V(m) is computed. V(m) is a model-level capability score in [0.20, 10.0], blended from public benchmark performance (SWE-bench, GAIA, HumanEval+, TAU-bench, Arena Elo where available), recent Sentiment Pulse, and our hype/nerf corrections. It is re-fit weekly so newly released or newly degraded models move within days, not quarters.

Intelligence / Dollar — Best Value

Measures how much intelligence you get per dollar spent. Computed as:

This is the intelligence value itself ($0.20–$10). Models that deliver strong capability at low prices rank highest — DeepSeek R1-0528 at $2.15/1M output delivers exceptional intelligence per dollar. The most expensive flagships rank lowest here.

Sentiment Pulse — Community Perception

A six-factor composite score capturing real-time community perception from 19 data sources:

The six factors are:

This ranking is independent of price — it reflects what developers and users are actually saying about each model across social media, forums, and developer platforms.

Validation logic is peer-review-grade (see our NeurIPS 2026 D&B submission); production weights are re-tuned as the underlying data improves and are not frozen with the paper.

5. Bias Corrections

Hype correction. We compute a raw hype indicator and cross-sectionally normalize it:

Models below median hype receive no discount. Negative sentiment (real complaints) is never discounted.

Open-source fairness. Top-k provider uptime (not average), max(HF, SDK) for adoption, endpoint diversity bonus.

6. Pre-Release Degradation Detection

We detect the compound signal of imminent model releases + rising user complaints:

When both release and degradation signals are present, a multiplicative amplification is applied:

7. Ensemble Forecasting

72-hour predictions via four complementary methods:

Prediction intervals expand with the square root of the forecast horizon:

Reference

Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.

Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)