How IntelligenceArena scores frontier LLMs
Three rankings, computed hourly from 19 public sources.
- 19 real-time signals — social, developer platforms, infrastructure telemetry, package distribution, research feeds — refreshed hourly.
- Three lenses — Sentiment Pulse (community perception), Intelligence / Dollar (best value), Intelligence / M Token (raw capability).
- Bias-corrected — hype is normalized cross-sectionally; pre-release degradation is detected and amplified; closed-source models aren't penalized for missing public signals.
Full technical breakdown — data sources, NLP scoring, factor decomposition, bias corrections, forecasting
1. Data Collection (19 Sources)
We continuously collect signals from 19 heterogeneous sources across five categories:
- Social media: Bluesky, Reddit, Mastodon, Lemmy — community perception and engagement
- Developer platforms: Hacker News, Stack Overflow, GitHub Discussions, Dev.to, V2EX — technical discourse and adoption friction
- Infrastructure: OpenRouter API (pricing), endpoint telemetry (uptime per provider), GitHub (stars/commits), HuggingFace (downloads)
- Distribution: PyPI/npm SDK downloads, VS Code Marketplace, DockerHub container pulls
- Research & Trends: arXiv papers, Google Trends search interest, LMSYS Chatbot Arena Elo
2. Data Quality Filtering
Every collected text is scored across four quality dimensions before entering the scoring pipeline:
- Uniqueness: Exact-hash dedup + near-duplicate detection via Jaccard similarity over character trigram sets
- Bot detection: Filtering templated, generic, and high-frequency automated content
- Source credibility: Platform-weighted base scores reflecting moderation rigor
- Specificity: Does the text discuss specific model capabilities, pricing, or benchmarks?
Texts scoring below a calibrated threshold are flagged and excluded from downstream analysis.
3. NLP Sentiment & Emotion Scoring
Each text produces 30+ features through multi-layer analysis.
Composite sentiment blends four methods via calibrated weights:
Emotion detection scores seven dimensions using curated lexicons. For each emotion e:
Sarcasm detection uses pattern matching against K curated regex templates:
Texts with high sarcasm probability have their sentiment sign inverted before aggregation.
Aspect analysis scores five LLM-specific dimensions (performance, reliability, cost, innovation, adoption):
Engagement weighting amplifies community-validated content:
4. Three Leaderboard Rankings
IntelligenceArena produces three distinct rankings, each measuring a different dimension:
Intelligence / M Token — Raw Intelligence
Measures how much raw intelligence each million output tokens carries, in dollar terms. Computed as:
where V(m) is the calibrated intelligence value ($0.20–$10) and Pout is the output price per 1M tokens. The most capable models rank highest regardless of cost — GPT-5.4 Pro produces the most intelligent tokens, even though each costs $180/1M.
How V(m) is computed. V(m) is a model-level capability score in [0.20, 10.0], blended from public benchmark performance (SWE-bench, GAIA, HumanEval+, TAU-bench, Arena Elo where available), recent Sentiment Pulse, and our hype/nerf corrections. It is re-fit weekly so newly released or newly degraded models move within days, not quarters.
Intelligence / Dollar — Best Value
Measures how much intelligence you get per dollar spent. Computed as:
This is the intelligence value itself ($0.20–$10). Models that deliver strong capability at low prices rank highest — DeepSeek R1-0528 at $2.15/1M output delivers exceptional intelligence per dollar. The most expensive flagships rank lowest here.
Sentiment Pulse — Community Perception
A six-factor composite score capturing real-time community perception from 19 data sources:
The six factors are:
This ranking is independent of price — it reflects what developers and users are actually saying about each model across social media, forums, and developer platforms.
Validation logic is peer-review-grade (see our NeurIPS 2026 D&B submission); production weights are re-tuned as the underlying data improves and are not frozen with the paper.
5. Bias Corrections
Hype correction. We compute a raw hype indicator and cross-sectionally normalize it:
Models below median hype receive no discount. Negative sentiment (real complaints) is never discounted.
Open-source fairness. Top-k provider uptime (not average), max(HF, SDK) for adoption, endpoint diversity bonus.
6. Pre-Release Degradation Detection
We detect the compound signal of imminent model releases + rising user complaints:
When both release and degradation signals are present, a multiplicative amplification is applied:
7. Ensemble Forecasting
72-hour predictions via four complementary methods:
Prediction intervals expand with the square root of the forecast horizon:
Reference
Gao, Y., Wang, M., Yu, Y.L. (2026). IntelligenceArena: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment. NeurIPS 2026 Datasets & Benchmarks submission.
Read the IntelligenceArena paper (NeurIPS 2026 D&B submission)