Artificial Analysis Intelligence Index Rankings – March 2026 (Full Breakdown)
Artificial Analysis Intelligence Index Rankings March 2026: See which AI models top the LLM leaderboard, how the index works
PP
Pulkit Porwal
Mar 31, 2026•8 min read

On this page
If you have been trying to figure out which AI model is actually the smartest right now, you are not alone. I have spent a lot of time digging into leaderboards, and the Artificial Analysis Intelligence Index is the most thorough one I have found. It does not just pick one test and crown a winner — it runs 10 different evaluations covering math, science, coding, and reasoning, then combines everything into a single score. As of March 2026, the top score is 57 out of a possible 57, and two models have hit that ceiling. Here is everything you need to know about the rankings, the method behind them, and what they actually mean for you.
What Is the Artificial Analysis Intelligence Index?
The Artificial Analysis Intelligence Index is a composite score that combines the results from 10 independent evaluations into one number you can use to compare AI models side by side. I think of it like a school report card — instead of grading a student on just one subject, it looks at everything together. The 10 tests are: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, LCR (Long Context Reasoning), AA-Omniscience, IFBench, Humanity's Last Exam (HLE), GPQA Diamond, and CritPt. Each category contributes 25% to the final score across four ability groups: reasoning, knowledge, math, and coding. The key thing that separates this from other rankings is that Artificial Analysis runs every evaluation independently — they do not rely on numbers that AI companies self-report. They measure 8 times per day for individual requests and twice per day for parallel requests, using a 72-hour rolling window for all performance metrics. That means the rankings reflect how models are actually performing through live APIs today, not on the day they were released.
For anyone wondering how this compares to other leaderboards, I have written a full guide to the LMSYS Chatbot Arena leaderboard — which uses human preference votes rather than automated benchmarks. Both approaches have value, but the Artificial Analysis method is more repeatable and harder for labs to game by training specifically on any one test.
The Top 5 AI Models on the March 2026 Leaderboard
As of March 2026, 316 models sit on the Artificial Analysis LLM leaderboard. The gap between the very top models is tiny — most of the frontier models are clustered within 5 points of each other. Here is the current top five:
- Gemini 3.1 Pro Preview (Google) – Score: 57. The highest-ranked model on the full index right now. Released on February 19, 2026, it generates output at 109.5 tokens per second and is priced at $2.00 per million input tokens and $12.00 per million output tokens. It leads on factual accuracy and agentic tasks.
- GPT-5.4 (xhigh) (OpenAI) – Score: 57. Tied at the top but tends to cost more per evaluation run. It performs especially well on reasoning-heavy tasks and is the go-to choice for users who need maximum logical depth and are less price-sensitive.
- GPT-5.3 Codex (xhigh) (OpenAI) – Score: 54. A strong third place that excels on coding benchmarks. For developers writing or reviewing complex code, this variant is often the practical first choice.
- Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (Anthropic) – Score: 53. The top-ranked Anthropic model. It performs particularly well in real-world agentic and scientific tasks, and was notably the first model to top the full index when it was first measured in early 2026.
- Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) (Anthropic) – Score: 52. A fast and capable model that balances intelligence with lower cost than Opus, making it the most popular Anthropic model for production use cases.
Want to know which of these models is considered the most advanced overall? I dug into that question in detail in this post: What is the most advanced AI in the world?
How the Benchmarks Actually Work – A Simple Explanation
I know benchmark names like "GPQA Diamond" and "τ²-Bench Telecom" sound confusing at first. Let me break down what each one is actually testing in plain language. GDPval-AA gives models real work tasks from 44 occupations and 9 industries — the model gets shell access and web browsing, and has to actually complete the task like a real employee would. AA-Omniscience tests factual knowledge across 6,000 questions covering 42 topics including law, business, science, and software engineering — and it penalizes models for guessing wrong, so they need to actually know the answer. GPQA Diamond is a set of graduate-level science questions so hard that even domain experts only answer about 65% correctly. Humanity's Last Exam (HLE) is exactly what it sounds like — some of the hardest questions humans have ever written, designed to test whether AI is approaching expert-level reasoning. SciCode tests scientific coding, while Terminal-Bench Hard checks whether a model can use a real computer terminal to solve difficult tasks. Together, these 10 tests paint a much more complete picture than any single benchmark ever could, and that is exactly why I trust this index more than most others.
"The index is calculated as a weighted average across four categories — reasoning, knowledge, math, and coding — each contributing 25% to the overall score. All evaluations are conducted independently by Artificial Analysis under identical conditions." — Artificial Analysis Intelligence Benchmarking Methodology
Full Score Comparison Table – March 2026
Here is the full picture across the top tier, including some important mid-range models you should know about. I have included cost and speed figures because a model that scores 57 but costs 10x more than a model scoring 52 is not always the right choice for every use case.
| Model | Intelligence Index Score | Open Weights? | Notes |
| Gemini 3.1 Pro Preview | 57 | No | Top-ranked; $2.00/$12.00 per 1M tokens |
| GPT-5.4 (xhigh) | 57 | No | Tied top; higher cost per evaluation |
| GPT-5.3 Codex (xhigh) | 54 | No | Best for coding tasks |
| Claude Opus 4.6 (Adaptive, Max) | 53 | No | Best Anthropic model; strong in science and agentic tasks |
| Claude Sonnet 4.6 (Adaptive, Max) | 52 | No | Best price-to-intelligence ratio among top-5 |
| GLM-5 (Reasoning) | 50 | Yes | Top open-weights model on the leaderboard |
| DeepSeek-R1 / Qwen3-series | ~40s | Yes | Strong open-source alternatives; fast improving |
| Qwen3.5 0.8B (Reasoning) | Lower tier | Yes | Cheapest model: $0.02 per 1M tokens |
| Mercury 2 | Mid range | No | Fastest model: 766.1 tokens per second |
If you are looking to use these models through a free or low-cost API, I have put together a full list of the best options here: 8 Best Free LLM API Providers.
What the Rankings Mean for You – Which Model Should You Pick?
The number one mistake I see people make is looking only at the intelligence score and ignoring cost, speed, and their actual use case. Here is how I think about picking a model based on the March 2026 rankings. If you need the absolute best reasoning for complex research or scientific work, go with Gemini 3.1 Pro Preview or GPT-5.4 (xhigh) — they are genuinely neck and neck at the top. If you are building a product on Anthropic's API and want a strong balance between performance and cost, Claude Sonnet 4.6 (Adaptive Reasoning) at a score of 52 gives you near-frontier intelligence without the cost of Opus. If you are a developer and your main need is writing or reviewing code, GPT-5.3 Codex (xhigh) at 54 is purpose-built for that. And if you want to avoid API costs entirely and run a model locally, GLM-5 (Reasoning) at 50 is the best open-weights option on the index right now — quite remarkable for a free model. For most everyday users, I honestly think the mid-tier models scoring in the 40s and 50s are more than good enough. The difference between a 53 and a 57 is small in practice for most writing, summarizing, or question-answering tasks. The difference in your monthly bill, on the other hand, can be very large.
- Best overall intelligence: Gemini 3.1 Pro Preview or GPT-5.4 (xhigh) — score 57
- Best for coding: GPT-5.3 Codex (xhigh) — score 54
- Best for science and agentic tasks: Claude Opus 4.6 — score 53
- Best value (paid): Claude Sonnet 4.6 — score 52
- Best open-source / free to run: GLM-5 (Reasoning) — score 50
- Cheapest API model: Qwen3.5 0.8B at $0.02 per 1M tokens
- Fastest model: Mercury 2 at 766.1 tokens per second
Key Takeaways
| Topic | Quick Answer |
| Top model (March 2026) | Gemini 3.1 Pro Preview – score of 57 |
| Second place (tied) | GPT-5.4 (xhigh) – also scores 57, but costs more per run |
| Best Anthropic model | Claude Opus 4.6 (Adaptive Reasoning, Max Effort) – score of 53 |
| Total models ranked | 316 models on the leaderboard |
| What the index measures | 10 benchmarks: math, coding, science, reasoning, agentic tasks |
| Best open-weights model | GLM-5 (Reasoning) – score of 50 |
| Most affordable model | Qwen3.5 0.8B (Reasoning) at $0.02 per 1M tokens |
| Fastest model | Mercury 2 at 766.1 tokens per second |