Artificial Analysis Intelligence Index Rankings – March 2026 (Full Breakdown)

Artificial Analysis Intelligence Index Rankings March 2026: See which AI models top the LLM leaderboard, how the index works

PP
Pulkit Porwal
Mar 31, 20268 min read
Artificial Analysis Intelligence Index Rankings – March 2026 (Full Breakdown)

On this page

If you have been trying to figure out which AI model is actually the smartest right now, you are not alone. I have spent a lot of time digging into leaderboards, and the Artificial Analysis Intelligence Index is the most thorough one I have found. It does not just pick one test and crown a winner — it runs 10 different evaluations covering math, science, coding, and reasoning, then combines everything into a single score. As of March 2026, the top score is 57 out of a possible 57, and two models have hit that ceiling. Here is everything you need to know about the rankings, the method behind them, and what they actually mean for you.

What Is the Artificial Analysis Intelligence Index?

The Artificial Analysis Intelligence Index is a composite score that combines the results from 10 independent evaluations into one number you can use to compare AI models side by side. I think of it like a school report card — instead of grading a student on just one subject, it looks at everything together. The 10 tests are: GDPval-AA, τ²-Bench Telecom, Terminal-Bench Hard, SciCode, LCR (Long Context Reasoning), AA-Omniscience, IFBench, Humanity's Last Exam (HLE), GPQA Diamond, and CritPt. Each category contributes 25% to the final score across four ability groups: reasoning, knowledge, math, and coding. The key thing that separates this from other rankings is that Artificial Analysis runs every evaluation independently — they do not rely on numbers that AI companies self-report. They measure 8 times per day for individual requests and twice per day for parallel requests, using a 72-hour rolling window for all performance metrics. That means the rankings reflect how models are actually performing through live APIs today, not on the day they were released.
For anyone wondering how this compares to other leaderboards, I have written a full guide to the LMSYS Chatbot Arena leaderboard — which uses human preference votes rather than automated benchmarks. Both approaches have value, but the Artificial Analysis method is more repeatable and harder for labs to game by training specifically on any one test.

The Top 5 AI Models on the March 2026 Leaderboard

As of March 2026, 316 models sit on the Artificial Analysis LLM leaderboard. The gap between the very top models is tiny — most of the frontier models are clustered within 5 points of each other. Here is the current top five:
  1. Gemini 3.1 Pro Preview (Google) – Score: 57. The highest-ranked model on the full index right now. Released on February 19, 2026, it generates output at 109.5 tokens per second and is priced at $2.00 per million input tokens and $12.00 per million output tokens. It leads on factual accuracy and agentic tasks.
  2. GPT-5.4 (xhigh) (OpenAI) – Score: 57. Tied at the top but tends to cost more per evaluation run. It performs especially well on reasoning-heavy tasks and is the go-to choice for users who need maximum logical depth and are less price-sensitive.
  3. GPT-5.3 Codex (xhigh) (OpenAI) – Score: 54. A strong third place that excels on coding benchmarks. For developers writing or reviewing complex code, this variant is often the practical first choice.
  4. Claude Opus 4.6 (Adaptive Reasoning, Max Effort) (Anthropic) – Score: 53. The top-ranked Anthropic model. It performs particularly well in real-world agentic and scientific tasks, and was notably the first model to top the full index when it was first measured in early 2026.
  5. Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort) (Anthropic) – Score: 52. A fast and capable model that balances intelligence with lower cost than Opus, making it the most popular Anthropic model for production use cases.
Want to know which of these models is considered the most advanced overall? I dug into that question in detail in this post: What is the most advanced AI in the world?

How the Benchmarks Actually Work – A Simple Explanation

I know benchmark names like "GPQA Diamond" and "τ²-Bench Telecom" sound confusing at first. Let me break down what each one is actually testing in plain language. GDPval-AA gives models real work tasks from 44 occupations and 9 industries — the model gets shell access and web browsing, and has to actually complete the task like a real employee would. AA-Omniscience tests factual knowledge across 6,000 questions covering 42 topics including law, business, science, and software engineering — and it penalizes models for guessing wrong, so they need to actually know the answer. GPQA Diamond is a set of graduate-level science questions so hard that even domain experts only answer about 65% correctly. Humanity's Last Exam (HLE) is exactly what it sounds like — some of the hardest questions humans have ever written, designed to test whether AI is approaching expert-level reasoning. SciCode tests scientific coding, while Terminal-Bench Hard checks whether a model can use a real computer terminal to solve difficult tasks. Together, these 10 tests paint a much more complete picture than any single benchmark ever could, and that is exactly why I trust this index more than most others.

"The index is calculated as a weighted average across four categories — reasoning, knowledge, math, and coding — each contributing 25% to the overall score. All evaluations are conducted independently by Artificial Analysis under identical conditions." — Artificial Analysis Intelligence Benchmarking Methodology

Full Score Comparison Table – March 2026

Here is the full picture across the top tier, including some important mid-range models you should know about. I have included cost and speed figures because a model that scores 57 but costs 10x more than a model scoring 52 is not always the right choice for every use case.
ModelIntelligence Index ScoreOpen Weights?Notes
Gemini 3.1 Pro Preview57NoTop-ranked; $2.00/$12.00 per 1M tokens
GPT-5.4 (xhigh)57NoTied top; higher cost per evaluation
GPT-5.3 Codex (xhigh)54NoBest for coding tasks
Claude Opus 4.6 (Adaptive, Max)53NoBest Anthropic model; strong in science and agentic tasks
Claude Sonnet 4.6 (Adaptive, Max)52NoBest price-to-intelligence ratio among top-5
GLM-5 (Reasoning)50YesTop open-weights model on the leaderboard
DeepSeek-R1 / Qwen3-series~40sYesStrong open-source alternatives; fast improving
Qwen3.5 0.8B (Reasoning)Lower tierYesCheapest model: $0.02 per 1M tokens
Mercury 2Mid rangeNoFastest model: 766.1 tokens per second
If you are looking to use these models through a free or low-cost API, I have put together a full list of the best options here: 8 Best Free LLM API Providers.

What the Rankings Mean for You – Which Model Should You Pick?

The number one mistake I see people make is looking only at the intelligence score and ignoring cost, speed, and their actual use case. Here is how I think about picking a model based on the March 2026 rankings. If you need the absolute best reasoning for complex research or scientific work, go with Gemini 3.1 Pro Preview or GPT-5.4 (xhigh) — they are genuinely neck and neck at the top. If you are building a product on Anthropic's API and want a strong balance between performance and cost, Claude Sonnet 4.6 (Adaptive Reasoning) at a score of 52 gives you near-frontier intelligence without the cost of Opus. If you are a developer and your main need is writing or reviewing code, GPT-5.3 Codex (xhigh) at 54 is purpose-built for that. And if you want to avoid API costs entirely and run a model locally, GLM-5 (Reasoning) at 50 is the best open-weights option on the index right now — quite remarkable for a free model. For most everyday users, I honestly think the mid-tier models scoring in the 40s and 50s are more than good enough. The difference between a 53 and a 57 is small in practice for most writing, summarizing, or question-answering tasks. The difference in your monthly bill, on the other hand, can be very large.
  • Best overall intelligence: Gemini 3.1 Pro Preview or GPT-5.4 (xhigh) — score 57
  • Best for coding: GPT-5.3 Codex (xhigh) — score 54
  • Best for science and agentic tasks: Claude Opus 4.6 — score 53
  • Best value (paid): Claude Sonnet 4.6 — score 52
  • Best open-source / free to run: GLM-5 (Reasoning) — score 50
  • Cheapest API model: Qwen3.5 0.8B at $0.02 per 1M tokens
  • Fastest model: Mercury 2 at 766.1 tokens per second

Key Takeaways

TopicQuick Answer
Top model (March 2026)Gemini 3.1 Pro Preview – score of 57
Second place (tied)GPT-5.4 (xhigh) – also scores 57, but costs more per run
Best Anthropic modelClaude Opus 4.6 (Adaptive Reasoning, Max Effort) – score of 53
Total models ranked316 models on the leaderboard
What the index measures10 benchmarks: math, coding, science, reasoning, agentic tasks
Best open-weights modelGLM-5 (Reasoning) – score of 50
Most affordable modelQwen3.5 0.8B (Reasoning) at $0.02 per 1M tokens
Fastest modelMercury 2 at 766.1 tokens per second
Frequently Asked Questions

Find answers to common questions about this topic.

1

What is the Artificial Analysis Intelligence Index?

It is a composite AI benchmark score that combines 10 independent evaluations covering math, coding, science, and reasoning into one number. It is published by Artificial Analysis and currently ranks over 316 AI models. The current maximum score is 57.

2

Which AI model has the highest score in March 2026?

Gemini 3.1 Pro Preview from Google and GPT-5.4 (xhigh) from OpenAI are both tied at 57 — the current maximum score on the index. Gemini 3.1 Pro Preview is typically listed first due to slightly better cost efficiency at the top tier.

3

Is Claude Opus 4.6 better than GPT-5.4?

On the Artificial Analysis Intelligence Index, GPT-5.4 (xhigh) scores 57 while Claude Opus 4.6 (Adaptive Reasoning, Max Effort) scores 53. GPT-5.4 has a higher composite score, but Claude Opus 4.6 leads on several individual sub-benchmarks including real-world agentic tasks and scientific evaluations. The better choice depends on your specific use case.

4

What is the best open-source AI model in March 2026?

GLM-5 (Reasoning) is the highest-ranked open-weights model on the Artificial Analysis leaderboard with a score of 50. Models from the Qwen3-series (Alibaba) and DeepSeek-R1 also perform well in the open-source category, scoring in the 40s on the index.