AI LLM Leaderboard
Compare the latest Large Language Models across multiple benchmarks and performance metrics
Total Models
71
Latest models included
Benchmarks
8
Comprehensive tests
Top Score
99.0%
Kimi K2.5 (HumanEval)
Fastest Model
250
TPS (GPT-4o mini Realtime)
Filters & Search
Model Rankings - 71 Models
| # | Organization | Category | MMLU | GPQA | MMMU | HellaSwag | HumanEval | BBHard | GSM8K | MATH | Cost/1K | TPS | Context | Trend | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Gemini 3.1 Pro 2026-02 Top GPQABest ARC-AGI-2Best Value Frontier | multimodal | 92.2% | 93.1% | 94.3% | 85.2% | N/A | N/A | N/A | N/A | 96.1% | $0.002 | 60 | 2M | ||
2 | GPT-5.4 2026-03 Best Computer UseTop OSWorld | OpenAI | reasoning | 91.6% | 93% | 92.8% | 83.5% | N/A | N/A | N/A | N/A | 97% | $0.0025 | 70 | 1M | |
3 | Claude Opus 4.6 2026-02 Best SWE-BenchTop Coding128K Output | Anthropic | coding | 91.3% | 92.4% | 91.3% | 85.1% | N/A | N/A | N/A | N/A | 95.2% | $0.005 | 45 | 1M | |
4 | Grok 4 2026-01 Best HLEMulti-AgentReal-time X Data | xAI | reasoning | 90.2% | 92.7% | 84.6% | N/A | N/A | N/A | N/A | N/A | 93.3% | $0.002 | 75 | 128K | |
5 | Kimi K2.5 2026-01 Open SourceBest HumanEval OpenTop SWE-Bench Open | Moonshot AI | coding | 89.5% | 92% | 87.6% | N/A | N/A | 99% | N/A | N/A | 98% | $0.0015 | 85 | 262K | |
6 | Claude Sonnet 4.6 2026-02 Best GDPval-AANear-Opus Performance | Anthropic | reasoning | 88.8% | 91% | 88.5% | 82% | N/A | N/A | N/A | N/A | 93.5% | $0.003 | 80 | 1M | |
7 | o1 2024-09 Top MMLUPremium | OpenAI | reasoning | 88.4% | 92.3% | 78% | N/A | N/A | N/A | N/A | N/A | 94.8% | $0.06 | 15 | 128K | |
8 | GLM-5 2026-01 Open SourceMIT LicenseBest Chatbot Arena Open | Zhipu AI | coding | 88% | 92% | 86% | N/A | N/A | 96.5% | N/A | N/A | 94.5% | $0.001 | 80 | 200K | |
9 | Qwen 3.5 397B 2026-02 Open SourceApache 2.0Top Open GPQA | Alibaba | reasoning | 87.7% | 91.8% | 88.4% | N/A | N/A | 94.2% | N/A | N/A | 96.5% | $0.001 | 90 | 256K | |
10 | MiniMax M2.5 2026-01 Open SourceBest SWE-Bench Open | MiniMax | coding | 87.5% | 90.8% | 85.2% | N/A | N/A | 94.5% | N/A | N/A | 96% | $0.0012 | 75 | 128K | |
11 | DeepSeek R1 2025-01 Best MATHBest Reasoning | DeepSeek | reasoning | 86.5% | 90.8% | 71.5% | N/A | N/A | N/A | N/A | N/A | 97.3% | $0.008 | 65 | 64K | |
12 | o3-mini 2024-12 Fastest TPSBest HumanEval | OpenAI | coding | 86% | 86% | 75% | N/A | N/A | 97% | N/A | N/A | N/A | $0.02 | 189 | 128K | |
13 | Claude 3.7 Sonnet 2024-11 Best MATH | Anthropic | reasoning | 85.5% | 86.1% | 84.8% | 75% | N/A | N/A | N/A | N/A | 96.2% | $0.02 | 65 | 200K | |
14 | o4-mini 2024-12 Best MATH | OpenAI | reasoning | 85.2% | N/A | 81.4% | 81.6% | N/A | N/A | N/A | N/A | 92.7% | $0.02 | 85 | 128K | |
15 | Gemini 2.5 Pro 2024-12 LatestHigh GPQA | multimodal | 85.2% | 89.8% | 84% | 81.7% | N/A | N/A | N/A | N/A | N/A | $0.02 | 55 | 1M | ||
16 | o3 2024-12 Top MATH | OpenAI | reasoning | 85% | N/A | 83.3% | 82.9% | N/A | N/A | N/A | N/A | 88.9% | $0.06 | 18 | 128K | |
17 | o1-preview 2024-09 Preview | OpenAI | reasoning | 84.9% | 90.8% | 78.3% | N/A | N/A | N/A | N/A | N/A | 85.5% | $0.045 | 20 | 128K | |
18 | DeepSeek V3.2 2025-12 Open SourceMIT LicenseBest Budget | DeepSeek | coding | 84.2% | 90.5% | 72.1% | N/A | N/A | 91.5% | N/A | N/A | 95% | $0.00014 | 95 | 64K | |
19 | DeepSeek V3 (0324) 2025-03 Open SourceUpdatedMIT License | DeepSeek | coding | 84.2% | 89% | 68.4% | N/A | N/A | 87.5% | N/A | N/A | 92% | $0.004 | 95 | 64K | |
20 | DeepSeek R1 Zero 2025-01 Open SourceRL-Only TrainingMIT License | DeepSeek | reasoning | 83.8% | 88.4% | 67% | N/A | N/A | N/A | N/A | N/A | 95.9% | $0.005 | 60 | 64K | |
21 | Llama 4 Behemoth 2025-07 Open SourceLargest LlamaTop MMLU Open | Meta | reasoning | 83.4% | 91.5% | 74.2% | 78.3% | N/A | N/A | N/A | N/A | 89.5% | $0.01 | 25 | 256K | |
22 | Claude 3.5 Sonnet 2024-06 User's ChoiceBest GSM8K | Anthropic | reasoning | 82.3% | 88.7% | 59.4% | 68.3% | 89% | 92% | 93.1% | 96.4% | 71.1% | $0.015 | 170 | 200K | |
23 | GPT-4o 2024-05 Least LatencyMultimodal | OpenAI | multimodal | 82.2% | 88.7% | 53.6% | 69.1% | 94.2% | 90.2% | 91.3% | 89.8% | 76.6% | $0.015 | 85 | 128K | |
24 | o1-mini 2024-09 Good Coding | OpenAI | coding | 81.9% | 85.2% | 60% | N/A | N/A | 92.4% | N/A | N/A | 90% | $0.025 | 45 | 128K | |
25 | Gemini 2.0 Flash 2024-12 FastGood Performance | multimodal | 81.8% | 87% | 59% | N/A | N/A | 91% | N/A | N/A | 90% | $0.01 | 110 | 1M | ||
26 | Claude Opus 4 2024-12 LatestPremium | Anthropic | reasoning | 81% | 88.8% | 83.3% | 76.5% | N/A | N/A | N/A | N/A | 75.5% | $0.045 | 45 | 200K | |
27 | Gemini 2.0 Flash Thinking 2025-02 Extended ThinkingBest Budget Reasoning | reasoning | 80.7% | 85% | 70.3% | 73.8% | N/A | N/A | N/A | N/A | 93.5% | $0.0035 | 50 | 1M | ||
28 | Grok 3 Mini 2025-02 Extended ThinkingCost-Effective | xAI | reasoning | 80.7% | 83% | 69.7% | N/A | N/A | N/A | N/A | N/A | 89.5% | $0.003 | 100 | 128K | |
29 | DeepSeek V3 2024-12 Open SourceGood MATH | DeepSeek | coding | 80.1% | 88.5% | 59.1% | N/A | N/A | 82.6% | N/A | N/A | 90.2% | $0.004 | 95 | 64K | |
30 | Claude 3.5 Sonnet v2 2025-02 Computer UseTop SWE-BenchUpgraded | Anthropic | reasoning | 79.3% | 88.7% | 65% | 70.7% | N/A | 93.7% | N/A | N/A | 78.3% | $0.015 | 75 | 200K | |
31 | Llama 3.1 405B 2024-07 Open SourceLargest Open | Meta | reasoning | 78.9% | 88.6% | 51.1% | 64.5% | 87% | 89% | 81.3% | 96.8% | 73.8% | $0.015 | 35 | 128K | |
32 | Claude Sonnet 4 2024-12 LatestHigh GPQA | Anthropic | reasoning | 78.8% | 86.5% | 83.8% | 74.4% | N/A | N/A | N/A | N/A | 70.5% | $0.025 | 60 | 200K | |
33 | Qwen 2.5-Max 2025-02 MoE ArchitectureTop Chinese Open | Alibaba | reasoning | 78.1% | 87% | 52.5% | N/A | N/A | 88% | N/A | N/A | 85% | $0.0016 | 95 | 128K | |
34 | GPT-4 Turbo 2024-04 Highly PreferredBalanced | OpenAI | reasoning | 77.6% | 86.5% | 48% | 63.1% | 94.2% | 90.2% | 87.6% | 91% | 72.2% | $0.03 | 45 | 128K | |
35 | Llama 3.1 Nemotron 70B 2025-01 Open SourceRLHF TunedTop Arena | NVIDIA | reasoning | 77.5% | 85% | 55.8% | N/A | N/A | 90% | N/A | N/A | 79% | $0.0035 | 70 | 128K | |
36 | GPT-4o (2025) 2025-05 UpdatedBest VoiceImage Gen | OpenAI | multimodal | 77.2% | 89.5% | 55% | 70.2% | N/A | 91.5% | N/A | N/A | 80% | $0.0125 | 90 | 128K | |
37 | Claude 3 Opus 2024-03 PremiumComplete Benchmarks | Anthropic | reasoning | 77.2% | 86.8% | 50.4% | 59.4% | 95.4% | 84.9% | 86.8% | 95% | 60.1% | $0.045 | 45 | 200K | |
38 | GPT-4.1 2024-11 Latest GPT | OpenAI | reasoning | 77.1% | 90.2% | 66.3% | 74.8% | N/A | N/A | N/A | N/A | N/A | $0.04 | 50 | 128K | |
39 | Gemini 2.0 Pro Experimental 2024-12 ExperimentalGood MATH | multimodal | 77.1% | 79.1% | 64.7% | 72.7% | N/A | N/A | N/A | N/A | 91.8% | $0.015 | 60 | 1M | ||
40 | Qwen 2.5 72B 2025-01 Open SourceApache 2.0Best Open Coding | Alibaba | coding | 76.4% | 86.1% | 49% | N/A | N/A | 87.2% | N/A | N/A | 83.1% | $0.0009 | 88 | 128K | |
41 | Claude 3.7 Sonnet (Normal) 2024-11 Balanced | Anthropic | reasoning | 76.3% | 83.2% | 68% | 71.8% | N/A | N/A | N/A | N/A | 82.2% | $0.015 | 85 | 200K | |
42 | Phi-4 2025-01 Open SourceBest-in-Class 14BSTEM Strong | Microsoft | reasoning | 75.9% | 84.8% | 56.1% | N/A | N/A | 82.6% | N/A | N/A | 80.4% | $0.0007 | 120 | 16K | |
43 | Llama 4 Maverick 2024-12 Open Source | Meta | reasoning | 75.9% | 84.6% | 69.8% | 73.4% | N/A | N/A | N/A | N/A | N/A | $0.005 | 85 | 128K | |
44 | Llama 3.3 70B 2024-10 Open SourceGood Coding | Meta | coding | 75.5% | 86% | 50.5% | N/A | N/A | 88.4% | N/A | N/A | 77% | $0.006 | 90 | 128K | |
45 | Mistral Large 2 2025-01 Open WeightsMultilingualFunction Calling | Mistral AI | reasoning | 75.4% | 84% | 49.6% | N/A | N/A | 92% | N/A | N/A | 76% | $0.006 | 80 | 128K | |
46 | Mistral Medium 3 2025-05 EnterpriseMultilingualNew | Mistral AI | reasoning | 75.2% | 83.5% | 51% | N/A | N/A | 90% | N/A | N/A | 76.5% | $0.004 | 95 | 128K | |
47 | GPT-4.1 mini 2024-11 Cost-Effective | OpenAI | reasoning | 75.1% | 87.5% | 65% | 72.7% | N/A | N/A | N/A | N/A | N/A | $0.015 | 95 | 128K | |
48 | Grok-2 2024-08 Good Coding | xAI | coding | 74.8% | 87.5% | 56% | 66.1% | N/A | 88.4% | N/A | N/A | 76.1% | $0.01 | 75 | 128K | |
49 | Grok 3 2024-12 Latest | xAI | reasoning | 74.3% | N/A | 75.4% | 73.2% | N/A | N/A | N/A | N/A | N/A | $0.012 | 70 | 128K | |
50 | Gemini 1.5 Pro 2024-02 Largest ContextComplete Benchmarks | multimodal | 73.6% | 81.9% | 46.2% | 62.2% | 92.5% | 71.9% | 84% | 91.7% | 58.5% | $0.0125 | 38 | 2M | ||
51 | Gemini 2.5 Flash Lite 2024-12 Latest | multimodal | 71.8% | 84.5% | 66.7% | 72.9% | N/A | N/A | N/A | N/A | 63.1% | $0.01 | 75 | 1M | ||
52 | GPT-4 2023-03 Most ExpensiveClassic | OpenAI | reasoning | 71.4% | 86.4% | 35.7% | 56.8% | 95.3% | 67% | 83.1% | 92% | 52.9% | $0.18 | 25 | 8K | |
53 | Claude 3.5 Haiku (2025) 2025-04 UpdatedFastest ClaudeComputer Use | Anthropic | conversation | 70.3% | 73.5% | 43.2% | N/A | N/A | 90.5% | N/A | N/A | 74% | $0.004 | 150 | 200K | |
54 | Llama 3.2 90B 2024-09 Open Source | Meta | reasoning | 69.6% | 86% | 46.7% | 60.3% | N/A | N/A | N/A | 86.9% | 68% | $0.008 | 80 | 128K | |
55 | Command R+ (2025) 2025-03 RAG OptimizedTool UseEnterprise | Cohere | reasoning | 69.4% | 82.3% | 46% | N/A | N/A | 80.5% | N/A | N/A | 68.9% | $0.0025 | 85 | 128K | |
56 | Claude 3 Sonnet 2024-03 BalancedComplete Benchmarks | Anthropic | reasoning | 69.1% | 79% | 40.4% | 53.1% | 89% | 73% | 82.9% | 92.3% | 43.1% | $0.012 | 90 | 200K | |
57 | Gemini 1.5 Flash 2024-05 FastComplete Benchmarks | multimodal | 68.6% | 78.9% | 39.5% | 56.1% | 81.3% | 67.5% | 89.2% | 68.8% | 67.7% | $0.008 | 95 | 1M | ||
58 | Mistral Small 3 2025-03 Open WeightsUltra EfficientApache 2.0 | Mistral AI | conversation | 68.1% | 81.5% | 42% | N/A | N/A | 83% | N/A | N/A | 66% | $0.001 | 130 | 32K | |
59 | Gemma 3 27B 2025-03 Open SourceMultimodalApache 2.0 | multimodal | 67.9% | 78.4% | 42% | 68.5% | N/A | 79% | N/A | N/A | 71.5% | $0.0003 | 110 | 128K | ||
60 | GPT-4o mini 2024-07 Cost-Effective | OpenAI | conversation | 67.8% | 82% | 40.2% | 59.4% | N/A | 87.2% | N/A | N/A | 70.2% | $0.007 | 120 | 128K | |
61 | Llama 4 Scout 2024-12 Least ExpensiveOpen Source | Meta | conversation | 67% | 74.3% | 57.2% | 69.4% | N/A | N/A | N/A | N/A | N/A | $0.0003 | 120 | 128K | |
62 | Amazon Nova Pro 2025-01 AWS NativeMultimodalCost-Effective | Amazon | multimodal | 66.9% | 80% | 44% | 63.5% | N/A | 79% | N/A | N/A | 68% | $0.0008 | 100 | 300K | |
63 | Claude 3.5 Haiku 2024-11 FastCost-Effective | Anthropic | conversation | 66% | 65% | 41.6% | N/A | N/A | 88.1% | N/A | N/A | 69.2% | $0.005 | 140 | 200K | |
64 | Claude 3 Haiku 2024-03 FastComplete Benchmarks | Anthropic | conversation | 65.3% | 75.2% | 33.3% | 50.2% | 85.9% | 75.9% | 73.7% | 88.9% | 38.9% | $0.004 | 160 | 200K | |
65 | Phi-4 Mini 2025-04 Open SourceEdge Deployable3.8B Params | Microsoft | conversation | 63.4% | 75.6% | 37.3% | N/A | N/A | 73% | N/A | N/A | 67.5% | $0.0001 | 200 | 16K | |
66 | GPT-4.1 nano 2024-11 Ultra Fast | OpenAI | conversation | 61.9% | 80.1% | 50.3% | 55.4% | N/A | N/A | N/A | N/A | N/A | $0.005 | 150 | 32K | |
67 | Gemma 3 12B 2025-03 Open SourceLightweightOn-Device | conversation | 60.2% | 74.2% | 37% | 59.8% | N/A | 72% | N/A | N/A | 58% | $0.0001 | 160 | 128K | ||
68 | Amazon Nova Lite 2025-01 AWS NativeUltra FastCheapest Multimodal | Amazon | conversation | 56.6% | 73% | 33% | 55% | N/A | 68% | N/A | N/A | 54% | $0.00006 | 180 | 300K | |
69 | o3-pro 2025-01 Upcoming | OpenAI | reasoning | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | $0.08 | 20 | 128K | |
70 | GPT-4o Realtime 2024-10 RealtimeVoice | OpenAI | conversation | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | $0.02 | 200 | 128K | |
71 | GPT-4o mini Realtime 2024-10 RealtimeVoiceCost-Effective | OpenAI | conversation | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | $0.008 | 250 | 128K |
Try our other free tools!
Explore more powerful AI tools to enhance your productivity and creativity.
GENERATOR
GENERATOR
EDITOR