Compare the latest Large Language Models across multiple benchmarks and performance metrics

Total Models

Latest models included

Benchmarks

Comprehensive tests

Top Score

99.0%

Kimi K2.5 (HumanEval)

Fastest Model

250

TPS (GPT-4o mini Realtime)

Filters & Search

Model Rankings - 71 Models

#		Organization	Category		MMLU	GPQA	MMMU	HellaSwag	HumanEval	BBHard	GSM8K	MATH	Cost/1K	TPS	Context
1	Gemini 3.1 Pro 2026-02 Top GPQABest ARC-AGI-2Best Value Frontier New	Google	multimodal	92.2%	93.1%	94.3%	85.2%	N/A	N/A	N/A	N/A	96.1%	$0.002	60	2M
2	GPT-5.4 2026-03 Best Computer UseTop OSWorld New	OpenAI	reasoning	91.6%	93%	92.8%	83.5%	N/A	N/A	N/A	N/A	97%	$0.0025	70	1M
3	Claude Opus 4.6 2026-02 Best SWE-BenchTop Coding128K Output New	Anthropic	coding	91.3%	92.4%	91.3%	85.1%	N/A	N/A	N/A	N/A	95.2%	$0.005	45	1M
4	Grok 4 2026-01 Best HLEMulti-AgentReal-time X Data New	xAI	reasoning	90.2%	92.7%	84.6%	N/A	N/A	N/A	N/A	N/A	93.3%	$0.002	75	128K
5	Kimi K2.5 2026-01 Open SourceBest HumanEval OpenTop SWE-Bench Open New	Moonshot AI	coding	89.5%	92%	87.6%	N/A	N/A	99%	N/A	N/A	98%	$0.0015	85	262K
6	Claude Sonnet 4.6 2026-02 Best GDPval-AANear-Opus Performance New	Anthropic	reasoning	88.8%	91%	88.5%	82%	N/A	N/A	N/A	N/A	93.5%	$0.003	80	1M
7	o1 2024-09 Top MMLUPremium	OpenAI	reasoning	88.4%	92.3%	78%	N/A	N/A	N/A	N/A	N/A	94.8%	$0.06	15	128K
8	GLM-5 2026-01 Open SourceMIT LicenseBest Chatbot Arena Open New	Zhipu AI	coding	88%	92%	86%	N/A	N/A	96.5%	N/A	N/A	94.5%	$0.001	80	200K
9	Qwen 3.5 397B 2026-02 Open SourceApache 2.0Top Open GPQA New	Alibaba	reasoning	87.7%	91.8%	88.4%	N/A	N/A	94.2%	N/A	N/A	96.5%	$0.001	90	256K
10	MiniMax M2.5 2026-01 Open SourceBest SWE-Bench Open New	MiniMax	coding	87.5%	90.8%	85.2%	N/A	N/A	94.5%	N/A	N/A	96%	$0.0012	75	128K
11	DeepSeek R1 2025-01 Best MATHBest Reasoning New	DeepSeek	reasoning	86.5%	90.8%	71.5%	N/A	N/A	N/A	N/A	N/A	97.3%	$0.008	65	64K
12	o3-mini 2024-12 Fastest TPSBest HumanEval New	OpenAI	coding	86%	86%	75%	N/A	N/A	97%	N/A	N/A	N/A	$0.02	189	128K
13	Claude 3.7 Sonnet 2024-11 Best MATH New	Anthropic	reasoning	85.5%	86.1%	84.8%	75%	N/A	N/A	N/A	N/A	96.2%	$0.02	65	200K
14	o4-mini 2024-12 Best MATH New	OpenAI	reasoning	85.2%	N/A	81.4%	81.6%	N/A	N/A	N/A	N/A	92.7%	$0.02	85	128K
15	Gemini 2.5 Pro 2024-12 LatestHigh GPQA New	Google	multimodal	85.2%	89.8%	84%	81.7%	N/A	N/A	N/A	N/A	N/A	$0.02	55	1M
16	o3 2024-12 Top MATH New	OpenAI	reasoning	85%	N/A	83.3%	82.9%	N/A	N/A	N/A	N/A	88.9%	$0.06	18	128K
17	o1-preview 2024-09 Preview	OpenAI	reasoning	84.9%	90.8%	78.3%	N/A	N/A	N/A	N/A	N/A	85.5%	$0.045	20	128K
18	DeepSeek V3.2 2025-12 Open SourceMIT LicenseBest Budget New	DeepSeek	coding	84.2%	90.5%	72.1%	N/A	N/A	91.5%	N/A	N/A	95%	$0.00014	95	64K
19	DeepSeek V3 (0324) 2025-03 Open SourceUpdatedMIT License New	DeepSeek	coding	84.2%	89%	68.4%	N/A	N/A	87.5%	N/A	N/A	92%	$0.004	95	64K
20	DeepSeek R1 Zero 2025-01 Open SourceRL-Only TrainingMIT License New	DeepSeek	reasoning	83.8%	88.4%	67%	N/A	N/A	N/A	N/A	N/A	95.9%	$0.005	60	64K
21	Llama 4 Behemoth 2025-07 Open SourceLargest LlamaTop MMLU Open New	Meta	reasoning	83.4%	91.5%	74.2%	78.3%	N/A	N/A	N/A	N/A	89.5%	$0.01	25	256K
22	Claude 3.5 Sonnet 2024-06 User's ChoiceBest GSM8K	Anthropic	reasoning	82.3%	88.7%	59.4%	68.3%	89%	92%	93.1%	96.4%	71.1%	$0.015	170	200K
23	GPT-4o 2024-05 Least LatencyMultimodal	OpenAI	multimodal	82.2%	88.7%	53.6%	69.1%	94.2%	90.2%	91.3%	89.8%	76.6%	$0.015	85	128K
24	o1-mini 2024-09 Good Coding	OpenAI	coding	81.9%	85.2%	60%	N/A	N/A	92.4%	N/A	N/A	90%	$0.025	45	128K
25	Gemini 2.0 Flash 2024-12 FastGood Performance New	Google	multimodal	81.8%	87%	59%	N/A	N/A	91%	N/A	N/A	90%	$0.01	110	1M
26	Claude Opus 4 2024-12 LatestPremium New	Anthropic	reasoning	81%	88.8%	83.3%	76.5%	N/A	N/A	N/A	N/A	75.5%	$0.045	45	200K
27	Gemini 2.0 Flash Thinking 2025-02 Extended ThinkingBest Budget Reasoning New	Google	reasoning	80.7%	85%	70.3%	73.8%	N/A	N/A	N/A	N/A	93.5%	$0.0035	50	1M
28	Grok 3 Mini 2025-02 Extended ThinkingCost-Effective New	xAI	reasoning	80.7%	83%	69.7%	N/A	N/A	N/A	N/A	N/A	89.5%	$0.003	100	128K
29	DeepSeek V3 2024-12 Open SourceGood MATH New	DeepSeek	coding	80.1%	88.5%	59.1%	N/A	N/A	82.6%	N/A	N/A	90.2%	$0.004	95	64K
30	Claude 3.5 Sonnet v2 2025-02 Computer UseTop SWE-BenchUpgraded New	Anthropic	reasoning	79.3%	88.7%	65%	70.7%	N/A	93.7%	N/A	N/A	78.3%	$0.015	75	200K
31	Llama 3.1 405B 2024-07 Open SourceLargest Open	Meta	reasoning	78.9%	88.6%	51.1%	64.5%	87%	89%	81.3%	96.8%	73.8%	$0.015	35	128K
32	Claude Sonnet 4 2024-12 LatestHigh GPQA New	Anthropic	reasoning	78.8%	86.5%	83.8%	74.4%	N/A	N/A	N/A	N/A	70.5%	$0.025	60	200K
33	Qwen 2.5-Max 2025-02 MoE ArchitectureTop Chinese Open New	Alibaba	reasoning	78.1%	87%	52.5%	N/A	N/A	88%	N/A	N/A	85%	$0.0016	95	128K
34	GPT-4 Turbo 2024-04 Highly PreferredBalanced	OpenAI	reasoning	77.6%	86.5%	48%	63.1%	94.2%	90.2%	87.6%	91%	72.2%	$0.03	45	128K
35	Llama 3.1 Nemotron 70B 2025-01 Open SourceRLHF TunedTop Arena New	NVIDIA	reasoning	77.5%	85%	55.8%	N/A	N/A	90%	N/A	N/A	79%	$0.0035	70	128K
36	GPT-4o (2025) 2025-05 UpdatedBest VoiceImage Gen New	OpenAI	multimodal	77.2%	89.5%	55%	70.2%	N/A	91.5%	N/A	N/A	80%	$0.0125	90	128K
37	Claude 3 Opus 2024-03 PremiumComplete Benchmarks	Anthropic	reasoning	77.2%	86.8%	50.4%	59.4%	95.4%	84.9%	86.8%	95%	60.1%	$0.045	45	200K
38	GPT-4.1 2024-11 Latest GPT New	OpenAI	reasoning	77.1%	90.2%	66.3%	74.8%	N/A	N/A	N/A	N/A	N/A	$0.04	50	128K
39	Gemini 2.0 Pro Experimental 2024-12 ExperimentalGood MATH New	Google	multimodal	77.1%	79.1%	64.7%	72.7%	N/A	N/A	N/A	N/A	91.8%	$0.015	60	1M
40	Qwen 2.5 72B 2025-01 Open SourceApache 2.0Best Open Coding New	Alibaba	coding	76.4%	86.1%	49%	N/A	N/A	87.2%	N/A	N/A	83.1%	$0.0009	88	128K
41	Claude 3.7 Sonnet (Normal) 2024-11 Balanced New	Anthropic	reasoning	76.3%	83.2%	68%	71.8%	N/A	N/A	N/A	N/A	82.2%	$0.015	85	200K
42	Phi-4 2025-01 Open SourceBest-in-Class 14BSTEM Strong New	Microsoft	reasoning	75.9%	84.8%	56.1%	N/A	N/A	82.6%	N/A	N/A	80.4%	$0.0007	120	16K
43	Llama 4 Maverick 2024-12 Open Source New	Meta	reasoning	75.9%	84.6%	69.8%	73.4%	N/A	N/A	N/A	N/A	N/A	$0.005	85	128K
44	Llama 3.3 70B 2024-10 Open SourceGood Coding	Meta	coding	75.5%	86%	50.5%	N/A	N/A	88.4%	N/A	N/A	77%	$0.006	90	128K
45	Mistral Large 2 2025-01 Open WeightsMultilingualFunction Calling New	Mistral AI	reasoning	75.4%	84%	49.6%	N/A	N/A	92%	N/A	N/A	76%	$0.006	80	128K
46	Mistral Medium 3 2025-05 EnterpriseMultilingualNew New	Mistral AI	reasoning	75.2%	83.5%	51%	N/A	N/A	90%	N/A	N/A	76.5%	$0.004	95	128K
47	GPT-4.1 mini 2024-11 Cost-Effective New	OpenAI	reasoning	75.1%	87.5%	65%	72.7%	N/A	N/A	N/A	N/A	N/A	$0.015	95	128K
48	Grok-2 2024-08 Good Coding	xAI	coding	74.8%	87.5%	56%	66.1%	N/A	88.4%	N/A	N/A	76.1%	$0.01	75	128K
49	Grok 3 2024-12 Latest New	xAI	reasoning	74.3%	N/A	75.4%	73.2%	N/A	N/A	N/A	N/A	N/A	$0.012	70	128K
50	Gemini 1.5 Pro 2024-02 Largest ContextComplete Benchmarks	Google	multimodal	73.6%	81.9%	46.2%	62.2%	92.5%	71.9%	84%	91.7%	58.5%	$0.0125	38	2M
51	Gemini 2.5 Flash Lite 2024-12 Latest New	Google	multimodal	71.8%	84.5%	66.7%	72.9%	N/A	N/A	N/A	N/A	63.1%	$0.01	75	1M
52	GPT-4 2023-03 Most ExpensiveClassic	OpenAI	reasoning	71.4%	86.4%	35.7%	56.8%	95.3%	67%	83.1%	92%	52.9%	$0.18	25	8K
53	Claude 3.5 Haiku (2025) 2025-04 UpdatedFastest ClaudeComputer Use New	Anthropic	conversation	70.3%	73.5%	43.2%	N/A	N/A	90.5%	N/A	N/A	74%	$0.004	150	200K
54	Llama 3.2 90B 2024-09 Open Source	Meta	reasoning	69.6%	86%	46.7%	60.3%	N/A	N/A	N/A	86.9%	68%	$0.008	80	128K
55	Command R+ (2025) 2025-03 RAG OptimizedTool UseEnterprise New	Cohere	reasoning	69.4%	82.3%	46%	N/A	N/A	80.5%	N/A	N/A	68.9%	$0.0025	85	128K
56	Claude 3 Sonnet 2024-03 BalancedComplete Benchmarks	Anthropic	reasoning	69.1%	79%	40.4%	53.1%	89%	73%	82.9%	92.3%	43.1%	$0.012	90	200K
57	Gemini 1.5 Flash 2024-05 FastComplete Benchmarks	Google	multimodal	68.6%	78.9%	39.5%	56.1%	81.3%	67.5%	89.2%	68.8%	67.7%	$0.008	95	1M
58	Mistral Small 3 2025-03 Open WeightsUltra EfficientApache 2.0 New	Mistral AI	conversation	68.1%	81.5%	42%	N/A	N/A	83%	N/A	N/A	66%	$0.001	130	32K
59	Gemma 3 27B 2025-03 Open SourceMultimodalApache 2.0 New	Google	multimodal	67.9%	78.4%	42%	68.5%	N/A	79%	N/A	N/A	71.5%	$0.0003	110	128K
60	GPT-4o mini 2024-07 Cost-Effective	OpenAI	conversation	67.8%	82%	40.2%	59.4%	N/A	87.2%	N/A	N/A	70.2%	$0.007	120	128K
61	Llama 4 Scout 2024-12 Least ExpensiveOpen Source New	Meta	conversation	67%	74.3%	57.2%	69.4%	N/A	N/A	N/A	N/A	N/A	$0.0003	120	128K
62	Amazon Nova Pro 2025-01 AWS NativeMultimodalCost-Effective New	Amazon	multimodal	66.9%	80%	44%	63.5%	N/A	79%	N/A	N/A	68%	$0.0008	100	300K
63	Claude 3.5 Haiku 2024-11 FastCost-Effective New	Anthropic	conversation	66%	65%	41.6%	N/A	N/A	88.1%	N/A	N/A	69.2%	$0.005	140	200K
64	Claude 3 Haiku 2024-03 FastComplete Benchmarks	Anthropic	conversation	65.3%	75.2%	33.3%	50.2%	85.9%	75.9%	73.7%	88.9%	38.9%	$0.004	160	200K
65	Phi-4 Mini 2025-04 Open SourceEdge Deployable3.8B Params New	Microsoft	conversation	63.4%	75.6%	37.3%	N/A	N/A	73%	N/A	N/A	67.5%	$0.0001	200	16K
66	GPT-4.1 nano 2024-11 Ultra Fast New	OpenAI	conversation	61.9%	80.1%	50.3%	55.4%	N/A	N/A	N/A	N/A	N/A	$0.005	150	32K
67	Gemma 3 12B 2025-03 Open SourceLightweightOn-Device New	Google	conversation	60.2%	74.2%	37%	59.8%	N/A	72%	N/A	N/A	58%	$0.0001	160	128K
68	Amazon Nova Lite 2025-01 AWS NativeUltra FastCheapest Multimodal New	Amazon	conversation	56.6%	73%	33%	55%	N/A	68%	N/A	N/A	54%	$0.00006	180	300K
69	o3-pro 2025-01 Upcoming New	OpenAI	reasoning	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	$0.08	20	128K
70	GPT-4o Realtime 2024-10 RealtimeVoice	OpenAI	conversation	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	$0.02	200	128K
71	GPT-4o mini Realtime 2024-10 RealtimeVoiceCost-Effective	OpenAI	conversation	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	N/A	$0.008	250	128K

Want to use AI on your website?

Create a free WhisperChat support chatbot. Train it on your website content and start answering repetitive customer questions in minutes.

Start Free

How to Use the LLM Leaderboard

Choosing the right AI model matters because cost, speed, and accuracy vary significantly across providers. This leaderboard compares 71+ large language models across eight industry-standard benchmarks so you can make evidence-based decisions.

Step 1: Browse the main leaderboard to review overall rankings by composite score.

Step 2: Open the Performance Charts tab to visualize strengths across benchmarks like MMLU, GPQA, and HumanEval.

Step 3: Use Model Comparison to evaluate 2-3 models side by side on metrics relevant to your use case.

Step 4: Review Benchmark Details for scoring methodology and context.

Whether you are building customer support automation, coding assistants, or content generation workflows, selecting the right model can save substantial API spend. Use cost and throughput columns to identify your performance-budget sweet spot. A model that scores 5% lower but costs 80% less may be the practical winner.

The leaderboard is updated regularly as new models release and benchmarks evolve. Recheck it before major infrastructure decisions so your stack reflects current capabilities and pricing realities.

For practical evaluation, shortlist models from the leaderboard and run your own task-specific test set before rollout. Benchmarks provide directional guidance, but domain prompts, latency expectations, and compliance constraints can change final selection. Combining public rankings with internal testing gives the most reliable model choice.

Related Tools

Try WhisperChat's AI Free Book a Demo

Try our other free tools!

Explore more powerful AI tools to enhance your productivity and creativity.

GENERATOR

AI FAQ GENERATOR

Generate comprehensive FAQ sections for your website or product using AI. Create helpful answers to common questions automatically.

GENERATOR

AI ANSWER GENERATOR

Create intelligent, contextual answers to any question or query. Perfect for customer support and knowledge base creation.

EDITOR

AI HUMANIZE TEXT

Transform robotic or AI-generated text into natural, human-sounding language. Improve relatability and tone with just one click.

GENERATOR

AI EMAIL RESPONSE GENERATOR

Generate professional email responses tailored to your specific needs. Save time with smart, contextual email automation.

AI LLM Leaderboard

Want to use AI on your website?

How to Use the LLM Leaderboard

Related Articles

Related Tools

Try our other free tools!

AI FAQ GENERATOR

AI ANSWER GENERATOR

AI HUMANIZE TEXT

AI EMAIL RESPONSE GENERATOR