Loading...
Loading...
Browse all stories on DeepNewz
VisitWhich benchmark will o3 improve most by end of 2025?
ARC-AGI • 25%
Frontier Math • 25%
SWE-Bench • 25%
ARC-AGI Semi-Private Evaluation • 25%
SWE-Bench Verified test • 25%
AIME • 25%
GPQA-Diamond benchmark • 25%
Official benchmark results published by OpenAI or relevant benchmark authorities
OpenAI Unveils o3 and o3-mini, Surpasses Human-Level ARC-AGI Performance, Sets New AI Benchmarks
Dec 20, 2024, 06:23 PM
OpenAI has unveiled 'o3' and 'o3-mini', its next-generation reasoning models designed to enhance AI's ability to adapt to novel tasks using a 'private chain of thought' and self fact-checking features. Announced on December 20, 2024, o3 surpasses previous models in various benchmarks, achieving a score of 87.5% on the ARC-AGI Semi-Private Evaluation in high-compute mode, surpassing the human performance threshold of 85%, with high-compute tasks costing $3,500 each. In low-compute mode, o3 scored 75.7% on the same evaluation. The model also set new records on other technical benchmarks, including 71.7% on the SWE-Bench Verified test, a 2727 rating on Codeforces, 96.7% on the American Invitational Mathematics Examination (AIME), and 96.7% on the GPQA-Diamond benchmark. It achieved 25.2% on EpochAI's Frontier Math problems, a significant jump from the previous best of 2%. François Chollet, a prominent AI researcher, stated that o3 represents "not merely incremental improvement, but a genuine breakthrough" in AI's ability to adapt to novel tasks. OpenAI plans to release o3-mini to the public by the end of January 2025, with o3 following shortly after. The development of o3 skips over 'o2' due to potential trademark issues with British telecommunications firm O2.
View original story
Below 70% • 25%
70% to 75% • 25%
75% to 80% • 25%
Above 80% • 25%
Below 85% • 25%
85% to 90% • 25%
90% to 95% • 25%
Above 95% • 25%
0-1 benchmarks • 25%
2-3 benchmarks • 25%
4-5 benchmarks • 25%
More than 5 benchmarks • 25%
GPT-4o • 33%
Gemini 1.5 • 33%
Claude 3.5 Sonnet • 34%
GPQA Diamond (CoT) • 25%
Math (CoT) • 25%
Other AI Benchmark • 25%
None • 25%
Top 100 • 25%
Top 200 • 25%
Top 300 • 25%
Below Top 300 • 25%
Chatbots • 25%
Virtual assistants • 25%
Content generation • 25%
Coding • 25%
Less than 5% • 25%
5% to 10% • 25%
10% to 20% • 25%
More than 20% • 25%
OpenAI o1 • 33%
Anthropic • 33%
Other • 33%
o3 remains the top performer • 25%
Another model surpasses o3 • 25%
o3 ties with another model • 25%
No new models tested • 25%
Cosine • 25%
Amazon • 25%
Cognition • 25%
Other • 25%
Microsoft • 25%
Other • 25%
Amazon • 25%
Google • 25%