DeepNewz Markets

Market

Which benchmark will o3 improve most by end of 2025?

OpenAI•Codeforces•American Invitational Mathematics Examination•AIME•EpochAI•Frontier Math•François Chollet•British•O2

Resolution / Starting Odds

ARC-AGI • 25%

Frontier Math • 25%

SWE-Bench • 25%

ARC-AGI Semi-Private Evaluation • 25%

SWE-Bench Verified test • 25%

AIME • 25%

GPQA-Diamond benchmark • 25%

Official benchmark results published by OpenAI or relevant benchmark authorities

Story

OpenAI Unveils o3 and o3-mini, Surpasses Human-Level ARC-AGI Performance, Sets New AI Benchmarks

Dec 20, 2024, 06:23 PM

OpenAI has unveiled 'o3' and 'o3-mini', its next-generation reasoning models designed to enhance AI's ability to adapt to novel tasks using a 'private chain of thought' and self fact-checking features. Announced on December 20, 2024, o3 surpasses previous models in various benchmarks, achieving a score of 87.5% on the ARC-AGI Semi-Private Evaluation in high-compute mode, surpassing the human performance threshold of 85%, with high-compute tasks costing $3,500 each. In low-compute mode, o3 scored 75.7% on the same evaluation. The model also set new records on other technical benchmarks, including 71.7% on the SWE-Bench Verified test, a 2727 rating on Codeforces, 96.7% on the American Invitational Mathematics Examination (AIME), and 96.7% on the GPQA-Diamond benchmark. It achieved 25.2% on EpochAI's Frontier Math problems, a significant jump from the previous best of 2%. François Chollet, a prominent AI researcher, stated that o3 represents "not merely incremental improvement, but a genuine breakthrough" in AI's ability to adapt to novel tasks. OpenAI plans to release o3-mini to the public by the end of January 2025, with o3 following shortly after. The development of o3 skips over 'o2' due to potential trademark issues with British telecommunications firm O2.

View original story

Similar markets

What will be the SWE-Bench Verified score of 'o3' by end of 2025?

Below 70% • 25%

70% to 75% • 25%

75% to 80% • 25%

Above 80% • 25%

What will be the ARC-AGI high-compute performance of 'o3' by end of 2025?

Below 85% • 25%

85% to 90% • 25%

90% to 95% • 25%

Above 95% • 25%

How many AI benchmarks will OpenAI's 'o3' set records in by the end of 2025?

0-1 benchmarks • 25%

2-3 benchmarks • 25%

4-5 benchmarks • 25%

More than 5 benchmarks • 25%

Which model will have the best performance in benchmarks by end of 2024?

GPT-4o • 33%

Gemini 1.5 • 33%

Claude 3.5 Sonnet • 34%

Which benchmark will Llama 3.3 achieve the highest score on by June 30, 2025?

GPQA Diamond (CoT) • 25%

Math (CoT) • 25%

Other AI Benchmark • 25%

None • 25%

Where will 'o3' rank among competitive coders by end of 2025?

Top 100 • 25%

Top 200 • 25%

Top 300 • 25%

Below Top 300 • 25%

Which benchmark will NVIDIA's Mistral-NeMo-Minitron 8B model achieve the highest improvement in by December 31, 2024?

Chatbots • 25%

Virtual assistants • 25%

Content generation • 25%

Coding • 25%

What will be the accuracy improvement of OpenAI's o3 model over o1-preview by end of 2025?

Less than 5% • 25%

5% to 10% • 25%

10% to 20% • 25%

More than 20% • 25%

Which model will top the PlanBench planning benchmark by March 31, 2025?

OpenAI o1 • 33%

Anthropic • 33%

Other • 33%

How will OpenAI's 'o3' perform on ARC-AGI benchmark compared to others by 2025?

o3 remains the top performer • 25%

Another model surpasses o3 • 25%

o3 ties with another model • 25%

No new models tested • 25%

Will GPT Next achieve a 3 OOMs Boost by the end of 2024?

Yes • 50%

No • 50%

Which company will achieve the highest score on SWE-Bench in 2024?

Cosine • 25%

Amazon • 25%

Cognition • 25%

Other • 25%