DeepNewz Markets

Market

What will be the ARC-AGI high-compute performance of 'o3' by end of 2025?

OpenAI•O2•ARC•Frontier Math•Codeforces

Resolution / Starting Odds

Below 85% • 25%

85% to 90% • 25%

90% to 95% • 25%

Above 95% • 25%

Published results on OpenAI's official channels or peer-reviewed publications

Story

OpenAI's 'o3' Models with Breakthrough AI Reasoning Surpass Human Performance on ARC-AGI

Dec 20, 2024, 06:10 PM

OpenAI has announced its latest AI reasoning models, 'o3' and 'o3-mini', marking a significant advancement in artificial intelligence capabilities. The 'o3' model, successor to 'o1', bypasses 'o2' due to potential trademark issues with telecommunications company O2. Designed to enhance thoughtful and contextual responses by 'thinking' before responding via a 'private chain of thought', 'o3' represents a breakthrough in AI reasoning. OpenAI collaborated with ARC to test 'o3' on ARC-AGI, which testers believe marks a qualitative shift in AI capabilities compared to prior limitations of large language models. 'o3' has achieved state-of-the-art performance across several benchmarks, including scoring 87.5% in high-compute mode on the ARC-AGI semi-private evaluation, surpassing human performance estimated at 85%. In low-compute mode, it scored 75.7%. On the Frontier Math benchmark, 'o3' solved 25.2% of the hardest math questions, a substantial increase from the previous best of 2%. Additionally, 'o3' scored 71.7% on SWE-Bench Verified, over 20% better than 'o1', and achieved a Codeforces rating of 2727, equivalent to the 175th best human competitive coder. The models are currently available to a limited group of outside researchers for safety testing, with 'o3-mini' expected to launch publicly by the end of January 2025, followed by 'o3' shortly thereafter.

View original story

Similar markets

Who will release a model surpassing 'o3' on ARC-AGI by end of 2025?

Google DeepMind • 25%

Anthropic • 25%

Meta AI • 25%

Other • 25%

How will OpenAI's 'o3' perform on ARC-AGI benchmark compared to others by 2025?

o3 remains the top performer • 25%

Another model surpasses o3 • 25%

o3 ties with another model • 25%

No new models tested • 25%

SWE-Bench • 25%

ARC-AGI Semi-Private Evaluation • 25%

SWE-Bench Verified test • 25%

AIME • 25%

GPQA-Diamond benchmark • 25%

What will be the accuracy improvement of OpenAI's o3 model over o1-preview by end of 2025?

Less than 5% • 25%

5% to 10% • 25%

10% to 20% • 25%

More than 20% • 25%

How many AI benchmarks will OpenAI's 'o3' set records in by the end of 2025?

0-1 benchmarks • 25%

2-3 benchmarks • 25%

4-5 benchmarks • 25%

More than 5 benchmarks • 25%

What will be the ranking of xAI's GPU cluster in terms of AI training power by December 31, 2024?

1st • 25%

2nd • 25%

3rd • 25%

Below 3rd • 25%

How many ARC-AGI tasks will the top 3 AI entries solve by November 10, 2024?

790-793 • 25%

794-796 • 25%

797-799 • 25%

800 • 25%

Will Grok 3.0 achieve 20x computational power increase by end of 2025?

Yes • 50%

No • 50%

How will Tesla's AI supercluster Cortex perform compared to other leading AI superclusters by June 30, 2025?

Better than NVIDIA H100 • 25%

Comparable to NVIDIA H100 • 25%

Worse than NVIDIA H100 • 25%

Other • 25%

Will SambaNova Cloud achieve higher than 132 tokens/sec for Llama 3.1 405B by end of 2024?