DeepNewz Markets

Markets Stories

Search

Loading...

Browse all stories on DeepNewz

Market

Which large language model will achieve the highest pass@200 score on LiveCodeBench using PlanSearch by the end of 2024?

2

AI•California Institute of Technology•Northeastern University•Cursor AI•PlanSearch•LiveCodeBench

Resolution / Starting Odds

Claude 3.5 • 25%

GPT-4 • 25%

Bard • 25%

Other • 25%

LiveCodeBench official results or credible AI research publications

Story

Scale AI's PlanSearch Boosts Claude 3.5 Code Generation to 77.0%

Sep 8, 2024, 03:06 PM

Scale AI, in collaboration with the California Institute of Technology, Northeastern University, and Cursor AI, has introduced a new state-of-the-art (SOTA) test-time compute method called PlanSearch. This algorithm enhances diversity and efficiency in large language model (LLM) code generation by creating high-level plans in natural language that guide the coding process. The method has shown significant improvements in creativity, diversity in solutions, and the quality of code generated. Notably, Claude 3.5, using PlanSearch, achieved a pass@200 of 77.0% on LiveCodeBench, outperforming the best score achieved without the search algorithm (pass@1 = 41.4%). The optillm lib implements the core idea of PlanSearch, optimizing inference proxy. This development, involving researchers E Wang, F Cassano, C Wu, and Y Bai, represents a significant leap in optimizing inference capabilities and performance of LLMs.

View original story

Similar markets

Which speech recognition model will rank highest in performance benchmarks by end of 2024?

aiOla's Whisper-Medusa • 25%

OpenAI's Whisper • 25%

Google's Speech-to-Text • 25%

Other • 25%

Which model will achieve the highest vision-language performance by December 31, 2024?

GPT-4o • 25%

InternVL 2 • 25%

NVLM 1.0 • 25%

Other • 25%

Which AI model will be ranked first on LiveBench AI on December 31, 2024?

OpenAI o1-preview • 25%

Anthropic Claude 3.5 Sonnet • 25%

OpenAI o1 mini • 25%

Other • 25%

Will OpenAI's o1-preview model maintain the top spot on LiveBench AI by end of 2024?

Yes • 50%

No • 50%

Which open-source AI model will have the highest MMLU score by December 31, 2024?

Apple's 7B AI model • 25%

Mistral 7B • 25%

Llama 3 8B • 25%

Google's Gemma • 25%

Which model will be predominantly used for GitHub Copilot completions by the end of 2025?

Anthropic's Claude 3.5 Sonnet • 25%

OpenAI's GPT-4o • 25%

A new model • 25%

Other existing model • 25%

Which AI model will achieve the highest performance benchmark by December 31, 2024?

Meta's Llama 3.1-70B • 25%

OpenAI's GPT-4 • 25%

Google's Bard • 25%

Other • 25%

Which LiquidAI model will achieve the highest score on SuperGLUE by end of 2024?

1B parameter model • 25%

3B parameter model • 25%

40B parameter model • 25%

None of the models achieve the highest score • 25%

First company to release a new AI language model after August 2024?

OpenAI • 25%

Google • 25%

Microsoft • 25%

Other • 25%

Which AI model will have the highest multilingual support by end of 2024?

Llama 3.1 405B • 25%

GPT-4o • 25%

Claude Sonnet 3.5 • 25%

Other • 25%

Top AI model on SWE-bench by March 2025?

Llama 3.1 8B • 25%

GPT-4 • 25%

Llama 3.1 70B • 25%

Other • 25%

Which AI model will achieve the highest score in the MATH benchmark by end of 2024?

DeepSeek-R1-Lite-Preview • 25%

OpenAI's o1-preview • 25%

Google DeepMind's model • 25%

Other • 25%

Markets based on same story

Loading...

Looking for markets...

Show all

Will another major AI company adopt PlanSearch by the end of 2024?

No • 50%

Yes • 50%

Will Claude 3.5 achieve a pass@200 score of 80% or higher on LiveCodeBench by the end of 2024?

No • 50%

Yes • 50%

Will PlanSearch be integrated into a publicly available open-source LLM library by mid-2024?

Yes • 50%

No • 50%

Which major AI company will adopt PlanSearch first by the end of 2024?

Other • 25%

OpenAI • 25%

Google DeepMind • 25%

Microsoft • 25%