DeepNewz Markets

Market

Will Scale AI launch a new LLM evaluation by Nov 2024?

Scale AI•Grade School Math•Phi•Mistral•SEAL•Claude•Gemini

Resolution / Starting Odds

Yes • 50%

No • 50%

Official announcements from Scale AI or credible tech news sources

Story

Scale AI Finds Up to 13% Accuracy Drops in LLMs with New GSM8K Test

May 2, 2024, 02:49 AM

Scale AI has launched a new evaluation for large language models (LLMs) using an updated and private version of the Grade School Math (GSM8K) test set, revealing accuracy drops of up to 13% in popular models like Phi and Mistral, which showed evidence of systematic overfitting. The initiative, led by Scale AI's SEAL team, included the creation of a new test set, GSM1k, to specifically assess overfitting on public benchmarks. The evaluation also tested other models such as GPT-4, Claude, and Gemini, which did not show signs of overfitting. This effort aims to address the ongoing issue of data contamination and the manipulation of benchmark results in LLM evaluations.

View original story

Market

Story

Similar markets

Scale AI to launch a new AGI product by end of 2024?

Scale AI to secure additional funding in 2024?

Will Scale AI acquire another company by 2024?

Scale AI Breakthrough in AI Data Processing by 2024?

Scale AI partners with Fortune 500 company by end of 2024?

Will Scale AI SEAL Leaderboards include more than 50 LLMs by September 30, 2024?

Will Scale AI reach $1.4 billion in annual recurring revenue by 2024?

Will Scale AI achieve profitability by end of 2024?

Scale AI Leaderboards Show 50% Improvement in Model Performance by November 2024?

Additional investment in Scale AI in 2024?

Will Scale AI have a more diverse workforce by end of 2024?

Scale AI Client Expansion by Mid-2025?

Will other AI companies adopt GSM1k for evaluations by May 2025?

Will Phi and Mistral improve on benchmarks by Nov 2024?

Best Performing LLM on GSM1k by End of 2024

First Company to Announce New LLM Benchmark Post-Scale AI

Will other AI companies adopt GSM1k for evaluations by May 2025?

Will Phi and Mistral improve on benchmarks by Nov 2024?

Best Performing LLM on GSM1k by End of 2024

First Company to Announce New LLM Benchmark Post-Scale AI