Loading...
Loading...
Browse all stories on DeepNewz
VisitScale AI Finds Up to 13% Accuracy Drops in LLMs with New GSM8K Test
May 2, 2024, 02:49 AM
Scale AI has launched a new evaluation for large language models (LLMs) using an updated and private version of the Grade School Math (GSM8K) test set, revealing accuracy drops of up to 13% in popular models like Phi and Mistral, which showed evidence of systematic overfitting. The initiative, led by Scale AI's SEAL team, included the creation of a new test set, GSM1k, to specifically assess overfitting on public benchmarks. The evaluation also tested other models such as GPT-4, Claude, and Gemini, which did not show signs of overfitting. This effort aims to address the ongoing issue of data contamination and the manipulation of benchmark results in LLM evaluations.
View original story
Mistral • 20%
Phi • 20%
Gemini • 20%
Claude • 20%
GPT-4 • 20%
Microsoft • 25%
OpenAI • 25%
Google • 25%
Facebook • 25%