Loading...
Loading...
Browse all stories on DeepNewz
VisitWhich AI model will be the top performer in RewardBench by end of 2024?
Llama 3-70B • 25%
GPT-4 • 25%
Claude 2.0 • 25%
Other • 25%
Official RewardBench results published by recognized institutions
Meta FAIR's Self-Taught Evaluators Boost Llama 3-70B, Surpass GPT-4 in AI Evaluation
Aug 6, 2024, 03:47 PM
Meta, through its FAIR division, has introduced a new AI approach called 'Self-Taught Evaluators' that aims to enhance the evaluation of language models without the need for human annotations. This method utilizes synthetic training data and an iterative self-improvement scheme to train models. The Self-Taught Evaluators have demonstrated superior performance compared to commonly used language model judges like GPT-4, and they match the performance of top reward models trained with labeled examples. The approach involves generating contrasting outputs to train a language model as a judge, producing reasoning traces and final judgments. The model has shown significant improvements, notably boosting the Llama 3-70B model's performance on RewardBench to scores of 88.3 and 88.7 with majority vote, outperforming larger models and human labels.
View original story
Llama 3.1 405B • 25%
GPT-4o • 25%
Claude Sonnet 3.5 • 25%
Other • 25%
Meta's Llama 3.1-70B • 25%
OpenAI's GPT-4 • 25%
Google's Bard • 25%
Other • 25%
Claude 3.5 Sonnet • 33%
GPT-4o • 33%
Google's AI Model • 33%
OpenAI's O1 model • 25%
GPT-4 • 25%
Gemini • 25%
Anthropic's Claude • 25%
Claude 3.5 Sonnet • 33%
GPT-4o • 33%
Gemini • 34%
OpenAI o1-preview • 25%
Anthropic Claude 3.5 Sonnet • 25%
OpenAI o1 mini • 25%
Other • 25%
Google's Gemini • 25%
OpenAI's GPT • 25%
Microsoft's Azure AI • 25%
Other • 25%
Llama 3.1 405B • 25%
GPT-4o • 25%
Claude Sonnet 3.5 • 25%
Other • 25%
ChatGPT-4o • 25%
Google's Gemini • 25%
Another AI model • 25%
No clear leader • 25%
Nemotron 70B • 25%
ChatGPT4o • 25%
Sonnet 3.5 • 25%
Other • 25%
Llama 3.1 • 25%
GPT-4o • 25%
Bard • 25%
Other • 25%
ChatGPT-4o • 25%
Gemini 1.5 Pro • 25%
Claude-3.5 • 25%
Other • 25%
No • 50%
Yes • 50%
New AI application • 25%
New AI model • 25%
Other • 25%
New AI evaluation method • 25%