DeepNewz Markets

Market

Top Performing Model Trained on FineWeb by 2025

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

LLaMA-3 • 25%

BERT-Enhanced • 25%

GPT-4 • 25%

Transformer-XL • 25%

Benchmarking results published in peer-reviewed journals or major AI conferences.

Story

HuggingFace Launches FineWeb, a 15T Token Dataset Outperforming Others

Apr 21, 2024, 08:02 AM

HuggingFace has released a new dataset named FineWeb, available at HuggingFace, which comprises 15 trillion tokens of high-quality, deduplicated web data sourced from CommonCrawl spanning from 2013 to 2024. This 275GB dataset, available under an Open Data Commons license, has shown to outperform existing datasets such as RefinedWeb, C4, DolmaV1.6, The Pile, and SlimPajama in various benchmarks, including 350B token ablations. FineWeb has been utilized to train models like LLaMA-3, demonstrating significant improvements due to its scale and quality. The dataset is expected to support extensive training runs given its size and has been made open-source for broad accessibility.

View original story

Similar markets

Leading AI model in benchmarks by end of 2025

Phi-3 14B • 25%

Mixtral 8x7B • 25%

GPT-3.5 • 25%

Llama-3 8B • 25%

Leading AI model in performance by end of 2024

Falcon 2 • 33%

Meta's Llama 3 • 33%

OpenAI's latest model • 34%

Most adopted AI model in commercial applications by end of 2025

Phi-3-mini • 25%

Phi-3 14B • 25%

Llama-3 8B • 25%

GPT-3.5 • 25%

Performance ranking of Meta's Llama 3 400B by end of 2024

Top performer • 25%

Top 3 • 25%

Top 5 • 25%

Outside Top 5 • 25%

Top performing AI model for enhancing MRI quality by mid-2025

Model A • 33%

Model B • 33%

Model C • 33%

GPT series • 25%

Llama series • 25%

Fastest AI model by end of 2024

Llama3 • 33%

GPT-4 • 33%

Next-gen OpenAI model • 34%

Sector Most Impacted by Meta's Llama 3 AI by 2025

Social Media • 25%

Advertising • 25%

E-commerce • 25%

Customer Service • 25%

GPT-4o ranks in top 3 AI models by end of 2024?

First • 33%

Second • 33%

Third • 33%

Primary Application Benefiting from Llama 3 by End of 2024

Language Processing • 25%

Image Recognition • 25%

Data Analysis • 25%

Automated Decision Making • 25%