DeepNewz Markets

Markets Stories

Search

Loading...

Browse all stories on DeepNewz

HuggingFace Launches FineWeb, a 15T Token Dataset Outperforming Others

Apr 21, 2024, 08:02 AM

HuggingFace has released a new dataset named FineWeb, available at HuggingFace, which comprises 15 trillion tokens of high-quality, deduplicated web data sourced from CommonCrawl spanning from 2013 to 2024. This 275GB dataset, available under an Open Data Commons license, has shown to outperform existing datasets such as RefinedWeb, C4, DolmaV1.6, The Pile, and SlimPajama in various benchmarks, including 350B token ablations. FineWeb has been utilized to train models like LLaMA-3, demonstrating significant improvements due to its scale and quality. The dataset is expected to support extensive training runs given its size and has been made open-source for broad accessibility.

View original story

Markets

Loading...

Looking for markets...

Will FineWeb receive a major update by mid-2025?

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

No • 50%

Yes • 50%

Official announcements from HuggingFace or documented updates in their repositories.

Will FineWeb's LLaMA-3 models be available commercially by the end of 2024?

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

No • 50%

Yes • 50%

Press releases, product updates, or commercial licensing information from HuggingFace.

Will FineWeb surpass RefinedWeb in usage by end of 2024?

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

Yes • 50%

No • 50%

Market analysis reports and usage statistics from HuggingFace and other relevant datasets' platforms.

Most Popular AI Training Dataset by End of 2024

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

FineWeb • 20%

The Pile • 20%

DolmaV1.6 • 20%

C4 • 20%

RefinedWeb • 20%

Industry reports, usage data published by dataset providers, and market analysis.

Primary Sector Adopting FineWeb by End of 2024

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

Telecommunications • 25%

Healthcare • 25%

Finance • 25%

Automotive • 25%

Industry uptake reports, sectorial analysis in technology adoption.

Top Performing Model Trained on FineWeb by 2025

HuggingFace•FineWeb•CommonCrawl•Open Data Commons•RefinedWeb•C4•The Pile•SlimPajama•LLaMA

Resolution / Starting Odds

GPT-4 • 25%

Transformer-XL • 25%

LLaMA-3 • 25%

BERT-Enhanced • 25%

Benchmarking results published in peer-reviewed journals or major AI conferences.