Loading...
Loading...
Browse all stories on DeepNewz
VisitMost Popular AI Training Dataset by End of 2024
FineWeb • 20%
RefinedWeb • 20%
C4 • 20%
DolmaV1.6 • 20%
The Pile • 20%
Industry reports, usage data published by dataset providers, and market analysis.
HuggingFace Launches FineWeb, a 15T Token Dataset Outperforming Others
Apr 21, 2024, 08:02 AM
HuggingFace has released a new dataset named FineWeb, available at HuggingFace, which comprises 15 trillion tokens of high-quality, deduplicated web data sourced from CommonCrawl spanning from 2013 to 2024. This 275GB dataset, available under an Open Data Commons license, has shown to outperform existing datasets such as RefinedWeb, C4, DolmaV1.6, The Pile, and SlimPajama in various benchmarks, including 350B token ablations. FineWeb has been utilized to train models like LLaMA-3, demonstrating significant improvements due to its scale and quality. The dataset is expected to support extensive training runs given its size and has been made open-source for broad accessibility.
View original story
Education • 20%
Healthcare • 20%
Finance • 20%
Technology • 20%
Retail • 20%
Natural Language Processing • 25%
Machine Learning Platforms • 25%
AI-optimized Hardware • 25%
Automated Machine Learning • 25%
Healthcare • 25%
Finance • 25%
Education • 25%
Retail • 25%
Technology • 20%
Education • 20%
Healthcare • 20%
Finance • 20%
Government • 20%
GPT-4o • 25%
Claude 3 • 25%
Google Bard • 25%
Other • 25%
North America • 25%
Europe • 25%
Asia-Pacific • 25%
Rest of the World • 25%
OpenAI • 25%
Google • 25%
Microsoft • 25%
Other • 25%
Snapdragon X Elite • 25%
Apple's M3 • 25%
Intel AI • 25%
AMD Ryzen AI • 25%
Tool 1 • 10%
Tool 2 • 10%
Tool 3 • 10%
Tool 4 • 10%
Tool 5 • 10%
Tool 6 • 10%
Tool 7 • 10%
Tool 8 • 10%
Tool 9 • 10%
Tool 10 • 10%
Telecommunications • 25%
Healthcare • 25%
Finance • 25%
Automotive • 25%