FineWeb: 15 trillion tokens of high quality web data the web has to offer.
The 🍷 dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl.
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Comments are closed.