Apr 21, 2024

HuggingFaceFW/fineweb · Datasets at Hugging Face

Posted by in categories: internet, robotics/AI

FineWeb: 15 trillion tokens of high quality web data the web has to offer.

The 🍷 dataset consists of more than 15T tokens of cleaned and deduplicated english web data from CommonCrawl.

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

Leave a reply