15. Datasets

Common Crawl

The Common Crawl corpus contains petabytes of data collected since 2008.

It contains raw web page data, extracted metadata and text extractions.

Common Crawl

Example Projects