Large language model data pipelines and Common Crawl (christianperone.com)