Hacker News with Generative AI: Datasets

Harvard Is Releasing a Free AI Training Dataset (wired.com)
Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools.
Datasaurus dozen – Different datasets with the same descriptive statistics (wikipedia.org)
The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed.[1] It was inspired by the smaller Anscombe's quartet that was created in
Foursquare Open Source Places: A new foundational dataset (simonwillison.net)
Foursquare Open Source Places: A new foundational dataset for the geospatial community (via) I did not expect this!
Seeing faces in things: A model and dataset for pareidolia (mhamilton.net)
The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. "Face pareidolia" describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective.
Show HN: What's HN Working On – A structured dataset (github.com/getomni-ai)
Trailer Faces HQ Dataset (justinpinkney.com)
Show HN: Fineweb-Edu-Fortified dataset: Fineweb-Edu deduped, embeddings included (huggingface.co)
A multimodal dataset with one trillion tokens (github.com/mlfoundations)
How to think about creating a dataset for LLM fine-tuning evaluation (mlops.systems)
Sharing new research, models, and datasets from Meta FAIR (meta.com)
AI Books4 Dataset for training LLMs further (reddit.com)
Sakuga-42M Dataset: Scaling Up Cartoon Research (arxiv.org)
FineWeb: 15T tokens of the finest data the web has to offer (huggingface.co)
GitHub: Awesome-reasoning, a curated list of datasets for reasoning AIs (github.com/neurallambda)