Hacker News with Generative AI: Datasets

FUTO open-sources 1M row keyboard swipe dataset (huggingface.co)
Open-sourcing 5,000hrs of self-driving dataset (huggingface.co)
To unlock the potential for robotics AI, Yaak teamed up with the LeRobot team at 🤗 and is excited to announce Learning to Drive (L2D) to the robotics AI community. L2D is the world’s largest multimodal dataset aimed at building an open-sourced spatial intelligence for the automotive domain with first class support for 🤗’s LeRobot training pipeline and models.
Hugging Face datasets and models for cybersecurity/sofwtare vulnerabilities (huggingface.co)
CIRCL is the CERT (Computer Emergency Response Team/Computer Security Incident Response Team) for the private sector, communes and non-governmental entities in Luxembourg.
Some critical issues with the SWE-bench dataset (arxiv.org)
To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories.
Harvard Is Releasing a Free AI Training Dataset (wired.com)
Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools.
Datasaurus dozen – Different datasets with the same descriptive statistics (wikipedia.org)
The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed.[1] It was inspired by the smaller Anscombe's quartet that was created in
Foursquare Open Source Places: A new foundational dataset (simonwillison.net)
Foursquare Open Source Places: A new foundational dataset for the geospatial community (via) I did not expect this!
Seeing faces in things: A model and dataset for pareidolia (mhamilton.net)
The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. "Face pareidolia" describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective.
Show HN: What's HN Working On – A structured dataset (github.com/getomni-ai)
Trailer Faces HQ Dataset (justinpinkney.com)
Show HN: Fineweb-Edu-Fortified dataset: Fineweb-Edu deduped, embeddings included (huggingface.co)
A multimodal dataset with one trillion tokens (github.com/mlfoundations)
How to think about creating a dataset for LLM fine-tuning evaluation (mlops.systems)
Sharing new research, models, and datasets from Meta FAIR (meta.com)
AI Books4 Dataset for training LLMs further (reddit.com)
Sakuga-42M Dataset: Scaling Up Cartoon Research (arxiv.org)
FineWeb: 15T tokens of the finest data the web has to offer (huggingface.co)
GitHub: Awesome-reasoning, a curated list of datasets for reasoning AIs (github.com/neurallambda)