Hacker News with Generative AI: Datasets

Simple Classification Rules Perform Well on Commonly Used Datasets (1993) [pdf] (ualberta.ca)

Machine Learning, Data Science, Classification, Datasets, Research Papers

4 points by Tomte 429 days ago | 1 comments

Show HN: TheorIA – An Open Curated Physics Dataset (Equations,Explanations,JSON) (theoria-dataset.github.io)

Open Source, Physics, Datasets, Machine Learning

9 points by ManuelSH 433 days ago | 6 comments

FUTO open-sources 1M row keyboard swipe dataset (huggingface.co)

Data Science, Machine Learning, Datasets, Open Source, Keyboard Input

12 points by ReadEvalPost 470 days ago | 1 comments

Open-sourcing 5,000hrs of self-driving dataset (huggingface.co)
To unlock the potential for robotics AI, Yaak teamed up with the LeRobot team at 🤗 and is excited to announce Learning to Drive (L2D) to the robotics AI community. L2D is the world’s largest multimodal dataset aimed at building an open-sourced spatial intelligence for the automotive domain with first class support for 🤗’s LeRobot training pipeline and models.

Robotics, Open Source, Artificial Intelligence, Datasets

63 points by SnYaak 494 days ago | 11 comments

Hugging Face datasets and models for cybersecurity/sofwtare vulnerabilities (huggingface.co)
CIRCL is the CERT (Computer Emergency Response Team/Computer Security Incident Response Team) for the private sector, communes and non-governmental entities in Luxembourg.

Cybersecurity, Software Vulnerabilities, Machine Learning, Datasets

7 points by cedricbonhomme 496 days ago | 1 comments

Some critical issues with the SWE-bench dataset (arxiv.org)
To facilitate a rigorous evaluation of LLMs in practical coding contexts, Carlos et al. introduced the SWE-bench dataset, which comprises 2,294 real-world GitHub issues and their corresponding pull requests, collected from 12 widely used Python repositories.

Software Engineering, Datasets, Artificial Intelligence

350 points by joshwa 512 days ago | 116 comments

Harvard Is Releasing a Free AI Training Dataset (wired.com)
Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools.

Artificial Intelligence, Education, Datasets

75 points by ilamont 577 days ago | 11 comments

Datasaurus dozen – Different datasets with the same descriptive statistics (wikipedia.org)
The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed.[1] It was inspired by the smaller Anscombe's quartet that was created in

Data Visualization, Statistics, Datasets, Data Science

14 points by yread 581 days ago | 1 comments

Foursquare Open Source Places: A new foundational dataset (simonwillison.net)
Foursquare Open Source Places: A new foundational dataset for the geospatial community (via) I did not expect this!

Open Source, Geospatial Data, Datasets, Foursquare, Location Data

154 points by po 605 days ago | 28 comments

Seeing faces in things: A model and dataset for pareidolia (mhamilton.net)
The human visual system is well-tuned to detect faces of all shapes and sizes. While this brings obvious survival advantages, such as a better chance of spotting unknown predators in the bush, it also leads to spurious face detections. "Face pareidolia" describes the perception of face-like structure among otherwise random stimuli: seeing faces in coffee stains or clouds in the sky. In this paper, we study face pareidolia from a computer vision perspective.

Computer Vision, Psychology, Artificial Intelligence, Datasets

61 points by sebg 634 days ago | 9 comments

Show HN: What's HN Working On – A structured dataset (github.com/getomni-ai)

Hacker News, Datasets, Data Analysis, Artificial Intelligence

7 points by themanmaran 692 days ago | 0 comments

Trailer Faces HQ Dataset (justinpinkney.com)

Computer Vision, Datasets, Machine Learning

46 points by surprisetalk 699 days ago | 9 comments

Show HN: Fineweb-Edu-Fortified dataset: Fineweb-Edu deduped, embeddings included (huggingface.co)

Machine Learning, Datasets

22 points by neutralino1 703 days ago | 0 comments

A multimodal dataset with one trillion tokens (github.com/mlfoundations)