Hacker News with Generative AI: Data Sets

Bluesky Social Dataset (235M posts from 4M users) (zenodo.org)
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
Foursquare Open Source Places: Foundational dataset for the geospatial community (foursquare.com)
In an effort to change that dynamic, we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places (“FSQ OS Places”).
Show HN: 4B+ DNS Records Dataset (merklemap.com)
Introducing the world's most comprehensive and extensive DNS (Domain Name System) records database with more than 4 billion records.
Call of Duty: Warzone Caldera Data Set for Academic Use (activision.com)
OpenAI destroyed a trove of books used to train AI models (businessinsider.com)
Building a Large Japanese Web Corpus for Large Language Models (arxiv.org)