Hacker News with Generative AI: Data Sets

Meta torrented & seeded 81.7 TB dataset containing copyrighted data (arstechnica.com)
Newly unsealed emails allegedly provide the "most damning evidence" yet against Meta in a copyright case raised by book authors alleging that Meta illegally trained its AI models on pirated books.
Bluesky Social Dataset (235M posts from 4M users) (zenodo.org)
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
Foursquare Open Source Places: Foundational dataset for the geospatial community (foursquare.com)
In an effort to change that dynamic, we are announcing today the general availability of a foundational open data set, Foursquare Open Source Places (“FSQ OS Places”).
Show HN: 4B+ DNS Records Dataset (merklemap.com)
Introducing the world's most comprehensive and extensive DNS (Domain Name System) records database with more than 4 billion records.
Call of Duty: Warzone Caldera Data Set for Academic Use (activision.com)
OpenAI destroyed a trove of books used to train AI models (businessinsider.com)
Building a Large Japanese Web Corpus for Large Language Models (arxiv.org)