Hacker News with Generative AI: Data

AI bots strain Wikimedia as bandwidth surges 50% (arstechnica.com)
Automated AI bots seeking training data threaten Wikipedia project stability, foundation says.
Crawlers impact the operations of the Wikimedia projects (wikimedia.org)
Since the beginning of 2024, the demand for the content created by the Wikimedia volunteer community – especially for the 144 million images, videos, and other files on Wikimedia Commons – has grown significantly. In this post, we’ll discuss the reasons for this trend and its impact.
Show HN: XYMake – Turn Your Posts into LLM-Ready Data (xymake.com)
Unlock your X value by letting your MCP agents and other APIs access all your X posts.
Nonprofit Software Heritage maintains largest public collection of source code (softwareheritage.org)
Search LibGen, the Pirated-Books Database That Meta Used to Train AI (theatlantic.com)
Millions of books and scientific papers are captured in the collection’s current iteration.
Euclid opens data treasure trove, offers glimpse of deep fields (esa.int)
On 19 March 2025, the European Space Agency’s Euclid mission released its first batch of survey data, including a preview of its deep fields.
Search LibGen, the Pirated-Books Database That Meta Used to Train AI (theatlantic.com)
Millions of books and scientific papers are captured in the collection’s current iteration.
Nvidia Bets Big on Synthetic Data (wired.com)
Nvidia has acquired synthetic data firm Gretel for nine figures, according to two people with direct knowledge of the deal.
Dead People Database: DOGE deletes names of 3.2M individuals aged 120+ (indiatimes.com)
AI crawlers haven't learned to play nice with websites (theregister.com)
SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.
Show HN: OpenTimes – Free travel times between U.S. Census geographies (opentimes.org)
AboutData
Data Broker Brags About Having Detailed Personal Info on Nearly All Net Users (gizmodo.com)
The owner of a data brokerage business recently put out a creepy-ass video in which he bragged about the degree to which his industry could collect and analyze data on the habits of billions of people.
Experts warn about the 'crumbling infrastructure' of federal government data (npr.org)
Unstable funding for federal statistical agencies such as the Census Bureau and the Bureau of Economic Analysis, both based in Suitland, Md., is putting at risk the government statistics the U.S. uses to track changes in the country's economy and population, officials and data users warn.
Doge Makes Its Latest Errors Harder to Find (nytimes.com)
Elon Musk’s Department of Government Efficiency has repeatedly posted error-filled data that inflated its success at saving taxpayer money. But after a series of news reports called out those mistakes, the group changed its tactics.
Mozilla Likely Been Sharing Aggregated Firefox Data with Advertisers Since 2017 (quippd.com)
TL;DR: With Firefox 56, Mozilla combined Firefox Health Report and Telemetry data into a single setting called “technical and interaction data”, which was then enabled by default. This included data about advertising within Firefox’s New Tab page, along with a lot of other technical information about the installation of Firefox. The Firefox preferences UI makes no mention of usage of this technical data for advertising purposes.
Torrenting to seed US Government scientific datasets (aus.social)
The 200+ Sites an ICE Surveillance Contractor is Monitoring (404media.co)
A contractor for Immigration and Customs Enforcement (ICE) and many other U.S. government agencies has developed a tool that lets analysts more easily pull a target individual’s publicly available data from a wide array of sites, social networks, apps, and services across the web at once, including Bluesky, OnlyFans, and various Meta platforms, according to a leaked list of the sites obtained by 404 Media.
Why data on the economy doesn't match our feelings (marketplace.org)
Leading up to the election, economic figures said the economy was doing pretty well and inflation was slowing down significantly. Yet a lot of people just didn’t feel it.
Knowledge retention and generational succession in OpenStreetMap (imagico.de)
I wrote a comment today on the OSM-Carto issue tracker that i think should get some broader exposure:
Public health data disappeared. RestoredCDC.org is bringing it back (RestoredCDC.org)
RestoredCDC.org is an independent project and is not affiliated with, endorsed by, or associated with the Centers for Disease Control and Prevention (CDC) or any government entity.
Mozilla flamed by Firefox fans after reneging on promises to not sell their data (theregister.com)
Mozilla this week asked Firefox users to abide by new Terms of Use, and updated its Privacy Notice as well as an FAQ – only to quickly issue a clarification that it isn’t actually claiming ownership of user data.
Farmers depend on climate data. They're suing the USDA for deleting it (grist.org)
In late January, the director of digital communications at the U.S. Department of Agriculture sent an email to staff instructing them to remove agency web pages related to climate change by the end of the following day.
Firefox deletes promise to never sell personal data, asks users not to panic (arstechnica.com)
Firefox maker Mozilla deleted a promise to never sell its users' personal data and is trying to assure worried users that its approach to privacy hasn't fundamentally changed.
Lawsuit Alleges GM Illegally Sold Arkansans' Driving Data to Insurance Companies (insurancejournal.com)
Arkansas Attorney General Tim Griffin sued General Motors and its subsidiary OnStar this week, alleging the car manufacturer deceived Arkansans by collecting and selling driver information to third parties, who then sold the data to insurance companies.
Mozilla deletes promise not to sell Firefox users' data (osnews.com)
The hits just keep on coming. Mozilla not only changed its Privacy Notice and introduced a Terms of Use for Firefox for the first time with some pretty onerous terms, they also removed a rather specific question and answer pair from their page with frequently asked questions about Firefox, as discovered by David Gerard.
Farmers sue over purge of climate data needed for agricultural decisions (thehill.com)
Farmers and green groups sued the U.S. Department of Agriculture on Monday for an “unlawful purge” of climate data from its website.
Show HN: SQL Premier League – Learn SQL with Sports Data (sqlpremierleague.com)
How Many School Shootings? All Incidents from 1966-Present (k12ssdb.org)
How many school shootings this year? Unlike other data sources, this information includes gang shootings, domestic violence, shootings at sports games and afterhours school events, suicides, fights that escalate into shootings, and accidents.
Financialdata.net – Stock Market and Financial Data API (financialdata.net)
Here you can find a list of all API endpoints, along with their descriptions, required or optional query parameters, and sample responses.
Every .gov Domain (flatgithub.com)