Hacker News with Generative AI: Data

GM parks claims driver location data was given to insurers, pushing up premiums (theregister.com)
General Motors on Thursday said that it has reached a settlement with the FTC "to address privacy concerns about our now-discontinued Smart Driver program."
General Motors Is Banned from Selling Driving Behavior Data for 5 Years (nytimes.com)
The Federal Trade Commission said on Thursday that it had reached a settlement with General Motors that would ban the automaker from providing drivers’ behavior and geolocation data to consumer reporting agencies.
43K fewer drivers on Manhattan roads after congestion pricing turned on (gothamist.com)
Meta Confesses to Training Llama with Pirated LibGen Data [pdf] (courtlistener.com)
Levels.fyi's annual compensation report 2024 (levels.fyi)
Levels.fyi's annual compensation report. View top paying companies, cities, titles & other trends.
Small Data [video] (youtube.com)
25% of the top websites are blocking OpenAI from crawling (originality.ai)
Bots such as OpenAI’s GPTBot, the Applebot, CCBot, Google-Extended, and Bytespider analyze, store, or scrape your website’s data in order to provide data to train more advanced LLMs.
Brief Introduction to Fix and Fix JSON (fixparser.dev)
The FIX Protocol (Financial Information Exchange) is a standardized messaging system for real-time electronic communication of trade-related information in financial markets.
New Data Reveal Climate Change-Driven Insurance Crisis Is Spreading (senate.gov)
Washington, D.C.—Today, Senator Sheldon Whitehouse (D-RI), Chairman of the Senate Budget Committee, released a first-of-its kind public dataset and accompanying staff report that expose the scale of the climate change-driven crisis in homeowners’ insurance.
Show HN: The all-in-one fake API (fooapi.com)
Dummy data for your projects, fast and simple. Users, products, posts, comments and more!
Maps that show time instead of space (youtube.com)
Plasticlist Report – Data on plastic chemicals in Bay Area foods (plasticlist.org)
A data table thousands of years old (2020) (datafix.com.au)
I knew that data tables had been around a long time, but I didn't appreciate how long until I read recently about account-keeping in ancient Mesopotamia.
Show HN: Gribstream.com – Historical Weather Forecast API (gribstream.com)
Leverage The National Blend of Models (NBM) & The Global Forecast System (GFS)
This is Where the data to build AI comes from (technologyreview.com)
New findings show how the sources of data are concentrating power in the hands of the most powerful tech companies.
The Work Number (W2 salary service) (theworknumber.com)
Utilizing The Work Number® database, we’ll provide you differentiated and proprietary data that can give you a more holistic view of applicants.
The First 50M Prime Numbers (1975) [pdf] (mpim-bonn.mpg.de)
Reclaim Your Data: Freeing a Wi-Fi Sensor from the Cloud (embeddedartistry.com)
In this article we’ll investigate how a particular Wi-Fi connected sensor (in this case a radon sensor) communicates with “the cloud” and how we can use that knowledge to reduce our reliance on third-party servers.
Surfer: Open-Source Personal Data Warehouse (github.com/Surfer-Org)
Surfer Protocol is an open-source framework for exporting and building applications off of your personal data.
Google Timeline location purge causes collateral damage (theregister.com)
A year ago, Google announced plans to save people's Location History, which it now calls Timeline, locally on devices rather than on its servers.
Google Timeline location purge causes collateral damage (theregister.com)
A year ago, Google announced plans to save people's Location History, which it now calls Timeline, locally on devices rather than on its servers.
Show HN: DataFuel.dev – Turn websites into LLM-ready data (datafuel.dev)
DataFuel API scrapes entire websites and knowledge bases in a single query. Get clean, markdown-structured web data instantly for your RAG systems and AI models. No complex scraping code needed.
All Text in NYC (alltext.nyc)
Palantir, Anduril to save data from battlefield to train AI models (businesstimes.com.sg)
SOFTWARE company Palantir Technologies and weapons maker Anduril Industries plan to accelerate the use of artificial intelligence (AI) in the US military and are inviting other companies to join the effort.
Show HN: Data Connector – Chat with Your Database and APIs (github.com/inferablehq)
The limitations of data and the fracturing of opinion (based.science)
From within, science appears as usual: parsimonious, slow, pedantic. But on the periphery something has changed.
PSA: Microsoft may be training on your private data without your knowledge (garymarcus.substack.com)
This just in: Microsoft is apparently, at least for some Office subscribers, maybe most, scraping private documents.
Foursquare's 104M Points of Interest (marksblogg.com)
Point of Interest (POI) datasets of any strong quality have rarely been published freely. Overture and OpenStreetMap (OSM) have been making inroads but even in 2021, I could only find half of Starbucks' locations in OSM.
The Fastest Way to Download Foursquare's New 100M+ POI Dataset (fused.io)
Foursquare just released an open dataset of over 100M global places of interest.
China's Surveillance State Is Selling Citizen Data as a Side Hustle (wired.com)
China has long been a billion-plus-person experiment in total state surveillance, with virtually no legal checks on the government's ability to physically and digitally monitor its citizens.