Hacker News with Generative AI: Data Analysis

How effective and safe are measles vaccines? (ourworldindata.org)
Data from large meta-analyses show that measles vaccination is highly effective and safe, reducing the chances of getting measles by 95%.
Web Browser telemetry – 2025 edition (sizeof.cat)
This is a re-release of my “world-renown” Web Browser Telemetry - 2021 edition article, updated for 2025.
Show HN: Comparelists.org – Instantly Compare Two Lists, Find Differences (comparelists.org)
The easiest way to compare two lists online. Free tool to find matches, differences, duplicates and unique items between lists instantly. Support for TXT, CSV and Excel files.
How to Run Python in Production (ashishb.net)
My previous article recommended that one should reconsider using Python in production. However, there’s one category of use case where Python is the dominant option for running production workloads. And that’s data analysis and machine learning.
I analyzed chord progressions in 680k songs (cantgetmuchhigher.com)
I Analyzed the Chord Progressions of 680k Songs
ICE Hands Palantir Millions for Comprehensive Analysis of Known Groups (404media.co)
Last week Immigration and Customs Enforcement (ICE) paid contracting giant Palantir tens of millions of dollars to make modifications to a powerful ICE database and search tool to allow “complete target analysis of known populations” and to update the tool’s targeting and enforcement priorities, according to procurement records reviewed by 404 Media.
Reproducing Hacker News writing style fingerprinting (antirez.com)
About three years ago I saw a quite curious and interesting post on Hacker News. A student, Christopher Tarry, was able to use cosine similarity against a vector of top words frequencies in comments, in order to detect similar HN accounts — and, sometimes, even accounts actually controlled by the same user, that is, fake accounts used to uncover the identity of the writer.
A puzzle of two unreliable sensors (wordpress.com)
Suppose you are trying to measure a value P and you have two unreliable sensors. Sensor A returns 0.5P + 0.5U, where U is uniform random noise over the same domain as P. Sensor B will return either P or U with 50% likelihood. In other words, sensor A is a noisy measurement of your variable, and B is sometimes the correct value and sometimes pure noise.
Understanding Aggregate Trends for Apple Intelligence Using Differential Privacy (apple.com)
At Apple, we believe privacy is a fundamental human right. And we believe in giving our users a great experience while protecting their privacy.
Show HN: I made a free tool that analyzes SEC filings and posts detailed reports (signalbloom.ai)
Wilson Bank Holding Company (WBHC) announced robust first-quarter 2025 results, highlighted by significant year-over-year earnings per share growth and continued expansion of its balance sheet and ...
Benefits of Apache Iceberg for geospatial data analysis (wherobots.com)
Apache Iceberg v3 supports the geometry type which allows the geospatial community to take advantage of amazing Iceberg features like reliable transactions, DML operations (deletes & upserts), time travel, versioned data, schema enforcement, schema evolution, and much more.
Hand-counted images of rallies yield significantly smaller numbers (cbc.ca)
Political campaigns may be significantly off-base when it comes to the number of people they say are present at campaign rallies across the country, a CBC News investigation shows.
OLAP Hierarchical Aggregation with DuckDB SQL Recursive Common Table Expressions (medium.com)
Aggregation for dimensional hierarchies doesn’t require costly Business Intelligence (BI) tools. You can use recursive SQL techniques to express your hierarchical data in relational form, allowing for easy and fast aggregation along multiple levels and dimensions.
Visualizing a Million Time Series with the Density Line Chart (arxiv.org)
Data analysts often need to work with multiple series of data---conventionally shown as line charts---at once. Few visual representations allow analysts to view many lines simultaneously without becoming overwhelming or cluttered. In this paper, we introduce the DenseLines technique to calculate a discrete density representation of time series.
Ask HN: Is There a Crypto Equivalent to Tracking Politician's Transactions? (ycombinator.com)
I'm curious if there's a platform similar to Capitol Trades (capitoltrades.com) or AutoPilot but for the cryptocurrency market.
Announcing Think Stats 3e (allendowney.com)
The third edition of Think Stats is on its way to the printer! You can preorder now from Bookshop.org and Amazon (those are affiliate links), or if you can’t wait to get a paper copy, you can read the free, online version here.
Applying Pandas' Group_by on Videos (mixpeek.com)
My Browser WASM't Prepared for This. Using DuckDB, Apache Arrow and Web Workers (medium.com)
At Motif Analytics, we are building a highly-interactive analytics tool, which allows finding insights in relatively large datasets, fully in-browser.
Cursed Excel: "1/2"+1=45660 (quadratichq.com)
Cursed Excel: “1/2”+1=45660
Reverse engineering of the formula used to generate the "reciprocal tariffs" (twitter.com)
Something went wrong, but don’t fret — let’s give it another shot.
Are tariffs bad for growth? Yes, say 5 decades of data from 150 countries (2020) (sciencedirect.com)
Using an annual panel of macroeconomic data for 151 countries over 1963–2014, we find that tariff increases are associated with an economically and statistically sizeable and persistent decline in output growth.
DEDA – Tracking Dots Extraction, Decoding and Anonymisation Toolkit (github.com/dfd-tud)
Document Colour Tracking Dots, or yellow dots, are small systematic dots which encode information about the printer and/or the printout itself. This process is integrated in almost every commercial colour laser printer. This means that almost every printout contains coded information about the source device, such as the serial number.
How Airbnb measures listing lifetime value (medium.com)
A deep dive on the framework that lets us identify the most valuable listings for our guests.
Sample Size [in Baseball] (fangraphs.com)
A baseball season is the amalgamation of a lot of little events. Each pitch fits into a plate appearance which fits into an inning which fits into a game which fits into a series which fits into a season. That’s a lot of little data points flowing into an overall end result. We care a lot about which players will have good seasons and careers.
The R Inferno (2011) [pdf] (burns-stat.com)
Most promoted and blocked domains on Kagi (kagi.com)
Kagi Search Stats
CNCF Git Data Miner (github.com/cncf)
This is the Cloud Native Computing Foundation's fork of Jon Corbet and Greg KH's gitdm tool for calculating contributions based on developers and their companies.
Matrix Profiles (aneksteind.github.io)
Lately I’ve been thinking about time series analysis to aid in Reflect’s insights features. Towards this end, I’ve had a Hacker News thread about anomaly detection bookmarked in Later. I finally got to looking at it and there was a comment that mentioned the article left out matrix profiles, which I had never heard of, so I decided to look into them.
Some Reflections After a Month of Tracking My Own Online Activity (mcwhittemore.com)
Since 8:38 PM on February 22nd, I’ve been recording all my browsing activity in a database I manage using a custom-built browser extension and a wrapper around @rosskevin/ifvisible. The result? I now have a clear picture of just how much time I’ve spent on the web this past month. And, well… I spend a lot of time reading email. Go figure.
The persistent mischaracterization of Google and Facebook A/B tests (sciencedirect.com)
Marketing research has increasingly relied on online platform studies, which are studies conducted in a naturalistic online environment and which leverage the A/B testing tool provided by platforms such as Facebook or Google Ads.