Hacker News with Generative AI: Data Analysis

How Russian Data on Sanctioned Companies is disappearing and how we put it back (opensanctions.org)
Since Russia's full-scale invasion of Ukraine in 2022, the dictatorship has increasingly hidden and falsified data in its trade registry. We’ve churned through 100GB of compressed XML files, so that we can now present both: the current, memory-impaired version as well as the pre-war data.
Bluesky user activity has declined by 23% over the past three months (bluefacts.app)
This page shows how activity and usage of Bluesky has been growing over time. You can see how many users are actively posting on Bluesky and how many posts are added to Bluesky on a daily basis.
Python Pandas Ditches NumPy for Speedier PyArrow (thenewstack.io)
Frequentism and Bayesianism: A Practical Introduction (2014) (jakevdp.github.io)
One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions between them and the different practical approaches that result. The purpose of this post is to synthesize the philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself might be better prepared to understand the types of data analysis people do.
Calculating Oil Storage Tank Occupancy with Help of Satellite Imagery (2017) (medium.com)
At TankerTrackers.com, our mission statement is to present a bird’s eye view of the physical oil market with help of tanker-tracking, storage changes and official government statistics.
New DSL "MassQL" lets scientists query mass spectrometry data (news.ucr.edu)
Biologists and chemists have a new programming language to uncover previously unknown environmental pollutants at breakneck speed – without requiring them to code.
Show HN: High-resolution surface analysis with Lidar data (github.com/r-follador)
High-resolution surface analysis with LiDAR data
Discord Unveiled: A Comprehensive Dataset of Public Communication (2015-2024) (arxiv.org)
Discord has evolved from a gaming-focused communication tool into a versatile platform supporting diverse online communities.
Capalyze – Natural language data analysis (capalyze.ai)
上传表格,提问智答,生成洞见
What Is This Thing Called Swing? (ds.mpg.de)
Jazz must swing - jazz musicians agree on that. However, even a century after the beginnings of jazz, there is still no general agreement what exactly constitutes the swing feel. With a dedicated experiment and data analyses on more than 450 well-known jazz solos, we have tried to unravel some secrets of swing.
Show HN: Buckaroo – Data table UI for Notebooks (github.com/paddymul)
Buckaroo is a modern data table for Jupyter that expedites the most common exploratory data analysis tasks.
Show HN: Fahmatrix – A Lightweight, Pandas-Like DataFrame Library for Java (github.com/moustafa-nasr)
Fahmatrix is a lightweight, modern Java library for working with tabular data, inspired by Python's Pandas and rooted in the idea of making data understanding (fahm) easy on the JVM.
Show HN: CSV GB+ by Data.olllo – Open and Process CSVs Locally (microsoft.com)
Backblaze Drive Stats for Q1 2025 (backblaze.com)
Welcome to the first Drive Stats of 2025. In case you missed it, the 2024 Drive Stats report was the last for long-time Drive Stats guru, Andy Klein, who is happily retired—off putting the “green” in greener pastures by working on his golf game. We–being Backblaze staff writer Stephanie Doyle and Chief Technical Evangelist Pat Patterson–are picking up where Andy left off, bringing you the metrics and analysis you know and love. Now, on to the numbers! 
Gmail to SQLite (github.com/marcboeker)
This is a script to download emails from Gmail and store them in a SQLite database for further analysis.
Launch HN: Nao Labs (YC X25) – Cursor for Data (ycombinator.com)
Hey HN, we’re Claire and Christophe from nao Labs (https://getnao.io/). We just launched nao, an AI code editor to work with data: a local editor, directly connected with your data warehouse, and powered by an AI copilot with built-in context of your data schema and data-specific tools.
Show HN: Using eBPF to see through encryption without a proxy (github.com/qpoint-io)
Qtap: An eBPF agent that captures pre-encrypted network traffic, providing rich context about egress connections and their originating processes.
Show HN: YouTube Time Machine – browser extension to find forgotten videos (frankmeeuwsen.com)
Did you know that the average YouTube video is viewed about 41 times? For a platform that seemingly features professional content creators and is the second most visited site after Google, this view count might seem disappointing. Or is something else going on? I’ll explain how this fact inspired me to create a browser extension that makes visible the videos that the YouTube algorithm keeps hidden.
Energy efficiency of heat pumps in residential buildings using operation data (nature.com)
As heat pumps become more prevalent in residential buildings, effective performance monitoring is essential.
Show HN: TextQuery – Query CSV, JSON, XLSX Files with SQL (textquery.app)
TextQuery is an all-in-one desktop app to import, query, modify, and visualize your raw data with SQL.
Internet usage pattern during power outage in Spain and Portugal (akamai-mpulse.com)
On Monday this week, the Iberian Peninsula suffered a major power outage that disabled many services across these countries. In this post I'll look at the patterns we saw in mPulse data during this time.
DuckDB is probably the most important geospatial software of the last decade (dbreunig.com)
What happens when you embed geospatial capabilities in generalist data tools? More people engaging with geo data.
Determining favorite t-shirt color using science (ostwilkens.se)
I'm looking to simplify my wardrobe, and the t-shirt is a staple. I like solid color t-shirts, and so the main differentiating factor is the color. But what color? There is only one way to find out. That is: create images of myself with different colored t-shirts, and evaluate them in an ELO-based arena.
Backstory to the Survivorship Bias Plane (yuxi-liu-wired.github.io)
I discover the exact backstory to that picture of an airplane with red dots on top of it.
The PCAP (weberblog.net)
For the last couple of years, I captured many different network and upper-layer protocols and published the pcaps along with some information and Wireshark screenshots on this blog. However, it always takes me some time to find the correct pcap when I am searching for a concrete protocol example. There are way too many pcaps out there.
Zipf's Law (wikipedia.org)
Zipf's law (/zɪf/; German pronunciation: [tsɪpf]) is an empirical law stating that when a list of measured values is sorted in decreasing order, the value of the n-th entry is often approximately inversely proportional to n.
Normalizing Ratings (blogspot.com)
You Wouldn't Download a Hacker News (jasonthorsness.com)
And now I can analyze it with DuckDB. Behold the fraction of total comments and stories referencing key topics over time!
US Tariff Flow Analyzer (tradeflows.us)
We Found Insurance Fraud in Our Crash Data (levs.fyi)
When we set out to build geospatial risk scores for vehicle crashes at Matrisk AI, we never expected that a side by side look at Vehicle Identification Numbers and crash timelines would hint at possible insurance fraud. But data sometimes surprises you. Below, I’ll walk through how we stumbled upon this discovery, what we found, and why it might matter for anyone insuring vehicles.