Hacker News with Generative AI: Data Analysis

Does the Internet Route Around Damage? – Baltic Sea Cable Cuts (ripe.net)
This week's Internet cable cuts in the Baltic sea have been widely reported, even as attempts to understand their cause and impact are ongoing. We turn to RIPE Atlas to provide a preliminary analysis of these events and examine to what extent the Internet in the region is resilient to these events.
Statistical Rethinking (2024 Edition) (github.com/rmcelreath)
This course teaches data analysis, but it focuses on scientific models.
Consuming the Bluesky firehose for less than $2.50/mo (bad-example.com)
It's fun to play with data[citation needed]. All data on Bluesky is extremely public, and with 15 million users (as of today and with mind-boggling growth), there's a lot of public data to play with.
Non-elementary group-by aggregations in Polars vs pandas (quansight.org)
I attended PyData Berlin 2024 in April, and it was a blast! I met so many colleagues, collaborators, and friends. There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see people talking about it, and the focus tends to be on features such as:
Tesla has the highest fatal accident rate of all auto brands, study finds (roadandtrack.com)
Tesla vehicles suffer fatal accidents at a rate that's twice the industry average, according to a new report.
Car software patches are over 20% of recalls, study finds (arstechnica.com)
Software fixes are now responsible for more than 1 in 5 automotive recalls. That's the key finding from a decade's worth of National Highway Traffic Safety Administration recall data, according to an analysis from the law firm DeMayo Law.
FireDucks: Pandas but Faster (bearblog.dev)
My main background is a hedge fund professional, so I deal with finance data all the time and so far the Pandas library has been an indispensable tool in my workflow and my most used Python library.
Trust no one: why we can't trust most stats about the cybersecurity industry (ventureinsecurity.net)
There is a problem in cybersecurity: solid industry analysis is hard to come by.
Beating the bookies with their own numbers (arxiv.org)
The online sports gambling industry employs teams of data analysts to build forecast models that turn the odds at sports games in their favour.
Show HN: Visprex – Open-source, in-browser data visualisation tool for CSV files (visprex.com)
Visprex is a lightweight data visualisation tool that helps you speed up your statistical modelling and analytics workflows.
Title drops in movies (titledrops.net)
A title drop is when a character in a movie says the title of the movie they're in. Here's a large-scale analysis of 73,921 movies from the last 80 years on how often, when and maybe even why that happens.
I love Rust for tokenising and parsing (xnacly.me)
I am currently writing a analysis tool for Sql: sqleibniz, specifically for the sqlite dialect.
Show HN: IMDb SQL Best Movie Finder (imdb-sql.com)
Hacker News Data Map [180MB] (lmcinnes.github.io)
A Map of stories on Hackernews using UMAP and nomic-embed
DuckDB over Pandas/Polars (pgrs.net)
Since my previous post on DuckDB (DuckDB as the New jq), I’ve been continuing to use and enjoy DuckDB.
ClickPy – Python Package Analytics (clickhouse.com)
Browse through 706,881 Python packages from PyPI and over 1.17 trillion downloads, updated daily
Every number in Google Analytics is wrong (plausible.io)
“Every number in your Google Analytics account is wrong.” That is exactly what an independent study recently done by Orbit Media found.
The Blowout No One Sees Coming (vantagedatahouse.com)
Pollsters are expected to be fortune tellers. We’re often asked, “what’s going to happen in the election?” Credible pollsters’ predictions are grounded in reliable data and an understanding of voting behavior, not wishful thinking or reinforcing currently held perceptions. The current prevailing narrative about the U.S. Presidential race is that it’s tight–too close to call. The reality is this race is breaking for the Harris-Walz ticket.
Please show me lots of digits (dynomight.substack.com)
Hi there. It’s me, the person who stares very hard at the numbers in the papers you write. I’ve brought you here today to ask a favor.
A Deep Dive into German Strings (cedardb.com)
“Strings are Everywhere”! At least according to a 2018 DBTest Paper from the Hyper team at Tableau. In fact, strings make up nearly half of the data processed at Tableau. This high prevalence undoubtedly applies to many other companies as well, as the paper’s dataset consists of data analyzed by Tableau’s users. The string-heavy nature of the data makes string processing one of the most important tasks of a database system.
Sorry, but the ROI on enterprise AI is abysmal (theregister.com)
The deployment of AI projects and associated return on investment (ROI) have declined, according to a large survey of IT decision-makers.
Sampling with SQL (moertel.com)
Sampling is one of the most powerful tools you can wield to extract meaning from large datasets.
Evaluation quirks, metric pitfalls and some recommendations (juriopitz.com)
I’ve been observing system evaluation practice for close to 10 years. Thought to share a few funny and intriguing things that I noted.
Show HN: Automated smooth Nth order derivatives of noisy data (github.com/hugohadfield)
kalmangrad is a python package that calculates automated smooth N'th order derivatives of non-uniformly sampled time series data. The approach leverages Bayesian filtering techniques to compute derivatives up to any specified order, offering a robust alternative to traditional numerical differentiation methods that are sensitive to noise. This package is built on top of the underlying bayesfilter package.
Show HN: Andre v1.4 – Revamped User Dashboard to Democratize Data Analysis (ycombinator.com)
Hey HN!<p>We just released ANDRE v1.4 with a huge upgrade to the user dashboard—designed to make data analysis smoother and more accessible for everyone, not just data experts.
Testosterone Peaks at Series B (ycombinator.com)
We tested 139 founders for 17 biomarkers at the YC founders reunion last year and built a dashboard with some cool (very much anonymized) stats
The Low Stability of High Income (ofdollarsanddata.com)
Write your first MapReduce program in 20 minutes (deepnote.com)
This notebook focuses on MapReduce, which processes large datasets. It is designed to find the maximum transaction value by store using sales data. For those who want to learn MapReduce from scratch, this notebook covers the basics. For more information, here is a detailed article.
Unix for Poets: Basic NLP Tasks Using Unix Tools (medium.com)
Often, we become so captivated by complexity and sophistication that we overlook the profound effectiveness of simple, fundamental methods.
You can track changes someone makes to their Instagram account (github.com/ibnaleem)
📸 an Instagram tracker that logs any changes to an Instagram account (followers, following, posts, and bio)