Hacker News with Generative AI: Data Science

Non-elementary group-by aggregations in Polars vs pandas (quansight.org)
I attended PyData Berlin 2024 in April, and it was a blast! I met so many colleagues, collaborators, and friends. There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see people talking about it, and the focus tends to be on features such as:
Image-Text Curation for 1B+ Data: Faster, Better, Smaller Clip Models (datologyai.com)
The Corpus of United States State Statutes–Design, Construction and Use (ssrn.com)
There is a need for more publicly available corpora of legal language.
The geometry of data: the missing metric tensor and the Stein score [Part II] (christianperone.com)
I’m writing this second part of the series because I couldn’t find any formalisation of this metric tensor that naturally arises from the Stein score (especially when used with learned models), and much less blog posts or articles about it, which is surprising given its deep connection between score-based generative models, diffusion models and the geometry of the data manifold.
New textbook teaches students about matrix methods and their real world apps (engin.umich.edu)
A new textbook, Linear Algebra for Data Science, Machine Learning, and Signal Processing, is being unveiled for use in classes that explore the many applications of matrix methods to real world data.
45% of U.S. Data Scientists Work at Just 20 Companies (index42.com)
For most companies, data scientists are indispensable. These specialized professionals are at the core of artificial intelligence (AI) and data-driven projects, helping companies uncover insights, optimize operations, and unlock new business opportunities.
Timing-Sensitive Analysis in Python (deepnote.com)
Time consistency is critical in many fields, especially in sensitive applications like cryptography.
Segmenting Credit Card Customers with K-Means (medium.com)
Ever wondered how credit card companies categorize their clients? Here’s a behind-the-scenes look at how they do it using data science! In this project, we’ll explore how to segment credit card customers with K-Means clustering, one of the most popular machine learning techniques. Let’s dive into the key steps and make sense of the data!
Rd-TableBench – Accurately evaluating table extraction (reducto.ai)
RD-TableBench is an open benchmark to help teams evaluate extraction performance for complex tables.
DataChain: DBT for Unstructured Data (github.com/iterative)
DataChain is a modern Pythonic data-frame library designed for artificial intelligence.
131M American Buildings (marksblogg.com)
In May, Nature published an article detailing Oak Ridge National Laboratory's (ORNL) new, AI-generated US Building Dataset.
Python has overtaken JavaScript on GitHub (infoworld.com)
Python has overtaken JavaScript as the most popular language on GitHub, while the use of Jupyter Notebooks also has skyrocketed on the site. The rise of both underscore the surge in data science, artificial intelligence, and machine learning on the code-sharing platform, according to GitHub’s just-released Octoverse 2024 report.
The overlooked GenAI use case: cleaning, processing, and analyzing data (sumble.com)
Job post data reveals what companies plan to do with GenAI. The biggest use case is cleaning, processing, and analyzing data.
Data Commons (datacommons.org)
Probability-generating functions (entropicthoughts.com)
I have long struggled with understanding what probability-generating functions are and how to intuit them. There were two pieces of the puzzle missing for me, and we’ll go through both in this article.
Mistakes from building a model to scalp concert tickets (datastream.substack.com)
In November 2022, Taylor Swift fans crashed Ticketmaster. Not just the website - they crashed the entire company's facade of competence. Over 14 million people tried to buy tickets meant for 1.5 million fans. The servers melted down, the presale collapsed, and Ticketmaster had to cancel the general public sale entirely. Swift herself said watching her fans struggle to get tickets was "excruciating."
Using Survival Analysis to estimate product lifetime (jumpdata.co.uk)
We needed to estimate the average number of years after purchase various consumer products broke down from a large dataset containing:
ModelKit: Transforming AI/ML artifact sharing and management across lifecycles (kitops.ml)
ModelKit revolutionizes the way AI/ML artifacts are shared and managed throughout the lifecycle of AI/ML projects.
Goodhart’s law isn’t as useful as you might think (2023) (commoncog.com)
Goodhart’s Law is a famous adage that goes “when a measure becomes a target, it ceases to be a good measure.”
KitOps: Only Standards-Based Packaging and Versioning Tool for AI/ML Projects (kitops.ml)
KitOps is an innovative open-source project designed to enhance collaboration among data scientists, application developers, and SREs working on integrating or managing self-hosted AI/ML models.
Data viz project that maps all earthquakes by magnitude (concord.org)
Fast and scalable dataset preparation and curation tool from Nvidia (github.com/NVIDIA)
🚀 The GPU-Accelerated Open Source Framework for Efficient Large Language Model Data Curation 🚀
BMI-type measure for a place's "goodness of weather" (ycombinator.com)
My NumPy year: Creating a DType for the next generation of scientific computing (quansight.com)
From no CPython C API experience to shipping a new DType in NumPy 2.0.
Is It Pokémon or Big Data? (pixelastic.github.io)
Made by @pixelastic, inspired by this google form. Source code available on GitHub.
Free Python Course – A friendly, modern hands-on introduction to Python (fabridamicelli.github.io)
This is an open, free short introduction to the Python programming language, emphasizing practical over theoretical aspects and with a focus on data-related tasks.
Understanding Gaussians (gestalt.ink)
The Gaussian distribution, or normal distribution is a key subject in statistics, machine learning, physics, and pretty much any other field that deals with data and probability. It’s one of those subjects, like $\pi$ or Bayes’ rule, that is so fundamental that people treat it like an icon.
Why Rust is an increasingly beloved part of my programming toolbox (idyll.org)
Hi, I'm Titus. I'm a prof at UC Davis who writes scientific software for biological data analysis of very, very large data sets. And I really like Rust, especially in combination with Python. This blog post is about why.
Machine Learning to Computational Plasma Physics Reduced-Order Plasma Modeling (arxiv.org)
Machine learning (ML) provides a broad spectrum of tools and architectures that enable the transformation of data from simulations and experiments into useful and explainable science, thereby augmenting domain knowledge.
Waking up science's sleeping beauties (2023) (worksinprogress.co)
Many scientific papers receive little attention initially but become highly cited years later. What groundbreaking discoveries might have already been made, and how can we uncover them faster?