Hacker News with Generative AI: Data Science

Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser (hyperparam.app)
Hyperparam was founded to address a critical gap in the machine learning ecosystem: the lack of a user-friendly, scalable UI for exploring and curating massive datasets.
Dataframely: A polars-native data frame validation library (quantco.com)
At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.
Relational Graph Transformers (kumo.ai)
In the world of enterprise data, the most valuable insights often lie not in individual tables, but in the complex relationships between them. Customer interactions, product hierarchies, transaction histories—these interconnected data points tell rich stories that traditional machine learning approaches struggle to fully capture. Enter Relational Graph Transformers: a breakthrough architecture that's transforming how we extract intelligence from relational databases.
Stuffed-Na(a)N: stuff your NaNs (github.com/si14)
Have you ever done this by mistake?
Are polynomial features the root of all evil? (2024) (alexshtf.github.io)
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.
Introduction to Graph Transformers (kumo.ai)
Graphs are everywhere. From modeling molecular interactions and social networks to detecting financial fraud, learning from graph data is powerful—but inherently challenging.
The value of a dedicated data science approach in HR (gorelik.net)
This document outlines why HR departments in large organizations benefit from a dedicated data science approach, highlighting impacts beyond recruitment. In short, my thesis is as follows: as organizations scale, so does the complexity of understanding their internal dynamics. Data tools become essential to analyzing large organizations, as they enable HR to identify patterns and insights that can drive strategic improvements across key areas.
Show HN: LLM Based Spark Profiler (datasre.ai)
Functional connectomics spanning multiple areas of mouse visual cortex (nature.com)
Understanding the brain requires understanding neurons’ functional responses to the circuit architecture shaping them. Here we introduce the MICrONS functional connectomics dataset with dense calcium imaging of around 75,000 neurons in primary visual cortex (VISp) and higher visual areas (VISrl, VISal and VISlm) in an awake mouse that is viewing natural and synthetic stimuli.
FUTO open-sources 1M row keyboard swipe dataset (huggingface.co)
Nvidia's latest AI PC boxes sound great – for data scientists with $3k to spare (theregister.com)
Nvidia's latest AI PC boxes sound great – if you're a data scientist with $3,000 to spare
UK govt data people not 'technical,' says ex-Downing St data science head (theregister.com)
A former director of data science at the UK prime minister's office has told MPs that people working with data in government are not typically technical and would be unlikely to get a similar job in the private sector.
Compress Better, Compute Bigger (ironarray.io)
Have you ever experienced the frustration of not being able to analyze a dataset because it's too large to fit in memory? Or perhaps you've encountered the memory wall, where computation is hindered by slow memory access? These are common challenges in data science and high-performance computing. The developers of Blosc and Blosc2 have consistently focused on achieving compression and decompression speeds that approach or even exceed memory bandwidth limits.
Study of Lyft rideshare data confirms minorities get more tickets (arstechnica.com)
It's no secret that "driving while black" is a real phenomenon. Study after study has shown that minority drivers are ticketed at a higher rate, and data from speed cameras suggests that it's not because they commit traffic violations more frequently. But this leaves open the question of why. Bias is an obvious answer, but it's hard to eliminate an alternative explanation: Minority groups may engage in more unsafe driving, and the police are trying to deter that.
Show HN: Xorq – open-source Python-first Pandas-style pipelines (github.com/xorq-labs)
xorq is a deferred computational framework that brings the replicability and performance of declarative pipelines to the Python ML ecosystem.
(yet another) two kinds of data scientists (artofdatascience.substack.com)
We’re hiring data scientists. I mean, we hired one last week, but we need more data scientists here at Babbage.
Stop using the elbow criterion for k-means (arxiv.org)
A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters.
Notebooks as reusable Python programs (marimo.io)
AI developers, data engineers, data scientists, and researchers do a lot of their work in Python notebooks like Jupyter.
The clustering behavior of sliding windows (arxiv.org)
Things can go spectacularly wrong when clustering timeseries data that has been preprocessed with a sliding window.
Why extracting data from PDFs is still a nightmare for data experts (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.
Predictive Data Selection: The Data That Predicts Is the Data That Teaches (arxiv.org)
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role.
Changing Forecasts for Python Questions on Stack Overflow (win-vector.com)
I recently conducted a small time series workshop session for AI+ training hosted by ODSC. It went really well, and I’d be happy to offer longer interactive workshops going forward (please reach out if your team would like one!).
Why extracting data from PDFs is still a nightmare for data experts (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.
Building Deep Research Agent from Scratch (swirlai.com)
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in GenAI, MLOps, Data Engineering, Machine Learning and overall Data space.
Nexar Dashcam Crash Prediction Challenge (kaggle.com)
Extracting time series features: a powerful method from a obscure paper [pdf] (rcin.org.pl)
McKinsey's Analysis of the Gender Pay Gap: Data Exploration Gone Wrong (substack.com)
The power of interning: making a time series database smaller (gendignoux.com)
This week-end project started by browsing the open-data repository of Paris’ public transport network, which contains various APIs to query real-time departures, current disruptions, etc.
Data Incest: When AI Breeds with Itself (defragzone.substack.com)
Today, we’re diving into a problem that’s as bizarre as it is concerning: data incest. No, this isn’t some dystopian sci-fi horror plot. It’s a real issue creeping into AI and machine learning. And trust me, it’s as messy as it sounds.
Show HN: NumPy+Jax Except with Named Axes (github.com/justindomke)
NumPy+Jax with named axes and an uncompromising attitude