Hacker News with Generative AI: Data Science

Nvidia's latest AI PC boxes sound great – for data scientists with $3k to spare (theregister.com)
Nvidia's latest AI PC boxes sound great – if you're a data scientist with $3,000 to spare
UK govt data people not 'technical,' says ex-Downing St data science head (theregister.com)
A former director of data science at the UK prime minister's office has told MPs that people working with data in government are not typically technical and would be unlikely to get a similar job in the private sector.
Compress Better, Compute Bigger (ironarray.io)
Have you ever experienced the frustration of not being able to analyze a dataset because it's too large to fit in memory? Or perhaps you've encountered the memory wall, where computation is hindered by slow memory access? These are common challenges in data science and high-performance computing. The developers of Blosc and Blosc2 have consistently focused on achieving compression and decompression speeds that approach or even exceed memory bandwidth limits.
Study of Lyft rideshare data confirms minorities get more tickets (arstechnica.com)
It's no secret that "driving while black" is a real phenomenon. Study after study has shown that minority drivers are ticketed at a higher rate, and data from speed cameras suggests that it's not because they commit traffic violations more frequently. But this leaves open the question of why. Bias is an obvious answer, but it's hard to eliminate an alternative explanation: Minority groups may engage in more unsafe driving, and the police are trying to deter that.
Show HN: Xorq – open-source Python-first Pandas-style pipelines (github.com/xorq-labs)
xorq is a deferred computational framework that brings the replicability and performance of declarative pipelines to the Python ML ecosystem.
(yet another) two kinds of data scientists (artofdatascience.substack.com)
We’re hiring data scientists. I mean, we hired one last week, but we need more data scientists here at Babbage.
Stop using the elbow criterion for k-means (arxiv.org)
A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters.
Notebooks as reusable Python programs (marimo.io)
AI developers, data engineers, data scientists, and researchers do a lot of their work in Python notebooks like Jupyter.
The clustering behavior of sliding windows (arxiv.org)
Things can go spectacularly wrong when clustering timeseries data that has been preprocessed with a sliding window.
Why extracting data from PDFs is still a nightmare for data experts (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.
Predictive Data Selection: The Data That Predicts Is the Data That Teaches (arxiv.org)
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role.
Changing Forecasts for Python Questions on Stack Overflow (win-vector.com)
I recently conducted a small time series workshop session for AI+ training hosted by ODSC. It went really well, and I’d be happy to offer longer interactive workshops going forward (please reach out if your team would like one!).
Why extracting data from PDFs is still a nightmare for data experts (arstechnica.com)
For years, businesses, governments, and researchers have struggled with a persistent problem: How to extract usable data from Portable Document Format (PDF) files.
Building Deep Research Agent from Scratch (swirlai.com)
👋 I am Aurimas. I write the SwirlAI Newsletter with the goal of presenting complicated Data related concepts in a simple and easy-to-digest way. My mission is to help You UpSkill and keep You updated on the latest news in GenAI, MLOps, Data Engineering, Machine Learning and overall Data space.
Nexar Dashcam Crash Prediction Challenge (kaggle.com)
Extracting time series features: a powerful method from a obscure paper [pdf] (rcin.org.pl)
McKinsey's Analysis of the Gender Pay Gap: Data Exploration Gone Wrong (substack.com)
The power of interning: making a time series database smaller (gendignoux.com)
This week-end project started by browsing the open-data repository of Paris’ public transport network, which contains various APIs to query real-time departures, current disruptions, etc.
Data Incest: When AI Breeds with Itself (defragzone.substack.com)
Today, we’re diving into a problem that’s as bizarre as it is concerning: data incest. No, this isn’t some dystopian sci-fi horror plot. It’s a real issue creeping into AI and machine learning. And trust me, it’s as messy as it sounds.
Show HN: NumPy+Jax Except with Named Axes (github.com/justindomke)
NumPy+Jax with named axes and an uncompromising attitude
The best way to use text embeddings portably is with Parquet and Polars (minimaxir.com)
Text embeddings, particularly modern embeddings generated from large language models, are one of the most useful applications coming from the generative AI boom.
You can’t build a moat with AI (redux) (frontierai.substack.com)
Last spring, we wrote an article called You can’t build a moat with AI. That post argued that prompt engineering, while important, would be difficult to defend over time given how easy it is to experiment with LLMs. As a result, you have to focus on the quality of data your application has access to and your use of that data to differentiate yourself.
USDS Engineering Director Resigns: 'This Is Not the Mission I Came to Serve' (wired.com)
The director of data science and engineering for the United States Digital Service—which Elon Musk rebranded as the US DOGE Service—has resigned from her position.
Weird Kaggle and the Superiority of Books (griffens.net)
I recently entered a Kaggle competition to brush up on some modeling skills.1 The analysis problem is a pretty typical clinical prediction question. My final product is, well, "final product" is an extremely charitable description of it.2 But despite this it was a fascinating and worthwhile experience, full of interesting questions to ponder, such as "what is the fundamental difference between statistics and machine learning?" and "do they realize their evaluation metric is pretty silly?"
Tristan Davey's Punch Card Archive (tristandavey.com)
Punched cards were once a ubiquitous part of accounting, data collection and early computing.
Show HN: Bag of words – Build and share smart data apps using AI (github.com/bagofwords1)
Bag of words enables users to create comprehensive dashboards with a single prompt and refine them iteratively.
Physics Informed Neural Networks (pages.dev)
I mentioned in a previous post that as part of my position as junior DS for IKEA, I also get the opportunity to take part in their training program (called AI accelerator program). It’s a six months long program in which we are trained on all aspects of data science and AI by both professionals but also through various courses.
Basketball has evolved into a game of calculated decision-making (nabraj.com)
Basketball has evolved from a game of unpredictability into a game of calculated decision-making with the use of data and analytics.
Classic Data science pipelines built with LLMs (github.com/Pravko-Solutions)
Below you’ll find an overview of each file, what it showcases, and links to the actual code. You can run these examples (assuming you’ve installed FlashLearn and set your “OPENAI_API_KEY”) simply by using:
Stop using zip codes for geospatial analysis (2019) (carto.com)
Uncover deeper insights beyond ZIP codes with geospatial analysis. Explore the limitations of ZIP codes and discover alternatives for spatial understanding.