Hacker News with Generative AI: Data Science

PyViz – Overview of the Python visualization landscape (pyviz.org)
The Python visualization landscape can seem daunting at first. These overviews attempt to shine light on common patterns and use cases, comparing or discussing multiple plotting libraries. Note that some of the projects discussed in the overviews are no longer maintained, so be sure to check the list of dormant projects before choosing that library.
Directory of job boards by category (remote only, data science, etc.) (jobsearchdb.com)
Find an active job board in your career niche
Don't use cosine similarity carelessly (migdal.pl)
Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merchants, vectors are the language of AI1.
Dbt Labs acquires SDF Labs (getdbt.com)
The TL;DR: today, I have the pleasure of announcing that dbt Labs has acquired SDF Labs. The two teams are already working side-by-side to bring SDF’s SQL comprehension technology into the hands of dbt users everywhere. SDF will be a massive upgrade to the very heart of the dbt user experience moving forward.
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets (github.com/MinishLab)
How to become a Data Scientist? My journey, overview of skill set, practice tips (mljar.com)
In recent years, many people have been drawn to the field of data science, often believing it to be a fast track to wealth.
Small Data [video] (youtube.com)
Time-Series Anomaly Detection: A Decade Review (arxiv.org)
Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics.
JPL Horizons on-line solar system data and ephemeris computation service (nasa.gov)
The JPL Horizons on-line solar system data and ephemeris computation service provides access to key solar system data and flexible production of highly accurate ephemerides for solar system objects ( loading… ). ( unable to load body counts… please try later ).
Name Collision of the Year: Vector (crunchydata.com)
I can’t get through a zoom call, a conference talk, or an afternoon scroll through LinkedIn without hearing about vectors. Do you feel like the term vector is everywhere this year? It is.
Optimality of Frequency Moment Estimation (weizmann.ac.il)
Model Evaluation with RandomForest and AdaBoost (deepnote.com)
This notebook demonstrates how to train and evaluate machine learning models, specifically focusing on the Random Forest Classifier and AdaBoost Classifier. We will use synthetic data and visualize key metrics, including accuracy, precision, recall, and F1-score.
Maximum likelihood estimation and loss functions (rish-01.github.io)
When I started learning about loss functions, I could always understand the intuition behind them. For example, the mean squared error (MSE) for regression seemed logical—penalizing large deviations from the ground-truth makes sense. But one thing always bothered me: I could never come up with those loss functions on my own. Where did they come from? Why do we use these specific formulas and not something else?
Ask HN: Why is Ilya saying data is limited when the whole world is data? (ycombinator.com)
In his recent talk Ilya S. said that the data running out is a fundamental constraint on the scaling laws.
Array Languages for Clojurians (2020) (appliedscience.studio)
A discussion in the Clojure data science Zulip led me to Slobodan Blazeski’s enlightening article Array languages for Lisp programmers in the journal of the British APL Association. He quotes Alan Perlis:
Datasaurus dozen – Different datasets with the same descriptive statistics (wikipedia.org)
The Datasaurus dozen comprises thirteen data sets that have nearly identical simple descriptive statistics to two decimal places, yet have very different distributions and appear very different when graphed.[1] It was inspired by the smaller Anscombe's quartet that was created in
Soylent Green Is People (xeiaso.net)
Recently a group of data scientists at Hugging Face created a dataset of curated Bluesky posts. The publication of this data has made a lot of people very angry and has been widely regarded as a bad move. The dataset contained one million posts from the Bluesky firehose with the intent that this could be a standard dataset to evaluate the effectiveness of various moderation tooling.
1,600 days of a failed hobby data science project (lellep.xyz)
⭐ I spent >1,600 days working on a data science project that then failed because I lost interest. This article is to cope with the failure and maybe help you (and me) to finish successful data science projects by summarising a few learnings into a checklist, see below.
It's not enough to be right as an engineer (pedramnavid.com)
Early in my career, I was rewarded for being right. As a data scientist, I worked hard to understand some business processes, make predictions, model some outcomes, and help forecast some future states. As a data engineer, my ability to properly forecast future usage of a system and build systems that are resilient and scalable to that future state while being easy to maintain helped me grow in my career.
PyMyFlySpy: Track your flight using its headrest data (robertheaton.com)
“Where are we daddy?” asked my five-year-old.
Ask HN: Freelancer? Seeking freelancer? (December 2024) (ycombinator.com)
I'm a data scientist and I've been absolutely slaughtered by this recession. I lost a big project in July and haven't picked anything up since. Everything is "on hold until new year". So I'm looking for a project or it's coal for Christmas.
From PDFs to AI-ready structured data: a deep dive (explosion.ai)
PDFs are ubiquitous in industry and daily life. Paper is scanned, documents are sent and received as PDF, and they’re often kept as the archival copy. Unfortunately, processing PDFs is hard. In this blog post, I’ll present a new modular workflow for converting PDFs and similar documents to structured data and show how to build end-to-end document understanding and information extraction pipelines for industry use cases.
Vesuvius Challenge: First letters found in new scroll (scrollprize.substack.com)
Today we are thrilled to share a new scroll dataset that is already full of exciting findings. In collaboration with our partners at the Bodleian Library, the University of Oxford, Diamond Light Source, and EduceLab, we present: PHerc. 172!
A Non-Technical Guide to Interpreting SHAP Analyses (aidancooper.co.uk)
With interpretability becoming an increasingly important requirement for machine learning projects, there's a growing need to communicate the complex outputs of model interpretation techniques to non-technical stakeholders.
Heuristics that almost always work (deepnote.com)
RunCopyInclude input parameters in the link.
Two Parameter Model for Running Performance (normalizingconstant.com)
Human running performance from real-world big data is a super cool paper from 2020 that analyzes 14,000 people’s running activities.
Why can't we separate YAML from ML? (ethanrosenthal.com)
I’m coming up on 10 years, and half as many jobs, in data science and machine learning. No matter what, in every role, I find myself reinventing a programming language on top of YAML in order to train machine learning models.
Non-elementary group-by aggregations in Polars vs pandas (quansight.org)
I attended PyData Berlin 2024 in April, and it was a blast! I met so many colleagues, collaborators, and friends. There was quite some talk of Polars - some people even gathered together for a Polars-themed dinner! It's certainly nice to see people talking about it, and the focus tends to be on features such as:
Image-Text Curation for 1B+ Data: Faster, Better, Smaller Clip Models (datologyai.com)
The Corpus of United States State Statutes–Design, Construction and Use (ssrn.com)
There is a need for more publicly available corpora of legal language.