Hacker News with Generative AI: Data Science

Squiggle: A simple programming language for intuitive probabilistic estimation (squiggle-language.com)
A simple programming language for intuitive probabilistic estimation

Programming Languages, Probabilistic Modeling, Data Science

6 points by fanf2 45 days ago | 0 comments

Word Tour: 1d word embeddings (data-processing.club)
In the field of Natural Language Processing (NLP), a central theme has always been “how to make computers understand the meaning of words.”

Word Embeddings, AI, Data Science

5 points by atrudeau 48 days ago | 0 comments

DumPy: NumPy except it's OK if you're dum (dynomight.net)
I don’t like NumPy

Python, Libraries, Data Science, Software

28 points by antimatter15 49 days ago | 1 comments

DumPy: NumPy except it's OK if you're dum (dynomight.net)
They say you can’t truly hate someone unless you loved them first. I don’t know if that’s true as a general principle, but it certainly describes my relationship with NumPy. NumPy, by the way, is...

Python, Data Science, NumPy, Libraries

7 points by jjar 50 days ago | 2 comments

DumPy: NumPy except it's OK if you're dum (dynomight.substack.com)

Python, NumPy, Data Science, Programming, Humor

7 points by crescit_eundo 51 days ago | 2 comments

DumPy: NumPy except it's OK if you're dum (dynomight.net)
I don’t like NumPy

Python, Libraries, Data Science, Humor

23 points by mbforbes 51 days ago | 4 comments

Show HN: Juvio – UV Kernel for Jupyter (github.com/OKUA1)
Juvio: reproducible, dependency-aware, and Git-friendly Jupyter Notebooks.

Jupyter Notebooks, Open Source, Python, Data Science, Reproducibility

118 points by okost1 53 days ago | 27 comments

Climbing trees 1: what are decision trees? (mathpn.com)

Machine Learning, Data Science, Algorithms, Decision Trees

45 points by SchwKatze 55 days ago | 4 comments

Will AI systems perform poorly due to AI-generated material in training data? (cacm.acm.org)
Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

Artificial Intelligence, Generative AI, Machine Learning, Data Science

115 points by pseudolus 56 days ago | 142 comments

Simple Classification Rules Perform Well on Commonly Used Datasets (1993) [pdf] (ualberta.ca)

Machine Learning, Data Science, Classification, Datasets, Research Papers

4 points by Tomte 57 days ago | 1 comments

I don't like NumPy (dynomight.net)
They say you can’t truly hate someone unless you loved them first. I don’t know if that’s true as a general principle, but it certainly describes my relationship with NumPy.

Python, Programming Languages, Data Science

488 points by MinimalAction 58 days ago | 206 comments

Databricks acquires Neon (databricks.com)

Acquisitions, Cloud Computing, Data Science, Software

389 points by davidgomes 59 days ago | 226 comments

Data preparation for function tooling is boring (thehyperplane.substack.com)
🚨 Alert. This article is dangerously practical; no AI buzzword bingo here. Oh, and did I mention? We have CODE. 🧑‍💻

Data Science, Software Development, Machine Learning

9 points by andreeamiclaus 59 days ago | 4 comments

OpenTelemetry protocol with Apache Arrow (opentelemetry.io)
We are excited to announce the next phase of the OpenTelemetry Protocol with Apache Arrow project (OTel-Arrow). We began this project several years ago with the goal of bridging between OpenTelemetry data and the Apache Arrow ecosystem.

OpenTelemetry, Apache Arrow, Protocol, Data Transmission, Data Science

108 points by tanelpoder 60 days ago | 18 comments

Offline vs. online ML pipelines (decodingml.substack.com)
If you don't separate offline and online pipelines now....

Machine Learning, Data Science, Pipelines

13 points by rbanffy 64 days ago | 2 comments

National Snow and Ice Data Center changes service level to key sea ice datasets (nsidc.org)
Effective May 5, 2025, NOAA’s National Centers for Environmental Information (NCEI) will decommission its snow and ice data products from the Coasts, Oceans, and Geophysics Science Division (COGS).

Climate Change, Data Science, Environmental Science, Research

5 points by waterthrowaway 64 days ago | 1 comments

Adventures in Imbalanced Learning and Class Weight (andersource.dev)
A few months ago I was working on an image classification problem with severe class imbalance - the positive class was much rarer than the negative class.

Machine Learning, Data Science, Classification, Class Imbalance

49 points by andersource 65 days ago | 8 comments

Dreariness Index (2015) (blogspot.com)
How do you define dreary weather? Is it the amount of rain/snow? How about the frequency of precipitation? Many people feel that cloudy weather is dreary. Of course dreary does not have a scientific definition so some arbitrary measure must be developed.

Weather, Climate, Data Science, Opinion

32 points by skupig 67 days ago | 30 comments

ProbOnto – The Ontology and Knowledge Base of Probability Distributions (sites.google.com)
Welcome to ProbOnto - the Ontology and Knowledge Base of Probability Distributions, v2.5

Knowledge Graphs, Machine Learning, Data Science, Probability Distributions, Ontology

9 points by klysm 70 days ago | 0 comments

Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser (hyperparam.app)
Hyperparam was founded to address a critical gap in the machine learning ecosystem: the lack of a user-friendly, scalable UI for exploring and curating massive datasets.

Machine Learning, Data Science, Open Source, Web Development, Tools

77 points by platypii 72 days ago | 21 comments

Dataframely: A polars-native data frame validation library (quantco.com)
At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.

Data Science, Libraries, Performance Optimization

39 points by sito42 73 days ago | 8 comments

Relational Graph Transformers (kumo.ai)
In the world of enterprise data, the most valuable insights often lie not in individual tables, but in the complex relationships between them. Customer interactions, product hierarchies, transaction histories—these interconnected data points tell rich stories that traditional machine learning approaches struggle to fully capture. Enter Relational Graph Transformers: a breakthrough architecture that's transforming how we extract intelligence from relational databases.

Machine Learning, Data Science, Enterprise Data, Relational Databases

74 points by gk1 74 days ago | 7 comments

Stuffed-Na(a)N: stuff your NaNs (github.com/si14)
Have you ever done this by mistake?

Programming, Software, Software Development, Data Science, Python

156 points by dgroshev 77 days ago | 58 comments

Are polynomial features the root of all evil? (2024) (alexshtf.github.io)
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.

Machine Learning, Data Science, Myths, Mathematics, Polynomial Regression

188 points by Areibman 81 days ago | 77 comments

Introduction to Graph Transformers (kumo.ai)
Graphs are everywhere. From modeling molecular interactions and social networks to detecting financial fraud, learning from graph data is powerful—but inherently challenging.

Machine Learning, Data Science

44 points by gk1 81 days ago | 0 comments

The value of a dedicated data science approach in HR (gorelik.net)
This document outlines why HR departments in large organizations benefit from a dedicated data science approach, highlighting impacts beyond recruitment. In short, my thesis is as follows: as organizations scale, so does the complexity of understanding their internal dynamics. Data tools become essential to analyzing large organizations, as they enable HR to identify patterns and insights that can drive strategic improvements across key areas.

Data Science, Human Resources, Business Intelligence, Organizational Dynamics, Strategy

13 points by luu 90 days ago | 3 comments

Show HN: LLM Based Spark Profiler (datasre.ai)

Spark, Data Science, Profiling, Show HN

27 points by ambrood 93 days ago | 5 comments

Functional connectomics spanning multiple areas of mouse visual cortex (nature.com)
Understanding the brain requires understanding neurons’ functional responses to the circuit architecture shaping them. Here we introduce the MICrONS functional connectomics dataset with dense calcium imaging of around 75,000 neurons in primary visual cortex (VISp) and higher visual areas (VISrl, VISal and VISlm) in an awake mouse that is viewing natural and synthetic stimuli.

Neuroscience, Brain Research, Visual Cortex, Mice, Data Science

6 points by rntn 94 days ago | 0 comments

FUTO open-sources 1M row keyboard swipe dataset (huggingface.co)

Data Science, Machine Learning, Datasets, Open Source, Keyboard Input

12 points by ReadEvalPost 98 days ago | 1 comments

Nvidia's latest AI PC boxes sound great – for data scientists with $3k to spare (theregister.com)
Nvidia's latest AI PC boxes sound great – if you're a data scientist with $3,000 to spare

Nvidia, Artificial Intelligence, Hardware, Data Science, Finance

44 points by rntn 103 days ago | 55 comments