Show HN: Juvio – UV Kernel for Jupyter
(github.com/OKUA1)
Juvio: reproducible, dependency-aware, and Git-friendly Jupyter Notebooks.
Juvio: reproducible, dependency-aware, and Git-friendly Jupyter Notebooks.
Will AI systems perform poorly due to AI-generated material in training data?
(cacm.acm.org)
Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?
Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?
I don't like NumPy
(dynomight.net)
They say you can’t truly hate someone unless you loved them first. I don’t know if that’s true as a general principle, but it certainly describes my relationship with NumPy.
They say you can’t truly hate someone unless you loved them first. I don’t know if that’s true as a general principle, but it certainly describes my relationship with NumPy.
Data preparation for function tooling is boring
(thehyperplane.substack.com)
🚨 Alert. This article is dangerously practical; no AI buzzword bingo here. Oh, and did I mention? We have CODE. 🧑💻
🚨 Alert. This article is dangerously practical; no AI buzzword bingo here. Oh, and did I mention? We have CODE. 🧑💻
OpenTelemetry protocol with Apache Arrow
(opentelemetry.io)
We are excited to announce the next phase of the OpenTelemetry Protocol with Apache Arrow project (OTel-Arrow). We began this project several years ago with the goal of bridging between OpenTelemetry data and the Apache Arrow ecosystem.
We are excited to announce the next phase of the OpenTelemetry Protocol with Apache Arrow project (OTel-Arrow). We began this project several years ago with the goal of bridging between OpenTelemetry data and the Apache Arrow ecosystem.
Offline vs. online ML pipelines
(decodingml.substack.com)
If you don't separate offline and online pipelines now....
If you don't separate offline and online pipelines now....
National Snow and Ice Data Center changes service level to key sea ice datasets
(nsidc.org)
Effective May 5, 2025, NOAA’s National Centers for Environmental Information (NCEI) will decommission its snow and ice data products from the Coasts, Oceans, and Geophysics Science Division (COGS).
Effective May 5, 2025, NOAA’s National Centers for Environmental Information (NCEI) will decommission its snow and ice data products from the Coasts, Oceans, and Geophysics Science Division (COGS).
Adventures in Imbalanced Learning and Class Weight
(andersource.dev)
A few months ago I was working on an image classification problem with severe class imbalance - the positive class was much rarer than the negative class.
A few months ago I was working on an image classification problem with severe class imbalance - the positive class was much rarer than the negative class.
Dreariness Index (2015)
(blogspot.com)
How do you define dreary weather? Is it the amount of rain/snow? How about the frequency of precipitation? Many people feel that cloudy weather is dreary. Of course dreary does not have a scientific definition so some arbitrary measure must be developed.
How do you define dreary weather? Is it the amount of rain/snow? How about the frequency of precipitation? Many people feel that cloudy weather is dreary. Of course dreary does not have a scientific definition so some arbitrary measure must be developed.
ProbOnto – The Ontology and Knowledge Base of Probability Distributions
(sites.google.com)
Welcome to ProbOnto - the Ontology and Knowledge Base of Probability Distributions, v2.5
Welcome to ProbOnto - the Ontology and Knowledge Base of Probability Distributions, v2.5
Show HN: Hyperparam: OSS tools for exploring datasets locally in the browser
(hyperparam.app)
Hyperparam was founded to address a critical gap in the machine learning ecosystem: the lack of a user-friendly, scalable UI for exploring and curating massive datasets.
Hyperparam was founded to address a critical gap in the machine learning ecosystem: the lack of a user-friendly, scalable UI for exploring and curating massive datasets.
Dataframely: A polars-native data frame validation library
(quantco.com)
At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.
At QuantCo, we are constantly trying to improve the quality of our code bases to ensure that they remain easily maintainable. More recently, this often involved migrating data pipelines from pandas to polars in order to achieve significant performance gains.
Relational Graph Transformers
(kumo.ai)
In the world of enterprise data, the most valuable insights often lie not in individual tables, but in the complex relationships between them. Customer interactions, product hierarchies, transaction histories—these interconnected data points tell rich stories that traditional machine learning approaches struggle to fully capture. Enter Relational Graph Transformers: a breakthrough architecture that's transforming how we extract intelligence from relational databases.
In the world of enterprise data, the most valuable insights often lie not in individual tables, but in the complex relationships between them. Customer interactions, product hierarchies, transaction histories—these interconnected data points tell rich stories that traditional machine learning approaches struggle to fully capture. Enter Relational Graph Transformers: a breakthrough architecture that's transforming how we extract intelligence from relational databases.
Are polynomial features the root of all evil? (2024)
(alexshtf.github.io)
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.
It turns out that it’s just a MYTH. There’s nothing inherently wrong with high degree polynomials, and in contrast to what is typically taught, high degree polynomials are easily controlled using standard ML tools, like regularization. The source of the myth stems mainly from two misconceptions about polynomials that we will explore here. In fact, not only they are great non-linear features, certain representations also provide us with powerful control over the shape of the function we wish to learn.
Introduction to Graph Transformers
(kumo.ai)
Graphs are everywhere. From modeling molecular interactions and social networks to detecting financial fraud, learning from graph data is powerful—but inherently challenging.
Graphs are everywhere. From modeling molecular interactions and social networks to detecting financial fraud, learning from graph data is powerful—but inherently challenging.
The value of a dedicated data science approach in HR
(gorelik.net)
This document outlines why HR departments in large organizations benefit from a dedicated data science approach, highlighting impacts beyond recruitment. In short, my thesis is as follows: as organizations scale, so does the complexity of understanding their internal dynamics. Data tools become essential to analyzing large organizations, as they enable HR to identify patterns and insights that can drive strategic improvements across key areas.
This document outlines why HR departments in large organizations benefit from a dedicated data science approach, highlighting impacts beyond recruitment. In short, my thesis is as follows: as organizations scale, so does the complexity of understanding their internal dynamics. Data tools become essential to analyzing large organizations, as they enable HR to identify patterns and insights that can drive strategic improvements across key areas.
Functional connectomics spanning multiple areas of mouse visual cortex
(nature.com)
Understanding the brain requires understanding neurons’ functional responses to the circuit architecture shaping them. Here we introduce the MICrONS functional connectomics dataset with dense calcium imaging of around 75,000 neurons in primary visual cortex (VISp) and higher visual areas (VISrl, VISal and VISlm) in an awake mouse that is viewing natural and synthetic stimuli.
Understanding the brain requires understanding neurons’ functional responses to the circuit architecture shaping them. Here we introduce the MICrONS functional connectomics dataset with dense calcium imaging of around 75,000 neurons in primary visual cortex (VISp) and higher visual areas (VISrl, VISal and VISlm) in an awake mouse that is viewing natural and synthetic stimuli.
Nvidia's latest AI PC boxes sound great – for data scientists with $3k to spare
(theregister.com)
Nvidia's latest AI PC boxes sound great – if you're a data scientist with $3,000 to spare
Nvidia's latest AI PC boxes sound great – if you're a data scientist with $3,000 to spare
UK govt data people not 'technical,' says ex-Downing St data science head
(theregister.com)
A former director of data science at the UK prime minister's office has told MPs that people working with data in government are not typically technical and would be unlikely to get a similar job in the private sector.
A former director of data science at the UK prime minister's office has told MPs that people working with data in government are not typically technical and would be unlikely to get a similar job in the private sector.
Compress Better, Compute Bigger
(ironarray.io)
Have you ever experienced the frustration of not being able to analyze a dataset because it's too large to fit in memory? Or perhaps you've encountered the memory wall, where computation is hindered by slow memory access? These are common challenges in data science and high-performance computing. The developers of Blosc and Blosc2 have consistently focused on achieving compression and decompression speeds that approach or even exceed memory bandwidth limits.
Have you ever experienced the frustration of not being able to analyze a dataset because it's too large to fit in memory? Or perhaps you've encountered the memory wall, where computation is hindered by slow memory access? These are common challenges in data science and high-performance computing. The developers of Blosc and Blosc2 have consistently focused on achieving compression and decompression speeds that approach or even exceed memory bandwidth limits.
Study of Lyft rideshare data confirms minorities get more tickets
(arstechnica.com)
It's no secret that "driving while black" is a real phenomenon. Study after study has shown that minority drivers are ticketed at a higher rate, and data from speed cameras suggests that it's not because they commit traffic violations more frequently. But this leaves open the question of why. Bias is an obvious answer, but it's hard to eliminate an alternative explanation: Minority groups may engage in more unsafe driving, and the police are trying to deter that.
It's no secret that "driving while black" is a real phenomenon. Study after study has shown that minority drivers are ticketed at a higher rate, and data from speed cameras suggests that it's not because they commit traffic violations more frequently. But this leaves open the question of why. Bias is an obvious answer, but it's hard to eliminate an alternative explanation: Minority groups may engage in more unsafe driving, and the police are trying to deter that.
Show HN: Xorq – open-source Python-first Pandas-style pipelines
(github.com/xorq-labs)
xorq is a deferred computational framework that brings the replicability and performance of declarative pipelines to the Python ML ecosystem.
xorq is a deferred computational framework that brings the replicability and performance of declarative pipelines to the Python ML ecosystem.