Hacker News with Generative AI: Data Science

USDS Engineering Director Resigns: 'This Is Not the Mission I Came to Serve' (wired.com)
The director of data science and engineering for the United States Digital Service—which Elon Musk rebranded as the US DOGE Service—has resigned from her position.
Show HN: Bag of words – Build and share smart data apps using AI (github.com/bagofwords1)
Bag of words enables users to create comprehensive dashboards with a single prompt and refine them iteratively.
Physics Informed Neural Networks (pages.dev)
I mentioned in a previous post that as part of my position as junior DS for IKEA, I also get the opportunity to take part in their training program (called AI accelerator program). It’s a six months long program in which we are trained on all aspects of data science and AI by both professionals but also through various courses.
Basketball has evolved into a game of calculated decision-making (nabraj.com)
Basketball has evolved from a game of unpredictability into a game of calculated decision-making with the use of data and analytics.
Classic Data science pipelines built with LLMs (github.com/Pravko-Solutions)
Below you’ll find an overview of each file, what it showcases, and links to the actual code. You can run these examples (assuming you’ve installed FlashLearn and set your “OPENAI_API_KEY”) simply by using:
Stop using zip codes for geospatial analysis (2019) (carto.com)
Uncover deeper insights beyond ZIP codes with geospatial analysis. Explore the limitations of ZIP codes and discover alternatives for spatial understanding.
AI by Hand Exercises in Excel (github.com/ImagineAILab)
AI by Hand ✍️ Exercises in Excel
Clinical Trials Are Drowning in Data but Starving for Patients (seangeiger.substack.com)
The first clinical trial happened on a damp naval ship in 1747. James Lind, a Scottish naval surgeon, divided twelve scurvy-stricken sailors into pairs to test different treatments: oranges and lemons, cider, vinegar, sulfuric acid, garlic paste, and seawater.
Ask HN: Learning PySpark and Related Tools (ycombinator.com)
Hey HN,<p>I have been working in the data-science and machine-learning domain for the past 8 years or so. I have not been exposed to tools such as PySpark etc. which are being asked frequently in job descriptions. What resource or certification can I use to get upto par on PySpark?<p>Thanks!
Ask HN: Is There Any Pattern Matching Bible or Set of Essential Readings? (ycombinator.com)
Publication analyzing/unifying topics like Dynamic Time Warping, Spell checking, Knuth-Moris-Pratt andother string search, Hidden Markov Chains, and many other text/image/signal processing algorithms/metrics in coherent form?
Ask HN: How to passively prepare for a job interview? (ycombinator.com)
Hey there fellow HNers! I am currently at a stage where I am working within a Tech Consulting company for 5-6 years. My plan is to prepare myself well in the next 5-6 months and then possibly start actively interviewing for jobs. How can I do passive job search and interviewing in the meanwhile so that I can setup myself best after few months? I am in a data-science and machine-learning domain.
Two Bites of Data Science in K (zdsmith.com)
“no bowler with as many wickets has a better average”.
Modern Polars – A side-by-side comparison of the Polars and Pandas libraries (kevinheavey.github.io)
This is a side-by-side comparison of the Polars and Pandas dataframe libraries, based on Modern Pandas by Tom Augsburger.
Cosine Similarity: Not the Silver Bullet We Thought It Was (shaped.ai)
In the world of machine learning and data science, cosine similarity has long been a go-to metric for measuring the semantic similarity between high-dimensional objects. However, a new study by researchers at Netflix and Cornell University challenges our understanding of this popular technique, exposing the underlying issues that could lead to arbitrary and meaningless results.
Hallucination is a problem we'll have to live with for a long time (theregister.com)
A notable flaw of AI is its habit of "hallucinating," making up plausible answers that have no basis in real-world data. AWS is trying to tackle this by introducing Amazon Bedrock Automated Reasoning checks.
PyViz – Overview of the Python visualization landscape (pyviz.org)
The Python visualization landscape can seem daunting at first. These overviews attempt to shine light on common patterns and use cases, comparing or discussing multiple plotting libraries. Note that some of the projects discussed in the overviews are no longer maintained, so be sure to check the list of dormant projects before choosing that library.
Directory of job boards by category (remote only, data science, etc.) (jobsearchdb.com)
Find an active job board in your career niche
Don't use cosine similarity carelessly (migdal.pl)
Midas turned everything he touched into gold. Data scientists turn everything into vectors. We do it for a reason — as gold is the language of merchants, vectors are the language of AI1.
Dbt Labs acquires SDF Labs (getdbt.com)
The TL;DR: today, I have the pleasure of announcing that dbt Labs has acquired SDF Labs. The two teams are already working side-by-side to bring SDF’s SQL comprehension technology into the hands of dbt users everywhere. SDF will be a massive upgrade to the very heart of the dbt user experience moving forward.
Show HN: SemHash – Fast Semantic Text Deduplication for Cleaner Datasets (github.com/MinishLab)
How to become a Data Scientist? My journey, overview of skill set, practice tips (mljar.com)
In recent years, many people have been drawn to the field of data science, often believing it to be a fast track to wealth.
Small Data [video] (youtube.com)
Time-Series Anomaly Detection: A Decade Review (arxiv.org)
Recent advances in data collection technology, accompanied by the ever-rising volume and velocity of streaming data, underscore the vital need for time series analytics.
JPL Horizons on-line solar system data and ephemeris computation service (nasa.gov)
The JPL Horizons on-line solar system data and ephemeris computation service provides access to key solar system data and flexible production of highly accurate ephemerides for solar system objects ( loading… ). ( unable to load body counts… please try later ).
Name Collision of the Year: Vector (crunchydata.com)
I can’t get through a zoom call, a conference talk, or an afternoon scroll through LinkedIn without hearing about vectors. Do you feel like the term vector is everywhere this year? It is.
Optimality of Frequency Moment Estimation (weizmann.ac.il)
Model Evaluation with RandomForest and AdaBoost (deepnote.com)
This notebook demonstrates how to train and evaluate machine learning models, specifically focusing on the Random Forest Classifier and AdaBoost Classifier. We will use synthetic data and visualize key metrics, including accuracy, precision, recall, and F1-score.
Maximum likelihood estimation and loss functions (rish-01.github.io)
When I started learning about loss functions, I could always understand the intuition behind them. For example, the mean squared error (MSE) for regression seemed logical—penalizing large deviations from the ground-truth makes sense. But one thing always bothered me: I could never come up with those loss functions on my own. Where did they come from? Why do we use these specific formulas and not something else?
Ask HN: Why is Ilya saying data is limited when the whole world is data? (ycombinator.com)
In his recent talk Ilya S. said that the data running out is a fundamental constraint on the scaling laws.
Array Languages for Clojurians (2020) (appliedscience.studio)
A discussion in the Clojure data science Zulip led me to Slobodan Blazeski’s enlightening article Array languages for Lisp programmers in the journal of the British APL Association. He quotes Alan Perlis: