Hacker News with Generative AI: Big Data

The CDC, Palantir and the AI-Healthcare Revolution (unlimitedhangout.com)
The CDC’s Center for Forecasting and Outbreak Analytics (CFA) has partnered with the CIA-linked Palantir to cement the public-private model of invasive surveillance in “public health,” all while pushing the U.S. national security state and Silicon Valley even closer together.
Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.
Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)
Apache Hudi 1.0 released with secondary indexes for data lakehouses (apache.org)
We are thrilled to announce the release of Apache Hudi 1.0, a landmark achievement for our vibrant community that defines what the next generation of data lakehouses should achieve.
How big data created the modern dairy cow (worksinprogress.co)
What do cryogenics, butterfat tests, and genetic data have in common? They’re some of the reasons behind the world’s most productive dairy cows. Here’s how it all started.
New Amazon S3 Tables: Storage optimized for analytics workloads (amazon.com)
Amazon S3 Tables give you storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format, for easy queries using popular query engines like Amazon Athena, Amazon EMR, and Apache Spark.
Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.
Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.
Write your first MapReduce program in 20 minutes (deepnote.com)
This notebook focuses on MapReduce, which processes large datasets. It is designed to find the maximum transaction value by store using sales data. For those who want to learn MapReduce from scratch, this notebook covers the basics. For more information, here is a detailed article.
Ask HN: How big of a problem is unstructured data for companies? (ycombinator.com)
I read somewhere that 90% of companies have data like documents, PDFs, videos, images, audio clips, and other content that are unstructured that will be a big obstacle for ai.
COBOL's Map Reduce (2022) (ztoz.blog)
COBOL is for Big Data. Well, sort of. Awhile back, I noticed that the COBOL SORT verb was overpowered. Rather than sorting an array of items or even sorting a file, it included a generalized ability to stream in arbitrary inputs — () => Stream[T] —, where T is a key/value pair, and process the outputs in order — SortedStream[T] => (). This power is useful if you are writing a map-reduce program, but excessive for sorting.
6 Powerful Databricks Alternatives for Data Lakes and Lakehouses (definite.app)
Databricks has established itself as a leader in the data lake and lakehouse space, offering a powerful platform for big data processing and analytics.
Sail – Unify stream processing, batch processing and compute-intensive workloads (github.com/lakehq)
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Apache Zeppelin (apache.org)
The Imperial Origins of Big Data (yalebooks.yale.edu)
Amazon's exabyte-scale migration from Apache Spark to Ray on EC2 (amazon.com)
Memory Efficient Data Streaming to Parquet Files (estuary.dev)
Binance built a 100PB log service with Quickwit (quickwit.io)
Lessons Learned from Scaling to Multi-Terabyte Datasets (v2thegreat.com)
DataFusion Comet: Apache Spark Accelerator (github.com/apache)
Big data is dead (2023) (motherduck.com)
Buckets of Parquet Files Are Awful (scratchdata.com)
Estimating Pi with Kafka streams (fredrikmeyer.net)
Kafka storage architecture evolution in one image (twitter.com)
Using DuckDB to seamlessly query a large parquet file over HTTP (marginalia.nu)
10M records per second, on-premises to the cloud (spectralcore.com)