Hacker News with Generative AI: Big Data

Akvorado: Flow Collector, Enricher and Visualizer (github.com/akvorado)
This program receives flows (currently Netflow/IPFIX and sFlow), enriches them with interface names (using SNMP), geo information (using IPinfo.io), and exports them to Kafka, then ClickHouse. It also exposes a web interface to browse the collected data.
Ask HN: Learning PySpark and Related Tools (ycombinator.com)
Hey HN,<p>I have been working in the data-science and machine-learning domain for the past 8 years or so. I have not been exposed to tools such as PySpark etc. which are being asked frequently in job descriptions. What resource or certification can I use to get upto par on PySpark?<p>Thanks!
From BigQuery to Lakehouse:How We Built a Petabyte-Scale Data Analytics Platform (trmlabs.com)
At TRM Labs, we provide blockchain intelligence tools to help financial institutions, crypto businesses, and government agencies detect and investigate crypto-related financial crime and fraud.
Datawave: Open source data fusion across structured and unstructured datasets (code.nsa.gov)
DataWave is a Java-based ingest and query framework that leverages Apache Accumulo to provide fast, secure access to your data.
Apache Accumulo 4.0 Feature Preview (apache.org)
Apache Accumulo® is a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Apache Iceberg (apache.org)
Iceberg is a high-performance format for huge analytic tables.
Apache Hudi: an open data lakehouse platform (github.com/apache)
Apache Hudi is an open data lakehouse platform, built on a high-performance open table format to ingest, index, store, serve, transform and manage your data across multiple cloud data environments.
The CDC, Palantir and the AI-Healthcare Revolution (unlimitedhangout.com)
The CDC’s Center for Forecasting and Outbreak Analytics (CFA) has partnered with the CIA-linked Palantir to cement the public-private model of invasive surveillance in “public health,” all while pushing the U.S. national security state and Silicon Valley even closer together.
Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.
Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)
Apache Hudi 1.0 released with secondary indexes for data lakehouses (apache.org)
We are thrilled to announce the release of Apache Hudi 1.0, a landmark achievement for our vibrant community that defines what the next generation of data lakehouses should achieve.
How big data created the modern dairy cow (worksinprogress.co)
What do cryogenics, butterfat tests, and genetic data have in common? They’re some of the reasons behind the world’s most productive dairy cows. Here’s how it all started.
New Amazon S3 Tables: Storage optimized for analytics workloads (amazon.com)
Amazon S3 Tables give you storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format, for easy queries using popular query engines like Amazon Athena, Amazon EMR, and Apache Spark.
Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.
Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.
Write your first MapReduce program in 20 minutes (deepnote.com)
This notebook focuses on MapReduce, which processes large datasets. It is designed to find the maximum transaction value by store using sales data. For those who want to learn MapReduce from scratch, this notebook covers the basics. For more information, here is a detailed article.
Ask HN: How big of a problem is unstructured data for companies? (ycombinator.com)
I read somewhere that 90% of companies have data like documents, PDFs, videos, images, audio clips, and other content that are unstructured that will be a big obstacle for ai.
COBOL's Map Reduce (2022) (ztoz.blog)
COBOL is for Big Data. Well, sort of. Awhile back, I noticed that the COBOL SORT verb was overpowered. Rather than sorting an array of items or even sorting a file, it included a generalized ability to stream in arbitrary inputs — () => Stream[T] —, where T is a key/value pair, and process the outputs in order — SortedStream[T] => (). This power is useful if you are writing a map-reduce program, but excessive for sorting.
6 Powerful Databricks Alternatives for Data Lakes and Lakehouses (definite.app)
Databricks has established itself as a leader in the data lake and lakehouse space, offering a powerful platform for big data processing and analytics.
Sail – Unify stream processing, batch processing and compute-intensive workloads (github.com/lakehq)
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
Apache Zeppelin (apache.org)
The Imperial Origins of Big Data (yalebooks.yale.edu)
Amazon's exabyte-scale migration from Apache Spark to Ray on EC2 (amazon.com)
Memory Efficient Data Streaming to Parquet Files (estuary.dev)
Binance built a 100PB log service with Quickwit (quickwit.io)
Lessons Learned from Scaling to Multi-Terabyte Datasets (v2thegreat.com)
DataFusion Comet: Apache Spark Accelerator (github.com/apache)
Big data is dead (2023) (motherduck.com)
Buckets of Parquet Files Are Awful (scratchdata.com)
Estimating Pi with Kafka streams (fredrikmeyer.net)