Hacker News with Generative AI: Big Data

Show HN: I simulated Bigtable in BigQuery as a Type 2 SCD (statsig.com)
How do you handle high-throughput, schema-less updates and make that same data queryable at scale?
The Lost Decade of Small Data? (duckdb.org)
TL;DR: We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?
Entrepreneurial Spawning from Remote Work (nber.org)
Using a novel firm-level remote work measure created from big data on Internet activity, we show that firms with higher remote work during the pandemic are more likely to see their employees becoming entrepreneurs.
“Streaming vs. Batch” Is a Wrong Dichotomy, and I Think It's Confusing (morling.dev)
"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing
Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL (ycombinator.com)
Hey HN! I'm Win, founder of ParaQuery (https://paraquery.com), a fully-managed, GPU-accelerated Spark + SQL solution. We deliver BigQuery's ease of use (or easier) while being significantly more cost-efficient and performant.
Lessons learned operating petabyte-scale ClickHouse clusters: Part II (tinybird.co)
This is the second part of the series. Here's more of what I've learned from operating petabyte-scale ClickHouse clusters for the last 5+ years.
Erlang Solutions' Blog round-up (erlang-solutions.com)
The tech world doesn’t slow down, and neither do we. From the power of big data in healthcare to keeping you up-to-date about fintech compliance, our latest blog posts explore the important topics shaping today’s digital world.
Your Mouse Is a Database (2012) (queue.acm.org)
Web and mobile applications are increasingly composed of asynchronous and realtime streaming services and push notifications, a particular form of big data where the data has positive velocity.
Apache Flink 2.0.0 Released: A New Era of Real-Time Data Processing (apache.org)
<p>Today, the Flink PMC is proud to announce the official release of Apache Flink 2.0.0! This marks the first release in the Flink 2.x series and is the first major release since Flink 1.0 launched nine years ago. This version is the culmination of two years of meticulous preparation and collaboration, signifying a new chapter in the evolution of Flink.</p>
Conflict-Free Distributed Architecture for Append-Only Writes to Apache Iceberg (e6data.com)
Apache Iceberg is a cornerstone table format in modern data lakehouse systems. It is renowned for its ability to deliver transactional consistency, schema evolution, and snapshot isolation through a metadata-driven architecture.
Data Broker Brags About Having Highly Detailed Information on Nearly All Users (gizmodo.com)
The owner of a data brokerage business recently put out a creepy-ass video in which he bragged about the degree to which his industry could collect and analyze data on the habits of billions of people.
Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere (pola.rs)
Our goal is to enable Scalable data processing with all the flexibility and expressiveness of Polars’ API.
Apache iceberg the Hadoop of the modern-data-stack? (det.life)
In the early 2010s, Apache Hadoop dominated the big data conversation. Organizations raced to adopt it, seeing it as the cornerstone for scalable, distributed storage and processing. Today, Apache Iceberg is emerging as a cornerstone for data lakes and lakehouses in the modern data stack.
Common Crawl maintains a free, open repository of web crawl data (commoncrawl.org)
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Akvorado: Flow Collector, Enricher and Visualizer (github.com/akvorado)
This program receives flows (currently Netflow/IPFIX and sFlow), enriches them with interface names (using SNMP), geo information (using IPinfo.io), and exports them to Kafka, then ClickHouse. It also exposes a web interface to browse the collected data.
Ask HN: Learning PySpark and Related Tools (ycombinator.com)
Hey HN,<p>I have been working in the data-science and machine-learning domain for the past 8 years or so. I have not been exposed to tools such as PySpark etc. which are being asked frequently in job descriptions. What resource or certification can I use to get upto par on PySpark?<p>Thanks!
From BigQuery to Lakehouse:How We Built a Petabyte-Scale Data Analytics Platform (trmlabs.com)
At TRM Labs, we provide blockchain intelligence tools to help financial institutions, crypto businesses, and government agencies detect and investigate crypto-related financial crime and fraud.
Datawave: Open source data fusion across structured and unstructured datasets (code.nsa.gov)
DataWave is a Java-based ingest and query framework that leverages Apache Accumulo to provide fast, secure access to your data.
Apache Accumulo 4.0 Feature Preview (apache.org)
Apache Accumulo® is a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Apache Iceberg (apache.org)
Iceberg is a high-performance format for huge analytic tables.
Apache Hudi: an open data lakehouse platform (github.com/apache)
Apache Hudi is an open data lakehouse platform, built on a high-performance open table format to ingest, index, store, serve, transform and manage your data across multiple cloud data environments.
The CDC, Palantir and the AI-Healthcare Revolution (unlimitedhangout.com)
The CDC’s Center for Forecasting and Outbreak Analytics (CFA) has partnered with the CIA-linked Palantir to cement the public-private model of invasive surveillance in “public health,” all while pushing the U.S. national security state and Silicon Valley even closer together.
Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.
Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)
Apache Hudi 1.0 released with secondary indexes for data lakehouses (apache.org)
We are thrilled to announce the release of Apache Hudi 1.0, a landmark achievement for our vibrant community that defines what the next generation of data lakehouses should achieve.
How big data created the modern dairy cow (worksinprogress.co)
What do cryogenics, butterfat tests, and genetic data have in common? They’re some of the reasons behind the world’s most productive dairy cows. Here’s how it all started.
New Amazon S3 Tables: Storage optimized for analytics workloads (amazon.com)
Amazon S3 Tables give you storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format, for easy queries using popular query engines like Amazon Athena, Amazon EMR, and Apache Spark.
Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.
Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.
Write your first MapReduce program in 20 minutes (deepnote.com)
This notebook focuses on MapReduce, which processes large datasets. It is designed to find the maximum transaction value by store using sales data. For those who want to learn MapReduce from scratch, this notebook covers the basics. For more information, here is a detailed article.