Hacker News with Generative AI: Big Data

Show HN: I simulated Bigtable in BigQuery as a Type 2 SCD (statsig.com)
How do you handle high-throughput, schema-less updates and make that same data queryable at scale?

Data Warehousing, SQL, Big Data, Google Cloud

15 points by experimentctrlz 58 days ago | 0 comments

The Lost Decade of Small Data? (duckdb.org)
TL;DR: We benchmark DuckDB on a 2012 MacBook Pro to decide: did we lose a decade chasing distributed architectures for data analytics?

Data Analytics, Databases, Performance, Big Data, Cloud Computing

3 points by oli200110 65 days ago | 0 comments

Entrepreneurial Spawning from Remote Work (nber.org)
Using a novel firm-level remote work measure created from big data on Internet activity, we show that firms with higher remote work during the pandemic are more likely to see their employees becoming entrepreneurs.

Entrepreneurship, Remote Work, Pandemic, Economics, Big Data

4 points by ingve 71 days ago | 1 comments

“Streaming vs. Batch” Is a Wrong Dichotomy, and I Think It's Confusing (morling.dev)
"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing

Software Development, Data Processing, Big Data

70 points by ingve 71 days ago | 42 comments

Launch HN: ParaQuery (YC X25) – GPU Accelerated Spark/SQL (ycombinator.com)
Hey HN! I'm Win, founder of ParaQuery (https://paraquery.com), a fully-managed, GPU-accelerated Spark + SQL solution. We deliver BigQuery's ease of use (or easier) while being significantly more cost-efficient and performant.

Cloud Computing, Databases, Big Data, Software, Startups

135 points by winwang 73 days ago | 81 comments

Lessons learned operating petabyte-scale ClickHouse clusters: Part II (tinybird.co)
This is the second part of the series. Here's more of what I've learned from operating petabyte-scale ClickHouse clusters for the last 5+ years.

ClickHouse, Databases, Big Data, Software Engineering

61 points by javisantana 95 days ago | 2 comments

Erlang Solutions' Blog round-up (erlang-solutions.com)
The tech world doesn’t slow down, and neither do we. From the power of big data in healthcare to keeping you up-to-date about fintech compliance, our latest blog posts explore the important topics shaping today’s digital world.

Erlang, Technology, Fintech, Healthcare, Big Data

43 points by amalinovic 98 days ago | 0 comments

Your Mouse Is a Database (2012) (queue.acm.org)
Web and mobile applications are increasingly composed of asynchronous and realtime streaming services and push notifications, a particular form of big data where the data has positive velocity.

Web Development, Big Data, Real-time Applications, Asynchronous Programming

7 points by tosh 108 days ago | 1 comments

Apache Flink 2.0.0 Released: A New Era of Real-Time Data Processing (apache.org)
<p>Today, the Flink PMC is proud to announce the official release of Apache Flink 2.0.0! This marks the first release in the Flink 2.x series and is the first major release since Flink 1.0 launched nine years ago. This version is the culmination of two years of meticulous preparation and collaboration, signifying a new chapter in the evolution of Flink.</p>

Apache Flink, Data Processing, Big Data, Software Releases, Open Source

5 points by dockerd 119 days ago | 1 comments

Conflict-Free Distributed Architecture for Append-Only Writes to Apache Iceberg (e6data.com)
Apache Iceberg is a cornerstone table format in modern data lakehouse systems. It is renowned for its ability to deliver transactional consistency, schema evolution, and snapshot isolation through a metadata-driven architecture.

Distributed Systems, Big Data, Data Lakehouse, Apache Iceberg

10 points by hackintoshrao 126 days ago | 2 comments

Data Broker Brags About Having Highly Detailed Information on Nearly All Users (gizmodo.com)
The owner of a data brokerage business recently put out a creepy-ass video in which he bragged about the degree to which his industry could collect and analyze data on the habits of billions of people.

Data Privacy, Surveillance, Big Data, Technology, Business

6 points by doener 129 days ago | 0 comments

Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere (pola.rs)
Our goal is to enable Scalable data processing with all the flexibility and expressiveness of Polars’ API.

Cloud Computing, Data Processing, Software, Big Data

261 points by neilfrndes 139 days ago | 87 comments

Apache iceberg the Hadoop of the modern-data-stack? (det.life)
In the early 2010s, Apache Hadoop dominated the big data conversation. Organizations raced to adopt it, seeing it as the cornerstone for scalable, distributed storage and processing. Today, Apache Iceberg is emerging as a cornerstone for data lakes and lakehouses in the modern data stack.

Apache Iceberg, Data Warehousing, Big Data, Apache Hadoop

115 points by samrohn 140 days ago | 65 comments

Common Crawl maintains a free, open repository of web crawl data (commoncrawl.org)
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Web Crawling, Open Data, Big Data

27 points by doener 142 days ago | 1 comments

Akvorado: Flow Collector, Enricher and Visualizer (github.com/akvorado)
This program receives flows (currently Netflow/IPFIX and sFlow), enriches them with interface names (using SNMP), geo information (using IPinfo.io), and exports them to Kafka, then ClickHouse. It also exposes a web interface to browse the collected data.

Network Monitoring, Data Visualization, Open Source, Big Data

3 points by javatuts 159 days ago | 0 comments

Ask HN: Learning PySpark and Related Tools (ycombinator.com)
Hey HN,<p>I have been working in the data-science and machine-learning domain for the past 8 years or so. I have not been exposed to tools such as PySpark etc. which are being asked frequently in job descriptions. What resource or certification can I use to get upto par on PySpark?<p>Thanks!

Data Science, Machine Learning, Python, Programming, Big Data

9 points by rookie123 169 days ago | 6 comments

From BigQuery to Lakehouse:How We Built a Petabyte-Scale Data Analytics Platform (trmlabs.com)
At TRM Labs, we provide blockchain intelligence tools to help financial institutions, crypto businesses, and government agencies detect and investigate crypto-related financial crime and fraud.

Data Analytics, Big Data, Blockchain, Financial Crime, Crypto

26 points by Vijayshekhawat 170 days ago | 4 comments

Datawave: Open source data fusion across structured and unstructured datasets (code.nsa.gov)
DataWave is a Java-based ingest and query framework that leverages Apache Accumulo to provide fast, secure access to your data.

Open Source, Data Management, Java, Big Data

5 points by teleforce 180 days ago | 0 comments

Apache Accumulo 4.0 Feature Preview (apache.org)
Apache Accumulo® is a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.

Apache Accumulo, Databases, Big Data, Data Storage, Key/Value Stores

14 points by teleforce 181 days ago | 0 comments

Apache Iceberg (apache.org)
Iceberg is a high-performance format for huge analytic tables.

Data Warehousing, Big Data, Apache

207 points by jacobmarble 183 days ago | 66 comments

Apache Hudi: an open data lakehouse platform (github.com/apache)
Apache Hudi is an open data lakehouse platform, built on a high-performance open table format to ingest, index, store, serve, transform and manage your data across multiple cloud data environments.

Apache Hudi, Open Source, Data Lakehouse, Cloud Data, Big Data

27 points by saikatsg 183 days ago | 5 comments

The CDC, Palantir and the AI-Healthcare Revolution (unlimitedhangout.com)
The CDC’s Center for Forecasting and Outbreak Analytics (CFA) has partnered with the CIA-linked Palantir to cement the public-private model of invasive surveillance in “public health,” all while pushing the U.S. national security state and Silicon Valley even closer together.

Public Health, Surveillance, Artificial Intelligence, Government, Big Data

7 points by TheFreim 192 days ago | 0 comments

Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.

Machine Learning, Data Storage, Data Management, Big Data

6 points by abadid 198 days ago | 1 comments

Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)

Apache DataFusion, Big Data, Query Engines, Data Processing, Open Source

12 points by hambandit 198 days ago | 0 comments

Apache Hudi 1.0 released with secondary indexes for data lakehouses (apache.org)
We are thrilled to announce the release of Apache Hudi 1.0, a landmark achievement for our vibrant community that defines what the next generation of data lakehouses should achieve.

Data Lakehouses, Apache Hudi, New Releases, Software, Big Data

11 points by v5c6 219 days ago | 9 comments

How big data created the modern dairy cow (worksinprogress.co)
What do cryogenics, butterfat tests, and genetic data have in common? They’re some of the reasons behind the world’s most productive dairy cows. Here’s how it all started.

Agriculture, Big Data, Genetics, Dairy

48 points by surprisetalk 223 days ago | 58 comments

New Amazon S3 Tables: Storage optimized for analytics workloads (amazon.com)
Amazon S3 Tables give you storage that is optimized for tabular data such as daily purchase transactions, streaming sensor data, and ad impressions in Apache Iceberg format, for easy queries using popular query engines like Amazon Athena, Amazon EMR, and Apache Spark.

Cloud Storage, Data Analytics, Big Data

19 points by craigkerstiens 233 days ago | 3 comments

Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.

Rust, Distributed Computing, Data Processing, Big Data, Open Source

14 points by chenxi9649 245 days ago | 2 comments

Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.

Data Processing, Programming, Software, Big Data, Streaming

42 points by rebanevapustus 258 days ago | 11 comments

Write your first MapReduce program in 20 minutes (deepnote.com)
This notebook focuses on MapReduce, which processes large datasets. It is designed to find the maximum transaction value by store using sales data. For those who want to learn MapReduce from scratch, this notebook covers the basics. For more information, here is a detailed article.

Programming, MapReduce, Data Analysis, Big Data

11 points by andrewg4445 282 days ago | 6 comments