Hacker News with Generative AI: Data Processing

Apache Flink 2.0.0 Released: A New Era of Real-Time Data Processing (apache.org)
<p>Today, the Flink PMC is proud to announce the official release of Apache Flink 2.0.0! This marks the first release in the Flink 2.x series and is the first major release since Flink 1.0 launched nine years ago. This version is the culmination of two years of meticulous preparation and collaboration, signifying a new chapter in the evolution of Flink.</p>
Fast columnar JSON decoding with arrow-rs (arroyo.dev)
JSON is the most common serialization format used in streaming pipelines, so it pays to be able to deserialize it fast. This post covers in detail how the arrow-json library works to perform very efficient columnar JSON decoding, and the additions we've made for streaming use cases.
DeepSeek smallpond, 3FS and data processing for AI (getdaft.io)
Let’s talk about smallpond and 3FS, two open-source projects released by the DeepSeek team last week.
Destructive Updates – A Stitch in Time (icicle-lang.github.io)
Icicle is a high-level streaming query language, which gives new capabilities to its users, allowing them to combine and fuse hundreds of rich, individual, queries into a combined plan for safe and efficient execution.
ArkFlow – High-performance Rust stream processing engine (github.com/chenquan)
High-performance Rust stream processing engine, providing powerful data stream processing capabilities, supporting multiple input/output sources and processors.
Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere (pola.rs)
Our goal is to enable Scalable data processing with all the flexibility and expressiveness of Polars’ API.
Smallpond – A lightweight data processing framework built on DuckDB and 3FS (github.com/deepseek-ai)
A lightweight data processing framework built on DuckDB and 3FS.
Show HN: Interactive jq, but it's a bash script using fzf (github.com)
Instantly share code, notes, and snippets.
Fast JSON Processing in Real-Time Systems: Simdjson and Zero-Copy Design (estuary.dev)
Discover how Estuary Flow handles massive data volumes by leveraging simdjson and a unique Combiner to optimize real-time JSON parsing and document merging.
Show HN: Open-Source Document Extraction Tool (github.com/harishdeivanayagam)
Rowfill helps extract, analyze, and process data from complex documents, images, PDFs and more with advanced AI capabilities.
Apache DataFusion (apache.org)
DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format.
Show HN: Pyper – Concurrent Python Made Simple (github.com/pyper-dev)
Pyper is a flexible framework for concurrent and parallel data-processing, based on functional programming patterns. Used for 🔀 ETL Systems, ⚙️ Data Microservices, and 🌐 Data Collection
Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)
Show HN: Bodo – high-performance compute engine for Python data processing (github.com/bodo-ai)
Bodo is a cutting edge compute engine for large scale Python data processing. Powered by an innovative auto-parallelizing just-in-time compiler, Bodo transforms Python programs into highly optimized, parallel binaries without requiring code rewrites, which makes Bodo 20x to 240x faster compared to alternatives!
Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.
Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.
How to Flatten nested JSON arrays (datazip.io)
Flattening nested JSON or MongoDB’s BSON or normalizing semi-structured data and writing queries on it for analytics or regular queries, is a common challenge in data processing.
pg_flo – Stream, transform, and re-route PostgreSQL data in real-time (pgflo.io)
The easiest way to move and transform data between PostgreSQL databases
Drasi: Microsoft's open source data processing platform for event-driven systems (github.com/drasi-project)
Drasi is a data processing platform that simplifies detecting changes in data and taking immediate action.
No such thing as exactly-once delivery (sequinstream.com)
We say Sequin is system with "at-least-once delivery" and "exactly-once processing" guarantees.
Parsing Gigabytes of JSON per Second (arxiv.org)
JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible.
Sail – Unify stream processing, batch processing and compute-intensive workloads (github.com/lakehq)
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
I use Nim instead of Python for data processing (2021) (benjamindlee.com)
Lazy programmers often prefer to substitute computing effort for programming effort. I am just such a programmer. For my research, I often need to design and run algorithms over large datasets ranging into the scale of terabytes. As a fellow at the NIH, I have access to Biowulf, a 100,000+ processor cluster, so it’s usually not worth spending a ton of time optimizing single-threaded performance for a single experiment when I can just perform a big [MapReduce](https://en.wikipedia.org/wiki/MapReduce).
Reflection-based JSON in C++ at Gigabytes per Second (lemire.me)
Pipes: A spiritual successor to Yahoo Pipes (pipes.digital)
Show HN: Qq: like jq, but can transcode between many formats (github.com/JFryy)
TSV – Alternative to CSV (wikipedia.org)
DataFusion Comet: Apache Spark Accelerator (github.com/apache)
Ask HN: How would you chunk a large Excel file? (ycombinator.com)
Radient – vectorize many data types, not just text (github.com/fzliu)