Hacker News with Generative AI: Data Processing

Sqawk: A fusion of SQL and Awk: Applying SQL to text-based data files (github.com/jgarzik)
Sqawk is an SQL-based command-line tool for processing delimiter-separated files (CSV, TSV, etc.), inspired by the classic awk command. It loads data into in-memory tables, executes SQL queries against these tables, and writes the results back to the console or files.

SQL, Command-line tools, Data Processing, Text Files

50 points by ossusermivami 180 days ago | 10 comments

Loading Pydantic models from JSON without running out of memory (pythonspeed.com)
You have a large JSON file, and you want to load the data into Pydantic. Unfortunately, this uses a lot of memory, to the point where large JSON files are very difficult to read. What to do?

Python, Data Processing, Memory Management, Optimization, Pydantic

134 points by itamarst 184 days ago | 45 comments

“Streaming vs. Batch” Is a Wrong Dichotomy, and I Think It's Confusing (morling.dev)
"Streaming vs. Batch" Is a Wrong Dichotomy, and I Think It's Confusing

Software Development, Data Processing, Big Data

70 points by ingve 193 days ago | 42 comments

ArkFlow: High-performance Rust stream processing engine (github.com/arkflow-rs)
High-performance Rust stream processing engine, providing powerful data stream processing capabilities, supporting multiple input/output sources and processors.

Rust, Stream Processing, Data Processing, High-performance Computing

170 points by klaussilveira 207 days ago | 34 comments

Apache Flink 2.0.0 Released: A New Era of Real-Time Data Processing (apache.org)
<p>Today, the Flink PMC is proud to announce the official release of Apache Flink 2.0.0! This marks the first release in the Flink 2.x series and is the first major release since Flink 1.0 launched nine years ago. This version is the culmination of two years of meticulous preparation and collaboration, signifying a new chapter in the evolution of Flink.</p>

Apache Flink, Data Processing, Big Data, Software Releases, Open Source

5 points by dockerd 241 days ago | 1 comments

Fast columnar JSON decoding with arrow-rs (arroyo.dev)
JSON is the most common serialization format used in streaming pipelines, so it pays to be able to deserialize it fast. This post covers in detail how the arrow-json library works to perform very efficient columnar JSON decoding, and the additions we've made for streaming use cases.

JSON, Data Processing, Performance Optimization, Streaming, Libraries

56 points by necubi 244 days ago | 7 comments

DeepSeek smallpond, 3FS and data processing for AI (getdaft.io)
Let’s talk about smallpond and 3FS, two open-source projects released by the DeepSeek team last week.

Open Source, Artificial Intelligence, Data Processing

10 points by sammysidhu 249 days ago | 0 comments

Destructive Updates – A Stitch in Time (icicle-lang.github.io)
Icicle is a high-level streaming query language, which gives new capabilities to its users, allowing them to combine and fuse hundreds of rich, individual, queries into a combined plan for safe and efficient execution.

Programming Languages, Query Languages, Data Processing, Stream Processing, Software

7 points by g0xA52A2A 252 days ago | 0 comments

ArkFlow – High-performance Rust stream processing engine (github.com/chenquan)
High-performance Rust stream processing engine, providing powerful data stream processing capabilities, supporting multiple input/output sources and processors.

Rust, Stream Processing, Data Processing, Software, Performance

107 points by chenquan 254 days ago | 59 comments

Polars Cloud: The Distributed Cloud Architecture to Run Polars Anywhere (pola.rs)
Our goal is to enable Scalable data processing with all the flexibility and expressiveness of Polars’ API.

Cloud Computing, Data Processing, Software, Big Data

261 points by neilfrndes 260 days ago | 87 comments

Smallpond – A lightweight data processing framework built on DuckDB and 3FS (github.com/deepseek-ai)
A lightweight data processing framework built on DuckDB and 3FS.

Data Processing, Databases, Frameworks, Open Source

322 points by overflowcat 268 days ago | 72 comments

Show HN: Interactive jq, but it's a bash script using fzf (github.com)
Instantly share code, notes, and snippets.

Bash, Data Processing, Software

10 points by thomascountz 296 days ago | 1 comments

Fast JSON Processing in Real-Time Systems: Simdjson and Zero-Copy Design (estuary.dev)
Discover how Estuary Flow handles massive data volumes by leveraging simdjson and a unique Combiner to optimize real-time JSON parsing and document merging.

Real-Time Systems, JSON Processing, Data Processing, Optimization, Software

22 points by danthelion 302 days ago | 2 comments

Show HN: Open-Source Document Extraction Tool (github.com/harishdeivanayagam)
Rowfill helps extract, analyze, and process data from complex documents, images, PDFs and more with advanced AI capabilities.

Open Source, Document Extraction, AI, Data Processing

13 points by harishd30 312 days ago | 2 comments

Apache DataFusion (apache.org)
DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format.

Query Engines, Apache, Rust, Data Processing

156 points by thebuilderjr 314 days ago | 47 comments

Show HN: Pyper – Concurrent Python Made Simple (github.com/pyper-dev)
Pyper is a flexible framework for concurrent and parallel data-processing, based on functional programming patterns. Used for 🔀 ETL Systems, ⚙️ Data Microservices, and 🌐 Data Collection

Python, Data Processing, Functional Programming, Concurrency

156 points by pyper-dev 314 days ago | 35 comments

Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)

Apache DataFusion, Big Data, Query Engines, Data Processing, Open Source

12 points by hambandit 319 days ago | 0 comments

Show HN: Bodo – high-performance compute engine for Python data processing (github.com/bodo-ai)
Bodo is a cutting edge compute engine for large scale Python data processing. Powered by an innovative auto-parallelizing just-in-time compiler, Bodo transforms Python programs into highly optimized, parallel binaries without requiring code rewrites, which makes Bodo 20x to 240x faster compared to alternatives!

Python, Data Processing, Performance Optimization, Open Source, Software

11 points by ehsantn 339 days ago | 2 comments

Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.

Rust, Distributed Computing, Data Processing, Big Data, Open Source

14 points by chenxi9649 366 days ago | 2 comments

Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.

Data Processing, Programming, Software, Big Data, Streaming

42 points by rebanevapustus 379 days ago | 11 comments

How to Flatten nested JSON arrays (datazip.io)
Flattening nested JSON or MongoDB’s BSON or normalizing semi-structured data and writing queries on it for analytics or regular queries, is a common challenge in data processing.

Data Processing, JSON, Data Structures, Data Analytics

14 points by pkhodiyar 382 days ago | 2 comments

pg_flo – Stream, transform, and re-route PostgreSQL data in real-time (pgflo.io)
The easiest way to move and transform data between PostgreSQL databases

PostgreSQL, Data Processing, Real-time Data, Databases, Software

239 points by shayonj 384 days ago | 54 comments

Drasi: Microsoft's open source data processing platform for event-driven systems (github.com/drasi-project)
Drasi is a data processing platform that simplifies detecting changes in data and taking immediate action.

Data Processing, Open Source, Microsoft, Event-Driven Systems, Data Streaming

331 points by benocodes 398 days ago | 67 comments

No such thing as exactly-once delivery (sequinstream.com)
We say Sequin is system with "at-least-once delivery" and "exactly-once processing" guarantees.

Distributed Systems, Data Processing, Guarantees, Software Architecture

99 points by todsacerdoti 418 days ago | 114 comments

Parsing Gigabytes of JSON per Second (arxiv.org)
JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible.

Performance Optimization, Data Processing, JSON, Web Development, Algorithms

5 points by ibobev 437 days ago | 1 comments

Sail – Unify stream processing, batch processing and compute-intensive workloads (github.com/lakehq)
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.

Data Processing, Stream Processing, Batch Processing, Big Data, Artificial Intelligence

79 points by chenxi9649 439 days ago | 15 comments

I use Nim instead of Python for data processing (2021) (benjamindlee.com)
Lazy programmers often prefer to substitute computing effort for programming effort. I am just such a programmer. For my research, I often need to design and run algorithms over large datasets ranging into the scale of terabytes. As a fellow at the NIH, I have access to Biowulf, a 100,000+ processor cluster, so it’s usually not worth spending a ton of time optimizing single-threaded performance for a single experiment when I can just perform a big [MapReduce](https://en.wikipedia.org/wiki/MapReduce).