Hacker News with Generative AI: Data Processing

Show HN: Open-Source Document Extraction Tool (github.com/harishdeivanayagam)
Rowfill helps extract, analyze, and process data from complex documents, images, PDFs and more with advanced AI capabilities.
Apache DataFusion (apache.org)
DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format.
Show HN: Pyper โ€“ Concurrent Python Made Simple (github.com/pyper-dev)
Pyper is a flexible framework for concurrent and parallel data-processing, based on functional programming patterns. Used for ๐Ÿ”€ ETL Systems, โš™๏ธ Data Microservices, and ๐ŸŒ Data Collection
Apache DataFusion: Fast, Embeddable, Modular Analytic Query Engine [pdf] (nerdnetworks.org)
Show HN: Bodo โ€“ high-performance compute engine for Python data processing (github.com/bodo-ai)
Bodo is a cutting edge compute engine for large scale Python data processing. Powered by an innovative auto-parallelizing just-in-time compiler, Bodo transforms Python programs into highly optimized, parallel binaries without requiring code rewrites, which makes Bodo 20x to 240x faster compared to alternatives!
Sail 0.2: Spark replacement in Rust, runs 4x faster, drop-in PySpark compatible (lakesail.com)
LakeSail is thrilled to unveil a preview release of Sail 0.2, our latest milestone in the journey to redefine distributed data processing.
Differential Dataflow for the Masses (github.com/brurucy)
This library provides an implementation of the DBSP language for incremental streaming computations.
How to Flatten nested JSON arrays (datazip.io)
Flattening nested JSON or MongoDBโ€™s BSON or normalizing semi-structured data and writing queries on it for analytics or regular queries, is a common challenge in data processing.
pg_flo โ€“ Stream, transform, and re-route PostgreSQL data in real-time (pgflo.io)
The easiest way to move and transform data between PostgreSQL databases
Drasi: Microsoft's open source data processing platform for event-driven systems (github.com/drasi-project)
Drasi is a data processing platform that simplifies detecting changes in data and taking immediate action.
No such thing as exactly-once delivery (sequinstream.com)
We say Sequin is system with "at-least-once delivery" and "exactly-once processing" guarantees.
Parsing Gigabytes of JSON per Second (arxiv.org)
JavaScript Object Notation or JSON is a ubiquitous data exchange format on the Web. Ingesting JSON documents can become a performance bottleneck due to the sheer volume of data. We are thus motivated to make JSON parsing as fast as possible.
Sail โ€“ Unify stream processing, batch processing and compute-intensive workloads (github.com/lakehq)
LakeSail's computation framework with a mission to unify stream processing, batch processing, and compute-intensive (AI) workloads.
I use Nim instead of Python for data processing (2021) (benjamindlee.com)
Lazy programmers often prefer to substitute computing effort for programming effort. I am just such a programmer. For my research, I often need to design and run algorithms over large datasets ranging into the scale of terabytes. As a fellow at the NIH, I have access to Biowulf, a 100,000+ processor cluster, so itโ€™s usually not worth spending a ton of time optimizing single-threaded performance for a single experiment when I can just perform a big [MapReduce](https://en.wikipedia.org/wiki/MapReduce).
Reflection-based JSON in C++ at Gigabytes per Second (lemire.me)
Pipes: A spiritual successor to Yahoo Pipes (pipes.digital)
Show HN: Qq: like jq, but can transcode between many formats (github.com/JFryy)
TSV โ€“ Alternative to CSV (wikipedia.org)
DataFusion Comet: Apache Spark Accelerator (github.com/apache)
Ask HN: How would you chunk a large Excel file? (ycombinator.com)
Radient โ€“ vectorize many data types, not just text (github.com/fzliu)
The One Billion Row Challenge in CUDA (tspeterkim.github.io)