Hacker News with Generative AI: Data Formats

Reverse Engineering Apple's typedstream Format (chrissardegna.com)
imessage-exporter’s goal is to provide the most comprehensive representation of iMessage data available. Message data is stored in a legacy format that appears to be a stream that represents objects.
Sparrow, a modern C++ implementation of the Apache Arrow columnar format (medium.com)
We are thrilled to introduce Sparrow, a new library designed to simplify the integration of Apache Arrow’s columnar format into C++ applications.
Query Engines: Gatekeepers of the Parquet File Format (duckdb.org)
TL;DR: Mainstream query engines do not support reading newer Parquet encodings, forcing systems like DuckDB to default to writing older encodings, thereby sacrificing compression.
Preserves: An Expressive Data Language (preserves.dev)
This repository contains a definition and various implementations of Preserves, a data model with associated serialization formats in many ways comparable to JSON, XML, S-expressions, CBOR, ASN.1 BER, and so on.
Cramming Scrapscript into Msgpack (taylor.town)
msgpack is a lovely little serialization format. As a JSON replacement, it saves bandwidth while preserving native language features (e.g. tuples, records, objects, dates).
Nobody gets fired for picking JSON, but maybe they should? (mcyoung.xyz)
JSON is extremely popular but deeply flawed. This article discusses the details of JSON’s design, how it’s used (and misused), and how seemingly helpful “human readability” features cause headaches instead. Crucially, you rarely find JSON-based tools (except dedicated tools like jq) that can safely handle arbitrary JSON documents without a schema—common corner cases can lead to data corruption!
Nobody Gets Fired for Picking JSON, but Maybe They Should? (mcyoung.xyz)
JSON is extremely popular but deeply flawed. This article discusses the details of JSON’s design, how it’s used (and misused), and how seemingly helpful “human readability” features cause headaches instead. Crucially, you rarely find JSON-based tools (except dedicated tools like jq) that can safely handle arbitrary JSON documents without a schema—common corner cases can lead to data corruption!
Pg_parquet – Postgres to Parquet Interoperability (i-programmer.info)
ASCII Delimited Text – Not CSV or Tab Delimited Text (wordpress.com)
Unfortunately a quick google search on “ASCII Delimited Text” shows that IBM and Oracle failed to read the ASCII specification and both define ASCII Delimited Text as a CSV format.  ASCII Delimited Text should use the record separators defined as ASCII 28-31.
New better alterative to XML, JSON and YAML (xenondata.org)
Xenon is the best way to represent information: Terse. Readable multiple line indented text. Native support for arrays Native support for a graph structure, elements may have multiple parents. Native support for a types used in serialization. Unambiguous choice of data structure. Efficient to write by hand. Can be implemented to be blazingly fast or using a mode-less tokenizer/parser. The xenon document is named.
JSON Patch (zuplo.com)
JSON Patch is a standardized format defined in RFC 6902 for describing how to modify a JSON document.
JSON is usually the least bad option for machine-readable output formats (utoronto.ca)
CSVs Are Kinda Bad. DSVs Are Kinda Good (matthodges.com)
Recommended Formats Statement (loc.gov)
Why CSV is still king (konbert.com)
Data Formats: 3D, Audio, Image (paulbourke.net)
TSV – Alternative to CSV (wikipedia.org)
LSON: JSON with binary in 260 lines of public domain Lua (github.com/civboot)
Demystifying the protobuf wire format (kreya.app)
Buckets of Parquet Files Are Awful (scratchdata.com)
The Birth of Parquet (sympathetic.ink)
Object Linking and Embedding (wikipedia.org)
Show HN: ZSV (Zip Separated Values) columnar data format (github.com/Hafthor)