Hacker News with Generative AI: Data Storage

Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.
37signals Dev – Monitoring 10 Petabytes of Data in Pure Storage (37signals.com)
How we use Prometheus to have metrics and alerts for Pure Storage.
I Track My Health Data in Markdown: Lessons in Digital Longevity (ycombinator.com)
I’ve spent years tracking my sleep, diet, and exercise with apps and wearables. But here’s the problem: when an app gets discontinued or stops syncing, the data—and all the insights—disappear.
Century-Scale Storage (law.harvard.edu)
This piece looks at a single question. If you, right now, had the goal of digitally storing something for 100 years, how should you even begin to think about making that happen? How should the bits in your stewardship be stored with such a target in mind? How do our methods and platforms look when considered under the harsh unknowns of a century? There are plenty of worthy related subjects and discourses that this piece does not touch at all.
Century Scale Storage (law.harvard.edu)
This piece looks at a single question. If you, right now, had the goal of digitally storing something for 100 years, how should you even begin to think about making that happen? How should the bits in your stewardship be stored with such a target in mind? How do our methods and platforms look when considered under the harsh unknowns of a century?
Internet Object – New Age Data Serialization After JSON (internetobject.org)
Revolutionize your data exchange and storage with a format that's built for efficiency, clarity and reliability. A Text Based Data Serialization and Structured Storage Format Beyond JSON!
Terabit-scale high-fidelity diamond data storage (nature.com)
In the era of digital information, realizing efficient and durable data storage solutions is paramount.
Hetzner Object Storage (hetzner.com)
Object Storage is the S3 compatible storage solution that grows with your data requirements - highly available, secure and flexible.
Big Endian's Guide to SQLite Storage (jabid.in)
I wanted to learn how databases like SQLite store data under the hood, so I decided to write some code to inspect the database file. SQLite famously stores the entire database in a single file, and the file format is very well documented. Here is one diagram1 to get started instead of the roughly 13,848 words in that document.
Hide Photos on Floppies with a Flux Imager (github.com/dbalsom)
Chinese researchers indicate diamonds can store data for millions of years (readwrite.com)
Research has suggested that diamond-based storage technology could preserve vast amounts of information for up to millions of years.
Amazon S3 Adds Put-If-Match (Compare-and-Swap) (amazon.com)
Amazon S3 can now perform conditional writes that evaluate if an object is unmodified before updating it.
Transposing Tensor Files (mmapped.blog)
The safetensors library from Huggingface is popular for representing tensors on disk, and its data layout is fully compatible with the onnx raw tensor data format.
Amazon S3 now supports the ability to append data to an object (amazon.com)
Amazon S3 Express One Zone now supports the ability to append data to an object.
Huawei developing SSD-tape hybrid amid US tech restrictions (blocksandfiles.com)
Huawei’s in-house development of Magneto-Electric Disk (MED) archive storage technology combines an SSD with a Huawei-developed tape drive to provide warm (nearline) and cold data storage.
Transactional Object Storage? (mbrt.dev)
I was frustrated by the gap between stateless and stateful applications in the cloud. While I could easily spin up a stateless application as a “serverless” function in any major cloud provider and pretty much forget about it, persisting data between requests was a game of pick two among three: cheap, strongly consistent, portable.
Upspin: A framework for naming everyone's everything (upspin.io)
Upspin is an attempt to address problems like these, and many more.
Backblaze Drive Stats for Q3 2024 (backblaze.com)
As of the end of Q3 2024, Backblaze was monitoring 292,647 hard disk drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world.
Floppy Disk Storage (history) (ibm.com)
The once-ubiquitous data storage device gave rise to the modern software industry
Show HN: OpenDAL, one API to access all the storages (S3, Azblob, HDFS, etc.) (github.com/apache)
Apache OpenDAL™: Access Data Freely
LocalStorage vs. IndexedDB vs. Cookies vs. OPFS vs. WASM-SQLite (rxdb.info)
So you are building that web application and you want to store data inside of your users browser. Maybe you just need to store some small flags or you even need a fully fledged database.
Consider adding warnings against using ZFS native encryption (github.com/openzfs)
Among experienced zfs users and developers, it seems to be conventional wisdom that zfs native encryption is not suitable for production usage, particularly when combined with snapshotting and zfs send/recv. There is a long standing data corruption issue with many firsthand user reports:
19th-century photography technique employed in novel data storage method (ieee.org)
19th-century photography technique employed in novel data storage method
DNA stores data in bits after epigenetic upgrade (nature.com)
DNA stores data in bits after epigenetic upgrade
LocalStorage vs. IndexedDB vs. Cookies vs. OPFS vs. WASM-SQLite (rxdb.info)
So you are building that web application and you want to store data inside of your users browser. Maybe you just need to store some small flags or you even need a fully fledged database.
Icechunk: An open-source, cloud-native transactional tensor storage engine (earthmover.io)
Icechunk is a brand new open-source transactional storage engine for tensor / ND-array data designed for use on cloud object storage.
LocalStorage vs. IndexedDB vs. Cookies vs. OPFS vs. WASM-SQLite (rxdb.info)
So you are building that web application and you want to store data inside of your users browser. Maybe you just need to store some small flags or you even need a fully fledged database.
Show HN: Vortex – a high-performance columnar file format (github.com/spiraldb)
Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire.
Windows 11 24H2 hoards 8.63 GB of junk you can't delete (theregister.com)
Windows 11 24H2 users are finding there is undeletable data that remains on their devices after installing the recently released feature update.
Improving Parquet Dedupe on Hugging Face Hub (huggingface.co)
The Xet team at Hugging Face is working on improving the efficiency of the Hub's storage architecture to make it easier and quicker for users to store and update data and models.