Hacker News with Generative AI: Data Storage

ClickHouse gets lazier and faster: Introducing lazy materialization (clickhouse.com)
Imagine if you could skip packing your bags for a trip because you find out at the airport you’re not going. That’s what ClickHouse is doing with data now.
(All) Databases Are Just Files. Postgres Too (tselai.com)
Dear reader: If you’re feeling an urge to comment solely based on the title, just be warned that too many have done so already.
Unpowered SSD endurance investigation finds data loss, performance issues (tomshardware.com)
How Long Can SSD Store Data Unpowered? Year 2 Update (2024) [video] (youtube.com)
Colossus for Rapid Storage (cloud.google.com)
As an object storage service, Google Cloud Storage is popular for its simplicity and scale, a big part of which is due to the stateless REST protocols that you can use to read and write data. But with the rise of AI and as more customers look to run data-intensive workloads, two major obstacles to using object storage are its higher latency and lack of file-oriented semantics.
SpacetimeDB (spacetimedb.com)
U.S. Gov't eliminates tape data storage at the GSA to save $1M per year (tomshardware.com)
Reducing Cloud Spend: Migrating Logs from CloudWatch to Iceberg with Postgres (crunchydata.com)
As a database service provider, we store a number of logs internally to audit and oversee what is happening within our systems.
Scoping a Local-First Image Archive (scottishstoater.com)
For years, I’ve been thinking about how we store and access our digital files, especially photos.
Preview: Amazon S3 Tables and Lakehouse in DuckDB (duckdb.org)
TL;DR: We are happy to announce a new preview feature that adds support for Apache Iceberg REST Catalogs, enabling DuckDB users to connect to Amazon S3 Tables and Amazon SageMaker Lakehouse with ease.
The real failure rate of EBS (planetscale.com)
PlanetScale has deployed millions of Amazon Elastic Block Store (EBS) volumes across the world. We create and destroy tens of thousands of them every day as we stand up databases for customers, take backups, and test our systems end-to-end. Through this experience, we have an unique viewpoint into the failure rate and mechanisms of EBS, and have spent a lot of time working on how to mitigate them.
Archival Storage (dshr.org)
I'm honored to appear in what I believe is the final series of these seminars. Most of my previous appearances have focused on debunking some conventional wisdom, and this one is no exception. My parting gift to you is to stop you wasting time and resources on yet another seductive but impractical idea — that the solution to storing archival data is quasi-immortal media. As usual, you don't have to take notes.
Theory crafting a system for 1000 simultaneous micro SD card ingests (level1techs.com)
Ask HN: What do you think of BDXL (100GB disks)? (ycombinator.com)
I still have a need to archive data and I'm thinking about getting a BDXL writer and some disks. Is this a dumb thing to do in 2025?
Put a data center on the moon? (ieee.org)
Lonestar Data Holdings is sending a test mission, aiming to safeguard valuable data
Hard Drive Graveyard (benjdd.com)
Hard drive graveyard
What 5 Megabytes of Data Looked Like in 1966 (62,500 punched cards) (vintag.es)
In 1966, computing was in its infancy, and the concept of data storage and processing looked drastically different from today’s instant access to vast amounts of information.
Are SSDs more reliable than hard drives? (2021) (backblaze.com)
Solid-state drives (SSDs) continue to become more and more a part of the data storage landscape. And while our SSD 101 series has covered topics like upgrading, troubleshooting, and recycling your SSDs, we’d like to test one of the more popular declarations from SSD proponents: that SSDs fail much less often than our old friend, the hard disk drive (HDD).
12 years of Backblaze data center storage drives, visualized (benjdd.com)
1 small node -> 100 drives
Backblaze Drive Stats for 2024 (backblaze.com)
As of December 31, 2024, we had 305,180 drives under management. Of that number, there were 4,060 boot drives and 301,120 data drives. This report will focus on those data drives as we review the Q4 2024 annualized failure rates (AFR), the 2024 failure rates, and the lifetime failure rates for the drive models in service as of the end of 2024.
Seagate's HDD scandal deepens clues point at Chinese Chia mining farms (tomshardware.com)
Cloudflare R2 Incident on February 6, 2025 (cloudflare.com)
Multiple Cloudflare services, including our R2 object storage, were unavailable for 59 minutes on Thursday, February 6th. This caused all operations against R2 to fail for the duration of the incident, and caused a number of other Cloudflare services that depend on R2 — including Stream, Images, Cache Reserve, Vectorize and Log Delivery — to suffer significant failures.
For privacy: Change of our refund policy from 30 to 14 days (mullvad.net)
As part of our ongoing commitment to storing less user data and protect your privacy, we’re updating our refund policy.
Husky: Efficient Compaction at Datadog Scale (datadoghq.com)
In a previous blog post, we introduced our Husky event store system. Husky is a distributed storage system that is layered over object storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob Storage, etc.), with the query system acting as a cache over this storage. We also did a deep dive into Husky’s ingestion pipelines that we built to handle the scale of our customer data. In this post, we’ll cover how we designed Husky’s underlying data storage layer.
Apache Accumulo 4.0 Feature Preview (apache.org)
Apache Accumulo® is a sorted, distributed key/value store that provides robust, scalable data storage and retrieval.
Storage is cheap, but not thinking about logging is expensive (counting-stuff.com)
The bad habits of data over-collection run deep.
Seagate smashes largest HDD world record with 36TB hard drive (techradar.com)
Parquet and ORC's many shortfalls for machine learning, and what to do about it? (starburst.io)
At the turn of the century (around a quarter of a decade ago), over 99% of the data management industry used row-oriented storage to store data for all workloads involving structured data — including transactional and analytical workloads.
37signals Dev – Monitoring 10 Petabytes of Data in Pure Storage (37signals.com)
How we use Prometheus to have metrics and alerts for Pure Storage.
I Track My Health Data in Markdown: Lessons in Digital Longevity (ycombinator.com)
I’ve spent years tracking my sleep, diet, and exercise with apps and wearables. But here’s the problem: when an app gets discontinued or stops syncing, the data—and all the insights—disappear.