Hacker News with Generative AI: Distributed Systems

What If We Could Rebuild Kafka from Scratch? (morling.dev)
The last few days I spent some time digging into the recently announced KIP-1150 ("Diskless Kafka"), as well AutoMQ’s Kafka fork, tightly integrating Apache Kafka and object storage, such as S3. Following the example set by WarpStream, these projects aim to substantially improve the experience of using Kafka in cloud environments, providing better elasticity, drastically reducing cost, and paving the way towards native lakehouse integration.
Ask HN: Has anyone used Riak? Thoughts? (ycombinator.com)
I’ve just stumbled upon RIAK. It seems like a very cool technology. Almost like an alternative to kubernetes. Has anyone used it in production? Why isn’t it more well known? It seems like an awesome solution.
Decomposing Transactional Systems (transactional.blog)
Decomposing Transactional Systems (transactional.blog)
Consistent Hash Ring (selfboot.cn)
Consistent Hashing Ring is a special hashing algorithm primarily used for data distribution and load balancing in distributed systems.
Graham: Synchronizing Clocks by Leveraging Local Clock Properties (usenix.org)
High performance, strongly consistent applications are beginning to require scalable sub-microsecond clock synchronization.
KIP-1150: Diskless Kafka Topics (apache.org)
No results
Erlang's not about lightweight processes and message passing (2023) (stevana.github.io)
I used to think that the big idea of Erlang is its lightweight processes and message passing. Over the last couple of years I’ve realised that there’s a bigger insight to be had, and in this post I’d like to share it with you.
Engineering a Trace Details Page That Handles a Million Spans (signoz.io)
Building a modern durable execution engine from first principles (restate.dev)
We dive into the architecture details of Restate, a Durable Execution engine we built from the ground up. Restate requires no database/log or other system, but implements a full stack that competes with the best logs in terms of durability and operations.
Colossus: How we deliver SSD performance at HDD prices (cloud.google.com)
From YouTube and Gmail to BigQuery and Cloud Storage, almost all of Google’s products depend on Colossus, our foundational distributed storage system.
The Synchrony Budget (morling.dev)
For building a system of distributed services, one concept I think is very valuable to keep in mind is what I call the synchrony budget: as much as possible, a service should minimize the number of synchronous requests which it makes to other services.
Conflict-Free Distributed Architecture for Append-Only Writes to Apache Iceberg (e6data.com)
Apache Iceberg is a cornerstone table format in modern data lakehouse systems. It is renowned for its ability to deliver transactional consistency, schema evolution, and snapshot isolation through a metadata-driven architecture.
Nvidia Dynamo: A Datacenter Scale Distributed Inference Serving Framework (github.com/ai-dynamo)
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
When Kafka is not the right Move (rejot.dev)
When designing distributed systems, event streaming platforms such as Kafka are the preferred solution for asynchronous communication. In fact, on Kafka’s official website, the first use-case listed is messaging1. My believe is that this default choice can lead to problems and unnecessary complexity. The main reason for this is the conversion between state and events, which we’ll look at in this article through a game of chess.
Handling Database Failures in a Distributed System with RabbitMQ Workers (ycombinator.com)
I have a worker that processes tasks from RabbitMQ and inserts data into a database. The system operates at high scale, handling thousands of messages per second, which makes proper failure handling crucial to avoid overwhelming the system.
Distributed systems programming has stalled (shadaj.me)
Over the last decade, we’ve seen great advancements in distributed systems, but the way we program them has seen few fundamental improvements.
The Anatomy of a Durable Execution Stack from First Principles (restate.dev)
We dive into the architecture details of Restate, a Durable Execution engine we built from the ground up. Restate requires no database/log or other system, but implements a full stack that competes with the best logs in terms of durability and operations.
The Pijul Manual (pijul.org)
Welcome to the Pijul book, an introduction to Pijul, a distributed version control system that is at the same time theoretically sound, fast and easy to learn and use.
What Is the Byzantine Generals Problem in Distributed Systems? (scalablethread.com)
The Byzantine Generals Problem is a thought experiment in distributed computing to understand the challenges of reaching a consensus when some nodes may be untrustworthy or unreliable (node behavior). It shows how difficult it can be to coordinate actions when some system members may act dishonestly.
Chrono: A Peer-to-Peer Network with Verifiable Causality (arxiv.org)
Logical clocks are a fundamental tool to establish causal ordering of events in a distributed system.
Hydro: Distributed Programming Framework for Rust (hydro.run)
Hydro is a high-level distributed programming framework for Rust.
Durable execution should be lightweight (dbos.dev)
Everyone knows serious programs must make data durable. You persist data on disk or in a database so it doesn’t disappear the second your program crashes or your server is restarted. But we also take it for granted that programs themselves aren’t durable. When you restart your server, your data might be safe in the database, but any programs you were running are gone, and if you want them back, you have to restart them yourself.
Husky: Efficient Compaction at Datadog Scale (datadoghq.com)
In a previous blog post, we introduced our Husky event store system. Husky is a distributed storage system that is layered over object storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob Storage, etc.), with the query system acting as a cache over this storage. We also did a deep dive into Husky’s ingestion pipelines that we built to handle the scale of our customer data. In this post, we’ll cover how we designed Husky’s underlying data storage layer.
Turning the database inside-out (2015) (kleppmann.com)
Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?
Fault Tolerance in Tandem Computer Systems (1986) [pdf] (azurewebsites.net)
Every System is a Log: Avoiding coordination in distributed applications (restate.dev)
Building resilient distributed applications remains a tough challenge.
Nation-scale Matrix deployments will fail using the community version of Synapse (matrix.org)
Distributed Transactions at Scale in Amazon DynamoDB (2023) (blogspot.com)
This paper appeared in July at USENIX ATC 2023. If you haven't read about the architecture and operation of DynamoDB, please first read my summary of the DynamoDB ATC 2022 paper. The big omission in that paper was discussion about transactions. This paper amends that. It is great to see DynamoDB, and AWS in general, is publishing/sharing more widely than before.
Misty: A secure distributed actor language (mistysystem.com)
Public Domain 2025 Douglas Crockford