Hacker News with Generative AI: Distributed Systems

The Anatomy of a Durable Execution Stack from First Principles (restate.dev)
We dive into the architecture details of Restate, a Durable Execution engine we built from the ground up. Restate requires no database/log or other system, but implements a full stack that competes with the best logs in terms of durability and operations.
The Pijul Manual (pijul.org)
Welcome to the Pijul book, an introduction to Pijul, a distributed version control system that is at the same time theoretically sound, fast and easy to learn and use.
What Is the Byzantine Generals Problem in Distributed Systems? (scalablethread.com)
The Byzantine Generals Problem is a thought experiment in distributed computing to understand the challenges of reaching a consensus when some nodes may be untrustworthy or unreliable (node behavior). It shows how difficult it can be to coordinate actions when some system members may act dishonestly.
Chrono: A Peer-to-Peer Network with Verifiable Causality (arxiv.org)
Logical clocks are a fundamental tool to establish causal ordering of events in a distributed system.
Hydro: Distributed Programming Framework for Rust (hydro.run)
Hydro is a high-level distributed programming framework for Rust.
Durable execution should be lightweight (dbos.dev)
Everyone knows serious programs must make data durable. You persist data on disk or in a database so it doesn’t disappear the second your program crashes or your server is restarted. But we also take it for granted that programs themselves aren’t durable. When you restart your server, your data might be safe in the database, but any programs you were running are gone, and if you want them back, you have to restart them yourself.
Husky: Efficient Compaction at Datadog Scale (datadoghq.com)
In a previous blog post, we introduced our Husky event store system. Husky is a distributed storage system that is layered over object storage (e.g., Amazon S3, Google Cloud Storage, Azure Blob Storage, etc.), with the query system acting as a cache over this storage. We also did a deep dive into Husky’s ingestion pipelines that we built to handle the scale of our customer data. In this post, we’ll cover how we designed Husky’s underlying data storage layer.
Turning the database inside-out (2015) (kleppmann.com)
Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?
Fault Tolerance in Tandem Computer Systems (1986) [pdf] (azurewebsites.net)
Every System is a Log: Avoiding coordination in distributed applications (restate.dev)
Building resilient distributed applications remains a tough challenge.
Nation-scale Matrix deployments will fail using the community version of Synapse (matrix.org)
Distributed Transactions at Scale in Amazon DynamoDB (2023) (blogspot.com)
This paper appeared in July at USENIX ATC 2023. If you haven't read about the architecture and operation of DynamoDB, please first read my summary of the DynamoDB ATC 2022 paper. The big omission in that paper was discussion about transactions. This paper amends that. It is great to see DynamoDB, and AWS in general, is publishing/sharing more widely than before.
Misty: A secure distributed actor language (mistysystem.com)
Public Domain 2025 Douglas Crockford
Encore – Back end framework for type-safe distributed systems (encore.dev)
🚀 Launch Week Dec 9-13: See all the new features
Did we miss P In CAP? Partial Progress Conjecture under Asynchrony (arxiv.org)
Each application developer desires to provide its users with consistent results and an always-available system despite failures. Boldly, the CALM theorem disagrees. It states that it is hard to design a system that is both consistent and available under network partitions; select at most two out of these three properties.
We Have Google Drive at Home: Musings on Merkle-Tree Based File Sharing (dolthub.com)
Suppose you have a directory of files that you want to sync with your friends. When the files change, you want your friends to be able to download just the changes without needing to re-download the entire directory again. And you want this to scale, no matter how many or how large the files are. What's the best way to do this?
The end-to-end principle in distributed systems (tedinski.com)
The theme for the last couple weeks has been basic design considerations for distributed systems.
Beyond Gradient Averaging in Parallel Optimization (arxiv.org)
We introduce Gradient Agreement Filtering (GAF) to improve on gradient averaging in distributed deep learning optimization.
Show HN: Jido – Run 10k agents at 25KB each (Elixir) (github.com/agentjido)
Jido is a foundational framework for building autonomous, distributed agent systems in Elixir.
Exploring Alternatives to UUIDv4; Enter ULIDs (jirevwe.github.io)
UUIDv4 is a commonly used unique identifier format. UUIDv4 is a standardized format for generating unique identifiers that are widely used in distributed systems. Recently there have attempts to introduce new identifier formats that are shorter, url-friendly, lexographically sortable, collision-safe during generation.
Use of Logical Clocks in Databases (blogspot.com)
Sometimes I cache: implementing lock-free probabilistic caching (cloudflare.com)
HTTP caching is conceptually simple: if the response to a request is in the cache, serve it, and if not, pull it from your origin, put it in the cache, and return it.
400TB Single Cluster: OceanBase Powers Kwai`s Core Business (oceanbase.github.io)
Kwai is a short video app boasting more than 10 million daily active users. How does it efficiently process highly concurrent user requests? Kwai once deployed multiple MySQL clusters in the backend to support high traffic with large data storage and satisfactory performance. What are the weak points of this conventional sharding solution? What pushed Kwai to select distributed databases and eventually deploy OceanBase Database?
Show HN: Rivet Actors – Durable Objects build with Rust, FoundationDB, Isolates (github.com/rivet-gg)
🔩 Run and scale realtime applications with Rivet Actors
CRDTs and Collaborative Playground (cerbos.dev)
At Cerbos, we specialize in simplifying complex authorization logic to empower developers with the tools to implement secure, scalable, and maintainable access control systems.
Load is not what you should balance: Introducing Prequal (usenix.org)
We present PReQuaL (Probing to Reduce Queuing and Latency), a load balancer for distributed multi-tenant systems.
Designing a distributed circuit breaker in Golang (getconvoy.io)
One of the major problems of designing a webhook delivery system is designing around bad/zombie endpoints. Zombie endpoints are dead endpoints that fail continuously and, over time, clog up your queues, create back pressure, and delay event delivery to legitimate webhook endpoints. Circuit breakers are the best-known mechanism for dealing with unreliable HTTP API endpoints, preventing failures from upstream services from cascading into our system.
Eventual Consistency Is Tricky (systemdesigncodex.com)
The concept of eventual consistency refers to a system condition where all parts of the system reach the same state, even though they may be temporarily inconsistent due to delays or failures.
The Acton Programming Language (acton-lang.org)
Acton is a general purpose programming language, designed to be useful for a wide range of applications, from desktop applications to embedded and distributed systems.
Distributed Erlang (vereis.com)