We built a self-healing system to survive a concurrency bug at Netflix
(pushtoprod.substack.com)
Our CPUs were dying, the bug was temporarily un-fixable, and we had no viable path forward. Here's how we managed to survive.
Our CPUs were dying, the bug was temporarily un-fixable, and we had no viable path forward. Here's how we managed to survive.