What is Chaos Engineering?
Chaos Engineering is a disciplined approach of identifying potential failures before they become outages. Ultimately, the goal of Chaos Engineering is to enhance the stability and resiliency of our systems.
Creating reliable software is a fundamental necessity for modern cloud applications and architectures. As systems are increasingly being distributed by design, the potential for unplanned failure and unexpected outages increases significantly. Thankfully, Chaos and Reliability Engineering techniques are quickly gaining traction within the community. Many organizations – both big and small – have embraced Chaos Engineering over the last few years.
NOTE: The articles in this guide will use the term team to indicate a
singular group that is responsible for an application that you are
considering testing or actively testing using chaos experiments like those
described in this guide.
Contents
There are two main sections to our Chaos Engineering content. First, we start with a series outlining the stages of reliability within an organization as you progress from first learning about Chaos Engineering to preparing to implement and on to increasing adoption. Second, we have a series of individual posts which each illustrate how one might choose to implement Chaos Engineering with specific technologies, architectures, and approaches that may be a part of your stack or something you are considering for the future.
Chaos Engineering Through Staged Reliability
In his ChaosConf 2018 talk titled Practicing Chaos Engineering at Walmart, Walmart’s Director of Engineering Vilas Veeraraghavan outlines how he and the hundreds of engineering teams at Walmart have implemented Resilience Engineering (which we will refer to as the pursuit of reliability within SRE). By creating a robust series of “levels” or “stages” that each engineering team can work through, Walmart is able to progressively improve system reliability while dramatically reducing support costs.
This series expands on this model by diving deep into the five Stages of Reliability. Each post examines the necessary components of a stage, describes how those components are evaluated and assembled, and outlines the step-by-step process necessary to move from one stage to the next.
This series also digs into the specific implementation of each stage by progressing through the entire process with a real-world, fully-functional API application hosted on AWS. We’ll go through everything from defining and executing disaster recovery playbook scenarios to improving system architecture and reducing RTO, RPO, and applicable support costs for this example app.
With a bit of adjustment for your own organizational needs, you and your team can implement similar practices to quickly add Chaos Engineering to your own systems with relative ease. After climbing through all five stages your system and its deployment will be almost entirely automated and will feature significant resiliency testing and robust disaster recovery failover.
An additional tool to help you get started is Gremlin's reliability calculator.