Automating Chaos Internally: Stage 3

As we progress through Reliability Stage 3, it’s time to fully embrace automation: All reliability testing in non-production environments that can be, should be automated, requiring little to no manual interaction. After completing this Reliability Stage, your application will be quite reliable and – at least outside of production – should require minimal supervision and support costs. Let’s get to it!

3 min read

Last Updated June 10, 2019

In Chaos Engineering Through Staged Reliability - Stage 2 we examined partial automation by implementing a handful of automated reliability tests. We even mentioned automating our first tests in Chaos Engineering Through Staged Reliability - Stage 1. Now we will advocate automating every test you think will be helpful without including unacceptably high risks or complexity. No one expects daily automation of testing advanced disaster recovery schemes or region failover. We can all think of low-risk, simple tests that we can run frequently that we haven't yet automated. This stage advocates scheduling that work of automating these tests, right now just in a staging or testing environment.

Prerequisites

Creation and agreement on Disaster Recovery and Dependency Failover Playbooks.
Completion of the most vital and applicable parts of Reliability Stage 0.
Completion of the most vital and applicable parts of Reliability Stage 1.
Completion of the most vital and applicable parts of Reliability Stage 2.

Automate Reliability Testing in Non-Production

After progressing through Reliability Stage 2 your team implemented at least some semi-automated reliability testing. However, this fourth stage is where we advocate that current (appropriate) non-automated reliability tests should also be integrated into your automated testing suite. If your application features a development or other non-production environment, you can opt to integrate these automated resilience tests in that non-production environment, as full-blown production testing isn’t typical until Reliability Stage 4 and even then, not all situations permit this. However, the earlier the team starts thinking about and practicing implementation within production systems, the smoother the transition will be and the sooner you’ll see that dramatic increase in reliability and drop in support costs that staged resiliency aims to provide.

Semi-Automate Disaster Recovery Failover

In spite of everyone’s best efforts, not all disasters can be avoided, so it’s critical that the team implement at least a semi-automated disaster recovery failover script to assist when the unexpected happens. As with reliability testing, it’s best to automate as much of the disaster recovery failover process as possible, requiring as little human intervention as feasible. However, depending on the breadth of the system and initial planning throughout the earlier Reliability Stages, it’s entirely possible your disaster recovery failover will require at least a modicum of human supervision.

As the team progresses through this stage make sure you follow the playbooks that have been previously established. If something needs to be changed in a process or playbook, this is the time to suss that out and make those updates.

Reliability Stage 3: Implementation Example

As will sometimes be the case when your own team is working through each Stage of Reliability, the Bookstore application has already been configured to automatically perform resiliency testing in non-production environments. In Resiliency Stage 2 we explored Performing a CDN Failure Simulation Test and Performing a DB Failure Simulation Test, which handles the major reliability tests for the system by creating Gremlin attacks to sever the connection between the `bookstore-api` instances and the respective CDN/DB endpoints.

To ensure these tests are performed automatically, we can use the Gremlin API or web front-end to automatically schedule attacks for our given testing schedule. Similarly, we’d want to schedule an automatic disaster recovery failover test using a Gremlin Shutdown Attack, as illustrated in Verifying Automated Instance Failover in Reliability Stage 2. Check out the Gremlin documentation for more details on creating attacks with Gremlin.

Reliability Stage 3 Completion

You’ve automated reliability testing in a non-production environment (and, ideally, even a bit in production). Your team has also semi-automated disaster recovery failover procedures to ensure your service can moderately recover itself after a failure, with minimal human intervention. In the last chapter of this series, Chaos Engineering Through Staged Reliability - Stage 4, we’ll explore the final steps of fully automating reliability testing in production, along with CI/CD integration to ensure your service maintains stability throughout every step of the software development life cycle.

Stage 3

Automating Chaos Internally

Chaos Engineering Through Staged Resiliency

Prerequisites

Automate Reliability Testing in Non-Production

Semi-Automate Disaster Recovery Failover

Reliability Stage 3: Implementation Example

Reliability Stage 3 Completion

Avoid downtime. Use Gremlin to turn failure into resilience.

Company

Resources