Injected Automated Chaos Everywhere: Stage 4

Once your team has progressed through this final stage your system will be demonstrably more reliable and will require far fewer support hours, including reduced support costs in the event of a failure.

12 min read

Last Updated June 10, 2019

We’ve finally made it to the last Reliability Stage, where your team will be working toward full automation of both reliability testing and disaster recovery failover procedures. Chaos Engineering Through Staged Reliability - Stage 3 emphasized the importance of looking toward the production environment for automation, but Reliability Stage 4 is where all appropriate, remaining non-automated tests and processes that can, should be automated within production. That’s not to say you cannot have any human interactions involved, but rather, that a given reliability testing or disaster recovery process must not rely solely on human intervention to succeed.

Prerequisites

Creation and agreement on Disaster Recovery and Dependency Failover Playbooks.
Completion of the most vital and applicable parts of Reliability Stage 0.
Completion of the most vital and applicable parts of Reliability Stage 1.
Completion of the most vital and applicable parts of Reliability Stage 2.
Completion of the most vital and applicable parts of Reliability Stage 3.

Integrate Automatic Reliability Testing in CI/CD

For systems relying on continuous integration and continuous deployment, this final stage is when you should integrate automatic reliability testing within the CI/CD process.

Additionally, a failed reliability test should also result in a build and/or deployment failure. Just as with application-level unit testing, reliability testing should always be 100% successful prior to releasing a given build, so this prerequisite helps maintain system stability.

Automate Reliability and Disaster Recovery Failover Testing in Production

The final and most critical requirement for Reliability Stage 4 is the nearly-total automation of both reliability and disaster recovery failover testing in the production environment. An SRE can be involved, but everything should be so automated that running tests merely requires a minimal amount of support time to get the process started.

Reliability Stage 4: Implementation Example

To keep things simple the Bookstore application uses Jenkins to handle CI/CD. We’ve propagated a new Amazon EC2 instance onto which Jenkins is installed. An Amazon Route53 DNS record points the jenkins.bookstore.pingpublications.com endpoint to the EC2 instance, so accessing the Jenkins front end is done through http://jenkins.bookstore.pingpublications.com:8080.

The steps for installing and running Jenkins on an Amazon EC2 instance are beyond the scope of this article, but a traditional deployment is easy to accomplish.

This is now what the final Bookstore architecture looks like.

Jenkins Configuration

With Jenkins installed we need to create a new Project for the Bookstore application. We’re using the Publish Over SSH plugin to simplify deployment to both the blue/green Amazon EC2 instances. Once the plugin is installed it must be configured by adding each SSH server we’ll be interacting with.

Navigate to Manage Jenkins > Configure System and scroll down to the Publish over SSH section.
Enter the appropriate private key allowing connection to both Bookstore API Amazon EC2 instances.
Under SSH Servers click Add.
Enter the appropriate details for the bookstore-api-blue server.
- Name: bookstore-api-blue
- Hostname: 54.213.54.171
- Username: ubuntu
- Remote Directory: /home/ubuntu/apps
Add another entry for the bookstore-api-green server.
- Name: bookstore-api-green
- Hostname: 52.11.79.9
- Username: ubuntu
- Remote Directory: /home/ubuntu/apps
Click Save.

With the SSH connections configured it’s time to create the Project to handle actual deployment.

Click New Item.
Input bookstore-deploy in the Item name field.
Select Freestyle Project type and click OK.
Under Source Code Management select Git and enter the GitHub endpoint (e.g. git@github.com:GabeStah/bookstore_api.git).
Add the appropriate GitHub credentials.

Deploying the Bookstore application code is just a matter of creating an SSH Publisher for each environment.

Under Build click Add build step and select Send files or execute commands over SSH.
Under SSH Server > Name select bookstore-api-blue.
In the Source files field input **/*. This field uses the Apache Ant pattern format, so here we’re merely ensuring all files in the project directory are published.
For Remote directory input bookstore_api.
Since the application uses Gunicorn for the web server we need to add a command to restart it after deployment. Input sudo systemctl restart gunicorn in the Exec command field.
Click Add Server and repeat steps 2 through 5 for the bookstore-api-green server.
Click Save to finalize the settings. The bookstore-deploy Project will now deploy new builds of the Bookstore to each production environment!

Automating Reliability Testing in Jenkins

To meet the requirements of Reliability Stage 5 we need to integrate automated reliability testing into the CI/CD pipeline. For the Bookstore application, we’ll use a Gremlin Blackhole Attack. This attack blocks all network communication between the targeted instance and specified endpoints. Just as we saw in Resiliency Stage 2 we’ll be using this attack to perform a reliability test that interrupts communication and triggers an Amazon CloudWatch alarm. Check out the Gremlin documentation for more details on creating attacks with Gremlin’s Chaos Engineering tools.

Within Jenkins, we can initiate the reliability test by executing over SSH.

Navigate to the Configure section of the bookstore-deploy Project.
Under Build click Add build step and select Send files or execute commands over SSH.
Under SSH Server > Name select bookstore-api-blue.
In the Exec command field input the following:
```
gremlin attack blackhole -l 240 -h ^api.gremlin.com,cdn.bookstore.pingpublications.com,db.bookstore.pingpublications.com
```
Tip
This Gremlin Blackhole attack tests reliability against the simulated failure of our critical dependencies (database and CDN). Consequently, this will trigger the respective bookstore-api-{blue/green}-{db/cdn}-connectivity-failed Amazon CloudWatch Alarms that we configured during Reliability Stage 2. This causes a DNS failover, which can be confirmed just as it was in Resiliency Stage 3.
Click Add Server and repeat steps 3 and 4 for the bookstore-api-green server.
Click Save.

Automating a Disaster Recovery Failover Test in Jenkins

We also want to automate disaster recovery failover testing within the CI/CD pipeline. This time we’ll use a Gremlin Shutdown Attack, which shuts down the targeted instance.

Navigate to the Configure section of the bookstore-deploy Project.
Under Build click Add build step and select Send files or execute commands over SSH.
Under SSH Server > Name select bookstore-api-blue.
In the Exec command field input the following: gremlin attack shutdown -d 1.

Tip
This Gremlin Shutdown attack performs a disaster recovery failover test by terminating the bookstore-api-blue instance. We can optionally add the -r flag to have the server restart itself after shutdown. This triggers the bookstore-api-blue-StatusCheckFailed Amazon CloudWatch alarm that was created in Resiliency Stage 2. This automatically triggers Amazon Route53 DNS failover to reroute the bookstore.pingpublications.com endpoint to the bookstore-api-green environment.
Click Save.

Performing a Jenkins Build

Everything is now configured for our simple Bookstore application to be automatically deployed via Jenkins, during which reliability testing and disaster recovery failover testing is performed. It may be ideal to further automate the build process with Jenkins by configuring a Build Trigger, but we can also manually perform a build to confirm it works properly.

Navigate to the bookstore-deploy Project.
Click Build Now.

Click Console Output to view the output generated by Jenkins. It will look something like the following.

Started by user Gabe
Building in workspace /var/lib/jenkins/workspace/bookstore-deploy
> git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
> git config remote.origin.url https://github.com/GabeStah/bookstore_api # timeout=10
Fetching upstream changes from https://github.com/GabeStah/bookstore_api
> git --version # timeout=10
using GIT_ASKPASS to set credentials gabestah@github.com
> git fetch --tags --progress https://github.com/GabeStah/bookstore_api +refs/heads/\*:refs/remotes/origin/\*
> git rev-parse refs/remotes/origin/master^{commit} # timeout=10
> git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision d7172265ec23ab20d9eaabc6f3dd37a6c741f2cc (refs/remotes/origin/master)
> git config core.sparsecheckout # timeout=10
> git checkout -f d7172265ec23ab20d9eaabc6f3dd37a6c741f2cc
Commit message: "bumped"
> git rev-list --no-walk d7172265ec23ab20d9eaabc6f3dd37a6c741f2cc # timeout=10
SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal]
SSH: Connecting with configuration \[bookstore-api-blue] ...
SSH: EXEC: STDOUT/STDERR from command \[sudo systemctl restart gunicorn] ...
SSH: EXEC: completed after 401 ms
SSH: Disconnecting configuration \[bookstore-api-blue] ...
SSH: Transferred 20 file(s)
SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal]
SSH: Connecting with configuration \[bookstore-api-green] ...
SSH: EXEC: STDOUT/STDERR from command \[sudo systemctl restart gunicorn] ...
SSH: EXEC: completed after 200 ms
SSH: Disconnecting configuration \[bookstore-api-green] ...
SSH: Transferred 20 file(s)
Build step 'Send files or execute commands over SSH' changed build result to SUCCESS
SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal]
SSH: Connecting with configuration \[bookstore-api-blue] ...
SSH: EXEC: STDOUT/STDERR from command \[gremlin attack blackhole -l 240 -h ^api.gremlin.com,cdn.bookstore.pingpublications.com,db.bookstore.pingpublications.com] ...
Setting up blackhole gremlin with guid '0367cfe2-f9fc-11e8-acea-0242db4d1180' for 240 seconds
Setup successfully completed
Running blackhole gremlin with guid '0367cfe2-f9fc-11e8-acea-0242db4d1180' for 240 seconds
Whitelisting all egress traffic to 54.186.219.32
Whitelisting all egress traffic to 54.68.250.40
Dropping all egress traffic to 52.84.25.207
Dropping all egress traffic to 52.84.25.199
Dropping all egress traffic to 52.84.25.64
Dropping all egress traffic to 52.84.25.16
Dropping all egress traffic to 172.31.22.69
Whitelisting all ingress traffic from 54.186.219.32
Whitelisting all ingress traffic from 54.68.250.40
Dropping all ingress traffic from 52.84.25.207
Dropping all ingress traffic from 52.84.25.199
Dropping all ingress traffic from 52.84.25.64
Dropping all ingress traffic from 52.84.25.16
Dropping all ingress traffic from 172.31.22.69
Dropping all egress traffic to 13.33.147.150
Dropping all egress traffic to 13.33.147.73
Dropping all egress traffic to 13.33.147.19
Dropping all egress traffic to 13.33.147.18
Dropping all ingress traffic from 13.33.147.150
Dropping all ingress traffic from 13.33.147.73
Dropping all ingress traffic from 13.33.147.19
Dropping all ingress traffic from 13.33.147.18
Dropping all egress traffic to 13.32.253.5
Dropping all egress traffic to 13.32.253.244
Dropping all egress traffic to 13.32.253.105
Dropping all egress traffic to 13.32.253.26
Dropping all ingress traffic from 13.32.253.5
Dropping all ingress traffic from 13.32.253.244
Dropping all ingress traffic from 13.32.253.105
Dropping all ingress traffic from 13.32.253.26
Reverting impact!
SSH: EXEC: completed after 240,397 ms
SSH: Disconnecting configuration \[bookstore-api-blue] ...
SSH: Transferred 0 file(s)
SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal]
SSH: Connecting with configuration \[bookstore-api-green] ...
SSH: EXEC: STDOUT/STDERR from command \[gremlin attack blackhole -l 240 -h ^api.gremlin.com,cdn.bookstore.pingpublications.com,db.bookstore.pingpublications.com] ...
Setting up blackhole gremlin with guid '932c78b3-f9fc-11e8-9244-024280df5a87' for 240 seconds
Setup successfully completed
Running blackhole gremlin with guid '932c78b3-f9fc-11e8-9244-024280df5a87' for 240 seconds
Whitelisting all egress traffic to 54.186.219.32
Whitelisting all egress traffic to 54.68.250.40
Dropping all egress traffic to 13.33.147.150
Dropping all egress traffic to 13.33.147.73
Dropping all egress traffic to 13.33.147.19
Dropping all egress traffic to 13.33.147.18
Dropping all egress traffic to 172.31.22.69
Whitelisting all ingress traffic from 54.186.219.32
Whitelisting all ingress traffic from 54.68.250.40
Dropping all ingress traffic from 13.33.147.150
Dropping all ingress traffic from 13.33.147.73
Dropping all ingress traffic from 13.33.147.19
Dropping all ingress traffic from 13.33.147.18
Dropping all ingress traffic from 172.31.22.69
Dropping all egress traffic to 52.84.25.199
Dropping all egress traffic to 52.84.25.207
Dropping all egress traffic to 52.84.25.16
Dropping all egress traffic to 52.84.25.64
Dropping all ingress traffic from 52.84.25.64
Dropping all ingress traffic from 52.84.25.16
Dropping all ingress traffic from 52.84.25.207
Dropping all ingress traffic from 52.84.25.199
Reverting impact!
SSH: EXEC: completed after 240,395 ms
SSH: Disconnecting configuration \[bookstore-api-green] ...
SSH: Transferred 0 file(s)
SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal]
SSH: Connecting with configuration \[bookstore-api-blue] ...
SSH: EXEC: STDOUT/STDERR from command \[gremlin attack shutdown -d 1] ...
Setting up shutdown gremlin with guid '22cdba53-f9fd-11e8-8292-024230fbb3f0' after 1 minute
Setup successfully completed
Running shutdown gremlin with guid '22cdba53-f9fd-11e8-8292-024230fbb3f0' after 1 minute
SSH: Disconnecting configuration \[bookstore-api-blue] ...
SSH: EXEC: completed after 60,450 ms
Finished: SUCCESS

The initial build steps grab the latest version via Git, deploy the new app version to both the blue/green environments via SSH, then perform the resiliency and disaster recovery tests via Gremlin. As expected, all relevant bookstore-api-{blue/green}-{db/cdn}-connectivity-failedAmazon CloudWatch Alarms are triggered, forcing Amazon Route53 and Amazon RDS to engage its automated failover policies that we established in previous Resiliency Stages.

Warning
The above Jenkins deployment configuration for the Bookstore example app is merely a functional proof of concept. For an actual production configuration, we’d want to add many more safeguards, such as automating the alternation of deployment between blue/green environments, to ensure one environment is always ready in the event of a rollback.

Reliability Stage 4 Completion

With Reliability Stage 4 finished you and your team have reached the end of the current journey, but site reliability engineering is an ongoing process. Your system should now be fully monitored, provide high observability, and automatically perform reliability and disaster recovery testing in the production environment at regular intervals.

Working through all the Stages of Reliability takes time and will invariably be more difficult for some teams, so do not be discouraged. For organizations with multiple teams, the integration of this staged process allows for teams, as well as individuals within said teams, to be empowered to make improvements and be responsible for the services under their purview. As teams progress further through the stages, overall support costs will drop dramatically, while system stability and reliability will increase.

Stage 4

Injected Automated Chaos Everywhere

Chaos Engineering Through Staged Resiliency