We’ve finally made it to the last Reliability Stage, where your team will be working toward full automation of both reliability testing and disaster recovery failover procedures. Chaos Engineering Through Staged Reliability - Stage 3 emphasized the importance of looking toward the production environment for automation, but Reliability Stage 4 is where all appropriate, remaining non-automated tests and processes that can, should be automated within production. That’s not to say you cannot have any human interactions involved, but rather, that a given reliability testing or disaster recovery process must not rely solely on human intervention to succeed.
Prerequisites
- Creation and agreement on Disaster Recovery and Dependency Failover Playbooks.
- Completion of the most vital and applicable parts of Reliability Stage 0.
- Completion of the most vital and applicable parts of Reliability Stage 1.
- Completion of the most vital and applicable parts of Reliability Stage 2.
- Completion of the most vital and applicable parts of Reliability Stage 3.
Integrate Automatic Reliability Testing in CI/CD
For systems relying on continuous integration and continuous deployment, this final stage is when you should integrate automatic reliability testing within the CI/CD process.
Additionally, a failed reliability test should also result in a build and/or deployment failure. Just as with application-level unit testing, reliability testing should always be 100% successful prior to releasing a given build, so this prerequisite helps maintain system stability.
Automate Reliability and Disaster Recovery Failover Testing in Production
The final and most critical requirement for Reliability Stage 4 is the nearly-total automation of both reliability and disaster recovery failover testing in the production environment. An SRE can be involved, but everything should be so automated that running tests merely requires a minimal amount of support time to get the process started.
Reliability Stage 4: Implementation Example
To keep things simple the Bookstore application uses Jenkins to handle CI/CD. We’ve propagated a new Amazon EC2 instance onto which Jenkins is installed. An Amazon Route53 DNS record points the jenkins.bookstore.pingpublications.com
endpoint to the EC2 instance, so accessing the Jenkins front end is done through http://jenkins.bookstore.pingpublications.com:8080
.
The steps for installing and running Jenkins on an Amazon EC2 instance are beyond the scope of this article, but a traditional deployment is easy to accomplish.
This is now what the final Bookstore architecture looks like.
Jenkins Configuration
With Jenkins installed we need to create a new Project for the Bookstore application. We’re using the Publish Over SSH plugin to simplify deployment to both the blue/green Amazon EC2 instances. Once the plugin is installed it must be configured by adding each SSH server we’ll be interacting with.
- Navigate to Manage Jenkins > Configure System and scroll down to the Publish over SSH section.
- Enter the appropriate private key allowing connection to both Bookstore API Amazon EC2 instances.
- Under SSH Servers click Add.
-
Enter the appropriate details for the
bookstore-api-blue
server.- Name:
bookstore-api-blue
- Hostname:
54.213.54.171
- Username:
ubuntu
- Remote Directory:
/home/ubuntu/apps
- Name:
-
Add another entry for the
bookstore-api-green
server.- Name:
bookstore-api-green
- Hostname:
52.11.79.9
- Username:
ubuntu
- Remote Directory:
/home/ubuntu/apps
- Name:
- Click Save.
With the SSH connections configured it’s time to create the Project to handle actual deployment.
- Click New Item.
- Input
bookstore-deploy
in the Item name field. - Select Freestyle Project type and click OK.
- Under Source Code Management select Git and enter the GitHub endpoint (e.g.
git@github.com:GabeStah/bookstore_api.git
). - Add the appropriate GitHub credentials.
Deploying the Bookstore application code is just a matter of creating an SSH Publisher for each environment.
- Under Build click Add build step and select Send files or execute commands over SSH.
- Under SSH Server > Name select
bookstore-api-blue
. - In the Source files field input
**/*
. This field uses the Apache Ant pattern format, so here we’re merely ensuring all files in the project directory are published. - For Remote directory input
bookstore_api
. - Since the application uses Gunicorn for the web server we need to add a command to restart it after deployment. Input
sudo systemctl restart gunicorn
in the Exec command field. - Click Add Server and repeat steps 2 through 5 for the
bookstore-api-green
server. - Click Save to finalize the settings. The
bookstore-deploy
Project will now deploy new builds of the Bookstore to each production environment!
Automating Reliability Testing in Jenkins
To meet the requirements of Reliability Stage 5 we need to integrate automated reliability testing into the CI/CD pipeline. For the Bookstore application, we’ll use a Gremlin Blackhole Attack. This attack blocks all network communication between the targeted instance and specified endpoints. Just as we saw in Resiliency Stage 2 we’ll be using this attack to perform a reliability test that interrupts communication and triggers an Amazon CloudWatch alarm. Check out the Gremlin documentation for more details on creating attacks with Gremlin’s Chaos Engineering tools.
Within Jenkins, we can initiate the reliability test by executing over SSH.
- Navigate to the Configure section of the
bookstore-deploy
Project. - Under Build click Add build step and select Send files or execute commands over SSH.
- Under SSH Server > Name select
bookstore-api-blue
. -
In the Exec command field input the following:
gremlin attack blackhole -l 240 -h ^api.gremlin.com,cdn.bookstore.pingpublications.com,db.bookstore.pingpublications.com
TipThis Gremlin Blackhole attack tests reliability against the simulated failure of our critical dependencies (database and CDN). Consequently, this will trigger the respective
bookstore-api-{blue/green}-{db/cdn}-connectivity-failed
Amazon CloudWatch Alarms that we configured during Reliability Stage 2. This causes a DNS failover, which can be confirmed just as it was in Resiliency Stage 3. - Click Add Server and repeat steps 3 and 4 for the
bookstore-api-green
server. - Click Save.
Automating a Disaster Recovery Failover Test in Jenkins
We also want to automate disaster recovery failover testing within the CI/CD pipeline. This time we’ll use a Gremlin Shutdown Attack, which shuts down the targeted instance.
- Navigate to the Configure section of the
bookstore-deploy
Project. - Under Build click Add build step and select Send files or execute commands over SSH.
- Under SSH Server > Name select
bookstore-api-blue
. -
In the Exec command field input the following:
gremlin attack shutdown -d 1
.TipThis Gremlin Shutdown attack performs a disaster recovery failover test by terminating the
bookstore-api-blue
instance. We can optionally add the-r
flag to have the server restart itself after shutdown. This triggers thebookstore-api-blue-StatusCheckFailed
Amazon CloudWatch alarm that was created in Resiliency Stage 2. This automatically triggers Amazon Route53 DNS failover to reroute thebookstore.pingpublications.com
endpoint to thebookstore-api-green
environment. - Click Save.
Performing a Jenkins Build
Everything is now configured for our simple Bookstore application to be automatically deployed via Jenkins, during which reliability testing and disaster recovery failover testing is performed. It may be ideal to further automate the build process with Jenkins by configuring a Build Trigger, but we can also manually perform a build to confirm it works properly.
- Navigate to the
bookstore-deploy
Project. - Click Build Now.
-
Click Console Output to view the output generated by Jenkins. It will look something like the following.
Started by user Gabe Building in workspace /var/lib/jenkins/workspace/bookstore-deploy > git rev-parse --is-inside-work-tree # timeout=10 Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/GabeStah/bookstore_api # timeout=10 Fetching upstream changes from https://github.com/GabeStah/bookstore_api > git --version # timeout=10 using GIT_ASKPASS to set credentials gabestah@github.com > git fetch --tags --progress https://github.com/GabeStah/bookstore_api +refs/heads/\*:refs/remotes/origin/\* > git rev-parse refs/remotes/origin/master^{commit} # timeout=10 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10 Checking out Revision d7172265ec23ab20d9eaabc6f3dd37a6c741f2cc (refs/remotes/origin/master) > git config core.sparsecheckout # timeout=10 > git checkout -f d7172265ec23ab20d9eaabc6f3dd37a6c741f2cc Commit message: "bumped" > git rev-list --no-walk d7172265ec23ab20d9eaabc6f3dd37a6c741f2cc # timeout=10 SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal] SSH: Connecting with configuration \[bookstore-api-blue] ... SSH: EXEC: STDOUT/STDERR from command \[sudo systemctl restart gunicorn] ... SSH: EXEC: completed after 401 ms SSH: Disconnecting configuration \[bookstore-api-blue] ... SSH: Transferred 20 file(s) SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal] SSH: Connecting with configuration \[bookstore-api-green] ... SSH: EXEC: STDOUT/STDERR from command \[sudo systemctl restart gunicorn] ... SSH: EXEC: completed after 200 ms SSH: Disconnecting configuration \[bookstore-api-green] ... SSH: Transferred 20 file(s) Build step 'Send files or execute commands over SSH' changed build result to SUCCESS SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal] SSH: Connecting with configuration \[bookstore-api-blue] ... SSH: EXEC: STDOUT/STDERR from command \[gremlin attack blackhole -l 240 -h ^api.gremlin.com,cdn.bookstore.pingpublications.com,db.bookstore.pingpublications.com] ... Setting up blackhole gremlin with guid '0367cfe2-f9fc-11e8-acea-0242db4d1180' for 240 seconds Setup successfully completed Running blackhole gremlin with guid '0367cfe2-f9fc-11e8-acea-0242db4d1180' for 240 seconds Whitelisting all egress traffic to 54.186.219.32 Whitelisting all egress traffic to 54.68.250.40 Dropping all egress traffic to 52.84.25.207 Dropping all egress traffic to 52.84.25.199 Dropping all egress traffic to 52.84.25.64 Dropping all egress traffic to 52.84.25.16 Dropping all egress traffic to 172.31.22.69 Whitelisting all ingress traffic from 54.186.219.32 Whitelisting all ingress traffic from 54.68.250.40 Dropping all ingress traffic from 52.84.25.207 Dropping all ingress traffic from 52.84.25.199 Dropping all ingress traffic from 52.84.25.64 Dropping all ingress traffic from 52.84.25.16 Dropping all ingress traffic from 172.31.22.69 Dropping all egress traffic to 13.33.147.150 Dropping all egress traffic to 13.33.147.73 Dropping all egress traffic to 13.33.147.19 Dropping all egress traffic to 13.33.147.18 Dropping all ingress traffic from 13.33.147.150 Dropping all ingress traffic from 13.33.147.73 Dropping all ingress traffic from 13.33.147.19 Dropping all ingress traffic from 13.33.147.18 Dropping all egress traffic to 13.32.253.5 Dropping all egress traffic to 13.32.253.244 Dropping all egress traffic to 13.32.253.105 Dropping all egress traffic to 13.32.253.26 Dropping all ingress traffic from 13.32.253.5 Dropping all ingress traffic from 13.32.253.244 Dropping all ingress traffic from 13.32.253.105 Dropping all ingress traffic from 13.32.253.26 Reverting impact! SSH: EXEC: completed after 240,397 ms SSH: Disconnecting configuration \[bookstore-api-blue] ... SSH: Transferred 0 file(s) SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal] SSH: Connecting with configuration \[bookstore-api-green] ... SSH: EXEC: STDOUT/STDERR from command \[gremlin attack blackhole -l 240 -h ^api.gremlin.com,cdn.bookstore.pingpublications.com,db.bookstore.pingpublications.com] ... Setting up blackhole gremlin with guid '932c78b3-f9fc-11e8-9244-024280df5a87' for 240 seconds Setup successfully completed Running blackhole gremlin with guid '932c78b3-f9fc-11e8-9244-024280df5a87' for 240 seconds Whitelisting all egress traffic to 54.186.219.32 Whitelisting all egress traffic to 54.68.250.40 Dropping all egress traffic to 13.33.147.150 Dropping all egress traffic to 13.33.147.73 Dropping all egress traffic to 13.33.147.19 Dropping all egress traffic to 13.33.147.18 Dropping all egress traffic to 172.31.22.69 Whitelisting all ingress traffic from 54.186.219.32 Whitelisting all ingress traffic from 54.68.250.40 Dropping all ingress traffic from 13.33.147.150 Dropping all ingress traffic from 13.33.147.73 Dropping all ingress traffic from 13.33.147.19 Dropping all ingress traffic from 13.33.147.18 Dropping all ingress traffic from 172.31.22.69 Dropping all egress traffic to 52.84.25.199 Dropping all egress traffic to 52.84.25.207 Dropping all egress traffic to 52.84.25.16 Dropping all egress traffic to 52.84.25.64 Dropping all ingress traffic from 52.84.25.64 Dropping all ingress traffic from 52.84.25.16 Dropping all ingress traffic from 52.84.25.207 Dropping all ingress traffic from 52.84.25.199 Reverting impact! SSH: EXEC: completed after 240,395 ms SSH: Disconnecting configuration \[bookstore-api-green] ... SSH: Transferred 0 file(s) SSH: Connecting from host \[ip-172-31-43-207.us-west-2.compute.internal] SSH: Connecting with configuration \[bookstore-api-blue] ... SSH: EXEC: STDOUT/STDERR from command \[gremlin attack shutdown -d 1] ... Setting up shutdown gremlin with guid '22cdba53-f9fd-11e8-8292-024230fbb3f0' after 1 minute Setup successfully completed Running shutdown gremlin with guid '22cdba53-f9fd-11e8-8292-024230fbb3f0' after 1 minute SSH: Disconnecting configuration \[bookstore-api-blue] ... SSH: EXEC: completed after 60,450 ms Finished: SUCCESS
-
The initial build steps grab the latest version via Git, deploy the new app version to both the blue/green environments via SSH, then perform the resiliency and disaster recovery tests via Gremlin. As expected, all relevant
bookstore-api-{blue/green}-{db/cdn}-connectivity-failedAmazon
CloudWatch Alarms are triggered, forcing Amazon Route53 and Amazon RDS to engage its automated failover policies that we established in previous Resiliency Stages.WarningThe above Jenkins deployment configuration for the Bookstore example app is merely a functional proof of concept. For an actual production configuration, we’d want to add many more safeguards, such as automating the alternation of deployment between blue/green environments, to ensure one environment is always ready in the event of a rollback.
Reliability Stage 4 Completion
With Reliability Stage 4 finished you and your team have reached the end of the current journey, but site reliability engineering is an ongoing process. Your system should now be fully monitored, provide high observability, and automatically perform reliability and disaster recovery testing in the production environment at regular intervals.
Working through all the Stages of Reliability takes time and will invariably be more difficult for some teams, so do not be discouraged. For organizations with multiple teams, the integration of this staged process allows for teams, as well as individuals within said teams, to be empowered to make improvements and be responsible for the services under their purview. As teams progress further through the stages, overall support costs will drop dramatically, while system stability and reliability will increase.