Chaos Engineering

Chaos Engineering is a disciplined approach to identify potential failures before they become outages.

Preparing for Disaster

Stage 0 is all about implementing good site reliability engineering practices, laying the groundwork for Chaos Engineering. The steps outlined in this post aren't necessarily prerequisites, but instead will evolve naturally alongside your Chaos Engineering practice.

Read now

What to do in Stage 0

  • Establish observability
  • Define the critical dependencies
  • Define the non-critical dependencies
  • Create a disaster recovery failover playbook
  • Create a critical dependency failover playbook
  • Create a non-critical dependency failover playbook
  • Publish the above and get team-wide agreement
  • Manually execute a failover exercise
  • Implementation example

Injecting Chaos Internally

Stage 1 describes the early stages of implementing Chaos Engineering, where you begin to inject failure into non-production systems and establish good practices for documenting what you learn.

Read now

What to do in Stage 1

  • Perform critical dependency failure tests in non-production
  • Publish test results
  • Implementation example

Pushing the Envelope Forward

Stage 2 helps you take your first steps into automation and testing in production.

Read now

What to do in Stage 2

  • Perform frequent, semi-automated tests
  • Execute a reliability experiment in production
  • Publish test results
  • Implementation example

Automating Chaos Internally

Stage 3 is where you implement fully automated testing in your non-production systems and begin figuring out how to automate disaster recovery failover.

Read now

What to do in Stage 3

  • Automate resiliency testing in non-production
  • Semi-automate disaster recovery failover
  • Implementation example

Injected Automated Chaos Everywhere

Stage 4 is a fully mature implementation of Chaos Engineering where you begin to have ideas of your own to add to and expand your testing plan.

Read now

What to do in Stage 4

  • Integrate reliability testing in CI/CD
  • Automate reliability and disaster recovery failover testing in production
  • Implementation example

Chaos Engineering and Technology Options

This is a series covering interesting technologies, architectures, and approaches that companies are using today or considering for the future.

  • Chaos Engineering Article

    Chaos Engineering for Serverless Infrastructure

    Serverless deployments are becoming an important facet of many companies overall application architecture and must also be tested with Chaos Engineering experiments to enhance reliability. Here is how.

  • Chaos Engineering Article

    Chaos Engineering Tools Comparison

    This article describes some of the common tools that the Chaos Engineering community considers when starting to implement the practice in an organization. The goal is to give a high level introduction to some frequently mentioned options and list some of the strengths of each using a brief table and then an annotated list.

  • Chaos Engineering Article

    Chaos Engineering for Istio Service Mesh

    Istio is a popular, open source cloud-native service mesh management application with freely available source code. This article demonstrates how to perform a few Chaos Engineering experiments using features already available in Istio.

  • Community Tutorial

    Chaos Engineering: the history, principles, and practice

    With the rise of microservices and distributed cloud architectures, the web has grown increasingly complex. We all…

  • Blog Post

    What is Chaos Engineering? SREs and Leaders Define the Practice & Where It's Going

    Chaos Engineering is a practice that is growing in implementation and interest. What is it and why are some of the most…

Download PDF

Avoid downtime. Use Gremlin to turn failure into resilience.

Gremlin empowers you to proactively root out failure before it causes downtime. Use Gremlin for Free and see how you can harness chaos to build resilient systems.

Use For Free