Migrating to the Cloud Is Chaotic. Embrace It.

Why organizations planning to migrate to the cloud should embrace Chaos Engineering as a thoughtful strategy to avoid pain down the road.

Migrating to the cloud is an intimidating prospect and understandably so – there is a lot that will change in your systems as you move from on-prem to the cloud, and these changes can mean instability in your systems.

How can you ensure your software will be safe after migrating to the cloud? How do you combat the cloud's chaotic nature while providing a reliable and stable system? By intentionally inducing Chaos well before migration begins.

It sounds counter-intuitive to perform Chaos Engineering while your team is actively migrating to the cloud. Wouldn't that add failure and slow down an already challenging process? The reality is that when you are migrating to the cloud, Chaos Engineering is a great way to test how your new system will behave once you switch traffic over. By performing Chaos Experiments on the environment you are migrating into, you will identify previously unknown weaknesses while you have time to mitigate against them.

This blog post will discuss a number of ways that things can go wrong and provide tutorials to run Chaos Experiments to proactively identify potential issues before they turn into production outages.

Managing Heavy CPU Load

An overloaded CPU can quickly create bottlenecks and cause failures within most architectures. In a distributed cloud environment, instability in a single system can quickly cascade into problems elsewhere down the chain. Proper CPU reliability testing helps to determine which existing systems are currently reliable despite a CPU failure, and which need to be prioritized for upgrade and migration necessary to maintain a stable stack.

Performing a CPU Attack with Gremlin

A Gremlin CPU Attack consumes 100% of the specified CPU cores on the target system. The CPU Attack is a great way to test the stability of the targeted machine -- along with its critical dependencies -- when the CPU is overloaded.

Prerequisites

Install Gremlin on the target machine.
Retrieve your Gremlin API Token.

A CPU Attack accepts the following arguments.

Short Flag	Long Flag	Purpose
`-c`	`--cores`	Number of CPU cores to attack.
`-l`	`--length`	Attack duration (in seconds).

Most Gremlin API calls accept a JSON body payload, which specifies critical arguments. In all the following examples you'll be creating a local attacks/<attack-name>.json file to store the API attack arguments. You'll then pass those arguments along to the API request.

On your local machine, start by creating the attacks/cpu.json file and paste the following JSON into it. This will attack a single core for 30 seconds.
```
{
 "command": {
   "type": "cpu",
   "args": ["-c", "1", "-l", "30"]
 },
 "target": {
   "type": "Random"
 }
}
```

Create the new Attack by passing the JSON from attacks/cpu.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/cpu.json"

On the targeted machine you'll see that one CPU core is maxed out.
```
htop
```
You can also create, run, and view the Attack on the Gremlin Web UI.

If you wish to attack a specific Client just change the target : type argument value to "Exact" and add the target : exact field with a list of target Clients. A Client is identified on Gremlin as the GREMLIN_IDENTIFIER for the instance, which can also be specified in a local environment variable when running the gremlin init command.

{
  "command": {
    "type": "cpu",
    "args": ["-c", "1", "-l", "30"]
  },
  "target": {
    "type": "Exact",
    "exact": ["aws-nginx"]
  }
}

Handling Storage Disk Limitations

Migrating to a new system frequently requires moving volumes across disks and to other cloud-based storage layers. It is vital to determine whether your new storage system can handle the increase in volume that the migration will require. Additionally, you will also want to test how the system reacts when volumes become overburdened or unavailable.

Performing a Disk Attack with Gremlin

Gremlin's Disk Attack rapidly consumes disk space on the targeted machine, allowing you to test the reliability of that machine and other related systems when unexpected disk failures occur.

Prerequisites

Install Gremlin on the target machine.
Retrieve your Gremlin API Token.

A Gremlin API Disk Attack accepts the following arguments.

Short Flag	Long Flag	Purpose
`-b`	`--block-size`	The block size (in kilobytes) that are written.
`-d`	`--dir`	The directory that temporary files will be written to.
`-l`	`--length`	Attack duration (in seconds).
`-p`	`--percent`	The percentage of the volume to fill.
`-w`	`--workers`	The number of disk-write workers to run concurrently.

On your local machine, start by creating the attacks/disk.json file and paste the following JSON into it. Be sure to change your target Client. This attack will fill 95% of the volume over the course of a 60-second attack using 2 workers.
```
{
 "command": {
   "type": "disk",
   "args": ["-d", "/tmp", "-l", "60", "-w", "2", "-b", "4", "-p", "95"]
 },
 "target": {
   "type": "Exact",
   "exact": ["aws-nginx"]
 }
}
```

(Optional) Check the current disk usage on the target machine.

df -H
# OUTPUT
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      8.3G  1.4G  6.9G  17% /

Create the new Disk Attack by passing the JSON from attacks/disk.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/disk.json"

Check the attack target's current disk space, which will soon reach the specified percentage before Gremlin rolls back and returns the disk to the original state.
```
df -H
# OUTPUT
Filesystem      Size  Used Avail Use% Mounted on
/dev/xvda1      8.3G  7.9G  396M  96% /
```
You can also create, run, and view the Attack on the Gremlin Web UI.

Evaluating Network Reliability

Network problems are a common cause of service outages. Even architectures designed with network redundancies can experience multiple, cumulative network failures. Moreover, most modern software relies on external networks to some degree, which means a network outage completely outside of your control could cause a failure to propagate throughout your system.

Performing a Black Hole Attack with Gremlin

A Black Hole Attack temporarily drops all traffic based on the parameters of the attack. You can use a Black Hole Attack to test routing protocols, loss of communication to specific hosts, port-based traffic, network device failure, and much more.

Prerequisites

Install Gremlin on the target machine.
Retrieve your Gremlin API Token.

A Gremlin API Black Hole Attack accepts the following arguments.

Short Flag	Long Flag	Purpose
`-d`	`--device`	Network device through which traffic should be affected. Defaults to the first device found.
`-h`	`--hostname`	Outgoing hostnames to affect. Optionally, you can prefix a hostname with a caret (`^`) to whitelist it. It is recommended to include `^api.gremlin.com` in the whitelist.
`-i`	`--ipaddress`	Outgoing IP addresses to affect. Optionally, you can prefix an IP with a caret (`^`) to whitelist it.
`-l`	`--length`	Attack duration (in seconds).
`-n`	`--ingress_port`	Only affect ingress traffic to these destination ports. Ranges can also be specified (e.g. `8080-8085`).
`-p`	`--egress_port`	Only affect egress traffic to these destination ports. Ranges can also be specified (e.g. `8080-8085`).
`-P`	`--ipprotocol`	Only affect traffic using this protocol.

Start by performing a test to establish a baseline. The following command tests the response time of a request to example.com (which has an IP address of 93.184.216.34).
```
$ time curl -o /dev/null 93.184.216.34

# OUTPUT
real    0m0.025s
user    0m0.009s
sys     0m0.000s
```
On your local machine, create the attacks/blackhole.json file and paste the following JSON into it. Set your target Client as necessary. This attack creates a 30-second black hole that drops traffic to the 93.184.216.34 IP address.
```
{
 "command": {
   "type": "blackhole",
   "args": ["-l", "30", "-i", "93.184.216.34", "-h", "^api.gremlin.com"]
 },
 "target": {
   "type": "Exact",
   "exact": ["aws-nginx"]
 }
}
```

Execute the Black Hole Attack by passing the JSON from attacks/blackhole.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/blackhole.json"

On the target machine run the same timed curl test as before. It now hangs for approximately 30 seconds until the black hole has been terminated and a response is finally received.
```
$ time curl -o /dev/null 93.184.216.34

# OUTPUT
real    0m31.623s
user    0m0.013s
sys     0m0.000s
```
You can also create, run, and view the Attack on the Gremlin Web UI.

Proper Memory Management

While most cloud platforms provide auto-balancing and scaling services, it is unwise to rely solely on these technologies and assume they alone will keep your system stable and responsive. Memory management is a crucial part of maintaining a healthy and inexpensive cloud stack. An improper configuration or poorly tested system may not necessarily cause a system failure or outage, but even a tiny memory issue can add up to thousands of dollars in extra support costs.

Performing Chaos Engineering before, during, and after cloud migration lets you test system failures when instances, containers, or nodes run out of memory. This testing helps you keep your stack active and functional when an unexpected memory leak occurs.

Performing a Memory Attack with Gremlin

A Gremlin Memory Attack consumes memory on the targeted machine, making it easy to test how that system and other dependencies behave when memory is unavailable.

Prerequisites

Install Gremlin on the target machine.
Retrieve your Gremlin API Token.

A Gremlin API Memory Attack accepts the following arguments.

Short Flag	Long Flag	Purpose
`-g`	`--gigabytes`	The amount of memory (in GB) to allocate.
`-l`	`--length`	Attack duration (in seconds).
`-m`	`--megabytes`	The amount of memory (in MB) to allocate.

(Optional) On the target machine check the current memory usage to establish a baseline prior to executing the attack.
```
htop
```
On your local machine create an attacks/memory.json file and paste the following JSON into it, ensuring you change your target Client. This attack will consume up to 0.75 GB of memory for a total of 30 seconds.
```
{
 "command": {
   "type": "memory",
   "args": ["-l", "30", "-g", "0.75"]
 },
 "target": {
   "type": "Exact",
   "exact": ["aws-nginx"]
 }
}
```

Launch the Memory Attack by passing the JSON from attacks/memory.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/memory.json"

That additional memory is now consumed on the target machine.
```
htop
```
As always, you can view the Attack within the Gremlin Web UI.

Troubleshooting I/O Bottlenecks

Due to the proliferation of automatic monitoring and elastic scaling, I/O failure may seem like an unlikely problem within a cloud architecture. However, even when I/O failure isn't necessarily the root cause of an outage it is often the result of another issue. An I/O failure can trigger a negative cascading effect throughout other dependent systems. Moreover, since I/O failure is often considered an unlikely event, it is often overlooked as a test subject. It should not be overlooked.

Performing an I/O Attack with Gremlin

Gremlin's IO Attack performs rapid read and/or write actions on the targeted system volume.

Prerequisites

Install Gremlin on the target machine.
Retrieve your Gremlin API Token.

A Gremlin API IO Attack accepts the following arguments.

Short Flag	Long Flag	Purpose
`-c`	`--block-count`	The number of blocks read or written by workers.
`-d`	`--dir`	The directory that temporary files will be written to.
`-l`	`--length`	Attack duration (in seconds).
`-m`	`--mode`	Specifies if workers are in read (`r`), write (`w`), or read+write (`rw`) mode.
`-s`	`--block-size`	Size of blocks (in KB) that are read or written by workers.
`-w`	`--workers`	The number of concurrent workers.

On your local machine create an attacks/io.json file and paste the following JSON into it. Change the target Client as necessary. This IO Attack creates two workers that will perform both reads and writes during the 45-second attack.

{
 "command": {
   "type": "io",
   "args": [
     "-l",
     "45",
     "-d",
     "/tmp",
     "-w",
     "2",
     "-m",
     "rw",
     "-s",
     "4",
     "-c",
     "1"
   ]
 },
 "target": {
   "type": "Exact",
   "exact": ["aws-nginx"]
 }
}

Launch the IO Attack by passing the JSON from attacks/io.json to the https://api.gremlin.com/v1/attacks/new API endpoint.

curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/io.json"

On the target machine verify that the attack is running and that I/O is currently overloaded.

$ sudo iotop -aoP
# OUTPUT
Total DISK READ :       0.00 B/s | Total DISK WRITE :       3.92 M/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:      15.77 M/s
PID  PRIO  USER       DISK READ  DISK WRITE  SWAPIN     IO>     COMMAND
323   be/3 root          0.00 B     68.00 K  0.00 % 71.28 %   [jbd2/xvda1-8]
20030 be/4 gremlin       0.00 B    112.15 M  0.00 % 17.11 %   gremlin attack io -l 45 -d /tmp -w 2 -m rw -s 4 -c 1

You can also create, run, and view the Attack on the Gremlin Web UI.

What Comes Next?

This article explored a number of common issues and outages related to failed migrations and upgrade procedures. As impactful and expensive as those outages may have been, their existence should not dissuade you from making the move to the cloud. A distributed architecture allows you to enjoy faster release cycles and, in general, increased developer productivity.

Instead, the occurrence of migration issues for even the biggest organizations in the industry illustrates the necessity of proper reliability testing. Chaos Engineering is a critical piece of that finished and fully-reliable puzzle. Planning ahead and running Chaos Experiments on your systems, both prior to and during migration, will help ensure you are creating the most stable, robust, and reliable system possible.

March 11, 2019

Migrating to the Cloud Is Chaotic. Embrace It.

Managing Heavy CPU Load

Performing a CPU Attack with Gremlin

Prerequisites

Handling Storage Disk Limitations

Performing a Disk Attack with Gremlin

Prerequisites

Evaluating Network Reliability

Performing a Black Hole Attack with Gremlin

Prerequisites

Proper Memory Management

Performing a Memory Attack with Gremlin

Prerequisites

Troubleshooting I/O Bottlenecks

Performing an I/O Attack with Gremlin

Prerequisites

What Comes Next?

Related

6 Tips from 10 Years of Preparing for Peak Traffic Events

Announcing Gremlin’s New Slack Integration: Real-Time Collaboration For Chaos Engineering

Avoid downtime. Use Gremlin to turn failure into resilience.

Company

Resources