March 11, 2019
Migrating to the Cloud Is Chaotic. Embrace It.
Why organizations planning to migrate to the cloud should embrace Chaos Engineering as a thoughtful strategy to avoid pain down the road.
Migrating to the cloud is an intimidating prospect and understandably so – there is a lot that will change in your systems as you move from on-prem to the cloud, and these changes can mean instability in your systems.
How can you ensure your software will be safe after migrating to the cloud? How do you combat the cloud's chaotic nature while providing a reliable and stable system? By intentionally inducing Chaos well before migration begins.
It sounds counter-intuitive to perform Chaos Engineering while your team is actively migrating to the cloud. Wouldn't that add failure and slow down an already challenging process? The reality is that when you are migrating to the cloud, Chaos Engineering is a great way to test how your new system will behave once you switch traffic over. By performing Chaos Experiments on the environment you are migrating into, you will identify previously unknown weaknesses while you have time to mitigate against them.
This blog post will discuss a number of ways that things can go wrong and provide tutorials to run Chaos Experiments to proactively identify potential issues before they turn into production outages.
Managing Heavy CPU Load
An overloaded CPU can quickly create bottlenecks and cause failures within most architectures. In a distributed cloud environment, instability in a single system can quickly cascade into problems elsewhere down the chain. Proper CPU reliability testing helps to determine which existing systems are currently reliable despite a CPU failure, and which need to be prioritized for upgrade and migration necessary to maintain a stable stack.
Performing a CPU Attack with Gremlin
A Gremlin CPU Attack consumes 100% of the specified CPU cores on the target system. The CPU Attack is a great way to test the stability of the targeted machine -- along with its critical dependencies -- when the CPU is overloaded.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A CPU Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-c |
--cores |
Number of CPU cores to attack. |
-l |
--length |
Attack duration (in seconds). |
Most Gremlin API calls accept a JSON body payload, which specifies critical arguments. In all the following examples you'll be creating a local attacks/<attack-name>.json
file to store the API attack arguments. You'll then pass those arguments along to the API request.
-
On your local machine, start by creating the
attacks/cpu.json
file and paste the following JSON into it. This will attack a single core for30
seconds.{ "command": { "type": "cpu", "args": ["-c", "1", "-l", "30"] }, "target": { "type": "Random" } }
-
Create the new Attack by passing the JSON from
attacks/cpu.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/cpu.json"
-
On the targeted machine you'll see that one CPU core is maxed out.
htop
- You can also create, run, and view the Attack on the Gremlin Web UI.
If you wish to attack a specific Client just change the target : type
argument value to "Exact"
and add the target : exact
field with a list of target Clients. A Client is identified on Gremlin as the GREMLIN_IDENTIFIER
for the instance, which can also be specified in a local environment variable when running the gremlin init
command.
{
"command": {
"type": "cpu",
"args": ["-c", "1", "-l", "30"]
},
"target": {
"type": "Exact",
"exact": ["aws-nginx"]
}
}
Handling Storage Disk Limitations
Migrating to a new system frequently requires moving volumes across disks and to other cloud-based storage layers. It is vital to determine whether your new storage system can handle the increase in volume that the migration will require. Additionally, you will also want to test how the system reacts when volumes become overburdened or unavailable.
Performing a Disk Attack with Gremlin
Gremlin's Disk Attack rapidly consumes disk space on the targeted machine, allowing you to test the reliability of that machine and other related systems when unexpected disk failures occur.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API Disk Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-b |
--block-size |
The block size (in kilobytes) that are written. |
-d |
--dir |
The directory that temporary files will be written to. |
-l |
--length |
Attack duration (in seconds). |
-p |
--percent |
The percentage of the volume to fill. |
-w |
--workers |
The number of disk-write workers to run concurrently. |
-
On your local machine, start by creating the
attacks/disk.json
file and paste the following JSON into it. Be sure to change your target Client. This attack will fill95%
of the volume over the course of a60-second
attack using2
workers.{ "command": { "type": "disk", "args": ["-d", "/tmp", "-l", "60", "-w", "2", "-b", "4", "-p", "95"] }, "target": { "type": "Exact", "exact": ["aws-nginx"] } }
-
(Optional) Check the current disk usage on the target machine.
df -H # OUTPUT Filesystem Size Used Avail Use% Mounted on /dev/xvda1 8.3G 1.4G 6.9G 17% /
-
Create the new Disk Attack by passing the JSON from
attacks/disk.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/disk.json"
-
Check the attack target's current disk space, which will soon reach the specified percentage before Gremlin rolls back and returns the disk to the original state.
df -H # OUTPUT Filesystem Size Used Avail Use% Mounted on /dev/xvda1 8.3G 7.9G 396M 96% /
- You can also create, run, and view the Attack on the Gremlin Web UI.
Evaluating Network Reliability
Network problems are a common cause of service outages. Even architectures designed with network redundancies can experience multiple, cumulative network failures. Moreover, most modern software relies on external networks to some degree, which means a network outage completely outside of your control could cause a failure to propagate throughout your system.
Performing a Black Hole Attack with Gremlin
A Black Hole Attack temporarily drops all traffic based on the parameters of the attack. You can use a Black Hole Attack to test routing protocols, loss of communication to specific hosts, port-based traffic, network device failure, and much more.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API Black Hole Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-d |
--device |
Network device through which traffic should be affected. Defaults to the first device found. |
-h |
--hostname |
Outgoing hostnames to affect. Optionally, you can prefix a hostname with a caret (^ ) to whitelist it. It is recommended to include ^api.gremlin.com in the whitelist. |
-i |
--ipaddress |
Outgoing IP addresses to affect. Optionally, you can prefix an IP with a caret (^ ) to whitelist it. |
-l |
--length |
Attack duration (in seconds). |
-n |
--ingress_port |
Only affect ingress traffic to these destination ports. Ranges can also be specified (e.g. 8080-8085 ). |
-p |
--egress_port |
Only affect egress traffic to these destination ports. Ranges can also be specified (e.g. 8080-8085 ). |
-P |
--ipprotocol |
Only affect traffic using this protocol. |
-
Start by performing a test to establish a baseline. The following command tests the response time of a request to
example.com
(which has an IP address of93.184.216.34
).$ time curl -o /dev/null 93.184.216.34 # OUTPUT real 0m0.025s user 0m0.009s sys 0m0.000s
-
On your local machine, create the
attacks/blackhole.json
file and paste the following JSON into it. Set your target Client as necessary. This attack creates a30-second
black hole that drops traffic to the93.184.216.34
IP address.{ "command": { "type": "blackhole", "args": ["-l", "30", "-i", "93.184.216.34", "-h", "^api.gremlin.com"] }, "target": { "type": "Exact", "exact": ["aws-nginx"] } }
-
Execute the Black Hole Attack by passing the JSON from
attacks/blackhole.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/blackhole.json"
-
On the target machine run the same timed
curl
test as before. It now hangs for approximately30
seconds until the black hole has been terminated and a response is finally received.$ time curl -o /dev/null 93.184.216.34 # OUTPUT real 0m31.623s user 0m0.013s sys 0m0.000s
- You can also create, run, and view the Attack on the Gremlin Web UI.
Proper Memory Management
While most cloud platforms provide auto-balancing and scaling services, it is unwise to rely solely on these technologies and assume they alone will keep your system stable and responsive. Memory management is a crucial part of maintaining a healthy and inexpensive cloud stack. An improper configuration or poorly tested system may not necessarily cause a system failure or outage, but even a tiny memory issue can add up to thousands of dollars in extra support costs.
Performing Chaos Engineering before, during, and after cloud migration lets you test system failures when instances, containers, or nodes run out of memory. This testing helps you keep your stack active and functional when an unexpected memory leak occurs.
Performing a Memory Attack with Gremlin
A Gremlin Memory Attack consumes memory on the targeted machine, making it easy to test how that system and other dependencies behave when memory is unavailable.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API Memory Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-g |
--gigabytes |
The amount of memory (in GB) to allocate. |
-l |
--length |
Attack duration (in seconds). |
-m |
--megabytes |
The amount of memory (in MB) to allocate. |
-
(Optional) On the target machine check the current memory usage to establish a baseline prior to executing the attack.
htop
-
On your local machine create an
attacks/memory.json
file and paste the following JSON into it, ensuring you change your target Client. This attack will consume up to0.75 GB
of memory for a total of30
seconds.{ "command": { "type": "memory", "args": ["-l", "30", "-g", "0.75"] }, "target": { "type": "Exact", "exact": ["aws-nginx"] } }
-
Launch the Memory Attack by passing the JSON from
attacks/memory.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/memory.json"
-
That additional memory is now consumed on the target machine.
htop
- As always, you can view the Attack within the Gremlin Web UI.
Troubleshooting I/O Bottlenecks
Due to the proliferation of automatic monitoring and elastic scaling, I/O failure may seem like an unlikely problem within a cloud architecture. However, even when I/O failure isn't necessarily the root cause of an outage it is often the result of another issue. An I/O failure can trigger a negative cascading effect throughout other dependent systems. Moreover, since I/O failure is often considered an unlikely event, it is often overlooked as a test subject. It should not be overlooked.
Performing an I/O Attack with Gremlin
Gremlin's IO Attack performs rapid read and/or write actions on the targeted system volume.
Prerequisites
- Install Gremlin on the target machine.
- Retrieve your Gremlin API Token.
A Gremlin API IO Attack accepts the following arguments.
Short Flag | Long Flag | Purpose |
---|---|---|
-c |
--block-count |
The number of blocks read or written by workers. |
-d |
--dir |
The directory that temporary files will be written to. |
-l |
--length |
Attack duration (in seconds). |
-m |
--mode |
Specifies if workers are in read (r ), write (w ), or read+write (rw ) mode. |
-s |
--block-size |
Size of blocks (in KB) that are read or written by workers. |
-w |
--workers |
The number of concurrent workers. |
-
On your local machine create an
attacks/io.json
file and paste the following JSON into it. Change the target Client as necessary. This IO Attack creates two workers that will perform both reads and writes during the45-second
attack.{ "command": { "type": "io", "args": [ "-l", "45", "-d", "/tmp", "-w", "2", "-m", "rw", "-s", "4", "-c", "1" ] }, "target": { "type": "Exact", "exact": ["aws-nginx"] } }
-
Launch the IO Attack by passing the JSON from
attacks/io.json
to thehttps://api.gremlin.com/v1/attacks/new
API endpoint.curl -H "Content-Type: application/json" -H "Authorization: $GREMLIN_API_TOKEN" https://api.gremlin.com/v1/attacks/new -d "@attacks/io.json"
-
On the target machine verify that the attack is running and that I/O is currently overloaded.
$ sudo iotop -aoP # OUTPUT Total DISK READ : 0.00 B/s | Total DISK WRITE : 3.92 M/s Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 15.77 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 323 be/3 root 0.00 B 68.00 K 0.00 % 71.28 % [jbd2/xvda1-8] 20030 be/4 gremlin 0.00 B 112.15 M 0.00 % 17.11 % gremlin attack io -l 45 -d /tmp -w 2 -m rw -s 4 -c 1
- You can also create, run, and view the Attack on the Gremlin Web UI.
What Comes Next?
This article explored a number of common issues and outages related to failed migrations and upgrade procedures. As impactful and expensive as those outages may have been, their existence should not dissuade you from making the move to the cloud. A distributed architecture allows you to enjoy faster release cycles and, in general, increased developer productivity.
Instead, the occurrence of migration issues for even the biggest organizations in the industry illustrates the necessity of proper reliability testing. Chaos Engineering is a critical piece of that finished and fully-reliable puzzle. Planning ahead and running Chaos Experiments on your systems, both prior to and during migration, will help ensure you are creating the most stable, robust, and reliable system possible.