Chaos Engineering for Istio Service Mesh: Failure in a Service Mesh

In this post we demonstrate how you can implement some types of Chaos Engineering experiments using the Istio service mesh on Kubernetes.

10 min read

Last Updated June 10, 2019

Istio (https://istio.io) is a popular, open source cloud-native service mesh management application with freely available source code. The name service mesh refers to the network of microservices that make up modern applications and the communications between them. Whereas API gateways are typically used to manage public traffic in and out of an application or service, a service mesh is used to manage communications internal to the infrastructure, and specifically between services.

Istio provides insights into those communications as well as a variety of means to integrate, control, and secure the various end points of each microservice. This helps DevOps engineers manage the growing complexity of modern cloud-deployed distributed computing applications, such as those deployed using containers on Kubernetes, perhaps on Amazon Web Services or another cloud provider. Istio also provides service discovery and load balancing, depending on its configuration.

In a Kubernetes setting, Istio runs a sidecar in each microservice’s pod as part of its data plane, using its control plane to manage these proxies to route communications traffic according to your specified configuration, while Kubernetes helps you with container orchestration. Istio can also create a mesh across multiple Kubernetes clusters. This architecture makes Istio a great candidate for running some Chaos Engineering experiments. Knowing what will happen if communications via Istio are disrupted is a valuable addition to your knowledge base and, possibly, your to-do list. Mitigating against any problems you anticipate or discover will help you enhance the reliability of your application and your uptime.

When you have any complex system, it is inevitable that some part of the system will fail. The ultimate objective then is to mitigate against catastrophic failure across the entire system when one part fails, in favor of the graceful degradation when failure occurs which limits the felt impact on the ultimate end user experience. This article describes some ways to inject helpful chaos into your application regardless of whether or not you use Istio. The intent is to help you recognize and learn about how your system responds to failure and find ways to make the entire system more reliable. Most of what this article describes can already be done using Gremlin’s Application-Level Fault Injection (ALFI).

With Istio, you can inject chaos into networking easily, because the istio-proxy is already intercepting all network traffic. That means the proxy can be used to change the responses or delay responses to simulate latency. To give proper credit where credit is due, much of the following content comes from or is based on material in Red Hat’s Istio Tutorial for Java Microservices on GitHub. Developers following that tutorial deploy three simple microservices to Kubernetes and chain them together, like this, using Istio to control network communication:

customer > preference > recommendation

Eventually you deploy two versions of one of the microservices, called recommendation v1 and recommendation v2 and have them running side by side. The two use random load balancing, as this is the default behavior in Kubernetes.

With this as our foundation, we can now inject several types of faults using Istio, HTTP error codes, and network delays.

HTTP 503 Errors

Here we see the deployed pods running the two versions of the recommendation service using the recommended tutorial namespace as used by the tutorial when creating the sample application:

$ kubectl get pods -l app=recommendation -n tutorialNAME                                 READY     STATUS    RESTARTS   AGErecommendation-v1-no7area1ID-t35ts   2/2       Running   12         22hrecommendation-v2-00i8toomuc-74the   2/2       Running   2          6h

It is easy to create a rule that will inject 503 errors for approximately 50% of the requests. We do that by creating a YAML file called route-rule-recommendation-503.yml with the following contents:

apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:  name: recommendation-503spec:  destination:    namespace: tutorial    name: recommendation precedence: 2  route:  - labels:      app: recommendation  httpFault:    abort:      percent: 50      httpStatus: 503

This uses Istio’s RouteRule to inject our desired chaos. Use the istioctl command-line tool to apply the rule:

$ istioctl create -f route-rule-recommendation-503.yml -n tutorial

To test whether it is working, check the output from curl when hitting the exposed endpoint of the customer microservice:

$ curl {endpoint of customer}customer => preference => recommendation v2 from '00i8toomuc-74the': 138customer => Error: 503 - preference => Error: 503 - fault filter abortcustomer => preference => recommendation v1 from 'no7area1ID-t35ts': 139customer => Error: 503 - preference => Error: 503 - fault filter abortcustomer => preference => recommendation v2 from '00i8toomuc-74the': 140customer => preference => recommendation v1 from 'no7area1ID-t35ts': 141customer => preference => recommendation v2 from '00i8toomuc-74the': 142customer => preference => recommendation v1 from 'no7area1ID-t35ts': 143customer => Error: 503 - preference => Error: 503 - fault filter abortcustomer => Error: 503 - preference => Error: 503 - fault filter abortcustomer => preference => recommendation v2 from '00i8toomuc-74the': 144

Notice the 503 errors appear approximately 50% of the time as the load balancer submits the request to one or the other recommendation service.

To restore normal operation, delete the RouteRule YAML file.

$ istioctl delete -f route-rule-recommendation-503.yml -n tutorial

Retry

Sometimes random problems cause a site to throw a 503 error and then the site works normally when you retry the connection moments later. You can set Istio to retry a failed connection, which could mitigate with some added fault tolerance against the problem (at least until you do the homework to find out why the random error is occurring at all).

To try this out, enable the HTTP 503 Errors we had being thrown 50% of the time in the previous example. Then, create a second RouteRule file called route-rule-recommendation-v2-retry.yml containing the following:

apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:  name: recommendation-v2-retryspec:  destination:    namespace: tutorial    name: recommendation  precedence: 3  route:  - labels:      version: v2  httpReqRetries:    simpleRetry:      perTryTimeout: 2s      attempts: 3

Enable it as you did for the 503 route rule:

$ istioctl create -f route-rule-recommendation-v2-retry.yml -n tutorial

This will retry any failed connection up to three times, with a two second delay between attempts. Run the curl command again to see what happens.

$ curl {endpoint of customer}customer => preference => recommendation v2 from '00i8toomuc-74the': 147customer => preference => recommendation v1 from 'no7area1ID-t35ts': 148customer => preference => recommendation v2 from '00i8toomuc-74the': 149customer => preference => recommendation v1 from 'no7area1ID-t35ts': 150customer => preference => recommendation v2 from '00i8toomuc-74the': 151customer => preference => recommendation v1 from 'no7area1ID-t35ts': 152customer => preference => recommendation v2 from '00i8toomuc-74the': 153customer => preference => recommendation v1 from 'no7area1ID-t35ts': 154customer => preference => recommendation v2 from '00i8toomuc-74the': 155customer => preference => recommendation v1 from 'no7area1ID-t35ts': 156customer => preference => recommendation v2 from '00i8toomuc-74the': 157

Here we have Istio throwing 503 errors 50% of the time, but we also have Istio performing retries for each error. Interesting result.

To restore normal operation, delete the rule’s YAML file.

$ istioctl delete -f route-rule-recommendation-v2-retry.yml -n tutorial

Delays

To inject network latency as a chaos experiment, you can create yet another route rule. Remember that what we are doing here is simulating what happens to an upstream system when request-responses to a downstream system get slower. That is not the end goal, but merely a step on the path to reliability. Once we have set up our latency experiment, we want to learn about how our application responds and find ways to improve that response.

Once again, we create a YAML file, this time named route-rule-recommendation-delay.yml:

apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:  name: recommendation-delayspec:  destination:    namespace: tutorial    name: recommendation  precedence: 2  route:  - labels:      app: recommendation  httpFault:    delay:      percent: 50      fixedDelay: 7s

Enable it as you did for the other route rules:

$ istioctl create -f route-rule-recommendation-delay.yml -n tutorial

This will create a seven second delay in 50% of the requests. Use your monitoring tool to visualize what happens and where with greater accuracy for understanding. However, for a quick proof of concept test, you can use a simple bash script like this, which will output the elapsed time for each response to the issued curl command:

#!/bin/bash while truedotime curl {endpoint of customer}sleep .1done

Notice that many requests to the customer end point now have a delay. If you are monitoring the logs for recommendation v1 and v2, you will also see the delay happens before the recommendation service is actually called. The delay is in the Istio proxy (Envoy), not in the actual endpoint.

To restore normal operation, delete the rule’s YAML file.

$ istioctl delete -f route-rule-recommendation-delay.yml -n tutorial

Timeout Limits

What if, instead of allowing everything in our application to be forced to wait when one microservice is slow, we instead just canceled the request after a certain amount of time and moved on? Now, figuring out what you are going to do instead will be up to you. Surely there is some sort of fallback position you can use to mitigate against this. While you think about that, we will describe how to create timeout limits for requests in Istio so that, when you are ready, you can test your mitigation idea.

Once again, we create a YAML file, this time named route-rule-recommendation-timeout.yml:

apiVersion: config.istio.io/v1alpha2kind: RouteRulemetadata:  name: recommendation-timeoutspec:  destination:    namespace: tutorial    name: recommendation  precedence: 1  route:  - labels:      app: recommendation  httpReqTimeout:    simpleTimeout:      timeout: 1s

Here we are only allowing the recommendation service one second to reply before we return a 504 error.

Enable it as you did for the other route rules:

$ istioctl create -f route-rule-recommendation-timeout.yml -n tutorial

If you enable both this and the 503 error route rule we showed earlier, the output from curl will look something like this:

$ curl {endpoint of customer}customer => preference => recommendation v1 from '00i8toomuc-74the': 160customer => preference => recommendation v1 from 'no7area1ID-t35ts': 161customer => preference => 504 upstream request timeoutcustomer => preference => recommendation v1 from 'no7area1ID-t35ts': 162customer => preference => 504 upstream request timeoutcustomer => preference => recommendation v1 from 'no7area1ID-t35ts': 163customer => preference => 504 upstream request timeoutcustomer => preference => recommendation v1 from 'no7area1ID-t35ts': 164customer => preference => 504 upstream request timeoutcustomer => preference => recommendation v1 from 'no7area1ID-t35ts': 165customer => preference => 504 upstream request timeout

To restore normal operation, delete the rule’s YAML file.

$ istioctl delete -f route-rule-recommendation-timeout.yml -n tutorial

In Closing

Istio makes it pretty easy to inject some chaos into an application’s networking. Why would a company like Gremlin create a post telling us how to do the sorts of things that Gremlin specializes in and sells? We have a few reasons.

First, we believe Chaos Engineering is a vital aspect of a reliable systems engineering scheme. Site Reliability Engineers across the world are learning how to test their systems, to create and simulate in controlled experiments the types of failures that can cause major problems in real-world deployments if not found and mitigated against. We believe in the cause.

Second, most of what this article describes can already be done using Gremlin’s Application-Level Fault Injection (ALFI), which allows you to include a fault-injection mechanism within your application and test regardless of your platform or environment, even in serverless environments. Adding a service mesh into your stack is a large undertaking. In the short term, ALFI can help you get started quickly with Chaos Engineering and can help those with no plans to implement a service mesh implement the same types of chaos experiments described here. That is on top of the much broader set of attacks Gremlin offers. In addition, we have multiple safety features that enhance the experience like the ability to halt a running experiment and roll back to the previous state and you don’t have to mess with YAML files.

We hope this article has helped to illuminate some of the ways you can test your application’s communication and networking. We used Istio on Kubernetes as an example, but the attacks we describe here can be applied much more broadly and our hope is that these examples help you begin thinking about your application, how it is deployed, and where you may have potential spots for failure that you can test.

Failure in a Service Mesh

Chaos Engineering for Istio Service Mesh

HTTP 503 Errors

Retry

Delays

Timeout Limits

In Closing

Avoid downtime. Use Gremlin to turn failure into resilience.

Company

Resources