Select Page

Shashank Srivastava

last updated on April 27, 2023


The world of server infrastructure has become complex. Microservices and distributed cloud architectures have contributed to this complexity. With the rise in complexity, so has risen the number of failures. And operating in such complex environments is challenging. Production failures impact both businesses and customers.

Modern organizations have implemented SRE to ensure these outages and downtimes are less frequent and shorter in duration. The cost of downtime for large organizations is estimated at over $100000 per hour. At Netflix, the SREs needed a proper solution to tackle this unforeseen and volatile issue, and they came up with something called Chaos Engineering.

What is Chaos Engineering?

“Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions.”

Chaos engineering is a controlled experiment to test a system’s resiliency and ability to survive unexpected situations. Such testing methods simulate real-world scenarios that chaos help uncover. Finding faults by randomly self-inflecting outages and disruptive events to test the system is not the right approach. 

We have seen that teams cannot find blind spots and bottlenecks in a value chain even with good monitoring solutions. With distributed systems, such blind spots are more common. But Chaos Engineering has proven to effectively prevent downtime or production outages before their occurrence by locating these blind spots.

What Chaos Engineering is not?

People often confuse chaos engineering with breaking stuff in production and antifragility.

“Breaking stuff in production” is not Chaos Engineering. Though it sounds cool, it is not something that makes a lot of sense. From an angle of glass half empty, breaking things that are not broken is counterintuitive. So we have to look at it from an angle of glass half-full. “Fixing stuff in production” is a better characterization of Chaos engineering. The whole point is proactively improving the security and availability of a complex system.

Antifragility is also not a Chaos Engineering. The distinction between Chaos Engineering and Antifragility is that Chaos Engineering educates human operators about the chaos already inherent in the system to be a more resilient team. Antifragility, by contrast, adds chaos to a system in hopes that it will grow stronger in response rather than succumbing to it.

The Five Principles of Chaos Engineering

1. Build a hypothesis around steady-state behavior

Build a hypothesis around the steady-state of the current system. The steady-state is how it is expected to behave and the standard metrics it must generate. 

For example, a simple hypothesis can be 

H0: Under X conditions, end customers will have business as usual.

2. Vary real-world events

To perform real-world events, one must avoid easy routes such as terminating an instance or filling up disk space by turning off the network. The SREs must look at variables from the angle of an end user’s perspective.

3. Run experiments in production

Experimenting with staging will not instill confidence in the production servers. This doesn’t mean one must begin experimentation at the production instance, and it is wise to start experimenting with a Staging system and gradually move to production.

4. Automate experiments to run continuously

Humans can only cover so much. Hence bots must run experiments to cover every scope and possibility.

5. Minimize blast radius

Chaos engineering is a controlled experiment, so experiments must be constructed in a way that the impact of a disproved hypothesis on customer traffic in production is minimal.

Steps to perform Chaos Engineering?

There are four basic steps that an organization can use for each chaos experiment.

    1. Start by defining “steady state” as some measurable output of a system that indicates normal behavior.

    2. Hypothesize that this steady state will continue in both the control and experimental groups.

    3. Introduce variables that reflect real-world events like servers that crash, hard drives that malfunction, network connections that are severed, etc.

    4. Try to disprove the hypothesis by looking for a steady-state difference between the control and experimental groups.

Why Chaos engineering?

Chaos Engineering is like the science of vaccination. A controlled form of the disease-causing virus is injected into the body to develop resistance and immune response during the event of an actual infection. In the same way, we use chaos engineering to build resilience in our servers by inflecting controlled harm like latency, CPU failure, or network black holes to find and mitigate weaknesses and black spots in our systems.

Like a fire drill or a disaster recovery, drill teams are aware of the first aid actions and are trained to implement fixes at the earliest. 

According to the 2021 State of Chaos Engineering report, the most common outcomes are increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs shipped to product, and fewer outages. Teams frequently run Chaos Engineering experiments are more likely to have >99.9% availability.

Evaluating ROI of Chaos Engineering?

Chaos Engineering is a practical approach that focuses on adding value to businesses. But it is also one of the most challenging things to prove. Mitigating an unforeseen issue is very proactive and visionary but assessing value in business benefit is difficult. Quantifying a successful Chaos engineering project can be difficult. 

Netflix uses a Kirkpatrick model for evaluating ROI. The Kirkpatrick Model offers one way to evaluate ROI. The model has been around since the 1950s, and the most common iteration can be deconstructed into the following four levels:

    • Level 1: Reaction
    • Level 2: Learning
    • Level 3: Transfer
    • Level 4: Results

The levels are a progression, where Level 1 is relatively simple and low value, while Level 4 is often difficult to implement but high value. 

Level 1: Reaction

At the first level, the stakeholders are asked objectively whether the chaos projects were beneficial. They are also asked if they will be happy to expand the current project, and positive answers show the basic effectiveness of the project.

Level 2: Learning 

We need to establish proof that the stakeholders have learned something at the current level. A list of their discoveries must be listed. For example, “team learning that production has an unanticipated dependency on a particular Kubernetes cluster in staging.”

Level 3: Transfer

With an exhaustive list of discoveries, we can prioritize them and perform hypothesis testing.

Level 4: Results

Assess the hypothesis with actual data to support it or reject it. The supported hypothesis proves that unforeseen incidents have been thwarted and proves the experiment’s success. 

Next, we can correlate the assessment with a business outcome to estimate the value of the experiment.

Subjective Benefits of Chaos Engineering

  • Increase resiliency and reliability. 

Chaos testing enriches the organization’s intelligence about how the software performs under stress and how to make it more resilient.

  • Accelerate innovation

Intelligence from chaos testing funnels back to developers who can implement design changes that make the software more durable and improve production quality.

  • Advance collaboration. 

Developers aren’t the only group to see advantages. The insights engineers learn from their experiments elevate the expertise of the technical group, leading to response times and better collaboration.

  • Speed incident response. 

Teams can speed up troubleshooting, repairs, and incident response by learning what failure scenarios are possible.

  • Improve customer satisfaction. 

Increased resilience and faster response times mean less downtime. More significant innovation and collaboration from development and SRE teams mean better software that meets new customer demands quickly with efficiency and high performance.

  • Boost business outcomes. 

Chaos testing can extend an organization’s competitive advantage through faster time-to-value, saving time, money, and resources, and producing a better bottom line.

Chaos Engineering Tools to get you started

  • Chaos Monkey: The epicenter of chaos engineering. Chaos Monkey is still maintained by Netflix. It is integrated into Spinnaker that helps release software changes rapidly and reliably.

Take a look here on how it is done:

Automate Application Reliability Assessment with Chaos Monkey

  • Mangle: Enables running of chaos engineering experiments against applications and infrastructure components and quickly assess resiliency and fault tolerance. It is designed to introduce faults with minimal pre-configuration and supports a wide range of tooling, including K8S, Docker, vCenter, or any Remote Machine with SSH enabled. 
  • Gremlin: Founded by the former Netflix and Amazon engineers who productized Chaos as a Service (CaaS). Gremlin is a paid service that gives one a command-line interface, agent, and intuitive web interface that allow you to set up chaos experiments in no time. Don’t worry. A big red HALT button makes it simple for Gremlin users to reactively roll back experiments if an attack negatively impacts the customer experience.
  • Chaos Toolkit: An open-source project that tries to make chaos experiments easier by creating an open API and standard JSON format to expose experiments. They are many drivers to execute AWS, Azure, Kubernetes, PCF, and Google cloud experiments. It also includes integrations for monitoring systems and chat, such as Prometheus and Slack. 


Production systems have grown complex with the advent of distributed infrastructure and microservices. This has made detecting system failure much more difficult. So to prevent failures from happening, we all need to be proactive in our efforts to learn from failure.

Chaos Engineering is not for everyone. Enterprise with a smaller footprint may steer away from engaging in Chaos activities. The risk to reward is not favorable when your infrastructure is not spread across geographies and every second of uptime counts. It is advisable and favorable only for large tech companies with distributed systems and microservices architectures to rely on Chaos Engineering to improve the stability of their systems.

Tags :

Shashank Srivastava

As a Country Manager, Sales & Marketing (ROW) at OpsMx, Shashank is responsible for revenue for Europe, Middle East and Asia Pacific. He is also responsible for Product Marketing and Strategic Partnerships. Shashank brings in over 20 years of experience in selling and marketing technology / software solutions. Over these years he has led teams for marketing, sales, business development and field operations. He has successfully driven several strategic initiatives within startup environments.



Submit a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.