Public cloud adoption has become the norm in Enterprises with the recent increasing trend of creating container based cloud-native applications. Given this IT infrastructure disruption, the developers need to ensure that their applications are reliable during any unplanned outages in the cloud infrastructure. In this blog, we will talk about how to set up and analyze application reliability using Chaos Monkey and Spinnaker platform.
Chaos Monkey tool built by Netflix OSS team is most associated with creating random disruption to your application to help you test the reliability of your services. Chaos Monkey is an example of a tool that follows the Principles of Chaos Engineering.
Chaos Monkey is fully integrated with Spinnaker, the continuous delivery platform that is being increasingly used by Enterprises like Intuit, Target, Waze, etc. Chaos Monkey works with any backend that Spinnaker supports (AWS, GCP, Azure, Kubernetes, etc.).
Enabling Chaos Monkey in Spinnaker
To enable Chaos Monkey in Spinnaker, issue the following hal command. (If you need help setting up Chaos Monkey itself, check this documentation)
Ubuntu# hal config features edit --chaos true
Enable Chaos Monkey for an Application
Once Chaos Monkey is enabled for the Spinnaker instance, Chaos Monkey is enabled for all new applications by default. But you can enable/disable Chaos Monkey for any application in the application configuration page by clicking on the Chaos Monkey radio button.
The following figure shows the detailed Chaos Monkey configuration for an application.
Termination Frequency allows the application owner to specify the frequency of instance terminations (which are scheduled randomly). Currently, it is not possible to schedule more than one termination per day per grouping. Choose a narrower group (cluster) to test reliability if necessary to test multiple terminations across the applications services.
Grouping configuration allows the owner to select the grouping to be used to terminate an instance for the application.
- Application grouping selection results in an instance termination (max 1 per day) for the entire application (including all pipelines and stacks) per region (if “Region are Independent” is checked).
- Stack grouping selection results in an instance termination (max 1 per day) for each stack (stack refer to vertical stacks of dependent services for integration testing that can be assigned during a cluster creation) per region.
- Cluster grouping selection results in most terminations as Chaos Monkey terminates an instance for each cluster configured for all the pipelines of the application.
It is also possible to configure exceptions to exclude Chaos Monkey from terminating instances for specific regions or stacks for business reasons.
For information about configuration options of Chaos Monkey, check out the documentation.
Check the scheduled terminations by Chaos Monkey as below.
ubuntu:/apps/chaosmonkey$ cat /etc/cron.d/chaosmonkey-daily-terminations 2 17 7 12 4 root /apps/chaosmonkey/chaosmonkey-terminate.sh openshiftapp my-k8s-chaos-account --cluster=openshiftapp-chaos-demo --region=default 42 21 7 12 4 root /apps/chaosmonkey/chaosmonkey-terminate.sh chaosmonkeyapp my-aws-account --cluster=chaosmonkeyapp --region=us-west-2
Enabling Automatic Application Reliability Analysis
Now that Chaos Monkey is enabled and application instances are getting terminated, it is essential to analyze and measure the reliability of your applications. You can analyze the failures in the pre-production and the production environments. Since the terminations are set to occur randomly, it is critical to automatically analyze the application during those times.
OpsMx Continuous Risk Assessment platform integrates into Spinnaker and Chaos Monkey to trigger instant automatic application risk assessment upon a Chaos Monkey event and provides a detailed evaluation of the application reliability and behavior every time.