Improve Release Safety and Diagnostics Through Automated Canary Analysis for Spinnaker

Vardhan NS

Jun 16, 2017

last updated on November 29, 2021

Introduction

Spinnaker is a continuous delivery platform that is pioneering the ability to release software faster. It is allowing thousands of enterprises to achieve release velocity never seen before. The key to increasing the velocity is to have the ability to determine with confidence that the new release can be promoted across different testing stages and eventually to production through Canary, Red/Black (aka blue-green) or Rolling update release strategies.

Leading enterprises (like Netflix which deploys more than 4000 updates a day), has a proprietary decision engine to allow them to promote builds to production with confidence. However, most enterprises still are dependent on manual analysis and judgments to promote builds.

Manual judgment is error-prone as decisions are based on incomplete analysis and are time-consuming as the analysis are laborious. Bad builds in production introduce significant risks due to business disruptions and brand damage.

OpsMx Enterprise for Spinnaker is a real-time analytics platform for CI/CD pipelines that is designed to aid manual decision in promoting build across test and deployment to production. The OpsMx solution helps in reduce error and diagnostics time through complete, consistent real-time automated analysis for Spinnaker.

Practices for Promoting Builds To Production Today

Before we look at challenge and risks, let us review some of the enterprise practices of promoting builds to production:

Checking key service performance metrics (e.g., orders or users served by the new instances or latency and error rates being consistent with the baseline version)
Ensuring no system SLA violation alerts occurs with the new release of the service
Additional checks with custom scripts created for each service.
Release new release during the slow time of the day (night or holidays times) or to less critical customer base such as overseas countries or low volume or backup sites.

Figure 1: Challenging Environment for Ops Teams to Validate Builds

The core philosophy of the above strategies is to reduce the impact of bad deployments – How soon can one find out if the new service update is bad and how soon can one roll-back without causing too much business disruption. However, Ops teams face significant challenges to reliable validate builds as shown in Figure 1.

Risks with Incomplete Analysis With Current Manual Judgment Process

The manual judgment of new releases to be deployed into production introduces tremendous business risk. The manual analysis is inherently incomplete due to the complexity of the services, their interactions, changes to this build compared to the previous build and the sheer volume of metrics collected for any build during various pipeline stages. The incomplete analysis can be viewed in 3 specific dimensions as shown in Figure 2

Figure 2; Manual Analysis is Not Scalable and is Incomplete

- Metrics: Manual analysis as indicated earlier can look at crucial system metrics, but a typical service exhausts 1000+ metrics for a build. It is humanly impossible to detect deviations and trends to understand the relevance of any metric to your business requirements consistently.
- Application Complexity: Applications which are more complicated using multiple components (in-house or open source or 3rd party software components) are challenging to do a manual analysis. Also, applications behavior is unique, and they are continually evolving. It requires experts or application architects to understand the nuances of each service and the application overall. In the case of open source or 3rd party services, it is challenging to find an in-house expert to understand the expected behavior of application over various versions.
- The Rate of Change: Manual analysis may be sufficient initially, but as the rate of applications/services changes increases, the manual analysis tends not to keep up with the application behavior changes. The analysis becomes less reliable over time, and eventually, bad builds are likely to be promoted to production causing disruptions. Enterprises are increasingly tending towards multiple updates in a single day, and even lesser dynamic organizations need a weekly update to their applications.

Such an incomplete analysis leads to the following issues with the release validations:

Error-prone
Time-consuming
Expensive to create a custom analysis for every new service
The root cause debug is difficult for found issues

The above issues could cause significant business loss. A recent example of an improperly approved new version of software with bugs resulted in the grounding of flights of American Airlines for 6 hours or the case of Starbucks losing millions in revenue. There is a need for a more reliable data-driven approach to ensure consistency and fewer errors to improve safety and ease of diagnostics for Spinnaker minimizing business risk of new builds.

Improving Safety with Automated Canary Analysis

OpsMx Enterprise for Spinnaker is a CI/CD analytic platform that provides DevOps engineers an intelligent automated real-time actionable risk assessment to make a reliable judgment of a new release for production deployment. OpsMx compares the new release of the service to the baseline or production release for new validation. OpsMx leverages machine learning and Artificial Intelligence (AI) techniques to analyze 1000’s of metrics and perform in-depth analysis of architectural regressions, performance, scalability and security violations of new releases in a scalable way for enterprises. OpsMx seamlessly integrates with Spinnaker through existing Canary analysis service APIs. OpsMx address three prevalent use cases with Spinnaker:

Automated Canary Analysis

Figure 3: Automated Canary Analysis in Spinnaker

Automated Canary Analysis is the most well-known use case for enterprises who are interested in canary deployment to reduce risk. If the Canary deployment analysis fails (Figure 3), the pipeline execution terminates for further diagnostics of the release.

Red/Black Deployment Analysis

Automated Red/Black Canary Analysis in Spinnaker

Figure 4: Automated Red/Black Analysis in Spinnaker

Red/Black is the most traditional deployment option in Spinnaker. In this case, OpsMx compares the new release with the production or baseline release. If the Red/Black deployment analysis fails (Figure 4), then release is roll backed either manually or in an automated fashion.

Staging or Testing Deployments Analysis

Automated Staging Canary Analysis for Spinnaker

Figure 5: Automated Staging Deployment Analysis for Spinnaker

In many cases, it is safest to avoid exposing the bad release in production even for a few hours. Performing the analysis in the staging environment is preferred. If the release passes then release is promoted to production via Red/Black deployment. If it fails in staging (Figure 5), bad deployment is averted without any production traffic exposure. OpsMx provide detailed diagnostics for each of the analyzed stages to further diagnose the issues in the release as shown in Figure 6

Figure 6: OpsMx Risk Assessment Report and Diagnostics

OpsMx Automated Canary Analysis Benefits

Validate and approve builds with low risk to production: With the OpsMx build risk assessment report for a new version of the service, Ops team have an accurate automated report on safety and readiness of the build. If the safety score is above the pass threshold, the Ops team can promote the build for further deployment. OpsMx compares the current build to production baseline characteristics with the score accurately reflecting the risks of the new build. OpsMx can be configured to do real-time canary or Red/Black or staging/testing analysis and provide safety scores.

Identify root cause of issues with the build: OpsMx risk assessment report provides a very detailed sub-score for components of each build across various metrics group. If there are any significant deviation or issues found between the current build and the baseline version, OpsMx automatically flags the issue and provides root cause analysis including offending code commit. OpsMx does in-depth analysis including interactions between various services and transactions to narrow down the problematic service. OpsMx thus saves Ops team time with the fully automated issue and root cause identification.
Automated, scalable and less error-prone: OpsMx risk assessment report is fully automated and can analyze 1000s of metrics for every build through integration into existing data monitoring and collection tools. OpsMx can analyze known services or unknown new services. OpsMx machine learns the service characteristics to evaluate new builds of the service. Since it is an automated tool, the OpsMx solution is scalable, consistent and less error-prone providing Ops engineers a very reliable method to assist judgment of new builds.

Summary

The OpsMx provides an effective data-driven solution to automate real-time judgment of new software releases by Ops teams using Spinnaker in an enterprise. The OpsMx solution integrates with Spinnaker for analysis during Canary, Red/Black or Staging/Test deployment stages. With the OpsMx solution, Ops team can reliably validate and approve builds with low risk for deployment, scale to validate multiple deployments a day, reduce the time for analysis and debugging issues and reduce human errors in release decisions. Overall, OpsMx solution lowers business risk due to bad deployments and makes release judgment safer. For more information about the OpsMx solution for Spinnaker or free trial, fill out the below form or email us at info@opsmx.com.

Tags : Automated Canary Analysis, Continuous Verification

Vardhan NS

Vardhan is a technologist and a marketing professional, currently working as a Sr. PMM at OpsMx. His strength lies in understanding complex technologies, and explaining them in un-complicated ways. Vardhan is a passionate Product Marketer with a keen focus on Content, helping brands Position themselves uniquely with clear messaging and competitive differentiation. Outside of work, he is an athlete that is passionate about Football, Swimming and Surfing.

Link

0 Comments

Trackbacks/Pingbacks

How to Set Up Automated Release Analysis in Spinnaker Deployments - […] release in Spinnaker. It is critical to enable automated canary analysis for the deployments to truly benefit from these…
How to Set Up Automated Release Analysis in Spinnaker Deployments - […] release in Spinnaker. It is critical to enable automated canary analysis for the deployments to truly benefit from these…
How to Set Up Automated Release Analysis in Spinnaker Deployments - […] release in Spinnaker. It is critical to enable automated canary analysis for the deployments to truly benefit from these…
6-Month Free Trial of Automated Canary Analysis Platform - […] Improve Release Safety and Diagnostics Through Automated Canary Analysis for Spinnaker […]
6-Month Free Trial of Automated Canary Analysis Platform - […] Improve Release Safety and Diagnostics Through Automated Canary Analysis for Spinnaker […]
6-Month Free Trial of Automated Canary Analysis Platform - […] Improve Release Safety and Diagnostics Through Automated Canary Analysis for Spinnaker […]
Automated Canary Analysis Featured in Spinnaker Summit 2017 - […] about leveraging Spinnaker for more than as a multi-cloud deployment tool and encouraged the use of Automated Canary Analysis…
Automated Canary Analysis Featured in Spinnaker Summit 2017 - […] about leveraging Spinnaker for more than as a multi-cloud deployment tool and encouraged the use of Automated Canary Analysis…
Automated Canary Analysis Featured in Spinnaker Summit 2017 - […] about leveraging Spinnaker for more than as a multi-cloud deployment tool and encouraged the use of Automated Canary Analysis…
How to Enable Prometheus Monitoring for Kubernetes Cluster - […] can connect this Prometheus server to your Automated Canary Analysis system to able to perform automated release analysis. If…
How to Enable Prometheus Monitoring for Kubernetes Cluster - […] can connect this Prometheus server to your Automated Canary Analysis system to able to perform automated release analysis. If…
How to Enable Prometheus Monitoring for Kubernetes Cluster - […] can connect this Prometheus server to your Automated Canary Analysis system to able to perform automated release analysis. If…
Automate Application Reliability Assessment with Chaos Monkey - […] you are interested in piloting OpsMx solution for Chaos Monkey, please email us at info@opsmx.com to get […]
Automate Application Reliability Assessment with Chaos Monkey - […] you are interested in piloting OpsMx solution for Chaos Monkey, please email us at info@opsmx.com to get […]
Automate Application Reliability Assessment with Chaos Monkey - […] you are interested in piloting OpsMx solution for Chaos Monkey, please email us at info@opsmx.com to get […]

Submit a Comment Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Secure Delivery

Intelligent Delivery

Case Studies

Spinnaker

Argo

DevOps Transformation

CD University

Improve Release Safety and Diagnostics Through Automated Canary Analysis for Spinnaker

Vardhan NS

Introduction

Practices for Promoting Builds To Production Today

Risks with Incomplete Analysis With Current Manual Judgment Process

Improving Safety with Automated Canary Analysis

OpsMx Automated Canary Analysis Benefits

Summary

Vardhan NS

0 Comments

Trackbacks/Pingbacks

Submit a Comment Cancel reply

You May Like

Enabling Basic Form Authentication for Spinnaker via Halyard

How to Improve Release Quality with OpsMx?

3 Dilemmas DevOps Managers Face When Scaling Continuous Delivery Pipelines

Recent Posts

Videos & Podcasts : How To Build Amazon AMI Image Using Spinnaker

Ship better software faster