This blog is a continuation of the Autopilot story where we discuss how one can reduce the risk of releases by augmenting an exiting monitoring platform like Datadog. autopilot provides Realtime risk assessment of releases before a code is deployed into production and also deny releases that fail a minimum threshold.
Once Autopilot is configured, it will automatically fetch the logs from applications, pipelines and metrics. During the execution of a pipeline, it can compare risk scores of a new release against a baseline run to assert the quality of a release. Autopilot determines if it can promote a new update fully to production or push it back to the developer for debugging. The log analysis and risk- assessment get processed in a matter of seconds and provide automated decisions during the execution of a pipeline run.
The AI/ML-enabled intelligence layer in Autopilot uses supervised learning to improve its judgment abilities over time. SREs, as they evaluate the confidence score of any release, can change Autopilot’s assessment of the impact of errors and warnings. These inputs are like feedback to Autopilot, which helps it to develop a contextual understanding of specific applications and pipelines.
Why should i use consider Autopilot on Datadog
- Unsuspecting logs may go undetected and cause failed deployments in production. Logs typically comprise 10% of the overall analysis in a software delivery architecture. This is because logs are extracted during a time frame when the anomaly has occurred. We base this on the assumption that most errors happen only when an anomaly is detected. But exceptions and errors that cause errors in production may have happened at any stage in the pipeline, and we will not see this in the log reports pulled during that anomaly time frame. Autopilot can analyze these logs and metrics in real-time to prevent errors in production.
- Autopilot take actionable metrics and performs coordinated tasks, automatically. Thus avoiding any delays and human errors. Simultaneously, the SRE can also be notified to start a rollback action to revert the production server back to normal. With Autopilot, we can automate the entire process. Supervised learning can train Autopilot to take recommended actions on different anomalies that are encountered in the future.
- No need to replace your existing Datadog ecosystem. The beauty of Autopilot is that it can act like a data aggregator and perform its decision making without interfering with your existing Datadog architecture. With a simple wizard and an API key, one can extract data from the Datadog to the Autopilot engine. This maintains the steady stage of your CICD pipeline with the added benefit of the intelligence from Machine learning algorithms.
- Autopilot saves your valuable time by performing an unsupervised learning algorithm that will identify what type of logs relate to metrics for a failed release. This provides a quick start way for SREs to begin their risk assessment journey with no initial data to begin with. This is achieved by correlating both metrics and logs for a more accurate risk assessment.
- If you are using a different APM tool, Autopilot can aggregate data from there as well as other sources simultaneously with Datadog.
- 360 pipeline visibility in real time
How does OES extend Datadog to reduce risk in a CICD pipeline?
Risk Assessment Example
This below is an actual screenshot of Autopilot’s Risk Assessment Dashboard, where logs from the deployment updates were analyzed for their risk. As defined by the SRE, on a success score of 83, the Autopilot auto triggers the pipeline to move forward with deployment.
In case of a critical error when the risk assessment fails, Autopilot will automatically block the deployment pipeline from executing and inform the SRE to take proactive action on resolving the issue. All these happen in real time with Autopilot.
The exception cases where Autopilot experiences a new metric score or log never encountered before the SRE can perform due diligence on the case and assign the relevant action item back to the Autopilot so that it is taken care of in the future.
Continuous Risk Verification with Autopilot
Autopilot brings in the feature of “Approval Gates,” which places control over your CI/CD pipeline at critical points. Autopilot collects and presents all data that is relevant to the approval decision, including the assessments of the confidence level. They can evaluate the information and make a decision. Alternatively, the SRE can automate this approval decision if the confidence score is above a configurable threshold This will free them from the repetitive task of analyzing updates that are clearly either failures or successes.
Observability of enterprise-wide software verification
Autopilot provides an enterprise-wide historical analysis of risk scores for past deployments. For example, it provides application-centric time-series view of various risk scores along with their respective canary ids. Additionally, it also provides a service-centric deep-down analysis of each risk. It provides a number of critical errors, warnings, and exceptions in logs across the chosen time period.
Real-time approvals with policy automation
With the increasing scale and speed of deployment, it is crucial that all the updates that move in a CICD pipeline go through policy checks in your organization. For example we have gathered logs of a new release from Datadog after deploying into the staging environment. However, before moving into production, the CD pipeline can be configured to perform a runtime policy check like the Blackout window, or if the release has proper approvals from the right stakeholders. Autopilot can empower policy managers to define such policies and enforce them into the deployment pipeline through policy gates.
Learn more about the compliance and audit video here.
Risk Verification Audit Trail with Autopilot
Besides release verification, Autopilot provides detailed audit capabilities. These allow SREs and SecOps teams to view all related Deployment activities. This reduces the time and cost of required audits and speeds troubleshooting of any deployment-related issue by showing the who, what, where, and when of deployment steps.