Our customer is one of the most important networking and cybersecurity companies in the world. It prides itself on its ability to help customers turn complex problems into leadership opportunities by engineering simplicity into all its products.
Challenge: Remaking an Error-Prone Build Process
A crucial part of our customer’s ability to satisfy their customers is the innovation inherent in its software. The software teams iterate frequently, so software quality and velocity must be continually improved.
The build process uses Jenkins, and there are thousands of builds every day. Inevitably a small percentage fail, and quickly determining the root cause of the failure is critical to maintaining velocity.
Like many companies, each development team was responsible for their build environment. The company decided to unleash its SRE team and create a central build service that would support all developers.
This central “Build as a Service” improved efficiencies somewhat, but the build error rate was still too high – there were frequently multiple dozens of build failures each day.
Additionally, the time to diagnose the root cause of errors took too long – errors typically took multiple hours to diagnose and correct. For each error, expert engineers needed to gather data on the failure, collaborate with others, diagnose its root cause, correct the error, and coordinate with the teams to re-execute the build.
“We were spending the equivalent of more than five full-time engineers fixing build errors.”
The company decided to tackle the situation head-on and chose OpsMx to assist. They set a goal of improving developer productivity and reducing the triage time for build failures by more than 75%. “Reducing the time that developers spend in the build process is critical for every software organization,” said the director of software engineering. “Any improvement we make will translate directly to higher productivity and improved job satisfaction.”
Solution: Autopilot Resolves Build Errors Fast
OpsMx Autopilot is a machine-learning based solution that is perfect for reducing the time developers spend resolving build errors.
The customer deployed one of the key modules of OpsMx Autopilot for Jenkins: Build Verification. (Autopilot includes other capabilities, including verifying deployments, automating the deployment approval process, governance for ensuring that all deployment policies are followed, and enhanced visibility)
In build verification, Autopilot is invoked automatically after every build (refer to the image below). It gathers logs and uses its natural language processing and machine learning algorithms to determine which builds should be considered failures. Sometimes a failure is due to a short-term infrastructure issue; other times due to a problem in the code, or a dependency that is not specified correctly.
Simply validating that a build has been completed successfully improves developer productivity. “When our engineers know that they’ll be correctly and instantly notified if there is an issue, it is much easier for them to concentrate on their next tasks,” said the director of software engineering.
In the case of an error, Autopilot identifies the source of the error, dramatically reducing the amount of time required for triage. Autopilot also recommends the corrective action that should be taken, and communicates these instantly to the team through PagerDuty or Microsoft Teams.
As Autopilot continues to be used, it learns through supervised and unsupervised learning algorithms, becoming smarter every day. This increases the time savings that Autopilot generates.
Results: Autopilot Navigates to Success
The company was pleased with the results that they achieved. Autopilot provided significant improvement in triage time, putting the team on track to meet its goal of 75% time savings.
“One of the best outcomes of the project is the amount of time saved by expert team members,” said the director of software engineering.
Because Autopilot determines whether the failure was caused by an anomaly in the infrastructure, engineers aren’t required to troubleshoot these issues. This previously was a real challenge for the team, as infrastructure problems can be fatal yet transient. Autopilot’s ability to identify infrastructure issues creates the possibility of automatically re-executing the build in these situations.
In situations where Autopilot is unable to exactly identify the root cause, Autopilot shortens the triage and troubleshooting process. Using its natural language processing and clustering algorithms, Autopilot makes root cause identification much easier by sifting through large amount of data and identifying the probable root cause. Further, since Autopilot is continually learning, future instances of any specific build error are tagged with the resolution process and are therefore resolved much more quickly.
Toward Automated Software Delivery, with No Human Intervention
The Autopilot automated approach to build failure analysis and resolution can become a huge component of improving developer and SRE productivity, increasing the rate of innovation. Even though Autopilot usage is relatively new, the system has shown its promise.
“Our software engineers want to spend their time creating new features to help our customers, not diagnosing build problems,” said the software engineering director. “Autopilot is exactly the type of solution we need to reach our developer productivity goals.”
Longer term, the approach has even more possibilities. The company wants to expand machine learning to include a full automatic approval process. “Learning from our experience in build analysis, we believe that we can apply the same approach to production rollout. We would like to get to the point where we don’t need any manual intervention in the deployment process.”
Read more Autopilot user stories:
- Telecom Leader Accelerates Time to Market with OpsMx
- Online Leader Accelerates Software Delivery
- How Customers Improve CI/CD Velocity Using Autopilot
If you want to know more about the Autopilot or request a demonstration, please book a meeting with us for Autopilot Demo.
OpsMx is a leading provider of Continuous Delivery platform that helps enterprises safely deliver software at scale and without any human intervention. We help engineering teams take the risk and manual effort out of releasing innovations at the speed of modern business. For additional information, contact us.
0 Comments
Trackbacks/Pingbacks