I’ve been working on log analysis at OpsMx for our Autopilot analysis engine, which helps take the guess-work out of releasing new code and other operational changes by analyzing logs and metrics. I am frequently asked how to use Autopilot log analysis most effectively.
Here are the top recommendations on how to generate logs from an application to make it easier for Autopilot (as well as future you and your coworkers) more effective.
Autopilot uses several techniques to group log lines into clusters, which are similar in content, and then detect increases in occurrences of errors, warnings, and other log types. While much of this is automatic and uses the application’s logging level, some best practices in generating log lines will help avoid the need to guide Autopilot’s decision-making process, and a more generic model can be applied to logs.
Here is a shortlist of the top recommendations I’ve collected:
1. Log at appropriate severity level
Logging levels are typically an error, warning, info, debug, and may have others. These are usually described as severity.
Errors should be an indication that the application itself is broken in some way — either a required upstream component is not responding, or some other operational or code error is occurring.
Log errors at the error level, warnings at warnings, and info at info. Exceptions should be logged at an error level if they genuinely are an error.
2. Differentiate between application and transactional errors
Often, an error for a specific transaction may be normal for the application.
Let’s consider an example of parsing incoming JSON in a web request. If it fails to parse, that is an error for the transaction, but it likely is not an error for the application, which will just process the next incoming request as usual.
Logging at the “error” level should generally mean the application itself is experiencing an error, and functionality is compromised. Failure in transactions should be logged at a lower level, and ideally tracked by metrics rather than logs.
How does Autopilot augment Datadog to reduce risk in a CI/CD pipeline?
How can you extend Splunk with Autopilot to achieve a seamless Continuous Verification?
3. Exceptions should be exceptional
We will often write a try/catch block that will try some operation, and if it fails, log details about the failure can be debugged later. This is good (and probably worthy of an error level), but other times exceptions are just not exceptional. These may be better caught and logged at a lower level.
For example, attempting to remove a file that isn’t present may throw an exception. This likely should be converted into a warning as functionality is likely unaffected, especially if part of a cleanup process. Logging this as an error-level makes automatic detection of errors more difficult.
4. Put variable parts last
While Autopilot can handle variable components to log lines such as “user mark logged in” vs “user mary logged in,” including a long phrase in the middle of a log message makes the analysis less accurate.
“Parser: unable to parse ‘this is a very long string that doesn\’t parse’ into an object.”
“objectParser: Unable to parse: ‘this is a very long string that doesn\’t parse.’
And of course, log at the correct level.
5. Generally, be silent
In operational situations, it is often good to use logs sparingly or disable ‘info’ and below. Informational messages help with debugging, so there is a temptation to include them and log all kinds of things. However, not only does this affect application speed, it makes detecting important messages more difficult. It may be better to move some of those less critical ‘info’ level messages to ‘debug.’
Additionally, storage of logs can be costly. Keeping unnecessary logs that do not help operationally can cost real money.
I’ve presented some logging best practices, which will make your application’s logs more useful to you, your coworkers, and Autopilot’s log analysis. Feel free to send me any other logging tips and tricks, and I’ll include them in a future post.