When managing cloud-native applications, it’s essential to have end-to-end visibility into what’s happening at any given time. This is especially true because of the distributed and dynamic nature of cloud-native apps, which are often deployed using ephemeral technologies like containers and serverless functions.
With so much flux and complexity across a cloud-native system, it’s important to have robust monitoring and logging in place to control and manage the inevitable chaos. This post discusses what we consider to be some of the best practices and standards to follow when logging and monitoring cloud-native applications.
1. Use a Managed Log Management Tool vs building your own infrastructure
First off, logging should reflect your applications. In a world of cloud-native applications, logging solutions should be built on the same principles as high availability, distributed processing, and intelligent failover that consequently lay the foundation for the applications themselves. This is what differentiates modern cloud-native apps from legacy monolithic apps.
The tools to implement this approach include Elasticsearch, Fluentd, Kibana (which, together, are often called the EFK stack), and others. They are architected to handle large-scale data analysis and deliver results in real time. They facilitate complex search queries over data and enable open API-based integration with other tools. However, though the raw materials are available, bringing it all together and making sure it meets your purposes is a whole other challenge.
Rather than build out this system on your own, it makes sense to use a managed logging solution that is built and scaled by a vendor. We go over that in detail in 5 Reasons to Run Elastic Stack in the Cloud. With ready-made integrations, all you need to do is connect your sources and destinations, and you’re all set to analyze application logs the easy way. This leaves you free to spend more time monitoring and logging your application rather than building out logging infrastructure.
Read more about when you should build or buy your DevOps toolchain, including log analysis and monitoring tools.
2. Know What Logs to Monitor, and What Not to Monitor
Know what not to log. Just because you can log something doesn’t mean you should — and logging too much data can make it harder to find the data that actually matters. It also adds complexity to your log storage and management processes because it gives you more logs to manage.
Thus, consider carefully what you actually need to log. Any types of production-environment data that are critical for compliance or auditing purposes should certainly be logged. So should data that helps you troubleshoot performance problems, solve user-experience issues or monitor security-related events.
On the other hand, there are categories of data that you do not need to log, such as data from test environments that are not an essential part of your software delivery pipeline. There are also some kinds of data that you should not log for compliance or security reasons. For example, if a user has enabled a do-not-track setting, you should not log data associated with that user. Similarly, you should avoid logging highly sensitive data, such as credit card numbers, unless you are certain that your logging and storage processes meet the security requirements for that data.
3. Implement a Log Security and Retention Policy
Logs contain sensitive data. A log security policy should review sensitive data – like personal data of your clients or internal access keys for APIs. Make sure that sensitive data gets anonymized or encrypted before you ship logs to any third party. GDPR log management best practices teach you about good practices for data protection of sensitive data and personal data in web server logs. The secure transport of log data to log management servers requires the setup of encrypted endpoints for TLS or HTTPS on client and server side.
Logs from different sources might require different retention times. Some applications are only relevant for troubleshooting for a few days. Security-related or business transaction logs require longer retention times. Therefore, a retention policy should be flexible, depending on the log source.
4. Design a Scalable and Reliable Log Storage
Planning the capacity for log storage should consider high load peaks. When systems run well, the amount of data produced per day is nearly constant and depends mainly on the system utilization and amount of transactions per day. In the case of critical system errors, we typically see accelerated growth in the log volume. If the log storage hits storage limits, you lose the latest logs, which are essential to fix system errors. The log storage must work as a cyclic buffer, which deletes the oldest data first before any storage limit is applied.
Design your log storage so that it’s scalable and reliable – there is nothing worse than having system downtimes and a lack of information for troubleshooting, which in turn, can elongate downtime.
Log storage should have a separate security policy. Every attacker will try to avoid or delete his traces in log files. Therefore you should ship logs in real-time to the central log storage. If the attacker has access to your infrastructure, sending logs off-site, e.g., using a log management toll will help keep evidence untampered.
5. Use the Proper Log Level
Among others, one important piece of information each log entry should include is the appropriate log level since it helps differentiate severe and important events from irregular or even regular events. Log levels will depend on the framework, but ERROR, WARNING, INFO, DEBUG are usually available. You need to be consistent in assigning severity to log messages. Overthewise filtering by severity will not work as expected.
Here are some rules of thumb:
This is an unexpected event and could impact the functionality of your app. It doesn’t necessarily need immediate intervention, but it better be sooner rather than later.
For example, a connection failure between the two services would be an ERROR.
WARN / WARNING
This level includes logs that may become errors and that you detected unusual and unexpected app behavior. You don’t know yet if the problem will persist or not. Still, while they don’t harm or stop the application execution, WARN logs should be investigated.
For example, if login takes too long that could be a WARNING.
These logs reflect normal application behavior. INFO should have long-term value, in that they provide details about what and when something happened.
For example, an item added to the cart could be an INFO message.
As their name suggests, DEBUG logs are particularly useful during debugging. This type of logs are pretty “noisy”, providing more information than what you’d need during development. We recommend you make sure to log only meaningful entries.
For example, a connection successfully established between two services could be a DEBUG message.
6. Create meaningful log messages
Readable and useful log messages are key for faster troubleshooting. If logs contain only some error codes or ‘cryptic’ error messages it can be difficult to understand. As a Developer, you can save your organization a lot of time by providing a meaningful log message.
7. Use structured log formats
The log format should be structured (e.g., JSON or key/value format) having various fields like timestamp, severity, message and any other relevant data fields like process ID, transaction ID, etc. If you don’t use a unique log format for all your applications, normalize the logs in the log shipper. Parse logs and store logs in a structured format.
8. Make log level configurable
Some applications logs are too verbose and other application logs don’t provide enough information about the activities. Adjustable log levels are the key to configure the verbosity of logs. Another topic for log reviews is the challenge to balance between logging relevant information and not exposing personal data or security-related information. If so, make sure that those messages can be anonymized or encrypted.
9. Inspect audit logs frequently
Acting on security issues is crucial – so you should always have an eye on audit logs. Setup security tools such as auditd or OSSEC agents. The tools implement real-time log analysis and generate alert logs pointing to potential security issues. On top of such audit logs, you should define alerts on logs in order to be notified quickly on any suspicious activity. For more details, check out a quick tutorial on using auditd, plus you’ll find some complementary frameworks too.
10. Review & Maintain Your Logs Constantly
Unmaintained log data could lead to longer troubleshooting times, risks of exposing sensitive data or higher costs for log storage. Review the log output of your applications and adjust it to your needs. Reviews should cover usability, operational and security aspects. We recommend you use a checklist for log reviews:
- Is the log message meaningful for users?
- Does the log message include context for troubleshooting?
- Are the log message structured and include
- severity/log level
- additional troubleshooting information in separate fields
- Are 3rd party logs parsed and structured (configure log shipper)?
- Are log levels configurable?
- Does the log message include personal data or security-related data?
- Inspect audit logs and adjust log alert rules
- Setup alerts on logs
11. Don’t do log analysis in a silo: Correlate all data sources
Connect the dots. Logging is one part of an entire monitoring strategy. To practice truly effective monitoring, you need to complement your logging with other types of monitoring like monitoring based on events, alerts and tracing. This is the only way to get the whole story of what’s happening at any point in time. Logs are great for giving you high-definition details on issues, but this is useful only once you’ve seen the forest and are ready to zoom into the trees. Metrics and events at an aggregate level may be more effective, especially when starting to troubleshoot an issue.
Don’t look at logs in a silo — Complement them with other types of monitoring like APM, network monitoring, infrastructure monitoring, and more. See APM vs. Log Management for more details. This also means that the monitoring solution you use should be comprehensive enough to provide all your monitoring information in one place, or flexible enough to easily integrate with other tools that provide this information. This way, as a user, you have a single-pane view of your entire stack.
12. View logging as an enabler of GitOps
For busy DevOps teams, it’s easy to view logging as a nice-to-have, or an add-on that you can embrace once you’ve figured out automated CI/CD pipelines and are releasing more frequently. However, another way to look at logging is to see it as an enabler of DevOps and CI/CD. To practice automation at every step of the development pipeline, you need the visibility to know where issues are introduced, and what the main sources of these issues are — faulty code, dependency issues, external attacks, insufficient resources, or something else. The causes can be innumerable, but logging gives you the insight you need to find and fix these issues.
As continuous integration increasingly becomes about enabling GitOps at the very start of the pipeline, there’s a need to not compromise on quality and security authentication in the name of automation and speed.
13. Get Real-time Feedback on Any Type of Events
Automated testing and new approaches like headless testing are making it possible to get real-time feedback on every single code change in a developer environment, even before a commit. As testing shifts left, and there is an increasing focus on the start of the pipeline, logging is essential to gain visibility and enable GitOps. Without the appropriate testing and logging, you’ll be left with runaway releases and deployment hell.
14. Use Logging to Identify Automation Opportunities & Trends
Logging helps to catch issues early on in the pipeline and saves your team valuable time and energy. It also helps you find opportunities for automation. You can set up custom alerts to trigger when something breaks, and even set up automated actions to be initiated when these alerts are triggered. Whether it’s through Slack, a custom script, or a Jenkins automation plugin, you can drive automation in your GitOps process using logs. For all these reasons, you need to view logging as an enabler and driver of GitOps rather than an add-on.
Conclusion & Next Steps
In conclusion, logging is an essential part of building and managing cloud-native apps. For logging to be successful, it should reflect the state of your applications and be able to scale along with them. Logging should never be done in a silo. This is why a monitoring solution for cloud-native applications should consider other types of monitoring and metrics. Logging can often be viewed as an afterthought, but teams that want to go all the way with GitOps see logging as a driver and enabler of observability, and hence, as indispensable.
Looking for a full-stack monitoring solution? Try Sematext Cloud free for 14 days. Sematext bridges the gap between logs, metrics, real user monitoring and tracing allowing you to benefit from faster actionable insights.
- Java logging tutorial: Basic concepts to get started
- Java logging best practices: 10+ Tips you should know to get the most of your logs
About the Authors
Twain Taylor | Contributor
Twain is a Fixate IO Contributor and began his career at Google, where, among other things, he was involved in technical support for the AdWords team. His work involved reviewing stack traces, and resolving issues affecting both customers and the Support team, and handling escalations. Later, he built branded social media applications, and automation scripts to help startups better manage their marketing operations. Today, as a technology journalist he helps IT magazines, and startups change the way teams build and ship applications.
Stefan Thies | DevOps Evangelist | Sematext
10+ years of work experience as a product manager and pre-sales engineer in the telecommunications industry. Passionate about new software technologies and scalable system architectures. Likes NodeJS for POCs.