In this article I’ll be talking about some techniques for debugging and fault finding in microservices architectures.
One of the issues we’ve had for a long time, in fact ever since distributed computing became a thing, is debugging issues with a single business process that is run across multiple machines at different times. When software was monolithic we always had access to the full execution stack trace, and we knew the machine that we were running on. When we encountered an error we could write all the information we needed to a log and later inspect it to see what went wrong.
When applications started running over small farms we no longer knew exactly which machine we were running on, but a look over the whole farm was a highly inefficient but often-taken route. With the advent of cloud computing, where compute instances are ephemeral, we cannot rely on machine logs to be there when we come to analysing log data. We need to get serious about how we manage our logging so that we have a decent chance of tracking down and fixing issues.
So, here are my best practices for tracing and debugging your microservices:
1. Externalise and centralise the storage of your logs
The first thing you need to do is to treat the data integrity of your logs seriously. You cannot rely on retrieving information from a physical machine at a later date, especially in an auto-scaled virtual environment. You wouldn’t allow a design where important customer data was stored on a single virtual machine with no backup copy, so you shouldn’t do this for logs.
There’s no absolute right answer to how you store your log information, whether it’s in a database, on disk or in an S3 bucket, but you need to be sure the storage used is durable and available.
Tip: If you’re on AWS you can always direct your output to CloudWatch. On Azure consider using Application Insights. We will be going into more detail on logging in future articles, as getting this right is fundamental to success in making operationally supportable services.
2. Log structured data
Many of the logging components we get now will output JSON documents for each log instead of outputting flat rows in a text file. This type of data structure is important because log processing and collation tools can easily process each record, and the information you provide in each record can be richer.
3. Create and pass a correlation identifier through all of your requests
When you receive an initial request that kicks off some processing you need to create an identifier that you can trace all the way through from the initial request and through all subsequent processing. If you call out to other components or microservices in order to complete your processing then you should pass the same correlation identifier each time.
Each of your subsequent components and microservices need to use this identifier in their own logging, so you can collate a complete history of all of the work done to process a request.
4. Return your identifier back to your client
It’s all very interesting if you can trace the flow through your system for a given request, but not much fun if you get a support call and you have no idea which request you need to be looking at.
When your API client has made a request, and initiated a process, you should return a transaction reference in the response headers. When your ops team needs to investigate the issue they should be able to provide the transaction reference with a support call and you can find the relevant information in the logs.
5. Make your logs searchable
It’s a good start to capture your log data, but you’re still left with the haystack of information even if you have the identifier of your needle. If you’re going to be serious about support then you should be able to search and retrieve a collated and filtered set of information for a single request.
Even if your data isn’t initially being written into a data warehouse you might want to look at offline processing that loads it into one. You’ve got plenty of options for this, such as the ELK stack or AWS Redshift. If you’re on the Azure stack you could choose Application Insights, as mentioned above.
6. Allow your logging level to be changed dynamically
Most logging frameworks support multiple levels of details. Typically you’ll have error, warning, info, debug and verbose as the default logging levels available to you, and your code should be instrumented appropriately. In production you’ll probably log info and above, but if you have problems in specific components then you should be able to change the tracing to debug or verbose in order to capture the required diagnostic information.
If you’re getting problems you should be able to make this change to logging levels on the fly while your systems are running, diagnose the issue, then return the logging back to normal afterwards.
Fault finding in distributed microservices can be difficult if you don’t readily have access to the logs from all of the machines your code runs on. Virtualized cloud instances, that can disappear at any time, exacerbate the problem because you cannot go back to a machine instance later to see what has happened.
Planning the management of logs and how you conduct fault finding needs to be done at the design phase and executed with the appropriate tooling and technique. Once you’ve cracked this, supporting your systems will be a whole lot easier.