Episode 1: Tracing and Debugging Microservices in Kubernetes

This is the inaugural episode of the 345 Tech Talks podcast. In this episode Andrew and Paul discuss the issue of tracing and debugging microservices in Kubernetes. This is a technical deep dive into a subject that can make or break your ability to build, test and operate a large production system.

Book a Call

A while back we wrote an article “Best Practices for Tracing and Debugging Microservices” that has turned out to be our most viewed web page ever on the 345 site. The original article is a brief look at some of the main considerations, so when we were looking for a subject for our first podcast episode this was an ideal candidate.

Some of the main points from the episode:

  • Building applications in Kubernetes helps with 3 of the 7 Outcomes for Success: Rapid Delivery, Availability & Scalability and Cost Optimised.
  • The ability to read detailed diagnostic information is essential if you are going to build large scale distributed applications. This is especially important if a single process involves calls to many different microservices.
  • You need to ensure you can piece together the diagnostic information from every component involved in fulfilling an operation. These can span many machines and services. Paul describes the best ways of doing this by passing a correlation identifier through all the services.
  • We have a tech stack for containerized microservices running in Kubernetes that includes FluentBit, ElasticSearch, Kibana, Prometheus and Grafana. This stack is explained in detail, with descriptions of why each part of the stack is chosen.
  • We discuss the information you need to trace and the structure of the data that is best. FluentBit and FluentD are data collectors that feed into ElasticSearch for storage. It’s best to have an interface to view, search and filter the log information and that’s where we use Kibana.
  • Performance data is handled differently. We store this in Prometheus because it’s better at handling realtime time-series data. We also move this into ElasticSearch for long-term storage.
  • We discuss how you need an archiving strategy. It’s important to understand how much data you need on fast storage, how much on slow storage, when you can put data into cold storage and when you can purge it. This helps you keep a good balance of performance and cost, whilst meeting any regulatory data retention requirements.
  • If you’ve ever been the guy who needs to fix a system when it’s down, you know the value of good diagnostic information!

You can watch the podcast here:

The audio version is here: