Published on

Monitoring Distributed Systems

Authors
  • avatar
    Name
    Skim
    Twitter

Following is Google's Site Reliability Engineering (SRE) teams' fundamental principles and best practices for building effective monitoring and alerting systems.

Terminology

  • Monitoring: The process of collecting, processing, aggregating, and displaying real-time quantitative data about a system, such as query counts, error rates, and response times.
  • White-box Monitoring: This is based on metrics derived from the internals of the system, including logs, interfaces like the Java Virtual Machine Profiling Interface, or other internal statistics.
  • Black-box Monitoring: It involves testing the external behavior of a system as a user would experience it.
  • Dashboard: A web-based application that offers a summary view of core service metrics. Dashboards help in answering basic questions about a service and can display information like ticket queue length, high-priority bugs, and the current on-call engineer.
  • Alerts: Notifications intended for humans that are pushed to systems like bug or ticket queues, email aliases, or pagers. These alerts can be classified as tickets, email alerts, or pages.
  • Root Cause: A defect in a system or human process that, when fixed, instills confidence that a similar issue won't occur in the same way.
  • Node and Machine: These terms are used interchangeably to refer to a single instance of a running kernel, whether on a physical server, virtual machine, or container.
  • Push: Any change to a service's running software or configuration.

Why Monitoring Matters

Monitoring does several things at once. You use it to spot long-term trends (how fast is the database growing? are we gaining users?), to compare the performance of different configurations or experiments, and to get notified when something breaks so you can respond quickly. Dashboards give teams a quick read on service health, and when something does go wrong, historical monitoring data is often the fastest way to figure out what changed.

Good monitoring and alerting can surface problems before users notice them. But getting there takes care -- too many alerts and people start ignoring them, too few and you miss real incidents. Every alert should have a clear reason to exist.

Setting Realistic Expectations

It's important to set realistic expectations for your monitoring efforts. Monitoring is a significant engineering endeavor, and even with a mature infrastructure, dedicated monitoring personnel are often required. Google's SRE teams have moved towards simpler and faster monitoring systems while avoiding overly complex "magic" systems that try to automatically detect thresholds or causality.

Symptoms vs. Causes

An essential aspect of monitoring is distinguishing between symptoms and causes. Symptoms indicate what is broken, while causes represent the reasons behind the issues. The key is to monitor symptoms to quickly identify problems and leave the investigation of causes for later, aiding in efficient debugging.

Balancing White-Box and Black-Box Monitoring

Google combines white-box monitoring (inspecting internal system metrics) with black-box monitoring (testing external system behavior). The choice depends on the specific context and information needed to assess system health.

The Four Golden Signals

Google emphasizes the importance of monitoring the four golden signals: latency (response time), traffic (system demand), errors (failed requests), and saturation (system fullness). Focusing on these four metrics provides a comprehensive view of system performance.

  • Latency: Measures the time it takes to process requests. This includes distinguishing between the latency of successful and failed requests.
  • Traffic: Measures the demand placed on your system, typically in requests per second.
  • Errors: Tracks the rate of failed requests, whether explicit (e.g., HTTP 500 errors), implicit, or by policy.
  • Saturation: Reflects how "full" your service is, focusing on constrained resources like CPU, memory, or I/O.

Addressing the Long Term

Monitoring isn't just about detecting immediate issues. You also need to think about where you're headed. Sometimes that means accepting a short-term hit to availability or performance in exchange for long-term stability.

A Monitoring Philosophy

Google's SRE teams page on symptoms, not causes. If a user-facing signal is broken, that warrants waking someone up. The cause can be investigated after the page fires. Keep the monitoring setup simple, make sure every alert leads to a concrete action, and resist the temptation to add alerts "just in case."