Observability is the ability to understand the internal state of a system by examining its external outputs. It is a broader capability than logging, and the gap between the two becomes obvious the first time a system reaches a state nobody designed for. This is the first post in a series on observability. It covers what the term means, why logs alone leave you debugging in production, the commercial case for investing in observability, and the bar an application has to clear before you can say it has it.
Systems reach states you did not predict
Any engineer who has operated a complex system long enough has watched it reach a state nobody planned for. Networking failures, a process killed for running out of memory, a disk filling up, CPU starvation, a database deadlock, a connection pool hitting its limit, an external API or database that is simply unavailable. The list is open-ended because the interactions that produce these states are not written down anywhere at design time. They emerge from the system running in production under real load.
A system with minimal logging, or none, gives you almost nothing to work with when this happens. Troubleshooting becomes a loop: form a hypothesis, add log lines where you think the problem is, ship the change to production, wait for the issue to reproduce, read the new output, repeat. Each turn of that loop is a full deploy. While it runs, the system is still degraded or down. An incident that should be a matter of reading output you already have instead spans hours, paid in lost revenue and customer trust, because the information needed to diagnose it was never emitted.
What observability actually means
The term did not originate in software. The engineer Rudolf E. Kálmán coined it in 1960, in a paper on mathematical control systems, where observability is a measure of how well a system’s internal states can be inferred from its external outputs. The book Observability Engineering, by Charity Majors, Liz Fong-Jones, and George Miranda, carries that definition into software engineering: the ability to understand the internal state of a system by examining its external outputs.
Those external outputs are signals: logs, metrics, and traces. Taken together they let an engineer reconstruct not just what the system did but why it did it. The useful distinction the book draws is between known unknowns, the failure modes you anticipated and instrumented for, and unknown unknowns, the states you did not predict and so could not have written a specific check for. Monitoring, built on predefined dashboards and alert thresholds, handles the first. Observability is aimed at the second: enabling teams to answer questions about the system that they did not think to ask before it misbehaved.
A test for whether you have it
Suppose one of your services starts emitting error log lines. Can you answer the following quickly, from outputs you already have, without shipping new code?
- What inputs caused the requests to fail?
- Is the failure specific to a particular host, region, or release version?
- What sequence of events led to the failing call, and was it part of a larger chain of calls?
- When exactly did it start, and is it a single occurrence or recurring?
- Does it correlate with a specific shape of input, or with a time of day or day of the week?
- Were any specific feature flags enabled for the affected requests?
If the honest answer to most of these is “I would have to add logging and redeploy to find out,” the system has monitoring but not observability.
The business case
Observability is an engineering practice, but the case for funding it is commercial. Observability Engineering groups the return into four categories:
- Higher incremental revenue. Better uptime and performance improve the product directly, and that shows up as revenue not lost to degradation.
- Cheaper incident response. Lower mean time to detect and mean time to resolve mean fewer engineer-hours per incident: less time locating bottlenecks, less time on call, fewer rollbacks.
- Fewer incidents. Causes get found while they are still small, before they escalate into long, critical outages.
- Lower employee churn. Less alert fatigue and on-call burnout mean experienced engineers stay, and replacing them is expensive.
The bar for having observability
The book sets a specific bar for when an application can be said to have observability. You must be able to:
- understand the inner workings of the application;
- understand any state the application has gotten itself into, including states you have never seen and could not have predicted;
- do both solely by observing and interrogating the system with external tools;
- do both without shipping new custom code to handle the case, because needing new code means the explanation required prior knowledge you did not have.
The last point connects back to the redeploy-to-debug loop. If diagnosing a novel failure requires a new deployment, you do not have observability yet. You have logging that happened to cover the cases you thought of in advance.
Links
- Observability Engineering (O’Reilly): Charity Majors, Liz Fong-Jones, and George Miranda; the source for the definitions and business case cited here
- Observability Engineering, free copy from Honeycomb: the full book at no cost