Jun 14, 2024

The Three Pillars — Revisited

Is this once-foundational observability concept crumbling?

Observability is a dynamic field. Projects like OpenTelemetry have matured, feature growing communities, and are now supported by more platforms than ever. What five years ago may have felt like dreams of an enterprise-scale, composable, cross-platform observability solution based on a standard, open wire format are now a reality. The increasing prevalence of microservices, Kubernetes, IoT, streaming, and cloud-native development patterns continue to spur innovation, and present challenges, in the observability ecosystem needed to meet the unique demands of these technologies. Yet, despite this evolution, a lot of the practical, nuts-and-bolts work in the field is still being accomplished with the o11y OGs—logs, metrics, and traces.

These forms of telemetry have long been referred to as "The Three Pillars of Observability" for their critical role utilized both as individual signals but also, more powerfully, as a trifecta supporting the foundation of a holistic observability system. While criticism of the pillars goes back years (maybe even to the same time it entered the lexicon, so perhaps always a controversial analogy?), I've noticed a recent uptick in proclamations calling it: dead, outdated, a trope, bullshit. So, is this concept a relic of a simpler time? Are the pillars cracking under the weight of what is becoming an increasingly complex domain that is accelerating alongside the broader technology environment it serves?

How the Pillars break down

One of the main criticisms of the pillars is the tendency to use them in isolation, instead of as part of a holistic observability system. Indeed, much of the potential of the Three Pillars as originally conceived is only fully realized if each of the signals can be cross-correlated against the others in real-time. But often this isn't what happens in practice. The process usually goes something like this:

An engineer receives an alert, let's say about a high-error threshold being crossed. The alert includes a link to a dashboard, which the engineer reviews. The dashboard includes a metric plotted out on a graph which indicates a clear spike at a particular point in time (likely some period just before the alert was triggered). The engineer then logs into their log analysis tool, uses a saved search (or hunts through internal documentation/wikis for one) to sift through the logs for that app and approximate time period until they find something that might have caused the alert to fire.

This process of discovering "what's going on?" is often sequential, intentional, and occurs across different systems—or at least in different tabs 🙃. But the magic of observability that we were promised is that these insights can be gained simultaneously, intuitively, and through a single pane of glass. Consider an alternative (yet, all-too-often imaginary) scenario:

An engineer receives the same alert about a high-error threshold being crossed. The alert directs them to their observability tooling. They review a graph representing the same metric clearly indicating a spike. The engineer zooms into the spiked area which automatically pulls up all of the related logs that occurred at the same time in the same viewing window. The logs are neatly organized, giving the engineer a high-level overview. Aha—there it is, the root cause! Oh, and this problem only affected a single customer. Cool.

My kingdom for a SPOG!

Market realities are at least one driver of this signal-used-in-isolation pattern. Many vendors, maybe themselves originally influenced by the "pillars" terminology, tend to focus their products around one or two pillars at the expense of the others, resulting in a kind of chicken-egg scenario that has left teams dependent on multiple tools and disjointed workflows in their quest to implement the pillars. Even top-tier commercial platforms that support all three pillars still fail to seamlessly integrate them throughout the product experience.

Naturally, this situation leads to the second main complaint about the pillars: cost. If you're running (or buying) separate systems for your logs, metrics, and traces, on some level you're paying to store redundant data. It's not necessarily the exact same data, since it is being stored in different forms. But it's all data about the same core event. And when you factor in the explosion of data volumes inherent to today's dominant computing architectures (remember that thing about Kubernetes being popular?) volume-based cost concerns are not trivial—just ask Coinbase about its $65 million (!) Datadog bill. At a large enough scale, cost mitigation tactics, like sampling and cardinality monitoring, need to be introduced which reduce the fidelity and utility of the data itself while also diverting the team's time resources from their core work into cost monitoring.

💸 💸 💸

Ultimately, these factors all too frequently leave teams, who despite their best efforts to implement the pillars and paradigms they'd heard tell of, disappointed the next time an alarm triggers and the root cause of the problem is no more apparent than before. True observability, it seems, remains elusive.

Rebuilding the Pillars

Observability is about what you do with the pillars, not just collecting them. —Liz Fong-Jones

From my perspective, the problem isn't the pillars but the promise—the promise that if you instrument your code like this and visualize it in some tool like that, then the next time there is a problem, the root cause will just reveal itself to you like magic, requiring virtually no effort on the system maintainer's part. But the reality is: debugging is hard. The more complicated the system, the harder it is. Observability does not absolve engineers from the responsibility of actually understanding how their systems work. Observability makes it easier to understand them.

I don't think the pillars are dead, but I think what we've been doing with them is no longer viable. There is simply too much data involved today to accept a lack of cross-correlation ability. High-quality, first-class support for logs, metrics, and traces is currently a part of virtually every commercial or open source observability solution on the market. Now, it's time to unify them in a way that truly unlocks their value; quite a lot can be achieved with these three simple signals—when used cohesively. The Three Pillars may not be a perfect model but it can probably get you 80% of where you need to be. The other 20% is on you to actually know how to use them to get there.

References

The following references were used in developing the content of this article:

Sridharan, Cindy. "Chapter 4. The Three Pillars of Observability" from Distributed Systems Observability. O'Reilly.
Majors, Charity. "So You Want To Build An Observability Tool...." Honeycomb.
Mancioppi, Michele. "Observability vs. monitoring debate: an irreverent view." Canonical Ubuntu.
Daniel, Bryant. "Three Pillars with Zero Answers: Rethinking Observability with Ben Sigelman." InfoQ.
Majors, Charity. "The Cost Crisis in Observability Tooling." Honeycomb.
Fong-Jones, Liz. "Is it already time to version observability?." Honeycomb Observability Day, 2 May 2024, Chicago, IL. Lecture.

How the Pillars break down

Rebuilding the Pillars

References

Subscribe to o11yTime