Alert Fatigue: The (not so) Good, The Bad, and the Ugly

Various notifications and alarms waking a sleeping engineer

The other day, my dog had a longer-than-usual stay at the groomer, much to his dismay. I had accidentally ignored the text message they sent me when he was ready to be picked up. But how did I make this (in his eyes) egregious mistake?

Dog looking at the camera with sad eyes
Dog tax! The sad pup in question...

Well, in recent weeks, my phone has been swamped with text messages seeking political donations (it’s that time of year!) and I have become trained to disregard unexpected messages coming from phone numbers that aren’t in my known contacts. But in an era when text messages are used for appointment reminders, receipts, and other routine notifications, these few “good” messages that I actually want to receive—like say a message from the dog groomer—can easily get lost in the noise. This is an example of a phenomenon known as alert fatigue and the implications extend far beyond a pouty pooch.

Alert fatigue is a condition that occurs when an alarm is triggered so frequently that it desensitizes responders to its urgency and subsequently loses its attention-grabbing power. Alerts, or alarms, are signals intended to communicate to a human operator that urgent attention, and possibly action, is needed. However, human attention can only be sustained at an urgent level for so long. Through a psychological process known as desensitization, a repeatedly sounding alarm will eventually train the brain that this signal’s urgency can be ignored, no longer triggering a prompt response.

The Bad

Isn't more safer?

When designing an alert, it can be tempting to want to notify someone earlier rather than later, catch a potential problem before it becomes an actual problem, err on the side of caution. Yet, despite these good intentions, over-alerting is actually a common monitoring anti-pattern because of the increased risk of alert fatigue. Excess alarms or loose alerting thresholds can increase the likelihood that actual problems might get drowned out by the noise of many useless alerts or be ignored by responders who have become desensitized to frequently unactionable alarms.

Alert design is one of the best applications of the "less is more" philosophy, contrary to our natural inclinations. It is much more effective to have a single, well-crafted monitor that alerts on carefully chosen SLO-based thresholds than a half-dozen monitors that were thrown together "just to be safe." Resist the urge to add alarms or lower alerting thresholds for this reason especially. Simply adding more alerts rarely increases system safety.

But can't engineers be expected to pay attention?

It's hard to fight against human nature. Every time an alert triggers, gets an engineer’s attention, and then subsequently requires no action, the brain takes note and the desensitization process starts. We've likely all experienced a system that had a perpetually failing monitor. Unactionable alerts should simply be removed, as they do not provide value. They are, in fact, doing the opposite by actively contributing to alert fatigue. This is even true for alerts in development environments. When we say things like "Oh, you can ignore that monitor, it's always failing", we're teaching ourselves that it's okay to ignore alerts. It shouldn't be! Instead of simply thinking that we can train our brains to know that this alert is good and should not be ignored and that alert is bad and can be safely ignored, we should update the monitoring policy itself to reflect the reality of our behavior and eliminate the alert that we ignore.

What should I do instead?

Monitoring is never static—alerts should be continually evaluated for effectiveness, utility, and fatigue indicators. On a regular basis, and especially after incidents that lacked appropriate monitoring, evaluate each of your alerts and assess them based on the following criteria.

Effectiveness: Does the alert actually alert against the conditions that it should? Are thresholds set at appropriate levels?

Utility: Do we actually care about this alert? Has this monitor ever successfully guarded against an incident? Is it likely to? Is it useful in isolation or only when combined with other signals?

Fatigue indicators:

  • Does the alert trigger frequently?
    Alerts that frequently fire, even when they legitimately require action, still contribute to alert fatigue by adding to the sea of "alert noise" which can drown out other, less frequently-occurring alerts making them easier to miss, especially during events like "page storms."
    • Recommended actions:
      • Threshold tuning
      • Alert grouping / threading
  • Is the alert unactionable?
    Alerts that do not require action, regardless of how frequently they fire, most readily contribute to alert fatigue because they perpetuate the "see/disregard" cycle that leads to desensitization.
    • Recommended actions:
      • Eliminate
      • Refactor purely informational alerts into dashboard visualizations
  • Has the alert never triggered?
    An alert that never fires is also not really providing value and may be misconfigured or tuned too aggressively. Having too many monitors that never fire can provide a false sense of security.
    • Recommended actions:
      • Eliminate
      • Threshold tuning

The Ugly

At its best, alert fatigue is a nuisance; at its worst, it can be dangerous and even life-threatening.

In the medical field, alert fatigue is a frequently-cited patient safety concern and was ranked by the ECRI Institute, an independent healthcare safety research nonprofit, as a top patient safety health hazard.

Various public agencies must concern themselves with the risks of alert fatigue when communicating with the public. When severe weather strikes, officials at the National Weather Service, carefully crafting their messaging, must balance the risk of over-warning citizens when deploying emergency alerts with the risks of not adequately informing the public about potentially deadly conditions posed by severe storms. SAFECOM, the National Council of Statewide Interoperability Coordinators (NCSWIC), and the Department of Homeland Security Cybersecurity and Infrastructure Security Agency (CISA) specifically callout avoiding alert fatigue in their public safety notification best practices guide. While targeted towards public safety notification systems and practitioners, its “Ten Keys” principles are broadly applicable to SRE and other fields where alerting practices are critical.

Across industries, employees responsible for responding to alerts who are exposed to the negative effects of alert fatigue can suffer from burnout, affecting quality of sleep, on-the-job performance, job turnover rates, and overall quality of life.

Once established, alert fatigue is a very challenging problem to address because human behaviors need to be re-calibrated, which takes time. It's a much easier problem to prevent by implementing strategic alerting policies. This change starts with each of us. The next time you are paged by an alarm that required no action, challenge its existence. The next time that noisy monitor triggers, try tuning the thresholds. No one should accept a culture of alert fatigue, which has wide-ranging and frequently unanticipated impacts.

Subscribe to o11yTime

Sign up now to get access to the library of members-only issues.
Jamie Larson
Subscribe