The promise and pitfalls of machine learning for automated troubleshooting

ML has the potential to revolutionize root cause analysis by mimicking and enhancing human diagnostic capabilities at machine scale. But genuine breakthroughs require overcoming significant data, computational, and interpretability challenges. 

Let’s envision a world where root causes are instantly identified the moment any system degradation occurs:

Maria, an e-commerce site reliability engineer, wakes up to an alert that the site’s checkout success rate has dropped 15% over the last 30 minutes due to higher than normal failure rates. With traditional monitoring tools, traversing the chain of dependencies would present a challenge; the disparate services at play here communicate indirectly. For example, some of the services communicate via Kafka, and some read data from a database which is written by other services. The site reliability engineer must understand all “hops in the chain” – the relationship and flows between services – to determine causality. This would traditionally take hours of manual dashboard analysis and logs review. 

Instead, within seconds, Maria’s AIOps platform sends a notification showing the root cause: a dependency used by the payment microservice has been degraded, slowing transaction processing times. The latest version of the payment service couldn’t handle the scale placed upon the prior version, and thus latency started to increase.

With this insight, Maria immediately knows both the blast radius and scope of the issue. Her AIOps platform then details all impacted components and APIs involved in this event. Armed with this knowledge, she quickly resolves the problem by rolling back the last update made to the payment service. Checkout success rates are restored without any further customer impact. Going from alert to resolution took less than 5 minutes, compared to hours or days through manual troubleshooting.

This level of automated root cause analysis delivers immense benefits:

  • Rapid detection:  Analysis of the “blast radius” – i.e. connecting alert indicators to potential service degradations and outages – is done in seconds. Issues that took hours to surface through dashboard monitoring or support tickets are now instantly flagged.  
  • Alert fatigue reduction: By consolidating alerts and forming a cohesive picture of a production issue, automated root cause analysis focuses on the core issues that need repair rather than blasting dozens of alerts for all their implications, impacts and systems. Additionally, root cause analysis from a holistic, system-wide perspective prioritizes issues with actual user impacts, as opposed to internal “IT” issues.
  • Precise targeting: The exact root cause – whether in applications, infrastructure, or network layers – is signaled, along with its probabilistic impact on site reliability and revenue. No more wasted time on false leads.
  • Faster recovery: By understanding root cause and blast radius from the start, teams can precisely mitigate issues rather than reactively firefighting. MTTR drops dramatically.  
  • Proactive prevention: Over time, patterns emerge showing systemic deficiencies (e.g. a Redis cluster needing failover configuration). Teams can make targeted improvements before problems recur.   
  • Exponential ROI: Calculate the cost of downtime x MTTR reduction + engineer time savings + customer loyalty gains + risk/liability reduction. ML automation provides staggering ROI. 

Bonus: When ML automation is designed for cloud-native systems from the outset, and boasts advanced data collection technologies, the capital expenditures (CapEx) also decrease.

This promise seems almost too good to be true. And indeed, multiple barriers obstruct the path to production-grade ML pipelines for root cause analysis. 

To understand why, think about your production environment as if it were a car. You’re driving on the freeway when your engine starts rattling, sputtering, and eventually stalling. If you were trying to replace a mechanic with an ML algorithm to identify the root cause, what are some of the challenges that you might encounter? 

1. No wiring diagram: Where is each sensor, actuator, pump located? How do all the systems – electric, exhaust, cooling – fit together? Who manufactured every part of the automobile and where can you source a new component if one breaks? 

Without a multi-dimensional topology mapping all dependencies, ML models have zero context of how to traverse interrelated failures. Manually creating this wiring chart is enormously complex at scale. And if your car is stalling on the freeway, you’re (hopefully) at least able to get to a mechanic and take time to identify the problem. But when you need real-time answers in the middle of a production incident, it only gets harder.

2. Not accounting for blind spots: How many times has the car already been repaired or refurbished, and what off-brand or wholesale parts are kicking around in place of the manufacturing standard? In other words, how accurate is that manual in your glovebox? Counting on user-provided telemetry will often leave blind spots in all production layers, creating “gaps” – making the automatic troubleshoot/triage almost impossible. Since telemetry is always assumed as being incomplete, there needs to be another way of collecting “complete” environment data.

3. Can’t reason from past experience: When hearing a rattling engine, intuition and past experience flag some likely root causes (if you’ve had one car for a long time, you might instinctively know if a failing alternator, loose suspension components, or broken flywheel are causing your problem). ML models lack this ability to zoom in on probable culprits tied to a business impact (e.g., a drop-in checkout conversion), thereby reducing noise. They would computationally need to evaluate all sensors evenly across a vehicle. Applied to vast production topologies, this brute force approach results in an overload of production noise.

4. Communicating the diagnosis: Once a failing head gasket is identified as the root cause, the diagnosis still needs to be communicated to both engineers and customers in understandable language along with a priority level. Generic ML model correlations do none of this.


Let’s explore these pitfalls inhibiting organizations from realizing automated root cause analysis:

1. No machine-readable system topology

ML models use statistics to find patterns – but can only spot patterns in data they can access. Without an existing topology mapping the thousands of interdependent services, containers, APIs and infrastructure elements, models have no pathway to traverse failures across domains. 

Manually creating this topology is remarkably complex – and sometimes impossible – as production environments dynamically scale across hybrid cloud infrastructure. Engineers would spend all their time just configuring, not innovating.

2. Root cause inference at scale

But even with a topology in place, searching it during an incident poses scalability issues. Existing ML libraries cannot handle production causality analysis. 

To diagnose checkout failure, should we evaluate payment APIs or database clusters? Intuitively, an engineer would prioritize services tied to revenue delivery. But generic ML techniques lack this reasoning, forcing an exponential search across all topology layers – like holding a microphone to every inch of a car engine.

Advanced algorithms are needed to traverse topology graphs during incidents, weighing and filtering options based on business criticality. Both simple and intricate failure chains must be unpackaged – all before revenue and trust disappear.

3. Interpretability for humans 

Finally, ML troubleshooting creates a new challenge: how to make inferences understandable to humans? Identifying patterns in metrics data reveals statistical correlations between events, but not causal priority chains:

  • Event A (high memory usage) frequently corresponds to Event Z (checkout errors)
  • Therefore, there is a high probability Event A causes Event Z

But this diagnosis lacks critical context on translating model outputs into actionable insights for engineers:

* What was the blast radius on revenue and reliability?  

* How do we communicate this to decision makers? 

* How do we prioritize fixing Event A vs B which also correlates?

Solving this final mile problem requires models that capture and visualize root cause probability, business impact sequencing, risk levels, and mitigation recommendations. Only with this translation can organizations act on algorithmic intelligence to resolve issues.

While core machine learning techniques provide immense potential, purpose-built solutions are necessary to address the complexity of causality analysis at production scale. Combining specialized topology inference, heuristic graph search algorithms, and interpretable data science unlocks the power of automated root cause analysis.