Why Generic AI Models Fall Short for Root Cause Analysis

Call it the GenAI gold rush…or land rush.

Industries from healthcare to banking and beyond have been rushing to integrate generative AI (GenAI). And the observability space is no exception. 

There has been a ton of noise around how large language models (LLMs) are set to transform the observability market. The status quo is plainly less than ideal – with alarms, warnings, and a confusing mixture of signals coming in from monitoring softwares (whether commercial vendors or open-source stacks), site reliability engineers are overwhelmed and suffer from alert fatigue

If you manage a modern, distributed production system, you might already be looking into ways LLMs can simplify your team’s work. Issue diagnosis and root cause analysis are likely demanding more time and energy from your SREs and DevOps pros in recent years – a result of the increasingly intricate web of interdependencies of software systems and infrastructure. 

LLMs will undoubtedly play a role in this space. But as an industry, we’re still learning where they fit best.

This blog specifically explores why generic AI models aren’t a strong fit for a critical element of production troubleshooting: root cause analysis. Then it offers a suggestion for how LLMs could be integrated to enhance your overall observability strategy.

LLMs excel at analyzing unstructured text. They learn from vast amounts of text-based training data to identify patterns and make predictions.  

So in theory, if you could supply an LLM with the right text inputs, it would be able to synthesize vast amounts of information about your environment to create high-quality insights. (An example: this microservice is failing – and it’s likely because of issue X.) 

But now consider: what would the right text inputs look like to generate that kind of insight? Broadly, they would fall into the bucket of “context”:

  • Relational heuristics: a model of the connections between different layers of your environment (e.g., a particular microservice calls a specific set of APIs)
  • Tribal knowledge around previous cause/effect relationships (e.g., user activity always spikes around certain times of the year)

This type of context exists naturally in a graph representation of your environment. A graph (with nodes and edges) is a natural and useful way of representing structured relationships between different application, API, network, and infrastructure layers. It highlights dependencies and captures evolutions in the structure of your environment over time. 

But the process of converting structured, time-series data into meaningful text inputs for an LLM is far from trivial. In fact, it’s the biggest bottleneck in using LLMs: transforming a map of your environment into specific and relevant training data that genAI can use to generate insights beyond the generic (i.e., what you would get if you googled “why might a microservice fail?”).  

For example, let’s imagine that you experienced an outage. (For simplicity, let’s assume this scenario – overloaded web pods leading to service issues on the checkout page.) The non-obvious root cause here turned out to be a degradation of the cache service. But in order to get an LLM to reproduce this insight, you’d need to feed it exhaustive, explicit, and up-to-date information on the relationship between your web pods and cache service, and the underlying network configuration governing the cache service. 
For complex environments with a high degree of interdependence between APIs, applications, network, and infrastructure layers, it’s not practical to continuously convert all of these highly structured relationships to text input in a way that an LLM can analyze for root cause analysis. (This, by the way, is exactly the area where graph machine learning shines.)  

Production systems generate continuous, structured time-series data streams that require real-time visibility during root cause analysis. By design, models like GPT react to queries and cannot process dynamic, real-time network, system, and application data. Further, as your systems evolve, LLMs will always be a step behind, only aware of the version of your environment on which they were last trained.

For effective root cause analysis, SREs and DevOps professionals need clear, intuitive data visualizations like environment topologies and impact maps alongside a history of recent changes and deployments. GenAI models interface primarily through chat, which makes it difficult to visualize their insights.

Explainability matters in all domains, but the stakes are particularly high in root cause analysis. 

If a model can’t properly explain its reasoning for suggesting a particular mitigation strategy (like upgrading a specific database that is causing a payment service bottleneck), SREs can’t act with confidence. And they may be influenced to perform misguided actions that compound existing issues or create new ones.  

The fact that LLMs perform “black box” operations isn’t a problem in and of itself. (After all, any neural network involves explainability challenges due to the number of parameters and volume training data.)

But a major explainability challenge arises from the fact that the underlying data – a structured, time-series graph of your environment – has to be transformed into an unstructured textual format that LLMs can analyze. 

Where a graph representation of your environment makes the relationship across layers visible, the text input of an LLM makes it almost impossible to reason about the output. This can make it particularly hard to debug potential model failures:

  • Is the model analyzing the most up-to-date representation fo your environment?
  • Was inadequate context provided to generate an accurate recommendation?
  • Why was a particular root cause identified across multiple “hops in the chain”?

In short, the additional layer required to make LLMs useful for root cause analysis – transforming and annotating graph input into textual training data – introduces explainability challenges that only compound at scale.

We’ve highlighted some of the shortcomings of LLMs for investigation and root cause analysis of production issues.

But it would be a big mistake to discount GenAI for observability altogether. LLMs have a clear and powerful role to play in the troubleshooting process. Specifically, they can complement other forms of AI used in root cause analysis (like graph machine learning) by providing an intuitive, flexible, and shared user interface for investigation. 

Using LLMs as the “interface layer” during incident investigation – a chatbot that enables truly conversational troubleshooting – offers many benefits:

  • It makes complex insights accessible to different cross-functional members of the team (SREs, DevOps, developers) who may have different levels of familiarity with the technical environment 
  • It accelerates the troubleshooting process by enabling questions to be asked and answered iteratively – the way that SREs actually investigate issues in the real world
  • It promotes alignment by providing a common language for issues and root causes across the team

What does this look like in practice? Imagine if during a service degradation (like an outage on the checkout page for an e-commerce retailer), various teams had access to a chat-based interface for asking questions and exploring hypotheses – not generic answers, but tailored to their specific environments. “Why might the web pods be overloaded? What are potential root causes of our cache service overload?” 

In short: a conversational interface for troubleshooting can make root cause insights understandable and actionable for the human teams tasked with investigation and remediation.

There’s no question that generic AI models like GPT will have a groundbreaking impact on observability. As we’ve seen, LLMs can vastly accelerate the troubleshooting process by translating root cause insights for human SREs. 

But truly intelligent observability is not as simple as bolting a chatbot onto an existing platform. It requires a complete system designed by domain experts to collect, structure, and analyze data through the lens of end-users and business impact. And generic AI models fall short in the specialized and demanding task of root cause analysis for distributed production systems.

At Senser, we’ve been building that system from the ground up since day one.

How Senser helps

Senser’s zero-instrumental AIOps platform uses eBPF-based data collection to provide immediate, low-overhead visibility into your production environment – all of it. 

Senser automatically creates a topology of your environment, dynamically mapping dependencies across layers (application, APIs, network, infrastructure) to provide critical context for troubleshooting. And our graph ML-based approach helps you quickly pinpoint the root cause of service issues in even the most complex environment. 

Bringing together the power of LLMs (for conversational troubleshooting) with machine learning purpose-built for root cause analysis gives your team the best of both worlds: the right tools to vastly reduce mean time to detect (MTTD) and mean time to remediate (MTTR).