How We Built This: Complementing our AIOps Platform with LLMs

Senser has been an AI company from day one. We operate within the AIOps category, using machine learning to graph production Cloud and IT environments and provide our customers with deep insights into root cause and change impact. While the buzz around implementing generative AI capabilities in the observability market is recent and loud, we’ve been busy quietly evaluating the specific roles LLMs can play in bolstering our offering over the past year. This blog documents our process and learnings along the way.

Root cause analysis is one of the most challenging issues we face in observability. Because it requires an abundance of context and domain-specific knowledge, automation is extremely difficult to pull off. Application, network, and infrastructure level data streams must be reliable and constantly monitored in relation to one another. Typically, the best root cause analysis requires a team of senior site reliability leaders and developers who know the ins and outs of the system, and probably helped build it.

In an attempt to scale those senior leaders (and hopefully make their life a little easier) we sought out to find the symbiotic middleground where domain expertise meets algorithmic efficiency.

Early experiments with algorithms

Our early attempts at algorithmic root cause analysis (in the pre-LLM days) went something like this:

1. Issue occurs
2. Data is gathered around the issue (time and proximity)
3. Data is fed into an algorithm
4. Algorithm identifies the issue with ~95% accuracy

    Initially, this felt like success. But, then we realized something important. The algorithm was only successful in identifying issues that were already known. How could it be useful for identifying the causes of more complex, rare, or previously unknown problems where domain expertise is critical?

    Algorithms follow a simple principle: quality in, quality out. When you input high-quality data into a system, you are more likely to achieve high-quality results. It is therefore difficult to develop a root cause algorithm that can handle “new” situations–the missing context is crucial.

    To ensure an algorithm’s success, it’s essential to have a human domain expert feed it the appropriate data in the correct context. Finding the root cause isn’t a typical data science problem. In classical supervised learning, having more high-quality data generally yields better results. However, the root cause problem space is vast and contains many unknowns. It also doesn’t fit neatly into unsupervised learning anomaly detection because there’s too much noise—many insignificant anomalies and crucial patterns that might not seem anomalous enough. Domain expertise is vital to narrow the problem space, sift through noise, and piece together significant patterns. 
    So, our data scientists teamed up with our domain experts to build a series of algorithms and data collection tools that could more accurately conduct root cause analysis. By leveraging Graph Machine Learning, Senser is able to map a network of components (nodes) and their interactions (edges) to gain insight into the dynamics of complex systems. While full automation is the goal, we have already made strides in developing a system that greatly reduces workloads for SREs while improving diagnostic outcomes.

    Motivation for exploring LLMs

    Along comes generative AI, which has taken the world by storm and is redefining the way many of us work. Each day, people are finding new ways to increase their productivity and performance with LLMs. In observability and site reliability, there is an immense amount of data that must be constantly processed. LLMs, on the surface, show promise in making sense of large swaths of information and communicating in human-understandable language.

    We, like many in our industry, recognized immediately that LLMs would play a role in the future of our industry–the only question is how?

    The promise and pitfalls of LLMs

    With LLMs having arrived on the scene, we immediately started building a framework around how to incorporate them. First, we needed to consider how exactly generative AI algorithms “learn” and generate their responses. LLMs are experts in word association and gathering intent. They are really great at developing connections between concepts and delivering some reasonable-sounding response to the prompter. However, they can also lie, or hallucinate, especially when the task they are given is too broad in scope. 

    We considered what it would look like to deploy LLMs to root cause analysis. Retrieval-augmented generation (RAG), a technique for enhancing the accuracy and reliability of generative AI models with facts fetched from a variety of sources, offered some promise here. Using RAG, the LLM could pull information from logs, metrics, and past incident reports and synthesize that information into a cohesive story.

    But, even with RAG, it was simply impossible to feed the model all of the network, application, and infrastructure data continuously or affordably. Even if it were possible, LLMs are not capable of understanding the highly-structured, yet complex, relationships between the various components of an environment. To start, it would take an impractical amount of time and effort to translate the data into a language-based prompt format. Then, an LLM could tell whether a certain metric is outside of the acceptable range, but it will not be able to infer which second or third order cause might be responsible for the change. Worse, if the model is missing critical information, it is more likely to guess at a wrong answer than to state that it does not know. That’s the intent part–LLMs have a compulsive need to answer the question asked, even if that means answering it incorrectly. In mission-critical situations, we can’t afford hallucinations.

    With all these considerations in mind, we knew that if LLMs were to play a role in our AIOps platform, there would need to be some serious constraints.

    Having deeply studied the strengths and limitations of LLMs from the lens of AIOps, we identified two early use cases LLMs are ready to tackle today.

    Saving time with summarization

    One of the superpowers LLMs bring to the table is ultra-efficient summarization. Given a dense block of information, generative AI models can extract the main points and actionable insights. Like with our earlier trials in algorithmic root cause analysis, we gathered all the data we could surrounding an observed issue, converted it into text-based prompts, and fed it to an LLM along with guidance on how it should summarize and prioritize the data. Then, the LLM was able to leverage its broad training and newfound context to summarize the issues and hypothesize about root causes. Constricting the scope of the prompt by providing the LLM the information and context it needs–and nothing more–we were able to prevent hallucinations and extract useful insights from the model.


    It still takes a domain expert to gather and transform the relevant data and context for the LLM, and it still takes a domain expert to hunt down and validate the potential root causes identified by the model. But, there are massive efficiency gains in the middle.

    Consider this hypothetical example: A network operations team notices intermittent outages affecting a segment of their servers. They collect logs, performance metrics, and historical incident reports, and feed this data into an LLM with a prompt focused on identifying patterns and potential causes of the outages. The LLM’s summarization highlights a possible correlation between high CPU usage and specific network traffic spikes. It also infers that a newly deployed software update could be causing the issue, as similar patterns are observed in post-deployment logs. The domain experts then investigate this hypothesis, confirming that the software update contains a bug that leads to resource exhaustion under specific conditions.

    Combining powerful ML models with LLMs could reveal a winning combination for AI-based explainability. Being able to extract the information surrounding an issue from a graph-based model and then summarize the feedback in text format will make AIOps more approachable, but it will also make insights easier to communicate horizontally and vertically within an organization.

    Chatting with data

    Summarization and hypothesizing about root causes is already proving to be a promising application of LLMs in AIOps. But there’s another superpower LLMs possess that could be useful, too: chat.  

    A huge reason for the widespread adoption of ChatGPT, Claude, Gemini, Llama, and other generalized models is their friendly, intuitive interface. We have all grown accustomed to search engines like Google, where we type our question, and instantly thousands of related links populate the page. LLMs take it a step further, turning your questions into conversations, allowing for iteration, clarification, and memory.

    We recognized an opportunity to leverage these chat capabilities in our product. While it is impractical and inhibitively expensive to constantly provide LLMs with entire databases of observability data, it is not unreasonable to use them to “chat” with the data. 

    Earlier, we talked about how LLMs are great at recognizing intent. Leaning on that ability, we have developed a tool (in beta) which allows for end users to ask specific questions in free text about their system data. The LLM, recognizing the intent of the question being asked, extracts the proper data needed to answer the question without drowning itself in unnecessary context and uses its generative capabilities to provide a response back to the user, in human-readable language. While this is a very reductive, simple way of using the technology, it is exactly the type of nuanced use case we, as an industry, are tuned to identify and leverage. 

    As LLMs progress even further, we look to extend this capability beyond text-based query databases to all types of data structures, including lists, trees, graphs, or any combination found in relational or non-relational databases. Making the leap with LLMs from text-based data to these other formats will open interesting doors for the observability space.

    Improving model performance

    We continue to refine our models to improve their performance and accuracy in real-world scenarios. This involves ongoing research and experimentation, as well as continually testing new models as they become available.

    Extending the models to new use cases (SLO generation)

    We are exploring how LLMs can be extended to other use cases, such as generating Service Level Objectives (SLOs) and other performance metrics.

    Post-mortem report generation

    Another potential application of LLMs is in automatically generating post-mortem reports after incidents. Documenting issues and resolutions is not only a best practice; it’s also sometimes a compliance requirement. Rather than scheduling multiple meetings with different SREs, Developers, and DevOps to collect information, could LLMs extract the necessary information from the Senser platform and generate reports automatically?

    When we say that LLMs are bad for root cause analysis, we mean it. But, that doesn’t mean they won’t play a critical role. Our viewpoint is that they’ll serve a more nuanced function as an enabling technology and not a holistic solution. 

    As innovators in the observability space, we are constantly updating our views and beliefs about emerging technologies. It is our passion and curiosity that lead us to experimenting with the latest and greatest, and we love sharing what we have learned with like-minded technologists. We hope you enjoyed this exploration into LLMs specific role in observability, and please feel free to reach out with any questions, comments, or to share the results of your own experiments!