How Graph Machine Learning is Changing the Game in Observability

Imagine you’re the coach of Manchester United, prepping for a big match – and your star striker, Marcus Rashford, just went down with an injury. How will this impact your game plan and chances of winning? 

As surprising as it might seem, this is precisely a scenario where graph machine learning (Graph ML) could come in handy. 

It turns out that all of the complex interactions on the soccer field can be modeled using Graph ML – which is also changing how we as an industry think about monitoring and analyzing complex distributed systems. (Plus, we’re soccer nerds and just wanted to write about soccer alongside observability and AIOps.)

What can soccer teach us about the impact of Graph ML on generating insights from complex, distributed systems? It turns out: a lot.

Graph machine learning is a branch of machine learning that enables us to infer the properties of graph data structures – providing insight into the dynamics of problems that can be expressed as graphs. Think of a graph like a network of points connected by lines. The points, called nodes, represent entities like players in a soccer game or components in a computer system. The lines, called edges, represent the relationships or interactions between these entities.

For instance, in a soccer game, each player is a node, and the passes, tackles, or positions on the field are the edges connecting them. In a distributed computer system, each service or application is a node, and the data flows or dependencies between them are the edges.

Think of your soccer team as a complex distributed system. Each player is a component, with roles and dependencies on others. An injury is like an outage – it can have a cascading effect. Graph algorithms can model these relationships and interdependencies, allowing you to analyze the potential blast radius and adjust tactics accordingly.

Modeling Interactions

Consider Bruno Fernandes, a player rightfully celebrated by fans for orchestrating the midfield with great vision and passing ability. His interactions with teammates can be modeled in a graph, showing how crucial he is in connecting defense and attack. Without him, the team’s dynamics change, potentially leading to more reliance on less effective plays.

Similarly, in a distributed system, services and components interact with each other. Modeling these interactions as a graph allows us to understand the different interactions, which elements are part of critical paths, and the levels of dependency and redundancy – which in turn helps in identifying the root cause of issues more accurately and quickly.

Logs

In the context of Kubernetes observability, logs refer to the detailed records of events and actions occurring within the Kubernetes system and the applications it hosts. These logs provide qualitative insights into the behavior of the cluster’s components, including nodes, pods, and containers, helping identify errors, system state changes, and operational trends. Analyzing these logs is crucial for troubleshooting issues, understanding application performance, and ensuring Kubernetes security and compliance.

A typical Kubernetes observability tool stack for managing and analyzing logs often includes Fluentd and Elasticsearch. Fluentd is an open-source data collector for unified logging. Elasticsearch is part of the ELK Stack (Elasticsearch, Logstash, Kibana) and is used for storing and searching logs.

Reducing Mean Time to Resolve (MTTR)

Graph ML can significantly reduce the mean time to resolve (MTTR) an issue once it has been detected. By continuously learning from observability data, it can quickly infer the potential root causes of anomalies in a production environment. This is similar to how analyzing real-time player tracking data can help a coach make quick decisions to adapt their strategy during a match.

Identifying “Unknown Unknowns”

One of the biggest challenges in observability is identifying “unknown unknowns”—issues that you didn’t even know could occur. Graph ML can uncover these hidden issues by analyzing the complex interactions between services. This mirrors how unexpected patterns in player interactions might reveal new tactics or weaknesses in a soccer team.

To make sense of the complex interactions in a soccer game or a distributed system, Graph ML uses several key steps:

  • Data Collection: In the context of observability, data collection involves gathering telemetry data, metrics, logs, and traces from various services and components. This is akin to collecting data on all the players and their interactions on the soccer field.
  • Graph Construction: Build a graph where nodes represent entities (players or services) and edges represent their relationships (passes or data flows). This graph forms the basis for analysis.
  • Learning and Analysis: Apply machine learning algorithms to analyze the graph. This might involve looking for patterns, identifying key nodes, or predicting future interactions. For instance, it might identify that a sudden spike in latency in one service could affect several downstream services.
  • Insights and Predictions: Use the analysis to gain insights and make predictions. For example, predicting which service might fail next or identifying the root cause of a performance issue.

Graph ML is already making a significant impact in the field of observability. Here are some examples:

  • Root Cause Analysis: By modeling service dependencies as a graph, Graph ML can quickly identify the root cause of issues in a distributed system. For instance, if a payment service fails, Graph ML can trace the issue back to a specific database query that is causing a bottleneck, reducing the time and effort required to troubleshoot and resolve problems.
  • Anomaly Detection: Graph ML can detect anomalies in observability data, such as unusual spikes in latency or error rates. For example, it might identify an unexpected increase in response times for a microservice and alert the team before it impacts users.
  • Performance Optimization: By analyzing the interactions between services, Graph ML can provide insights into how to optimize the performance of your IT environment. For example, it might suggest reconfiguring the load balancer to distribute traffic more evenly, ensuring that the system runs smoothly and efficiently.

Graph machine learning is revolutionizing observability by providing deep insights into complex interactions within distributed systems. By leveraging Graph ML, IT professionals can gain a competitive edge in maintaining the health and performance of their environments. Just as a soccer coach uses data to make informed decisions and optimize team performance, Graph ML can help you make informed decisions and optimize your observability strategies.

The Senser Advantage

Senser is an AIOps platform that harnesses the power of machine learning to provide actionable insights and streamline issue resolution. By using graph analytics to model service dependencies and analyze observability data, Senser empowers engineering teams to optimize their infrastructure and application performance, reducing MTTD and MTTR for production issues.

Senser’s advanced algorithms can also uncover blind spots and “unknown unknowns” that traditional monitoring approaches might miss, saving time and resources in identifying, analyzing, and resolving service degradation issues.