In today’s world of complex, interconnected production environments, outages and performance issues have never been costlier. A few minutes of downtime can mean millions in lost revenue and irreparable damage to a company’s reputation.
This environment of hyper-fragility is only getting more severe. Trends like cloud-native architectures, hybrid cloud, and microservices introduce flexibility but also exponentially increase complexity. Additionally, microservices offer the liberty to use different programming languages and tech stacks for individual services – which provides more flexibility and independence, but also adds yet another layer of complexity. The typical enterprise now runs on a tangled web of interdependent services and infrastructure that make debugging failures vastly more difficult.
When an issue arises, traditional approaches to troubleshooting fall painfully short. Teams waste countless hours trying to trace error messages and alerts back to a root cause. Is the problem in the network, the database, the load balancer? The culprit constantly seems to shift as teams play whack-a-mole with symptoms.
Manual troubleshooting: a losing battle
Let’s walk through a typical example of how troubleshooting happens today. Say an e-commerce site experiences a surge of 503 errors during peak traffic. The site reliability engineer (SRE) gets paged at 3am:
Step 1: The SRE checks dashboards and sees high memory utilization on the web servers. They restart the web pods and add more instances to handle load (or use an autoscaler). But errors persist.
Step 2: Logging shows timeout errors calling the product catalog microservice. The SRE suspects the catalog service can’t scale, so they provision more instances. But errors continue.
Step 3: Finally, after a war room is called and multiple teams get pulled in, the root cause emerges: the Kubernetes cluster autoscaler has hit resource quotas, preventing PODs and services from launching. Fixing that eliminates the errors.
This process took 4+ hours of precious engineering time to find a root cause that spanned multiple layers. And that is just one example – the number of cross-functional dependencies in modern environments makes scenarios like this increasingly common.
The costs add up
In addition to the hard costs of downtime during outages and performance issues, poor root cause analysis drives up costs in multiple ways:
- Latency increases from performance degradation issues
- Engineer time wasted chasing false leads
- Delayed innovation as teams fight fires
- Customer trust and loyalty erosion
- Revenue impacts from subpar user experiences
As environments get more complex, companies need ways to radically improve mean time to detect (MTTD) and mean time to resolution (MTTR) for system issues. The solution is to augment human intelligence with machine intelligence – replacing manual troubleshooting with automated root cause analysis.
The challenges of automated root cause analysis
Machine learning (ML) promises a path to automatically trace issues back to their origin. But most organizations struggle to build effective ML pipelines for root cause analysis. A few main challenges arise:
1. Lack of a service topology
Without a graph of all system components and their dependencies, ML models have no context for pinpointing failures. Manually creating an accurate topology is enormously difficult.
2. Specialization
Even with a topology, searching it to find root causes requires specialized algorithms. Commercial ML libraries fail here.
3. Making inferences understandable
ML models identify statistical correlations between events, but translating that into actionable insights for engineers is tough. The root cause needs to be expressed in clear business/user impact language.
4. Disparate and dynamic environments
Each environment is unique, and behaves differently. Atop this differentiation is an added layer of dynamism: new versions, new services, new APIs, new users and new nodes all contribute to a shifting ecosystem for ML models to parse. And telemetry (logs, metrics, traces, events) enrich the different layers, from infrastructure to APIs.
Solving these challenges requires an integrated approach combining automation, machine learning pipelines built for time-series data at scale, and interfaces optimized for human understanding. Most organizations lack these capabilities today.
The bottom line
Production environments are becoming more complex and distributed. And it’s increasingly essential that teams tasked with maintaining service quality and uptime embrace automated root cause analysis to quickly discover and remediate production issues. But that can be difficult to do at scale. Engineers need a data-driven solution to automatically turn production chaos into clarity, illuminating the precise root cause of any issue as soon as it emerges.
Learn more about how Senser helps SRE and DevOps teams go from production chaos to intelligence.