Why is the DevOps team always at fault?

The age-old adage of “always blame the network” has morphed into “always blame the DevOps team.”

It’s tough out there for DevOps. In today’s complex and interconnected production environments, when something goes wrong, the blame almost always lands first on the DevOps team. 

Here’s why that’s the wrong approach – and what to do instead.

The “blame DevOps” reflex

DevOps teams are responsible for ensuring applications and infrastructure are deployed quickly, reliably, and securely. To accomplish this, they use practices like CI/CD, infrastructure as code, and monitoring.

Given their broad responsibilities, DevOps is typically the first team suspected when something goes wrong in production. After all, they touched it last, right? 

But while intuitive, this reflex is problematic and costly for three reasons:

1. Modern architectures make root causes hard to diagnose

Gone are the days of monolithic applications running in isolated silos. Microservices, serverless, managed and third-party services, and hybrid/multi-cloud deployments interact in complex ways across infrastructures, networks, and applications. An issue in one area can quickly cascade into other domains. 

For example, let’s say a Kafka broker becomes overloaded. The symptoms – high CPU or memory usage, high network and I/O usage, increased consumer lag, and consumer or producer failures – are easy enough to observe. But the root cause, particularly in mixed infrastructure and application cases like this, is far from obvious.  

So when something breaks, it’s not always clear where the root cause originated – it could plausibly be an app bug, a network blip, or an infrastructure config change. But DevOps inevitably shoulders the initial blame regardless.

2. Guilty until proven innocent burdens DevOps

This “guilty until proven innocent” dynamic places an operating burden on DevOps teams. They constantly have to drop everything and go into war room mode anytime an issue crops up, before proper root causing and diagnosis occurs. 

For example, let’s say a database server runs out of memory, crashing a key microservice. DevOps gets paged at 2am to urgently investigate. They spend hours proving the memory config wasn’t changed and it’s not their issue. By the time the DBA team wakes up and finds the actual root cause – a memory leak in the app – countless hours have been wasted. 

This interrupts DevOps’ normal work and demoralizes the team over time (more commentary on the topic, and some colorful language, on this Reddit thread).

3. It introduces high costs before a root cause is found

Having DevOps operate in a reactive mode is hugely inefficient. There is a massive opportunity cost anytime engineering resources are pulled away from feature work towards knee-jerk firefighting. And that’s before the root cause is even understood. 

For example, if every week DevOps spends one day doing unplanned work on issues not caused by them, that’s up to 20% of their bandwidth lost. And the associated delays in shipping new features can also be substantial.

A better approach

What’s needed is a new approach that avoids reflexively assigning blame, and instead uses technology to quickly find the true root cause of system issues. With the rise of AIOps platforms leveraging big data, ML and advanced telemetry, we now have the capability to do rapid data-driven fault isolation.

Rather than immediately pull DevOps engineers into war room mode whenever anything goes wrong, companies should first leverage AIOps to automatically map dependencies and changes across their entire production environment – spanning cloud, containers, microservices and more. The AIOps system can then use that mapped topology and advanced telemetry data to intelligently isolate the root cause of the issue.

Only once the initial data-driven diagnosis is complete should engineers be pulled in – but now they can target the area that has been intelligently identified as the likely culprit. No more wasted war room time for DevOps proving they didn’t break anything.

How Senser helps

This paradigm shift can seem impossible with legacy reactive monitoring tools. But modern AIOps solutions like Senser provide the necessary capabilities by leveraging leading-edge technologies such as:

  • eBPF for high-fidelity monitoring without performance overhead
  • Auto-generated service maps across infra/app/network  
  • Noise reduction through ML baselining of all telemetry   
  • Root cause analysis powered by topology-aware ML algorithms

With these innovations, the AIOps system itself can shoulder the burden of initial fault isolation and noise reduction. This frees DevOps teams from unfair accusations, while still rapidly identifying the root cause of system issues. The result is reduced costs, less firefighting, and more innovation.  

It’s time to move beyond reflexively assigning blame, and instead empower engineers through unbiased data. New AIOps solutions finally make this possible, leading to happier, more productive teams and lower operational costs. The DevOps team shouldn’t always be the scapegoat – with the right technology, we can replace blame with insights.

Learn more about how Senser helps SRE and DevOps teams go from production chaos to intelligence.