10-Minute Data Strategy Audit: Part 1 - Infrastructure Debt

Solving Infrastructure Toil (Without Deleting Your Cloud)

Feb 11, 2026

In my 13 years in data, I’ve seen teams with brilliant models get completely sidelined by “infrastructure friction.” We often think of infrastructure debt as “using old versions of software.” But the real debt is Toil. Google’s SRE team defines Toil as the manual, repetitive, “tactical” work that grows as your system grows. If you have 5 pipelines, you can manage them manually. If you have 50, you’re no longer a Data Scientist, you’re a full-time pipe-fixer.

What Infrastructure Debt Actually Looks Like:

It’s never a single catastrophic failure; it’s a thousand tiny papercuts:

The “Local Machine” Trap: A critical model only runs on one person’s laptop because the production environment is missing “certain libraries” no one can identify.
The Weekend Pipeline: A data load fails every Saturday morning. Instead of fixing the root cause, the team has a rotating “weekend shift” just to hit the restart button.
The Ghost in the Machine: You change a variable in Staging, it works perfectly, you move it to Production, and everything breaks. Why? Because six months ago, someone manually changed a setting in the Prod console and never told anyone.
The “Surprise” Cloud Bill: You realize you’ve been paying $2,000 a month for a GPU instance that was spun up for a “quick test” in 2023 and forgotten.

Why do we fall into this? We do it because we are under pressure. We tell ourselves: “I’ll just fix this manually today so we can hit the deadline, and I’ll automate it properly next week.” But “next week” never comes. Eventually, the team spends 60% of their time just “keeping the lights on,” leaving only 40% for actual innovation.

Three Ways to Start “Getting to Green”

You don’t need to be a DevOps engineer to fix this. You just need to be disciplined about where you spend your team’s energy.

1. Target the “Manual Restart”

Find the one thing your team has to “check on” every day. Is it a server that runs out of memory? A pipeline that needs a manual kick-start?

The Solution: Don’t rebuild the whole architecture. Just add a simple automated “health check” or a script that cleans up the memory. If you save 20 minutes a day, you’ve just bought your team two hours of deep work a week.

2. Declare “Console Freeze”

One of the biggest sources of debt is “Click-Ops”, manually clicking buttons in the AWS or Azure console. It’s fast in the moment, but it’s invisible to the rest of the team.

The Solution: Start a “Source of Truth” document (or a repo). If a setting is changed in the cloud, it must be updated in the doc first. This moves you toward “Infrastructure as Code” without needing to learn complex new tools overnight.

3. Set “Budget Guardrails”

Infrastructure debt often manifests as financial waste.

The Solution: Instead of complex cost-optimization, just set up automated alerts. If a specific project’s cost spikes by 20% in a single day, the lead gets an email. Visibility alone usually changes team behavior overnight, people start cleaning up their “zombie” instances when they know the bill is being watched.

The Bottom Line

Infrastructure shouldn’t be a “hobby” your team does on the side. It is the foundation of your velocity. Getting to “Green” doesn’t mean your cloud is perfect; it means your cloud is boring. You want an infrastructure that is so stable and automated that you actually forget it’s there.

Does your team have a “Hero” who is the only one who knows how to fix the servers? That’s a major debt signal. Let’s talk about how to break that cycle in the comments.

Next in Data

Discussion about this post

Ready for more?