Mcp Plumbing

Plumbing Maintenance: What Breaks and How Often

Rza

25 Dec 2025 — 9 min read

Nobody budgets for maintenance. The pitch deck shows the architecture diagram with clean arrows between services. The sprint plan covers building the integrations. The roadmap shows the features those integrations enable. Nowhere in any of these documents is a line item that says "keep this working after we build it." And yet maintenance — not building, maintaining — is where most of the actual cost of AI plumbing lives. The build is a one-time expense. The maintenance is a subscription you never agreed to, billed in engineering hours, payable forever.

What It Actually Does

AI plumbing breaks in six predictable categories. They're predictable because they happen to everyone, on roughly the same timelines, for the same reasons. The unpredictable part is which one hits you first.

Auth expiry is the most common failure and the most preventable. OAuth tokens expire — typically every hour, with refresh tokens lasting days to months depending on the provider. API keys don't expire automatically but get rotated by admins, revoked during security audits, or invalidated when someone changes the associated account. If your system doesn't handle token refresh automatically, this isn't a potential problem. It's a scheduled outage you haven't put on the calendar yet. The typical cadence: OAuth token refresh failures surface within days to weeks, depending on the provider's token lifetime. API key rotations hit whenever your ops team or the service provider triggers one — quarterly if you're disciplined, randomly if you're not.

API deprecation is slower but more disruptive. Major API versions get deprecated with 6-12 months' notice, usually. Minor endpoint changes — a field renamed, a parameter moved, a response format tweaked — happen more frequently and with less fanfare. Google is notorious for deprecating APIs on aggressive timelines [VERIFY]. Stripe gives generous notice but still ships breaking changes that require migration work. The pattern: expect at least one meaningful API change per year per service you integrate with. If you integrate with five services, you're doing API migration work at least quarterly.

Rate limit changes are the invisible wall. You build and test at low volume. Everything works. Usage grows. Suddenly your integration is getting throttled because the service tightened its rate limits, or your traffic pattern changed, or you hit a tier boundary you didn't know existed. Rate limit documentation is often incomplete or outdated — the documented limits and the actual limits diverge, and you find out by getting 429 responses. This typically surfaces 1-3 months after launch, when real usage patterns diverge from testing patterns.

Schema drift is the subtle one. The API still returns 200. The response still has the fields you expect. But a new field was added, an enum got a new value your code doesn't handle, or a nullable field started returning null where it previously always had a value. Your code processes the response successfully — and produces wrong results. Schema drift is the hardest category to detect because it doesn't trigger errors. It triggers incorrect data. The cadence depends on the provider, but for actively developed APIs, expect minor schema changes every 1-3 months.

Provider outages are the most visible and the least controllable. AWS goes down. Stripe has a bad day. Google Cloud has a multi-region incident. Your integration's error handling is either good enough to weather this or it isn't, and you won't find out until it happens. Major providers have 99.9%+ uptime SLAs, which sounds great until you calculate that 99.9% uptime allows for 8.7 hours of downtime per year. Spread across multiple providers, you're looking at your integrations experiencing upstream outages several times annually.

Dependency updates are the maintenance tax on your own code. The SDK you use to call an API releases a new version. The MCP server library has a security patch. The runtime environment (Node.js, Python) ships a breaking release. Each of these requires testing and potentially updating your integration code, even if the upstream service hasn't changed at all. The cadence: major dependency updates every 3-6 months, security patches unpredictably.

What The Demo Makes You Think

The demo shows integrations as pipes — you connect them once and data flows. The architecture diagram shows arrows between boxes. The mental model is plumbing: install it correctly and it works until the building is demolished.

The actual mental model is gardening. Integrations are living things that require regular attention. Skip the watering for long enough and things die. The difference is that dead pipes leak visibly. Dead integrations fail silently.

Here's the failure timeline that nobody shows in a demo. You build an AI pipeline that reads from a Google Sheet, processes data through Claude, and writes results to a Notion database, with Slack notifications along the way. Four integrations. At launch, it works perfectly.

Week 3: The Google OAuth token refresh fails because of a network blip during the refresh attempt. Your pipeline stops reading new data. Nobody notices for two days because the Slack notifications stop too — they run after processing, and there's nothing to process.

Week 8: Notion ships a minor API update. The database property types you're writing to now accept a slightly different format. Your writes start failing with a 400 error. The error message is technically accurate but requires reading Notion's changelog to understand.

Week 14: Your Claude API usage has grown. You hit Anthropic's rate limit during peak hours. Three out of ten items in your batch fail processing. Your error handling retries them, but the retry logic doesn't back off properly, so it uses up more of your rate limit, causing more failures.

Week 20: Google changes something in their Sheets API scoping. Your integration's read permissions still work, but the metadata call you use to check for new rows returns a permissions error. The fix takes two hours once you understand the problem. Understanding the problem takes six hours.

This timeline is not pessimistic. It's the baseline for a four-integration pipeline running in production. Each of these breakages is individually small. Collectively, they represent approximately one engineering day per month of reactive maintenance — and that's if you catch the failures quickly.

The Silent Failure Problem

The most expensive category of integration failure is the one where nothing appears to break. No error logs. No alerts. No crashed processes. The integration runs, returns data, and moves on. The data is just wrong.

This happens most often with schema drift and with AI processing steps. A field that used to contain a customer's full name now contains only their first name because the upstream API changed its default field mapping. Your pipeline processes it, your LLM summarizes it, your output looks plausible. Nobody notices until a human reads the output and realizes all the customer records from the last two weeks are missing last names.

Or the AI processing step starts producing lower-quality outputs because the model version was updated on the provider's side. The pipeline runs. Results appear. They're just subtly worse — more generic summaries, less accurate extractions, different formatting. This kind of degradation doesn't trigger errors. It triggers gradually declining quality that's hard to pinpoint to a specific moment.

The only defense against silent failures is monitoring that checks output quality, not just output existence. This means assertions on the shape and content of your data — not just "did the API return 200" but "does this response contain the fields I expect, with values in the ranges I expect, matching the patterns I expect." This is significantly more work than basic health checks, which is why most pipelines don't have it, which is why silent failures persist for days or weeks.

Detection and Monitoring

How you find out something broke determines how much damage it does. There are four tiers of detection, from best to worst.

Automated alerts are the gold standard. Your monitoring system detects the failure within minutes and sends a notification. This requires proactive setup — health checks, error rate monitoring, data quality assertions — and most teams don't invest in it until after they've been burned by a silent failure. Tools that help: Langfuse and LangSmith for AI-specific pipeline monitoring, Datadog or similar for general infrastructure, custom logging with alerting thresholds for anything in between.

Scheduled checks are the realistic standard. Someone — or an automated script — verifies daily or weekly that the pipeline produced expected output. This catches failures within hours to days. It's less sophisticated than real-time monitoring but covers the most common case: something broke overnight and nobody's looked at it yet.

Downstream discovery is when someone using the output of your pipeline notices something is wrong. A team member opens the dashboard and the numbers look stale. A customer reports receiving the wrong data. This is the most common detection method in practice, and it means the failure has already impacted users.

Accident is when you discover a failure while doing something unrelated. You're debugging a new feature and notice the integration log is full of 401 errors dating back three weeks. This is more common than anyone admits.

The goal is to get as much of your failure detection as possible into the first two tiers. The reality is that most teams live in the third tier, with occasional visits to the fourth.

The Realistic Maintenance Budget

Based on real-world patterns from SRE literature and developer surveys [VERIFY], here's what maintenance actually costs for AI-connected systems:

Reactive maintenance — fixing things when they break — runs approximately 2-4 hours per month per integration, averaged over a year. Some months you spend zero hours. Some months a major API change eats a full day. The average is what matters for budgeting.

Proactive maintenance — updating dependencies, rotating credentials before they expire, testing against new API versions — runs approximately 1-2 hours per month per integration if you're doing it properly. Most teams skip this and pay the cost in reactive maintenance instead.

Monitoring maintenance — keeping your monitoring and alerting systems accurate as the pipeline evolves — runs approximately 1-2 hours per month for the entire pipeline, not per integration. Alerts need tuning. Thresholds need adjusting. New failure modes need new checks.

For a pipeline with four integrations, that's roughly 15-25 hours per month of total maintenance if you're doing it well, or 10-20 hours of reactive-only maintenance if you're not, with occasional spikes when something big breaks.

The rule of thumb: budget 15-20% of the original build time as ongoing monthly maintenance. If the pipeline took 80 hours to build, budget 12-16 hours per month to keep it running. This number consistently surprises people, and it consistently turns out to be accurate.

Reducing Maintenance: Design Patterns That Help

You can't eliminate integration maintenance. You can reduce it by designing for failure from the start.

Circuit breakers stop calling a failing service after a threshold of errors, preventing cascade failures and giving the upstream service time to recover. Instead of hammering a service returning 500s, your system fails fast and tries again later. This doesn't prevent the failure but contains its blast radius.

Fallback data paths provide a degraded-but-functional alternative when an integration is down. If the enrichment API is unavailable, the pipeline continues with unenriched data rather than stopping entirely. Not every workflow supports this, but when it's possible, it's the difference between "the pipeline is down" and "the pipeline is running at reduced quality."

Health checks verify each integration is working before you depend on it. A lightweight API call at the start of each pipeline run that confirms auth is valid and the service is responding. This catches failures at the beginning of the process rather than in the middle.

Version pinning locks your dependencies — API versions, SDK versions, model versions — to specific known-working versions. This means you opt into changes deliberately rather than having them forced on you. The tradeoff: you accumulate version debt that needs periodic resolution. The benefit: your pipeline doesn't break because someone else shipped an update.

Idempotent operations ensure that running the same operation twice produces the same result. This is crucial for retry logic — when a failure occurs mid-pipeline and you restart, idempotent operations mean you don't duplicate data or produce inconsistent state.

None of these patterns are novel. They're standard reliability engineering practices that have been understood for decades. The problem is that most AI pipelines are built with feature velocity as the priority and reliability as an afterthought. The patterns are easy to implement during the build. They're expensive to retrofit later.

What's Coming

The tooling around AI pipeline maintenance is improving. Langfuse, LangSmith, Braintrust, and similar tools are building the observability layer that AI-connected systems have been missing. These tools make it easier to detect failures, trace them to specific integration points, and measure quality degradation over time.

MCP is also helping by standardizing the integration surface. When every integration uses the same protocol, monitoring tools can provide consistent visibility across all of them. The days of building custom monitoring for every API connection are ending — slowly, but they're ending.

The harder problem — APIs that change under you and auth systems that expire without warning — isn't a tooling problem. It's a design problem with the internet itself. Services will continue to change their APIs. Tokens will continue to expire. The best you can do is detect these changes quickly, handle them gracefully, and budget the time to fix them.

The Verdict

Integration maintenance is the unsexy reality of AI plumbing. Nobody tweets about updating an OAuth refresh handler or migrating to a new API version. But this is where production systems live — not in the launch, but in the months and years after.

The teams that succeed with AI integrations are not the ones that build the most sophisticated pipelines. They're the ones that budget maintenance time from day one, implement monitoring before they need it, and design for failure rather than assuming success.

The build is the fun part. The maintenance is the real work. Budget accordingly, or plan to rebuild from scratch every six months when the accumulated breakage becomes unmanageable. Those are the actual options.

This is part of CustomClanker's MCP & Plumbing series — reality checks on what actually connects.

Plumbing Maintenance: What Breaks and How Often

Rza

What It Actually Does

What The Demo Makes You Think

The Silent Failure Problem

Detection and Monitoring

The Realistic Maintenance Budget

Reducing Maintenance: Design Patterns That Help

What's Coming

The Verdict

Read more

The YouTube + AI Pipeline

The Weekly Drop

The Tool Collector's Guide to Owning Nothing

Self-Hosting & Tinkering