March 23, 2026
Preparing for AI Outages in CX: What to Do When AI Stops Working
We’ve all felt the frustration that comes from AI having a “bad day” at some point. Like it or not, most CX employees rely on AI for something, whether it’s routing customers, analysing sentiment, or giving them tips on what to do next. As much as we tend to complain about our AI colleagues, we also struggle to survive on the days when they call in sick.
Sometimes it’s not even a big “outage” that causes the biggest problems. Small errors stack up, like a chatbot that keeps looping, an IVR that can’t authenticate, or a summary from a crucial customer call that never gets generated. All of those things stack extra work on human employees already spread far too thin. Then there are all the “breaks” that don’t look like outages at all.
AI assistants accidentally demonstrate bias in self-service conversations or forget context, so customers have to start from scratch.
The thing is, AI downtime in CX is pretty common, but most leaders still aren’t planning for it. They have ideas of what they can do if their cloud services go down or telephony issues happen, but as soon as they’re hit with AI outages, teams are stuck, refreshing pages and hoping something will eventually work. It’s time we all made a change.
AI Outages in CX: What They Look Like
We’re used to describing outages as dramatic things. When a cloud service or CCaaS platform goes down, everything stops, and there’s nothing a company can do but start sending out apology emails. Sometimes, AI outages are the same; sometimes they’re not. We’ve got a few categories.
Hard AI Outages in CX
This is just like a major cloud outage. Your AI system stops working completely. The chatbot doesn’t load. The IVR freezes mid-flow. Agent assist disappears from the desktop. API calls start returning 503 errors. Customers can’t move forward.
We’ve seen this play out publicly, like in a recent OpenAI incident, some components recovered faster than others. Text stabilised, while voice didn’t, which caused chaos inside a contact centre. One channel works, another doesn’t, and supervisors scramble.
Sometimes the outage happens at an infrastructure level. When Cloudflare experienced a widespread outage, it disrupted access to multiple AI-powered services that brands rely on.
This is classic outage territory. You need a failover. You need routing overrides. You need clear communications. That’s the baseline for AI Outage readiness.
Brownouts: The System Works, Just Poorly
In these cases, the system isn’t “down”; it’s just not reliable. Latency creeps up for voice AI tools, API calls time out, and confidence scores go down. Some features might work fine; others stop functioning altogether.
It’s usually an architecture problem which leads to abandonment, higher average handle times, and more frustrated customers. Alibaba, for instance, learned how problematic brownouts could be when its coupon giveaway drove unexpected demand.
The system wasn’t conceptually broken. It was overwhelmed. Traffic spiked beyond what the platform could handle, and operations had to pause.
Brownouts demand a different response. Feature shedding. Throttling. Mode switching. If your AI outage planning only accounts for full shutdowns, brownouts will catch you off guard.
Mistakes: The AI Keeps Working (Badly)
The AI answers. It just answers incorrectly.
A support bot invents a policy. A refund threshold is misquoted. Maybe a routing model deprioritises a complaint. An agent reads a summary that missed the one sentence where the customer said they’d already tried that step.
The Cursor “Sam” incident is a good example. An AI support bot hallucinated a policy that didn’t exist. The response sounded confident. It just wasn’t real. The damage came from trust erosion, not downtime.
Your safeguards here aren’t uptime metrics. They’re sampling. Escalation overrides. Guardrails that allow humans to interrupt the automation instantly.
Tool-Chain Failures: When the AI Can’t Act
Sometimes the model is fine, but it just can’t do much.
The AI generates the right action, but can’t create the ticket. Or it summarises the call, but the CRM writeback fails. Sometimes it retrieves the correct article, but the payment API won’t execute the refund. It might even stop being able to gather contextual information, which means customers have to repeat themselves.
These outages often live in the integration layer: identity, CRM, payment gateways, and knowledge bases. They’re harder to see because the AI interface looks normal.
Without proper monitoring, teams blame the model. Meanwhile, it’s a token expiry or permission change blocking the workflow.
What AI Outages in CX Really Cost You
Outages in any kind of tech are annoying and expensive. The problem with AI outages in CX is that the side effects are getting worse. We’ve all become so dependent on these tools that we don’t know how to move forward without them.
Downtime for “essential” tech like this costs the average company well above six figures per hour, and that doesn’t even account for all the extra costs that build up, such as:
- Rising repeat contact rates, because customers need to come back and ask the same question again. That means more AI minutes, or more human labour to pay for.
- Escalation failures leading to increased costs caused by transfers, risks, and improper flagging of vulnerable customers.
- Reduced productivity and employee burnout, because team members can no longer rely on the systems they’ve been pushed to use.
- Reputational damage caused by increased complaints and customers sharing screenshots of biased, incorrect, or discriminatory bots.
- Fines and compliance headaches prompted by bots improperly handling data or making unethical decisions.
Put all that together, and it’s easy to see why companies are getting so worried about “AI uptime” these days.
Handling AI Outages: Preparing for AI Downtime in CX
AI outages are inevitable. They’re never going to disappear completely, no matter how strong models and platforms get. You can’t stop them from happening, but you can prepare for them just like you’d prepare for a major cloud issue.
Step 1: Define What Can’t Break
Start with experience.
When AI outages in CX happen, what really can’t fail? You can probably get by without an AI agent taking notes (we did that for years), but what you can’t go without are four things:
- Reachability. Customers must be able to reach a human when automation stumbles. If “agent” requests don’t route during AI downtime in CX, you’re stuck.
- Safety. High-risk intents can’t run on unstable automation. Payments. Account recovery. Regulated complaints. If confidence scores dip or tool calls fail, those flows need to default to human review.
- Continuity. Context can’t disappear during recovery. Partial system restores are messy. Components come back at different times. Customers shouldn’t have to repeat themselves because one layer lagged behind another.
- Truth. One source of guidance during incidents. If IVR says one thing and chat says another, you’ve extended the outage.
Only after you define those things do you look at your stack. The model provider, orchestration layer, retrieval database, knowledge base, CRM, identity, CCaaS routing, everything. Figure out where you’re going to need a backup. That doesn’t mean “two versions of everything”, it just means intentional strategies for keeping things running.
Step 2: Build a Real Troubleshooting Runbook
When AI outages in CX hit, a lot of teams spend way too much time refreshing pages. Usually, because they don’t know what else to do. Give them a quick troubleshooting guide. It should cover how to classify the problem first, with prompts like:
- Does it respond at all?
- Do the answers match reliable resources?
- Is it connecting to all of the right APIs?
Then guide agents through what they can do. Do they restart or redeploy the system? Can they check for conflicts between tools? Should they check service status and verify connectivity and credentials? Are there any fixes they can implement themselves (like adjusting settings), or do they need to pass the issue straight on to someone in IT?
Ideally, the vendors you’re using for AI in CX will have tools you can use here, troubleshooting apps for orchestrated agents, and so on. Make sure your teams know how to use them.
Step 3: Design Clear Fallback Paths
Most organisations treat fallback as a last resort in the age of limitless automation. That’s a mistake. With automation scaling fast, the blast radius of AI outages is larger than ever.
Voice AI usage alone has surged dramatically year over year across enterprise platforms. More journeys depend on that layer now.
So design fallback by intent.
- Low risk? Order status, store hours, basic FAQs. A rules-based flow or cached response works.
- Medium risk? Returns. Billing adjustments. Human review with context packet attached.
- High risk? Account security. Vulnerable customers. Regulatory complaints. Direct to human.
Build a fallback ladder:
- Same channel, lower intelligence.
- Same channel, human takeover.
- Channel shift if needed. Chat to voice. Voice to callback.
- Safe stop if necessary.
A fallback strategy won’t prevent issues from happening, but at least it should stop your team from running around in a panic when things go wrong.
Step 4: Treat Incidents as Customer Events
This is the “incident response” part of the playbook. The last thing you want is to see customers complaining about your AI outages before you have a chance to say anything.
First, define who owns the experience during an outage. Not just engineering. CX operations. Workforce management. Compliance. Communications. Someone must be accountable for what customers are told and how quickly humans can absorb the load.
Second, communicate early and plainly. Acknowledge what’s affected. Explain what still works. Offer the next best action. Set a realistic update time. Then stick to it.
During the AWS outage that rippled across industries, leaders who had backup workflows and clear messaging stabilised faster. Those who waited for full restoration before saying anything created radio silence and churn.
Not to mention, the quicker you tell your customers what’s happening, the faster you stop the issue from compounding. If your AI IVR isn’t working, and you tell your customers straight away, they know that they’re probably going to wait longer on hold, they wait to call at a different time, or switch to another channel.
Step 5: Learn from the AI Outages
Most post-incident reviews obsess over the time to recovery, but that’s the wrong metric to anchor on. Instead, you need to know what customers felt during that time. After AI outages, look at:
- Repeat contact within 24 to 48 hours.
- Escalation success rates.
- Complaints and regulatory flags.
- Sentiment shifts.
- Supervisor overrides.
- Manual rework volume.
Containment might look stable, and closure rates might even improve temporarily, but if customers come back frustrated, your automation masked the damage.
The “closed but not resolved” pattern already shows up in customer support research under normal conditions. Layer instability on top of that, and you amplify the gap between operational success and customer confidence.
Regular drills that simulate model failure, tool-chain failure, and context loss are the only way to force your team to switch modes under pressure and expose where the real friction lives. Fix those points before the real thing hits, because technology will fail again, and that’s not the variable. The variable is whether your next AI outage feels chaotic or controlled, and that’s entirely up to you.
The Future of AI Outages in CX: The Stakes Keep Rising
AI outages won’t stop; they’ll just get more impactful, particularly as we continue relying on AI to handle more of the customer experience.
Salesforce has publicly said its AI is now handling customer inquiries with 93% accuracy. That sounds impressive. It also means a massive share of interactions are being touched by automation. When performance dips, the impact spreads fast.
8×8 reported AI interactions growing more than 100% year over year, with voice AI up over 200% in some periods. AI is now a core part of the CX teams.
Execution raises the stakes. If a summarisation tool fails, an agent can recover. If an AI system executes the wrong refund or misroutes a compliance case during AI downtime in CX, the consequences escalate. Financial exposure. Regulatory risk. Reputational damage.
This is why AI outage planning has to evolve alongside automation maturity. The more authority AI holds, the tighter your guardrails and fallback controls need to be.
If AI in CX Fails Tomorrow, Will You Be Ready?
Technology is never 100% reliable. Most businesses know this, but they only plan for the big outages for cloud systems and architecture. Often, they forget to think about what they’re going to do if their AI team members take a day off.
Now’s the time to ask, if your automation layer goes unstable tomorrow:
- Can customers still file a claim?
- Can they reach a human quickly?
- Do agents have a manual workflow?
- Does context survive the handoff?
- Is messaging consistent?
If the answer is “no”, you’re setting yourself up for disaster the minute one of your AI tools breaks, and eventually, it will.
