Voice-First Agentic AI Gains Traction with IBM–ElevenLabs Collaboration

eleven-labs-ibm-voice

IBM and ElevenLabs have announced a new collaboration that brings advanced voice technology into IBM watsonx Orchestrate, expanding the platform from text-only workflows to natural spoken interactions. According to the announcement, “voice has become a critical medium for customer and employee-facing agentic AI workflows”, which will help it to stand out in an increasingly crowded agentic AI market.

By adding ElevenLabs text to speech and speech to text, IBM is giving enterprises access to speech quality that captures nuance, tone and rhythm across 70 languages. This is designed to help organisations replace rigid call flows with AI conversations that feel natural, consistent and human.

A Shift Towards Voice First Agentic AI

Agentic AI is increasingly used to analyse information, make decisions and take action with limited human involvement. With this integration, IBM is helping to enable the movement towards voice-centred AI, allowing organisations to design agents that interact with customers through one of the most intuitive formats: speaking and listening.

Mati Staniszewski, Co-founder at ElevenLabs, explains the importance of the shift: “AI agents are becoming central to everyday work, and voice is where AI either earns trust or loses it. Together with IBM, we’re helping organisations replace robotic interactions with AI agents that people actually want to talk to, built with the security and compliance controls that enterprises require.”

The collaboration also brings access to more than 10,000 voices, PCI-compliant payment processing, data residency controls and Zero Retention Mode to support sensitive information. This makes voice-led automation suitable for industries such as banking, healthcare and government services, where accuracy and trust are essential.

Voice Becomes a Competitive Differentiator

Most agentic AI systems today, including Anthropic, OpenAI, Salesforce, ServiceNow, and Mistral are text-first platforms. While IBM watsonx Orchestrate’s native capabilities would fit into this category too, its technological collaboration with ElevenLabs means it can now support voice-first interactions. This allows it to compete with those taking advantage of voice as a key differentiator.

One business communications enterprise focusing on voice-first agentic AI is RingCentral, which recently launched AIR Pro, an agentic AI platform designed to automate full voice journeys from start to finish. The platform takes a voice-led approach by allowing businesses to build AI agents that can manage entire customer interactions independently, drawing on contextual data from CRM and communications systems. AIR Pro sits within RingCentral’s broader AI ecosystem, alongside AIR (AI Receptionist) for call handling, AVA (AI Virtual Assistant) for real-time note-taking, and ACE (AI Conversation Expert) for post-call coaching. The copmany says these tools together create a continuous feedback loop across every stage of a customer conversation.

PolyAI is also pursuing voice through its new Agent Studio, which helps organisations build voice agents that handle interruptions, clarify questions and maintain conversational rhythm. The platform aims to make sophisticated conversational behaviours easier to design by offering transparent tooling, scalable deployment options and flexible voice customisation. By positioning voice as a primary interaction channel rather than an add-on, the company is seeking to redefine what enterprise-grade voice automation looks like at scale.

Elsewhere, the CX ecosystem is seeing voice integrated across orchestration, automation and digital channels. One example is the extension of agentic capabilities across CXone and Cognigy, pointing to a broader recognition that customers often move between channels but frequently return to voice when they need reassurance or faster resolution.

A Clear Direction

The collaboration between IBM and ElevenLabs reinforces a clear industry trend that voice is becoming a defining feature of the next generation of AI. As enterprises adopt agentic AI, the ability to deliver natural, expressive and secure spoken interactions is becoming essential. With more vendors strengthening their voice capabilities and customers showing a consistent preference for natural conversation, voice-first design is set to shape the next era of both customer and employee experience.