Voice Providers & Options

The full voice tech stack — which providers are live, which are optional add-ons, and which are on the roadmap.

Last updated: April 22, 2026

providersvapiblandelevenlabsedge ttswhispersttttsinbound calls

The Voice Stack at a Glance

Sarudo's voice feature composes three independent layers: telephony (placing and receiving the call), AI voice (real-time conversation with the person on the other end), and speech-to-text (transcribing the recording after the call). Each layer has one primary provider wired today plus optional alternatives you can opt into if you need more voice variety, better accuracy on specialized vocabulary, or lower latency. The defaults are chosen to get a fresh client talking to their AI employee over the phone with minimal setup.

The defaults (Twilio telephony + Vapi AI voice + faster-whisper transcription) are enough for most business outbound calling. The alternatives below are opt-in — no reason to touch them unless you have a concrete need.

Telephony — Twilio

Telephony — placing the actual phone call and handling call routing — is Twilio. You provision a Twilio account and a phone number during onboarding, and your AI employee places calls using your number as caller ID. Twilio is the only telephony provider wired today; there is no planned alternative because Twilio is effectively the industry standard and swapping it would not buy anything meaningful. Per-minute charges bill to your Twilio account, not through your Sarudo subscription.

AI Voice — Vapi (primary), Bland.ai (installed)

The AI voice layer handles the real-time conversation — the "your AI employee talking to the other person" part. Vapi is the primary AI voice provider wired today: if the `VAPI_API_KEY` env var is configured, Twilio redirects inbound audio to Vapi's inbound endpoint and Vapi generates the natural-sounding voice responses on the fly. Bland.ai is also installed on the Sarudo instance (it ships in the dependency stack) but is not yet the active route — it is an alternative we can swap in if you specifically need Bland's lower-latency cold-call feel or the particular voice personas Bland supports. If neither Vapi nor Bland is configured, calls fall back to Twilio's built-in text-to-speech (Polly voice) — functional, but clearly robotic and not recommended for real client-facing calls.

Bland.ai support is installed but not yet routed. If you want your AI voice to run on Bland instead of Vapi, let your setup team know — it is a configuration change, not a development effort.

Speech-to-Text — faster-whisper (primary)

Post-call transcription runs on faster-whisper, a local optimized implementation of OpenAI's open-source Whisper model. It runs on your dedicated server with CPU-only inference (int8 quantization), which means audio never leaves your infrastructure. This is the only STT provider wired today, and it is typically the right choice — quality is strong for clear business speech, latency is the processing time alone (no network round-trip), and there is no per-minute transcription bill. If you have a particular need for cloud STT (extremely long calls where server-side processing is slow, or specialized medical/legal vocabularies that benefit from larger cloud models), Groq and OpenAI cloud STT are both referenced in the architecture but not yet wired — they would need to be turned on by your setup team.

TTS Alternatives (when AI voice is not live)

When Vapi is not handling the conversation (for example, for simple reminder messages where you just want a voice playback of a fixed text rather than an interactive conversation), Sarudo can use Edge TTS, ElevenLabs, or OpenAI TTS for the voice generation. Edge TTS is free and ships as part of the Python stack; ElevenLabs and OpenAI TTS are env-var opt-ins where you provide an API key (`ELEVENLABS_API_KEY` or `OPENAI_API_KEY`) and get access to higher-quality voices and more personas. The current Sarudo voice-call pipeline focuses on the interactive AI-voice path via Vapi, so these TTS alternatives come into play only when you explicitly ask for a fixed-text voice playback — ask your AI employee "send a voice reminder to John saying [text]" and it picks the best available TTS provider for the job.

Order of preference when generating a fixed-text voice message: ElevenLabs (if key set, best quality) → OpenAI TTS (if key set, good quality) → Edge TTS (free, acceptable quality). You can override with "use ElevenLabs for this one" if you want a specific provider.

Inbound Calls

Outbound calling is fully live — your AI employee can place calls on your behalf and is designed around that primary use case. Inbound calling is partially wired: Twilio can forward inbound calls on your provisioned number to Vapi's inbound webhook, and Vapi can handle a real-time conversation. What is not yet live is the full inbound flow — call routing logic (who should Vapi be when answering, what business context should it know, how should it escalate to you), voicemail handling, and after-call logging into your CRM. If inbound answering is a hard requirement for your use case, it is feasible to enable manually during onboarding; if you are treating voice primarily as an outbound tool (most common), the defaults are already what you want.

If you publish your Twilio number anywhere and expect calls to come in, talk to your setup team before going live. The inbound path is configurable but is not the default, and you will want to decide the behavior explicitly.

Voice Call Setup

How to set up Twilio for phone number provisioning and Vapi integration for AI-powered voice conversations.

AI-Powered Conversations

How Vapi-powered autonomous calls work, their use cases, capabilities, and current limitations.

Call Transcription

How calls are transcribed locally using faster-whisper with complete privacy — no audio data leaves your server.