Automatic Transcription

How transcripts are generated — faster-whisper running locally, typical turnaround, privacy guarantees, and quality tuning.

Last updated: April 22, 2026

transcriptionfaster-whisperwhisperprivacyspeakerslocal processing

How Transcription Works

Transcripts are generated by faster-whisper, an optimized re-implementation of OpenAI's open-source Whisper speech-recognition model. The model runs entirely on your dedicated Sarudo server using CPU-only inference with int8 quantization — no audio data is ever sent to an external service. The base model is used by default, which hits the sweet spot between accuracy and speed for most business recordings. Turnaround is typically 15 to 30 seconds of processing per minute of audio, so an hour-long meeting takes a few minutes to transcribe end to end.

Privacy guarantee: every byte of audio stays on your dedicated infrastructure. The transcription model is local, the temporary audio file is deleted after processing, and only the final transcript (and its extracted summary, action items, etc.) is persisted in your database.

Transcript Structure

Every transcript is written with one segment per utterance, time-stamped in [MM:SS - MM:SS] format (or [HH:MM:SS - HH:MM:SS] for recordings over an hour), with the speaker label prepended when speaker detection is on. The full transcript is also saved as a plain-text file under /var/lib/sarudo/exports/ on the server so you can retrieve the raw text later. Ask your AI employee to show the transcript, paste a section, or answer questions like "what did Mark say about the launch date?" and it will search the transcript for the matching quote.

A short transcript excerpt

What a transcript looks like with speaker detection on.

You say:

Show me the part of today's kickoff where we talked about the launch date.

Sarudo responds:

[14:22 - 14:29] Speaker 2: "The soft launch target is June 15, but we can slide if Phase 1 needs two more weeks." [14:30 - 14:42] Speaker 1: "June 15 is workable. Let's commit to that on the SOW and keep two weeks as a buffer." [14:43 - 14:48] Speaker 3: "Agreed — I'll add the buffer note." Want me to pull any more context around this section?

Quality and Edge Cases

Whisper-base handles clear speech in major languages well. Things that hurt quality: heavy background noise, very quiet audio, strong regional accents combined with specialized jargon, multiple speakers talking over each other, or bad compression on the source recording. If a transcript looks off, ask your AI employee for the audio stats — it can report the duration, language probability score, and speaker count it detected, which usually points at the issue (wrong language, single-speaker recording mis-detected as multiple speakers, etc.). For critical transcripts, listen to the audio while reading the transcript and correct the handful of inevitable mis-transcriptions before approving.

Searching Across Transcripts

Every transcribed meeting is stored in the database so you can search across them. Ask your AI employee things like "what was the last thing Jennifer said about the pricing model?" or "pull every mention of the partnership deal in meetings this month" and it will search the relevant transcripts and return the matching segments with meeting titles and timestamps. This turns your meeting history into a searchable memory rather than a pile of files you forget about.

Uploading a Recording

How to hand a meeting recording to your AI employee — direct upload, URL, or pointing at a file that already lives on the server.

Action Items & Attendees

How action items, decisions, key topics, and CRM attendees are extracted from every transcript — and how to review and edit them.

What Meetings Can Do

An overview of Sarudo's meeting pipeline — transcribe recordings, extract action items and decisions, and track follow-ups.