Automatic Transcription
How transcripts are generated — faster-whisper running locally, typical turnaround, privacy guarantees, and quality tuning.
How Transcription Works
Transcripts are generated by faster-whisper, an optimized re-implementation of OpenAI's open-source Whisper speech-recognition model. The model runs entirely on your dedicated Sarudo server using CPU-only inference with int8 quantization — no audio data is ever sent to an external service. The base model is used by default, which hits the sweet spot between accuracy and speed for most business recordings. Turnaround is typically 15 to 30 seconds of processing per minute of audio, so an hour-long meeting takes a few minutes to transcribe end to end.
Privacy guarantee: every byte of audio stays on your dedicated infrastructure. The transcription model is local, the temporary audio file is deleted after processing, and only the final transcript (and its extracted summary, action items, etc.) is persisted in your database.
Transcript Structure
Every transcript is written with one segment per utterance, time-stamped in [MM:SS - MM:SS] format (or [HH:MM:SS - HH:MM:SS] for recordings over an hour), with the speaker label prepended when speaker detection is on. The full transcript is also saved as a plain-text file under /var/lib/sarudo/exports/ on the server so you can retrieve the raw text later. Ask your AI employee to show the transcript, paste a section, or answer questions like "what did Mark say about the launch date?" and it will search the transcript for the matching quote.
A short transcript excerpt
What a transcript looks like with speaker detection on.
Quality and Edge Cases
Whisper-base handles clear speech in major languages well. Things that hurt quality: heavy background noise, very quiet audio, strong regional accents combined with specialized jargon, multiple speakers talking over each other, or bad compression on the source recording. If a transcript looks off, ask your AI employee for the audio stats — it can report the duration, language probability score, and speaker count it detected, which usually points at the issue (wrong language, single-speaker recording mis-detected as multiple speakers, etc.). For critical transcripts, listen to the audio while reading the transcript and correct the handful of inevitable mis-transcriptions before approving.
Searching Across Transcripts
Every transcribed meeting is stored in the database so you can search across them. Ask your AI employee things like "what was the last thing Jennifer said about the pricing model?" or "pull every mention of the partnership deal in meetings this month" and it will search the relevant transcripts and return the matching segments with meeting titles and timestamps. This turns your meeting history into a searchable memory rather than a pile of files you forget about.