Reaching 99% accuracy is less about a single breakthrough and more about getting a chain of small things right — from the model architecture to the unglamorous post-processing that makes a transcript readable.
When people ask how transcription "works," they usually picture one model that hears audio and types words. In practice, an accurate transcript is the output of several stages working together, each cleaning up what the previous one left behind.
The pipeline, end to end
Every file you upload moves through four stages before you ever see a word of text:
- Acoustic model — converts the raw waveform into phonemes and words.
- Language model — uses context to resolve homophones and punctuation.
- Diarization — segments the audio by speaker before labelling each line.
- Post-processing — restores casing, numbers and formatting.
Why speaker labels matter
A wall of text is hard to use. Splitting dialogue by speaker turns a raw transcript into something you can actually skim, quote and edit. Diarization runs before labelling, so each line carries both a timestamp and a speaker tag.
Accuracy is what gets you in the door. Structure — speakers, timestamps, clean formatting — is what makes a transcript worth keeping.
Where we go next
The remaining gains are in the hard cases: heavy accents, overlapping speech and low-quality recordings. We are training on more diverse audio and improving how the model recovers when two people talk at once — the situations where a transcript is most useful and hardest to get right.