Transcribing lectures using Whisper

Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing.

Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect.

Assuming a good quality recording, the following settings seem to do a good job:

Medium model (specifically medium.en). This performs better than the large model when transcribing English because the large doesn’t have an English-specific model.
Enable word timestamps.
Explicitly set the maximum number of words per line. 20 seems slightly too short, but realistically any value is probably going to cause some awkward “text breaks”.
VTT output. EchoVideo supports other formats, but VTT is supported by everything and is easy to work with. (Needs a better VS Code extension, though.)
Disable FP16 if running on Apple M1 or M2 as they don’t have support for it. Should be fine on everything else.
An initial prompt may improve accuracy, e.g., “This is a postgraduate lecture about ethical issues in big data. The main topics are ethics, law, privacy, data dredging, and statistics.”
Normalising the audio beforehand may or may not help. The speechnorm filter in FFmpeg seems quite effective:, e.g.,ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>.

For example:

whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file>

Offline transcription is roughly real-time (i.e., 1 hour of audio takes about 1 hour to transcribe). Models are automatically downloaded to ~/.cache/whisper.

whisper-cpp is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Only supports 16kHz WAV as input 🙁 (ffmpeg -i <input> -vn -ar 16000 <output> works for any input).

Useful options:

--offset-t and maybe --duration to specify the start and end. Helpful to synchronise timestamps? --offset-t also helps if Whisper gets confused by lack of audio/speech at the start of a recording.
--print-colors to show confidence level? Hmm, really designed for dark theme…
--tinydiarize to identify speaker changes? (requires a tdrz model). Unclear how this related to `--diarize``.

whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file>

Much faster: about 8×, e.g., 1 hour 45 minutes takes about 12 minutes.

Vibe seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. It claims to have a CLI, but I can’t figure out how to make it work. It produces malformed VTT: no WEBVTT header, and no blank lines between entries.

VS Code extension issues:

Missing feature: merge subtitles. Merges selected subtitles into one and adjusts timestamps accordingly.
Bug: Sometimes adjusting timing leads to timestamps like 00:40:48.1000, which should actually be 00:40:49.000. Clearly there is something slightly wonky with the arithmetic.

Standard workflow

Download (low quality) primary audio/video recorded from Echo 360 (primary).
Download alternative audio from Zoom H2 recorder (alt-original).

Normalise alt-original audio and upload to Echo360 (alt-normalised):

# normalise alt-original to alt-normalised (single part)
ffmpeg -i input.wav -filter:a 'speechnorm=e=12.5' DATE-normalised.wav

# normalise alt-original to alt-normalised (multiple parts)
ffmpeg -i input1.wav -i input2.wav -filter_complex 'concat=n=2:v=0:a=1,speechnorm=e=12.5' DATE-normalised.wav

Extract 16kHz audio from primary and alt-normalised as Whisper requires 16 kHz:

# 16 kHz alt-normalised
ffmpeg -i DATE-normalised.wav -vn -ar 16000 DATE-normalised-16khz.wav

# 16 kHz primary
ffmpeg -i INFO\ 408\ S2\ 2024\ Lec-s1-low.mp4 -vn -ar 16000 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav

Generate VTT from 16 kHz primary and alt-normalised:

# from primary
whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav

# from alt-normalised, used for corrections and context
whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 DATE-normalised-16khz.wav

Clean up the primary VTT.
Download Echo360 encoded version of alt-normalised (alt-echo).
Copy primary VTT to alt-echo VTT and adjust timings.
Upload primary and alt-echo VTTs to Echo360.
Delete everything except: alt-original audio, primary VTT, alt-echo VTT, Zoom meeting chat (if any).