Recording quality is critical. Poor recordings with dropouts can lead to confused transcript timings requiring significant post-editing.
Whisper can often end up creating very short text chunks, leading to a kind “rapid fire” effect.
Assuming a good quality recording, the following settings seem to do a good job:
medium.en
). This performs better than the large model when transcribing English because the large
doesn’t have an English-specific model.speechnorm
filter in FFmpeg seems quite effective:, e.g.,ffmpeg -i <input> -filter:a speechnorm=e=12.5 <output>
.For example:
whisper --model medium.en --language English --output_format vtt --fp16 False --word_timestamps True --max_words_per_line 30 --initial_prompt "<prompt>" <input-file>
Offline transcription is roughly real-time (i.e., 1 hour of audio takes about 1 hour to transcribe). Models are automatically downloaded to ~/.cache/whisper
.
whisper-cpp
is actually the one we want as it’s written in C++ and supports Core ML. Annoyingly the CLI options are different, but it seems to have more of them. Uses the same models as Vibe below. Only supports 16kHz WAV as input 🙁 (ffmpeg -i <input> -vn -ar 16000 <output>
works for any input).
Useful options:
--offset-t
and maybe --duration
to specify the start and end. Helpful to synchronise timestamps? --offset-t
also helps if Whisper gets confused by lack of audio/speech at the start of a recording.--print-colors
to show confidence level? Hmm, really designed for dark theme…--tinydiarize
to identify speaker changes? (requires a tdrz
model). Unclear how this related to `--diarize``.whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 --prompt "<prompt>" <input file>
Much faster: about 8×, e.g., 1 hour 45 minutes takes about 12 minutes.
Vibe seems to be a useful cross-platform GUI implementation. Internally whisper-cpp ported to Rust. It claims to have a CLI, but I can’t figure out how to make it work. It produces malformed VTT: no WEBVTT header, and no blank lines between entries.
VS Code extension issues:
00:40:48.1000
, which should actually be 00:40:49.000
. Clearly there is something slightly wonky with the arithmetic.Download (low quality) primary audio/video recorded from Echo 360 (primary).
Download alternative audio from Zoom H2 recorder (alt-original).
Normalise alt-original audio and upload to Echo360 (alt-normalised):
# normalise alt-original to alt-normalised (single part) ffmpeg -i input.wav -filter:a 'speechnorm=e=12.5' DATE-normalised.wav # normalise alt-original to alt-normalised (multiple parts) ffmpeg -i input1.wav -i input2.wav -filter_complex 'concat=n=2:v=0:a=1,speechnorm=e=12.5' DATE-normalised.wav
Extract 16kHz audio from primary and alt-normalised as Whisper requires 16 kHz:
# 16 kHz alt-normalised ffmpeg -i DATE-normalised.wav -vn -ar 16000 DATE-normalised-16khz.wav # 16 kHz primary ffmpeg -i INFO\ 408\ S2\ 2024\ Lec-s1-low.mp4 -vn -ar 16000 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav
Generate VTT from 16 kHz primary and alt-normalised:
# from primary whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 INFO\ 408\ S2\ 2024\ Lec-s1-low.wav # from alt-normalised, used for corrections and context whisper-cpp --model ~/Library/Application\ Support/github.com.thewh1teagle.vibe/ggml-medium.en.bin --language en --output-vtt --max-len 150 DATE-normalised-16khz.wav
Clean up the primary VTT.
Download Echo360 encoded version of alt-normalised (alt-echo).
Copy primary VTT to alt-echo VTT and adjust timings.
Upload primary and alt-echo VTTs to Echo360.
Delete everything except: alt-original audio, primary VTT, alt-echo VTT, Zoom meeting chat (if any).