ffmpeg -y -i scene01.mov -i scene02.mov -i scene03.mov -i scene04.mov -i scene05.mov -i scene06.mov -i scene07.mov -i scene08.mov -i scene09.mov -i scene10.mov -i scene11.mov -i scene12.mov -i scene13.mov -i scene14.mov -i scene15.mov -filter_complex '[0:a] dynaudnorm=r=0.25:f=10:b=y [a0]; [1:a] dynaudnorm=r=0.25:f=10:b=y [a1]; [2:a] dynaudnorm=r=0.25:f=10:b=y [a2]; [3:a] dynaudnorm=r=0.25:f=10:b=y [a3]; [4:a] dynaudnorm=r=0.25:f=10:b=y [a4]; [5:a] dynaudnorm=r=0.25:f=10:b=y [a5]; [6:a] dynaudnorm=r=0.25:f=10:b=y [a6]; [7:a] dynaudnorm=r=0.25:f=10:b=y [a7]; [8:a] dynaudnorm=r=0.25:f=10:b=y [a8]; [9:a] dynaudnorm=r=0.25:f=10:b=y [a9]; [10:a] dynaudnorm=r=0.25:f=10:b=y [a10]; [11:a] dynaudnorm=r=0.25:f=10:b=y [a11]; [12:a] dynaudnorm=r=0.25:f=10:b=y [a12]; [13:a] dynaudnorm=r=0.25:f=10:b=y [a13]; [14:a] dynaudnorm=r=0.25:f=10:b=y [a14]; [0:v] [a0] [1:v] [a1] [2:v] [a2] [3:v] [a3] [4:v] [a4] [5:v] [a5] [6:v] [a6] [7:v] [a7] [8:v] [a8] [9:v] [a9] [10:v] [a10] [11:v] [a11] [12:v] [a12] [13:v] [a13] [14:v] [a14] concat=n=15:v=1:a=1 [v] [a]' -codec:a pcm_s16le -ac 1 -codec:v h264 -pix_fmt yuv420p -map '[v]' -map '[a]' INFO321_2016-07-29.mov ffmpeg -y -i scene01.mov -i scene02.mov -i scene03.mov -i scene04.mov -i scene05.mov -i scene06.mov -i scene07.mov -i scene08.mov -i scene09.mov -i scene10.mov -i scene11.mov -i scene12.mov -i scene13.mov -filter_complex '[0:v] [1:v] [2:v] [3:v] [4:v] [5:v] [6:v] [7:v] [8:v] [9:v] [10:v] [11:v] [12:v] concat=n=13:v=1 [v]' -codec:v h264 -pix_fmt yuv420p -map '[v]' test.mov ffmpeg -y -i 20160729_105610.wav -i scene01.mov -i scene02.mov -i scene03.mov -i scene04.mov -i scene05.mov -i scene06.mov -i scene07.mov -i scene08.mov -i scene09.mov -i scene10.mov -i scene11.mov -i scene12.mov -i scene13.mov -i scene14.mov -i scene15.mov -filter_complex '[1:v] [2:v] [3:v] [4:v] [5:v] [6:v] [7:v] [8:v] [9:v] [10:v] [11:v] [12:v] [13:v] concat=n=13:v=1:a=0 [v1]; [0:a] atrim=start=479:duration=1131 [a1]; [v1] [a1] [14:v] [14:a] [15:v] [15:a] dynaudnorm=r=0.25:f=10:b=y,concat=n=6:v=1:a=1 [v] [a]' -codec:a pcm_s16le -ac 1 -codec:v h264 -pix_fmt yuv420p -map '[v]' -map '[a]' test.mov The secret to trimming is the (a)setpts filter, set it to PTS - STARTPTS ffmpeg -y -i 20160729_105610.wav -i scene01.mov -i scene02.mov -i scene03.mov -i scene04.mov -i scene05.mov -i scene06.mov -i scene07.mov -i scene08.mov -i scene09.mov -i scene10.mov -i scene11.mov -i scene12.mov -i scene13.mov -filter_complex '[0:a] atrim=start=479:duration=1131,asetpts=PTS-STARTPTS,dynaudnorm=r=0.25:f=10:b=y [a1]; [1:v] [2:v] [3:v] [4:v] [5:v] [6:v] [7:v] [8:v] [9:v] [10:v] [11:v] [12:v] [13:v] concat=n=13:v=1:a=0 [v1]' -codec:a pcm_s16le -ac 1 -codec:v h264 -pix_fmt yuv420p -map '[v1]' -map '[a1]' test.mov Works: ffmpeg -y -i 20160729_105610.wav -filter_complex '[0:a] atrim=start=479:duration=1131,dynaudnorm=r=0.25:f=10:b=y [a]' -c pcm_s16le -ac 1 -map '[a]' test.wav Works (filter audio segments independently, concatenate frame-based video segments, concatenate all video and audio): ffmpeg -y -i 20160729_105610.wav -i scene01.mov -i scene02.mov -i scene03.mov -i scene04.mov -i scene05.mov -i scene06.mov -i scene07.mov -i scene08.mov -i scene09.mov -i scene10.mov -i scene11.mov -i scene12.mov -i scene13.mov -i scene14.mov -i scene15.mov -filter_complex '[0:a] atrim=start=479:duration=1131,asetpts=PTS-STARTPTS,dynaudnorm=r=0.25:f=10:b=y [a1]; [14:a] dynaudnorm=r=0.25:f=10:b=y [a2]; [15:a] dynaudnorm=r=0.25:f=10:b=y [a3]; [1:v] [2:v] [3:v] [4:v] [5:v] [6:v] [7:v] [8:v] [9:v] [10:v] [11:v] [12:v] [13:v] concat=n=13 [v1]; [v1] [a1] [14:v] [a2] [15:v] [a3] concat=n=3:v=1:a=1 [v] [a]' -codec:a pcm_s16le -ac 1 -codec:v h264 -pix_fmt yuv420p -map '[v]' -map '[a]' test01.mov Works (concatenate all audio segments, filter audio, concatenate all video segments). • Generate video segments v1, …, vn from original source (either JPEG frames or source .mov). • Generate joiner video segments vk, …, vl. • Get total duration of each run of non-joiner video segments (m runs). • Split original source audio into segments corresponding to runs of non-joiners -> a1, …, am. • Concatenate v* in correct sequence -> [vout]. • Concatenate a* in correct sequence -> [aconcat]. • Normalise [aconcat] -> [aout]. • Encode [vout] and [aout] to final. ffmpeg -y -i 20160729_105610.wav -i scene01.mov -i scene02.mov -i scene03.mov -i scene04.mov -i scene05.mov -i scene06.mov -i scene07.mov -i scene08.mov -i scene09.mov -i scene10.mov -i scene11.mov -i scene12.mov -i scene13.mov -i scene14.mov -i scene15.mov -filter_complex '[0:a] atrim=start=479:duration=1131,asetpts=PTS-STARTPTS [a1]; [a1] [14:a] [15:a] concat=n=3:v=0:a=1 [aconcat]; [aconcat] dynaudnorm=r=0.25:f=10:b=y [aout]; [1:v] [2:v] [3:v] [4:v] [5:v] [6:v] [7:v] [8:v] [9:v] [10:v] [11:v] [12:v] [13:v] [14:v] [15:v] concat=n=15 [vout]' -codec:a pcm_s16le -ac 1 -codec:v h264 -pix_fmt yuv420p -map '[vout]' -map '[aout]' test02.mov File format? (video splits): [f:slide-000.jpg] 00:07:59.000 00:08:46.000 [f:slide-001.jpg] 00:08:46.000 00:10:28.000 [f:slide-002.jpg] 00:10:28.000 00:12:19.000 [f:slide-003.jpg] 00:12:19.000 00:13:53.000 [f:slide-004.jpg] 00:13:53.000 00:14:26.000 [f:slide-005.jpg] 00:14:26.000 00:16:22.000 [f:slide-006.jpg] 00:16:22.000 00:20:16.000 [f:slide-007.jpg] 00:20:16.000 00:20:50.000 [f:slide-008.jpg] 00:20:50.000 00:22:32.000 [f:slide-009.jpg] 00:22:32.000 00:22:49.000 [f:slide-010.jpg] 00:22:49.000 00:25:59.000 [f:slide-011.jpg] 00:25:59.000 00:26:25.000 [f:slide-012.jpg] 00:26:25.000 00:26:50.000 [f:slide-012.jpg] 00:00:00.000 00:00:05.000 [f:slide-013.jpg] 00:53:00.000 00:53:27.000 File format? (audio splits): [20160729_105610.wav] 00:07:59.000 00:25:50.000 [joiner.wav] 00:00:00.000 00:00:05.000 [20160729_105610.wav] 00:53:00.000 00:53:27.000 Simple example: [a:20160729_105610.wav] 00:07:59.000 00:25:50.000 [a:joiner.wav] [a:20160729_105610.wav] 00:53:00.000 00:53:27.000 [v:input.mov] 00:07:59.000 00:25:50.000 [f:^:last]@joiner.wav [v:input.mov] 00:53:00.000 00:53:27.000 ignore empty lines and comments (#) ignore leading whitespace list segments in order of processing (no out-of-order segments!) number of segments across different types doesn't have to be the same (but it's simpler if they are); do the total durations for each type need to be the same? (ffmpeg can truncate to the shortest input, can it pad to the longest?) inputspecs are of the form [type:filename:num] type is mandatory, filename and num are optional, so: [type] (e.g., [video]) [type:filename] (e.g., [audio:foo.wav]) [type:filename:num] (e.g., [frame:foo.pdf:10]) [type::num] (uses default input from -v, -a, -f, e.g., [video::1]) filename: can be a relative (prepended with --prefix) or absolute path could be a printf style pattern? (e.g., slide-%03d.jpg, like ImageMagick) could be - for stdin? [a,v]: num is ffmpeg stream number (0-indexed) to allow for inputs with multiple streams [f] filename = "^" means use previous segment (invalid for input formats that don't have frames, e.g., audio, .mkv; ffprobe will return "N/A" for nb_frames) num is 0-indexed frame number to use from the input num -1 or "last" means use last frame of input e.g., [f:foo.pdf:last], [f:^:-1] [video] / [v] read from the default video input (from --video / -v) [audio] / [a] read from the default audio input (from --audio / -a) [frame] / [f] read from the default frame input (from --frame / -f) [video:filename] read from video input filename normal input is anything with a video stream (.mp4, .mkv, .mov, .ogv, ...) [audio:filename] read from audio input filename normal input is anything with an audio stream (.mp3, .aif, .wav, ...) [frame:filename] read from frame input filename normal input is anything that can be split into frames (.mp4, .pdf, ...) each inputspec is followed by zero or more whitespace-delimited timespecs (line breaks OK) timepecs are: a timestamp in the format HH:MM:SS.sss (can skip leading and trailing zeros) @filename means use the duration D of the file filename @filename should either be the only timespec (implying punch in at 0, punch out at D), or preceded by exactly one timestamp timespec T (implying punch in at T, punch out at T + D); anything other than that doesn't make sense [a,v]: none => [punch in at 0], [punch out at end] odd number => punch in, out, in, ..., in, [punch out at end] even number => punch in, out, in, ..., in, out each pair of timespecs defines a new segment if we support input filename patterns, then each pair of timespecs applies to the next filename iteration? [f]: none => duration = 0 (warning) one => duration is 0 .. t two => duration t1 .. t2 >1 => use the nth time as the duration of the nth frame (really only makes sense if no frame number specified) if we support filename patterns, then timespec is the duration of the next filename iteration? config ::= streamspec, {streamspec} ; streamspec ::= inputspec, [timespec_list] ; inputspec ::= "[", (audio_or_video_input | frame_input), "]" ; audio_or_video_input ::= audio_input | video_input ; audio_input ::= audio_type, [input_file, [stream_number]] ; audio_type ::= "audio" | "a" ; video_input ::= video_type, [input_file, [stream_number]] ; video_type ::= "video" | "v" ; frame_input ::= frame_type, [frame_input_file, [frame_number]] ; frame_type ::= "frame" | "f" ; frame_input_file ::= input_file | previous_segment ; previous_segment ::= ":", "^" ; input_file ::= empty_file | named_file ; empty_file ::= ":" ; named_file ::= ":" filename ; stream_number ::= ":", zero_index ; frame_number ::= ":", (zero_index | last_frame) ; last_frame ::= "-1" | "last" ; timespec_list ::= timestamp, (duration_file | {timestamp}) ; duration_file ::= "@", filename ; timestamp ::= hours, ":", minutes, ":", seconds, [second_fraction] ; hours ::= zero_index ; minutes ::= zero_index ; seconds ::= zero_index ; second_fraction ::= ".", zero_index ; filename ::= ... zero_index ::= digit, {digit} ; digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ; [a] [a:] [a::0] [a:file] [a:file:0] [v] [v:] [v::0] [v:file] [v:file:0] [f] [f:] [f::0] [f::-1] [f::last] [f:file] [f:file:0] [f:file:-1] [f:file:last] [f:^:0] [f:^:-1] [f:^:last] To read the time: datetime.time(*[int(v) for v in re.split(r"[:.]", "0:8:47.230000")]) or more readably: (hh, mm, ss, ms) = re.split(r"[:.]", "0:8:47.230000") datetime.time(hh, mm, ss, ms) or even better (hh, mm, ss, ms) = re.split(r"[:.]", "0:8:46.230") t1 = datetime.timedelta(hours=hh, minutes=mm, seconds=ss, milliseconds=ms) (hh, mm, ss, ms) = re.split(r"[:.]", "0:10:28.560") t2 = datetime.timedelta(hours=hh, minutes=mm, seconds=ss, milliseconds=ms) duration = t2 - t1 Yeah, baby, yeah! (do frame loops, filtering, and concatenation, all in one command) Note JPEG is faster all round than PNG — go figure. convert -scale 2048x1536 -density 600 Lectorial_slides.pdf slide-%03d.jpg ffmpeg -y -i 20160729_105610.wav -loop 1 -t 47 -i slide-000.jpg -loop 1 -t 1:42 -i slide-001.jpg -loop 1 -t 1:51 -i slide-002.jpg -loop 1 -t 1:34 -i slide-003.jpg -loop 1 -t 33 -i slide-004.jpg -loop 1 -t 1:56 -i slide-005.jpg -loop 1 -t 3:54 -i slide-006.jpg -loop 1 -t 34 -i slide-007.jpg -loop 1 -t 1:42 -i slide-008.jpg -loop 1 -t 17 -i slide-009.jpg -loop 1 -t 3:11 -i slide-010.jpg -loop 1 -t 26 -i slide-011.jpg -loop 1 -t 25 -i slide-012.jpg -loop 1 -t 5 -i slide-012.jpg -loop 1 -t 27 -i slide-013.jpg -i joiner.wav -filter_complex '[0:a] atrim=start=479:duration=1131,asetpts=PTS-STARTPTS [a1]; [0:a] atrim=start=3180:duration=27,asetpts=PTS-STARTPTS [a3]; [a1] [16:a] [a3] concat=n=3:v=0:a=1 [ac]; [ac] dynaudnorm=r=0.25:f=10:b=y [ad]; [1:v] [2:v] [3:v] [4:v] [5:v] [6:v] [7:v] [8:v] [9:v] [10:v] [11:v] [12:v] [13:v] [14:v] [15:v] concat=n=15 [vc]' -codec:a pcm_s16le -ac 1 -codec:v h264 -pix_fmt yuv420p -map '[vc]' -map '[ad]' test03j.mov Use cases Audio with or without segments Video with or without segments (unlikely) Frame with or without segments (unlikely) Audio + [video or frame], same input file, with or without segments Audio + [video or frame], separate input files, with or without segments Input type (video, audio, frame) Input file (same file, different files) Input stream (single or multiple)