How to add word-by-word highlighted subtitles (karaoke-style) to my automated video pipeline?

https://gist.github.com/8ullyMaguire/923eb7da24fc273ac12ba88932fa984e

I've built a fully automated bilingual video pipeline (EN/ES) that takes YouTube links, EPUBs, or text files and produces analysis videos with TTS audio, static background renders, and cross-platform publishing (PeerTube, Bluesky, Mastodon, PieFed). Uses ffmpeg, edge-tts, ImageMagick, and Python.

One thing I haven't cracked yet is adding those popular subtitle effects where each word highlights in sync as it's spoken — bold white text with the current word glowing in yellow/red, appearing word-by-word with the audio.

My pipeline already has the TTS audio and full script text, so timing info should be extractable. Looking for:

Which tool/approach works best? Aegisub karaoke timing? FFmpeg ASS/SSA subtitles? Forced alignment tools?
Any open-source projects that do this well headlessly?
How to integrate into an automated pipeline (no GUI)?
Examples of word-highlighted subs on long-form content (13-15 min)?

Also happy for any other feedback on the pipeline approach — things you'd add, change, or pitfalls.

Thanks!

View original on piefed.zip

Comments