Previous fix used segment boundaries — required segments <3s for type 1
or <4s for type 2. But Žena was in a 4.3s segment ('saj še doma mi več
noč'jo verjet'. Žena me'), so the condition wasn't met and clip start
stayed at 77.7s, exactly at end of word 'Žena' (76.88-77.70s).
New approach: scan word-level timestamps directly:
1. If clip start falls MID-WORD → extend back to word start - 0.15s
2. If a word ends 0-0.5s BEFORE clip start AND next word is at clip start
→ that word is suspect (may be first word of chorus that Scribe put
in previous segment), extend back to its start - 0.15s
Word-level timestamps are always available from Scribe (timestamps_granularity=word).
Falls back to segment-level for local Whisper without word timing.
This handles arbitrary segment lengths and is universal — works for any
language where the chorus starts on a word that the STT placed in the
previous segment.