reels-app/scripts
Sebastjan Artič df6011c3cf Detect Scribe hallucinations + filter from SRT + auto-retry
Bug found in Žena ME TEPE third re-test:
- Scribe transcribed only verse 1 (0-33s) properly
- Then returned a single 98s segment [34.7-133.2] with just 1 word 'sam'
- This is a known Scribe hallucination on instrumental sections
- Result: SRT showed 'SAM SAM SAM SAM...' 14 times across the chorus
- Looked completely wrong because the chorus audio was correct but
  subtitles showed 'SAM' repeatedly

Three-part fix:

1. SRT GENERATOR: skip segments > 15s with < 5 words.
   These are hallucinations and have no real transcription value.

2. SCRIBE TRANSCRIBE: detect hallucinations in returned segments.
   - Mark segments > 15s with < 5 words as hallucinations
   - Compute true coverage % (excluding hallucinations)
   - Add _hallucination_count and _coverage_pct to result

3. TRANSCRIBE_FULL: auto-retry Scribe if quality is poor.
   - If hallucinations detected OR coverage < 50%, retry once
   - Keep retry result only if it has better stats
   - Otherwise fall back to first attempt (still better than nothing)

This makes the pipeline robust against Scribe's occasional bad transcripts
on songs with long instrumental breaks. Most second attempts succeed
where the first failed (random Scribe variance).
2026-04-29 18:08:35 +00:00
..
acr_recognize.py MXF/MPG broadcast format support: handle multichannel audio properly 2026-04-29 14:38:48 +00:00
analyze.py Detect Scribe hallucinations + filter from SRT + auto-retry 2026-04-29 18:08:35 +00:00
clip.py Upgrade default Whisper model: small/medium → large-v3 for much better Slovenian/Slavic transcription accuracy 2026-04-29 08:20:18 +00:00
find_chorus.py Find chorus: weight repetitive short phrases (like 'Ohne dich x5') as strong chorus signal 2026-04-28 16:57:45 +00:00
reframe.py MXF/MPG broadcast format support: handle multichannel audio properly 2026-04-29 14:38:48 +00:00
subtitle.py Upgrade default Whisper model: small/medium → large-v3 for much better Slovenian/Slavic transcription accuracy 2026-04-29 08:20:18 +00:00
yt_download.py Add cookies support to yt_download.py for YouTube bot detection bypass 2026-04-28 15:47:59 +00:00