Bug: 'Vključi pre-chorus' checkbox in UI was sent to backend but ignored
by Claude/Gemini analysis prompt. Both modes used same lenient rules
saying 'pre-chorus is optional' — Claude often included pre-chorus even
when user wanted just chorus.
Real-world failure: Lady Gaga 'Abracadabra' picked 54.7-84.6s, but actual
chorus 'Abracadabra, amor, ooh-na-na' starts at 85.2s. Claude included
the entire pre-chorus block ('Hold me in your heart tonight', 'Like a
poem said by a lady in red', 'With a haunting dance') and missed the
actual chorus completely.
Fix: include_prebuild parameter now flows all the way to the prompt:
- main.py → analyze.py CLI args → analyze_with_llm() → prompt builder
- Two distinct prompt rule sets:
CHORUS ONLY (default, include_prebuild=False):
- Strict: 'clip starts on FIRST WORD of chorus, never before'
- Length: 12-25s typically
- Explicit examples for pop songs (Abracadabra, Despacito, Shape of You)
- List of common mistakes to avoid
CHORUS + PRE-CHORUS (include_prebuild=True):
- Optional pre-chorus before chorus, 4-10s
- Length: 18-35s
This fixes the most common failure mode where Claude rationalizes
including verse/pre-chorus content even when user explicitly wants
just the chorus.
User feedback: 'REFREN je obvezen, pre-chorus opcijsko' + 'sistem mora biti
stabilen za vse jezike, tudi španščino in romunščino'.
Two changes:
1. Web search is now MANDATORY first step (was: optional fallback):
- Even if Claude thinks it knows the song, must search lyrics first
- Universal lyrics sources by language:
SLO: besedila.com, lyricstranslate.com
DE: songtexte.com
HR/SR/BS: tekstovi.net
ES: letras.com, musica.com
RO: versuri.ro
IT: angolotesti.it
FR: paroles.net
EN: genius.com, azlyrics.com
Universal: lyricstranslate.com (any language)
- Search strategy: artist+title first, then transcript snippet fallback
- Without lyrics, Claude cannot reliably identify chorus boundaries
2. Simplified selection rules - chorus is THE priority:
- Chorus (full first occurrence) = MANDATORY
- Pre-chorus = ONLY if 1-2 verse lines tightly connected to chorus
- In doubt: just take chorus alone (12-25s)
- Outro fillers explicitly multi-language:
SLO 'aj ja ja' / 'ej ej ej'
EN 'yeah' / 'oh oh'
ES 'ay ay ay'
RO 'hei hei'
JA 'la la la'
- 12-35s total range (was 15-35s, now allows shorter chorus-only clips)
This makes the system language-agnostic: works the same way for Slovenian
narodno-zabavna, Spanish reggaeton, Romanian manele, German Schlager, etc.
The lyrics lookup is what makes it stable across languages.
Problem: Claude was cutting clip exactly at last transcribed word of chorus,
but in real songs:
- Singer holds last note 1-3s longer (still meaningful)
- Outro 'ej-ej-ej' / 'oh' / 'yeah' may not be transcribed as words
- Result felt like 'incomplete chorus' even though SRT was correct
Fix has two parts:
1. Prompt enhancement:
- Ask Claude to add 1-2s padding AFTER last chorus word
- Explicit example with timing math
- Mention outro fillers (ej-ej-ej, oh, yeah)
2. Post-LLM extension logic:
- After Claude returns clip range, scan corrected_segments for
segments overlapping or starting just after current end
- If next segment is within 1s pause and ends within max_duration+5s,
extend clip to include it (with 0.3s breathing room)
- Hard cap at max_duration + 5s to prevent unbounded extension
This ensures chorus naturally trails off rather than being cut mid-emotional-peak.
ROOT CAUSE FOUND: tag_audio_events=false caused Scribe to stop transcribing
when instrumental music dominates (polka harmonica taking over from vocals).
Real-world test on Avseniki - Ena bolha za pomoč (186s polka):
- tag_audio_events=false: 20% coverage (37s only) — fails
- tag_audio_events=true: 100% coverage (186s full) — works
When tag_audio_events=true, Scribe inserts placeholder markers like
'(glasba)' / '(plesalna glasba)' for instrumental sections instead of
giving up. We then filter these out so they don't appear in subtitles.
Filtering logic:
- Skip word.type != 'word' (audio_event types)
- Skip parenthesized text legacy fallback like '(music)', '(applause)'
This is the core fix — no longer reliant on filename for transcription
completeness. Even untitled files like '12345.mp4' now get full coverage.
Problem: Scribe was failing on Slovenian narodno-zabavna songs (Avseniki,
Modrijani) because:
- User doesn't manually pick language (everything is auto)
- Scribe auto-detect had low confidence (0.58) on harmonika-heavy polka
- Result: only 37s transcribed instead of full 186s song
Solution: detect_language_from_filename() function:
- Recognizes 60+ Slovenian artists (Avseniki, Modrijani, Veseli Dolenjci, ...)
- Recognizes 30+ German artists (Ben Zucker, Helene Fischer, ...)
- Recognizes 20+ Croatian/Serbian artists (Thompson, Severina, Lepa Brena, ...)
- Falls back to keyword matching (volim, liebe, srce, herz, ...)
- Detects character set (č/ž/š → SL, ä/ö/ü/ß → DE, đ → HR)
- Score-based: 5pts for artist match, 1-2pts for keywords/chars
When detected, sends language_code to Scribe explicitly:
- Avseniki → 'slv' lock → no more half-transcribed songs
- Ben Zucker → 'deu' lock → consistent German transcription
- User still doesn't need to manually pick anything
filename_hint flows: main.py → analyze.py CLI → transcribe_full → Scribe
Previous rules were ambiguous and Claude was sometimes picking:
- Just the chorus (no build-up)
- Second chorus instance (lower energy than first)
- Random verse + later chorus combinations
New explicit priority order:
1. PRIMARY: pre-chorus verse (build-up) + first chorus (~20-35s total)
2. FALLBACK: just first chorus alone
3. LAST RESORT: dramatic peak section
Strict rules:
- ALWAYS first chorus (highest energy/recognition)
- NEVER second/third chorus instances
- NEVER skip between verses
- NEVER extend over 35 seconds
- Concrete example given: chorus@32s,16s long → pick 20-48s
This fixes Veseli Dolenjci picking second chorus + post-chorus verse
instead of natural pre-chorus build-up + first chorus.
1. Claude API web_search tool integration:
- Claude can now search web for actual lyrics when STT text is wrong
- Especially useful for SLO/HR/BS/SR songs (Modrijani, Veseli Dolenjci)
where Claude doesn't know lyrics from training data
- Agentic loop: tool_use → server-side search → continuation → final text
- Max 3 searches per job ($0.03 cost limit)
- Hint sources: besedila.com, lyricstranslate.com, tekstovi.net, songtexte.com
2. Tighter subtitle segmentation from Scribe word timestamps:
- Phrase boundaries on shorter pauses (0.4s vs 0.6s)
- Sentence-ending punctuation triggers segment break
- Max segment 4s (was 6s) for natural readable subtitles
- Hard cap at 5.5s to prevent very long lines
This fixes 'ples to noč' → 'ples pojoč' for Modrijani songs that
Scribe transcribed phonetically wrong but Claude can fix via web lookup.
ElevenLabs Scribe replaces local Whisper as default transcription:
- 96.7% accuracy English, 2.4% WER Indonesian (vs Whisper 7.7%)
- 18x faster (200s song = 11s vs 3-5 min on CPU)
- No hallucinations on songs (Whisper invented 'Pony und Kleid' for 'Bonnie und Clyde')
- 99 languages supported, including SLO/HR/BS/SR
- $0.40/h pricing, ~$0.022 per 200s song
Implementation:
- transcribe_with_elevenlabs() function uses Scribe v1
- ISO 639-1 ↔ 639-3 mapping (Scribe needs 'deu' not 'de')
- Word-level timestamps converted to pseudo-segments (close on 0.6s pause or 6s duration)
- 24MB upload limit guard with auto-fallback to local
Default whisper_provider='auto':
- If ELEVENLABS_API_KEY set → use Scribe
- Otherwise → fallback to local faster-whisper
- 'elevenlabs' strict mode: no fallback
- 'local' strict mode: skip Scribe entirely
Tested on Ben Zucker - Ohne dich: Scribe correctly transcribed
'Wir sind Bonnie und Clyde, zu allem bereit' where local Whisper hallucinated.
Problem: When a song has chorus → la-la-la medbridge → chorus structure,
Claude was including the whole 40s+ block, with 18 seconds of la-la-la
making the reel feel artificially extended.
Fix:
1. Prompt enhancement: explicitly tell Claude NEVER to include
la-la-la / ooh ooh / yeah yeah / instrumental fillers
2. Post-LLM detection: scan corrected_segments for repetitive content
(>70% repeated words) and trim clip before that segment
3. Max duration guidance reduced from 45s → 35s in prompt
This means: clip will end at the first chorus, not extend through fillers.
Pipeline:
- New transcribe_with_groq() function uses Groq's whisper-large-v3-turbo
- 30s audio transcribed in ~0.5s (vs 30s+ on CPU)
- Same quality as local Whisper (it's the same OpenAI model)
- Cloudflare bypass via custom User-Agent header
- 24MB upload limit guard with auto-fallback to local
- Language auto-detect works (Groq returns full lang name, mapped to ISO codes)
Default whisper_provider='auto':
- If GROQ_API_KEY is set → use Groq (200x faster)
- Otherwise → fallback to local faster-whisper
- Strict 'groq' mode: no fallback (returns empty if Groq fails)
- Strict 'local' mode: skip Groq entirely
CLI: --whisper-provider {auto,groq,local}
API: whisper_provider field in StartJobIn
Cost: $0.04/h with whisper-large-v3-turbo ($0.002 per 200s song)
When Whisper hallucinates (generates fake lyrics not matching the audio),
LLM can now use the original filename as a hint to recognize the song
and override the false transcript with the actual lyrics.
Pipeline:
1. Pass filename (e.g. 'Ben Zucker - Bonnie und Clyde') as hint
2. Whisper transcribes (may hallucinate)
3. Claude/Gemini reads filename + transcript:
- Recognizes song from filename hint
- Compares Whisper output to known lyrics
- Replaces hallucinated text with real lyrics (preserves timestamps)
- If can't fix, removes segment (better silent than wrong)
Also added Whisper anti-hallucination params:
- beam_size=5 (more careful decoding vs greedy)
- hallucination_silence_threshold=2.0 (skip text in long silences)
Diagnoza:
- analyze.py je zgodovinsko imel samo Claude support
- ko se je dodal Gemini, je clip_range.source ostal hardcoded 'claude'
- prav tako log 'Whisper segmenti zamenjani s Claude' in 'Generated SRT from Claude'
- API rezultat je v jobu kazal source='claude' tudi ko je dejansko bil uporabljen Gemini
- to je samo COSMETIC bug — funkcionalno je vse delovalo pravilno
- Gemini se DEJANSKO klical (potrjeno: '🤖 Gemini (gemini-3.1-pro-preview) izbral: 172.5-201.8s')
in vrnil pravilen rezultat — samo logging je rekel napačno
Popravki:
1. clip_range['source'] = claude_result['source'] (dejansko 'gemini:...' ali 'claude:...')
2. clip_range['reason'] prefix iz hardcoded 'claude_llm:' v dinamičen '{source}:'
3. Log 'Whisper segmenti zamenjani s Claude' → 'z {llm_label}'
4. Log 'Claude je popravil jezik' → 'LLM je popravil'
5. main.py 'Generated SRT from Claude' → 'from {llm_src}'
Test (Zlati Muzikanti - Le prijatelja bodiva, valček, 246s):
✓ Gemini dejansko izbere refren (172.5-201.8s)
✓ Whisper detektira sl (p=0.97 across 3 samples)
✓ Vseh 18 segmentov popravljenih
✓ Pipeline end-to-end deluje
Backward compat:
- transcript['claude_corrected'] in srt_from_claude variable name ohranjena
ker že obstajajo v starih job state fajlih
- Refactored analyze_with_claude into shared _build_analysis_prompt + _parse_llm_response helpers
- New analyze_with_gemini() using Gemini 3.1 Pro ($2/M in, MMMLU 92.6% — best multilingual)
- Unified analyze_with_llm(provider) dispatcher with auto-fallback (Claude → Gemini)
- API endpoint accepts llm_provider in StartJobIn (claude/gemini/auto)
- Frontend dropdown to pick LLM
- Default model is now Sonnet 4.6 (was Haiku 4.5) — 3x quality at 3x price (~3 cents/video)
- Gemini support is opt-in: needs GEMINI_API_KEY env var to activate
- Claude now corrects transcription errors (Slavic languages, dialects, mixed langs)
- Returns corrected_segments with same timestamps but cleaner text
- Pipeline generates SRT from Claude-corrected transcript and passes to subtitle.py via --srt
- subtitle.py supports --srt to skip Whisper re-transcription on the trimmed clip
- clip.py propagates --srt through to subtitle.py
- Whisper still runs once (in analyze.py); subtitle.py reuses corrected output instead of re-running
- This means: Whisper's mistakes (mixed langs, hallucinations, wrong words) are fixed by Claude before becoming visible subtitles
- 3-sample voting for auto-detect (start/middle/end of song) prevents lang switching mid-song
- Lock detected language for full transcription
- Anti-hallucination: condition_on_previous_text=False, temperature=0.0
- compression_ratio_threshold=2.4 (rejects repetitive hallucinations)
- log_prob_threshold=-1.0 (rejects low-confidence segments)
- no_speech_threshold=0.6 (more aggressive silence detection)
- Default Whisper model changed: small → medium (better for all langs incl. Slavic)