Bug: Claude picked clip start at 78.19s (0.3s before segment 'tepe' at
78.4s). Word-level extension then found word 'me' (77.88-78.16s) right
before clip start, extended to 77.73s. But the FULL phrase was 'Žena me'
where 'Žena' [76.88-77.74] precedes 'me' [77.88-78.16] in the same
breath/speech burst (gap 0.14s, not a real pause).
Fix: when extending back via word-level, do a lookback through earlier
words. Stop only when finding a real pause (gap >= 0.5s between words).
This captures the entire connected phrase before clip start.
Now: clip start 78.19s → finds 'me' at 78.16s → looks back: 'Žena' at
77.74s (gap to 'me' = 0.14s, < 0.5s) → continue. Earlier 'verjet.' at
76.78s (gap to 'Žena' = 0.10s) → also captured if connected... actually
'verjet.' is part of previous verse, but anchor stops at next pause >= 0.5s.
For the Žena case, anchor will be at 'Žena' (or earlier if no big pause).
This makes the extension MUCH more robust for cases where multiple words
of the chorus opening fall in the previous transcript segment.
Bug found in Žena ME TEPE re-test:
- Final clip start was 77.2s but word 'Žena' starts at 76.88s
- Word-level extension would have correctly chosen 76.73s
- Why didn't it? Because corrected_segs (Claude output) doesn't contain
word-level timestamps, only segment start/end. all_words array was empty,
triggering segment-level fallback (-0.5s) which produced 77.2s instead.
Fix: always use transcript['segments'] (original Scribe output with word
timestamps) for word-level boundary detection, not Claude corrected_segments.
Now: 'Žena' word at 76.88-77.74s will trigger word-level extension to
76.73s (76.88 - 0.15s buffer), capturing the full word.
Žena problem persists: even after word-level extension, some cases where
Scribe doesn't transcribe the very first word still result in clip cutting
the vocal start.
Layer 3 defense: after word-level start extension, probe the FIRST 150ms
of audio at clip start with ffmpeg volumedetect. If mean_volume > -35 dB
(threshold for vocal/music vs silence), extend clip start back 0.5s as a
safety buffer.
This catches cases where:
- Scribe missed the word entirely (no word-level timestamp to extend to)
- LLM picked a start that's already inside vocal energy
- Word-level extension didn't trigger because no nearby word matched
The check is fast (<100ms) and conservative (only triggers if audio is
clearly NOT silent). If it's a true musical break (silence before chorus),
mean_volume will be < -40 dB and extension is skipped.
Three layers of defense now:
1. Claude prompt: 'start ~0.3s before first chorus word'
2. Word-level boundary detection (Scribe word timestamps)
3. Audio amplitude check (catches cases 1-2 missed)
Changes:
1. Frontend multi-upload:
- File input now has 'multiple' attribute, drag-drop accepts multiple
- File queue list with per-file artist/title preview + remove button
- 'Pošlji vse' uploads sequentially (one at a time to avoid network saturation)
- Each file gets same batch_id for Telegram batch summary
- After upload, queue clears, jobs appear in right sidebar
2. Backend queue worker:
- New _queue_worker() background thread processes 'queued' jobs sequentially
- Only 1 job at a time to keep openclaw stable (avoid CPU/RAM thrash)
- FIFO order by created_at
- Auto-starts on app startup after job resume
3. Job submission flow change:
- /api/process and /api/youtube no longer call background.add_task directly
- Just mark status='queued', queue worker picks up
- This means upload completes fast, processing happens in background
- User can close browser, jobs continue
4. Telegram notifications (FOLX Alerts bot):
- Per-job: 'Reel pripravljen: Lady Gaga - Abracadabra (29s, 30 MB)'
- Per-job failed: 'Reel ni uspel: <name> + error message'
- Batch summary: 'Batch končan: 10/10 reels pripravljeni' (only if >1 in batch)
- Uses existing TELEGRAM_TOKEN + TELEGRAM_CHAT_ID env vars
- app/telegram.py module with notify_job_done(), notify_job_failed(),
notify_batch_complete()
5. batch_id field:
- Added to Job model + StartJobIn pydantic
- Saved during upload + process
- Used to count batch progress and trigger summary notification
User experience:
- Drag 20 videos at once
- Click 'Pošlji'
- Close browser, go grab coffee
- Telegram sends 'Reel pripravljen' for each
- After all done: 'Batch končan: 20/20 reels pripravljeni' summary
- Open app to download all
Previous fix used segment boundaries — required segments <3s for type 1
or <4s for type 2. But Žena was in a 4.3s segment ('saj še doma mi več
noč'jo verjet'. Žena me'), so the condition wasn't met and clip start
stayed at 77.7s, exactly at end of word 'Žena' (76.88-77.70s).
New approach: scan word-level timestamps directly:
1. If clip start falls MID-WORD → extend back to word start - 0.15s
2. If a word ends 0-0.5s BEFORE clip start AND next word is at clip start
→ that word is suspect (may be first word of chorus that Scribe put
in previous segment), extend back to its start - 0.15s
Word-level timestamps are always available from Scribe (timestamps_granularity=word).
Falls back to segment-level for local Whisper without word timing.
This handles arbitrary segment lengths and is universal — works for any
language where the chorus starts on a word that the STT placed in the
previous segment.
Real-world failure: 'Ansambel Saša Avsenika - ŽENA ME TEPE'
- Refren starts with 'Žena me tepe' at 78.0s
- Scribe's segment boundary: word 'Žena' was end of previous segment (73.9-78.2s)
while new segment 'tepe, mi prazni žepe' started at 78.3s
- Claude picked clip start = 78.3s (segment boundary)
- Fade-in 0.4s on vocal start = inaudible 'Že-'
- User hears: '...na me tepe' (cut)
Three-part fix:
1. PROMPT: instruct Claude to start clip ~0.3s BEFORE first chorus word
(not exactly at it). Concrete example with timing math.
2. POST-LLM EXTENSION: scan corrected_segments for boundary cases:
- If clip start falls MID-segment → extend back to segment start - 0.2s
- If a previous segment ended within 0.5s of clip start → check if its
last word might actually be the first chorus word, extend back to it
- Uses word-level timestamps when available (Scribe provides these)
3. FADE-IN: was 0.4s when starting on vocal — too long, audibly cuts first
word. Reduced to 0.05s (just click prevention, not audible). Still 0.2s
for instrumental intros where fade is musically appropriate.
Now 'Žena' will be heard fully — clip starts at ~77.5-77.7s, word starts
at 78.0s, plenty of buffer.
Problem: MXF and MPG files (TV broadcast formats) often contain:
- Multiple audio streams (4-8 streams for different language tracks)
- Multichannel layouts (5.1, 7.1) instead of stereo
- Default ffmpeg behavior was -c:a aac without channel limit, which
meant multichannel got transcoded as multichannel AAC, overwriting
what should have been clean stereo
Solution:
1. get_audio_streams() helper probes all audio streams with ffprobe
- Returns codec, channels, sample_rate, language, layout for each
2. build_audio_args() picks best stream + downmix:
- Prefers first 2-channel stereo stream (usually main mix)
- Falls back to first stream if none are 2-ch
- Always: -ac 2 (force stereo downmix), -ar 48000, -c:a aac, -b:a 192k
- Bitrate raised from 128k to 192k for music quality
3. Smart trim path now detects broadcast formats:
- .mxf, .mpg, .mpeg, .ts, .m2ts, .mts → transcode (not stream copy)
- Standard MP4/MOV → stream copy (faster, lossless)
4. Pre-conversion step for broadcast files without trim:
- Even without --start/--duration, MXF/MPG get converted to MP4
- Same audio handling as trim path
5. Main render adds explicit -map 0✌️0 -map 0🅰️0? -ac 2 to ensure
only first video and first audio stream get encoded, with stereo
6. ACR recognize also gets -map 0🅰️0 -ac 2 for MXF compatibility
7. UI accepts: video/*,.mxf,.mpg,.mpeg,.ts,.m2ts,.mts
8. Upload limit raised: 2GB → 10GB (MXF files are large)
This means a TV broadcast MXF with [SLO/EN/DE language tracks] now
correctly outputs stereo MP4 with the main language track preserved.
Changes:
1. UI: removed blocking prompt() that asked for artist+title on filename
that didn't match 'Artist - Title' pattern. Upload always proceeds.
Instead shows yellow warning saying 'server will try to recognize'.
2. Backend: added scripts/acr_recognize.py — extracts 20s audio sample
from video (at 15s and 60s offsets for robustness), computes ACRCloud
fingerprint via native binary (3KB payload), sends to identify API.
3. Pipeline: process_job() now runs ACR recognition step before analysis
IF parsed_artist or parsed_title is missing. Result is saved to job
metadata and used for download filename + Scribe/Claude filename hint.
4. Credentials: ACR_HOST + ACR_ACCESS_KEY + ACR_SECRET_KEY env vars
added to Coolify (using existing keys from openclaw fb-agent metka).
5. requirements.txt: added pyacrcloud==1.0.11 for native fingerprinting.
This unblocks future automation/cron upload pipelines — files don't need
to be perfectly named, ACRCloud will identify them automatically.
Fallback chain:
1. Filename parsing (Artist - Title.mp4)
2. ACRCloud audio fingerprint (works even for '12345.mp4', 'IMG_001.mp4')
3. If both fail: download filename uses 'reel_<id>.mp4' (still works)
Bug: 'Vključi pre-chorus' checkbox in UI was sent to backend but ignored
by Claude/Gemini analysis prompt. Both modes used same lenient rules
saying 'pre-chorus is optional' — Claude often included pre-chorus even
when user wanted just chorus.
Real-world failure: Lady Gaga 'Abracadabra' picked 54.7-84.6s, but actual
chorus 'Abracadabra, amor, ooh-na-na' starts at 85.2s. Claude included
the entire pre-chorus block ('Hold me in your heart tonight', 'Like a
poem said by a lady in red', 'With a haunting dance') and missed the
actual chorus completely.
Fix: include_prebuild parameter now flows all the way to the prompt:
- main.py → analyze.py CLI args → analyze_with_llm() → prompt builder
- Two distinct prompt rule sets:
CHORUS ONLY (default, include_prebuild=False):
- Strict: 'clip starts on FIRST WORD of chorus, never before'
- Length: 12-25s typically
- Explicit examples for pop songs (Abracadabra, Despacito, Shape of You)
- List of common mistakes to avoid
CHORUS + PRE-CHORUS (include_prebuild=True):
- Optional pre-chorus before chorus, 4-10s
- Length: 18-35s
This fixes the most common failure mode where Claude rationalizes
including verse/pre-chorus content even when user explicitly wants
just the chorus.
User feedback: 'REFREN je obvezen, pre-chorus opcijsko' + 'sistem mora biti
stabilen za vse jezike, tudi španščino in romunščino'.
Two changes:
1. Web search is now MANDATORY first step (was: optional fallback):
- Even if Claude thinks it knows the song, must search lyrics first
- Universal lyrics sources by language:
SLO: besedila.com, lyricstranslate.com
DE: songtexte.com
HR/SR/BS: tekstovi.net
ES: letras.com, musica.com
RO: versuri.ro
IT: angolotesti.it
FR: paroles.net
EN: genius.com, azlyrics.com
Universal: lyricstranslate.com (any language)
- Search strategy: artist+title first, then transcript snippet fallback
- Without lyrics, Claude cannot reliably identify chorus boundaries
2. Simplified selection rules - chorus is THE priority:
- Chorus (full first occurrence) = MANDATORY
- Pre-chorus = ONLY if 1-2 verse lines tightly connected to chorus
- In doubt: just take chorus alone (12-25s)
- Outro fillers explicitly multi-language:
SLO 'aj ja ja' / 'ej ej ej'
EN 'yeah' / 'oh oh'
ES 'ay ay ay'
RO 'hei hei'
JA 'la la la'
- 12-35s total range (was 15-35s, now allows shorter chorus-only clips)
This makes the system language-agnostic: works the same way for Slovenian
narodno-zabavna, Spanish reggaeton, Romanian manele, German Schlager, etc.
The lyrics lookup is what makes it stable across languages.
Problem: Claude was cutting clip exactly at last transcribed word of chorus,
but in real songs:
- Singer holds last note 1-3s longer (still meaningful)
- Outro 'ej-ej-ej' / 'oh' / 'yeah' may not be transcribed as words
- Result felt like 'incomplete chorus' even though SRT was correct
Fix has two parts:
1. Prompt enhancement:
- Ask Claude to add 1-2s padding AFTER last chorus word
- Explicit example with timing math
- Mention outro fillers (ej-ej-ej, oh, yeah)
2. Post-LLM extension logic:
- After Claude returns clip range, scan corrected_segments for
segments overlapping or starting just after current end
- If next segment is within 1s pause and ends within max_duration+5s,
extend clip to include it (with 0.3s breathing room)
- Hard cap at max_duration + 5s to prevent unbounded extension
This ensures chorus naturally trails off rather than being cut mid-emotional-peak.
ROOT CAUSE FOUND: tag_audio_events=false caused Scribe to stop transcribing
when instrumental music dominates (polka harmonica taking over from vocals).
Real-world test on Avseniki - Ena bolha za pomoč (186s polka):
- tag_audio_events=false: 20% coverage (37s only) — fails
- tag_audio_events=true: 100% coverage (186s full) — works
When tag_audio_events=true, Scribe inserts placeholder markers like
'(glasba)' / '(plesalna glasba)' for instrumental sections instead of
giving up. We then filter these out so they don't appear in subtitles.
Filtering logic:
- Skip word.type != 'word' (audio_event types)
- Skip parenthesized text legacy fallback like '(music)', '(applause)'
This is the core fix — no longer reliant on filename for transcription
completeness. Even untitled files like '12345.mp4' now get full coverage.
Problem: Scribe was failing on Slovenian narodno-zabavna songs (Avseniki,
Modrijani) because:
- User doesn't manually pick language (everything is auto)
- Scribe auto-detect had low confidence (0.58) on harmonika-heavy polka
- Result: only 37s transcribed instead of full 186s song
Solution: detect_language_from_filename() function:
- Recognizes 60+ Slovenian artists (Avseniki, Modrijani, Veseli Dolenjci, ...)
- Recognizes 30+ German artists (Ben Zucker, Helene Fischer, ...)
- Recognizes 20+ Croatian/Serbian artists (Thompson, Severina, Lepa Brena, ...)
- Falls back to keyword matching (volim, liebe, srce, herz, ...)
- Detects character set (č/ž/š → SL, ä/ö/ü/ß → DE, đ → HR)
- Score-based: 5pts for artist match, 1-2pts for keywords/chars
When detected, sends language_code to Scribe explicitly:
- Avseniki → 'slv' lock → no more half-transcribed songs
- Ben Zucker → 'deu' lock → consistent German transcription
- User still doesn't need to manually pick anything
filename_hint flows: main.py → analyze.py CLI → transcribe_full → Scribe
Previous rules were ambiguous and Claude was sometimes picking:
- Just the chorus (no build-up)
- Second chorus instance (lower energy than first)
- Random verse + later chorus combinations
New explicit priority order:
1. PRIMARY: pre-chorus verse (build-up) + first chorus (~20-35s total)
2. FALLBACK: just first chorus alone
3. LAST RESORT: dramatic peak section
Strict rules:
- ALWAYS first chorus (highest energy/recognition)
- NEVER second/third chorus instances
- NEVER skip between verses
- NEVER extend over 35 seconds
- Concrete example given: chorus@32s,16s long → pick 20-48s
This fixes Veseli Dolenjci picking second chorus + post-chorus verse
instead of natural pre-chorus build-up + first chorus.
1. Claude API web_search tool integration:
- Claude can now search web for actual lyrics when STT text is wrong
- Especially useful for SLO/HR/BS/SR songs (Modrijani, Veseli Dolenjci)
where Claude doesn't know lyrics from training data
- Agentic loop: tool_use → server-side search → continuation → final text
- Max 3 searches per job ($0.03 cost limit)
- Hint sources: besedila.com, lyricstranslate.com, tekstovi.net, songtexte.com
2. Tighter subtitle segmentation from Scribe word timestamps:
- Phrase boundaries on shorter pauses (0.4s vs 0.6s)
- Sentence-ending punctuation triggers segment break
- Max segment 4s (was 6s) for natural readable subtitles
- Hard cap at 5.5s to prevent very long lines
This fixes 'ples to noč' → 'ples pojoč' for Modrijani songs that
Scribe transcribed phonetically wrong but Claude can fix via web lookup.
ElevenLabs Scribe replaces local Whisper as default transcription:
- 96.7% accuracy English, 2.4% WER Indonesian (vs Whisper 7.7%)
- 18x faster (200s song = 11s vs 3-5 min on CPU)
- No hallucinations on songs (Whisper invented 'Pony und Kleid' for 'Bonnie und Clyde')
- 99 languages supported, including SLO/HR/BS/SR
- $0.40/h pricing, ~$0.022 per 200s song
Implementation:
- transcribe_with_elevenlabs() function uses Scribe v1
- ISO 639-1 ↔ 639-3 mapping (Scribe needs 'deu' not 'de')
- Word-level timestamps converted to pseudo-segments (close on 0.6s pause or 6s duration)
- 24MB upload limit guard with auto-fallback to local
Default whisper_provider='auto':
- If ELEVENLABS_API_KEY set → use Scribe
- Otherwise → fallback to local faster-whisper
- 'elevenlabs' strict mode: no fallback
- 'local' strict mode: skip Scribe entirely
Tested on Ben Zucker - Ohne dich: Scribe correctly transcribed
'Wir sind Bonnie und Clyde, zu allem bereit' where local Whisper hallucinated.
Problem: When a song has chorus → la-la-la medbridge → chorus structure,
Claude was including the whole 40s+ block, with 18 seconds of la-la-la
making the reel feel artificially extended.
Fix:
1. Prompt enhancement: explicitly tell Claude NEVER to include
la-la-la / ooh ooh / yeah yeah / instrumental fillers
2. Post-LLM detection: scan corrected_segments for repetitive content
(>70% repeated words) and trim clip before that segment
3. Max duration guidance reduced from 45s → 35s in prompt
This means: clip will end at the first chorus, not extend through fillers.
Pipeline:
- New transcribe_with_groq() function uses Groq's whisper-large-v3-turbo
- 30s audio transcribed in ~0.5s (vs 30s+ on CPU)
- Same quality as local Whisper (it's the same OpenAI model)
- Cloudflare bypass via custom User-Agent header
- 24MB upload limit guard with auto-fallback to local
- Language auto-detect works (Groq returns full lang name, mapped to ISO codes)
Default whisper_provider='auto':
- If GROQ_API_KEY is set → use Groq (200x faster)
- Otherwise → fallback to local faster-whisper
- Strict 'groq' mode: no fallback (returns empty if Groq fails)
- Strict 'local' mode: skip Groq entirely
CLI: --whisper-provider {auto,groq,local}
API: whisper_provider field in StartJobIn
Cost: $0.04/h with whisper-large-v3-turbo ($0.002 per 200s song)
When Whisper hallucinates (generates fake lyrics not matching the audio),
LLM can now use the original filename as a hint to recognize the song
and override the false transcript with the actual lyrics.
Pipeline:
1. Pass filename (e.g. 'Ben Zucker - Bonnie und Clyde') as hint
2. Whisper transcribes (may hallucinate)
3. Claude/Gemini reads filename + transcript:
- Recognizes song from filename hint
- Compares Whisper output to known lyrics
- Replaces hallucinated text with real lyrics (preserves timestamps)
- If can't fix, removes segment (better silent than wrong)
Also added Whisper anti-hallucination params:
- beam_size=5 (more careful decoding vs greedy)
- hallucination_silence_threshold=2.0 (skip text in long silences)
Diagnoza:
- analyze.py je zgodovinsko imel samo Claude support
- ko se je dodal Gemini, je clip_range.source ostal hardcoded 'claude'
- prav tako log 'Whisper segmenti zamenjani s Claude' in 'Generated SRT from Claude'
- API rezultat je v jobu kazal source='claude' tudi ko je dejansko bil uporabljen Gemini
- to je samo COSMETIC bug — funkcionalno je vse delovalo pravilno
- Gemini se DEJANSKO klical (potrjeno: '🤖 Gemini (gemini-3.1-pro-preview) izbral: 172.5-201.8s')
in vrnil pravilen rezultat — samo logging je rekel napačno
Popravki:
1. clip_range['source'] = claude_result['source'] (dejansko 'gemini:...' ali 'claude:...')
2. clip_range['reason'] prefix iz hardcoded 'claude_llm:' v dinamičen '{source}:'
3. Log 'Whisper segmenti zamenjani s Claude' → 'z {llm_label}'
4. Log 'Claude je popravil jezik' → 'LLM je popravil'
5. main.py 'Generated SRT from Claude' → 'from {llm_src}'
Test (Zlati Muzikanti - Le prijatelja bodiva, valček, 246s):
✓ Gemini dejansko izbere refren (172.5-201.8s)
✓ Whisper detektira sl (p=0.97 across 3 samples)
✓ Vseh 18 segmentov popravljenih
✓ Pipeline end-to-end deluje
Backward compat:
- transcript['claude_corrected'] in srt_from_claude variable name ohranjena
ker že obstajajo v starih job state fajlih
- Refactored analyze_with_claude into shared _build_analysis_prompt + _parse_llm_response helpers
- New analyze_with_gemini() using Gemini 3.1 Pro ($2/M in, MMMLU 92.6% — best multilingual)
- Unified analyze_with_llm(provider) dispatcher with auto-fallback (Claude → Gemini)
- API endpoint accepts llm_provider in StartJobIn (claude/gemini/auto)
- Frontend dropdown to pick LLM
- Default model is now Sonnet 4.6 (was Haiku 4.5) — 3x quality at 3x price (~3 cents/video)
- Gemini support is opt-in: needs GEMINI_API_KEY env var to activate
- Claude now corrects transcription errors (Slavic languages, dialects, mixed langs)
- Returns corrected_segments with same timestamps but cleaner text
- Pipeline generates SRT from Claude-corrected transcript and passes to subtitle.py via --srt
- subtitle.py supports --srt to skip Whisper re-transcription on the trimmed clip
- clip.py propagates --srt through to subtitle.py
- Whisper still runs once (in analyze.py); subtitle.py reuses corrected output instead of re-running
- This means: Whisper's mistakes (mixed langs, hallucinations, wrong words) are fixed by Claude before becoming visible subtitles
- 3-sample voting for auto-detect (start/middle/end of song) prevents lang switching mid-song
- Lock detected language for full transcription
- Anti-hallucination: condition_on_previous_text=False, temperature=0.0
- compression_ratio_threshold=2.4 (rejects repetitive hallucinations)
- log_prob_threshold=-1.0 (rejects low-confidence segments)
- no_speech_threshold=0.6 (more aggressive silence detection)
- Default Whisper model changed: small → medium (better for all langs incl. Slavic)