Commit Graph

47 Commits

Author SHA1 Message Date
823eb3e91e Use original Scribe transcript for word-level (Claude doesnt return words)
Bug found in Žena ME TEPE re-test:
- Final clip start was 77.2s but word 'Žena' starts at 76.88s
- Word-level extension would have correctly chosen 76.73s
- Why didn't it? Because corrected_segs (Claude output) doesn't contain
  word-level timestamps, only segment start/end. all_words array was empty,
  triggering segment-level fallback (-0.5s) which produced 77.2s instead.

Fix: always use transcript['segments'] (original Scribe output with word
timestamps) for word-level boundary detection, not Claude corrected_segments.

Now: 'Žena' word at 76.88-77.74s will trigger word-level extension to
76.73s (76.88 - 0.15s buffer), capturing the full word.
2026-04-29 16:30:51 +00:00
e06c3efb8e Add audio amplitude defense (Layer 3) for first-word cut prevention
Žena problem persists: even after word-level extension, some cases where
Scribe doesn't transcribe the very first word still result in clip cutting
the vocal start.

Layer 3 defense: after word-level start extension, probe the FIRST 150ms
of audio at clip start with ffmpeg volumedetect. If mean_volume > -35 dB
(threshold for vocal/music vs silence), extend clip start back 0.5s as a
safety buffer.

This catches cases where:
- Scribe missed the word entirely (no word-level timestamp to extend to)
- LLM picked a start that's already inside vocal energy
- Word-level extension didn't trigger because no nearby word matched

The check is fast (<100ms) and conservative (only triggers if audio is
clearly NOT silent). If it's a true musical break (silence before chorus),
mean_volume will be < -40 dB and extension is skipped.

Three layers of defense now:
1. Claude prompt: 'start ~0.3s before first chorus word'
2. Word-level boundary detection (Scribe word timestamps)
3. Audio amplitude check (catches cases 1-2 missed)
2026-04-29 15:23:37 +00:00
91cc03658d Multi-upload batch queue + Telegram notifications
Changes:

1. Frontend multi-upload:
   - File input now has 'multiple' attribute, drag-drop accepts multiple
   - File queue list with per-file artist/title preview + remove button
   - 'Pošlji vse' uploads sequentially (one at a time to avoid network saturation)
   - Each file gets same batch_id for Telegram batch summary
   - After upload, queue clears, jobs appear in right sidebar

2. Backend queue worker:
   - New _queue_worker() background thread processes 'queued' jobs sequentially
   - Only 1 job at a time to keep openclaw stable (avoid CPU/RAM thrash)
   - FIFO order by created_at
   - Auto-starts on app startup after job resume

3. Job submission flow change:
   - /api/process and /api/youtube no longer call background.add_task directly
   - Just mark status='queued', queue worker picks up
   - This means upload completes fast, processing happens in background
   - User can close browser, jobs continue

4. Telegram notifications (FOLX Alerts bot):
   - Per-job: 'Reel pripravljen: Lady Gaga - Abracadabra (29s, 30 MB)'
   - Per-job failed: 'Reel ni uspel: <name> + error message'
   - Batch summary: 'Batch končan: 10/10 reels pripravljeni' (only if >1 in batch)
   - Uses existing TELEGRAM_TOKEN + TELEGRAM_CHAT_ID env vars
   - app/telegram.py module with notify_job_done(), notify_job_failed(),
     notify_batch_complete()

5. batch_id field:
   - Added to Job model + StartJobIn pydantic
   - Saved during upload + process
   - Used to count batch progress and trigger summary notification

User experience:
- Drag 20 videos at once
- Click 'Pošlji'
- Close browser, go grab coffee
- Telegram sends 'Reel pripravljen' for each
- After all done: 'Batch končan: 20/20 reels pripravljeni' summary
- Open app to download all
2026-04-29 15:12:38 +00:00
157e6b781e Fix 'Žena' word still cut: word-level start extension instead of segment-level
Previous fix used segment boundaries — required segments <3s for type 1
or <4s for type 2. But Žena was in a 4.3s segment ('saj še doma mi več
noč'jo verjet'. Žena me'), so the condition wasn't met and clip start
stayed at 77.7s, exactly at end of word 'Žena' (76.88-77.70s).

New approach: scan word-level timestamps directly:

1. If clip start falls MID-WORD → extend back to word start - 0.15s
2. If a word ends 0-0.5s BEFORE clip start AND next word is at clip start
   → that word is suspect (may be first word of chorus that Scribe put
   in previous segment), extend back to its start - 0.15s

Word-level timestamps are always available from Scribe (timestamps_granularity=word).
Falls back to segment-level for local Whisper without word timing.

This handles arbitrary segment lengths and is universal — works for any
language where the chorus starts on a word that the STT placed in the
previous segment.
2026-04-29 15:04:18 +00:00
a5097c5acc Fix first word being cut at clip start ('Žena' problem)
Real-world failure: 'Ansambel Saša Avsenika - ŽENA ME TEPE'
- Refren starts with 'Žena me tepe' at 78.0s
- Scribe's segment boundary: word 'Žena' was end of previous segment (73.9-78.2s)
  while new segment 'tepe, mi prazni žepe' started at 78.3s
- Claude picked clip start = 78.3s (segment boundary)
- Fade-in 0.4s on vocal start = inaudible 'Že-'
- User hears: '...na me tepe' (cut)

Three-part fix:

1. PROMPT: instruct Claude to start clip ~0.3s BEFORE first chorus word
   (not exactly at it). Concrete example with timing math.

2. POST-LLM EXTENSION: scan corrected_segments for boundary cases:
   - If clip start falls MID-segment → extend back to segment start - 0.2s
   - If a previous segment ended within 0.5s of clip start → check if its
     last word might actually be the first chorus word, extend back to it
   - Uses word-level timestamps when available (Scribe provides these)

3. FADE-IN: was 0.4s when starting on vocal — too long, audibly cuts first
   word. Reduced to 0.05s (just click prevention, not audible). Still 0.2s
   for instrumental intros where fade is musically appropriate.

Now 'Žena' will be heard fully — clip starts at ~77.5-77.7s, word starts
at 78.0s, plenty of buffer.
2026-04-29 14:47:07 +00:00
1cc8e8be35 MXF/MPG broadcast format support: handle multichannel audio properly
Problem: MXF and MPG files (TV broadcast formats) often contain:
- Multiple audio streams (4-8 streams for different language tracks)
- Multichannel layouts (5.1, 7.1) instead of stereo
- Default ffmpeg behavior was -c:a aac without channel limit, which
  meant multichannel got transcoded as multichannel AAC, overwriting
  what should have been clean stereo

Solution:

1. get_audio_streams() helper probes all audio streams with ffprobe
   - Returns codec, channels, sample_rate, language, layout for each

2. build_audio_args() picks best stream + downmix:
   - Prefers first 2-channel stereo stream (usually main mix)
   - Falls back to first stream if none are 2-ch
   - Always: -ac 2 (force stereo downmix), -ar 48000, -c:a aac, -b:a 192k
   - Bitrate raised from 128k to 192k for music quality

3. Smart trim path now detects broadcast formats:
   - .mxf, .mpg, .mpeg, .ts, .m2ts, .mts → transcode (not stream copy)
   - Standard MP4/MOV → stream copy (faster, lossless)

4. Pre-conversion step for broadcast files without trim:
   - Even without --start/--duration, MXF/MPG get converted to MP4
   - Same audio handling as trim path

5. Main render adds explicit -map 0✌️0 -map 0🅰️0? -ac 2 to ensure
   only first video and first audio stream get encoded, with stereo

6. ACR recognize also gets -map 0🅰️0 -ac 2 for MXF compatibility

7. UI accepts: video/*,.mxf,.mpg,.mpeg,.ts,.m2ts,.mts

8. Upload limit raised: 2GB → 10GB (MXF files are large)

This means a TV broadcast MXF with [SLO/EN/DE language tracks] now
correctly outputs stereo MP4 with the main language track preserved.
2026-04-29 14:38:48 +00:00
b543057cee ACRCloud auto-recognition: never block uploads, fall back to fingerprinting
Changes:

1. UI: removed blocking prompt() that asked for artist+title on filename
   that didn't match 'Artist - Title' pattern. Upload always proceeds.
   Instead shows yellow warning saying 'server will try to recognize'.

2. Backend: added scripts/acr_recognize.py — extracts 20s audio sample
   from video (at 15s and 60s offsets for robustness), computes ACRCloud
   fingerprint via native binary (3KB payload), sends to identify API.

3. Pipeline: process_job() now runs ACR recognition step before analysis
   IF parsed_artist or parsed_title is missing. Result is saved to job
   metadata and used for download filename + Scribe/Claude filename hint.

4. Credentials: ACR_HOST + ACR_ACCESS_KEY + ACR_SECRET_KEY env vars
   added to Coolify (using existing keys from openclaw fb-agent metka).

5. requirements.txt: added pyacrcloud==1.0.11 for native fingerprinting.

This unblocks future automation/cron upload pipelines — files don't need
to be perfectly named, ACRCloud will identify them automatically.

Fallback chain:
1. Filename parsing (Artist - Title.mp4)
2. ACRCloud audio fingerprint (works even for '12345.mp4', 'IMG_001.mp4')
3. If both fail: download filename uses 'reel_<id>.mp4' (still works)
2026-04-29 14:24:53 +00:00
a30137f1f2 Strict 'chorus only' mode: respect include_prebuild in LLM prompt
Bug: 'Vključi pre-chorus' checkbox in UI was sent to backend but ignored
by Claude/Gemini analysis prompt. Both modes used same lenient rules
saying 'pre-chorus is optional' — Claude often included pre-chorus even
when user wanted just chorus.

Real-world failure: Lady Gaga 'Abracadabra' picked 54.7-84.6s, but actual
chorus 'Abracadabra, amor, ooh-na-na' starts at 85.2s. Claude included
the entire pre-chorus block ('Hold me in your heart tonight', 'Like a
poem said by a lady in red', 'With a haunting dance') and missed the
actual chorus completely.

Fix: include_prebuild parameter now flows all the way to the prompt:
- main.py → analyze.py CLI args → analyze_with_llm() → prompt builder
- Two distinct prompt rule sets:

  CHORUS ONLY (default, include_prebuild=False):
  - Strict: 'clip starts on FIRST WORD of chorus, never before'
  - Length: 12-25s typically
  - Explicit examples for pop songs (Abracadabra, Despacito, Shape of You)
  - List of common mistakes to avoid

  CHORUS + PRE-CHORUS (include_prebuild=True):
  - Optional pre-chorus before chorus, 4-10s
  - Length: 18-35s

This fixes the most common failure mode where Claude rationalizes
including verse/pre-chorus content even when user explicitly wants
just the chorus.
2026-04-29 14:03:40 +00:00
90cdad516b Universal chorus selection: chorus mandatory, pre-chorus only natural extension
User feedback: 'REFREN je obvezen, pre-chorus opcijsko' + 'sistem mora biti
stabilen za vse jezike, tudi španščino in romunščino'.

Two changes:

1. Web search is now MANDATORY first step (was: optional fallback):
   - Even if Claude thinks it knows the song, must search lyrics first
   - Universal lyrics sources by language:
     SLO: besedila.com, lyricstranslate.com
     DE: songtexte.com
     HR/SR/BS: tekstovi.net
     ES: letras.com, musica.com
     RO: versuri.ro
     IT: angolotesti.it
     FR: paroles.net
     EN: genius.com, azlyrics.com
     Universal: lyricstranslate.com (any language)
   - Search strategy: artist+title first, then transcript snippet fallback
   - Without lyrics, Claude cannot reliably identify chorus boundaries

2. Simplified selection rules - chorus is THE priority:
   - Chorus (full first occurrence) = MANDATORY
   - Pre-chorus = ONLY if 1-2 verse lines tightly connected to chorus
   - In doubt: just take chorus alone (12-25s)
   - Outro fillers explicitly multi-language:
     SLO 'aj ja ja' / 'ej ej ej'
     EN 'yeah' / 'oh oh'
     ES 'ay ay ay'
     RO 'hei hei'
     JA 'la la la'
   - 12-35s total range (was 15-35s, now allows shorter chorus-only clips)

This makes the system language-agnostic: works the same way for Slovenian
narodno-zabavna, Spanish reggaeton, Romanian manele, German Schlager, etc.
The lyrics lookup is what makes it stable across languages.
2026-04-29 13:36:34 +00:00
4efd726176 Extend clip end past chorus to capture outro/sustained notes
Problem: Claude was cutting clip exactly at last transcribed word of chorus,
but in real songs:
- Singer holds last note 1-3s longer (still meaningful)
- Outro 'ej-ej-ej' / 'oh' / 'yeah' may not be transcribed as words
- Result felt like 'incomplete chorus' even though SRT was correct

Fix has two parts:

1. Prompt enhancement:
   - Ask Claude to add 1-2s padding AFTER last chorus word
   - Explicit example with timing math
   - Mention outro fillers (ej-ej-ej, oh, yeah)

2. Post-LLM extension logic:
   - After Claude returns clip range, scan corrected_segments for
     segments overlapping or starting just after current end
   - If next segment is within 1s pause and ends within max_duration+5s,
     extend clip to include it (with 0.3s breathing room)
   - Hard cap at max_duration + 5s to prevent unbounded extension

This ensures chorus naturally trails off rather than being cut mid-emotional-peak.
2026-04-29 13:12:28 +00:00
81bae81401 Fix Scribe stopping mid-song: enable tag_audio_events=true + filter events out
ROOT CAUSE FOUND: tag_audio_events=false caused Scribe to stop transcribing
when instrumental music dominates (polka harmonica taking over from vocals).

Real-world test on Avseniki - Ena bolha za pomoč (186s polka):
- tag_audio_events=false: 20% coverage (37s only) — fails
- tag_audio_events=true:  100% coverage (186s full) — works

When tag_audio_events=true, Scribe inserts placeholder markers like
'(glasba)' / '(plesalna glasba)' for instrumental sections instead of
giving up. We then filter these out so they don't appear in subtitles.

Filtering logic:
- Skip word.type != 'word' (audio_event types)
- Skip parenthesized text legacy fallback like '(music)', '(applause)'

This is the core fix — no longer reliant on filename for transcription
completeness. Even untitled files like '12345.mp4' now get full coverage.
2026-04-29 13:04:19 +00:00
7d00730051 Auto-detect language from filename for Scribe (no manual UI selection needed)
Problem: Scribe was failing on Slovenian narodno-zabavna songs (Avseniki,
Modrijani) because:
- User doesn't manually pick language (everything is auto)
- Scribe auto-detect had low confidence (0.58) on harmonika-heavy polka
- Result: only 37s transcribed instead of full 186s song

Solution: detect_language_from_filename() function:
- Recognizes 60+ Slovenian artists (Avseniki, Modrijani, Veseli Dolenjci, ...)
- Recognizes 30+ German artists (Ben Zucker, Helene Fischer, ...)
- Recognizes 20+ Croatian/Serbian artists (Thompson, Severina, Lepa Brena, ...)
- Falls back to keyword matching (volim, liebe, srce, herz, ...)
- Detects character set (č/ž/š → SL, ä/ö/ü/ß → DE, đ → HR)
- Score-based: 5pts for artist match, 1-2pts for keywords/chars

When detected, sends language_code to Scribe explicitly:
- Avseniki → 'slv' lock → no more half-transcribed songs
- Ben Zucker → 'deu' lock → consistent German transcription
- User still doesn't need to manually pick anything

filename_hint flows: main.py → analyze.py CLI → transcribe_full → Scribe
2026-04-29 12:57:19 +00:00
40acad26f3 Crystal-clear chorus selection rules: pre-chorus build-up + FIRST chorus
Previous rules were ambiguous and Claude was sometimes picking:
- Just the chorus (no build-up)
- Second chorus instance (lower energy than first)
- Random verse + later chorus combinations

New explicit priority order:
1. PRIMARY: pre-chorus verse (build-up) + first chorus (~20-35s total)
2. FALLBACK: just first chorus alone
3. LAST RESORT: dramatic peak section

Strict rules:
- ALWAYS first chorus (highest energy/recognition)
- NEVER second/third chorus instances
- NEVER skip between verses
- NEVER extend over 35 seconds
- Concrete example given: chorus@32s,16s long → pick 20-48s

This fixes Veseli Dolenjci picking second chorus + post-chorus verse
instead of natural pre-chorus build-up + first chorus.
2026-04-29 12:42:54 +00:00
5f90085981 Add Claude web_search tool for lyrics lookup + tighter subtitle timing
1. Claude API web_search tool integration:
   - Claude can now search web for actual lyrics when STT text is wrong
   - Especially useful for SLO/HR/BS/SR songs (Modrijani, Veseli Dolenjci)
     where Claude doesn't know lyrics from training data
   - Agentic loop: tool_use → server-side search → continuation → final text
   - Max 3 searches per job ($0.03 cost limit)
   - Hint sources: besedila.com, lyricstranslate.com, tekstovi.net, songtexte.com

2. Tighter subtitle segmentation from Scribe word timestamps:
   - Phrase boundaries on shorter pauses (0.4s vs 0.6s)
   - Sentence-ending punctuation triggers segment break
   - Max segment 4s (was 6s) for natural readable subtitles
   - Hard cap at 5.5s to prevent very long lines

This fixes 'ples to noč' → 'ples pojoč' for Modrijani songs that
Scribe transcribed phonetically wrong but Claude can fix via web lookup.
2026-04-29 12:24:17 +00:00
68247bb84c Integrate ElevenLabs Scribe (best multilingual STT 2026)
ElevenLabs Scribe replaces local Whisper as default transcription:
- 96.7% accuracy English, 2.4% WER Indonesian (vs Whisper 7.7%)
- 18x faster (200s song = 11s vs 3-5 min on CPU)
- No hallucinations on songs (Whisper invented 'Pony und Kleid' for 'Bonnie und Clyde')
- 99 languages supported, including SLO/HR/BS/SR
- $0.40/h pricing, ~$0.022 per 200s song

Implementation:
- transcribe_with_elevenlabs() function uses Scribe v1
- ISO 639-1 ↔ 639-3 mapping (Scribe needs 'deu' not 'de')
- Word-level timestamps converted to pseudo-segments (close on 0.6s pause or 6s duration)
- 24MB upload limit guard with auto-fallback to local

Default whisper_provider='auto':
- If ELEVENLABS_API_KEY set → use Scribe
- Otherwise → fallback to local faster-whisper
- 'elevenlabs' strict mode: no fallback
- 'local' strict mode: skip Scribe entirely

Tested on Ben Zucker - Ohne dich: Scribe correctly transcribed
'Wir sind Bonnie und Clyde, zu allem bereit' where local Whisper hallucinated.
2026-04-29 12:03:40 +00:00
3ffa9740f0 Revert "Add Groq Whisper API integration (200x faster than local CPU)"
This reverts commit 5c53a27d33.
2026-04-29 11:19:31 +00:00
6a8f87b4a2 Revert "Filler detection: trim clip before la-la-la / instrumental medbridge"
This reverts commit 4488717f6f.
2026-04-29 11:19:31 +00:00
4488717f6f Filler detection: trim clip before la-la-la / instrumental medbridge
Problem: When a song has chorus → la-la-la medbridge → chorus structure,
Claude was including the whole 40s+ block, with 18 seconds of la-la-la
making the reel feel artificially extended.

Fix:
1. Prompt enhancement: explicitly tell Claude NEVER to include
   la-la-la / ooh ooh / yeah yeah / instrumental fillers
2. Post-LLM detection: scan corrected_segments for repetitive content
   (>70% repeated words) and trim clip before that segment
3. Max duration guidance reduced from 45s → 35s in prompt

This means: clip will end at the first chorus, not extend through fillers.
2026-04-29 11:17:16 +00:00
5c53a27d33 Add Groq Whisper API integration (200x faster than local CPU)
Pipeline:
- New transcribe_with_groq() function uses Groq's whisper-large-v3-turbo
- 30s audio transcribed in ~0.5s (vs 30s+ on CPU)
- Same quality as local Whisper (it's the same OpenAI model)
- Cloudflare bypass via custom User-Agent header
- 24MB upload limit guard with auto-fallback to local
- Language auto-detect works (Groq returns full lang name, mapped to ISO codes)

Default whisper_provider='auto':
- If GROQ_API_KEY is set → use Groq (200x faster)
- Otherwise → fallback to local faster-whisper
- Strict 'groq' mode: no fallback (returns empty if Groq fails)
- Strict 'local' mode: skip Groq entirely

CLI: --whisper-provider {auto,groq,local}
API: whisper_provider field in StartJobIn

Cost: $0.04/h with whisper-large-v3-turbo ($0.002 per 200s song)
2026-04-29 11:08:15 +00:00
60765ad84c Anti-hallucination: filename hint to LLM + beam search + silence threshold
When Whisper hallucinates (generates fake lyrics not matching the audio),
LLM can now use the original filename as a hint to recognize the song
and override the false transcript with the actual lyrics.

Pipeline:
1. Pass filename (e.g. 'Ben Zucker - Bonnie und Clyde') as hint
2. Whisper transcribes (may hallucinate)
3. Claude/Gemini reads filename + transcript:
   - Recognizes song from filename hint
   - Compares Whisper output to known lyrics
   - Replaces hallucinated text with real lyrics (preserves timestamps)
   - If can't fix, removes segment (better silent than wrong)

Also added Whisper anti-hallucination params:
- beam_size=5 (more careful decoding vs greedy)
- hallucination_silence_threshold=2.0 (skip text in long silences)
2026-04-29 10:48:55 +00:00
OpenClaw Agent
0ca33be6ac Fix: clip_range source dynamic from LLM result instead of hardcoded 'claude'
Diagnoza:
- analyze.py je zgodovinsko imel samo Claude support
- ko se je dodal Gemini, je clip_range.source ostal hardcoded 'claude'
- prav tako log 'Whisper segmenti zamenjani s Claude' in 'Generated SRT from Claude'
- API rezultat je v jobu kazal source='claude' tudi ko je dejansko bil uporabljen Gemini
- to je samo COSMETIC bug — funkcionalno je vse delovalo pravilno
- Gemini se DEJANSKO klical (potrjeno: '🤖 Gemini (gemini-3.1-pro-preview) izbral: 172.5-201.8s')
  in vrnil pravilen rezultat — samo logging je rekel napačno

Popravki:
1. clip_range['source'] = claude_result['source'] (dejansko 'gemini:...' ali 'claude:...')
2. clip_range['reason'] prefix iz hardcoded 'claude_llm:' v dinamičen '{source}:'
3. Log 'Whisper segmenti zamenjani s Claude' → 'z {llm_label}'
4. Log 'Claude je popravil jezik' → 'LLM je popravil'
5. main.py 'Generated SRT from Claude' → 'from {llm_src}'

Test (Zlati Muzikanti - Le prijatelja bodiva, valček, 246s):
✓ Gemini dejansko izbere refren (172.5-201.8s)
✓ Whisper detektira sl (p=0.97 across 3 samples)
✓ Vseh 18 segmentov popravljenih
✓ Pipeline end-to-end deluje

Backward compat:
- transcript['claude_corrected'] in srt_from_claude variable name ohranjena
  ker že obstajajo v starih job state fajlih
2026-04-29 09:49:58 +00:00
OpenClaw Agent
e350352883 Fix: Gemini 3.1 Pro thinking model needs 32k maxOutputTokens (was 4096 → MAX_TOKENS truncation)
Diagnoza:
- Gemini 3.x Pro je thinking model (ima internal reasoning, thoughtsTokenCount)
- Pri velikih transkriptih (60+ segmentov pesmi):
  * thoughts ~ 1500-3000 tokens
  * output JSON s corrected_segments ~ 3000-7000 tokens
  * total ~ 4500-10000 tokens
- Z maxOutputTokens=4096 je bil response prekinjen (finishReason: MAX_TOKENS),
  JSON odrezan na pol, _parse_llm_response je threw json.JSONDecodeError
- Rezultat: 'Gemini vrnil prazen string' v logih

Popravki:
1. Gemini maxOutputTokens 4096 → 32768 (dovolj za thinking + dolg JSON)
2. Diagnostika finishReason==MAX_TOKENS in usage tokens v logih
3. Detekcija praznega text-a (ne samo praznega parts array-a)
4. Claude max_tokens 4096 → 8192 (rezerva za dolge pesmi)
5. Claude detekcija stop_reason==max_tokens

Test (60 segmentov, 5631 char prompt):
- 4096 → finishReason=MAX_TOKENS, thoughts=2594, output=1488, JSON odrezan 
- 16384 → finishReason=STOP, thoughts=1445, output=3040, JSON popoln 
- 32768 → varen default 
2026-04-29 09:03:53 +00:00
ec71c54570 Upgrade to Sonnet 4.6 + add Gemini 3.1 Pro support
- Refactored analyze_with_claude into shared _build_analysis_prompt + _parse_llm_response helpers
- New analyze_with_gemini() using Gemini 3.1 Pro ($2/M in, MMMLU 92.6% — best multilingual)
- Unified analyze_with_llm(provider) dispatcher with auto-fallback (Claude → Gemini)
- API endpoint accepts llm_provider in StartJobIn (claude/gemini/auto)
- Frontend dropdown to pick LLM
- Default model is now Sonnet 4.6 (was Haiku 4.5) — 3x quality at 3x price (~3 cents/video)
- Gemini support is opt-in: needs GEMINI_API_KEY env var to activate
2026-04-29 08:26:27 +00:00
9faa224885 Upgrade Claude model: Haiku 4.5 → Sonnet 4.6 for better Slavic language transcript correction 2026-04-29 08:22:10 +00:00
69fb2f5ce8 Upgrade default Whisper model: small/medium → large-v3 for much better Slovenian/Slavic transcription accuracy 2026-04-29 08:20:18 +00:00
4bc5ac6756 Major: Claude post-processing of Whisper transcript
- Claude now corrects transcription errors (Slavic languages, dialects, mixed langs)
- Returns corrected_segments with same timestamps but cleaner text
- Pipeline generates SRT from Claude-corrected transcript and passes to subtitle.py via --srt
- subtitle.py supports --srt to skip Whisper re-transcription on the trimmed clip
- clip.py propagates --srt through to subtitle.py
- Whisper still runs once (in analyze.py); subtitle.py reuses corrected output instead of re-running
- This means: Whisper's mistakes (mixed langs, hallucinations, wrong words) are fixed by Claude before becoming visible subtitles
2026-04-29 08:13:33 +00:00
af3c933c78 Robust language detection + anti-hallucination
- 3-sample voting for auto-detect (start/middle/end of song) prevents lang switching mid-song
- Lock detected language for full transcription
- Anti-hallucination: condition_on_previous_text=False, temperature=0.0
- compression_ratio_threshold=2.4 (rejects repetitive hallucinations)
- log_prob_threshold=-1.0 (rejects low-confidence segments)
- no_speech_threshold=0.6 (more aggressive silence detection)
- Default Whisper model changed: small → medium (better for all langs incl. Slavic)
2026-04-29 07:59:20 +00:00
c870d80726 Fix: extend clip if ends mid-vocal (no chorus cut-off), DejaVu Sans font (supports SLO/HR/BS chars), auto-upgrade to medium Whisper model for Slavic languages 2026-04-29 07:35:00 +00:00
5d5e169f9d Disable Whisper VAD filter — was dropping vocal segments in songs creating gaps in subtitles 2026-04-29 07:07:29 +00:00
a04811bdc9 Add Claude LLM analysis: sends full transcript to Claude API for true song structure understanding (refrain detection across all repetitions, not just local heuristic) 2026-04-29 06:55:41 +00:00
e072eec362 Fix: handle Whisper transcribe failure for instrumental-only audio (fallback to empty transcript) 2026-04-29 06:33:52 +00:00
33a138af9e Fix: force native Python bool/float for JSON serialization (numpy types) 2026-04-29 06:23:41 +00:00
8512076b91 Major: smart selection pipeline (analyze.py) + audio fade + multi-lang auto-detect
- New analyze.py: full transcript + energy + structural analysis
- Smart clip range: includes pre-chorus, can exceed 30s up to max_duration (default 45s)
- Audio fade in/out: auto-detected from vocal boundaries
- Instrumental detection: auto-disables subs if vocals < 10% of duration
- Multi-language: auto-detect via Whisper or explicit (DE/SL/HR/BS/SR/EN/IT/ES/FR)
- Frontend: cleaner UX, added bs language, smart selection description
- reframe.py: --fade-in --fade-out args
- clip.py: propagates fade params
- app/main.py: replaces find_chorus.py call with analyze.py
2026-04-29 06:21:35 +00:00
81edd24ca3 Subtitles: smaller font 56px (was 84), higher position MarginV=400, side margins 80px for safe zone 2026-04-29 06:09:26 +00:00
ba787744a6 Subtitles: cap chunk duration at 2.5s, split long lines into multiple time slices for faster reels pacing 2026-04-29 05:59:36 +00:00
e001387a89 Subtitles: convert SRT to ASS directly with PlayResY=1920 for predictable scaling instead of unreliable force_style 2026-04-28 18:09:53 +00:00
28d933c916 Subtitles: UPPERCASE + position lower (MarginV=320 for 1080x1920) + bigger font 2026-04-28 17:40:48 +00:00
15ef4888a1 Debug: log exact clip.py cmd in job + clip.py logs run_clip args 2026-04-28 17:28:10 +00:00
bc3fe1f9d4 Add explicit FFmpeg trim command logging + duration verification 2026-04-28 17:17:11 +00:00
8eaef029e2 Find chorus: weight repetitive short phrases (like 'Ohne dich x5') as strong chorus signal 2026-04-28 16:57:45 +00:00
c17578521a Fix find_chorus: RMS energy parser was broken (no pts_time available), now syntheses timestamps; energy weight x10 (refren je glasnejši) 2026-04-28 16:55:51 +00:00
64e8854cea Track mode: more sensitive face detection + longer smoothing window 2026-04-28 16:45:13 +00:00
400f6dbb6d Fix: limit FFmpeg crop expression to 20 sample points (was overflowing 4KB limit) 2026-04-28 16:32:26 +00:00
2e337ff079 Fix: shutil import was inside finally block, causing NameError when shutil.move was called 2026-04-28 16:22:39 +00:00
6e2a13d8a3 Fix cross-device link error: use shutil.move instead of os.replace 2026-04-28 16:15:20 +00:00
47509b4f06 Add cookies support to yt_download.py for YouTube bot detection bypass 2026-04-28 15:47:59 +00:00
30b969e4b8 Initial: reels clipper app
- FastAPI backend (auth, jobs, SSE, download)
- Frontend: drag&drop + YouTube URL + jobs panel
- Pipeline: yt_download → find_chorus → reframe → subtitle
- Modes: track (face follow), center, blur
- Whisper for SI/DE/EN subtitles
- Auto-chorus detection via Whisper + RMS energy
- Docker + Coolify ready
2026-04-28 15:28:22 +00:00