Commit Graph

144 Commits

Author SHA1 Message Date
68247bb84c Integrate ElevenLabs Scribe (best multilingual STT 2026)
ElevenLabs Scribe replaces local Whisper as default transcription:
- 96.7% accuracy English, 2.4% WER Indonesian (vs Whisper 7.7%)
- 18x faster (200s song = 11s vs 3-5 min on CPU)
- No hallucinations on songs (Whisper invented 'Pony und Kleid' for 'Bonnie und Clyde')
- 99 languages supported, including SLO/HR/BS/SR
- $0.40/h pricing, ~$0.022 per 200s song

Implementation:
- transcribe_with_elevenlabs() function uses Scribe v1
- ISO 639-1 ↔ 639-3 mapping (Scribe needs 'deu' not 'de')
- Word-level timestamps converted to pseudo-segments (close on 0.6s pause or 6s duration)
- 24MB upload limit guard with auto-fallback to local

Default whisper_provider='auto':
- If ELEVENLABS_API_KEY set → use Scribe
- Otherwise → fallback to local faster-whisper
- 'elevenlabs' strict mode: no fallback
- 'local' strict mode: skip Scribe entirely

Tested on Ben Zucker - Ohne dich: Scribe correctly transcribed
'Wir sind Bonnie und Clyde, zu allem bereit' where local Whisper hallucinated.
2026-04-29 12:03:40 +00:00
3ffa9740f0 Revert "Add Groq Whisper API integration (200x faster than local CPU)"
This reverts commit 5c53a27d33.
2026-04-29 11:19:31 +00:00
6a8f87b4a2 Revert "Filler detection: trim clip before la-la-la / instrumental medbridge"
This reverts commit 4488717f6f.
2026-04-29 11:19:31 +00:00
4488717f6f Filler detection: trim clip before la-la-la / instrumental medbridge
Problem: When a song has chorus → la-la-la medbridge → chorus structure,
Claude was including the whole 40s+ block, with 18 seconds of la-la-la
making the reel feel artificially extended.

Fix:
1. Prompt enhancement: explicitly tell Claude NEVER to include
   la-la-la / ooh ooh / yeah yeah / instrumental fillers
2. Post-LLM detection: scan corrected_segments for repetitive content
   (>70% repeated words) and trim clip before that segment
3. Max duration guidance reduced from 45s → 35s in prompt

This means: clip will end at the first chorus, not extend through fillers.
2026-04-29 11:17:16 +00:00
5c53a27d33 Add Groq Whisper API integration (200x faster than local CPU)
Pipeline:
- New transcribe_with_groq() function uses Groq's whisper-large-v3-turbo
- 30s audio transcribed in ~0.5s (vs 30s+ on CPU)
- Same quality as local Whisper (it's the same OpenAI model)
- Cloudflare bypass via custom User-Agent header
- 24MB upload limit guard with auto-fallback to local
- Language auto-detect works (Groq returns full lang name, mapped to ISO codes)

Default whisper_provider='auto':
- If GROQ_API_KEY is set → use Groq (200x faster)
- Otherwise → fallback to local faster-whisper
- Strict 'groq' mode: no fallback (returns empty if Groq fails)
- Strict 'local' mode: skip Groq entirely

CLI: --whisper-provider {auto,groq,local}
API: whisper_provider field in StartJobIn

Cost: $0.04/h with whisper-large-v3-turbo ($0.002 per 200s song)
2026-04-29 11:08:15 +00:00
60765ad84c Anti-hallucination: filename hint to LLM + beam search + silence threshold
When Whisper hallucinates (generates fake lyrics not matching the audio),
LLM can now use the original filename as a hint to recognize the song
and override the false transcript with the actual lyrics.

Pipeline:
1. Pass filename (e.g. 'Ben Zucker - Bonnie und Clyde') as hint
2. Whisper transcribes (may hallucinate)
3. Claude/Gemini reads filename + transcript:
   - Recognizes song from filename hint
   - Compares Whisper output to known lyrics
   - Replaces hallucinated text with real lyrics (preserves timestamps)
   - If can't fix, removes segment (better silent than wrong)

Also added Whisper anti-hallucination params:
- beam_size=5 (more careful decoding vs greedy)
- hallucination_silence_threshold=2.0 (skip text in long silences)
2026-04-29 10:48:55 +00:00
05fb0081c6 Fix preview cutoff + sticky left panel
1. Preview endpoint now supports HTTP Range requests (HTTP 206 Partial)
   - HTML5 video player needs Range support to seek/buffer properly
   - Without it, video would cut off after a few seconds
   - Returns chunks of 64KB on demand

2. Left panel (upload form) is now sticky (position: sticky)
   - Stays in view while right panel (jobs list) scrolls
   - On mobile (<800px) reverts to normal flow
2026-04-29 10:24:32 +00:00
OpenClaw Agent
0ca33be6ac Fix: clip_range source dynamic from LLM result instead of hardcoded 'claude'
Diagnoza:
- analyze.py je zgodovinsko imel samo Claude support
- ko se je dodal Gemini, je clip_range.source ostal hardcoded 'claude'
- prav tako log 'Whisper segmenti zamenjani s Claude' in 'Generated SRT from Claude'
- API rezultat je v jobu kazal source='claude' tudi ko je dejansko bil uporabljen Gemini
- to je samo COSMETIC bug — funkcionalno je vse delovalo pravilno
- Gemini se DEJANSKO klical (potrjeno: '🤖 Gemini (gemini-3.1-pro-preview) izbral: 172.5-201.8s')
  in vrnil pravilen rezultat — samo logging je rekel napačno

Popravki:
1. clip_range['source'] = claude_result['source'] (dejansko 'gemini:...' ali 'claude:...')
2. clip_range['reason'] prefix iz hardcoded 'claude_llm:' v dinamičen '{source}:'
3. Log 'Whisper segmenti zamenjani s Claude' → 'z {llm_label}'
4. Log 'Claude je popravil jezik' → 'LLM je popravil'
5. main.py 'Generated SRT from Claude' → 'from {llm_src}'

Test (Zlati Muzikanti - Le prijatelja bodiva, valček, 246s):
✓ Gemini dejansko izbere refren (172.5-201.8s)
✓ Whisper detektira sl (p=0.97 across 3 samples)
✓ Vseh 18 segmentov popravljenih
✓ Pipeline end-to-end deluje

Backward compat:
- transcript['claude_corrected'] in srt_from_claude variable name ohranjena
  ker že obstajajo v starih job state fajlih
2026-04-29 09:49:58 +00:00
OpenClaw Agent
e350352883 Fix: Gemini 3.1 Pro thinking model needs 32k maxOutputTokens (was 4096 → MAX_TOKENS truncation)
Diagnoza:
- Gemini 3.x Pro je thinking model (ima internal reasoning, thoughtsTokenCount)
- Pri velikih transkriptih (60+ segmentov pesmi):
  * thoughts ~ 1500-3000 tokens
  * output JSON s corrected_segments ~ 3000-7000 tokens
  * total ~ 4500-10000 tokens
- Z maxOutputTokens=4096 je bil response prekinjen (finishReason: MAX_TOKENS),
  JSON odrezan na pol, _parse_llm_response je threw json.JSONDecodeError
- Rezultat: 'Gemini vrnil prazen string' v logih

Popravki:
1. Gemini maxOutputTokens 4096 → 32768 (dovolj za thinking + dolg JSON)
2. Diagnostika finishReason==MAX_TOKENS in usage tokens v logih
3. Detekcija praznega text-a (ne samo praznega parts array-a)
4. Claude max_tokens 4096 → 8192 (rezerva za dolge pesmi)
5. Claude detekcija stop_reason==max_tokens

Test (60 segmentov, 5631 char prompt):
- 4096 → finishReason=MAX_TOKENS, thoughts=2594, output=1488, JSON odrezan 
- 16384 → finishReason=STOP, thoughts=1445, output=3040, JSON popoln 
- 32768 → varen default 
2026-04-29 09:03:53 +00:00
534d710e8a Auto-resume jobs interrupted by container restart
When Coolify redeploys, the container is killed mid-job.
Now on FastAPI startup:
- Detect status=processing jobs from JOBS_DIR
- If input file exists and resume_attempts < 3, restart pipeline (status=queued)
- After 3 failed attempts, mark as error
- If input is missing, mark error immediately
- Track resume_attempts and last_resume_at for diagnostics

Run actual process_job in asyncio executor (sync function in thread)
so startup completes quickly and resume happens in background.

Resolves: 'Veseli Dolenci stuck' issue
2026-04-29 08:52:16 +00:00
32baf9cd45 Auto-resume: cleanup stuck jobs on container startup + GEMINI_API_KEY env
- @app.on_event(startup) marks all status=processing jobs as error after restart
- Process endpoint now clears chorus_error/interrupted_at on retry (retry-friendly)
- GEMINI_API_KEY added to Coolify env (Gemini 3.1 Pro now active)
- User can now choose Gemini in UI dropdown for analysis
2026-04-29 08:43:31 +00:00
ec71c54570 Upgrade to Sonnet 4.6 + add Gemini 3.1 Pro support
- Refactored analyze_with_claude into shared _build_analysis_prompt + _parse_llm_response helpers
- New analyze_with_gemini() using Gemini 3.1 Pro ($2/M in, MMMLU 92.6% — best multilingual)
- Unified analyze_with_llm(provider) dispatcher with auto-fallback (Claude → Gemini)
- API endpoint accepts llm_provider in StartJobIn (claude/gemini/auto)
- Frontend dropdown to pick LLM
- Default model is now Sonnet 4.6 (was Haiku 4.5) — 3x quality at 3x price (~3 cents/video)
- Gemini support is opt-in: needs GEMINI_API_KEY env var to activate
2026-04-29 08:26:27 +00:00
9faa224885 Upgrade Claude model: Haiku 4.5 → Sonnet 4.6 for better Slavic language transcript correction 2026-04-29 08:22:10 +00:00
69fb2f5ce8 Upgrade default Whisper model: small/medium → large-v3 for much better Slovenian/Slavic transcription accuracy 2026-04-29 08:20:18 +00:00
4bc5ac6756 Major: Claude post-processing of Whisper transcript
- Claude now corrects transcription errors (Slavic languages, dialects, mixed langs)
- Returns corrected_segments with same timestamps but cleaner text
- Pipeline generates SRT from Claude-corrected transcript and passes to subtitle.py via --srt
- subtitle.py supports --srt to skip Whisper re-transcription on the trimmed clip
- clip.py propagates --srt through to subtitle.py
- Whisper still runs once (in analyze.py); subtitle.py reuses corrected output instead of re-running
- This means: Whisper's mistakes (mixed langs, hallucinations, wrong words) are fixed by Claude before becoming visible subtitles
2026-04-29 08:13:33 +00:00
4e123bdabc UI: hide lang/model dropdowns — both are fully automatic now (3-sample lang detection + medium default model) 2026-04-29 08:03:22 +00:00
af3c933c78 Robust language detection + anti-hallucination
- 3-sample voting for auto-detect (start/middle/end of song) prevents lang switching mid-song
- Lock detected language for full transcription
- Anti-hallucination: condition_on_previous_text=False, temperature=0.0
- compression_ratio_threshold=2.4 (rejects repetitive hallucinations)
- log_prob_threshold=-1.0 (rejects low-confidence segments)
- no_speech_threshold=0.6 (more aggressive silence detection)
- Default Whisper model changed: small → medium (better for all langs incl. Slavic)
2026-04-29 07:59:20 +00:00
c870d80726 Fix: extend clip if ends mid-vocal (no chorus cut-off), DejaVu Sans font (supports SLO/HR/BS chars), auto-upgrade to medium Whisper model for Slavic languages 2026-04-29 07:35:00 +00:00
5d5e169f9d Disable Whisper VAD filter — was dropping vocal segments in songs creating gaps in subtitles 2026-04-29 07:07:29 +00:00
a04811bdc9 Add Claude LLM analysis: sends full transcript to Claude API for true song structure understanding (refrain detection across all repetitions, not just local heuristic) 2026-04-29 06:55:41 +00:00
e072eec362 Fix: handle Whisper transcribe failure for instrumental-only audio (fallback to empty transcript) 2026-04-29 06:33:52 +00:00
33a138af9e Fix: force native Python bool/float for JSON serialization (numpy types) 2026-04-29 06:23:41 +00:00
8512076b91 Major: smart selection pipeline (analyze.py) + audio fade + multi-lang auto-detect
- New analyze.py: full transcript + energy + structural analysis
- Smart clip range: includes pre-chorus, can exceed 30s up to max_duration (default 45s)
- Audio fade in/out: auto-detected from vocal boundaries
- Instrumental detection: auto-disables subs if vocals < 10% of duration
- Multi-language: auto-detect via Whisper or explicit (DE/SL/HR/BS/SR/EN/IT/ES/FR)
- Frontend: cleaner UX, added bs language, smart selection description
- reframe.py: --fade-in --fade-out args
- clip.py: propagates fade params
- app/main.py: replaces find_chorus.py call with analyze.py
2026-04-29 06:21:35 +00:00
81edd24ca3 Subtitles: smaller font 56px (was 84), higher position MarginV=400, side margins 80px for safe zone 2026-04-29 06:09:26 +00:00
ba787744a6 Subtitles: cap chunk duration at 2.5s, split long lines into multiple time slices for faster reels pacing 2026-04-29 05:59:36 +00:00
e001387a89 Subtitles: convert SRT to ASS directly with PlayResY=1920 for predictable scaling instead of unreliable force_style 2026-04-28 18:09:53 +00:00
28d933c916 Subtitles: UPPERCASE + position lower (MarginV=320 for 1080x1920) + bigger font 2026-04-28 17:40:48 +00:00
d36893bf2d FIX CRITICAL: reload job dict after find_chorus update so reframe gets new start/duration values 2026-04-28 17:33:11 +00:00
15ef4888a1 Debug: log exact clip.py cmd in job + clip.py logs run_clip args 2026-04-28 17:28:10 +00:00
bc3fe1f9d4 Add explicit FFmpeg trim command logging + duration verification 2026-04-28 17:17:11 +00:00
8eaef029e2 Find chorus: weight repetitive short phrases (like 'Ohne dich x5') as strong chorus signal 2026-04-28 16:57:45 +00:00
c17578521a Fix find_chorus: RMS energy parser was broken (no pts_time available), now syntheses timestamps; energy weight x10 (refren je glasnejši) 2026-04-28 16:55:51 +00:00
64e8854cea Track mode: more sensitive face detection + longer smoothing window 2026-04-28 16:45:13 +00:00
400f6dbb6d Fix: limit FFmpeg crop expression to 20 sample points (was overflowing 4KB limit) 2026-04-28 16:32:26 +00:00
bf7ced5c7b Reset upload form also after failed jobs (so next upload works) 2026-04-28 16:29:39 +00:00
2e337ff079 Fix: shutil import was inside finally block, causing NameError when shutil.move was called 2026-04-28 16:22:39 +00:00
c34e4aa376 UX: Live progress panel below upload form, stable progress bar, inline preview/download 2026-04-28 16:19:40 +00:00
6e2a13d8a3 Fix cross-device link error: use shutil.move instead of os.replace 2026-04-28 16:15:20 +00:00
02ec6f81f2 Add Deno runtime for yt-dlp YouTube nsig challenge solving 2026-04-28 16:05:09 +00:00
e304b08d7b Add nodejs for yt-dlp JS challenge solver, remove anonymous VOLUME 2026-04-28 15:51:57 +00:00
83734dfdc5 Upgrade yt-dlp to nightly for new YouTube nsig algorithm support 2026-04-28 15:48:39 +00:00
47509b4f06 Add cookies support to yt_download.py for YouTube bot detection bypass 2026-04-28 15:47:59 +00:00
8e41bf21f6 Fix: create empty static/ in container instead of COPY (was empty in git) 2026-04-28 15:34:50 +00:00
30b969e4b8 Initial: reels clipper app
- FastAPI backend (auth, jobs, SSE, download)
- Frontend: drag&drop + YouTube URL + jobs panel
- Pipeline: yt_download → find_chorus → reframe → subtitle
- Modes: track (face follow), center, blur
- Whisper for SI/DE/EN subtitles
- Auto-chorus detection via Whisper + RMS energy
- Docker + Coolify ready
2026-04-28 15:28:22 +00:00