User feedback: 'dejstvo je da trajna ker more najprej zrenderirat? to traja?
za to bi rabili hudo mašino al?'
Solution before GPU upgrade: live preview that renders just the selected
range as low-quality 480p clip. ~2-3s instead of ~70s full reel render.
NEW endpoint: GET /api/preview-clip/{job_id}?start=X&end=Y
- ffmpeg fast extract (no reframe, no subtitles, no face tracking)
- 480p ultrafast x264 preset, CRF 30
- Cached per job+range (re-clicks are instant)
- ~2-3s on CPU
Frontend:
- '▶ Predvajaj odsek' button now triggers preview-clip render
- Shows status: '🎬 Renderiram odsek... (~3s)'
- After render: video element switches to preview src
- User sees EXACTLY what reel will contain (just without face track)
- Subsequent clicks on same range are instant (cached)
Workflow:
- Drag handles → click '▶ Predvajaj odsek' → 3s wait → see + hear it
- Iterate fast: drag → preview → drag → preview
- Final '✅ Shrani in re-render' only when satisfied (~70s full render)
Bug from screenshot: trim bar visible but red handles not showing.
Causes:
1. video_duration in job is None for old jobs (was not saved on initial
processing). Without it, fallback was endInit+60 which placed handles
off-screen.
2. videoDuration was const, couldn't be updated when video metadata loads.
3. Handle offset was 9px but handles are now 24px wide (need 12px offset).
Fixes:
- Backend /api/transcript: fallback to last segment end time if
video_duration missing in job
- Frontend: videoDuration is let, updated on loadedmetadata
- Handle offset 9px → 12px for 24px wide handles
- Re-render trim after metadata loads to pick up actual video.duration
User insight: 'treba je narediti da ko se reels naredijo da jih lahko
popravljamo... delamo na avtomatiko ampak lahk pa tudi popravljam'
Avto pipeline ostane (Soniox + Claude + render). Po render-u uporabnik
lahko klikne ✏️ Edit gumb in:
1. **Slider za clip start/end**:
- Vidi 16:9 original video
- Drag start/end slider z živim preview-om
- Dolžina prikazana real-time
- Min 5s, max 60s
2. **Edit napisov** (collapsed, opcijsko):
- Klik na vrstico → input za popravek besedila
- Original timestamp ostane, samo besedilo se posodobi
- Uporabno za 'doline IZBOR' → 'doline IZPOD' tip popravkov
3. **Re-render**:
- Backend POST /api/jobs/{id}/recut z {start, end, custom_segments}
- Worker preskoči Soniox + Claude (custom_clip flag)
- Re-uporabi cached transcript + analysis
- Re-render samo: clip → reframe → subtitle → output
- ~30s namesto 3-5 min
New endpoints:
- GET /api/source-video/{id} — 16:9 original za editor preview
- GET /api/transcript/{id} — segmenti + clip range za editor
- POST /api/jobs/{id}/recut — re-render z user timestampi
Worker change: če job ima custom_clip=True, preskoči auto_chorus
analizo in samo re-uporabi obstoječi clip_range iz analysis.json
(updated by recut endpoint).
Bug: BRAJDE reel showed subtitles 2-3 seconds out of sync with audio.
Soniox returned correct word timestamps:
- 'Ajmo,' at 41.82s
- 'Janezi!' at 42.18s
- 'Pejd' greva, ajde,' at 43.44-44.40s
But generate_srt_from_segments() ignored word timestamps and split long
segments into evenly-spaced 2.5s chunks based on segment duration:
chunk_dur = duration / n_parts ← assumes even pacing
for i in range(n_parts):
cs = rel_start + i * chunk_dur
This produces wrong timing because singers don't sing evenly. Real audio
had 'Ajmo, Janezi!' in 0.9s and 'Pejd' greva, ajde, na traktorju od Majde'
in 6s — the 2.5s chunks didn't align with vocals.
Fix: when word-level timestamps are available (Soniox/Scribe), group
words into chunks where each chunk's start/end match the actual first/last
word timestamps. Each chunk is at most MAX_CHUNK_DURATION (2.5s) but
respects natural word boundaries.
Before:
00:00.000 → 01.900 AJMO, JANEZI! PEJD' GREVA, AJDE, NA TRAKTORJU OD
00:01.900 → 03.800 MAJDE, NOBEN NAJU NE NAJDE, KO PELJEM TE
After:
00:00.020 → 02.120 AJMO, JANEZI! PEJD' GREVA,
00:02.360 → 04.820 AJDE, NA TRAKTORJU OD MAJDE, NOBEN
Subtitles now perfectly align with vocals.
User feedback: 'Ansambel UNIKAT — PA PA (offiicial video)' shows the
'(offiicial video)' suffix everywhere (titles, downloads, UI). The user
wants only 'Artist - Title' without any video format markers.
Two fixes:
1. EXPANDED _NOISE_PATTERNS to handle:
- Typos in 'official': 'offiicial', 'offical', 'oficial' (regex Off[a-z]*icial)
- Variants: '(Official 4K Video)', '(Official HD Video)', '(Official Music Video)'
- More versions: (Live), (Cover), (Acoustic), (Extended Mix), (Radio Edit), (Clean), (Explicit)
- Square brackets: [Official...], [HD], [Lyrics...]
- Bare words without brackets
- Trailing year markers '(2024)'
2. NEW clean_noise() function applied at READ TIME:
Even if a job was saved with 'PA PA (offiicial video)' as parsed_title,
the new code re-cleans it when serving the job to the UI or building
the download filename. This means existing jobs get fixed too without
needing re-processing.
3. Applied to:
- build_download_filename() — clean before formatting
- list_jobs() — strip noise when serving job list
- get_job() — strip noise when serving single job
Result: 'Ansambel UNIKAT - PA PA - REEL.mp4' (no more (offiicial video))
Bug found in Žena ME TEPE third re-test:
- Scribe transcribed only verse 1 (0-33s) properly
- Then returned a single 98s segment [34.7-133.2] with just 1 word 'sam'
- This is a known Scribe hallucination on instrumental sections
- Result: SRT showed 'SAM SAM SAM SAM...' 14 times across the chorus
- Looked completely wrong because the chorus audio was correct but
subtitles showed 'SAM' repeatedly
Three-part fix:
1. SRT GENERATOR: skip segments > 15s with < 5 words.
These are hallucinations and have no real transcription value.
2. SCRIBE TRANSCRIBE: detect hallucinations in returned segments.
- Mark segments > 15s with < 5 words as hallucinations
- Compute true coverage % (excluding hallucinations)
- Add _hallucination_count and _coverage_pct to result
3. TRANSCRIBE_FULL: auto-retry Scribe if quality is poor.
- If hallucinations detected OR coverage < 50%, retry once
- Keep retry result only if it has better stats
- Otherwise fall back to first attempt (still better than nothing)
This makes the pipeline robust against Scribe's occasional bad transcripts
on songs with long instrumental breaks. Most second attempts succeed
where the first failed (random Scribe variance).
Bug found in Žena ME TEPE re-test:
- Clip start: 76.73s (correct, captures full 'Žena' word)
- But SRT subtitle #1 showed: 'SAJ ŠE DOMA MI VEČ NOČJO VERJET.'
- That text is from the PREVIOUS verse, not the chorus!
Why: previous segment (73.9-78.2s) contained 'saj še doma mi več
nočjo verjet. Žena me'. Clip start fell at 76.73s (mid-segment).
Old SRT logic: max(s_start, clip_start) just clipped TIMING but kept
ALL the text from that segment, including text from before the clip.
Fix: when a segment partially falls outside clip range AND has word-level
timestamps (Scribe provides these), reconstruct the segment using only
the words that actually fall within [clip_start, clip_end]. Audio
(clipped at clip_start) only contains those words anyway, so the
subtitle should match.
Result for Žena chorus:
- Old: 'SAJ ŠE DOMA MI VEČ NOČJO VERJET.' (wrong, that text is silent
in clip)
- New: 'ŽENA ME' (only words actually heard at 76.73-78.16s)
Changes:
1. Frontend multi-upload:
- File input now has 'multiple' attribute, drag-drop accepts multiple
- File queue list with per-file artist/title preview + remove button
- 'Pošlji vse' uploads sequentially (one at a time to avoid network saturation)
- Each file gets same batch_id for Telegram batch summary
- After upload, queue clears, jobs appear in right sidebar
2. Backend queue worker:
- New _queue_worker() background thread processes 'queued' jobs sequentially
- Only 1 job at a time to keep openclaw stable (avoid CPU/RAM thrash)
- FIFO order by created_at
- Auto-starts on app startup after job resume
3. Job submission flow change:
- /api/process and /api/youtube no longer call background.add_task directly
- Just mark status='queued', queue worker picks up
- This means upload completes fast, processing happens in background
- User can close browser, jobs continue
4. Telegram notifications (FOLX Alerts bot):
- Per-job: 'Reel pripravljen: Lady Gaga - Abracadabra (29s, 30 MB)'
- Per-job failed: 'Reel ni uspel: <name> + error message'
- Batch summary: 'Batch končan: 10/10 reels pripravljeni' (only if >1 in batch)
- Uses existing TELEGRAM_TOKEN + TELEGRAM_CHAT_ID env vars
- app/telegram.py module with notify_job_done(), notify_job_failed(),
notify_batch_complete()
5. batch_id field:
- Added to Job model + StartJobIn pydantic
- Saved during upload + process
- Used to count batch progress and trigger summary notification
User experience:
- Drag 20 videos at once
- Click 'Pošlji'
- Close browser, go grab coffee
- Telegram sends 'Reel pripravljen' for each
- After all done: 'Batch končan: 20/20 reels pripravljeni' summary
- Open app to download all
Changes:
1. UI: removed blocking prompt() that asked for artist+title on filename
that didn't match 'Artist - Title' pattern. Upload always proceeds.
Instead shows yellow warning saying 'server will try to recognize'.
2. Backend: added scripts/acr_recognize.py — extracts 20s audio sample
from video (at 15s and 60s offsets for robustness), computes ACRCloud
fingerprint via native binary (3KB payload), sends to identify API.
3. Pipeline: process_job() now runs ACR recognition step before analysis
IF parsed_artist or parsed_title is missing. Result is saved to job
metadata and used for download filename + Scribe/Claude filename hint.
4. Credentials: ACR_HOST + ACR_ACCESS_KEY + ACR_SECRET_KEY env vars
added to Coolify (using existing keys from openclaw fb-agent metka).
5. requirements.txt: added pyacrcloud==1.0.11 for native fingerprinting.
This unblocks future automation/cron upload pipelines — files don't need
to be perfectly named, ACRCloud will identify them automatically.
Fallback chain:
1. Filename parsing (Artist - Title.mp4)
2. ACRCloud audio fingerprint (works even for '12345.mp4', 'IMG_001.mp4')
3. If both fail: download filename uses 'reel_<id>.mp4' (still works)
Two improvements:
1. DOWNLOAD FILENAME: instead of 'reel_<job-id>.mp4' (e.g. reel_25e076af7600.mp4),
downloads now have descriptive names like:
- 'Lady Gaga - Abracadabra - REEL.mp4'
- 'Modrijani - S teboj - REEL.mp4'
- 'Sarah Connor - FICKA - REEL.mp4'
2. PRE-UPLOAD VALIDATION: when filename doesn't follow 'Artist - Title' format,
browser prompts user for both fields. Without them, upload is blocked.
This prevents files with names like '12345.mp4' or 'video_final.mp4' from
being processed without identifying info.
Implementation:
- parse_artist_title() helper handles common formats:
- 'Artist - Title.mp4' / 'Artist – Title' (em-dash)
- 'Artist | Title' / 'Artist : Title'
- Strips noise: '(Official Music Video)', '(Audio)', '(HD)', '[Lyric Video]'
- Client-side parser mirrors backend (validation before upload)
- Backend accepts artist + title form fields (override parsed)
- Job stored with parsed_artist + parsed_title + has_clean_name fields
- YouTube jobs auto-fetch title via yt-dlp --info-only and parse it
- Filename hint to Scribe/Claude uses parsed values (cleaner than raw filename)
- Download endpoint uses build_download_filename() for content-disposition
- Jobs list shows 'Artist — Title' instead of raw filename
Result: downloaded reels are auto-named correctly for Facebook/Instagram
upload, no more renaming files manually.
ElevenLabs Scribe replaces local Whisper as default transcription:
- 96.7% accuracy English, 2.4% WER Indonesian (vs Whisper 7.7%)
- 18x faster (200s song = 11s vs 3-5 min on CPU)
- No hallucinations on songs (Whisper invented 'Pony und Kleid' for 'Bonnie und Clyde')
- 99 languages supported, including SLO/HR/BS/SR
- $0.40/h pricing, ~$0.022 per 200s song
Implementation:
- transcribe_with_elevenlabs() function uses Scribe v1
- ISO 639-1 ↔ 639-3 mapping (Scribe needs 'deu' not 'de')
- Word-level timestamps converted to pseudo-segments (close on 0.6s pause or 6s duration)
- 24MB upload limit guard with auto-fallback to local
Default whisper_provider='auto':
- If ELEVENLABS_API_KEY set → use Scribe
- Otherwise → fallback to local faster-whisper
- 'elevenlabs' strict mode: no fallback
- 'local' strict mode: skip Scribe entirely
Tested on Ben Zucker - Ohne dich: Scribe correctly transcribed
'Wir sind Bonnie und Clyde, zu allem bereit' where local Whisper hallucinated.
Pipeline:
- New transcribe_with_groq() function uses Groq's whisper-large-v3-turbo
- 30s audio transcribed in ~0.5s (vs 30s+ on CPU)
- Same quality as local Whisper (it's the same OpenAI model)
- Cloudflare bypass via custom User-Agent header
- 24MB upload limit guard with auto-fallback to local
- Language auto-detect works (Groq returns full lang name, mapped to ISO codes)
Default whisper_provider='auto':
- If GROQ_API_KEY is set → use Groq (200x faster)
- Otherwise → fallback to local faster-whisper
- Strict 'groq' mode: no fallback (returns empty if Groq fails)
- Strict 'local' mode: skip Groq entirely
CLI: --whisper-provider {auto,groq,local}
API: whisper_provider field in StartJobIn
Cost: $0.04/h with whisper-large-v3-turbo ($0.002 per 200s song)
When Whisper hallucinates (generates fake lyrics not matching the audio),
LLM can now use the original filename as a hint to recognize the song
and override the false transcript with the actual lyrics.
Pipeline:
1. Pass filename (e.g. 'Ben Zucker - Bonnie und Clyde') as hint
2. Whisper transcribes (may hallucinate)
3. Claude/Gemini reads filename + transcript:
- Recognizes song from filename hint
- Compares Whisper output to known lyrics
- Replaces hallucinated text with real lyrics (preserves timestamps)
- If can't fix, removes segment (better silent than wrong)
Also added Whisper anti-hallucination params:
- beam_size=5 (more careful decoding vs greedy)
- hallucination_silence_threshold=2.0 (skip text in long silences)
1. Preview endpoint now supports HTTP Range requests (HTTP 206 Partial)
- HTML5 video player needs Range support to seek/buffer properly
- Without it, video would cut off after a few seconds
- Returns chunks of 64KB on demand
2. Left panel (upload form) is now sticky (position: sticky)
- Stays in view while right panel (jobs list) scrolls
- On mobile (<800px) reverts to normal flow
Diagnoza:
- analyze.py je zgodovinsko imel samo Claude support
- ko se je dodal Gemini, je clip_range.source ostal hardcoded 'claude'
- prav tako log 'Whisper segmenti zamenjani s Claude' in 'Generated SRT from Claude'
- API rezultat je v jobu kazal source='claude' tudi ko je dejansko bil uporabljen Gemini
- to je samo COSMETIC bug — funkcionalno je vse delovalo pravilno
- Gemini se DEJANSKO klical (potrjeno: '🤖 Gemini (gemini-3.1-pro-preview) izbral: 172.5-201.8s')
in vrnil pravilen rezultat — samo logging je rekel napačno
Popravki:
1. clip_range['source'] = claude_result['source'] (dejansko 'gemini:...' ali 'claude:...')
2. clip_range['reason'] prefix iz hardcoded 'claude_llm:' v dinamičen '{source}:'
3. Log 'Whisper segmenti zamenjani s Claude' → 'z {llm_label}'
4. Log 'Claude je popravil jezik' → 'LLM je popravil'
5. main.py 'Generated SRT from Claude' → 'from {llm_src}'
Test (Zlati Muzikanti - Le prijatelja bodiva, valček, 246s):
✓ Gemini dejansko izbere refren (172.5-201.8s)
✓ Whisper detektira sl (p=0.97 across 3 samples)
✓ Vseh 18 segmentov popravljenih
✓ Pipeline end-to-end deluje
Backward compat:
- transcript['claude_corrected'] in srt_from_claude variable name ohranjena
ker že obstajajo v starih job state fajlih
When Coolify redeploys, the container is killed mid-job.
Now on FastAPI startup:
- Detect status=processing jobs from JOBS_DIR
- If input file exists and resume_attempts < 3, restart pipeline (status=queued)
- After 3 failed attempts, mark as error
- If input is missing, mark error immediately
- Track resume_attempts and last_resume_at for diagnostics
Run actual process_job in asyncio executor (sync function in thread)
so startup completes quickly and resume happens in background.
Resolves: 'Veseli Dolenci stuck' issue
- @app.on_event(startup) marks all status=processing jobs as error after restart
- Process endpoint now clears chorus_error/interrupted_at on retry (retry-friendly)
- GEMINI_API_KEY added to Coolify env (Gemini 3.1 Pro now active)
- User can now choose Gemini in UI dropdown for analysis
- Refactored analyze_with_claude into shared _build_analysis_prompt + _parse_llm_response helpers
- New analyze_with_gemini() using Gemini 3.1 Pro ($2/M in, MMMLU 92.6% — best multilingual)
- Unified analyze_with_llm(provider) dispatcher with auto-fallback (Claude → Gemini)
- API endpoint accepts llm_provider in StartJobIn (claude/gemini/auto)
- Frontend dropdown to pick LLM
- Default model is now Sonnet 4.6 (was Haiku 4.5) — 3x quality at 3x price (~3 cents/video)
- Gemini support is opt-in: needs GEMINI_API_KEY env var to activate
- Claude now corrects transcription errors (Slavic languages, dialects, mixed langs)
- Returns corrected_segments with same timestamps but cleaner text
- Pipeline generates SRT from Claude-corrected transcript and passes to subtitle.py via --srt
- subtitle.py supports --srt to skip Whisper re-transcription on the trimmed clip
- clip.py propagates --srt through to subtitle.py
- Whisper still runs once (in analyze.py); subtitle.py reuses corrected output instead of re-running
- This means: Whisper's mistakes (mixed langs, hallucinations, wrong words) are fixed by Claude before becoming visible subtitles