reels-app

Author	SHA1	Message	Date
Sebastjan Artič	7cb4302dcd	Edit feature: slider + napis edit + recut endpoint User insight: 'treba je narediti da ko se reels naredijo da jih lahko popravljamo... delamo na avtomatiko ampak lahk pa tudi popravljam' Avto pipeline ostane (Soniox + Claude + render). Po render-u uporabnik lahko klikne ✏️ Edit gumb in: 1. Slider za clip start/end: - Vidi 16:9 original video - Drag start/end slider z živim preview-om - Dolžina prikazana real-time - Min 5s, max 60s 2. Edit napisov (collapsed, opcijsko): - Klik na vrstico → input za popravek besedila - Original timestamp ostane, samo besedilo se posodobi - Uporabno za 'doline IZBOR' → 'doline IZPOD' tip popravkov 3. Re-render: - Backend POST /api/jobs/{id}/recut z {start, end, custom_segments} - Worker preskoči Soniox + Claude (custom_clip flag) - Re-uporabi cached transcript + analysis - Re-render samo: clip → reframe → subtitle → output - ~30s namesto 3-5 min New endpoints: - GET /api/source-video/{id} — 16:9 original za editor preview - GET /api/transcript/{id} — segmenti + clip range za editor - POST /api/jobs/{id}/recut — re-render z user timestampi Worker change: če job ima custom_clip=True, preskoči auto_chorus analizo in samo re-uporabi obstoječi clip_range iz analysis.json (updated by recut endpoint).	2026-04-30 10:26:25 +00:00
Sebastjan Artič	f99574daff	Sebastjan's filing: 15 reference examples in prompt User insight: 'to so moje odločitve pomojem občutki in ja t so refreni .. kako ga naučimomojega občutka' Solution: few-shot learning. Instead of trying to express Sebastjan's filing in abstract rules, give the LLM 15 concrete examples of his choices as a reference table. Each example contains: - Song title - Exact chorus text (Sebastjan's choice) - Filing note (his reasoning: 2x ponovitev, 1 polni refren, NE outro, intro klici included, etc.) Coverage: - Hitri komadi: BRAJDE, PA PA, PIJAN, FICKA, ABRACADABRA - Počasne NZ (Avsenik): CVETELE, ENA BOLHA, ŽENA ME TEPE - Fehtarji: GOJZAR TANC, GORENJSKA, PODEŽELSKI - Schlager/pop: STISN SE K MEN, KO MISLIM NATE, NA LEPO SKUPNO POT - Sentimentalne NZ: DOMOTOŽJE V POMLADI Plus key filing notes: ✅ DA vključi: intro klici (Ajmo Janezi/Slovenci), 2 zaporedna refrena (~30s), naravni outro filler (kratko), pevčev držan ton ❌ NE vključi: verzi/kitice, pre-chorus, 2 različna refrena pomešana, dolg outro filler (5+s), instrumental break LLM bo pri NOVIH pesmih posnemal ta filing iz 15 primerov.	2026-04-30 10:01:24 +00:00
Sebastjan Artič	6cf52e1918	Soften 'title must be in chorus' rule + fix CVETELE example User feedback: 'Odmakni da se zacne refren na besedo ki je v naslovu' Problem: Many Slovenian folk-pop songs have the title in the VERSE, not in the chorus: - 'Cvetele so maline' → title is in verse, real chorus is 'Naj veter zdaj...' - 'Domotožje v pomladi' → title is theme, real chorus is 'Bele breze...' Old prompt forced LLM to find title phrase in chorus, leading it to pick verse parts (mid-line, wrong timing) just because they contained the title. Changes: 1. REMOVED forced rule: 'Naslov pesmi = REFREN HOOK (80-90% primerov)' 2. NEW guidance: 'Naslov pesmi je VČASIH v refrenu, VČASIH v verzu. NE silujte!' 3. NEW principle: 'Refren je tisti del ki se PONAVLJA 2-3x z ENAKIM besedilom' 4. Fixed CVETELE example: chorus is 'Naj veter zdaj ponese...' (not Cvetele) with explicit warning that title is in VERSE 2 at ~125s 5. Added: 'NE izberi outro/3. nastop — izberi PRVI nastop refrena' This should let LLM find the actual repeating chorus instead of chasing the title phrase into verses.	2026-04-30 05:39:44 +00:00
Sebastjan Artič	d8c2aae9c1	Prompt: chorus must start on FIRST WORD of FIRST LINE User feedback after re-processing 14 reels: - 8 perfect (BRAJDE 100%, FICKA, Abracadabra, Žena, Stisn, PA PA, Gojzar tanc, GADI Pijan) - 4 problematic patterns identified: 1. CVETELE: clip extends 22s into instrumental on 'nekoč oba' hold 2. GORENJSKA LJUBLJENA: clip starts mid-line at 'obrnem nazaj' instead of 'V Ljubljani se obrnem nazaj' 3. Fantje: clip starts mid-chorus at 'vabijo me' (2nd line) instead of first line 4. PODEŽELSKI: extends into 'o o o' outro filler Common cause: Soniox can group end-of-verse + start-of-chorus into same segment (e.g., '[43.6-47.6] doma. V Ljubljani se'), and Claude picks segment.start (43.6) or next segment.start (48.2) instead of the actual word 'V' boundary inside the segment. Prompt fix: 1. NEW critical rule: 'clip start = TOČNO prva beseda PRVE vrstice' 2. Warning about Soniox merging end-of-verse + start-of-chorus 3. Use word-level timestamps to find chorus start word 4. Warning about long held tones in Soniox segments (15-20s on 'oba', 'doma', 'srca' due to fade-out instrumental) 5. Cut 1-2s after last sung word, don't wait 20s for tone to die 6. Outro filler: include short outros (yeah/aj-aj), but cut before long repeating outros (5+s of 'o o o') as those are fade-out Added concrete examples in PRIMERI: - BRAJDE: 28s (already perfect) - GORENJSKA: explicit warning about 'V Ljubljani se' boundary - CVETELE: explicit warning about 15-20s held tone segments This is a prompt-only change. No code logic modified. LLM still has full autonomy on duration.	2026-04-30 05:19:35 +00:00
Sebastjan Artič	22bb3cfe02	Trust LLM: remove forced extension, content-driven prompt User feedback: 'Tikaj more llm razmislat in ineti filing kaj dat notri'. With Soniox transcript now accurate, LLM has all info to decide content-wise. TWO CHANGES: 1. smart_clip_range() — REMOVED forced extension logic: Before: if duration < min_duration (20s): - extend to next chorus (40% match) ← WRONG! merged with B-chorus - extend symmetrically into VERSE ← WRONG! brought in kitica - cap at max_duration After: trust LLM completely. Only safety: clamp to video bounds. 2. Prompt rewrite — content-driven instead of number-driven: Before: 'Skupna dolžina: 12-25 sekund (običajno)' + conflicting '~30s' '❌ Drugi/tretji nastop refrena — uporabi PRVI' After: '~30 sekund (NAJBOLJŠA opcija = dva zaporedna refrena)' 'Vključi naravne intro klice (Ajmo Janezi! Hey! Pa-pa!)' 'BRAJDE primer: 41.8-69.8s = 28s (dva refrena z Ajmo Janezi intro)' 'NE meša 2 RAZLIČNA refrena (A + B = napaka)' 'NE razširi v VERZE/KITICE' For BRAJDE this means: - Old: Claude picked 57.1-69.8s (12.7s, 2nd chorus, no Ajmo) Code forced extension to 57.06-82.5s (mixed with B-chorus + verse) - New: Claude picks 41.8-69.8s (28s, 2 choruses with 'Ajmo Janezi!' intro) Code returns exactly that — no forced extension.	2026-04-30 04:39:26 +00:00
Sebastjan Artič	865e21fe1a	Integrate Soniox stt-async-v4 as primary STT provider Test results comparing all providers on Slovenian folk-pop: CVETELE SO MALINE: - Scribe: HALLUCINATED ('finančni moduli...') ❌ - Gemini 3 Pro: correct lyrics, ~100s ✅ - Soniox: PERFECT lyrics in 4 seconds ✅✅ PA PA: - Scribe: 'se mu pomahala' (wrong: missing M) ❌ - Soniox: 'sem mu pomahala' ✅ + caught 'pa-pa-ra-pa' fillers ŽENA ME TEPE: - Scribe: hallucinations + word errors - Soniox: PERFECT 'Žena me tepe, mi prazni žepe, da vidi, kje in s kom sem bil' Soniox advantages: - 4x cheaper than Scribe ($0.10/h vs $0.40/h) - 5x faster (4-15s vs 10-15s for 180s audio) - 50x cheaper than Gemini 3 Pro - 25x faster than Gemini - Slovenian native quality matches Gemini - Word-level timestamps + diacritics + punctuation Implementation: 1. transcribe_with_soniox() function: - Multipart upload to /v1/files (no SDK dependency) - Create transcription with stt-async-v4 model - Auto language hint based on filename (NZ → 'sl') - Multilingual fallback ['en', 'sl', 'de', 'hr', 'es', 'fr', 'it'] - Poll status, fetch transcript - Group subword tokens into words → segments - Auto-cleanup files after transcription 2. New 'soniox_chain' provider mode (default for 'auto'): - Soniox primary (fast + cheap + accurate) - Scribe fallback (rare cases when Soniox fails) - Gemini fallback (last resort, slow but bulletproof) - Quality gate: coverage >= 50%, no hallucinations 3. Provider modes: auto, soniox, elevenlabs, gemini, hybrid, local This makes the pipeline reliable for ALL music genres including Slovenian narodno-zabavni glasbi which Scribe consistently failed on.	2026-04-30 03:06:38 +00:00
Sebastjan Artič	ab5424d37b	Clip starts EXACTLY on chorus first word (no buffer) User feedback: 'na refren ne pred na začetek refrena' — the clip should start right when the chorus begins, not 0.3s before. Changes: 1. Prompt rule: 'Začetek = TOČNO ko prva beseda refrena začne' (was: '~0.3s PRED prvo besedo refrena') 2. Word-level extension: removed -0.15s buffer when extending back (now lands exactly on word start) Reasoning: with no_subs as default, we don't need buffer to avoid cutting first word during fade-in (fade-in is now 0.05s = imperceptible). Cleaner cuts directly at chorus onset.	2026-04-29 19:41:56 +00:00
Sebastjan Artič	0dd33c16f3	Hybrid transcription: Scribe primary + Gemini 3 Pro fallback Real-world test confirmed Gemini 3 Pro can transcribe Slovenian folk-pop songs accurately where ElevenLabs Scribe hallucinates: Test: FEHTARJI - GORENJSKA LJUBLJENA (120s sample) - Scribe result: 'finančni moduli...' (total hallucination, wrong content) - Gemini 3 Pro: 'Zunaj srečo sem iskal, planet prepotoval' (CORRECT lyrics) Implementation: 1. New transcribe_with_gemini() function: - Uploads audio via Gemini Files API (resumable upload) - Calls gemini-3-pro-preview with structured prompt - Parses JSON response with word-level timestamps - Computes coverage_pct and hallucination_count - Returns same format as Scribe (compatible) 2. New 'hybrid' provider mode (now the default for 'auto'): - Try Scribe first (fast, cheap: 8-10s, $0.013) - If quality OK (coverage >= 50%, no hallucinations) → return Scribe - Else retry Scribe once - If still bad → fallback to Gemini 3 Pro (slow, more expensive: 100s, $0.20) - Compare results, return whichever is better 3. Provider modes: - 'auto' → hybrid if both keys, else elevenlabs, else local - 'hybrid' → explicit Scribe + Gemini fallback - 'elevenlabs'→ Scribe only (with auto-retry) - 'gemini' → Gemini only - 'local' → faster-whisper on CPU Cost analysis (10 reels/day): - Pure Scribe: $0.13/day, ~5-10% reels unusable - Hybrid: ~$0.55/day, 100% usable - Pure Gemini: $2/day Hybrid is the clear winner: +$0.42/day for 100% reliability.	2026-04-29 18:38:27 +00:00
Sebastjan Artič	df6011c3cf	Detect Scribe hallucinations + filter from SRT + auto-retry Bug found in Žena ME TEPE third re-test: - Scribe transcribed only verse 1 (0-33s) properly - Then returned a single 98s segment [34.7-133.2] with just 1 word 'sam' - This is a known Scribe hallucination on instrumental sections - Result: SRT showed 'SAM SAM SAM SAM...' 14 times across the chorus - Looked completely wrong because the chorus audio was correct but subtitles showed 'SAM' repeatedly Three-part fix: 1. SRT GENERATOR: skip segments > 15s with < 5 words. These are hallucinations and have no real transcription value. 2. SCRIBE TRANSCRIBE: detect hallucinations in returned segments. - Mark segments > 15s with < 5 words as hallucinations - Compute true coverage % (excluding hallucinations) - Add _hallucination_count and _coverage_pct to result 3. TRANSCRIBE_FULL: auto-retry Scribe if quality is poor. - If hallucinations detected OR coverage < 50%, retry once - Keep retry result only if it has better stats - Otherwise fall back to first attempt (still better than nothing) This makes the pipeline robust against Scribe's occasional bad transcripts on songs with long instrumental breaks. Most second attempts succeed where the first failed (random Scribe variance).	2026-04-29 18:08:35 +00:00
Sebastjan Artič	d3b71942d2	Word-level extension: 2-word lookback (not full phrase) Refinement of previous lookback fix - limit to MAX 2 words back. Reason: with unlimited lookback, the lookback would chain through words with gaps < 0.5s and keep walking back into the previous verse. For Žena ME TEPE: 'verjet.' [76.78] → 'Žena' [76.88] gap is 0.10s, which means lookback would walk back to verses before chorus. With 2-word limit: - Clip at 78.19s → 'me' [78.16] is closest preceding word (gap 0.03s) - Lookback j=i: 'me' → 'Žena' gap 0.14s → captured (i-1) - Lookback j=i-1: 'Žena' → 'verjet.' gap 0.10s → would be captured but MAX_LOOKBACK_WORDS=2 stops here ✓ Result: anchor = 'Žena' at 76.88s → new_start = 76.73s. Subtitle: 'ŽENA ME TEPE' (full phrase, no verse leakage).	2026-04-29 16:53:29 +00:00
Sebastjan Artič	823eb3e91e	Use original Scribe transcript for word-level (Claude doesnt return words) Bug found in Žena ME TEPE re-test: - Final clip start was 77.2s but word 'Žena' starts at 76.88s - Word-level extension would have correctly chosen 76.73s - Why didn't it? Because corrected_segs (Claude output) doesn't contain word-level timestamps, only segment start/end. all_words array was empty, triggering segment-level fallback (-0.5s) which produced 77.2s instead. Fix: always use transcript['segments'] (original Scribe output with word timestamps) for word-level boundary detection, not Claude corrected_segments. Now: 'Žena' word at 76.88-77.74s will trigger word-level extension to 76.73s (76.88 - 0.15s buffer), capturing the full word.	2026-04-29 16:30:51 +00:00
Sebastjan Artič	e06c3efb8e	Add audio amplitude defense (Layer 3) for first-word cut prevention Žena problem persists: even after word-level extension, some cases where Scribe doesn't transcribe the very first word still result in clip cutting the vocal start. Layer 3 defense: after word-level start extension, probe the FIRST 150ms of audio at clip start with ffmpeg volumedetect. If mean_volume > -35 dB (threshold for vocal/music vs silence), extend clip start back 0.5s as a safety buffer. This catches cases where: - Scribe missed the word entirely (no word-level timestamp to extend to) - LLM picked a start that's already inside vocal energy - Word-level extension didn't trigger because no nearby word matched The check is fast (<100ms) and conservative (only triggers if audio is clearly NOT silent). If it's a true musical break (silence before chorus), mean_volume will be < -40 dB and extension is skipped. Three layers of defense now: 1. Claude prompt: 'start ~0.3s before first chorus word' 2. Word-level boundary detection (Scribe word timestamps) 3. Audio amplitude check (catches cases 1-2 missed)	2026-04-29 15:23:37 +00:00
Sebastjan Artič	91cc03658d	Multi-upload batch queue + Telegram notifications Changes: 1. Frontend multi-upload: - File input now has 'multiple' attribute, drag-drop accepts multiple - File queue list with per-file artist/title preview + remove button - 'Pošlji vse' uploads sequentially (one at a time to avoid network saturation) - Each file gets same batch_id for Telegram batch summary - After upload, queue clears, jobs appear in right sidebar 2. Backend queue worker: - New _queue_worker() background thread processes 'queued' jobs sequentially - Only 1 job at a time to keep openclaw stable (avoid CPU/RAM thrash) - FIFO order by created_at - Auto-starts on app startup after job resume 3. Job submission flow change: - /api/process and /api/youtube no longer call background.add_task directly - Just mark status='queued', queue worker picks up - This means upload completes fast, processing happens in background - User can close browser, jobs continue 4. Telegram notifications (FOLX Alerts bot): - Per-job: 'Reel pripravljen: Lady Gaga - Abracadabra (29s, 30 MB)' - Per-job failed: 'Reel ni uspel: <name> + error message' - Batch summary: 'Batch končan: 10/10 reels pripravljeni' (only if >1 in batch) - Uses existing TELEGRAM_TOKEN + TELEGRAM_CHAT_ID env vars - app/telegram.py module with notify_job_done(), notify_job_failed(), notify_batch_complete() 5. batch_id field: - Added to Job model + StartJobIn pydantic - Saved during upload + process - Used to count batch progress and trigger summary notification User experience: - Drag 20 videos at once - Click 'Pošlji' - Close browser, go grab coffee - Telegram sends 'Reel pripravljen' for each - After all done: 'Batch končan: 20/20 reels pripravljeni' summary - Open app to download all	2026-04-29 15:12:38 +00:00
Sebastjan Artič	157e6b781e	Fix 'Žena' word still cut: word-level start extension instead of segment-level Previous fix used segment boundaries — required segments <3s for type 1 or <4s for type 2. But Žena was in a 4.3s segment ('saj še doma mi več noč'jo verjet'. Žena me'), so the condition wasn't met and clip start stayed at 77.7s, exactly at end of word 'Žena' (76.88-77.70s). New approach: scan word-level timestamps directly: 1. If clip start falls MID-WORD → extend back to word start - 0.15s 2. If a word ends 0-0.5s BEFORE clip start AND next word is at clip start → that word is suspect (may be first word of chorus that Scribe put in previous segment), extend back to its start - 0.15s Word-level timestamps are always available from Scribe (timestamps_granularity=word). Falls back to segment-level for local Whisper without word timing. This handles arbitrary segment lengths and is universal — works for any language where the chorus starts on a word that the STT placed in the previous segment.	2026-04-29 15:04:18 +00:00
Sebastjan Artič	a5097c5acc	Fix first word being cut at clip start ('Žena' problem) Real-world failure: 'Ansambel Saša Avsenika - ŽENA ME TEPE' - Refren starts with 'Žena me tepe' at 78.0s - Scribe's segment boundary: word 'Žena' was end of previous segment (73.9-78.2s) while new segment 'tepe, mi prazni žepe' started at 78.3s - Claude picked clip start = 78.3s (segment boundary) - Fade-in 0.4s on vocal start = inaudible 'Že-' - User hears: '...na me tepe' (cut) Three-part fix: 1. PROMPT: instruct Claude to start clip ~0.3s BEFORE first chorus word (not exactly at it). Concrete example with timing math. 2. POST-LLM EXTENSION: scan corrected_segments for boundary cases: - If clip start falls MID-segment → extend back to segment start - 0.2s - If a previous segment ended within 0.5s of clip start → check if its last word might actually be the first chorus word, extend back to it - Uses word-level timestamps when available (Scribe provides these) 3. FADE-IN: was 0.4s when starting on vocal — too long, audibly cuts first word. Reduced to 0.05s (just click prevention, not audible). Still 0.2s for instrumental intros where fade is musically appropriate. Now 'Žena' will be heard fully — clip starts at ~77.5-77.7s, word starts at 78.0s, plenty of buffer.	2026-04-29 14:47:07 +00:00
Sebastjan Artič	a30137f1f2	Strict 'chorus only' mode: respect include_prebuild in LLM prompt Bug: 'Vključi pre-chorus' checkbox in UI was sent to backend but ignored by Claude/Gemini analysis prompt. Both modes used same lenient rules saying 'pre-chorus is optional' — Claude often included pre-chorus even when user wanted just chorus. Real-world failure: Lady Gaga 'Abracadabra' picked 54.7-84.6s, but actual chorus 'Abracadabra, amor, ooh-na-na' starts at 85.2s. Claude included the entire pre-chorus block ('Hold me in your heart tonight', 'Like a poem said by a lady in red', 'With a haunting dance') and missed the actual chorus completely. Fix: include_prebuild parameter now flows all the way to the prompt: - main.py → analyze.py CLI args → analyze_with_llm() → prompt builder - Two distinct prompt rule sets: CHORUS ONLY (default, include_prebuild=False): - Strict: 'clip starts on FIRST WORD of chorus, never before' - Length: 12-25s typically - Explicit examples for pop songs (Abracadabra, Despacito, Shape of You) - List of common mistakes to avoid CHORUS + PRE-CHORUS (include_prebuild=True): - Optional pre-chorus before chorus, 4-10s - Length: 18-35s This fixes the most common failure mode where Claude rationalizes including verse/pre-chorus content even when user explicitly wants just the chorus.	2026-04-29 14:03:40 +00:00
Sebastjan Artič	90cdad516b	Universal chorus selection: chorus mandatory, pre-chorus only natural extension User feedback: 'REFREN je obvezen, pre-chorus opcijsko' + 'sistem mora biti stabilen za vse jezike, tudi španščino in romunščino'. Two changes: 1. Web search is now MANDATORY first step (was: optional fallback): - Even if Claude thinks it knows the song, must search lyrics first - Universal lyrics sources by language: SLO: besedila.com, lyricstranslate.com DE: songtexte.com HR/SR/BS: tekstovi.net ES: letras.com, musica.com RO: versuri.ro IT: angolotesti.it FR: paroles.net EN: genius.com, azlyrics.com Universal: lyricstranslate.com (any language) - Search strategy: artist+title first, then transcript snippet fallback - Without lyrics, Claude cannot reliably identify chorus boundaries 2. Simplified selection rules - chorus is THE priority: - Chorus (full first occurrence) = MANDATORY - Pre-chorus = ONLY if 1-2 verse lines tightly connected to chorus - In doubt: just take chorus alone (12-25s) - Outro fillers explicitly multi-language: SLO 'aj ja ja' / 'ej ej ej' EN 'yeah' / 'oh oh' ES 'ay ay ay' RO 'hei hei' JA 'la la la' - 12-35s total range (was 15-35s, now allows shorter chorus-only clips) This makes the system language-agnostic: works the same way for Slovenian narodno-zabavna, Spanish reggaeton, Romanian manele, German Schlager, etc. The lyrics lookup is what makes it stable across languages.	2026-04-29 13:36:34 +00:00
Sebastjan Artič	4efd726176	Extend clip end past chorus to capture outro/sustained notes Problem: Claude was cutting clip exactly at last transcribed word of chorus, but in real songs: - Singer holds last note 1-3s longer (still meaningful) - Outro 'ej-ej-ej' / 'oh' / 'yeah' may not be transcribed as words - Result felt like 'incomplete chorus' even though SRT was correct Fix has two parts: 1. Prompt enhancement: - Ask Claude to add 1-2s padding AFTER last chorus word - Explicit example with timing math - Mention outro fillers (ej-ej-ej, oh, yeah) 2. Post-LLM extension logic: - After Claude returns clip range, scan corrected_segments for segments overlapping or starting just after current end - If next segment is within 1s pause and ends within max_duration+5s, extend clip to include it (with 0.3s breathing room) - Hard cap at max_duration + 5s to prevent unbounded extension This ensures chorus naturally trails off rather than being cut mid-emotional-peak.	2026-04-29 13:12:28 +00:00
Sebastjan Artič	81bae81401	Fix Scribe stopping mid-song: enable tag_audio_events=true + filter events out ROOT CAUSE FOUND: tag_audio_events=false caused Scribe to stop transcribing when instrumental music dominates (polka harmonica taking over from vocals). Real-world test on Avseniki - Ena bolha za pomoč (186s polka): - tag_audio_events=false: 20% coverage (37s only) — fails - tag_audio_events=true: 100% coverage (186s full) — works When tag_audio_events=true, Scribe inserts placeholder markers like '(glasba)' / '(plesalna glasba)' for instrumental sections instead of giving up. We then filter these out so they don't appear in subtitles. Filtering logic: - Skip word.type != 'word' (audio_event types) - Skip parenthesized text legacy fallback like '(music)', '(applause)' This is the core fix — no longer reliant on filename for transcription completeness. Even untitled files like '12345.mp4' now get full coverage.	2026-04-29 13:04:19 +00:00
Sebastjan Artič	7d00730051	Auto-detect language from filename for Scribe (no manual UI selection needed) Problem: Scribe was failing on Slovenian narodno-zabavna songs (Avseniki, Modrijani) because: - User doesn't manually pick language (everything is auto) - Scribe auto-detect had low confidence (0.58) on harmonika-heavy polka - Result: only 37s transcribed instead of full 186s song Solution: detect_language_from_filename() function: - Recognizes 60+ Slovenian artists (Avseniki, Modrijani, Veseli Dolenjci, ...) - Recognizes 30+ German artists (Ben Zucker, Helene Fischer, ...) - Recognizes 20+ Croatian/Serbian artists (Thompson, Severina, Lepa Brena, ...) - Falls back to keyword matching (volim, liebe, srce, herz, ...) - Detects character set (č/ž/š → SL, ä/ö/ü/ß → DE, đ → HR) - Score-based: 5pts for artist match, 1-2pts for keywords/chars When detected, sends language_code to Scribe explicitly: - Avseniki → 'slv' lock → no more half-transcribed songs - Ben Zucker → 'deu' lock → consistent German transcription - User still doesn't need to manually pick anything filename_hint flows: main.py → analyze.py CLI → transcribe_full → Scribe	2026-04-29 12:57:19 +00:00
Sebastjan Artič	40acad26f3	Crystal-clear chorus selection rules: pre-chorus build-up + FIRST chorus Previous rules were ambiguous and Claude was sometimes picking: - Just the chorus (no build-up) - Second chorus instance (lower energy than first) - Random verse + later chorus combinations New explicit priority order: 1. PRIMARY: pre-chorus verse (build-up) + first chorus (~20-35s total) 2. FALLBACK: just first chorus alone 3. LAST RESORT: dramatic peak section Strict rules: - ALWAYS first chorus (highest energy/recognition) - NEVER second/third chorus instances - NEVER skip between verses - NEVER extend over 35 seconds - Concrete example given: chorus@32s,16s long → pick 20-48s This fixes Veseli Dolenjci picking second chorus + post-chorus verse instead of natural pre-chorus build-up + first chorus.	2026-04-29 12:42:54 +00:00
Sebastjan Artič	5f90085981	Add Claude web_search tool for lyrics lookup + tighter subtitle timing 1. Claude API web_search tool integration: - Claude can now search web for actual lyrics when STT text is wrong - Especially useful for SLO/HR/BS/SR songs (Modrijani, Veseli Dolenjci) where Claude doesn't know lyrics from training data - Agentic loop: tool_use → server-side search → continuation → final text - Max 3 searches per job ($0.03 cost limit) - Hint sources: besedila.com, lyricstranslate.com, tekstovi.net, songtexte.com 2. Tighter subtitle segmentation from Scribe word timestamps: - Phrase boundaries on shorter pauses (0.4s vs 0.6s) - Sentence-ending punctuation triggers segment break - Max segment 4s (was 6s) for natural readable subtitles - Hard cap at 5.5s to prevent very long lines This fixes 'ples to noč' → 'ples pojoč' for Modrijani songs that Scribe transcribed phonetically wrong but Claude can fix via web lookup.	2026-04-29 12:24:17 +00:00
Sebastjan Artič	68247bb84c	Integrate ElevenLabs Scribe (best multilingual STT 2026) ElevenLabs Scribe replaces local Whisper as default transcription: - 96.7% accuracy English, 2.4% WER Indonesian (vs Whisper 7.7%) - 18x faster (200s song = 11s vs 3-5 min on CPU) - No hallucinations on songs (Whisper invented 'Pony und Kleid' for 'Bonnie und Clyde') - 99 languages supported, including SLO/HR/BS/SR - $0.40/h pricing, ~$0.022 per 200s song Implementation: - transcribe_with_elevenlabs() function uses Scribe v1 - ISO 639-1 ↔ 639-3 mapping (Scribe needs 'deu' not 'de') - Word-level timestamps converted to pseudo-segments (close on 0.6s pause or 6s duration) - 24MB upload limit guard with auto-fallback to local Default whisper_provider='auto': - If ELEVENLABS_API_KEY set → use Scribe - Otherwise → fallback to local faster-whisper - 'elevenlabs' strict mode: no fallback - 'local' strict mode: skip Scribe entirely Tested on Ben Zucker - Ohne dich: Scribe correctly transcribed 'Wir sind Bonnie und Clyde, zu allem bereit' where local Whisper hallucinated.	2026-04-29 12:03:40 +00:00
Sebastjan Artič	3ffa9740f0	Revert "Add Groq Whisper API integration (200x faster than local CPU)" This reverts commit `5c53a27d33`.	2026-04-29 11:19:31 +00:00
Sebastjan Artič	6a8f87b4a2	Revert "Filler detection: trim clip before la-la-la / instrumental medbridge" This reverts commit `4488717f6f`.	2026-04-29 11:19:31 +00:00
Sebastjan Artič	4488717f6f	Filler detection: trim clip before la-la-la / instrumental medbridge Problem: When a song has chorus → la-la-la medbridge → chorus structure, Claude was including the whole 40s+ block, with 18 seconds of la-la-la making the reel feel artificially extended. Fix: 1. Prompt enhancement: explicitly tell Claude NEVER to include la-la-la / ooh ooh / yeah yeah / instrumental fillers 2. Post-LLM detection: scan corrected_segments for repetitive content (>70% repeated words) and trim clip before that segment 3. Max duration guidance reduced from 45s → 35s in prompt This means: clip will end at the first chorus, not extend through fillers.	2026-04-29 11:17:16 +00:00
Sebastjan Artič	5c53a27d33	Add Groq Whisper API integration (200x faster than local CPU) Pipeline: - New transcribe_with_groq() function uses Groq's whisper-large-v3-turbo - 30s audio transcribed in ~0.5s (vs 30s+ on CPU) - Same quality as local Whisper (it's the same OpenAI model) - Cloudflare bypass via custom User-Agent header - 24MB upload limit guard with auto-fallback to local - Language auto-detect works (Groq returns full lang name, mapped to ISO codes) Default whisper_provider='auto': - If GROQ_API_KEY is set → use Groq (200x faster) - Otherwise → fallback to local faster-whisper - Strict 'groq' mode: no fallback (returns empty if Groq fails) - Strict 'local' mode: skip Groq entirely CLI: --whisper-provider {auto,groq,local} API: whisper_provider field in StartJobIn Cost: $0.04/h with whisper-large-v3-turbo ($0.002 per 200s song)	2026-04-29 11:08:15 +00:00
Sebastjan Artič	60765ad84c	Anti-hallucination: filename hint to LLM + beam search + silence threshold When Whisper hallucinates (generates fake lyrics not matching the audio), LLM can now use the original filename as a hint to recognize the song and override the false transcript with the actual lyrics. Pipeline: 1. Pass filename (e.g. 'Ben Zucker - Bonnie und Clyde') as hint 2. Whisper transcribes (may hallucinate) 3. Claude/Gemini reads filename + transcript: - Recognizes song from filename hint - Compares Whisper output to known lyrics - Replaces hallucinated text with real lyrics (preserves timestamps) - If can't fix, removes segment (better silent than wrong) Also added Whisper anti-hallucination params: - beam_size=5 (more careful decoding vs greedy) - hallucination_silence_threshold=2.0 (skip text in long silences)	2026-04-29 10:48:55 +00:00
OpenClaw Agent	0ca33be6ac	Fix: clip_range source dynamic from LLM result instead of hardcoded 'claude' Diagnoza: - analyze.py je zgodovinsko imel samo Claude support - ko se je dodal Gemini, je clip_range.source ostal hardcoded 'claude' - prav tako log 'Whisper segmenti zamenjani s Claude' in 'Generated SRT from Claude' - API rezultat je v jobu kazal source='claude' tudi ko je dejansko bil uporabljen Gemini - to je samo COSMETIC bug — funkcionalno je vse delovalo pravilno - Gemini se DEJANSKO klical (potrjeno: '🤖 Gemini (gemini-3.1-pro-preview) izbral: 172.5-201.8s') in vrnil pravilen rezultat — samo logging je rekel napačno Popravki: 1. clip_range['source'] = claude_result['source'] (dejansko 'gemini:...' ali 'claude:...') 2. clip_range['reason'] prefix iz hardcoded 'claude_llm:' v dinamičen '{source}:' 3. Log 'Whisper segmenti zamenjani s Claude' → 'z {llm_label}' 4. Log 'Claude je popravil jezik' → 'LLM je popravil' 5. main.py 'Generated SRT from Claude' → 'from {llm_src}' Test (Zlati Muzikanti - Le prijatelja bodiva, valček, 246s): ✓ Gemini dejansko izbere refren (172.5-201.8s) ✓ Whisper detektira sl (p=0.97 across 3 samples) ✓ Vseh 18 segmentov popravljenih ✓ Pipeline end-to-end deluje Backward compat: - transcript['claude_corrected'] in srt_from_claude variable name ohranjena ker že obstajajo v starih job state fajlih	2026-04-29 09:49:58 +00:00
OpenClaw Agent	e350352883	Fix: Gemini 3.1 Pro thinking model needs 32k maxOutputTokens (was 4096 → MAX_TOKENS truncation) Diagnoza: - Gemini 3.x Pro je thinking model (ima internal reasoning, thoughtsTokenCount) - Pri velikih transkriptih (60+ segmentov pesmi): * thoughts ~ 1500-3000 tokens * output JSON s corrected_segments ~ 3000-7000 tokens * total ~ 4500-10000 tokens - Z maxOutputTokens=4096 je bil response prekinjen (finishReason: MAX_TOKENS), JSON odrezan na pol, _parse_llm_response je threw json.JSONDecodeError - Rezultat: 'Gemini vrnil prazen string' v logih Popravki: 1. Gemini maxOutputTokens 4096 → 32768 (dovolj za thinking + dolg JSON) 2. Diagnostika finishReason==MAX_TOKENS in usage tokens v logih 3. Detekcija praznega text-a (ne samo praznega parts array-a) 4. Claude max_tokens 4096 → 8192 (rezerva za dolge pesmi) 5. Claude detekcija stop_reason==max_tokens Test (60 segmentov, 5631 char prompt): - 4096 → finishReason=MAX_TOKENS, thoughts=2594, output=1488, JSON odrezan ❌ - 16384 → finishReason=STOP, thoughts=1445, output=3040, JSON popoln ✅ - 32768 → varen default ✅	2026-04-29 09:03:53 +00:00
Sebastjan Artič	ec71c54570	Upgrade to Sonnet 4.6 + add Gemini 3.1 Pro support - Refactored analyze_with_claude into shared _build_analysis_prompt + _parse_llm_response helpers - New analyze_with_gemini() using Gemini 3.1 Pro ($2/M in, MMMLU 92.6% — best multilingual) - Unified analyze_with_llm(provider) dispatcher with auto-fallback (Claude → Gemini) - API endpoint accepts llm_provider in StartJobIn (claude/gemini/auto) - Frontend dropdown to pick LLM - Default model is now Sonnet 4.6 (was Haiku 4.5) — 3x quality at 3x price (~3 cents/video) - Gemini support is opt-in: needs GEMINI_API_KEY env var to activate	2026-04-29 08:26:27 +00:00
Sebastjan Artič	9faa224885	Upgrade Claude model: Haiku 4.5 → Sonnet 4.6 for better Slavic language transcript correction	2026-04-29 08:22:10 +00:00
Sebastjan Artič	69fb2f5ce8	Upgrade default Whisper model: small/medium → large-v3 for much better Slovenian/Slavic transcription accuracy	2026-04-29 08:20:18 +00:00
Sebastjan Artič	4bc5ac6756	Major: Claude post-processing of Whisper transcript - Claude now corrects transcription errors (Slavic languages, dialects, mixed langs) - Returns corrected_segments with same timestamps but cleaner text - Pipeline generates SRT from Claude-corrected transcript and passes to subtitle.py via --srt - subtitle.py supports --srt to skip Whisper re-transcription on the trimmed clip - clip.py propagates --srt through to subtitle.py - Whisper still runs once (in analyze.py); subtitle.py reuses corrected output instead of re-running - This means: Whisper's mistakes (mixed langs, hallucinations, wrong words) are fixed by Claude before becoming visible subtitles	2026-04-29 08:13:33 +00:00
Sebastjan Artič	af3c933c78	Robust language detection + anti-hallucination - 3-sample voting for auto-detect (start/middle/end of song) prevents lang switching mid-song - Lock detected language for full transcription - Anti-hallucination: condition_on_previous_text=False, temperature=0.0 - compression_ratio_threshold=2.4 (rejects repetitive hallucinations) - log_prob_threshold=-1.0 (rejects low-confidence segments) - no_speech_threshold=0.6 (more aggressive silence detection) - Default Whisper model changed: small → medium (better for all langs incl. Slavic)	2026-04-29 07:59:20 +00:00
Sebastjan Artič	c870d80726	Fix: extend clip if ends mid-vocal (no chorus cut-off), DejaVu Sans font (supports SLO/HR/BS chars), auto-upgrade to medium Whisper model for Slavic languages	2026-04-29 07:35:00 +00:00
Sebastjan Artič	5d5e169f9d	Disable Whisper VAD filter — was dropping vocal segments in songs creating gaps in subtitles	2026-04-29 07:07:29 +00:00
Sebastjan Artič	a04811bdc9	Add Claude LLM analysis: sends full transcript to Claude API for true song structure understanding (refrain detection across all repetitions, not just local heuristic)	2026-04-29 06:55:41 +00:00
Sebastjan Artič	e072eec362	Fix: handle Whisper transcribe failure for instrumental-only audio (fallback to empty transcript)	2026-04-29 06:33:52 +00:00
Sebastjan Artič	33a138af9e	Fix: force native Python bool/float for JSON serialization (numpy types)	2026-04-29 06:23:41 +00:00
Sebastjan Artič	8512076b91	Major: smart selection pipeline (analyze.py) + audio fade + multi-lang auto-detect - New analyze.py: full transcript + energy + structural analysis - Smart clip range: includes pre-chorus, can exceed 30s up to max_duration (default 45s) - Audio fade in/out: auto-detected from vocal boundaries - Instrumental detection: auto-disables subs if vocals < 10% of duration - Multi-language: auto-detect via Whisper or explicit (DE/SL/HR/BS/SR/EN/IT/ES/FR) - Frontend: cleaner UX, added bs language, smart selection description - reframe.py: --fade-in --fade-out args - clip.py: propagates fade params - app/main.py: replaces find_chorus.py call with analyze.py	2026-04-29 06:21:35 +00:00

41 Commits