User: 'sej je že preset narejen, ne kompliciraj. Vse je enako za reels.
Naredi da so podnapisi po default GOR (brez kljukice). Klukica = brez napisov.
Ostalo zapeči, nimamo kaj spreminjat.'
Hidden defaults (zapečen preset za vse reels):
- mode=track
- quality=medium
- llm-provider=claude
- whisper-model=large-v3
- subtitle-style=reels
- auto-chorus=true (data-checked)
- include-prebuild=false (data-checked)
- duration=30
Vidno samo:
- TV postaja (tabs) — kam gre na Nextcloud
- Brez kljukice = napisi V VIDEO (default)
- Kljukica = brez napisov
Side note: 'Brez kljukice = napisi' je obratno od prej (kljukica = brez).
Default je sedaj 'with subtitles' (no-subs unchecked). Persistence v
localStorage ohranja izbiro med reload-i.
Bonus: odstranjen 'Auto-chorus toggle' handler ki je iskal #manual-times
ki ne obstaja več.
User feedback: 'ko refrešam se ponovno vklopi kljukica da ni
podnapisov to bo problem'
Saved fields:
- no-subs (checkbox)
- auto-chorus (checkbox)
- include-prebuild (checkbox)
- mode (track/center/blur)
- quality (fast/medium/high)
- llm-provider (claude/gemini/auto)
- tv-station (FOLX SLOVENIJA / ONE DE / ...)
On page load: re-aplicira saved values
On change: shrani v localStorage
TV station: tudi posodobi aktiven tab style
Defaults ostanejo isti če nikoli niso spremenjeni.
User feedback: 'pa ne maram teh emojijev ki si ji dal k Adria in ZWEI Music.
Daj počisti sedaj kar je že narejeno in gremnalagat.'
UI changes:
- Tab labels: 'FOLX SLOVENIJA', 'FOLX DE', 'ONE DE', 'ZWEI MUSIC', 'ADRIA'
(without emoji prefixes)
- Job card badge: brez emoji prefix
Cleanup (manual via container):
- /data/jobs/*.json (15 → 0)
- /data/outputs/* (132 → 0)
- /data/uploads/* (15 → 0)
Dedup baza je ohranjena (15 zapisov), tako da če uporabnik poskuša
naložiti komad ki je že naložen na Nextcloud, dobi opozorilo.
User feedback: 'ko ni kljukice so napisi, zato napiši samo izklopi
podnapise in ko je kljukica so izklopljeni'
Pred: 'Brez podnapisov (privzeto — bolj zanesljivo)'
Po: 'Izklopi podnapise (kljukica = brez napisov · brez kljukice = napisi v video)'
Logika ostane ista (no_subs flag), samo label je bolj jasen.
User feedback: 'dodaj da če čekira in shranjuje že obdelani komadi v SQL bazo,
da če nalagamo komad ki smo ga že naložili da ga ne naloži'
NEW: SQLite dedup database at /data/processed.db
Schema: processed_videos
- normalized_name (PK part 1)
- tv_station (PK part 2) — isti komad lahko obstaja na različnih postajah
- filename_orig
- job_id
- nextcloud_url
- file_size_mb
- uploaded_at
Filename normalization removes noise:
'BRAJDE (Official Video).mp4' → 'brajde'
'Brajde (HD).mxf' → 'brajde'
'BRAJDE - LIVE 2024.mp4' → 'brajde'
(strips parentheses, suffixes like Official/HD/4K/Live, extension, lowercase)
NEW endpoints:
- POST /api/dedup/check — preveri katera imena so že obdelana
- POST /api/dedup/remove — pobriše dedup zapis (Re-process)
- GET /api/dedup/list — seznam vseh obdelanih (opt. filter po tv_station)
Integration:
- Nextcloud upload (manual + auto): zabeleži v dedup po uspešnem PUT
- File queue (frontend): pred dodajanjem preveri dedup
→ prikaže rdeč warning '⚠ Že naložen na ONE DE (29.4.2026) — Re-process'
→ opacity 0.6 (vizualno blediji)
→ submit jih SKIP-a (osim če 'Re-process' kliknil)
User feedback: 'zaenkrat bomo ročno popravljali in pregledovali. Ko kliknemo
Save potem se shrani v pravi folder in izgine.'
Workflow:
1. User izbere TV postajo (zavihek)
2. Naloži komade
3. Reel se renderira (auto chorus)
4. User pregleda + Edit če treba
5. Save → re-render z user popravki
6. Po končanem re-render: AVTO-upload v Nextcloud /folxspeed/REELS/{station}/
7. Reel IZGINE iz seznama (hidden_after_upload flag)
Backend changes:
- RecutRequest: nov field auto_upload (default True)
- update_job: shrani auto_upload_to_nextcloud
- process_job done block: če flag set + Nextcloud configured →
upload + nextcloud_status='uploaded' + hidden_after_upload=True
Frontend changes:
- refreshJobs: filter out jobs with hidden_after_upload
- TV station badge na vsaki kartici (z emoji + ime postaje)
- Vidiš na prvi pogled kam bo šlo
Workflow rezultat: po Save reel izgine, je avtomatsko v pravi mapi
User feedback: 'Sarah Connor in Abracadabra grejo v ONE DE, ne v FOLX SLO.
Naredi zavihke za vsako TV postajo (FOLX SLO, FOLX DE, ONE DE, ZWEI MUSIC, ADRIA)
in upload gre v ustrezno Nextcloud podmapo.'
Backend changes:
- StartJobIn + YouTubeJobIn: nov field 'tv_station' (default 'FOLX SLOVENIJA')
- update_job: shrani tv_station v job JSON
- POST /api/jobs/{id}/upload-nextcloud:
bere tv_station iz job, target_subdir = folxspeed/REELS/{station}
Frontend changes:
- 5 TV station tabs: FOLX SLOVENIJA (active), FOLX DE, ONE DE, ZWEI MUSIC, ADRIA
- Hidden input #tv-station-input drži current selection
- Klik na tab ga aktivira (accent color)
- collectSettings() vključuje tv_station
Manual fix: Sarah Connor in Abracadabra job.json popravljena → tv_station=ONE DE
Reset njihovih nextcloud_* polj da bo upload v pravo mapo.
Endpoint /api/jobs/{id}/upload-nextcloud je že obstajal (commit dbb8ab3)
in deluje. Moja nova varianta je bila duplikat — odstranjena.
Aktivni endpoint: line 1593 (upload_nextcloud), uporablja
_nextcloud_upload() in _nextcloud_configured() helper-je.
Zdaj imamo:
- _safe_filename_for_nextcloud() helper (ostane, lahko pride prav)
- Frontend gumb '☁ Nextcloud' z 4 stanji (default/uploading/uploaded/failed)
- 14 reelov že uspešno uploadanih v /folxspeed/REELS/FOLX SLOVENIJA/
- NEXTCLOUD_FOLDER env updated: folxspeed/REELS → folxspeed/REELS/FOLX SLOVENIJA
- urllib.parse.quote() each segment (handles spaces in folder names)
- e.g. 'FOLX SLOVENIJA' → 'FOLX%20SLOVENIJA' in URL
User feedback: 'tukaj imava cel kup stvari ki niso res, kako oblikujemo?'
Old text was misleading:
- 'Whisper, 3-sample voting' → not used since Soniox integration
- 'Model: medium' → irrelevant (Whisper not used)
- 'Whisper + energy → najde refren' → now Soniox + Claude LLM
New text reflects actual stack:
- STT: Soniox (primary) → ElevenLabs Scribe → Gemini fallback
- LLM: Claude Sonnet 4.6
- Energy profile + word-level timestamps + 15 reference examples
- Mention ✏️ Edit button for manual fine-tuning
User feedback: 'ko hočemo preveriti konec, predvajaj samo 5s.
začetek ni problematičen ker plej začne od začetka'
NEW button: ▶ Konec (5s) — green
- Seeks to (trimEnd - 5s)
- Plays from there to trimEnd
- Auto-stops at trimEnd (existing logic)
- Quick way to verify if 'OUT' position is correct without
waiting for full clip playback (which can be 30-60s)
Renamed: '▶ Predvajaj odsek' → '▶ Predvajaj cel' for clarity
(plays full clip from start to end)
Workflow now:
- Adjust handles
- '▶ Predvajaj cel' to hear whole clip (when needed)
- '▶ Konec (5s)' to quickly check if end is right
- Iterate handles until perfect
- Save
Bug: triangles positioned at top:-14px were outside trim bar bounds.
Trim bar has overflow:hidden, so triangles were clipped (invisible).
Fix: top:0 (inside trim bar, at the very top edge).
Triangle 14px tall now sits at top of trim bar (overlapping waveform
slightly but visible, with drop-shadow to make them stand out).
User feedback: Workflow is - click + Enter sets a marker triangle, then
button moves the red handle to that triangle. Triangle near LEFT handle
= IN candidate (green), near RIGHT = OUT candidate (red).
Visual:
- Green triangle (▼) above trim bar = IN candidate position
- Red triangle (▼) above trim bar = OUT candidate position
- White line (playhead) = current video position (moves during playback)
- Red handles (existing) = actual clip start/end
Workflow:
1. Click on waveform → white playhead jumps there
2. Press Enter → playhead starts moving (plays)
ALSO: triangle gets placed at current position
- If position closer to LEFT handle → green IN triangle
- If position closer to RIGHT handle → red OUT triangle
3. Listen, decide 'this is the right spot'
4. Click ▼ Postavi IN button → red LEFT handle jumps to green triangle
(or ▼ Postavi OUT for right handle)
5. Now red handle and triangle are aligned = clip boundary committed
Triangles persist until next play press (= next candidate).
Buttons styled with matching color (green for IN, red for OUT).
User feedback: 'naj ne začne takoj predvajati. naj začne ko pritisnem
Enter, in pozicija naj ostane črta ker bomo tja dali tracker'
Changes:
- Click on waveform: just seek + render playhead (was: seek + auto-play)
- Click on segment row: just seek + render playhead (was: seek + auto-play)
- Playhead: brighter, with triangle marker on top (tracker placeholder)
- Enter key: play/pause toggle from current position
- Space key: also play/pause (back-compat)
- Hint texts updated to reflect new workflow
Workflow now:
1. Click on waveform/segment → playhead jumps there (no sound)
2. Read transcript, look at waveform around the position
3. Press Enter → plays from there
4. Press Enter again → pauses
5. Click somewhere else → playhead moves there (paused)
6. Press Enter → plays from new position
Allows precise positioning before commit to playback.
User feedback: 'ne morem nič drugega delat dokler izvaža reel?
a če bi bile večje mašine bi blo bolj?'
Without GPU upgrade, optimize CPU usage:
1. PARALLEL WORKERS:
- Was: 1 worker thread, processes 1 job at a time
- Now: NUM_WORKERS=3 parallel threads (configurable via env)
- Each worker locks its job atomically (set instead of single var)
- 3 reels render simultaneously instead of sequentially
- Edit feature usable while other reels render
2. PRE-CACHE EDIT ASSETS:
- On job done, fire-and-forget ffmpeg subprocess.Popen for:
* low-q source video (480p) — used in Edit modal video player
* waveform PNG (2400x72) — used in Edit modal trim bar
- Both run in background, don't block pipeline
- When user later clicks Edit, assets already cached → modal instant
- On-demand fallback still works if precache failed
Result: Edit modal opens instantly even while other reels render.
3 reels can render in parallel = ~3x throughput on multi-core CPU.
User feedback:
1. 'Wave form je premajhen — zoom'
2. 'Ko nastavimo pozicijo, play od začetka — ne moremo predvajat od tam'
NEW Zoom feature:
- 5 zoom levels: 1x, 2x, 5x, 10x, 20x
- Trim bar wrapped in scrollable container
- On zoom: bar width grows to 100*N%, scroll auto-centers on trim region
- Higher zoom = more pixels per second = micro-tuning possible
(1x: 5px/s, 20x: 100px/s for 4min song)
- Active zoom button highlighted accent red
NEW Play-from-position:
- Click on waveform/trim bar = playhead JUMPS THERE + auto-plays
(was: just moved playhead, no play)
- Space key = play/pause toggle from current position
(works anywhere except in input fields)
- '▶ Predvajaj odsek' still does start-to-end of selection
- Cleanup keydown listener on modal close
Waveform now rendered at 2400x72 (higher res) so zoom looks crisp.
User can now:
- Zoom 10x to see exact word boundaries in waveform
- Click anywhere → instant play from there
- Hit Space to toggle while watching
User feedback: 'če bi imeli spodaj wave strukturo bi se po tem prmikali,
in narediti da teksti laufajo na desni strani ob robu videja'
NEW: backend /api/waveform/{id}?width&height
- ffmpeg showwavespic generates PNG (~10-50KB)
- Cached forever per song
- Red color (#ff6b6b) matching accent
Frontend layout RESTRUCTURED:
- 1200px max-width (was 900px)
- Top section: GRID 1fr / 320px
- LEFT: video (16:9)
- RIGHT: napisi panel (sticky header, scrollable, 55vh max)
- Bottom: trim bar full-width with WAVEFORM as background image
- Hint text updated: 'Klik na valove ali napise = skoči video'
INTERACTIONS:
- Click segment row → seekToSegment() jumps video to that timestamp
- Live highlight: gold (#ffd700) on currently playing segment
- Auto-scroll panel to keep active segment in view
- Drag handles updates segment row colors (in-clip = red bg, outside = gray)
- Click on trim bar (waveform) still works as seek
User can now:
- See visual audio shape (loud parts = vocals, quiet = instrumental)
- See ALL napisi at once on the right
- Click any napis to jump to it
- Watch live highlight follow the song
- Edit any napis text inline
User feedback: 'predvaja odsek in začne iz nule kar ni ok, ne moremo
premikati levo dolj levo... za to bi rabili low-q?'
REPLACED render-on-demand approach with low-q source download:
1. Backend: GET /api/source-video/{id}?quality=low
- 480p re-encode of full source (cached after first request)
- veryfast preset, CRF 28
- First request: ~5-10s (depends on song length)
- Subsequent: instant (cached)
2. Frontend: Edit modal loads ?quality=low
- 'Pripravljam predogled (~5s prvič, potem instant)' status
- Once loaded: ALL preview is client-side instant
- 'Predvajaj odsek' jumps to trimStart and plays
- Auto-stop at trimEnd (loops back)
- Drag handles DURING playback = instant seek (browser scrubs in 5MB)
- Drag NOT blocked during play (you can fine-tune in/out live)
3. Removed old /api/preview-clip endpoint logic (no longer needed)
Note: kept the route as cache cleanup for old jobs
Workflow now:
- Open Edit → 5s wait first time
- Drag handles freely (instant scrubbing)
- Click Predvajaj → starts at trimStart immediately
- Drag handles WHILE playing → live preview
- Save when satisfied → 70s full render
Bugs from puppeteer inspection:
1. Old buggy renders left 0-byte cache files behind. New code never
re-rendered because cache_path.exists() was True.
Fix: validate cache file is >1KB, otherwise re-render.
2. FastAPI @app.get only handles GET, not HEAD. Frontend's HEAD check
returned 405, then GET re-rendered (correct), but second click also
returned 405 then 200 again — confusing.
Fix: use @app.api_route with methods=['GET', 'HEAD']
3. If ffmpeg fails partway, broken file remains in cache.
Fix: unlink on any failure path.
Also deleted existing empty cache files in container.
Bug: 'width not divisible by 2 (853x480)' from screenshot.
libx264 requires even width/height. scale=854:480 + decrease can result
in 853x480 (odd width).
Fix: chain second scale filter that truncates to nearest even number:
scale=trunc(iw/2)*2:trunc(ih/2)*2
Verified locally: 4.4MB clip in 4.8s on CPU.
User feedback: 'dejstvo je da trajna ker more najprej zrenderirat? to traja?
za to bi rabili hudo mašino al?'
Solution before GPU upgrade: live preview that renders just the selected
range as low-quality 480p clip. ~2-3s instead of ~70s full reel render.
NEW endpoint: GET /api/preview-clip/{job_id}?start=X&end=Y
- ffmpeg fast extract (no reframe, no subtitles, no face tracking)
- 480p ultrafast x264 preset, CRF 30
- Cached per job+range (re-clicks are instant)
- ~2-3s on CPU
Frontend:
- '▶ Predvajaj odsek' button now triggers preview-clip render
- Shows status: '🎬 Renderiram odsek... (~3s)'
- After render: video element switches to preview src
- User sees EXACTLY what reel will contain (just without face track)
- Subsequent clicks on same range are instant (cached)
Workflow:
- Drag handles → click '▶ Predvajaj odsek' → 3s wait → see + hear it
- Iterate fast: drag → preview → drag → preview
- Final '✅ Shrani in re-render' only when satisfied (~70s full render)
- Big '▶ Predvajaj odsek' button: plays from trim start
- Auto-stop when video reaches trim end (loops back to trim start)
- iPhone trim preview behavior: see exactly what reel will contain
Screenshot revealed: trim bar element has only 4px width even after
ResizeObserver fires. Likely the parent (.modal-content) is a flex
container that shrinks the trim-bar.
Force trim bar to take full width with width:100% and prevent shrinking
with flex-shrink:0.
Root cause found via puppeteer inspection:
- trimBarWidth was 4px when renderTrim() ran
- That made calc(32.64% - 12px) = ~-10px, putting handles offscreen left
Modal element gets actual width AFTER appendChild + browser layout pass.
Original code called renderTrim() synchronously right after appendChild,
before the modal had real dimensions.
Fix:
1. Use ResizeObserver on trim-bar to re-render whenever it gets actual width
2. Also use double requestAnimationFrame as fallback (waits for layout)
Verified via puppeteer:
Before: leftStyle='calc(32.6443% - 12px)' but trimBarWidth=4
After: handles correctly positioned within visible bar
JS renderTrim() likely failed silently (Cannot read undefined of length).
Set handle positions inline in HTML template so they show immediately
without waiting for renderTrim() to fire.
Added pctOfStr helper to compute percentage as string for inline style.
Bug from screenshot: trim bar visible but red handles not showing.
Causes:
1. video_duration in job is None for old jobs (was not saved on initial
processing). Without it, fallback was endInit+60 which placed handles
off-screen.
2. videoDuration was const, couldn't be updated when video metadata loads.
3. Handle offset was 9px but handles are now 24px wide (need 12px offset).
Fixes:
- Backend /api/transcript: fallback to last segment end time if
video_duration missing in job
- Frontend: videoDuration is let, updated on loadedmetadata
- Handle offset 9px → 12px for 24px wide handles
- Re-render trim after metadata loads to pick up actual video.duration
Bug from screenshot: trim bar was invisible due to:
- Background rgba(255,255,255,0.05) too transparent
- Handles 18px width with low contrast
- Removed video controls
Fixed:
- Trim bar background #1a1a1a + 2px #444 border (visible)
- Handles 24px width, full red #ff6b6b with strong glow
- Region 35-20% opacity (brighter)
- Playhead 3px white with shadow (visible)
- Restored video controls
- Added hint text below trim bar
User feedback: 'tako kot imajo na iphonu - potegnem iz leve in iz
desne za na konec... reel pa more biti že v stanju postavljen'
Replaced 2 separate range sliders with iPhone-style trim bar:
- Single horizontal bar showing full video duration
- 2 draggable handles (left = start, right = end)
- Selected region highlighted in accent color
- Live playhead during playback
- Mouse + touch support
- Click anywhere on bar = seek to that position
- Initial state: handles positioned at auto-selected clip range
(just fine-tune left/right, no need to set from scratch)
formatTime helper for nice m:ss.c display.
User insight: 'treba je narediti da ko se reels naredijo da jih lahko
popravljamo... delamo na avtomatiko ampak lahk pa tudi popravljam'
Avto pipeline ostane (Soniox + Claude + render). Po render-u uporabnik
lahko klikne ✏️ Edit gumb in:
1. **Slider za clip start/end**:
- Vidi 16:9 original video
- Drag start/end slider z živim preview-om
- Dolžina prikazana real-time
- Min 5s, max 60s
2. **Edit napisov** (collapsed, opcijsko):
- Klik na vrstico → input za popravek besedila
- Original timestamp ostane, samo besedilo se posodobi
- Uporabno za 'doline IZBOR' → 'doline IZPOD' tip popravkov
3. **Re-render**:
- Backend POST /api/jobs/{id}/recut z {start, end, custom_segments}
- Worker preskoči Soniox + Claude (custom_clip flag)
- Re-uporabi cached transcript + analysis
- Re-render samo: clip → reframe → subtitle → output
- ~30s namesto 3-5 min
New endpoints:
- GET /api/source-video/{id} — 16:9 original za editor preview
- GET /api/transcript/{id} — segmenti + clip range za editor
- POST /api/jobs/{id}/recut — re-render z user timestampi
Worker change: če job ima custom_clip=True, preskoči auto_chorus
analizo in samo re-uporabi obstoječi clip_range iz analysis.json
(updated by recut endpoint).
User insight: 'to so moje odločitve pomojem občutki in ja t so refreni
.. kako ga naučimomojega občutka'
Solution: few-shot learning. Instead of trying to express Sebastjan's
filing in abstract rules, give the LLM 15 concrete examples of his
choices as a reference table.
Each example contains:
- Song title
- Exact chorus text (Sebastjan's choice)
- Filing note (his reasoning: 2x ponovitev, 1 polni refren, NE outro,
intro klici included, etc.)
Coverage:
- Hitri komadi: BRAJDE, PA PA, PIJAN, FICKA, ABRACADABRA
- Počasne NZ (Avsenik): CVETELE, ENA BOLHA, ŽENA ME TEPE
- Fehtarji: GOJZAR TANC, GORENJSKA, PODEŽELSKI
- Schlager/pop: STISN SE K MEN, KO MISLIM NATE, NA LEPO SKUPNO POT
- Sentimentalne NZ: DOMOTOŽJE V POMLADI
Plus key filing notes:
✅ DA vključi: intro klici (Ajmo Janezi/Slovenci), 2 zaporedna
refrena (~30s), naravni outro filler (kratko), pevčev držan ton
❌ NE vključi: verzi/kitice, pre-chorus, 2 različna refrena pomešana,
dolg outro filler (5+s), instrumental break
LLM bo pri NOVIH pesmih posnemal ta filing iz 15 primerov.
User feedback: 'Odmakni da se zacne refren na besedo ki je v naslovu'
Problem: Many Slovenian folk-pop songs have the title in the VERSE,
not in the chorus:
- 'Cvetele so maline' → title is in verse, real chorus is 'Naj veter zdaj...'
- 'Domotožje v pomladi' → title is theme, real chorus is 'Bele breze...'
Old prompt forced LLM to find title phrase in chorus, leading it to
pick verse parts (mid-line, wrong timing) just because they contained
the title.
Changes:
1. REMOVED forced rule: 'Naslov pesmi = REFREN HOOK (80-90% primerov)'
2. NEW guidance: 'Naslov pesmi je VČASIH v refrenu, VČASIH v verzu. NE silujte!'
3. NEW principle: 'Refren je tisti del ki se PONAVLJA 2-3x z ENAKIM besedilom'
4. Fixed CVETELE example: chorus is 'Naj veter zdaj ponese...' (not Cvetele)
with explicit warning that title is in VERSE 2 at ~125s
5. Added: 'NE izberi outro/3. nastop — izberi PRVI nastop refrena'
This should let LLM find the actual repeating chorus instead of
chasing the title phrase into verses.
User feedback after re-processing 14 reels:
- 8 perfect (BRAJDE 100%, FICKA, Abracadabra, Žena, Stisn, PA PA,
Gojzar tanc, GADI Pijan)
- 4 problematic patterns identified:
1. CVETELE: clip extends 22s into instrumental on 'nekoč oba' hold
2. GORENJSKA LJUBLJENA: clip starts mid-line at 'obrnem nazaj'
instead of 'V Ljubljani se obrnem nazaj'
3. Fantje: clip starts mid-chorus at 'vabijo me' (2nd line)
instead of first line
4. PODEŽELSKI: extends into 'o o o' outro filler
Common cause: Soniox can group end-of-verse + start-of-chorus into
same segment (e.g., '[43.6-47.6] doma. V Ljubljani se'), and Claude
picks segment.start (43.6) or next segment.start (48.2) instead of
the actual word 'V' boundary inside the segment.
Prompt fix:
1. NEW critical rule: 'clip start = TOČNO prva beseda PRVE vrstice'
2. Warning about Soniox merging end-of-verse + start-of-chorus
3. Use word-level timestamps to find chorus start word
4. Warning about long held tones in Soniox segments (15-20s on
'oba', 'doma', 'srca' due to fade-out instrumental)
5. Cut 1-2s after last sung word, don't wait 20s for tone to die
6. Outro filler: include short outros (yeah/aj-aj), but cut before
long repeating outros (5+s of 'o o o') as those are fade-out
Added concrete examples in PRIMERI:
- BRAJDE: 28s (already perfect)
- GORENJSKA: explicit warning about 'V Ljubljani se' boundary
- CVETELE: explicit warning about 15-20s held tone segments
This is a prompt-only change. No code logic modified.
LLM still has full autonomy on duration.
User feedback: 'Tikaj more llm razmislat in ineti filing kaj dat notri'.
With Soniox transcript now accurate, LLM has all info to decide content-wise.
TWO CHANGES:
1. smart_clip_range() — REMOVED forced extension logic:
Before: if duration < min_duration (20s):
- extend to next chorus (40% match) ← WRONG! merged with B-chorus
- extend symmetrically into VERSE ← WRONG! brought in kitica
- cap at max_duration
After: trust LLM completely. Only safety: clamp to video bounds.
2. Prompt rewrite — content-driven instead of number-driven:
Before: 'Skupna dolžina: 12-25 sekund (običajno)' + conflicting '~30s'
'❌ Drugi/tretji nastop refrena — uporabi PRVI'
After: '~30 sekund (NAJBOLJŠA opcija = dva zaporedna refrena)'
'Vključi naravne intro klice (Ajmo Janezi! Hey! Pa-pa!)'
'BRAJDE primer: 41.8-69.8s = 28s (dva refrena z Ajmo Janezi intro)'
'NE meša 2 RAZLIČNA refrena (A + B = napaka)'
'NE razširi v VERZE/KITICE'
For BRAJDE this means:
- Old: Claude picked 57.1-69.8s (12.7s, 2nd chorus, no Ajmo)
Code forced extension to 57.06-82.5s (mixed with B-chorus + verse)
- New: Claude picks 41.8-69.8s (28s, 2 choruses with 'Ajmo Janezi!' intro)
Code returns exactly that — no forced extension.
Bug: BRAJDE reel showed subtitles 2-3 seconds out of sync with audio.
Soniox returned correct word timestamps:
- 'Ajmo,' at 41.82s
- 'Janezi!' at 42.18s
- 'Pejd' greva, ajde,' at 43.44-44.40s
But generate_srt_from_segments() ignored word timestamps and split long
segments into evenly-spaced 2.5s chunks based on segment duration:
chunk_dur = duration / n_parts ← assumes even pacing
for i in range(n_parts):
cs = rel_start + i * chunk_dur
This produces wrong timing because singers don't sing evenly. Real audio
had 'Ajmo, Janezi!' in 0.9s and 'Pejd' greva, ajde, na traktorju od Majde'
in 6s — the 2.5s chunks didn't align with vocals.
Fix: when word-level timestamps are available (Soniox/Scribe), group
words into chunks where each chunk's start/end match the actual first/last
word timestamps. Each chunk is at most MAX_CHUNK_DURATION (2.5s) but
respects natural word boundaries.
Before:
00:00.000 → 01.900 AJMO, JANEZI! PEJD' GREVA, AJDE, NA TRAKTORJU OD
00:01.900 → 03.800 MAJDE, NOBEN NAJU NE NAJDE, KO PELJEM TE
After:
00:00.020 → 02.120 AJMO, JANEZI! PEJD' GREVA,
00:02.360 → 04.820 AJDE, NA TRAKTORJU OD MAJDE, NOBEN
Subtitles now perfectly align with vocals.
Test results comparing all providers on Slovenian folk-pop:
CVETELE SO MALINE:
- Scribe: HALLUCINATED ('finančni moduli...') ❌
- Gemini 3 Pro: correct lyrics, ~100s ✅
- Soniox: PERFECT lyrics in 4 seconds ✅✅
PA PA:
- Scribe: 'se mu pomahala' (wrong: missing M) ❌
- Soniox: 'sem mu pomahala' ✅ + caught 'pa-pa-ra-pa' fillers
ŽENA ME TEPE:
- Scribe: hallucinations + word errors
- Soniox: PERFECT 'Žena me tepe, mi prazni žepe, da vidi, kje in s kom sem bil'
Soniox advantages:
- 4x cheaper than Scribe ($0.10/h vs $0.40/h)
- 5x faster (4-15s vs 10-15s for 180s audio)
- 50x cheaper than Gemini 3 Pro
- 25x faster than Gemini
- Slovenian native quality matches Gemini
- Word-level timestamps + diacritics + punctuation
Implementation:
1. transcribe_with_soniox() function:
- Multipart upload to /v1/files (no SDK dependency)
- Create transcription with stt-async-v4 model
- Auto language hint based on filename (NZ → 'sl')
- Multilingual fallback ['en', 'sl', 'de', 'hr', 'es', 'fr', 'it']
- Poll status, fetch transcript
- Group subword tokens into words → segments
- Auto-cleanup files after transcription
2. New 'soniox_chain' provider mode (default for 'auto'):
- Soniox primary (fast + cheap + accurate)
- Scribe fallback (rare cases when Soniox fails)
- Gemini fallback (last resort, slow but bulletproof)
- Quality gate: coverage >= 50%, no hallucinations
3. Provider modes: auto, soniox, elevenlabs, gemini, hybrid, local
This makes the pipeline reliable for ALL music genres including
Slovenian narodno-zabavni glasbi which Scribe consistently failed on.
User feedback: 'na refren ne pred na začetek refrena' — the clip should
start right when the chorus begins, not 0.3s before.
Changes:
1. Prompt rule: 'Začetek = TOČNO ko prva beseda refrena začne'
(was: '~0.3s PRED prvo besedo refrena')
2. Word-level extension: removed -0.15s buffer when extending back
(now lands exactly on word start)
Reasoning: with no_subs as default, we don't need buffer to avoid
cutting first word during fade-in (fade-in is now 0.05s = imperceptible).
Cleaner cuts directly at chorus onset.
User feedback: subtitles have been causing problems (wrong text from STT,
chorus selection issues). Better to default to clean reels without
burned-in subtitles - just video + audio at the chorus moment.
Changes:
- 'Brez podnapisov' checkbox now CHECKED by default
- Removed 'Stil podnapisov' dropdown from UI (kept hidden for compat)
- Updated step label: 'Reframe v 9:16 + podnapisi' → 'Reframe v 9:16'
- Backend already supports --no-subs flag, no logic changes needed
Result: reels are simpler and more reliable. Just clean 9:16 with audio.
User can enable subtitles per-job by unchecking the box if needed.
User feedback: 'Ansambel UNIKAT — PA PA (offiicial video)' shows the
'(offiicial video)' suffix everywhere (titles, downloads, UI). The user
wants only 'Artist - Title' without any video format markers.
Two fixes:
1. EXPANDED _NOISE_PATTERNS to handle:
- Typos in 'official': 'offiicial', 'offical', 'oficial' (regex Off[a-z]*icial)
- Variants: '(Official 4K Video)', '(Official HD Video)', '(Official Music Video)'
- More versions: (Live), (Cover), (Acoustic), (Extended Mix), (Radio Edit), (Clean), (Explicit)
- Square brackets: [Official...], [HD], [Lyrics...]
- Bare words without brackets
- Trailing year markers '(2024)'
2. NEW clean_noise() function applied at READ TIME:
Even if a job was saved with 'PA PA (offiicial video)' as parsed_title,
the new code re-cleans it when serving the job to the UI or building
the download filename. This means existing jobs get fixed too without
needing re-processing.
3. Applied to:
- build_download_filename() — clean before formatting
- list_jobs() — strip noise when serving job list
- get_job() — strip noise when serving single job
Result: 'Ansambel UNIKAT - PA PA - REEL.mp4' (no more (offiicial video))
Real-world test confirmed Gemini 3 Pro can transcribe Slovenian folk-pop
songs accurately where ElevenLabs Scribe hallucinates:
Test: FEHTARJI - GORENJSKA LJUBLJENA (120s sample)
- Scribe result: 'finančni moduli...' (total hallucination, wrong content)
- Gemini 3 Pro: 'Zunaj srečo sem iskal, planet prepotoval' (CORRECT lyrics)
Implementation:
1. New transcribe_with_gemini() function:
- Uploads audio via Gemini Files API (resumable upload)
- Calls gemini-3-pro-preview with structured prompt
- Parses JSON response with word-level timestamps
- Computes coverage_pct and hallucination_count
- Returns same format as Scribe (compatible)
2. New 'hybrid' provider mode (now the default for 'auto'):
- Try Scribe first (fast, cheap: 8-10s, $0.013)
- If quality OK (coverage >= 50%, no hallucinations) → return Scribe
- Else retry Scribe once
- If still bad → fallback to Gemini 3 Pro (slow, more expensive: 100s, $0.20)
- Compare results, return whichever is better
3. Provider modes:
- 'auto' → hybrid if both keys, else elevenlabs, else local
- 'hybrid' → explicit Scribe + Gemini fallback
- 'elevenlabs'→ Scribe only (with auto-retry)
- 'gemini' → Gemini only
- 'local' → faster-whisper on CPU
Cost analysis (10 reels/day):
- Pure Scribe: $0.13/day, ~5-10% reels unusable
- Hybrid: ~$0.55/day, 100% usable
- Pure Gemini: $2/day
Hybrid is the clear winner: +$0.42/day for 100% reliability.
Bug found in Žena ME TEPE third re-test:
- Scribe transcribed only verse 1 (0-33s) properly
- Then returned a single 98s segment [34.7-133.2] with just 1 word 'sam'
- This is a known Scribe hallucination on instrumental sections
- Result: SRT showed 'SAM SAM SAM SAM...' 14 times across the chorus
- Looked completely wrong because the chorus audio was correct but
subtitles showed 'SAM' repeatedly
Three-part fix:
1. SRT GENERATOR: skip segments > 15s with < 5 words.
These are hallucinations and have no real transcription value.
2. SCRIBE TRANSCRIBE: detect hallucinations in returned segments.
- Mark segments > 15s with < 5 words as hallucinations
- Compute true coverage % (excluding hallucinations)
- Add _hallucination_count and _coverage_pct to result
3. TRANSCRIBE_FULL: auto-retry Scribe if quality is poor.
- If hallucinations detected OR coverage < 50%, retry once
- Keep retry result only if it has better stats
- Otherwise fall back to first attempt (still better than nothing)
This makes the pipeline robust against Scribe's occasional bad transcripts
on songs with long instrumental breaks. Most second attempts succeed
where the first failed (random Scribe variance).
Refinement of previous lookback fix - limit to MAX 2 words back.
Reason: with unlimited lookback, the lookback would chain through
words with gaps < 0.5s and keep walking back into the previous verse.
For Žena ME TEPE: 'verjet.' [76.78] → 'Žena' [76.88] gap is 0.10s,
which means lookback would walk back to verses before chorus.
With 2-word limit:
- Clip at 78.19s → 'me' [78.16] is closest preceding word (gap 0.03s)
- Lookback j=i: 'me' → 'Žena' gap 0.14s → captured (i-1)
- Lookback j=i-1: 'Žena' → 'verjet.' gap 0.10s → would be captured
but MAX_LOOKBACK_WORDS=2 stops here ✓
Result: anchor = 'Žena' at 76.88s → new_start = 76.73s.
Subtitle: 'ŽENA ME TEPE' (full phrase, no verse leakage).
Bug found in Žena ME TEPE re-test:
- Clip start: 76.73s (correct, captures full 'Žena' word)
- But SRT subtitle #1 showed: 'SAJ ŠE DOMA MI VEČ NOČJO VERJET.'
- That text is from the PREVIOUS verse, not the chorus!
Why: previous segment (73.9-78.2s) contained 'saj še doma mi več
nočjo verjet. Žena me'. Clip start fell at 76.73s (mid-segment).
Old SRT logic: max(s_start, clip_start) just clipped TIMING but kept
ALL the text from that segment, including text from before the clip.
Fix: when a segment partially falls outside clip range AND has word-level
timestamps (Scribe provides these), reconstruct the segment using only
the words that actually fall within [clip_start, clip_end]. Audio
(clipped at clip_start) only contains those words anyway, so the
subtitle should match.
Result for Žena chorus:
- Old: 'SAJ ŠE DOMA MI VEČ NOČJO VERJET.' (wrong, that text is silent
in clip)
- New: 'ŽENA ME' (only words actually heard at 76.73-78.16s)
Bug found in Žena ME TEPE re-test:
- Final clip start was 77.2s but word 'Žena' starts at 76.88s
- Word-level extension would have correctly chosen 76.73s
- Why didn't it? Because corrected_segs (Claude output) doesn't contain
word-level timestamps, only segment start/end. all_words array was empty,
triggering segment-level fallback (-0.5s) which produced 77.2s instead.
Fix: always use transcript['segments'] (original Scribe output with word
timestamps) for word-level boundary detection, not Claude corrected_segments.
Now: 'Žena' word at 76.88-77.74s will trigger word-level extension to
76.73s (76.88 - 0.15s buffer), capturing the full word.
Bug: when user clicked Preview in the left 'Analiza pesmi' panel, it
loaded an inline <video> below it. The video element grew to 400px and,
combined with the sticky positioning of the left card and grid layout,
caused the right column (jobs list) preview/download buttons to become
unclickable — the video element was effectively layering over them.
Fix: replace inline preview with the same modal that the right-side
Preview buttons use. Removes the layout conflict entirely:
- Live panel Preview button now opens the centered modal
- Removed inline #live-video element
- Removed liveVideo references from JS (resetLive)
- Job cards now have data-id and data-title attributes so the modal can
pull title for display
Both left-side (live) and right-side (jobs list) preview now use the
same clean modal experience.
Žena problem persists: even after word-level extension, some cases where
Scribe doesn't transcribe the very first word still result in clip cutting
the vocal start.
Layer 3 defense: after word-level start extension, probe the FIRST 150ms
of audio at clip start with ffmpeg volumedetect. If mean_volume > -35 dB
(threshold for vocal/music vs silence), extend clip start back 0.5s as a
safety buffer.
This catches cases where:
- Scribe missed the word entirely (no word-level timestamp to extend to)
- LLM picked a start that's already inside vocal energy
- Word-level extension didn't trigger because no nearby word matched
The check is fast (<100ms) and conservative (only triggers if audio is
clearly NOT silent). If it's a true musical break (silence before chorus),
mean_volume will be < -40 dB and extension is skipped.
Three layers of defense now:
1. Claude prompt: 'start ~0.3s before first chorus word'
2. Word-level boundary detection (Scribe word timestamps)
3. Audio amplitude check (catches cases 1-2 missed)
1. CLIP END EXTENSION TOO AGGRESSIVE (Avsenika problem):
Previous logic extended clip end to any segment within 1s pause.
This caused clip to spill into instrumental break or next chorus.
New rules (multi-language):
- Hard cap: original_clip_end + 3s max (prevents long instrumental tails)
- Pause threshold tightened: 0.7s (was 1.0s)
- Length check: skip segments longer than 2.5s (likely new verse/chorus)
- Outro filler regex: only extend if next segment matches
(la|na|oh|ah|eh|ej|aj|ja|hey|yeah|yo|ho|wo|hu|mm|nn|uu|oo|aa|ee|ii)
- Universal across languages (works for SLO 'aj ja ja', EN 'yeah',
ES 'ay ay ay', RO 'hei hei', JP 'la la la')
2. UI CLEANUP:
- Removed dead pendingFile/pendingArtist/pendingTitle references
(multi-upload migration left some single-file resets behind)
- Job watch handler no longer tries to clear single-file state