You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
refactor(skills): consolidate tts/whisper guidance into hyperframes-media
Code review found the new hyperframes-media skill was parallel
content with skills/hyperframes/references/tts.md and the "Whisper
Model Guide" section of transcript-guide.md — same voice table, same
.en-translates-non-English warning, same TTS→transcribe chain in
both places. Plus some scope creep in hyperframes-media (audio/video
HTML snippets that duplicate the canonical track docs in
hyperframes/SKILL.md:265+).
Consolidation:
- hyperframes-media is now the single source of truth for CLI
invocation, voice selection, multilingual phonemization, whisper
model selection, and the .en gotcha. Picked up the multilingual
prefix decoding from the deleted tts.md.
- skills/hyperframes/references/tts.md deleted; the bullet in
hyperframes/SKILL.md is removed (no replacement — agents land on
hyperframes-media via its own description).
- skills/hyperframes/references/transcript-guide.md keeps only the
caption-side concerns: input-format table, mandatory quality
check, cleaning JS, external-API import path, and the
"if no transcript exists" flow. The intro bash recipe and Whisper
Model Guide section both moved to hyperframes-media. Top of the
file now points to hyperframes-media for CLI/model details.
Other tightening in hyperframes-media:
- Dropped WHAT-narration filler and the inline <audio>/<video> HTML
snippets — they duplicate the canonical track-attribute docs in
hyperframes/SKILL.md.
- Added the `id` field (`w0`, `w1`, ...) to the transcript output
shape — the actual Word interface in
packages/cli/src/whisper/normalize.ts includes it (optional for
backwards compat), used by caption override logic.
- Compressed the TTS → Transcribe → Captions chain section.
Net: hyperframes-media 147 → 136 lines, transcript-guide.md 152 →
106 lines, tts.md gone (-75 lines).
Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`.
9
-
10
-
Run them before composing — drop the output file into the project, then reference it from the composition HTML.
8
+
Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.
11
9
12
10
## Text-to-Speech (`tts`)
13
11
@@ -31,7 +29,16 @@ Match voice to content. Default is `af_heart`.
31
29
| Documentation |`bf_emma`/`bm_george`| Clear British English, formal |
32
30
| Casual / social |`af_heart`/`af_sky`| Approachable, natural |
33
31
34
-
8 languages supported: EN, JP, ZH, KO, FR, DE, IT, PT. Run `--list` for the full set.
32
+
### Multilingual
33
+
34
+
Voice IDs encode language in the first letter: `a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no `--lang` needed when the voice matches the text.
35
+
36
+
```bash
37
+
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
@@ -44,24 +51,9 @@ Match voice to content. Default is `af_heart`.
44
51
45
52
For more than a few paragraphs, write to a `.txt` file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.
46
53
47
-
### Use in a Composition
48
-
49
-
Reference the generated audio as a standard `<audio>` track:
50
-
51
-
```html
52
-
<audio
53
-
id="narration"
54
-
data-start="0"
55
-
data-duration="auto"
56
-
data-track-index="2"
57
-
src="narration.wav"
58
-
data-volume="1"
59
-
></audio>
60
-
```
61
-
62
54
### Requirements
63
55
64
-
Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads automatically on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).
56
+
Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).
|`base`| 142 MB | Fast | Short clips, clear audio |
86
+
|`small`| 466 MB | Moderate |**Default** — most content |
87
+
|`medium`| 1.5 GB | Slow | Important content, noisy audio, music |
88
+
|`large-v3`| 3.1 GB | Slowest | Production quality |
89
+
90
+
Music with vocals: start at `medium` minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see [hyperframes/references/transcript-guide.md](../hyperframes/references/transcript-guide.md).
91
+
88
92
### Output Shape
89
93
90
-
The composition consumes a flat array of word objects:
94
+
Compositions consume a flat array of word objects. The `id` field (`w0`, `w1`, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.
Chrome decodes VP9 alpha natively, so the `.webm` plugs into a composition like any other muted-autoplay video — see the `hyperframes` skill for the `<video>` track conventions.
130
126
131
127
## TTS → Transcribe → Captions
132
128
133
-
Chain the commands when you don't have a pre-recorded voiceover:
129
+
When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:
Copy file name to clipboardExpand all lines: skills/hyperframes/SKILL.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -467,7 +467,6 @@ Skip on small edits (fixing a color, adjusting one duration). Run on new composi
467
467
## References (loaded on demand)
468
468
469
469
-**[references/captions.md](references/captions.md)** — Captions, subtitles, lyrics, karaoke synced to audio. Tone-adaptive style detection, per-word styling, text overflow prevention, caption exit guarantees, word grouping. Read when adding any text synced to audio timing.
470
-
-**[references/tts.md](references/tts.md)** — Text-to-speech with Kokoro-82M. Voice selection, speed tuning, TTS+captions workflow. Read when generating narration or voiceover.
471
470
-**[references/audio-reactive.md](references/audio-reactive.md)** — Audio-reactive animation: map frequency bands and amplitude to GSAP properties. Read when visuals should respond to music, voice, or sound.
472
471
-**[references/css-patterns.md](references/css-patterns.md)** — CSS+GSAP marker highlighting: highlight, circle, burst, scribble, sketchout. Deterministic, fully seekable. Read when adding visual emphasis to text.
473
472
-**[references/video-composition.md](references/video-composition.md)** — Video-medium rules: density, color presence, scale, frame composition, design.md as brand not layout. **Always read** — these override web instincts.
@@ -481,7 +480,7 @@ Skip on small edits (fixing a color, adjusting one duration). Run on new composi
481
480
-**[house-style.md](house-style.md)** — Default motion, sizing, and color palettes when no design.md is specified.
482
481
-**[patterns.md](patterns.md)** — PiP, title cards, slide show patterns.
483
482
-**[data-in-motion.md](data-in-motion.md)** — Data, stats, and infographic patterns.
-**[references/transcript-guide.md](references/transcript-guide.md)** — Caption-side transcript handling: input formats, mandatory quality check, cleaning JS, OpenAI/Groq API fallback, "if no transcript exists" flow. (For the `transcribe` CLI invocation, model selection rules, and the `.en` gotcha, see the `hyperframes-media` skill.)
# Filter to English only (skips non-English speech)
15
-
npx hyperframes transcribe audio.mp3 --language en
16
-
17
-
# Import an existing transcript from another tool
18
-
npx hyperframes transcribe captions.srt
19
-
npx hyperframes transcribe captions.vtt
20
-
npx hyperframes transcribe openai-response.json
21
-
```
3
+
For the `transcribe` CLI invocation, the `.en`-translates-non-English rule, and whisper model selection, see the `hyperframes-media` skill. This file covers what to do with the resulting transcript when authoring captions: input formats, mandatory quality checks, cleaning code, external-API fallbacks.
22
4
23
5
## Supported Input Formats
24
6
@@ -34,32 +16,6 @@ The CLI auto-detects and normalizes these formats:
34
16
35
17
**Word-level timestamps produce better captions.** SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.
36
18
37
-
## Whisper Model Guide
38
-
39
-
The default model (`small.en`) balances accuracy and speed. For better results, use a larger model:
|`base`| 142 MB | Fast | Fair | Short clips, clear audio |
45
-
|`small`| 466 MB | Moderate | Good |**Default** — good for most content |
46
-
|`medium`| 1.5 GB | Slow | Very good | Important content, noisy audio, music |
47
-
|`large-v3`| 3.1 GB | Slowest | Best | Production quality |
48
-
49
-
**Only add `.en` suffix when the user explicitly says the audio is English.**`.en` models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.
50
-
51
-
**Critical: `.en` models translate non-English audio into English** — they don't transcribe it. If the audio might not be English, always use a model without the `.en` suffix and pass `--language` to specify the source language. If you're unsure of the language, use `small` (not `small.en`) without `--language` — whisper will auto-detect.
52
-
53
-
```bash
54
-
# Spanish audio
55
-
npx hyperframes transcribe audio.mp3 --model small --language es
56
-
57
-
# Unknown language — let whisper auto-detect
58
-
npx hyperframes transcribe audio.mp3 --model small
59
-
```
60
-
61
-
**Music and vocals over instrumentation**: `small.en` will misidentify lyrics — use `medium.en` as the minimum, or import lyrics manually. Even `medium.en` struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with `hyperframes transcribe lyrics.srt` will always beat automated transcription.
62
-
63
19
## Transcript Quality Check (Mandatory)
64
20
65
21
After every transcription, **read the transcript and check for quality issues before proceeding.** Bad transcripts produce nonsensical captions. Never skip this step.
0 commit comments