Skip to content

Commit 4413d25

Browse files
committed
refactor(skills): consolidate tts/whisper guidance into hyperframes-media
Code review found the new hyperframes-media skill was parallel content with skills/hyperframes/references/tts.md and the "Whisper Model Guide" section of transcript-guide.md — same voice table, same .en-translates-non-English warning, same TTS→transcribe chain in both places. Plus some scope creep in hyperframes-media (audio/video HTML snippets that duplicate the canonical track docs in hyperframes/SKILL.md:265+). Consolidation: - hyperframes-media is now the single source of truth for CLI invocation, voice selection, multilingual phonemization, whisper model selection, and the .en gotcha. Picked up the multilingual prefix decoding from the deleted tts.md. - skills/hyperframes/references/tts.md deleted; the bullet in hyperframes/SKILL.md is removed (no replacement — agents land on hyperframes-media via its own description). - skills/hyperframes/references/transcript-guide.md keeps only the caption-side concerns: input-format table, mandatory quality check, cleaning JS, external-API import path, and the "if no transcript exists" flow. The intro bash recipe and Whisper Model Guide section both moved to hyperframes-media. Top of the file now points to hyperframes-media for CLI/model details. Other tightening in hyperframes-media: - Dropped WHAT-narration filler and the inline <audio>/<video> HTML snippets — they duplicate the canonical track-attribute docs in hyperframes/SKILL.md. - Added the `id` field (`w0`, `w1`, ...) to the transcript output shape — the actual Word interface in packages/cli/src/whisper/normalize.ts includes it (optional for backwards compat), used by caption override logic. - Compressed the TTS → Transcribe → Captions chain section. Net: hyperframes-media 147 → 136 lines, transcript-guide.md 152 → 106 lines, tts.md gone (-75 lines).
1 parent 051e985 commit 4413d25

4 files changed

Lines changed: 35 additions & 166 deletions

File tree

skills/hyperframes-media/SKILL.md

Lines changed: 33 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,7 @@ description: Asset preprocessing for HyperFrames compositions — text-to-speech
55

66
# HyperFrames Media Preprocessing
77

8-
Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`.
9-
10-
Run them before composing — drop the output file into the project, then reference it from the composition HTML.
8+
Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.
119

1210
## Text-to-Speech (`tts`)
1311

@@ -31,7 +29,16 @@ Match voice to content. Default is `af_heart`.
3129
| Documentation | `bf_emma`/`bm_george` | Clear British English, formal |
3230
| Casual / social | `af_heart`/`af_sky` | Approachable, natural |
3331

34-
8 languages supported: EN, JP, ZH, KO, FR, DE, IT, PT. Run `--list` for the full set.
32+
### Multilingual
33+
34+
Voice IDs encode language in the first letter: `a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no `--lang` needed when the voice matches the text.
35+
36+
```bash
37+
npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
38+
npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
39+
```
40+
41+
Use `--lang` only to override auto-detection (stylized accents). Valid codes: `en-us`, `en-gb`, `es`, `fr-fr`, `hi`, `it`, `pt-br`, `ja`, `zh`. Non-English phonemization requires `espeak-ng` system-wide (`brew install espeak-ng` / `apt-get install espeak-ng`).
3542

3643
### Speed
3744

@@ -44,24 +51,9 @@ Match voice to content. Default is `af_heart`.
4451

4552
For more than a few paragraphs, write to a `.txt` file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.
4653

47-
### Use in a Composition
48-
49-
Reference the generated audio as a standard `<audio>` track:
50-
51-
```html
52-
<audio
53-
id="narration"
54-
data-start="0"
55-
data-duration="auto"
56-
data-track-index="2"
57-
src="narration.wav"
58-
data-volume="1"
59-
></audio>
60-
```
61-
6254
### Requirements
6355

64-
Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads automatically on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).
56+
Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).
6557

6658
## Transcription (`transcribe`)
6759

@@ -72,7 +64,7 @@ npx hyperframes transcribe audio.mp3
7264
npx hyperframes transcribe video.mp4 --model small --language es
7365
npx hyperframes transcribe subtitles.srt # import existing
7466
npx hyperframes transcribe subtitles.vtt
75-
npx hyperframes transcribe openai-response.json # import OpenAI output
67+
npx hyperframes transcribe openai-response.json
7668
```
7769

7870
### Language Rule (Non-Negotiable)
@@ -85,19 +77,29 @@ npx hyperframes transcribe openai-response.json # import OpenAI output
8577

8678
**Default model is `small`, not `small.en`.**
8779

80+
### Model Sizes
81+
82+
| Model | Size | Speed | When to use |
83+
| ---------- | ------ | -------- | ------------------------------------- |
84+
| `tiny` | 75 MB | Fastest | Quick previews, testing pipeline |
85+
| `base` | 142 MB | Fast | Short clips, clear audio |
86+
| `small` | 466 MB | Moderate | **Default** — most content |
87+
| `medium` | 1.5 GB | Slow | Important content, noisy audio, music |
88+
| `large-v3` | 3.1 GB | Slowest | Production quality |
89+
90+
Music with vocals: start at `medium` minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see [hyperframes/references/transcript-guide.md](../hyperframes/references/transcript-guide.md).
91+
8892
### Output Shape
8993

90-
The composition consumes a flat array of word objects:
94+
Compositions consume a flat array of word objects. The `id` field (`w0`, `w1`, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.
9195

9296
```json
9397
[
94-
{ "text": "Hello", "start": 0.0, "end": 0.5 },
95-
{ "text": "world.", "start": 0.6, "end": 1.2 }
98+
{ "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
99+
{ "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
96100
]
97101
```
98102

99-
For caption rendering, styling, and per-word effects, invoke the `hyperframes` skill (composition authoring).
100-
101103
## Background Removal (`remove-background`)
102104

103105
Remove the background from a video or image so it can sit as a transparent overlay in a composition (e.g. an avatar floating on a background plate).
@@ -112,36 +114,23 @@ npx hyperframes remove-background --info # detected pro
112114

113115
Uses `u2net_human_seg` (MIT). First run downloads ~168 MB of weights to `~/.cache/hyperframes/background-removal/models/`.
114116

115-
### Output Format Choice
117+
### Output Format
116118

117119
| Format | When |
118120
| --------------------- | ------------------------------------------------------------- |
119-
| `.webm` (VP9 + alpha) | Default. Compositions consume this directly via `<video>`. |
121+
| `.webm` (VP9 + alpha) | Default. Compositions play this directly via `<video>`. |
120122
| `.mov` (ProRes 4444) | Editing in DaVinci/Premiere/FCP. Large files. |
121123
| `.png` | Single-image cutout (still subject, layered over a backdrop). |
122124

123-
### Use in a Composition
124-
125-
Drop the `.webm` into the project, then play it like any other video — Chrome decodes VP9 alpha natively:
126-
127-
```html
128-
<video src="transparent.webm" autoplay muted loop></video>
129-
```
125+
Chrome decodes VP9 alpha natively, so the `.webm` plugs into a composition like any other muted-autoplay video — see the `hyperframes` skill for the `<video>` track conventions.
130126

131127
## TTS → Transcribe → Captions
132128

133-
Chain the commands when you don't have a pre-recorded voiceover:
129+
When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:
134130

135131
```bash
136-
# 1. Generate speech
137132
npx hyperframes tts script.txt --voice af_heart --output narration.wav
138-
139-
# 2. Transcribe back for word-level timestamps
140-
npx hyperframes transcribe narration.wav
141-
142-
# 3. narration.wav + transcript.json are ready for captions
133+
npx hyperframes transcribe narration.wav # → transcript.json
143134
```
144135

145136
Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.
146-
147-
For caption rendering, styling, and timeline integration, see the `hyperframes` skill.

skills/hyperframes/SKILL.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -467,7 +467,6 @@ Skip on small edits (fixing a color, adjusting one duration). Run on new composi
467467
## References (loaded on demand)
468468

469469
- **[references/captions.md](references/captions.md)** — Captions, subtitles, lyrics, karaoke synced to audio. Tone-adaptive style detection, per-word styling, text overflow prevention, caption exit guarantees, word grouping. Read when adding any text synced to audio timing.
470-
- **[references/tts.md](references/tts.md)** — Text-to-speech with Kokoro-82M. Voice selection, speed tuning, TTS+captions workflow. Read when generating narration or voiceover.
471470
- **[references/audio-reactive.md](references/audio-reactive.md)** — Audio-reactive animation: map frequency bands and amplitude to GSAP properties. Read when visuals should respond to music, voice, or sound.
472471
- **[references/css-patterns.md](references/css-patterns.md)** — CSS+GSAP marker highlighting: highlight, circle, burst, scribble, sketchout. Deterministic, fully seekable. Read when adding visual emphasis to text.
473472
- **[references/video-composition.md](references/video-composition.md)** — Video-medium rules: density, color presence, scale, frame composition, design.md as brand not layout. **Always read** — these override web instincts.
@@ -481,7 +480,7 @@ Skip on small edits (fixing a color, adjusting one duration). Run on new composi
481480
- **[house-style.md](house-style.md)** — Default motion, sizing, and color palettes when no design.md is specified.
482481
- **[patterns.md](patterns.md)** — PiP, title cards, slide show patterns.
483482
- **[data-in-motion.md](data-in-motion.md)** — Data, stats, and infographic patterns.
484-
- **[references/transcript-guide.md](references/transcript-guide.md)**Transcription commands, whisper models, external APIs, troubleshooting.
483+
- **[references/transcript-guide.md](references/transcript-guide.md)**Caption-side transcript handling: input formats, mandatory quality check, cleaning JS, OpenAI/Groq API fallback, "if no transcript exists" flow. (For the `transcribe` CLI invocation, model selection rules, and the `.en` gotcha, see the `hyperframes-media` skill.)
485484
- **[references/dynamic-techniques.md](references/dynamic-techniques.md)** — Dynamic caption animation techniques (karaoke, clip-path, slam, scatter, elastic, 3D).
486485

487486
- **[references/transitions.md](references/transitions.md)** — Scene transitions: crossfades, wipes, reveals, shader transitions. Energy/mood selection, CSS vs WebGL guidance. **Always read for multi-scene compositions** — scenes without transitions feel like jump cuts.

skills/hyperframes/references/transcript-guide.md

Lines changed: 1 addition & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,6 @@
11
# Transcript Guide
22

3-
## How Transcripts Are Generated
4-
5-
`hyperframes transcribe` handles both transcription and format conversion:
6-
7-
```bash
8-
# Transcribe audio/video (uses whisper.cpp locally, no API key needed)
9-
npx hyperframes transcribe audio.mp3
10-
11-
# Use a larger model for better accuracy
12-
npx hyperframes transcribe audio.mp3 --model medium.en
13-
14-
# Filter to English only (skips non-English speech)
15-
npx hyperframes transcribe audio.mp3 --language en
16-
17-
# Import an existing transcript from another tool
18-
npx hyperframes transcribe captions.srt
19-
npx hyperframes transcribe captions.vtt
20-
npx hyperframes transcribe openai-response.json
21-
```
3+
For the `transcribe` CLI invocation, the `.en`-translates-non-English rule, and whisper model selection, see the `hyperframes-media` skill. This file covers what to do with the resulting transcript when authoring captions: input formats, mandatory quality checks, cleaning code, external-API fallbacks.
224

235
## Supported Input Formats
246

@@ -34,32 +16,6 @@ The CLI auto-detects and normalizes these formats:
3416

3517
**Word-level timestamps produce better captions.** SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.
3618

37-
## Whisper Model Guide
38-
39-
The default model (`small.en`) balances accuracy and speed. For better results, use a larger model:
40-
41-
| Model | Size | Speed | Accuracy | When to use |
42-
| ---------- | ------ | -------- | --------- | ------------------------------------- |
43-
| `tiny` | 75 MB | Fastest | Low | Quick previews, testing pipeline |
44-
| `base` | 142 MB | Fast | Fair | Short clips, clear audio |
45-
| `small` | 466 MB | Moderate | Good | **Default** — good for most content |
46-
| `medium` | 1.5 GB | Slow | Very good | Important content, noisy audio, music |
47-
| `large-v3` | 3.1 GB | Slowest | Best | Production quality |
48-
49-
**Only add `.en` suffix when the user explicitly says the audio is English.** `.en` models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.
50-
51-
**Critical: `.en` models translate non-English audio into English** — they don't transcribe it. If the audio might not be English, always use a model without the `.en` suffix and pass `--language` to specify the source language. If you're unsure of the language, use `small` (not `small.en`) without `--language` — whisper will auto-detect.
52-
53-
```bash
54-
# Spanish audio
55-
npx hyperframes transcribe audio.mp3 --model small --language es
56-
57-
# Unknown language — let whisper auto-detect
58-
npx hyperframes transcribe audio.mp3 --model small
59-
```
60-
61-
**Music and vocals over instrumentation**: `small.en` will misidentify lyrics — use `medium.en` as the minimum, or import lyrics manually. Even `medium.en` struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with `hyperframes transcribe lyrics.srt` will always beat automated transcription.
62-
6319
## Transcript Quality Check (Mandatory)
6420

6521
After every transcription, **read the transcript and check for quality issues before proceeding.** Bad transcripts produce nonsensical captions. Never skip this step.

skills/hyperframes/references/tts.md

Lines changed: 0 additions & 75 deletions
This file was deleted.

0 commit comments

Comments
 (0)