refactor(skills): consolidate tts/whisper guidance into hyperframes-media

jrusso1020 · jrusso1020 · commit 4413d25b0e11 · 2026-05-04T22:06:02.000Z
Code review found the new hyperframes-media skill was parallel
content with skills/hyperframes/references/tts.md and the "Whisper
Model Guide" section of transcript-guide.md — same voice table, same
.en-translates-non-English warning, same TTS→transcribe chain in
both places. Plus some scope creep in hyperframes-media (audio/video
HTML snippets that duplicate the canonical track docs in
hyperframes/SKILL.md:265+).

Consolidation:

- hyperframes-media is now the single source of truth for CLI
  invocation, voice selection, multilingual phonemization, whisper
  model selection, and the .en gotcha. Picked up the multilingual
  prefix decoding from the deleted tts.md.
- skills/hyperframes/references/tts.md deleted; the bullet in
  hyperframes/SKILL.md is removed (no replacement — agents land on
  hyperframes-media via its own description).
- skills/hyperframes/references/transcript-guide.md keeps only the
  caption-side concerns: input-format table, mandatory quality
  check, cleaning JS, external-API import path, and the
  "if no transcript exists" flow. The intro bash recipe and Whisper
  Model Guide section both moved to hyperframes-media. Top of the
  file now points to hyperframes-media for CLI/model details.

Other tightening in hyperframes-media:

- Dropped WHAT-narration filler and the inline &lt;audio&gt;/&lt;video&gt; HTML
  snippets — they duplicate the canonical track-attribute docs in
  hyperframes/SKILL.md.
- Added the `id` field (`w0`, `w1`, ...) to the transcript output
  shape — the actual Word interface in
  packages/cli/src/whisper/normalize.ts includes it (optional for
  backwards compat), used by caption override logic.
- Compressed the TTS → Transcribe → Captions chain section.

Net: hyperframes-media 147 → 136 lines, transcript-guide.md 152 →
106 lines, tts.md gone (-75 lines).
diff --git a/skills/hyperframes-media/SKILL.md b/skills/hyperframes-media/SKILL.md
@@ -5,9 +5,7 @@ description: Asset preprocessing for HyperFrames compositions — text-to-speech
 
 # HyperFrames Media Preprocessing
 
-Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`.
-
-Run them before composing — drop the output file into the project, then reference it from the composition HTML.
+Three CLI commands that produce assets for compositions: `tts` (speech), `transcribe` (timestamps), and `remove-background` (transparent video). Each downloads a model on first run and caches it under `~/.cache/hyperframes/`. Drop the output into the project, then reference it from the composition HTML — see the `hyperframes` skill for the audio/video element conventions.
 
 ## Text-to-Speech (`tts`)
 
@@ -31,7 +29,16 @@ Match voice to content. Default is `af_heart`.
 | Documentation     | `bf_emma`/`bm_george` | Clear British English, formal |
 | Casual / social   | `af_heart`/`af_sky`   | Approachable, natural         |
 
-8 languages supported: EN, JP, ZH, KO, FR, DE, IT, PT. Run `--list` for the full set.
+### Multilingual
+
+Voice IDs encode language in the first letter: `a`=American English, `b`=British English, `e`=Spanish, `f`=French, `h`=Hindi, `i`=Italian, `j`=Japanese, `p`=Brazilian Portuguese, `z`=Mandarin. The CLI auto-detects the phonemizer locale from the prefix — no `--lang` needed when the voice matches the text.
+
+```bash
+npx hyperframes tts "La reunión empieza a las nueve" --voice ef_dora --output es.wav
+npx hyperframes tts "今日はいい天気ですね" --voice jf_alpha --output ja.wav
+```
+
+Use `--lang` only to override auto-detection (stylized accents). Valid codes: `en-us`, `en-gb`, `es`, `fr-fr`, `hi`, `it`, `pt-br`, `ja`, `zh`. Non-English phonemization requires `espeak-ng` system-wide (`brew install espeak-ng` / `apt-get install espeak-ng`).
 
 ### Speed
 
@@ -44,24 +51,9 @@ Match voice to content. Default is `af_heart`.
 
 For more than a few paragraphs, write to a `.txt` file and pass the path. Inputs over ~5 minutes of speech may benefit from splitting into segments.
 
-### Use in a Composition
-
-Reference the generated audio as a standard `<audio>` track:
-
-```html
-<audio
-  id="narration"
-  data-start="0"
-  data-duration="auto"
-  data-track-index="2"
-  src="narration.wav"
-  data-volume="1"
-></audio>
-```
-
 ### Requirements
 
-Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads automatically on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).
+Python 3.8+ with `kokoro-onnx` and `soundfile` (`pip install kokoro-onnx soundfile`). Model downloads on first use (~311 MB + ~27 MB voices, cached in `~/.cache/hyperframes/tts/`).
 
 ## Transcription (`transcribe`)
 
@@ -72,7 +64,7 @@ npx hyperframes transcribe audio.mp3
 npx hyperframes transcribe video.mp4 --model small --language es
 npx hyperframes transcribe subtitles.srt          # import existing
 npx hyperframes transcribe subtitles.vtt
-npx hyperframes transcribe openai-response.json   # import OpenAI output
+npx hyperframes transcribe openai-response.json
 ```
 
 ### Language Rule (Non-Negotiable)
@@ -85,19 +77,29 @@ npx hyperframes transcribe openai-response.json   # import OpenAI output
 
 **Default model is `small`, not `small.en`.**
 
+### Model Sizes
+
+| Model      | Size   | Speed    | When to use                           |
+| ---------- | ------ | -------- | ------------------------------------- |
+| `tiny`     | 75 MB  | Fastest  | Quick previews, testing pipeline      |
+| `base`     | 142 MB | Fast     | Short clips, clear audio              |
+| `small`    | 466 MB | Moderate | **Default** — most content            |
+| `medium`   | 1.5 GB | Slow     | Important content, noisy audio, music |
+| `large-v3` | 3.1 GB | Slowest  | Production quality                    |
+
+Music with vocals: start at `medium` minimum; produced tracks often need manual SRT/VTT import. For caption-quality checks (mandatory after every transcription), the cleaning JS, retry rules, and the OpenAI/Groq API import path, see [hyperframes/references/transcript-guide.md](../hyperframes/references/transcript-guide.md).
+
 ### Output Shape
 
-The composition consumes a flat array of word objects:
+Compositions consume a flat array of word objects. The `id` field (`w0`, `w1`, ...) is added during normalization for stable references in caption overrides; it's optional for backwards compatibility.
 
 ```json
 [
-  { "text": "Hello", "start": 0.0, "end": 0.5 },
-  { "text": "world.", "start": 0.6, "end": 1.2 }
+  { "id": "w0", "text": "Hello", "start": 0.0, "end": 0.5 },
+  { "id": "w1", "text": "world.", "start": 0.6, "end": 1.2 }
 ]
 ```
 
-For caption rendering, styling, and per-word effects, invoke the `hyperframes` skill (composition authoring).
-
 ## Background Removal (`remove-background`)
 
 Remove the background from a video or image so it can sit as a transparent overlay in a composition (e.g. an avatar floating on a background plate).
@@ -112,36 +114,23 @@ npx hyperframes remove-background --info                          # detected pro
 
 Uses `u2net_human_seg` (MIT). First run downloads ~168 MB of weights to `~/.cache/hyperframes/background-removal/models/`.
 
-### Output Format Choice
+### Output Format
 
 | Format                | When                                                          |
 | --------------------- | ------------------------------------------------------------- |
-| `.webm` (VP9 + alpha) | Default. Compositions consume this directly via `<video>`.    |
+| `.webm` (VP9 + alpha) | Default. Compositions play this directly via `<video>`.       |
 | `.mov` (ProRes 4444)  | Editing in DaVinci/Premiere/FCP. Large files.                 |
 | `.png`                | Single-image cutout (still subject, layered over a backdrop). |
 
-### Use in a Composition
-
-Drop the `.webm` into the project, then play it like any other video — Chrome decodes VP9 alpha natively:
-
-```html
-<video src="transparent.webm" autoplay muted loop></video>
-```
+Chrome decodes VP9 alpha natively, so the `.webm` plugs into a composition like any other muted-autoplay video — see the `hyperframes` skill for the `<video>` track conventions.
 
 ## TTS → Transcribe → Captions
 
-Chain the commands when you don't have a pre-recorded voiceover:
+When there's no pre-recorded voiceover, generate one and transcribe it back to get word-level timestamps for captions:
 
 ```bash
-# 1. Generate speech
 npx hyperframes tts script.txt --voice af_heart --output narration.wav
-
-# 2. Transcribe back for word-level timestamps
-npx hyperframes transcribe narration.wav
-
-# 3. narration.wav + transcript.json are ready for captions
+npx hyperframes transcribe narration.wav   # → transcript.json
 ```
 
 Whisper extracts precise word boundaries from the generated audio, so caption timing matches delivery without hand-tuning.
-
-For caption rendering, styling, and timeline integration, see the `hyperframes` skill.
diff --git a/skills/hyperframes/SKILL.md b/skills/hyperframes/SKILL.md
@@ -467,7 +467,6 @@ Skip on small edits (fixing a color, adjusting one duration). Run on new composi
 ## References (loaded on demand)
 
 - **[references/captions.md](references/captions.md)** — Captions, subtitles, lyrics, karaoke synced to audio. Tone-adaptive style detection, per-word styling, text overflow prevention, caption exit guarantees, word grouping. Read when adding any text synced to audio timing.
-- **[references/tts.md](references/tts.md)** — Text-to-speech with Kokoro-82M. Voice selection, speed tuning, TTS+captions workflow. Read when generating narration or voiceover.
 - **[references/audio-reactive.md](references/audio-reactive.md)** — Audio-reactive animation: map frequency bands and amplitude to GSAP properties. Read when visuals should respond to music, voice, or sound.
 - **[references/css-patterns.md](references/css-patterns.md)** — CSS+GSAP marker highlighting: highlight, circle, burst, scribble, sketchout. Deterministic, fully seekable. Read when adding visual emphasis to text.
 - **[references/video-composition.md](references/video-composition.md)** — Video-medium rules: density, color presence, scale, frame composition, design.md as brand not layout. **Always read** — these override web instincts.
@@ -481,7 +480,7 @@ Skip on small edits (fixing a color, adjusting one duration). Run on new composi
 - **[house-style.md](house-style.md)** — Default motion, sizing, and color palettes when no design.md is specified.
 - **[patterns.md](patterns.md)** — PiP, title cards, slide show patterns.
 - **[data-in-motion.md](data-in-motion.md)** — Data, stats, and infographic patterns.
-- **[references/transcript-guide.md](references/transcript-guide.md)** — Transcription commands, whisper models, external APIs, troubleshooting.
+- **[references/transcript-guide.md](references/transcript-guide.md)** — Caption-side transcript handling: input formats, mandatory quality check, cleaning JS, OpenAI/Groq API fallback, "if no transcript exists" flow. (For the `transcribe` CLI invocation, model selection rules, and the `.en` gotcha, see the `hyperframes-media` skill.)
 - **[references/dynamic-techniques.md](references/dynamic-techniques.md)** — Dynamic caption animation techniques (karaoke, clip-path, slam, scatter, elastic, 3D).
 
 - **[references/transitions.md](references/transitions.md)** — Scene transitions: crossfades, wipes, reveals, shader transitions. Energy/mood selection, CSS vs WebGL guidance. **Always read for multi-scene compositions** — scenes without transitions feel like jump cuts.
diff --git a/skills/hyperframes/references/transcript-guide.md b/skills/hyperframes/references/transcript-guide.md
@@ -1,24 +1,6 @@
 # Transcript Guide
 
-## How Transcripts Are Generated
-
-`hyperframes transcribe` handles both transcription and format conversion:
-
-```bash
-# Transcribe audio/video (uses whisper.cpp locally, no API key needed)
-npx hyperframes transcribe audio.mp3
-
-# Use a larger model for better accuracy
-npx hyperframes transcribe audio.mp3 --model medium.en
-
-# Filter to English only (skips non-English speech)
-npx hyperframes transcribe audio.mp3 --language en
-
-# Import an existing transcript from another tool
-npx hyperframes transcribe captions.srt
-npx hyperframes transcribe captions.vtt
-npx hyperframes transcribe openai-response.json
-```
+For the `transcribe` CLI invocation, the `.en`-translates-non-English rule, and whisper model selection, see the `hyperframes-media` skill. This file covers what to do with the resulting transcript when authoring captions: input formats, mandatory quality checks, cleaning code, external-API fallbacks.
 
 ## Supported Input Formats
 
@@ -34,32 +16,6 @@ The CLI auto-detects and normalizes these formats:
 
 **Word-level timestamps produce better captions.** SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.
 
-## Whisper Model Guide
-
-The default model (`small.en`) balances accuracy and speed. For better results, use a larger model:
-
-| Model      | Size   | Speed    | Accuracy  | When to use                           |
-| ---------- | ------ | -------- | --------- | ------------------------------------- |
-| `tiny`     | 75 MB  | Fastest  | Low       | Quick previews, testing pipeline      |
-| `base`     | 142 MB | Fast     | Fair      | Short clips, clear audio              |
-| `small`    | 466 MB | Moderate | Good      | **Default** — good for most content   |
-| `medium`   | 1.5 GB | Slow     | Very good | Important content, noisy audio, music |
-| `large-v3` | 3.1 GB | Slowest  | Best      | Production quality                    |
-
-**Only add `.en` suffix when the user explicitly says the audio is English.** `.en` models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.
-
-**Critical: `.en` models translate non-English audio into English** — they don't transcribe it. If the audio might not be English, always use a model without the `.en` suffix and pass `--language` to specify the source language. If you're unsure of the language, use `small` (not `small.en`) without `--language` — whisper will auto-detect.
-
-```bash
-# Spanish audio
-npx hyperframes transcribe audio.mp3 --model small --language es
-
-# Unknown language — let whisper auto-detect
-npx hyperframes transcribe audio.mp3 --model small
-```
-
-**Music and vocals over instrumentation**: `small.en` will misidentify lyrics — use `medium.en` as the minimum, or import lyrics manually. Even `medium.en` struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with `hyperframes transcribe lyrics.srt` will always beat automated transcription.
-
 ## Transcript Quality Check (Mandatory)
 
 After every transcription, **read the transcript and check for quality issues before proceeding.** Bad transcripts produce nonsensical captions. Never skip this step.
diff --git a/skills/hyperframes/references/tts.md b/skills/hyperframes/references/tts.md