---
name: normalize-format
description: Normalizes a single inbox file of any supported format (plain markdown, Claude.ai JSON export, Claude Code JSONL session, Readwise markdown/CSV highlight, transcript with timestamps or speaker labels, link capture) into a clean markdown body plus partial frontmatter (id, title, source block, word_count). Handles format-specific failure modes — JSON content-block arrays, timestamp stripping, per-highlight chunking, URL-vs-commentary separation. Use when ingesting any inbox item for the substacker Librarian. Trigger keywords: normalize, convert, parse, transcript, export, JSON, JSONL, highlight, CSV.
---
Related skills: Called by ingest-inbox-item as step 1. Upstream of tag-by-topic, score-intuition-density, dedupe-against-corpus.
| Extension | Format | Notes |
|---|---|---|
.md, .txt |
plain markdown | Default; passes through |
.json |
Claude.ai export | Conversation with messages array |
.jsonl |
Claude Code session | Content-block array per response |
.md (Readwise-shaped) |
Readwise export | Highlights + user notes |
.csv |
Readwise CSV | Per-row highlight |
.vtt, .srt, .md (diarized) |
Transcript | May include timestamps + speaker labels |
.md with URL + commentary |
Link capture | User's framing is the signal |
Normalize one file:
- [ ] Step 1: Detect format by extension + first-line sniff
- [ ] Step 2: Apply format-specific parse
- [ ] Step 3: Split long transcripts at topic boundaries (>3000 words)
- [ ] Step 4: Emit [{body, partial_frontmatter}, ...] list (usually one item)
.jsonlwith"type":"assistant"→ Claude Code session..jsonwith"conversation"/"messages"top-level key → Claude.ai export..mdstarting with#and Readwise boilerplate (**Highlights first synced by Readwise...**) → Readwise..vtt/.srt, or.mdwith[HH:MM:SS]timestamp pattern, or lines prefixed with speaker labels likeMe:→ transcript..mdwith ≤50 words and a prominent URL → link capture.- Else: plain markdown.
Plain markdown: pass body through unchanged. Title = first H1 or filename-derived.
Claude.ai JSON: flatten content blocks to markdown. Preserve user/assistant turn labels (**Me:** / **Claude:**). Strip system-reminder blocks. provenance.author: claude, confidence: paraphrased.
Claude Code JSONL: flatten content-block array. Drop tool_use blocks unless the adjacent user message references the tool output. Strip system reminders.
Readwise: split per-book file into one seed per highlight. Body = highlight + user note. Boilerplate stripped. For bare highlights (no user note), set provenance.confidence: quoted, density capped at 3 downstream. For user-annotated highlights, confidence: owned.
Transcript: strip timestamps. Preserve speaker labels as **Speaker:** prefixes. If >3000 words, split at topic shifts — emit multiple outputs sharing parent_source. Target ~1500 words per chunk.
Link capture: separate URL from commentary. Body = user's commentary. Frontmatter adds source.linked_url. If <50 words of commentary, flag low_commentary: true so the scorer caps density.
Split heuristic: paragraph break + topic-vocabulary shift (measured by tag overlap drop across adjacent paragraphs). Each chunk ~1500 words. Preserve parent_source across chunks.
- A file that looks like a transcript but is actually an email thread (first-line sniff:
From: ...,Date: ...,Subject: ...) — reclassify as plain markdown or link capture. - A Readwise file missing boilerplate — treat as plain markdown.
- A
.jsonfile that isn't a Claude export — treat as plain markdown and wrap in code fences.
Input (inbox/2026-04-21-claude-bnn.json):
{"conversation":{"name":"BNN variational","messages":[
{"role":"user","content":[{"type":"text","text":"help me intuit why variational inference..."}]},
{"role":"assistant","content":[{"type":"text","text":"Think of it as fitting a simple distribution..."}]}
]}}Output:
# BNN variational
**Me:** help me intuit why variational inference...
**Claude:** Think of it as fitting a simple distribution...With partial_frontmatter = {id: 2026-04-21-bnn-variational, title: "BNN variational", source: {type: claude-conversation, ...}, provenance: {author: claude, confidence: paraphrased}}.
- Readwise CSV with malformed rows: skip the row, log
WARN | malformed CSV row in <file> line Nto changelog. - Claude.ai JSON schema drift: if
messageskey missing, fall back to recursive text extraction; markconfidence: paraphrasedregardless. - Transcript that is actually an email thread: reclassify rather than keep as transcript.
- Never OCR images. Image-only inbox items get a seed body of
[image: awaiting user annotation]andstatus: deadwith reasonimage-only. - Files >50k words: refuse and log. User splits manually.
- Never lose the original. Even if parsing fails, the inbox file stays intact (caller moves to
.processed/only on success).
- Seven supported formats, each with a specific parser.
- Returns
[{body, partial_frontmatter}, ...]— always a list, usually of length 1. - Long transcripts split into multiple outputs sharing
parent_source.