# HopDown Tokenizer Design ## Problem The regex-based inline parser and serializer can't reliably distinguish structural delimiters from literal text characters. This causes: - `toMarkdown` escaping bugs (over-escaping inside inline tags, under-escaping in text nodes) - Round-trip failures (`toHTML(toMarkdown(html)) !== html`) - Fragile interactions between features (underscore normalization + strikethrough, HTML passthrough + escaping) ## Invariants 1. `toHTML` satisfies GFM spec rules 1-15 2. `toMarkdown` always emits the canonical form 3. `toHTML(toMarkdown(html)) === html` (single-pass round-trip) ## Architecture ### Token types ``` text — literal characters, will be escaped during serialization delimiter — structural marker (**, *, ~~, `, etc.) html — raw HTML tag passthrough break — hard line break (
) ``` ### Inline tokenizer (markdown → tokens) Scans left-to-right, character by character. Maintains a stack of open delimiters. Produces a flat token stream: ``` Input: "hello **bold *nested*** end" Tokens: [text "hello "] [open **] [text "bold "] [open *] [text "nested"] [close *] [close **] [text " end"] ``` The tokenizer handles: - Backslash escapes: `\*` → text token containing `*` - Entity resolution: `&` → text token containing `&` - Flanking rules: only emit delimiter tokens when flanking conditions are met - Code spans: `` ` `` opens a code span that consumes everything until the matching `` ` `` - Links: `[text](url)` parsed as a unit - Autolinks: `` and bare URLs - Hard line breaks: trailing spaces or `\` before newline - HTML tags: `` etc. passed through as html tokens ### Inline parser (tokens → HTML) Walks the token stream and matches open/close delimiter pairs using a stack. Produces HTML string. Handles: - Delimiter pairing with precedence (*** before ** before *) - Multiple-of-3 rule - Nesting validation (no em inside em, no links inside links) ### Serializer (DOM → tokens → markdown) Walks the DOM tree. For each node: - Text nodes → text tokens (the serializer knows these need escaping) - Element nodes → look up the tag, emit delimiter tokens + recurse into children - Unknown elements → recurse into children Then the token stream is serialized to a string: - Delimiter tokens → emitted verbatim (they're structural) - Text tokens → characters that would be misinterpreted as delimiters are backslash-escaped. The serializer knows exactly which characters are dangerous because it knows what delimiters exist. - HTML tokens → emitted verbatim ### Why this solves the round-trip problem The key insight: delimiter tokens and text tokens are different types. When serializing `hello *world*`, the output is: ``` [delim **] [text "hello "] [delim *] [text "world"] [delim *] [delim **] ``` The `*` around "world" are delimiter tokens (from the nested ``). If instead the text contained a literal `*`: ``` hello * world ``` The output would be: ``` [delim **] [text "hello * world"] [delim **] ``` The `*` is a text token. During serialization, the text token scanner sees `*` and escapes it to `\*` because `*` is a known delimiter character. The delimiter tokens are never escaped. No ambiguity. ## Files - `types.ts` — Token type, updated Tag interface - `tokenizer.ts` — Inline tokenizer (markdown → tokens) - `serializer.ts` — DOM → tokens → markdown string - `hopdown.ts` — Orchestrator (block parsing, delegates inline to tokenizer) - `tags.ts` — Tag definitions (simplified: no more regex patterns) ## Migration The Tag interface changes: - `pattern` field removed (tokenizer handles delimiter matching) - `toMarkdown` returns Token[] instead of string - `match` stays the same (block-level matching is already clean) - `toHTML` stays the same The HopDown public API stays the same: - `toHTML(markdown)` — unchanged - `toMarkdown(html)` — unchanged - `findCompletePair`, `findUnmatchedOpener` — reimplemented on tokenizer - `getTagForElement`, `getEditableSelector` — unchanged