4.0 KiB
HopDown Tokenizer Design
Problem
The regex-based inline parser and serializer can't reliably distinguish structural delimiters from literal text characters. This causes:
toMarkdownescaping bugs (over-escaping inside inline tags, under-escaping in text nodes)- Round-trip failures (
toHTML(toMarkdown(html)) !== html) - Fragile interactions between features (underscore normalization + strikethrough, HTML passthrough + escaping)
Invariants
toHTMLsatisfies GFM spec rules 1-15toMarkdownalways emits the canonical formtoHTML(toMarkdown(html)) === html(single-pass round-trip)
Architecture
Token types
text — literal characters, will be escaped during serialization
delimiter — structural marker (**, *, ~~, `, etc.)
html — raw HTML tag passthrough
break — hard line break (<br>)
Inline tokenizer (markdown → tokens)
Scans left-to-right, character by character. Maintains a stack of open delimiters. Produces a flat token stream:
Input: "hello **bold *nested*** end"
Tokens: [text "hello "] [open **] [text "bold "] [open *] [text "nested"] [close *] [close **] [text " end"]
The tokenizer handles:
- Backslash escapes:
\*→ text token containing* - Entity resolution:
&→ text token containing& - Flanking rules: only emit delimiter tokens when flanking conditions are met
- Code spans:
`opens a code span that consumes everything until the matching` - Links:
[text](url)parsed as a unit - Autolinks:
<url>and bare URLs - Hard line breaks: trailing spaces or
\before newline - HTML tags:
<span>etc. passed through as html tokens
Inline parser (tokens → HTML)
Walks the token stream and matches open/close delimiter pairs using a stack. Produces HTML string. Handles:
- Delimiter pairing with precedence (*** before ** before *)
- Multiple-of-3 rule
- Nesting validation (no em inside em, no links inside links)
Serializer (DOM → tokens → markdown)
Walks the DOM tree. For each node:
- Text nodes → text tokens (the serializer knows these need escaping)
- Element nodes → look up the tag, emit delimiter tokens + recurse into children
- Unknown elements → recurse into children
Then the token stream is serialized to a string:
- Delimiter tokens → emitted verbatim (they're structural)
- Text tokens → characters that would be misinterpreted as delimiters are backslash-escaped. The serializer knows exactly which characters are dangerous because it knows what delimiters exist.
- HTML tokens → emitted verbatim
Why this solves the round-trip problem
The key insight: delimiter tokens and text tokens are different types.
When serializing <strong>hello *world*</strong>, the output is:
[delim **] [text "hello "] [delim *] [text "world"] [delim *] [delim **]
The * around "world" are delimiter tokens (from the nested <em>).
If instead the text contained a literal *:
<strong>hello * world</strong>
The output would be:
[delim **] [text "hello * world"] [delim **]
The * is a text token. During serialization, the text token scanner
sees * and escapes it to \* because * is a known delimiter character.
The delimiter tokens are never escaped. No ambiguity.
Files
types.ts— Token type, updated Tag interfacetokenizer.ts— Inline tokenizer (markdown → tokens)serializer.ts— DOM → tokens → markdown stringhopdown.ts— Orchestrator (block parsing, delegates inline to tokenizer)tags.ts— Tag definitions (simplified: no more regex patterns)
Migration
The Tag interface changes:
patternfield removed (tokenizer handles delimiter matching)toMarkdownreturns Token[] instead of stringmatchstays the same (block-level matching is already clean)toHTMLstays the same
The HopDown public API stays the same:
toHTML(markdown)— unchangedtoMarkdown(html)— unchangedfindCompletePair,findUnmatchedOpener— reimplemented on tokenizergetTagForElement,getEditableSelector— unchanged