Convert Hugo mmark LaTeX into Pandoc

blog
Published

December 13, 2022

Convert Hugo mmark LaTeX into Pandoc

I’ve recently migrated from Hugo to Quarto and one of the hardest steps was converting the equations in Hugo’s legacy mmark format to Quarto. This notebook shows how I converted the equations without changing equations inside code blocks (see fix_tex.py in my hugo2quarto repository for an executable version of this).

The problem

The (deprecated) version of mmark in Hugo uses an unusual syntax for TeX. It’s not documented (except in the code, e.g. inline math), but some empirical rules for mmark are: - $$...$$ inside a paragraph starts inline math (even with whitespace surrounding …) - $$...$$ after a paragraph starts a math block (even with whitespace surrounding …) - A $ sign not followed by another $ sign is just a normal $ sign (A \$ should also be a $ mode) - Math isn’t rendered in inline code/code blocks

In Pandoc it’s documented

Anything between two $ characters will be treated as TeX math. The opening $ must have a non-space character immediately to its right, while the closing $ must have a non-space character immediately to its left, and must not be followed immediately by a digit. Thus, \$20,000 and \$30,000 won’t parse as math. If for some reason you need to enclose text in literal $ characters, backslash-escape them and they won’t be treated as math delimiters. For display math, use $$ delimiters. (In this case, the delimiters may be separated from the formula by whitespace. However, there can be no blank lines between the opening and closing $$ delimiters.)

In summary:

  • $...$ starts an inline TeX (and space isn’t allowed between them)
  • $$...$$ starts a math block
  • A \$ sign is rendered as a normal \$ sign
  • Math isn’t rendered in inline code/code blocks

The final script implementing this is in my hugo2quarto repository as fix_tex.py; the rest of this notebook

Tests

The result should be a function that takes mmark code and returns pandoc code.

Since there are a set of rules the best way to check the implementation is with some examples. Each Example will have a descriptive name, the mmark input and the expected pandoc output.

from dataclasses import dataclass

@dataclass
class Example:
    name: str
    mmark: str
    pandoc: str

We’ll generate a bunch of examples that satisfy the above rules.

Sometimes there are multiple possibilities, like with $20,000 to $30,000 but we will just pick a simple rule to transform them (escaping every $).

There’s a bunch of other cases we won’t check (like indented code blocks and HTML BLocks) because they don’t occur in the Skeptric code.

examples = [
    Example("Inline",
            "And $$x=2$$",
            "And $x=2$"),
    
    Example("Inline Space",
            "And $$ x = 2 $$",
            "And $x = 2$"),
    
    Example("Block",
           "And\n\n$$x=2$$\n",
           "And\n\n$$x=2$$\n"),
    
    Example("Block space",
            "And\n\n$$ x = 2 $$\n",
            "And\n\n$$x = 2$$\n"),
    
    Example("Block multiline",
            """
$$\begin{align}
& \text{maximize}   && \mathbf{c}^\mathrm{T} \mathbf{x}\\
& \text{subject to} && A \mathbf{x} \le \mathbf{b}, \\
&  && \mathbf{x} \ge \mathbf{0}, \\
\end{align}
$$
""",
                       """
$$\begin{align}
& \text{maximize}   && \mathbf{c}^\mathrm{T} \mathbf{x}\\
& \text{subject to} && A \mathbf{x} \le \mathbf{b}, \\
&  && \mathbf{x} \ge \mathbf{0}, \\
\end{align}
$$
"""),
    
    Example("Literal $", "It costs $20,000", r"It costs \$20,000"),
    
    Example("Two Literal $", "$20,000 to $30,000", r"\$20,000 to \$30,000"),
    
    Example("Inline code", "And `$x+=1`", "And `$x+=1`"),
    
    Example("Inline code double $", "As TeX `$$x=2$$`", "As TeX `$$x=2$$`"),
    
    Example("Inline code with escape", "And `\$x=2`", "And `\$x=2`"),
    
    Example("Fenced code",
            """\n```\n$x+=1\n```\n""",
            """\n```\n$x+=1\n```\n"""),
    
    Example("Fenced code double $",
            """\n```latex\n$$x==2$$\n```\n""",
            """\n```latex\n$$x==2$$\n```\n"""),
    
    Example("Indented code blocks",
            "\n" + r"    %>% mutate_if(is.character, function(x) gsub('\\$', '\\\\$', x))",
            "\n" + r"    %>% mutate_if(is.character, function(x) gsub('\\$', '\\\\$', x))"),
    
    Example("After intended code blocks",
            "Like so\n    $x = 2\nfor $30",
            "Like so\n    $x = 2\nfor \$30"),
            ]

Check the names are unique

assert len(set([e.name for e in examples])) == len(examples)

Now we can test our examples by checking our transformation function and returning the failures.

def test(f, examples=examples):
    for example in examples:
        data = example.mmark
        result = f(data)
        expected = example.pandoc
        if result != expected:
            yield({'name': example.name, 'data': data, 'result': result, 'expected': expected})

If we return the empty string all tests should fail

assert len(list(test(lambda x: ''))) == len(examples)

A lot of the time the input is unchanged; the identity function will only have a few failures

list(test(lambda x: x))
[{'name': 'Inline',
  'data': 'And $$x=2$$',
  'result': 'And $$x=2$$',
  'expected': 'And $x=2$'},
 {'name': 'Inline Space',
  'data': 'And $$ x = 2 $$',
  'result': 'And $$ x = 2 $$',
  'expected': 'And $x = 2$'},
 {'name': 'Block space',
  'data': 'And\n\n$$ x = 2 $$\n',
  'result': 'And\n\n$$ x = 2 $$\n',
  'expected': 'And\n\n$$x = 2$$\n'},
 {'name': 'Literal $',
  'data': 'It costs $20,000',
  'result': 'It costs $20,000',
  'expected': 'It costs \\$20,000'},
 {'name': 'Two Literal $',
  'data': '$20,000 to $30,000',
  'result': '$20,000 to $30,000',
  'expected': '\\$20,000 to \\$30,000'},
 {'name': 'After intended code blocks',
  'data': 'Like so\n    $x = 2\nfor $30',
  'result': 'Like so\n    $x = 2\nfor $30',
  'expected': 'Like so\n    $x = 2\nfor \\$30'}]

Strategy

We will use a simple Discrete Finite Automonon (DFA) to handle the transitions between the different states:

  • In default state just yield characters, and look for transitions to other states
  • In inline_code or block_code just yield characters until the end of the code
  • In inline_math or block_math transform the delimiters and strip surrounding whitespace, leaving the input unchanged

Why not a parser?

A good solution would be to use one of the many Markdown parsers like Marko, or Mistletoe or even Pandoc itself. These all can produce Markdown and are able to be extended which would allow us to parse mmark maths.

The problem is they are all destructive parsers, they don’t preserve things like whitespace and even an identity parse changes the markdown significantly. This makes the git diffs much bigger and it’s harder to check the results (and I caught a lot of bugs checking the git diffs).

So we’re forced to write our own.

Implementation

States

We will create a Mode for each state

from enum import Enum, auto

class Mode(Enum):
    DEFAULT = auto()             # Default (paragraph mode)
    INLINE_CODE = auto()         # Inside an inline code
    BLOCK_CODE = auto()          # Inside a code block
    INLINE_MATH = auto()         # Inside inline math
    BLOCK_MATH = auto()          # Inside block math
    INDENTED_CODE = auto()       # Inside an indented code block

Transitions

We transition between the states when we hit certain sequences of tokens.

The below diagram shows the main transitions.

Diagram of DFA for parser

We will capture the transitions in an Action object which has:

  • an input_mode where it applies
  • a match_re, a regular expression on which to trigger the action
  • a output_mode to transition to on match
  • an output string to emit on a match, by default the matched string itself

There is also an implicit default action that consumes the next token, and outputs the current mode and that consumed token.

import re
from typing import Optional

@dataclass
class Action:
    input_mode: Mode
    match_re: str
    output_mode: Mode
    output: Optional[str] = None
        
    def __post_init__(self):
        self.pattern = re.compile(self.match_re)
        
    def match(self, s: str, idx: int = 0) -> Optional[str]:
        match = self.pattern.match(s, idx)
        if match:
            match_str = match.group(0)
            len_match_str = len(match_str)
            assert len_match_str > 0
            return {'output': self.output or match_str, 'size': len_match_str}

Now the transitions can be defined as a list of Actions

actions = [
    Action(Mode.DEFAULT, "\n```", Mode.BLOCK_CODE),
    Action(Mode.DEFAULT, "`", Mode.INLINE_CODE),
    Action(Mode.DEFAULT, "\n    ", Mode.INDENTED_CODE),
    Action(Mode.DEFAULT, "\n\$\$ *", Mode.BLOCK_MATH, "\n$$"),
    Action(Mode.DEFAULT, "\$\$ *", Mode.INLINE_MATH, "$"),
    Action(Mode.DEFAULT, "\$", Mode.DEFAULT, "\$"),
    
    
    Action(Mode.BLOCK_CODE, "```", Mode.DEFAULT),
    
    Action(Mode.INLINE_CODE, "`", Mode.DEFAULT),
    
    Action(Mode.INLINE_MATH, " *\$\$", Mode.DEFAULT, "$"),
    Action(Mode.BLOCK_MATH, " *\$\$", Mode.DEFAULT, "$$"),
    
    Action(Mode.INDENTED_CODE, "\n {,3}\S", Mode.DEFAULT),
]

Parsing

Now we need to find the matching action and pattern and update the mode and output.

If there is no matching pattern in this mode then we just consume one token and continue.

import logging
    
def parse(s):
    mode = Mode.DEFAULT
    idx = 0
    output = []
    
    while idx < len(s):
        logging.debug('Mode: %s, Last output: %s, Next chars: %s' % (mode, output[-1:], s[idx:idx+5].replace('\n', '\\n')))
        last_idx = idx
        for action in actions:
            if action.input_mode != mode:
                continue
            match = action.match(s, idx)
            if match:
                logging.debug('Match: %s' % action)
                mode = action.output_mode
                idx += match['size']
                output += match['output']
                break
        else:
            output += s[idx]
            idx += 1
        
        assert idx > last_idx, "Infinite loop"
    
    return ''.join(output)

Example

Let’s run through an example with logging on to see how it works

logging.getLogger().setLevel('DEBUG')
mmark = examples[1].mmark
mmark
'And $$ x = 2 $$'
parse(mmark)
DEBUG:root:Mode: Mode.DEFAULT, Last output: [], Next chars: And $
DEBUG:root:Mode: Mode.DEFAULT, Last output: ['A'], Next chars: nd $$
DEBUG:root:Mode: Mode.DEFAULT, Last output: ['n'], Next chars: d $$ 
DEBUG:root:Mode: Mode.DEFAULT, Last output: ['d'], Next chars:  $$ x
DEBUG:root:Mode: Mode.DEFAULT, Last output: [' '], Next chars: $$ x 
DEBUG:root:Match: Action(input_mode=<Mode.DEFAULT: 1>, match_re='\\$\\$ *', output_mode=<Mode.INLINE_MATH: 4>, output='$')
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['$'], Next chars: x = 2
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['x'], Next chars:  = 2 
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: [' '], Next chars: = 2 $
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['='], Next chars:  2 $$
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: [' '], Next chars: 2 $$
DEBUG:root:Mode: Mode.INLINE_MATH, Last output: ['2'], Next chars:  $$
DEBUG:root:Match: Action(input_mode=<Mode.INLINE_MATH: 4>, match_re=' *\\$\\$', output_mode=<Mode.DEFAULT: 1>, output='$')
'And $x = 2$'
logging.getLogger().setLevel('INFO')

Run tests

All the tests pass

list(test(parse))
[]
assert not list(test(parse))