Module texlang::token::lexer

source ·
Expand description

The Texlang lexer

This module contains Texlang’s TeX lexer, which converts TeX source code into TeX tokens.

Just-in-time lexing

A TeX lexer is different to most other lexers because TeX’s lexing rules are dynamic and can be changed from within TeX source code. In particular, TeX has the following primitives which can change the lexing rules:

  • The \catcode primitive is used to change the category code that is applied to each character in the input, and thus the kind of token that each character gets converted to.

  • The \endlinechar primitive is used to change the character that TeX appends to each line in the input file. TeX appends this character after stripping the new line characters (\n or \r\n) and any trailing space characters ( , ASCII code 32) from the end of the line.

The most important implication of TeX having dynamic lexing rules is that the lexer is lazy, or “just in time”. One cannot, in general, run the lexer for a entire input file to obtain a list of tokens. Instead one must request new tokens just as they are needed. Here is an example of TeX source code that relies on this behavior:

\def\Hello{The macro `Hello' was expanded.\par}
\def\HelloWorld{The macro `Hello World' was expanded.\par}
% change the category code of the character W to category code other
\catcode`\W = 12
\HelloWorld

If the lexer were run over the whole source file at once, the last line would be tokenized as the single control sequence \HelloWorld. However the third line redefines the category code of W to other. Because non-singleton control sequence names consist only of characters with the letter category code, the last line is tokenized as the control sequence \Hello followed by the other token W and then four letter tokens for orld.

Due to this “just in time” behavior, the API for the Texlang lexer looks somewhat like a Rust iterator. The next token is retrieved on-demand using the Lexer’s next method.

Subtle lexing behavior

Another implication of TeX’s dynamic lexing rules is that the process of lexing is fragile and susceptible to subtle bugs. Consider, for example, the following TeX source code:

A\endlinechar=`\X
B
C

What is the output of this code? One might expect that the end of line character will be <return> on the first line, and X on the subsequent two lines, thus giving A BXCX. However the result is actually A B CX! The exact order of operations here is:

  1. The \endlinechar control sequence is returned from the lexer, and the primitive starts running.
  2. The optional = is parsed from the input.
  3. TeX starts parsing a number. The first character from the lexer is `, which indicates that the number will be provided via a single character control sequence.
  4. The control sequence \A is then returned from the lexer. At this point the lexer is at the end of line 1, and hasn’t started the new line.
  5. Next, following TeX’s rules for scanning numbers of the form `\A, an optional space is parsed from the input. See e.g. section 442 in TeX The Program. Parsing this optional space triggers the lexer to return another token. Because the current line is over, the lexer loads the next line and – crucially - uses the current definition of the end of line character, which is \r.
  6. In this case it happens that there is no optional space. So at this point the end of line character is changed to X. At the end of the second line, this will be used when loading the third line.

The lesson from this example is that the output of TeX source code is dependent on the precise order of lexing operations. It is very easy for implementations to get this wrong. To minimize the chances of bugs in the Texlang lexer, its implementation very closely follows the implementation of Knuth’s TeX (see sections 343 - 356 of TeX The Program).

Using the Texlang lexer

The Texlang lexer is used internally in the Texlang VM to read from input files, but may also be used outside the VM in libraries. For example, implementations of the \openin/\read primitives also need to tokenize TeX source code. For this reason the lexer is a public part of Texlang’s API.

Structs

Enums

Traits

  • Configuration for a specific instance of the Lexer.