Texcraft
Texcraft is an experimental project to create a composable, LLVM-style framework for building TeX and other typesetting software. The project manifesto describes the big-picture ideas and goals behind project.
As of 2025, the project is divided into two main sub-projects:
-
Texlang is a framework for building fast and correct TeX language interpreters. It provides APIs for defining TeX primitives and is thus the core of any "TeX engine" built with Texcraft. Texlang's standard library contains implementations of many TeX primitives like
\count
,\def
and\expandafter
. -
Boxworks is an implementation of the typesetting engine inside TeX. It is designed to be fully independent of the TeX language. One of the main goals of Boxworks is to support creating new non-TeX typesetting languages that use the engine for typesetting.
The Texcraft playground is example of what can be built with Texcraft. In the long run the goal is to produce re-implementations of TeX, pdfTeX and other TeX engines using Texcraft.
Design of the documentation
The Texcraft documentation consists of this website along with reference documentation that is generated from the Rust source code.
The documentation is designed with the Divio taxonomy of documentation in mind. In this taxonomy, there are four kinds of documentation: tutorials, how-to guides, references, and explainers. Because Texcraft is a small project so far, we don't have significant documentation of each type. Right now we have:
-
The Texlang user guide, which is mostly a grounds-up tutorial on how to use Texlang. The goal is for it to be possible to read the user guide from the starting introduction through to the end. But we also hope that you can jump into arbitrary sections that interest you, without having to slog through the prior chapters.
-
Reference documentation that is autogenerated using
rustdoc
. -
Occasionally some explainers that are high-level and theoretical. Some parts of the Texlang user guide (such as the manifesto) have this style.
There are currently no how-to guides - we think the tutorials are enough for the moment.
What's in a name?
Texcraft is a contraction of TeX and craft. The verb craft was chosen, in part, to pay homage to Robert Nystrom's book Crafting Interpreters. This book is an inspiration both because of its wonderful exposition of the process of developing a language interpreter and because the book itself is so beautifully typeset. (Unfortunately the methods of the book do not directly apply to TeX because TeX is not context free, has dynamic lexing rules, and many other problems besides.) We hope Texcraft will eventually enable people to craft their own TeX distributions.
The name Texcraft is written using the letter casing rule for proper nouns shared by most languages that use a Roman alphabet: the first letter is in uppercase, and the remaining letters are in lowercase. To quote Robert Bringhurst, "an increasing number of persons and institutions, from archy and mehitabel, to PostScript and TrueType, come to the typographer in search of special treatment [...] Logotypes and logograms push typography in the direction of heiroglyphics, which tend to be looked at rather than read."
Manifesto; or, why Texcraft?
Texcraft's ultimate goal is to advance the current state of open-source typesetting. There are other large projects that share the same goal. Typst is a brand new typesetting system designed to be more user friendly than TeX. Within the TeX world, there is still significant work being done on LaTeX.
Texcraft is taking a different approach. The project was started after a connection was made between the technical problems with the existing implementations of TeX, and some of the ideas behind the enormously successful LLVM project.
Donald Knuth's original Pascal/WEB implementation of TeX is simultaneously a critical part of the world's research and typesetting infrastructure but also basically impossible to iteratively improve upon because of its legacy software design. The original TeX is 25 thousand lines of extremely monolithic code. It makes extensive use of global state, has a custom memory management system, very few abstractions, and no test coverage. It is very difficult to change. It is impossible to reuse subparts of the code, for example the box-and-glue typesetting engine it contains. (The same observations apply to other engines like pdfTeX, which are forks of TeX with more functionality and code added.)
Texcraft was started with the observation: we've been here before. In the late 1990s the GCC project simultaneously dominated open-source C/C++ compiler space but also had a software architecture that made it difficult to evolve. When Chris Lattner started the LLVM project, one of his ideas was to implement a compiler as a loosely coupled collection of libraries:
"The world needs better compiler tools, tools which are built as libraries. This design point allows reuse of the tools in new and novel ways. However, building the tools as libraries isn't enough: they must have clean APIs, be as decoupled from each other as possible, and be easy to modify/extend. This requires clean layering, decent design, and avoiding tying the libraries to a specific use. Oh yeah, did I mention that we want the resultant libraries to be as fast as possible?"
Texcraft is based on the belief that the world needs better typesetting tools, tools which are built as libraries. And in a world dominated by TeX, these tools should probably be compatible with TeX.
In concrete terms
Texcraft's concrete goal is to reimplement the existing TeX engines (TeX82 and pdfTeX at least) with a modular library-based software architecture. As part of this, there is an opportunity to improve some of the user-experience around TeX, like returning better error messages or being smarter about when a recompilation is needed. However we think the most promise of the project is how such a base could be built upon. A modular code base would make it possible to:
-
Make small improvement to existing TeX engines, or even non-trivial improvements like adding new pagination algorithms.
-
Develop new languages that perform typesetting by using the existing box-and-glue engine of TeX as reimplemented in Texcraft. No need to write an engine from scratch (just as Rust's LLVM-based compiler didn't need to implement code generation).
-
Some hybrid of these: a TeX engine that can, say,
\inputTypst{chapter1.typst}
and allow authors to use Typst source files in their projects. This is potentially exciting as it gives a roadmap for migrating off of the TeX language, which has some fundamental usability issues.
Finding the right abstractions
One of the main challenges in Texcraft is determining what parts of the TeX system can be decoupled; i.e., where the "clean APIs" can be introduced.
It's important to recognize straight away that the decoupled multi-pass architecture that is common in programming language compilers cannot work for TeX. In TeX, all stages of the compilation process from lexing through to page building are tied together. In TeX, it's possible to change the lexing rules depending on how many pages have been typeset so far:
% If the page number is odd...
\ifodd \pageno
% ...change the meaning of the letter T to be open brace and X to be close brace
\catcode`\T 1
\catcode`\X 2
\fi
% The lexing rules for the next line depend on which page we're on, and are thus
% a function of the entire document so far.
% If this is an odd page, TeX will tokenize this as {e} and typeset e;
% otherwise, it will typeset TeX.
TeX
In this example, we can't run the lexer on the line TeX
until we've fully processed and typeset
every single thing that has come before it.
This means that we can't run the lexer, or TeX macro expander, or line-breaker, or page-builder
in isolation: they must all run concurrently.
However, after a few years of working on Texcraft we think the TeX source code is extremely amendable to modularization, once all the modules can be be made to run together. At the highest possible level, TeX can be divided into two parts following a traditional frontend/backend split:
-
Backend: Knuth's box-and-glue typesetting engine. In the Pascal/WEB implementation, this engine mostly (completely?) works on internal data structures that are agnostic to the TeX language. It seems possible to reimplement this without any dependency on TeX. In Texcraft the backend is being implemented as the Boxworks sub-project.
-
Frontend: a TeX language interpreter that reads TeX source code and then pushes the correct buttons on the backend. In Texcraft this is the Texlang sub-project.
Within these two halves there is also lots of opportunity for modularization. In Texlang, the implementation of conditional logic is a single Rust source file that's completely decoupled from the rest of the project (except of course for using some "clean" Texlang APIs). Work on Boxworks has just started, but for example it seems that the Knuth-Plass line-breaking algorithm can be put behind a generic "line-breaking" API.
Correctness and speed
Texcraft's interesting software architecture is not enough. In order to be viable, Texcraft needs to generate the same results as TeX (be correct) and do it in about the same time (be fast).
For correctness, TeX clearly falls under Hyrum's law. It doesn't matter what Knuth says in the TeXBook: after 45 years in production, every observable behavior of the TeX system is probably relied upon by someone. The Texcraft's project goal is to exactly replicate the output of TeX. This is fairly non-trivial because TeX is, ultimately, a fragile language. To achieve this, Texcraft development generally works by closely examining the Pascal/WEB source code and sometimes translating it by hand.
There is a silver lining though. Once you've committed to replicating a program exactly, you now have access to many test cases to verify your new implementation is correct. In the very long run, we envisage running Texcraft's pdfTeX implementation on papers in the Arxiv and automatically verifying correctness.
As for speed, this is also challenging. Knuth performance optimized the Pascal/WEB code so that it would run tolerably on early 1980s computers. This means that on today's hardware TeX is really really fast. Our initial work here has been promising, and our goal of being "about as fast as TeX" seems very achievable. We do have some advantages over Knuth: access to a modern optimizing compiler toolchain and the option of better data structures in hot parts of the code like macro expansion and list building.
Introduction to Texlang
Running the Texlang VM
Before writing any custom TeX primitives it's good to know how to run the Texlang virtual machine (VM). This way you can manually test the primitives you write to ensure they're working as you expect. In the long run you may decide to lean more on unit testing, even when initially developing commands, rather than manual testing things out. But it's still good to know how to run the VM.
If you just want to see the minimal code to do this, jump down to the first code listing.
Running the VM is generally a four step process:
-
Specify the built-in commands/primitives you want the VM to include.
-
Initialize the VM using its
new
constructor. At this point you will need to decide which concrete state type you're using. The state concept was described in high-level terms in the previous section, and we will gain hands-on experience with it the primitives with state section. For the moment, to keep things simple, we're just going to use a pre-existing state type that exists in the Texlang standard library crate:::texlang_stdlib::testing::State
. -
Load some TeX source code into the VM using the VM's
push_source
method. -
Call the VM's
run
method, or some other helper function, to run the VM.
One minor complication at this point is that to call the VM's run
method directly
one needs to provide so-called "VM handlers".
These are Rust functions that tell the VM what to do when it encounters certain
kinds of TeX tokens.
For example, when a real typesetting VM sees the character a
,
it typesets the character a
.
Handlers are described in detail in the VM hooks and handlers section.
For the moment, we're going to get around the handlers problem entirely
by instead running the VM using the ::texlang_stdlib::script::run
function.
This function automatically provides handlers such that when the VM sees a character,
it just prints the character to the terminal.
With all of this context, here is a minimal code listing that runs the Texlang VM:
#![allow(unused)] fn main() { extern crate texlang_stdlib; extern crate texlang; use texlang::{vm, command}; use texlang_stdlib::StdLibState; use texlang_stdlib::script; // 1. Create a new map of built-in commands. // In this book's next section we will add some commands here. let built_in_commands = std::collections::HashMap::new(); // 2. Initialize the VM. let mut vm = vm::VM::<StdLibState>::new_with_built_in_commands(built_in_commands); // 3. Add some TeX source code the VM. vm.push_source("input.tex", r"Hello, World."); // 4. Run the VM and write the results to stdout. script::set_io_writer(&mut vm, std::io::stdout()); script::run(&mut vm).unwrap(); }
When you run this code, you will simply see:
Hello, World.
In this case the VM has essentially nothing to do: it just passes characters from the input to the handler and thus the terminal. To see the VM doing a little work at least, change the source code to this:
#![allow(unused)] fn main() { extern crate texlang_stdlib; extern crate texlang; use texlang::{vm, command}; let mut vm = vm::VM::<()>::new_with_built_in_commands(Default::default()); // 3. Add some TeX source code the VM. vm.push_source("input.tex", r"Hello, {World}."); }
The output here is the same as before - in particular, the braces {}
disappear:
Hello, World.
The braces disappear because they are special characters in the TeX grammar, and are used to denote the beginning and ending of a group. The VM consumes these characters internally when processing the input.
Another thing the VM can do is surface an error if the input contains an undefined control sequence. Because we haven't provided any built-in commands, every control sequence is undefined. Changing the VM setup to the following:
#![allow(unused)] fn main() { extern crate texlang_stdlib; extern crate texlang; use texlang::{vm, command}; use texlang_stdlib::StdLibState; use texlang_stdlib::script; let mut vm = vm::VM::<StdLibState>::new_with_built_in_commands(Default::default()); // 3. Add some TeX source code the VM. vm.push_source("input.tex", r"\controlSequence"); // 4. Run the VM and write the results to stdout. script::set_io_writer(&mut vm, std::io::stdout()); if let Err(err) = script::run(&mut vm){ println!["{err}"]; } }
results in the following output:
Error: undefined control sequence \controlSequence
>>> input.tex:1:1
|
1 | \controlSequence
| ^^^^^^^^^^^^^^^^ control sequence
In general, though, the VM can't do much if no built-in commands are provided. In the next section we will learn how to write some.
Simple expansion and execution primitives
Primitives with state (the component pattern)
Primitive tags
VM hooks and handlers
TeX variables
The documentation so far has covered two of three type of primitives in TeX: expansion primitives and execution primitives. In this section we discuss the third type: variables.
The TeX language includes typed variables. There are a few possible types which are listed below. The Texlang variables API is the mechanism by which TeX variables are implemented in Texlang. Ultimately the API provides a way for memory in Rust to be reference in TeX source code.
[History] Variables in the TeXBook
This section can be freely skipped.
The TeXBook talks a lot about the different variables that are available in
the original TeX engine, like \year
and \count
.
The impression one sometimes gets from the TeXBook is that
these variables are heterogeneous in terms of how they are handled by the interpreter.
For example, on page 271 when describing the grammar of the TeX language,
we find the following rule:
<internal integer> -> <integer parameter> | <special integer> | \lastpenalty
<countdef token> | \count<8-bit number>
...
This makes it seems that an "integer parameter"
behaves differently to a "special integer", or to a register accessed via \count
,
even though all of these variables have the same concrete type (a 32 bit integer).
To the best of our current knowledge, this is not the case at all!
It appears that there is a uniform way to handle all variables of the same concrete type in TeX,
and this is what Texlang does.
The benefit of this approach is that it makes for a much simpler API,
both for adding new variables to a TeX engine or for
consuming variables.
Singleton versus array variables
In TeX, for each variable type, there are two categories of variable:
singleton variables (like \year
)
and array variables (like \count N
, where N
is the index of the variable in the registers array).
Both of these cases are handled in the same way in the Texlang API.
In Texlang, the control sequences \year
and \count
are not considered variables themselves.
This is because without reading the N
after \count
, we don't actually know which memory is being referred to.
Instead, \year
and \count
are variable commands (of type Command
).
A variable command is resolved to obtain a variable (of type Variable
).
A variable is an object that points to a specific piece of memory like an i32
in the state.
For singleton variables, resolving a command is a no-op. The command itself has enough information to identify the memory being pointed at.
For array variables, resolving a command involves determining the index of the variable within the array.
The index has type Index
, which is a wrapper around Rust's usize
.
The index is determined using the command's index resolver, which has enum type IndexResolver
.
There are a two different ways the index can be resolved, corresponding to different
variants in the enum type.
Implementing a singleton variable
Variables require some state in which to store the associated value. We assume that the component pattern is being used. In this case, the variable command is associated with a state struct which will be included in the VM state as a component. The value of a variable is just a Rust field of the correct type in the component:
#![allow(unused)] fn main() { extern crate texlang; pub struct MyComponent { my_variable_value: i32 } }
To make a Texlang variable out of this i32
we need to provide two things:
an immutable getter
and a mutable getter.
These getters have the signature RefFn
and MutRefFn
respectively.
Both getters accept a reference to the state and an index, and return a reference to the variable.
(The index is only for array variables, and is ignored for singleton variables.)
For the component and variable above, our getters look like this:
#![allow(unused)] fn main() { extern crate texlang; pub struct MyComponent { my_variable_value: i32 } use texlang::vm::HasComponent; use texlang::variable; fn getter<S: HasComponent<MyComponent>>(state: &S, index: variable::Index) -> &i32 { &state.component().my_variable_value } fn mut_getter<S: HasComponent<MyComponent>>(state: &mut S, index: variable::Index) -> &mut i32 { &mut state.component_mut().my_variable_value } }
Once we have the getters, we can create the TeX variable command:
#![allow(unused)] fn main() { extern crate texlang; pub struct MyComponent { my_variable_value: i32 } use texlang::vm::HasComponent; fn getter<S: HasComponent<MyComponent>>(state: &S, index: variable::Index) -> &i32 { &state.component().my_variable_value } fn mut_getter<S: HasComponent<MyComponent>>(state: &mut S, index: variable::Index) -> &mut i32 { &mut state.component_mut().my_variable_value } use texlang::variable; use texlang::command; pub fn my_variable<S: HasComponent<MyComponent>>() -> command::BuiltIn<S> { return variable::Command::new_singleton( getter, mut_getter, ).into() } }
The function Command::new_singleton
creates a new variable command associated to a singleton variable.
We cast the variable command into a generic command using the into
method.
This command can now be included in the VM's command map and the value can be accessed in TeX scripts!
As usual with the component pattern, the code we write works for any TeX engine whose state contains our component.
Finally, as a matter of style, consider implementing the getters inline as closures. This makes the code a little more compact and readable. With this style, the full code listing is as follows:
#![allow(unused)] fn main() { extern crate texlang; use texlang::vm::HasComponent; use texlang::variable; use texlang::command; pub struct MyComponent { my_variable_value: i32 } pub fn my_variable<S: HasComponent<MyComponent>>() -> command::BuiltIn<S> { return variable::Command::new_singleton( |state: &S, index: variable::Index| -> &i32 { &state.component().my_variable_value }, |state: &mut S, index: variable::Index| -> &mut i32 { &mut state.component_mut().my_variable_value }, ).into() } }
Implementing an array variable
The main difference between singleton and array variables is that we need to use the index arguments that were ignored above.
In this section we will implement an array variable with 10 entries.
In the component, we replace the i32
with an array of i32
s:
#![allow(unused)] fn main() { pub struct MyComponent { my_array_values: [i32; 10] } }
The getter functions use the provided index argument to determine the index to use for the array:
#![allow(unused)] fn main() { extern crate texlang; pub struct MyComponent { my_array_values: [i32; 10] } use texlang::vm::HasComponent; use texlang::variable; fn getter<S: HasComponent<MyComponent>>(state: &S, index: variable::Index) -> &i32 { &state.component().my_array_values[index.0 as usize] } }
The above listing raises an important question: what if the array access is out of bounds?
The Rust code here will panic, and in Texlang this is the correct behavior.
Texlang always assumes that variable getters are infallible.
This is the same as assuming that an instantiated [Variable
] type points to a valid piece of memory
and is not (say) dangling.
Next, we construct the command.
Unlike the singleton command, this command will need to figure out the index of the variable.
As with \count
, our command will do this by reading the index from the input token stream.
In the variables API, we implement this by providing the following type of function:
#![allow(unused)] fn main() { extern crate texlang; use texlang::*; use texlang::prelude as txl; use texlang::traits::*; fn index<S: TexlangState>(token: token::Token, input: &mut vm::ExpandedStream<S>) -> txl::Result<variable::Index> { let index = parse::Uint::<10>::parse(input)?; return Ok(index.0.into()) } }
Finally we create the command.
This is the same as the singleton case, except we pass the index function above as an index resolver
with the Dynamic
variant:
#![allow(unused)] fn main() { extern crate texlang; use texlang::*; use texlang::prelude as txl; fn getter<S: HasComponent<MyComponent>>(state: &S, index: variable::Index) -> &i32 { panic![""] } fn mut_getter<S: HasComponent<MyComponent>>(state: &mut S, index: variable::Index) -> &mut i32 { panic![""] } fn index_resolver<S>(token: token::Token, input: &mut vm::ExpandedStream<S>) -> txl::Result<variable::Index> { panic![""] } pub struct MyComponent { my_array_values: [i32; 10] } use texlang::vm::HasComponent; pub fn my_array<S: HasComponent<MyComponent>>() -> command::BuiltIn<S> { return variable::Command::new_array( getter, mut_getter, variable::IndexResolver::Dynamic(index_resolver), ).into() } }
Implementing a \countdef
type command
In Knuth's TeX, the \countdef
command is an execution command with the following semantics.
After executing the TeX code \countdef \A 1
,
the control sequence \A
will be a variable command pointing to the same
memory as \count 1
.
One way of thinking about it is that \A
aliases \count 1
.
Using the Texlang variables API it is possible to implement the analogous command
for the \myArray
command implemented above.
The implementation is in 3 steps:
-
The target (e.g.
\A
) is read usingtexlang::parse::Command::parse
. -
The index (e.g.
1
) is read usingusize::parse
, just like in the previous section. -
A new variable command is then created and added to the commands map. This command is created using [
Command::new_array
] just as above, except in the index resolver we use the [IndexResolver::Static
] variant with the index calculated in part 2.
For a full example where this is all worked out, consult
the implementation of \countdef
in the Texlang standard library.
TeX variable types
Not all variable types have been implemented yet.
TeX Type | Rust type | Register accessor command | Implemented in Texlang? |
---|---|---|---|
Integer | i32 | \count | Yes |
Dimension | core::Scaled | \dimen | Yes |
Glue | core::Glue | \skip | Yes |
Muglue | TBD | \muskip | No |
Box | TBD | \box and \setbox | No |
Category code | CatCode | \catcode | Yes |
Math code | MathCode | \mathcode | Yes |
Delimiter code | TBD | \delcode | No |
Space factor code | TBD | \sfcode | No |
Token list | Rc<Token> | \toks | No |
Parsing the TeX grammar
Error handling
Like every programming language, TeX files can contain errors. The Texlang errors system is used to implement TeX's error handling behavior.
Errors in TeX
Before describing Texlang's error system we will describe the semantics of errors in TeX.
Most errors in TeX are recoverable.
Take the following TeX code,
which tries to assign a non-integer value to the integer variable \month
:
\month = March
Knuth's TeX prints the following error message:
! Missing number, treated as zero.
<to be read again>
M
l.1 \month = M
arch
TeX tried to interpret March
as an integer and of course could not.
So instead, it recovered from the error by assuming the right hand side
is 0, assigned the value 0 to \month
,
and then continued processing with the next token M
.
Most TeX errors are like this:
the system falls back to some default behavior and then optionally continues
processing.
When it hits a TeX error like the one above, TeX can either continue or abort or even prompt the user to edit the input. This depends on which of the four "interaction modes" is currently enabled:
-
\errorstopmode
: the default mode in which errors drop the user into an interactive terminal and the user then instructs TeX what to do (abort, or ignore the error and move on, or switch to a different interaction mode, or edit the input inline). -
\scrollmode
: all recoverable errors are recovered from, but the error messages are printed to the log file and terminal. However if more than 100 errors are hit, the program aborts. -
\nonstopmode
: like\scrollmode
, but other forms of terminal interaction like\read
are also suppressed. -
\batchmode
: like\nonstopmode
, except error messages are only printed to the log file and not the terminal
In Texlang, the decision to continue or abort is made inside
the implementation of the
TexlangState::recoverable_error_hook
.
Finally, some errors like a file-not-found when using \input
are fatal errors that can't be recovered from.
TeX aborts in such rare cases.
Handling errors in Texlang
For a given error case you first need to define a Rust type for that case.
This type must implement then [error::TexError
] trait.
At a minimum you must provide an error kind and a title.
Suppose you're writing a Texlang function to parse a yes/no answer.
The function will accept Y
or y
to mean to mean yes,
and N
or n
to mean no.
The error that can occur here is that the user provides a different
character like A
.
First define a type for this error.
This type is a token error because the error occurs at a specific TeX token
(in this case the character A
).
#![allow(unused)] fn main() { extern crate texlang; use texlang::prelude as txl; use texlang::traits::*; use texlang::{vm, token, error}; #[derive(Debug)] struct YesOrNoError { token: token::Token, } impl error::TexError for YesOrNoError { fn kind(&self) -> error::Kind { error::Kind::Token(self.token) } fn title(&self) -> String { format!["invalid response to yes or no; expected Y or N"] } } }
Next, in the function that parses the yes/no,
on the error path first construct the error,
pass it to the [TokenStream::error
] method of the function input.
If that doesn't error out, return the default value.
#![allow(unused)] fn main() { extern crate texlang; use texlang::prelude as txl; use texlang::traits::*; use texlang::{vm, token, error}; #[derive(Debug)] struct YesOrNoError { token: token::Token, } impl error::TexError for YesOrNoError { fn kind(&self) -> error::Kind { error::Kind::Token(self.token) } fn title(&self) -> String { format!["invalid response to yes or no; expected Y or N"] } } /// Parses a yes (character 'Y' or 'y') or no (character 'N' or 'n'). /// Returns true if the parsed value is "yes". fn parse_yes_or_now<S: TexlangState>(input: &mut vm::ExecutionInput<S>) -> txl::Result<bool> { let token = input.next()?.expect("input has not ended"); let yes = match token.value() { token::Value::Letter('Y' | 'y') => true, token::Value::Letter('N' | 'n') => false, _ => { // If the VM treats the error as fatal, a Rust error will be returned // here and then propagated using the `?` operator. input.error(YesOrNoError{token})?; // Otherwise, the VM has treated the error as recoverable and we // fallback to the recovery behavior. false } }; Ok(yes) } }
End of input errors
Unit testing
When writing code that uses Texlang it's generally expected that unit tests will also be written to verify the primitives being implemented work as expected.
Software projects are more likely to have high-quality,
extensive unit tests if it is easy to write and maintain such tests.
In order to support easy unit-testing of code that uses Texlang,
the Texcraft project includes a specific crate for writing unit tests called
texlang_testing
.
The crate currently supports three kinds of unit tests:
-
Expansion equality tests: verifying that two TeX snippets expand to the same output.
-
Failure tests: verifying that a TeX snippet fails to run.
-
Serde tests: verifying that a Texlang VM can be successfully serialized and deserialized in the middle of executing a TeX snippet.
This crate is used extensively in the Texlang standard library. Browsing some of the Rust source code will give a good sense of how tests are written using this library.
For information on writing unit tests using this crate, consult the crate's documentation.
Format files (a.k.a. serialization and deserialization)
This page describes Texlang's support for serializing and deserializing VMs. This functionality allows Texlang to replicate Knuth's "format file" mechanism.
Background
Knuth's original implementation of TeX includes a feature called "format files". A TeX format is a set of general-purpose macros and other configurations such as category code mappings that are included as a preamble in TeX documents. The plain TeX format was developed by Knuth concurrently with the initial implementation of TeX. Nowadays the LaTeX format is so ubiquitous as to be synonymous with TeX itself.
A format file is created by running TeX, inputting the format definitions (which are in regular .tex
files)
and then dumping the state of the interpreter using the \dump
primitive.
The resulting format file has the file extension .fmt
.
Subsequent runs of the VM can then read the format file and apply the definitions "at high speed".
The format file mechanism is essentially a performance optimization that gets
the format into the interpreter faster than parsing the .tex
definitions each time.
This optimization was probably especially important when TeX was being developed in the early 80s
and computers were much slower than today.
A modern perspective on format files is that they are a mechanism for serializing and deserializing the state of a TeX virtual machine. Texlang includes support for such serialization and deserialization of VMs. In fact, Texlang's (de)serialization feature is strictly more powerful than the format files mechanism in Knuth's TeX:
-
Texlang's (de)serialization feature is implemented using the Rust library Serde, and is thus independent of any specific serialization format. Texlang VMs can be (de)serialized to and from any format compatible with Serde. All of the unit tests in the Texlang project are run for three formats: message pack, bincode, and JSON.
-
Texlang VMs can be serialized irrespective of their internal state. With format files this is not the case: format files cannot be created when there is a current active group, or when typesetting has already started.
This latter property opens up some exciting use cases for Texlang (de)serialization, especially checkpoint compilation. In theory, after shipping out each PDF page a Texlang VM could checkpoint its progress by serializing itself and persisting the bytes in the filesystem. Then, when the same TeX document is compiled, the interpreter could check if the document hasn't changed up to a certain checkpoint. If so, instead of recompiling the entire document, the checkpoint could be deserialized and compilation could continue from the checkpoint. This would offer genuine O(1) generation of the Nth page in a TeX document.
Serializing VMs
Texlang VMs are generic over the state S
.
Whether or not you can serialize or deserialize the VM
depends on properties of the state.
We will start by discussing serialization.
If the state S
implements [::serde::Serialize
] then
the Texlang VM vm::V<S>
satisfies the [::serde::Serialize
] trait too.
VMs can thus be serialized using the standard Serde infrastructure.
Note that making S
serializable with Serde us usually very easy
and just involves adding type annotations.
Here's a simply example of serializing a VM to JSON:
#![allow(unused)] fn main() { extern crate serde; extern crate serde_json; extern crate texlang; use texlang::vm; #[derive(Default, serde::Serialize, serde::Deserialize)] struct State { number: i32, } let built_in_commands = Default::default(); let vm = vm::VM::<State>::new_with_built_in_commands(built_in_commands); let serialized_vm = serde_json::to_string_pretty(&vm).unwrap(); println!["{serialized_vm}"]; }
Deserializing VMs
Deserialization is a little more tricky that serialization because the serialized bytes are not enough to fully reconstruct the VM. Specifically, the VM's built-in primitives are missing from the serialized bytes and must be provided again at deserialization time. This is because, fundamentally, Texlang primitives are Rust function pointers and these cannot be serialized and deserialized.
The easiest way to support deserialization is to implement [vm::HasDefaultBuiltInCommands
]
for the state.
This trait provides the default set of built-in commands for that state.
If the state S
implements [::serde::Deserialize
] and
this trait [super::HasDefaultBuiltInCommands
],
the Texlang VM vm::V<S>
satisfies the [::serde::Deserialize
] trait too.
In this case deserialization can be done in the usual way with Serde:
#![allow(unused)] fn main() { extern crate serde; extern crate serde_json; extern crate texlang; use std::collections::HashMap; use texlang::vm; use texlang::command; #[derive(Default, serde::Serialize, serde::Deserialize)] struct State { number: i32, } impl vm::TexlangState for State {} impl vm::HasDefaultBuiltInCommands for State { fn default_built_in_commands() -> HashMap<&'static str, command::BuiltIn<Self>> { // Returning an empty set of built-in commands, but in general this will be non-empty. HashMap::new() } } // When `vm::HasDefaultBuiltInCommands` is implemented for the state, // the VM's plain `new` constructor can be used. let original_vm = vm::VM::<State>::new(); let serialized_vm = serde_json::to_string_pretty(&original_vm).unwrap(); println!["{serialized_vm}"]; let deserialized_vm: vm::VM::<State> = serde_json::from_str(&serialized_vm).unwrap(); }
If the state doesn't implement [vm::HasDefaultBuiltInCommands
],
or you are using a non-default set of built-in commands,
deserialization can be done in one of two ways.
First way: use the VM::deserialize_with_built_in_commands
helper function that
accepts a Serde deserializer and the built-in commands:
#![allow(unused)] fn main() { extern crate serde_json; extern crate texlang; use texlang::vm; let built_in_commands = Default::default(); let vm = vm::VM::<()>::new_with_built_in_commands(built_in_commands); let serialized_vm = serde_json::to_string_pretty(&vm).unwrap(); let built_in_commands = Default::default(); let mut deserializer = serde_json::Deserializer::from_str(&serialized_vm); let vm = vm::VM::<()>::deserialize_with_built_in_commands(&mut deserializer, built_in_commands); }
Second way: first deserialize the bytes to a value of type vm::serde::DeserializedVM
,
and then convert this value into a regular VM using the vm::serde::finish_deserialization
function:
#![allow(unused)] fn main() { extern crate serde_json; extern crate texlang; use texlang::vm; let built_in_commands = Default::default(); let vm = vm::VM::<()>::new_with_built_in_commands(built_in_commands); let serialized_vm = serde_json::to_string_pretty(&vm).unwrap(); let built_in_commands = Default::default(); let deserialized_vm: Box<vm::serde::DeserializedVM<()>> = serde_json::from_str(&serialized_vm).unwrap(); let vm = vm::serde::finish_deserialization(deserialized_vm, built_in_commands); }
Serializing VMs inside TeX commands
Using Rust code like in the previous subsection,
it's possible to write TeX primitives that serialize the VM and write the result to a file -
i.e., write a format file!
The Texlang standard library includes an implementation of the \dump
primitive that does this.
The texcraft
binary accepts a --format-file
argument that reads the format files,
and continues from where it left off.
Primitive tags and (de)serialization
In a previous section we discussed primitive tags. These provide unique identifiers that are generated using a global counter at runtime. Tags sometimes appear in the state, but they are not safe to serialize and deserialize. Deserialized tags may collide with new tags generated using the global counter.
Instead, when (de)serializing a component that contains a tag, the tag field should be skipped when serializing and the value should be provided manually when deserializing. This can be achieved using Serde's skip field attribute:
#![allow(unused)] fn main() { extern crate serde; extern crate texlang; use serde::{Serialize, Deserialize}; use texlang::command; static TAG: command::StaticTag = command::StaticTag::new(); fn get_tag() -> command::Tag { TAG.get() } #[derive(Serialize, Deserialize)] struct Component { variable_value: i32, #[serde(skip, default="get_tag")] tag: command::Tag, } // This Default implementation is not needed for (de)serializing. // However it illustrates how to ensure that new and deserialized components have the same tag. impl Default for Component { fn default() -> Self { Self { variable_value: 0, tag: get_tag(), } } } }
Another approach is to extract the tag to its own sub-struct
and then manually implement the Default
trait for that sub-struct.
In this case we instruct Serde to use the default value for the sub-struct when deserializing:
#![allow(unused)] fn main() { extern crate serde; extern crate texlang; use serde::{Serialize, Deserialize}; use texlang::command; static TAG: command::StaticTag = command::StaticTag::new(); #[derive(Serialize, Deserialize)] struct Component { variable_value: i32, #[serde(skip)] tags: Tags, } struct Tags { tag: command::Tag, } impl Default for Tags { fn default() -> Self { Self { tag: TAG.get(), } } } }
Introduction to Boxworks
Boxworks is a work-in-progress re-implementation of the typesetting engine inside TeX. It is language-agnostic in the sense that it doesn't rely on the TeX language. We envisage a future in which you can take, say, Typst source code and typeset it using the Boxworks engine.
Like the wider Texcraft project, the plan is to implement Boxworks as a collection of libraries that plug together to create a typesetting engine. Each individual library (e.g. a page builder) can be easily replaced if a different algorithm is desired. So far we've identified the following parts of TeX that can probably be individual decoupled modules in the Boxworks system. (This is not exhaustive and does not cover everything that will need to be implemented.)
Text preprocessor
This module takes Unicode characters as input and generates horizontal list elements. This sounds trivial but in TeX it's not because this process includes:
- Adding kerns.
- Replacing multiple characters with ligatures.
- Figuring out the right inter-word glue to add
based on a variety of considerations like the current space factor and the values
of variables like
\spaceskip
.
Most of the relevant TeX code is in "part 46: the chief executive".
Line width calculator
Calculates line widths based on values of \parshape
and \hangindent
.
In TeX this logic seems to be duplicated between the line breaker and math processing.
In the line breaker, relevant sections are TeX.2021.847-850.
Line breaker
Breaks up a horizontal list into multiple lines and puts these on the enclosing vertical list. In TeX this is the famous Knuth-Plass line-breaking algorithm. Implemented in parts 38 and 39 of TeX.
Output driver
Takes a vertical list corresponding to a page and outputs it in DVI format. Implemented in part 32: shipping pages out. In the long run there would also be an output driver for PDF and other formats.
Fuzz testing guide
Dependencies policy
Rust and Cargo make it very easy to add third-party dependencies. This can be useful for building software quickly. However there is a well known phenomenon of "dependency bloat" for Rust projects in which even small crates will have a huge number of dependencies, direct and transitive. This slows compilation times and adds maintenance costs as the dependencies need to be kept up to date. There is also some supply chain and code quality risk because at some point no one knows what's in the dependencies.
This point is especially compelling for a project that reimplements Knuth's TeX, because TeX has no third-party dependencies at all!
In order to mitigate the risk of dependency bloat, we have adopted the following fairly strict dependencies policies:
-
Texcraft libraries are not allowed to have any required third-party dependencies. Third-party dependencies must be gated behind a Cargo feature.
-
It is okay if a Cargo feature corresponding to a dependency is default enabled. For example, the Texlang standard library default enables the
time
feature which uses the third-party Cratechrono
. In this case disabling the feature by default would be a footgun because the values of\day
,\month
, etc. would not be initialized correctly. -
There is a small curated collection of third-party Rust crates are always okay to use. (But if they are used in a Texcraft library, they must be gated behind a Cargo feature.) If you want to use one of these crates, don't think twice: the project is already deeply invested in using them, so there's no downside. These crates are:
-
Serde. This is always gated behind a
serde
feature. -
Clap. This is used for building Texcraft binaries and is not relevant for libraries.
-
-
For other third-party dependencies, try to think critically about whether the dependency is really worth it. It can be very tempting to add some cool dependency-based feature. For example, Texcraft's word diffing library originally had support for color-based diffs because I thought it would be cool. But in the end such a niche feature is not worth taking on a third-party dependency.
You can verify that Texcraft's libraries don't have any third-party dependencies by running the following command. The output should only reference Texcraft libraries:
cargo tree -e normal --features="" --no-default-features \
-p "boxworks*" \
-p core \
-p dvi \
-p texcraft-stdext \
-p "texlang*" \
-p tfm
Conversely, this shows the dependency tree when all features are enabled:
cargo tree -e normal --all-features \
-p "boxworks*" \
-p core \
-p dvi \
-p texcraft-stdext \
-p "texlang*" \
-p tfm
Dev dependencies
For dev dependencies we are less strict right now and there are no policy restrictions.