tfm/lib.rs
1//! # tfm: TeX font metric data
2//!
3//! This is a crate for working with TeX font metric data.
4//! It includes:
5//!
6//! - Functions to read and write TeX font metric (.tfm) files
7//! to and from a value of type [`File`]
8//! ([`deserialize`](File::deserialize), [`serialize`](File::serialize)).
9//!
10//! - Functions to read and write property list (.pl) files
11//! to and from a value of type [`pl::File`]
12//! ([`from_pl_source_code`](pl::File::from_pl_source_code), [`display`](pl::File::display)).
13//!
14//! - Converters from .tfm to .pl files and vice-versa
15//! (using Rust's [`From`] trait to go between [`File`] and [`pl::File`]).
16//!
17//! ## Background
18//!
19//! Probably the most famous part of the implementation of TeX is the Knuth-Plass line breaking algorithm.
20//! This algorithm determines the "optimal" places to add line breaks when typesetting a paragraph of text.
21//! In order to run the algorithm one needs to provide the dimensions of all characters in the current font.
22//! These dimensions are used to size the boxes in the Knuth-Plass box and glue model.
23//!
24//! In TeX, character dimensions are provided using TeX font metric files.
25//! These are binary files.
26//! By convention they have a .tfm file extension.
27//! Unlike more modern file formats like TrueType, .tfm files only contain the font dimensions;
28//! they don't contains the glyphs.
29//! In general,
30//! .tfm files are produced by other software like Metafont,
31//! placed in some well-known directory in the TeX distribution,
32//! and then read into memory when TeX is running.
33//!
34//! Because .tfm files are binary files, it's hard to debug or tweak them.
35//! To remedy this, Knuth and his team developed another file format called a property list file
36//! (extension .pl or .plst)
37//! that contains the same information but in a modifiable text format.
38//! They then wrote two programs:
39//! `tftopl` to convert a .tfm file to a .pl file,
40//! and `pltotf` to convert a .pl file to a .tfm file.
41//!
42//! The general goal of this crate to fully re-implement all of the TeX font metric
43//! code written by Knuth and others.
44//! This includes `tftopl`, `pltotf`, and also the parts of TeX itself that contain logic
45//! for reading and interpreting .tfm files.
46//! However, unlike these monolithic software programs,
47//! this re-implementation is in the form of a modular library in which
48//! individual pieces of logic and be used and re-used.
49//!
50//! ## Basic example
51//!
52//! ```
53//! // Include the .tfm file for Computer Modern in size 10pt.
54//! let tfm_bytes = include_bytes!["../corpus/computer-modern/cmr10.tfm"];
55//!
56//! // Deserialize the .tfm file.
57//! let (tfm_file_or_error, deserialization_warnings) = tfm::File::deserialize(tfm_bytes);
58//! let mut tfm_file = tfm_file_or_error.expect("cmr10.tfm is a valid .tfm file");
59//! assert_eq![deserialization_warnings, vec![], "the data in cmr10.tfm is 100% valid, so there are no deserialization warnings"];
60//! // TODO assert_eq![tfm_file.header.design_size, tfm::FixWord::UNITY * 10]; make it 11 to be more interesting
61//! // TODO query some data
62//!
63//! // Validate the .tfm file.
64//! let validation_warnings = tfm_file.validate_and_fix();
65//! assert_eq![validation_warnings, vec![], "the data in cmr10.tfm is 100% valid, so there are no validation warnings"];
66//!
67//! // Convert the .tfm file to a .pl file and print it.
68//! let pl_file: tfm::pl::File = tfm_file.clone().into();
69//! // TODO query some data
70//! println!["cmr10.pl:\n{}", pl_file.display(/*indent=*/2, tfm::pl::CharDisplayFormat::Default)];
71//! ```
72//!
73//!
74//! ## Advanced functionality
75//!
76//! In addition to supporting the basic use cases of querying font metric data
77//! and converting between different formats,
78//! this crate has advanced functionality for performing additional tasks on font metric data.
79//! The full set of functionality can be understood by navigating through the crate documentation.
80//! But here are 3 highlights we think are interesting:
81//!
82//! - **Language analysis of .pl files**:
83//! In `pltotf`, Knuth parses .pl files in a single pass.
84//! This crate takes a common approach nowadays of parsing in multiple passes:
85//! first constructing a [concrete syntax tree](pl::cst::Cst) (or parse tree),
86//! next constructing a [fully typed and checked abstract syntax tree](pl::ast::Ast),
87//! and finally building the [`pl::File`] itself.
88//! Each of the passes is exposed, so you can e.g. just build the AST for the .pl file and
89//! do some analysis on it.
90//!
91//! - **Debug output for .tfm files**:
92//!
93//! - **Compilation of lig/kern programs**:
94//!
95//!
96//! ## Binaries
97//!
98//! The Texcraft project produces 3 binaries based on this crate:
99//!
100//! - `tftopl` and `pltotf`: re-implementations of Knuth's programs.
101//! - `tfmtools`: a new binary that has a bunch of different tools
102//! for working with TeX font metric data.
103//! Run `tfmtools help` to list all of the available tools.
104//!
105//! In the root of [the Texcraft repository](https://github.com/jamespfennell/texcraft)
106//! these tools can be run with `cargo run --bin $NAME`
107//! and built with `cargo build --bin $NAME`.
108//!
109//!
110//! ## Correctness
111//!
112//! As part of the development of this crate significant effort has been spent
113//! ensuring it exactly replicates the work of Knuth.
114//! This correctness checking is largely based around diff testing the binaries
115//! `tftopl` and `pltotf`.
116//! We verify that the Texcraft and Knuth implementations have the same output
117//! and generate the same error messages.
118//!
119//! This diff testing has been performed in a few different ways:
120//!
121//! - We have run diff tests over all ~100,000 .tfm files in CTAN.
122//! These tests verify that `tftopl` gives the correct result,
123//! and that running `pltotf` on the output .pl file gives the correct result too.
124//! Unfortunately running `pltotf` on the .pl files in CTAN is infeasible
125//! because most of these files are Perl scripts, not property list files.
126//!
127//! - We have developed a fuzz testing harness (so far just for `tftopl`)
128//! that generates highly heterogenous .tfm files and verifies that `tftopl` gives the correct result.
129//! This fuzz testing has uncovered many issues in the Texcraft implementation,
130//! and has even identified [a 30-year old bug](https://tug.org/pipermail/tex-k/2024-March/004031.html)
131//! in Knuth's implementation of `tftopl`.
132//!
133//! Any .tfm or .pl file that exposes a bug in this library is added to
134//! [our automated testing corpus](https://github.com/jamespfennell/texcraft/tree/main/crates/tfm/bin/tests/data).
135//! Running `cargo t` validates that Texcraft's binaries give the same result as Knuth's binaries
136//! (the output of Knuth's binaries is in source control).
137//! This ensures there are no regressions.
138//!
139//! If you discover a .tfm or .pl file such that the Texcraft and Knuth implementations
140//! diverge, this indicates there is a bug in this library.
141//! Please create an issue on the Texcraft GitHub repo.
142//! We will fix the bug and add your files to the testing corpus.
143
144pub mod algorithms;
145use std::{
146 collections::{BTreeSet, HashMap, HashSet},
147 num::NonZeroU8,
148};
149pub mod ligkern;
150pub mod pl;
151
152use std::collections::BTreeMap;
153
154mod debug;
155mod deserialize;
156mod serialize;
157mod validate;
158
159pub use deserialize::DeserializationError;
160pub use deserialize::DeserializationWarning;
161pub use deserialize::RawFile;
162pub use deserialize::SubFileSizes;
163pub use validate::ValidationWarning;
164
165/// Complete contents of a TeX font metric (.tfm) file.
166///
167/// The struct contain multiple vectors.
168/// In TeX and TFtoPL there is an optimization in which all of data in the vectors
169/// is stored in one large vector of 32-bit integers.
170/// The conversion from [u32] to the specific types like [FixWord] are then done when the
171/// data is needed.
172/// This makes the memory footprint of this type much more compact,
173/// and such a change may be considered in the future.
174///
175/// In fact in TeX the font data for all fonts is stored in one contiguous piece of memory
176/// (`font_info`, defined in TeX82.2021.549).
177/// This is a little too unsafe to pull off though.
178#[derive(Clone, Debug, PartialEq, Eq)]
179pub struct File {
180 /// Header.
181 pub header: Header,
182
183 /// Smallest character in the .tfm.
184 pub smallest_char: Char,
185
186 /// Character dimensions.
187 pub char_dimens: BTreeMap<Char, CharDimensions>,
188
189 /// Character tags.
190 ///
191 /// Note there is no correlation between a character having
192 /// a tag and a character having a dimension.
193 /// All four combinations of (has or hasn't dimensions) and (has or hasn't a tag)
194 /// are possible.
195 pub char_tags: BTreeMap<Char, CharTag>,
196
197 /// Tags that have been unset, but whose discriminant is still written to a .tfm file by PLtoTF.
198 pub unset_char_tags: BTreeMap<Char, u8>,
199
200 /// Character widths
201 pub widths: Vec<FixWord>,
202
203 /// Character heights
204 pub heights: Vec<FixWord>,
205
206 /// Character depths
207 pub depths: Vec<FixWord>,
208
209 /// Character italic corrections
210 pub italic_corrections: Vec<FixWord>,
211
212 /// Lig kern program.
213 pub lig_kern_program: ligkern::lang::Program,
214
215 /// Kerns. These are referenced from inside the lig kern commands.
216 pub kerns: Vec<FixWord>,
217
218 /// Extensible characters.
219 pub extensible_chars: Vec<ExtensibleRecipe>,
220
221 /// Font parameters.
222 pub params: Vec<FixWord>,
223}
224
225/// Data about one character in a .tfm file.
226#[derive(Clone, Debug, PartialEq, Eq)]
227pub struct CharDimensions {
228 /// Index of the width of this character in the widths array.
229 ///
230 /// In valid TFM files, this index will always be non-zero.
231 /// This is because if the width index is zero it means there is no data for the character in the file.
232 /// In this case the other indices (height, etc.) are necessarily zero.
233 /// See TFtoPL.2014.? (where blocks of data with a zero width index are skipped)
234 /// and PLtoTF.2014.? (where missing characters are written with all indices 0).
235 ///
236 /// There is one edge case where this index can be zero.
237 /// This is if the width index is invalid.
238 /// In this case tftopl essentially sets the width index to 0.
239 ///
240 /// Note that even if a character doesn't have dimensions, it can still have a tag.
241 pub width_index: WidthIndex,
242 /// Index of the height of this character in the height array.
243 pub height_index: u8,
244 pub depth_index: u8,
245 pub italic_index: u8,
246}
247
248#[derive(Copy, Clone, Debug, PartialEq, Eq)]
249pub enum WidthIndex {
250 Invalid,
251 Valid(NonZeroU8),
252}
253
254impl WidthIndex {
255 pub fn valid(&self) -> Option<NonZeroU8> {
256 match self {
257 WidthIndex::Invalid => None,
258 WidthIndex::Valid(n) => Some(*n),
259 }
260 }
261 pub fn get(&self) -> u8 {
262 match self {
263 WidthIndex::Invalid => 0,
264 WidthIndex::Valid(n) => n.get(),
265 }
266 }
267}
268
269/// Tag of a character in a .tfm file.
270#[derive(Clone, Debug, PartialEq, Eq)]
271pub enum CharTag {
272 Ligature(u8),
273 List(Char),
274 Extension(u8),
275}
276
277impl CharTag {
278 pub fn ligature(&self) -> Option<u8> {
279 match self {
280 CharTag::Ligature(l) => Some(*l),
281 CharTag::List(_) | CharTag::Extension(_) => None,
282 }
283 }
284 pub fn list(&self) -> Option<Char> {
285 match self {
286 CharTag::List(c) => Some(*c),
287 CharTag::Ligature(_) | CharTag::Extension(_) => None,
288 }
289 }
290 pub fn extension(&self) -> Option<u8> {
291 match self {
292 CharTag::Extension(u) => Some(*u),
293 CharTag::Ligature(_) | CharTag::List(_) => None,
294 }
295 }
296}
297
298/// Extensible recipe instruction in a .tfm file.
299#[derive(Debug, Default, Clone, PartialEq, Eq)]
300pub struct ExtensibleRecipe {
301 pub top: Option<Char>,
302 pub middle: Option<Char>,
303 pub bottom: Option<Char>,
304 pub rep: Char,
305}
306
307impl ExtensibleRecipe {
308 pub fn is_seven_bit(&self) -> bool {
309 self.chars().all(|c| c.is_seven_bit())
310 }
311
312 pub fn chars(&self) -> impl Iterator<Item = Char> {
313 [self.top, self.middle, self.bottom, Some(self.rep)]
314 .into_iter()
315 .flatten()
316 }
317}
318
319impl Default for File {
320 fn default() -> Self {
321 Self {
322 header: Default::default(),
323 smallest_char: Char(1),
324 char_dimens: Default::default(),
325 char_tags: Default::default(),
326 unset_char_tags: Default::default(),
327 widths: vec![FixWord::ZERO],
328 heights: vec![FixWord::ZERO],
329 depths: vec![FixWord::ZERO],
330 italic_corrections: vec![FixWord::ZERO],
331 lig_kern_program: Default::default(),
332 kerns: vec![],
333 extensible_chars: vec![],
334 params: Default::default(),
335 }
336 }
337}
338
339/// Print debugging information about a .tfm file.
340pub fn debug<'a>(
341 path: Option<&'a str>,
342 sub_file_sizes: SubFileSizes,
343 tfm_file: Option<&'a File>,
344 raw_file: Option<&'a RawFile<'a>>,
345 sections: Vec<Section>,
346) -> impl std::fmt::Display + 'a {
347 debug::Debug {
348 sub_file_sizes,
349 path,
350 raw_file,
351 tfm_file,
352 sections,
353 }
354}
355
356impl File {
357 pub fn from_raw_file(raw_file: &RawFile) -> Self {
358 deserialize::from_raw_file(raw_file)
359 }
360
361 pub fn validate_and_fix(&mut self) -> Vec<ValidationWarning> {
362 validate::validate_and_fix(self)
363 }
364
365 pub fn deserialize(
366 b: &[u8],
367 ) -> (
368 Result<File, DeserializationError>,
369 Vec<DeserializationWarning>,
370 ) {
371 deserialize::deserialize(b)
372 }
373
374 pub fn serialize(&self) -> Vec<u8> {
375 serialize::serialize(self)
376 }
377
378 pub fn replace_lig_kern_program(
379 &mut self,
380 mut program: ligkern::lang::Program,
381 entrypoints: HashMap<Char, u16>,
382 ) {
383 let entrypoints = program.pack_entrypoints(entrypoints);
384 self.kerns = program.unpack_kerns();
385 self.lig_kern_program = program;
386 self.char_tags
387 .retain(|_, tag| !matches!(tag, CharTag::Ligature(_)));
388 for (c, ep) in entrypoints {
389 self.char_tags.insert(c, CharTag::Ligature(ep));
390 }
391 }
392
393 /// Return a map from characters to the lig/kern entrypoint for that character.
394 /// TODO: can probably return impl Iterator<Item=(Char, u8)>
395 pub fn lig_kern_entrypoints(&self) -> HashMap<Char, u8> {
396 self.char_tags
397 .iter()
398 .filter_map(|(c, d)| match *d {
399 CharTag::Ligature(l) => Some((*c, l)),
400 _ => None,
401 })
402 .collect()
403 }
404
405 fn char_info_bounds(&self) -> Option<(Char, Char)> {
406 let mut r: Option<(Char, Char)> = None;
407 for c in self.char_dimens.keys().copied() {
408 r = Some(match r {
409 None => (c, c),
410 Some((lower, upper)) => (
411 if c.0 < lower.0 { c } else { lower },
412 if c.0 < upper.0 { upper } else { c },
413 ),
414 })
415 }
416 r
417 }
418
419 /// Calculate the checksum of this .tfm file.
420 pub fn checksum(&self) -> u32 {
421 // This checksum algorithm is in PLtoTF.2014.134.
422 let (bc, ec) = self.char_info_bounds().unwrap_or((Char(1), Char(0)));
423 let mut b = [bc.0, ec.0, bc.0, ec.0];
424 for c in bc.0..=ec.0 {
425 let char_dimens = match self.char_dimens.get(&Char(c)) {
426 None => continue,
427 Some(char_dimens) => char_dimens,
428 };
429 let width = self.widths[char_dimens.width_index.get() as usize].0;
430 // TODO: adjust based on the design units
431 let width = width + (c as i32 + 4) * 0o20_000_000;
432 let add = |b: u8, m: u8| -> u8 {
433 (((b as i32) + (b as i32) + width) % (m as i32))
434 .try_into()
435 .expect("(i32 % u8) is always a u8")
436 };
437 b = [
438 add(b[0], 255),
439 add(b[1], 253),
440 add(b[2], 251),
441 add(b[3], 247),
442 ]
443 }
444 u32::from_be_bytes(b)
445 }
446
447 pub fn named_param(&self, param: NamedParameter) -> Option<FixWord> {
448 self.params.get((param.number() - 1) as usize).copied()
449 }
450
451 pub fn named_param_scaled(&self, param: NamedParameter) -> Option<common::Scaled> {
452 self.named_param(param)
453 .map(|p| p.to_scaled(self.header.design_size))
454 }
455
456 pub fn width_utf8(&self, char: char) -> Option<common::Scaled> {
457 let char: Char = char.try_into().ok()?;
458 let dimens = self.char_dimens.get(&char)?;
459 let i = dimens.width_index.valid()?.get();
460 Some(
461 self.widths
462 .get(i as usize)?
463 .to_scaled(self.header.design_size),
464 )
465 }
466 pub fn height_utf8(&self, char: char) -> Option<common::Scaled> {
467 let char: Char = char.try_into().ok()?;
468 let dimens = self.char_dimens.get(&char)?;
469 let i = dimens.height_index;
470 Some(
471 self.heights
472 .get(i as usize)?
473 .to_scaled(self.header.design_size),
474 )
475 }
476 pub fn depth_utf8(&self, char: char) -> Option<common::Scaled> {
477 let char: Char = char.try_into().ok()?;
478 let dimens = self.char_dimens.get(&char)?;
479 let i = dimens.depth_index;
480 Some(
481 self.depths
482 .get(i as usize)?
483 .to_scaled(self.header.design_size),
484 )
485 }
486}
487
488impl From<crate::pl::File> for File {
489 fn from(pl_file: crate::pl::File) -> Self {
490 let mut char_bounds: Option<(Char, Char)> = None;
491
492 let lig_kern_entrypoints = pl_file.lig_kern_entrypoints(true);
493 let mut lig_kern_program = pl_file.lig_kern_program;
494 let kerns = lig_kern_program.unpack_kerns();
495 let lig_kern_entrypoints = lig_kern_program.pack_entrypoints(lig_kern_entrypoints);
496
497 let mut widths = pl_file.additional_widths;
498 let mut heights = pl_file.additional_heights;
499 let mut depths = pl_file.additional_depths;
500 let mut italic_corrections = pl_file.additional_italics;
501 for (char, char_dimens) in &pl_file.char_dimens {
502 widths.push(char_dimens.width.unwrap_or_default());
503 match char_dimens.height {
504 None | Some(FixWord::ZERO) => {}
505 Some(height) => heights.push(height),
506 }
507 match char_dimens.depth {
508 None | Some(FixWord::ZERO) => {}
509 Some(depth) => depths.push(depth),
510 }
511 match char_dimens.italic_correction {
512 None | Some(FixWord::ZERO) => {}
513 Some(italic_correction) => italic_corrections.push(italic_correction),
514 }
515 char_bounds = Some(match char_bounds {
516 None => (*char, *char),
517 Some((lower, upper)) => (
518 if *char < lower { *char } else { lower },
519 if *char > upper { *char } else { upper },
520 ),
521 })
522 }
523 let (widths, width_to_index) = compress(&widths, 255);
524 let (heights, height_to_index) = compress(&heights, 15);
525 let (depths, depth_to_index) = compress(&depths, 15);
526 let (italic_corrections, italic_correction_to_index) = compress(&italic_corrections, 63);
527 let mut extensible_chars = vec![];
528 let char_dimens = match char_bounds {
529 None => Default::default(),
530 Some((lower, upper)) => {
531 let mut m: BTreeMap<Char, CharDimensions> = Default::default();
532 for c in lower.0..=upper.0 {
533 let pl_data = match pl_file.char_dimens.get(&Char(c)) {
534 Some(pl_data) => pl_data,
535 None => continue,
536 };
537 let width = pl_data.width.unwrap_or_default();
538 let width_index = *width_to_index.get(&width).expect(
539 "the map returned from compress(_,_) contains every input as a key",
540 );
541 m.insert(
542 Char(c),
543 CharDimensions {
544 width_index: WidthIndex::Valid(width_index),
545 height_index: match pl_data.height {
546 None => 0,
547 Some(height) => {
548 // If the height data is missing from the height_to_index map, it's because
549 // the height is 0. Similar with depths and italic corrections.
550 height_to_index
551 .get(&height)
552 .copied()
553 .map(NonZeroU8::get)
554 .unwrap_or(0)
555 }
556 },
557 depth_index: match pl_data.depth {
558 None => 0,
559 Some(depth) => depth_to_index
560 .get(&depth)
561 .copied()
562 .map(NonZeroU8::get)
563 .unwrap_or(0),
564 },
565 italic_index: match pl_data.italic_correction {
566 None => 0,
567 Some(italic_correction) => italic_correction_to_index
568 .get(&italic_correction)
569 .copied()
570 .map(NonZeroU8::get)
571 .unwrap_or(0),
572 },
573 },
574 );
575 }
576 m
577 }
578 };
579 let ordered_chars = {
580 let mut m: Vec<Char> = pl_file.char_tags.keys().copied().collect();
581 m.sort();
582 m
583 };
584 let char_tags = ordered_chars.into_iter().map(|c| {
585 (c,
586 match pl_file.char_tags.get(&c).unwrap() {
587 pl::CharTag::Ligature(_) => {
588 let entrypoint = *lig_kern_entrypoints.get(&c).expect("the map returned by crate::ligkern::lang::compress_entrypoints has a key for all chars with a lig tag");
589 CharTag::Ligature(entrypoint)
590 }
591 pl::CharTag::List(c) => CharTag::List(*c),
592 pl::CharTag::Extension(e) => {
593 let index: u8 = extensible_chars.len().try_into().unwrap();
594 extensible_chars.push(e.clone());
595 CharTag::Extension(index)
596 }
597 },
598 )
599 }).collect();
600
601 let mut file = Self {
602 header: pl_file.header,
603 smallest_char: char_bounds.map(|t| t.0).unwrap_or(Char(1)),
604 char_dimens,
605 char_tags,
606 unset_char_tags: pl_file.unset_char_tags.clone(),
607 widths,
608 heights,
609 depths,
610 italic_corrections,
611 lig_kern_program,
612 kerns,
613 extensible_chars,
614 params: pl_file.params,
615 };
616 if file.header.checksum.is_none() {
617 file.header.checksum = Some(file.checksum());
618 }
619 file
620 }
621}
622
623/// Lossy compression of numbers for TFM files.
624///
625/// The TFM file format can only store up to 15 heights, 15 depths and 63 italic corrections.
626/// If a property list file contains, e.g., more than 15 distinct heights, something has to give.
627/// PLtoTF contains a lossy compression algorithm that takes a list of values and returns another
628/// bounded list of values which approximates the original list.
629/// This is implemented in PLtoTF.2014.75-80 and re-implemented here.
630///
631/// ## How the algorithm works.
632///
633/// For a given delta, the algorithm partitions the ordered list of values such that within
634/// each partition the maximum distance between two elements is delta. All of the values within
635/// the partition are then approximated by `(interval_max+interval_min)/2`.
636/// As such, each value may be increased or decreased by up to `delta/2`.
637/// After this procedure, the list of values is replaced by the list of approximations.
638/// This compresses the list of values into a list whose size is the number of intervals.
639///
640/// Given this subroutine, the algorithm finds the smallest delta such that the number of intervals
641/// is less than the required maximum (e.g., 15 for heights).
642///
643/// There are some important features of this algorithm to note:
644///
645/// - For a given delta value there may be multiple possible partitions.
646/// The algorithm uses a greedy approach in which it maximizes the size of the first partition,
647/// then the size of the second partition, and so on.
648///
649/// - Distinct delta values can yield the same partition.
650/// For example, if the initial values are `[1, 4, 5]` then any delta in the range `[1, 3)`
651/// gives the same result (`[[1], [4, 5]]`)
652/// Whenever we check a delta, we are really checking the _interval of deltas_ that gives the same result.
653/// Both Knuth's and our implementations use this fact to speed up the search by reducing the search space.
654/// E.g. in the example above, after checking `delta=1` we _don't_ check `delta=2`.
655///
656/// ## This re-implementation
657///
658/// The re-implementation here follows PLtoTF closely enough, but with one modification.
659/// To find the optimal delta, Knuth first calculates the minimal possible delta.
660/// This is the minimum distance between adjacent elements in the ordered list of values.
661/// In general this will not be a valid solution because the number of intervals
662/// it generates will be large.
663/// So, he next finds the smallest `k` such that `2^k * min_delta` is a valid solution.
664/// Te then does a upwards linear search within the interval `[min_delta * 2^{k-1}, min_delta * 2^k]` to find
665/// the optimal delta.
666///
667/// Checking a particular delta is `O(n)`.
668/// The worst-case running time of Knuth's algorithm is then `O(n^3)` because the interval
669/// `[2^{k-1}, 2^k]` can contain `O(n^2)` distinct deltas to check in the worst-case [Note 1].
670///
671/// In the re-implementation here we realize that the problem of finding the smallest possible delta
672/// is a classic binary search problem.
673/// This is because if delta is a valid solution, any larger delta also is;
674/// and if delta is not a valid solution, any smaller delta is also not a valid solution.
675/// The re-implementation using binary search is `O(n log n)`.
676/// Moreover, the constant factors are about the same.
677///
678/// [Note 1] In the worst-case there are O(n^2) distinct deltas because each pair of elements yields a delta.
679/// Let m be the smallest delta and M the largest delta.
680/// In the initial `2^k`-based ranging scheme, the largest `K` satisfies `m 2^{K-1} < M <= m 2^K`,
681/// or `K-1 <= log_2(M/m)`. Thus there are `K=O(1)` ranges in this initial scheme.
682/// By the pigeon-hole principle, there exists a `k` such that the range `[m * 2^{k-1}, m * 2^k]`
683/// contains O(n^2) elements.
684/// In the worst-case, the solution is the maximum element of this range.
685pub fn compress(values: &[FixWord], max_size: u8) -> (Vec<FixWord>, HashMap<FixWord, NonZeroU8>) {
686 let max_size = max_size as usize;
687 let dedup_values = {
688 let s: HashSet<FixWord> = values.iter().copied().collect();
689 // remove the zero value for non-widths
690 // and then add it back in at the start
691 let mut v: Vec<FixWord> = s.into_iter().collect();
692 v.sort();
693 v
694 };
695 // After deduplication, it is possible we don't need to compress at all so we can exit early.
696 // This also handles the case when the values slice is empty.
697 if dedup_values.len() <= max_size {
698 let m: HashMap<FixWord, NonZeroU8> = dedup_values
699 .iter()
700 .enumerate()
701 .map(|(i, &w)| {
702 let i: u8 = i.try_into().expect("`dedup_values` has at most `max_size` elements, so the index is at most `max_size-1`");
703 let i: NonZeroU8 = (i+1).try_into().expect("`i<=max_size-1<=u8::MAX`, so `i+1<=u8::MAX`");
704 (w, i) })
705 .collect();
706 let mut dedup_values = dedup_values;
707 dedup_values.push(FixWord::ZERO);
708 dedup_values.rotate_right(1);
709 return (dedup_values, m);
710 }
711
712 // For the binary search we maintain lower and upper indices as usual.
713 // The optimal delta is in the interval [lower, upper].
714 //
715 // Invariant: delta<lower is never a solution.
716 // Because delta must be non-negative, we initialize it to zero.
717 let mut lower = FixWord::ZERO;
718 // Invariant: delta=upper is always solution.
719 // To initialize upper and begin the search we construct a solution that always works: a single
720 // interval encompassing the entire slice and the largest delta possible.
721 let max_delta = *dedup_values.last().unwrap() - *dedup_values.first().unwrap();
722 let mut upper = max_delta;
723 let mut solution = vec![dedup_values.len()];
724
725 let mut buffer = vec![];
726 while lower < upper {
727 // After the following line delta is potentially equal to lower. This is what we want as
728 // we know upper is a solution so to advance the search when upper=lower+1
729 // we need to check lower+1.
730 let delta = lower + (upper - lower) / 2;
731
732 let mut interval_start = *dedup_values.first().unwrap();
733 // The smallest delta such that the candidate solution will be the same.
734 // This is the maximum of all gaps that don't start a new interval.
735 let mut delta_lower = FixWord::ZERO;
736 // The largest delta such that the candidate solution will be different.
737 // This is the minimum of all gaps that start a new interval.
738 let mut delta_upper = max_delta;
739 for (i, &v) in dedup_values.iter().enumerate() {
740 let gap = v - interval_start;
741 if gap > delta {
742 // We need to start a new interval
743 if gap < delta_upper {
744 delta_upper = gap;
745 }
746 buffer.push(i);
747 // If the candidate solution is already too big, we can exit early.
748 if buffer.len() >= max_size {
749 break;
750 }
751 interval_start = v;
752 } else {
753 // We need to extend the current interval
754 // For any delta in the range [gap, delta] we would have made the same choice here.
755 if gap > delta_lower {
756 delta_lower = gap;
757 }
758 }
759 }
760 buffer.push(dedup_values.len());
761
762 if buffer.len() <= max_size {
763 // solution
764 std::mem::swap(&mut buffer, &mut solution);
765 upper = delta_lower;
766 } else {
767 // not a solution
768 lower = delta_upper;
769 }
770 buffer.clear();
771 }
772
773 let mut value_to_index = HashMap::<FixWord, NonZeroU8>::new();
774 let mut result = vec![FixWord::ZERO];
775 let mut previous = 0_usize;
776 for i in solution {
777 let interval = &dedup_values[previous..i];
778 previous = i;
779 for &v in interval {
780 let index: u8 = result.len().try_into().expect("the `result` array contains at most `1+max_size` elements, so the index it at most `max_size` which is a u8");
781 let index: NonZeroU8 = index
782 .try_into()
783 .expect("the `result` array contains at least 1 element so this is never 0");
784 value_to_index.insert(v, index);
785 }
786 let replacement = (*interval.last().unwrap() + *interval.first().unwrap()) / 2;
787 result.push(replacement);
788 }
789
790 (result, value_to_index)
791}
792
793#[derive(Debug, Hash, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
794pub enum Section {
795 /// Sub-file sizes
796 SubFileSizes,
797 /// Header
798 Header,
799 /// Character data
800 CharInfos,
801 /// Widths array
802 Widths,
803 /// Heights array
804 Heights,
805 /// Depths array
806 Depths,
807 /// Italic corrections array
808 ItalicCorrections,
809 /// Lig/kern instructions
810 LigKern,
811 /// Kerns array
812 Kerns,
813 /// Extensible recipes
814 ExtensibleRecipes,
815 /// Params array
816 Params,
817}
818
819impl Section {
820 pub const NAMES: [&'static str; 11] = [
821 "sub-file-sizes",
822 "header",
823 "char-infos",
824 "widths",
825 "heights",
826 "depths",
827 "italic-corrections",
828 "lig-kern",
829 "kerns",
830 "extensible-recipes",
831 "params",
832 ];
833 pub const ALL_SECTIONS: [Section; 11] = [
834 Section::SubFileSizes,
835 Section::Header,
836 Section::CharInfos,
837 Section::Widths,
838 Section::Heights,
839 Section::Depths,
840 Section::ItalicCorrections,
841 Section::LigKern,
842 Section::Kerns,
843 Section::ExtensibleRecipes,
844 Section::Params,
845 ];
846}
847
848impl TryFrom<&str> for Section {
849 type Error = ();
850
851 fn try_from(value: &str) -> Result<Self, Self::Error> {
852 for i in 0..Section::NAMES.len() {
853 if Section::NAMES[i] == value {
854 return Ok(Section::ALL_SECTIONS[i]);
855 }
856 }
857 Err(())
858 }
859}
860
861impl std::fmt::Display for Section {
862 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
863 write!(f, "{}", Section::NAMES[*self as usize])
864 }
865}
866
867/// The TFM header, which contains metadata about the file.
868///
869/// This is defined in TFtoPL.2014.10.
870#[derive(Clone, Debug, Default, PartialEq, Eq)]
871pub struct Header {
872 /// The font checksum, if specified.
873 ///
874 /// In .tfm files checksums are always specified because the format has no
875 /// way to omit a checksum.
876 ///
877 /// In .pl files checksums are specified if the `CHECKSUM` node appears.
878 /// If no checksum is specified in a .pl file, pltotf calculates the
879 /// correct value and writes that.
880 ///
881 /// In TeX82, this is stored in the `font_check` array (TeX82.2021.549).
882 pub checksum: Option<u32>,
883 /// In TeX82, this is stored in the `font_dsize` array (TeX82.2021.549).
884 pub design_size: FixWord,
885 pub design_size_valid: bool,
886 pub character_coding_scheme: Option<String>,
887 pub font_family: Option<String>,
888 pub seven_bit_safe: Option<bool>,
889 pub face: Option<Face>,
890 /// The TFM format allows the header to contain arbitrary additional data.
891 pub additional_data: Vec<u32>,
892}
893
894impl Header {
895 /// Returns the default header when parsing property list files.
896 ///
897 /// This is defined in PLtoTF.2014.70.
898 pub fn pl_default() -> Header {
899 Header {
900 checksum: None,
901 design_size: FixWord::ONE * 10,
902 design_size_valid: true,
903 character_coding_scheme: Some("UNSPECIFIED".into()),
904 font_family: Some("UNSPECIFIED".into()),
905 seven_bit_safe: None,
906 face: Some(0.into()),
907 additional_data: vec![],
908 }
909 }
910
911 /// Returns the default header when parsing .tfm files.
912 ///
913 /// This is defined in PLtoTF.2014.70.
914 pub fn tfm_default() -> Header {
915 Header {
916 checksum: Some(0),
917 design_size: FixWord::ZERO,
918 design_size_valid: true,
919 character_coding_scheme: None,
920 font_family: None,
921 seven_bit_safe: None,
922 face: None,
923 additional_data: vec![],
924 }
925 }
926}
927
928/// A character in a TFM file.
929///
930/// TFM and PL files only support 1-byte characters.
931#[derive(Debug, Default, PartialEq, Eq, Hash, Clone, Copy, PartialOrd, Ord)]
932#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
933pub struct Char(pub u8);
934
935impl std::fmt::Display for Char {
936 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
937 if (self.0 as char).is_ascii_graphic() {
938 write!(f, "{}", self.0 as char)
939 } else {
940 write!(f, "\\u{{{:02x}}}", self.0)
941 }
942 }
943}
944
945impl From<u8> for Char {
946 fn from(value: u8) -> Self {
947 Char(value)
948 }
949}
950
951impl TryFrom<char> for Char {
952 type Error = std::char::TryFromCharError;
953
954 fn try_from(value: char) -> Result<Self, Self::Error> {
955 let u: u8 = value.try_into()?;
956 Ok(Char(u))
957 }
958}
959
960impl From<Char> for char {
961 fn from(value: Char) -> Self {
962 value.0 as char
963 }
964}
965
966macro_rules! const_chars {
967 ( $( ($name: ident, $value: expr), )+ ) => {
968 $(
969 pub const $name: Char = Char($value);
970 )+
971 };
972}
973
974impl Char {
975 const_chars![
976 (A, b'A'),
977 (B, b'B'),
978 (C, b'C'),
979 (D, b'D'),
980 (V, b'V'),
981 (W, b'W'),
982 (X, b'X'),
983 (Y, b'Y'),
984 (Z, b'Z'),
985 ];
986
987 pub fn is_seven_bit(&self) -> bool {
988 self.0 <= 127
989 }
990}
991
992#[derive(Debug, PartialEq, Eq, Copy, Clone)]
993#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
994pub enum FaceWeight {
995 Light,
996 Medium,
997 Bold,
998}
999
1000#[derive(Debug, PartialEq, Eq, Copy, Clone)]
1001#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
1002pub enum FaceSlope {
1003 Roman,
1004 Italic,
1005}
1006
1007#[derive(Debug, PartialEq, Eq, Copy, Clone)]
1008#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
1009pub enum FaceExpansion {
1010 Regular,
1011 Condensed,
1012 Extended,
1013}
1014
1015#[derive(Debug, PartialEq, Eq, Copy, Clone)]
1016#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
1017pub enum Face {
1018 Valid(FaceWeight, FaceSlope, FaceExpansion),
1019 Other(u8),
1020}
1021
1022impl From<u8> for Face {
1023 fn from(value: u8) -> Self {
1024 if value >= 18 {
1025 return Face::Other(value);
1026 }
1027 let a = match (value % 6) / 2 {
1028 0 => FaceWeight::Medium,
1029 1 => FaceWeight::Bold,
1030 2 => FaceWeight::Light,
1031 _ => unreachable!(),
1032 };
1033 let b = match value % 2 {
1034 0 => FaceSlope::Roman,
1035 1 => FaceSlope::Italic,
1036 _ => unreachable!(),
1037 };
1038 let c = match value / 6 {
1039 0 => FaceExpansion::Regular,
1040 1 => FaceExpansion::Condensed,
1041 2 => FaceExpansion::Extended,
1042 _ => unreachable!(),
1043 };
1044 Face::Valid(a, b, c)
1045 }
1046}
1047
1048impl From<Face> for u8 {
1049 fn from(value: Face) -> Self {
1050 match value {
1051 Face::Valid(w, s, c) => {
1052 let a: u8 = match w {
1053 FaceWeight::Medium => 0,
1054 FaceWeight::Bold => 1,
1055 FaceWeight::Light => 2,
1056 };
1057 let b: u8 = match s {
1058 FaceSlope::Roman => 0,
1059 FaceSlope::Italic => 1,
1060 };
1061 let c: u8 = match c {
1062 FaceExpansion::Regular => 0,
1063 FaceExpansion::Condensed => 1,
1064 FaceExpansion::Extended => 2,
1065 };
1066 c * 6 + a * 2 + b
1067 }
1068 Face::Other(b) => b,
1069 }
1070 }
1071}
1072
1073/// A named TeX font metric parameter.
1074#[derive(PartialEq, Eq, Debug, Copy, Clone)]
1075#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
1076pub enum NamedParameter {
1077 Slant,
1078 Space,
1079 Stretch,
1080 Shrink,
1081 XHeight,
1082 Quad,
1083 ExtraSpace,
1084 Num1,
1085 Num2,
1086 Num3,
1087 Denom1,
1088 Denom2,
1089 Sup1,
1090 Sup2,
1091 Sup3,
1092 Sub1,
1093 Sub2,
1094 SupDrop,
1095 SubDrop,
1096 Delim1,
1097 Delim2,
1098 AxisHeight,
1099 DefaultRuleThickness,
1100 BigOpSpacing1,
1101 BigOpSpacing2,
1102 BigOpSpacing3,
1103 BigOpSpacing4,
1104 BigOpSpacing5,
1105}
1106
1107impl NamedParameter {
1108 pub fn number(&self) -> u8 {
1109 match self {
1110 NamedParameter::Slant => 1,
1111 NamedParameter::Space => 2,
1112 NamedParameter::Stretch => 3,
1113 NamedParameter::Shrink => 4,
1114 NamedParameter::XHeight => 5,
1115 NamedParameter::Quad => 6,
1116 NamedParameter::ExtraSpace => 7,
1117 NamedParameter::Num1 => 8,
1118 NamedParameter::Num2 => 9,
1119 NamedParameter::Num3 => 10,
1120 NamedParameter::Denom1 => 11,
1121 NamedParameter::Denom2 => 12,
1122 NamedParameter::Sup1 => 13,
1123 NamedParameter::Sup2 => 14,
1124 NamedParameter::Sup3 => 15,
1125 NamedParameter::Sub1 => 16,
1126 NamedParameter::Sub2 => 17,
1127 NamedParameter::SupDrop => 18,
1128 NamedParameter::SubDrop => 19,
1129 NamedParameter::Delim1 => 20,
1130 NamedParameter::Delim2 => 21,
1131 NamedParameter::AxisHeight => 22,
1132 NamedParameter::DefaultRuleThickness => 8,
1133 NamedParameter::BigOpSpacing1 => 9,
1134 NamedParameter::BigOpSpacing2 => 10,
1135 NamedParameter::BigOpSpacing3 => 11,
1136 NamedParameter::BigOpSpacing4 => 12,
1137 NamedParameter::BigOpSpacing5 => 13,
1138 }
1139 }
1140}
1141
1142/// Warning from the compilation of "next larger character" instructions.
1143#[derive(PartialEq, Eq, Debug, Clone)]
1144pub enum NextLargerProgramWarning {
1145 NonExistentCharacter { original: Char, next_larger: Char },
1146 InfiniteLoop { original: Char, next_larger: Char },
1147}
1148
1149impl NextLargerProgramWarning {
1150 pub fn bad_char(&self) -> Char {
1151 match self {
1152 NextLargerProgramWarning::NonExistentCharacter {
1153 original,
1154 next_larger: _,
1155 } => *original,
1156 NextLargerProgramWarning::InfiniteLoop { original, .. } => *original,
1157 }
1158 }
1159
1160 /// Returns the warning message the TFtoPL program prints for this kind of error.
1161 pub fn tftopl_message(&self) -> String {
1162 match self {
1163 NextLargerProgramWarning::NonExistentCharacter {
1164 original: _,
1165 next_larger,
1166 } => {
1167 format![
1168 "Bad TFM file: Character list link to nonexistent character '{:03o}.",
1169 next_larger.0
1170 ]
1171 }
1172 NextLargerProgramWarning::InfiniteLoop { original, .. } => {
1173 format!["Bad TFM file: Cycle in a character list!\nCharacter '{:03o} now ends the list.", original.0]
1174 }
1175 }
1176 }
1177
1178 /// Returns the section in Knuth's TFtoPL (version 2014) in which this warning occurs.
1179 pub fn tftopl_section(&self) -> u8 {
1180 84
1181 }
1182
1183 /// Returns the section in Knuth's PLtoTF (version 2014) in which this warning occurs.
1184 pub fn pltotf_section(&self) -> u8 {
1185 match self {
1186 NextLargerProgramWarning::NonExistentCharacter { .. } => 111,
1187 NextLargerProgramWarning::InfiniteLoop { .. } => 113,
1188 }
1189 }
1190
1191 /// Returns the warning message the PLtoTF program prints for this kind of error.
1192 pub fn pltotf_message(&self) -> String {
1193 match self {
1194 NextLargerProgramWarning::NonExistentCharacter {
1195 original,
1196 next_larger: _,
1197 } => {
1198 format![
1199 "The character NEXTLARGER than '{:03o} had no CHARACTER spec.",
1200 original.0
1201 ]
1202 }
1203 NextLargerProgramWarning::InfiniteLoop { original, .. } => {
1204 format![
1205 "A cycle of NEXTLARGER characters has been broken at '{:03o}.",
1206 original.0
1207 ]
1208 }
1209 }
1210 }
1211}
1212/// Compiled program of "next larger character" instructions
1213///
1214/// The .tfm file format can associate a "next larger" character to any character in a font.
1215/// Next larger characters form sequences: i.e. B can be the next larger character for A,
1216/// and C can be the next larger character for B,
1217/// leading to the sequences A-B-C.
1218/// These next larger characters are used at certain points in TeX.
1219/// TeX occasionally traverses the entire sequence for a given starting character (e.g. A).
1220///
1221/// As with ligatures, next larger specifications can contain infinite loops -
1222/// e.g, if X is the next larger character for Y
1223/// and Y is the next larger character for X.
1224/// These loops are invalid and removed by TFtoPL and PLtoTF.
1225///
1226/// Motivated by the idea of "parse don't validate", this type represents
1227/// a compiled version of the next larger specifications in which infinite loops
1228/// are statically guaranteed not to exist.
1229///
1230/// The basic use of a valid program looks like this:
1231///
1232/// ```
1233/// # use tfm::*;
1234/// let edges = vec![
1235/// (Char::A, Char::B),
1236/// (Char::B, Char::C),
1237/// ];
1238/// let (next_larger_program, warnings) = NextLargerProgram::new(edges.into_iter(), |_| true, true);
1239///
1240/// assert_eq![warnings, vec![]];
1241///
1242/// let sequence_A: Vec<Char> = next_larger_program.get(Char::A).collect();
1243/// assert_eq!(sequence_A, vec![Char::B, Char::C]);
1244///
1245/// let sequence_B: Vec<Char> = next_larger_program.get(Char::B).collect();
1246/// assert_eq!(sequence_B, vec![Char::C]);
1247///
1248/// let sequence_C: Vec<Char> = next_larger_program.get(Char::C).collect();
1249/// assert_eq!(sequence_C, vec![]);
1250///
1251/// // Character that is not in the program.
1252/// let sequence_D: Vec<Char> = next_larger_program.get(Char::D).collect();
1253/// assert_eq!(sequence_D, vec![]);
1254/// ```
1255///
1256/// ## Warnings
1257///
1258/// There are two types of error that can occur when constructing the next
1259/// larger program.
1260/// Both of these errors are handled gracefully, so we officially refer to them as warnings.
1261/// The constructor returns them as values of type [`NextLargerProgramWarning`].
1262///
1263/// ### Infinite loops
1264///
1265/// The first error is that next larger programs can contain infinite loops -
1266/// e.g, if X is the next larger character for Y
1267/// and Y is the next larger character for X.
1268/// In this case the loop is broken by removing the next larger program for the
1269/// character with the largest 8-bit code, in this case Y.
1270/// A [`NextLargerProgramWarning::InfiniteLoop`] warning is returned
1271/// from the program constructor.
1272///
1273/// ```
1274/// # use tfm::*;
1275/// let edges = vec![
1276/// (Char::X, Char::Y),
1277/// (Char::Y, Char::X),
1278/// ];
1279/// let (next_larger_program, warnings) = NextLargerProgram::new(edges.into_iter(), |_| true, true);
1280///
1281/// assert_eq!(warnings, vec![NextLargerProgramWarning::InfiniteLoop{
1282/// original: Char::Y,
1283/// next_larger: Char::X,
1284/// }]);
1285///
1286/// let sequence_X: Vec<Char> = next_larger_program.get(Char::X).collect();
1287/// assert_eq!(sequence_X, vec![Char::Y]);
1288///
1289/// let sequence_Y: Vec<Char> = next_larger_program.get(Char::Y).collect();
1290/// assert_eq!(sequence_Y, vec![]);
1291/// ```
1292///
1293/// ### Non-existent characters
1294///
1295/// The second error is that characters referred to in the next larger program
1296/// may not be defined in the .tfm or .pl file.
1297/// For example, a .pl file may contain the snippet `(CHARACTER C X (NEXTLARGER C Y))`
1298/// without defining the character Y.
1299/// The constructor [`NextLargerProgram::new`] accepts a function for checking if a
1300/// character exists.
1301/// In all cases a [`NextLargerProgramWarning::NonExistentCharacter`] warning is returned
1302/// if a non-existent character is encountered.
1303///
1304/// The behavior of the resulting next larger program is configured using the
1305/// `drop_non_existent_characters` argument.
1306/// If this is false, then the behavior is the same as PLtoTF and the program still
1307/// contains the character.
1308///
1309/// ```
1310/// # use tfm::*;
1311/// let edges = vec![
1312/// (Char::X, Char::Y),
1313/// ];
1314/// let character_exists = |c| {
1315/// if c == Char::Y {
1316/// false
1317/// } else {
1318/// true
1319/// }
1320/// };
1321/// let (next_larger_program, warnings) = NextLargerProgram::new(edges.into_iter(), character_exists, false);
1322///
1323/// assert_eq!(warnings, vec![NextLargerProgramWarning::NonExistentCharacter{
1324/// original: Char::X,
1325/// next_larger: Char::Y,
1326/// }]);
1327///
1328/// let sequence_X: Vec<Char> = next_larger_program.get(Char::X).collect();
1329/// assert_eq!(sequence_X, vec![Char::Y]);
1330/// ```
1331///
1332/// If `drop_non_existent_characters` is true, next larger instructions pointing at non-existent
1333/// characters are dropped.
1334/// This is how TFtoPL behaves.
1335///
1336///
1337/// ```
1338/// # use tfm::*;
1339/// let edges = vec![
1340/// (Char::X, Char::Y),
1341/// ];
1342/// let character_exists = |c| {
1343/// if c == Char::Y {
1344/// false
1345/// } else {
1346/// true
1347/// }
1348/// };
1349/// let (next_larger_program, warnings) = NextLargerProgram::new(edges.into_iter(), character_exists, true);
1350///
1351/// assert_eq!(warnings, vec![NextLargerProgramWarning::NonExistentCharacter{
1352/// original: Char::X,
1353/// next_larger: Char::Y,
1354/// }]);
1355///
1356/// let sequence_X: Vec<Char> = next_larger_program.get(Char::X).collect();
1357/// assert_eq!(sequence_X, vec![]);
1358/// ```
1359///
1360#[derive(Clone, Debug)]
1361pub struct NextLargerProgram {
1362 entrypoints: HashMap<Char, u8>,
1363 next_larger: Vec<(Char, NonZeroU8)>,
1364}
1365
1366impl NextLargerProgram {
1367 /// Build a new next larger program from an iterator over edges.
1368 pub fn new<I: Iterator<Item = (Char, Char)>, F: Fn(Char) -> bool>(
1369 edges: I,
1370 character_exists: F,
1371 drop_non_existent_characters: bool,
1372 ) -> (Self, Vec<NextLargerProgramWarning>) {
1373 // This function implements functionality in TFtoPL.2014.84 and PLtoTF.2014.{110,111,113}.
1374 let mut warnings: Vec<NextLargerProgramWarning> = vec![];
1375
1376 let mut node_to_larger = HashMap::<Char, Char>::new();
1377 let mut node_to_num_smaller = HashMap::<Char, usize>::new();
1378 for (smaller, larger) in edges {
1379 if !character_exists(larger) {
1380 warnings.push(NextLargerProgramWarning::NonExistentCharacter {
1381 original: smaller,
1382 next_larger: larger,
1383 });
1384 if drop_non_existent_characters {
1385 continue;
1386 }
1387 }
1388 node_to_larger.insert(smaller, larger);
1389 node_to_num_smaller.entry(smaller).or_default();
1390 *node_to_num_smaller.entry(larger).or_default() += 1;
1391 }
1392
1393 let mut leaves: Vec<Char> = vec![];
1394 let mut non_leaves: BTreeSet<Char> = Default::default();
1395 for (node, num_smaller) in &node_to_num_smaller {
1396 if *num_smaller == 0 {
1397 leaves.push(*node);
1398 } else {
1399 non_leaves.insert(*node);
1400 }
1401 }
1402
1403 let mut sorted_chars = vec![];
1404 let mut infinite_loop_warnings = vec![];
1405 loop {
1406 while let Some(smaller) = leaves.pop() {
1407 if let Some(larger) = node_to_larger.get(&smaller).copied() {
1408 let num_smaller = node_to_num_smaller
1409 .get_mut(&larger)
1410 .expect("`node_to_num_smaller` contains all nodes");
1411 *num_smaller = num_smaller
1412 .checked_sub(1)
1413 .expect("the larger of a smaller must have at least one smaller");
1414 if *num_smaller == 0 {
1415 leaves.push(larger);
1416 non_leaves.remove(&larger);
1417 }
1418 }
1419 sorted_chars.push(smaller);
1420 }
1421
1422 // There could be pending nodes left because of a cycle.
1423 // We break the cycle at the largest node.
1424 let smaller = match non_leaves.last() {
1425 None => break,
1426 Some(child) => *child,
1427 };
1428 let larger = node_to_larger.remove(&smaller).expect(
1429 "General graph fact: sum_(node)#in_edges(node)=sum_(node)#out_edges(node).
1430 Fact about the next larger graph: #out_edges(node)<=1, because each char has at most one next larger char.
1431 If #out_edge(child)=0 then sum_(node)#in_edges(node)=sum_(node)#out_edges(node) < #nodes.
1432 Thus there exists another node with #in_edges(node)=0 and that node is a leaf.
1433 But `leaves.len()=0` at this line of code",
1434 );
1435 infinite_loop_warnings.push(NextLargerProgramWarning::InfiniteLoop {
1436 original: smaller,
1437 next_larger: larger,
1438 });
1439 leaves.push(larger);
1440 non_leaves.remove(&larger);
1441 }
1442 warnings.extend(infinite_loop_warnings.into_iter().rev());
1443
1444 let next_larger = {
1445 let parents: HashSet<Char> = node_to_larger.values().copied().collect();
1446 let mut node_to_position = HashMap::<Char, u8>::new();
1447 let mut next_larger: Vec<(Char, NonZeroU8)> = vec![];
1448
1449 for c in sorted_chars.iter().rev() {
1450 if !parents.contains(c) {
1451 continue;
1452 }
1453 // The present character is the parent of at least one child, aka it is the next larger
1454 // character for another character. So it needs to be in the next_larger array.
1455 let child_position: u8 = next_larger
1456 .len()
1457 .try_into()
1458 .expect("there are at most u8::MAX chars in the `next_larger` array");
1459 let offset = match node_to_larger.get(c) {
1460 // The next_larger array contains at most 256 elements: one for each char.
1461 // (Actually it contains at most 255 because one character necessarily does not
1462 // have a child node and this character does not appear in the array.)
1463 // Anyway, an offset of 256 sends the index outside the array bound, and so
1464 // subsequent calls to iterator return None.
1465 None => NonZeroU8::MAX,
1466 Some(parent) => {
1467 let parent_position = *node_to_position
1468 .get(parent)
1469 .expect("parent has already been inserted");
1470 child_position
1471 .checked_sub(parent_position)
1472 .expect("parent inserted before so its position it is strictly smaller")
1473 .try_into()
1474 .expect("parent inserted before so its position it is strictly smaller")
1475 }
1476 };
1477 next_larger.push((*c, offset));
1478 node_to_position.insert(*c, child_position);
1479 }
1480 next_larger.reverse();
1481 next_larger
1482 };
1483
1484 let entrypoints = {
1485 let node_to_position: HashMap<Char, u8> = next_larger
1486 .iter()
1487 .enumerate()
1488 .map(|(i, (c, _))| {
1489 let u: u8 = i
1490 .try_into()
1491 .expect("there are at most u8::MAX chars in the `next_larger` array");
1492 (*c, u)
1493 })
1494 .collect();
1495 let mut entrypoints = HashMap::<Char, u8>::new();
1496 for c in sorted_chars.iter().rev() {
1497 if let Some(parent) = node_to_larger.get(c) {
1498 entrypoints.insert(
1499 *c,
1500 *node_to_position
1501 .get(parent)
1502 .expect("parent has already been inserted"),
1503 );
1504 }
1505 }
1506 entrypoints
1507 };
1508
1509 (
1510 Self {
1511 entrypoints,
1512 next_larger,
1513 },
1514 warnings,
1515 )
1516 }
1517
1518 /// Get the next larger sequence for a character
1519 pub fn get(&self, c: Char) -> impl Iterator<Item = Char> + '_ {
1520 NextLargerProgramIter {
1521 current: self.entrypoints.get(&c).copied(),
1522 program: self,
1523 }
1524 }
1525
1526 /// Returns whether this program is seven-bit safe.
1527 ///
1528 /// A next larger program is seven-bit safe if the next larger sequences for
1529 /// seven-bit characters only contain seven-bit characters.
1530 /// Conversely a program is seven-bit unsafe if there is a seven-bit
1531 /// character whose next larger sequence contains a non-seven-bit character.
1532 ///
1533 /// ```
1534 /// # use tfm::*;
1535 /// let edges = vec![
1536 /// (Char(250), Char(125)),
1537 /// (Char(125), Char(126)),
1538 /// ];
1539 /// let (next_larger_program, _) = NextLargerProgram::new(edges.into_iter(), |_| true, true);
1540 /// assert_eq!(true, next_larger_program.is_seven_bit_safe());
1541 ///
1542 /// let edges = vec![
1543 /// (Char(125), Char(250)),
1544 /// ];
1545 /// let (next_larger_program, _) = NextLargerProgram::new(edges.into_iter(), |_| true, true);
1546 /// assert_eq!(false, next_larger_program.is_seven_bit_safe());
1547 /// ```
1548 pub fn is_seven_bit_safe(&self) -> bool {
1549 // For each c, we only need to check the first element in c's next larger sequence.
1550 // If there is a subsequent element d of the sequence that is seven-bit unsafe,
1551 // we will find it when considering one of d's children.
1552 // This optimization makes this function O(n), rather than worst case O(n^2).
1553 self.entrypoints
1554 .keys()
1555 .copied()
1556 .filter(Char::is_seven_bit)
1557 .filter_map(|c| self.get(c).next())
1558 .all(|c| c.is_seven_bit())
1559 }
1560}
1561
1562#[derive(Clone, Debug)]
1563struct NextLargerProgramIter<'a> {
1564 current: Option<u8>,
1565 program: &'a NextLargerProgram,
1566}
1567
1568impl<'a> Iterator for NextLargerProgramIter<'a> {
1569 type Item = Char;
1570
1571 fn next(&mut self) -> Option<Self::Item> {
1572 // Note that the iterator is statically guaranteed to terminate!
1573 // If it returns None, it has already terminated.
1574 // If it returns Some, then self.current will be incremented by a strictly
1575 // positive number.
1576 // Incrementing like this can only happen a finite number of times before overflow
1577 // occurs, and then self.current is None and the iterator is terminated.
1578 match self.current {
1579 None => None,
1580 Some(current) => match self.program.next_larger.get(current as usize) {
1581 None => None,
1582 Some((c, inc)) => {
1583 self.current = current.checked_add(inc.get());
1584 Some(*c)
1585 }
1586 },
1587 }
1588 }
1589}
1590
1591impl common::FontFormat for File {
1592 const DEFAULT_FILE_EXTENSION: &'static str = "tfm";
1593 type Error = DeserializationError;
1594
1595 fn parse(b: &[u8]) -> Result<Self, Self::Error> {
1596 File::deserialize(b).0
1597 }
1598}
1599
1600/// Fixed-width numeric type used in TFM files.
1601///
1602/// This numeric type has 11 bits for the integer part,
1603/// 20 bits for the fractional part, and a single signed bit.
1604/// The inner value is the number multiplied by 2^20.
1605/// It is called a `fix_word` in TFtoPL.
1606///
1607/// In property list files, this type is represented as a decimal number
1608/// with up to 6 digits after the decimal point.
1609/// This is a non-lossy representation
1610/// because 10^(-6) is larger than 2^(-20).
1611#[derive(Default, PartialEq, Eq, Debug, Copy, Clone, PartialOrd, Ord, Hash)]
1612#[cfg_attr(feature = "arbitrary", derive(arbitrary::Arbitrary))]
1613pub struct FixWord(pub i32);
1614
1615impl FixWord {
1616 /// Representation of the number 0 as a [FixWord].
1617 pub const ZERO: FixWord = FixWord(0);
1618
1619 /// Representation of the number 1 as a [FixWord].
1620 pub const ONE: FixWord = FixWord(1 << 20);
1621
1622 /// Returns true if the number is less than 16.0 in magnitude according to Knuth.
1623 ///
1624 /// The number +16.0 is not allowed.
1625 /// This is covered in the E2E tests.
1626 /// See `check_fix` in TFtoPL.2014.60.
1627 pub fn is_abs_less_than_16(&self) -> bool {
1628 *self >= FixWord::ONE * -16 && *self < FixWord::ONE * 16
1629 }
1630
1631 /// Calculates the scaled value of this number at the specified design size.
1632 ///
1633 /// All dimensions in a TFM file are specified as multiples of the design size
1634 /// (with the exception of the design size itself).
1635 /// The scaled value (i.e. the value is scaled points) is the TFM value multiplied
1636 /// by the design size.
1637 ///
1638 /// The raw value and the design size both have 20 binary fractional bits,
1639 /// so the product has 40 fractional bits.
1640 /// A scaled value has only 16 fractional bits, and how the truncation is performed
1641 /// can effect the result.
1642 /// The specific multiplication algorithm Knuth wrote (TeX.2021.571-572)
1643 /// truncates the design size first and then performs the multiplication
1644 /// and then truncates again.
1645 /// Additionally, the integer division in that algorithm rounds towards zero
1646 /// even for negative numbers (e.g. -1.25 rounds to -1 instead of -2).
1647 ///
1648 /// In order to ensure the result here matched Knuth's algorithm, we reimplement
1649 /// Knuth's algorithm.
1650 pub fn to_scaled(self, design_size: FixWord) -> common::Scaled {
1651 // TeX.2021.568
1652 let mut z = common::Scaled(design_size.0 / 16);
1653 // TeX.2021.572
1654 let mut alpha = 16;
1655 // Ensure that z < 2^23 sp
1656 while z.0 >= 0o40000000 {
1657 z /= 2;
1658 alpha *= 2;
1659 }
1660 let beta = 256 / alpha; // beta = 16 * z / design_size
1661 let alpha = z * alpha; // alpha = 16 * design_size
1662
1663 // TeX.2021.571 (store_scaled)
1664 let [a, b, c, d] = self.0.to_be_bytes();
1665 assert!(a == 0 || a == 255);
1666 let sw = (((z * (d as i32)) / 0o400 + (z * (c as i32))) / 0o400 + z * (b as i32)) / beta;
1667 if a == 255 {
1668 // In this case self < 0.
1669 // We have calculated sw using only the 3 least significant bytes of self,
1670 // which is equivalent to setting self=16+self at the start. We then undo
1671 // this by setting sw = sw - 16 * design_size.
1672 sw - alpha
1673 } else {
1674 sw
1675 }
1676 }
1677}
1678
1679impl std::fmt::Display for FixWord {
1680 fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
1681 // TFtoPL.2014.40-43
1682 if self.0 < 0 {
1683 write!(f, "-")?;
1684 }
1685 let integer_part = (self.0 / FixWord::ONE.0).abs();
1686 write!(f, "{integer_part}.")?;
1687 let mut fp = (self.0 % FixWord::ONE.0).abs();
1688 fp = 10 * fp + 5;
1689 let mut delta = 10;
1690 loop {
1691 if delta > 0o4_000_000 {
1692 fp = fp + 0o2_000_000 - delta / 2;
1693 }
1694 write!(f, "{}", fp / 0o4_000_000)?;
1695 fp = 10 * (fp % 0o4_000_000);
1696 delta *= 10;
1697 if fp <= delta {
1698 break;
1699 }
1700 }
1701 Ok(())
1702 }
1703}
1704
1705impl std::ops::Add<FixWord> for FixWord {
1706 type Output = FixWord;
1707 fn add(self, rhs: FixWord) -> Self::Output {
1708 FixWord(self.0 + rhs.0)
1709 }
1710}
1711impl std::ops::Sub<FixWord> for FixWord {
1712 type Output = FixWord;
1713 fn sub(self, rhs: FixWord) -> Self::Output {
1714 FixWord(self.0 - rhs.0)
1715 }
1716}
1717
1718impl std::ops::Mul<i32> for FixWord {
1719 type Output = FixWord;
1720
1721 fn mul(self, rhs: i32) -> Self::Output {
1722 FixWord(self.0 * rhs)
1723 }
1724}
1725
1726impl std::ops::Div<i32> for FixWord {
1727 type Output = FixWord;
1728
1729 fn div(self, rhs: i32) -> Self::Output {
1730 FixWord(self.0 / rhs)
1731 }
1732}
1733
1734#[cfg(test)]
1735mod tests {
1736 use super::*;
1737
1738 #[test]
1739 fn to_scaled_test() {
1740 assert_eq!(common::Scaled::ZERO, FixWord::ZERO.to_scaled(FixWord::ONE));
1741 assert_eq!(common::Scaled::ONE, FixWord::ONE.to_scaled(FixWord::ONE));
1742 }
1743
1744 fn run_compress_test(
1745 values: Vec<FixWord>,
1746 max_size: u8,
1747 want: Vec<FixWord>,
1748 want_map: Vec<u8>,
1749 ) {
1750 let (got, got_map) = compress(&values, max_size);
1751 assert_eq!(got, want);
1752 let want_map: HashMap<FixWord, NonZeroU8> = want_map
1753 .into_iter()
1754 .enumerate()
1755 .map(|(i, t)| (values[i], t.try_into().unwrap()))
1756 .collect();
1757 assert_eq!(got_map, want_map);
1758 }
1759
1760 macro_rules! compress_tests {
1761 ( $( ($name: ident, $values: expr, $max_size: expr, $want: expr, $want_map: expr, ), )+ ) => {
1762 $(
1763 #[test]
1764 fn $name () {
1765 let values = $values;
1766 let max_size = $max_size;
1767 let want = $want;
1768 let want_map = $want_map;
1769 run_compress_test(values, max_size, want, want_map);
1770 }
1771 )+
1772 };
1773 }
1774
1775 compress_tests!(
1776 (no_op_0, vec![], 1, vec![FixWord(0)], vec![],),
1777 (
1778 no_op_2,
1779 vec![FixWord::ONE * 2, FixWord::ONE],
1780 2,
1781 vec![FixWord(0), FixWord::ONE, FixWord::ONE * 2],
1782 vec![2, 1],
1783 ),
1784 (
1785 just_deduplication,
1786 vec![FixWord::ONE, FixWord::ONE],
1787 1,
1788 vec![FixWord(0), FixWord::ONE],
1789 vec![1, 1],
1790 ),
1791 (
1792 simple_compression_case,
1793 vec![FixWord::ONE, FixWord::ONE * 2],
1794 1,
1795 vec![FixWord(0), FixWord::ONE * 3 / 2],
1796 vec![1, 1],
1797 ),
1798 (
1799 simple_compression_case_2,
1800 vec![
1801 FixWord::ONE,
1802 FixWord::ONE * 2,
1803 FixWord::ONE * 200,
1804 FixWord::ONE * 201
1805 ],
1806 2,
1807 vec![FixWord(0), FixWord::ONE * 3 / 2, FixWord::ONE * 401 / 2],
1808 vec![1, 1, 2, 2],
1809 ),
1810 (
1811 lower_upper_close_edge_case_1,
1812 vec![FixWord(1), FixWord(3)],
1813 1,
1814 vec![FixWord(0), FixWord(2)],
1815 vec![1, 1],
1816 ),
1817 (
1818 lower_upper_close_edge_case_2,
1819 vec![FixWord(0), FixWord(2)],
1820 1,
1821 vec![FixWord(0), FixWord(1)],
1822 vec![1, 1],
1823 ),
1824 (
1825 lower_upper_close_edge_case_3,
1826 vec![FixWord(1), FixWord(4)],
1827 1,
1828 vec![FixWord(0), FixWord(2)],
1829 vec![1, 1],
1830 ),
1831 (
1832 lower_upper_close_edge_case_4,
1833 vec![FixWord(1), FixWord(2)],
1834 1,
1835 vec![FixWord(0), FixWord(1)],
1836 vec![1, 1],
1837 ),
1838 );
1839
1840 mod next_larger_tests {
1841 use super::*;
1842
1843 fn run(
1844 edges: Vec<(Char, Char)>,
1845 want_sequences: HashMap<Char, Vec<Char>>,
1846 want_warnings: Vec<NextLargerProgramWarning>,
1847 ) {
1848 let (program, got_warnings) = NextLargerProgram::new(edges.into_iter(), |_| true, true);
1849 assert_eq!(got_warnings, want_warnings);
1850 for u in 0..=u8::MAX {
1851 let want_sequence = want_sequences
1852 .get(&Char(u))
1853 .map(Vec::as_slice)
1854 .unwrap_or_default();
1855 let got_sequence: Vec<Char> = program.get(Char(u)).collect();
1856 assert_eq!(
1857 got_sequence, want_sequence,
1858 "got/want sequences for {u} do not match"
1859 );
1860 }
1861 }
1862
1863 fn big_infinite_loop_edges() -> Vec<(Char, Char)> {
1864 (0..=u8::MAX)
1865 .into_iter()
1866 .map(|u| (Char(u), Char(u.wrapping_add(1))))
1867 .collect()
1868 }
1869
1870 fn big_infinite_loop_sequences() -> HashMap<Char, Vec<Char>> {
1871 (0..=u8::MAX)
1872 .into_iter()
1873 .map(|u| {
1874 let v: Vec<Char> = match u.checked_add(1) {
1875 None => vec![],
1876 Some(w) => (w..=u8::MAX).into_iter().map(Char).collect(),
1877 };
1878 (Char(u), v)
1879 })
1880 .collect()
1881 }
1882
1883 macro_rules! next_larger_tests {
1884 ( $( ($name: ident, $edges: expr, $want_sequences: expr, $want_warnings: expr, ), )+ ) => {
1885 $(
1886 #[test]
1887 fn $name () {
1888 run($edges, $want_sequences, $want_warnings);
1889 }
1890 )+
1891 };
1892 }
1893
1894 next_larger_tests!(
1895 (
1896 same_node_loop,
1897 vec![(Char::A, Char::A)],
1898 HashMap::from([(Char::A, vec![])]),
1899 vec![NextLargerProgramWarning::InfiniteLoop {
1900 original: Char::A,
1901 next_larger: Char::A,
1902 }],
1903 ),
1904 (
1905 two_loops,
1906 vec![
1907 (Char::A, Char::B),
1908 (Char::B, Char::C),
1909 (Char::C, Char::B),
1910 (Char::X, Char::Y),
1911 (Char::Y, Char::Z),
1912 (Char::Z, Char::X),
1913 ],
1914 HashMap::from([
1915 (Char::A, vec![Char::B, Char::C]),
1916 (Char::B, vec![Char::C]),
1917 (Char::C, vec![]),
1918 (Char::X, vec![Char::Y, Char::Z]),
1919 (Char::Y, vec![Char::Z]),
1920 (Char::Z, vec![]),
1921 ]),
1922 vec![
1923 NextLargerProgramWarning::InfiniteLoop {
1924 original: Char::C,
1925 next_larger: Char::B,
1926 },
1927 NextLargerProgramWarning::InfiniteLoop {
1928 original: Char::Z,
1929 next_larger: Char::X,
1930 },
1931 ],
1932 ),
1933 (
1934 path_leading_to_loop,
1935 vec![(Char::A, Char::B), (Char::B, Char::C), (Char::C, Char::B),],
1936 HashMap::from([
1937 (Char::A, vec![Char::B, Char::C]),
1938 (Char::B, vec![Char::C]),
1939 (Char::C, vec![]),
1940 ]),
1941 vec![NextLargerProgramWarning::InfiniteLoop {
1942 original: Char::C,
1943 next_larger: Char::B,
1944 }],
1945 ),
1946 (
1947 big_infinite_loop,
1948 big_infinite_loop_edges(),
1949 big_infinite_loop_sequences(),
1950 vec![NextLargerProgramWarning::InfiniteLoop {
1951 original: Char(u8::MAX),
1952 next_larger: Char(0),
1953 }],
1954 ),
1955 );
1956 }
1957}