Function tfm::format::compress

source ·
pub fn compress(
    values: &[Number],
    max_size: u8
) -> (Vec<Number>, HashMap<Number, NonZeroU8>)
Expand description

Lossy compression of numbers for TFM files.

The TFM file format can only store up to 15 heights, 15 depths and 63 italic corrections. If a property list file contains, e.g., more than 15 distinct heights, something has to give. PLtoTF contains a lossy compression algorithm that takes a list of values and returns another bounded list of values which approximates the original list. This is implemented in PLtoTF.2014.75-80 and re-implemented here.

How the algorithm works.

For a given delta, the algorithm partitions the ordered list of values such that within each partition the maximum distance between two elements is delta. All of the values within the partition are then approximated by (interval_max+interval_min)/2. As such, each value may be increased or decreased by up to delta/2. After this procedure, the list of values is replaced by the list of approximations. This compresses the list of values into a list whose size is the number of intervals.

Given this subroutine, the algorithm finds the smallest delta such that the number of intervals is less than the required maximum (e.g., 15 for heights).

There are some important features of this algorithm to note:

  • For a given delta value there may be multiple possible partitions. The algorithm uses a greedy approach in which it maximizes the size of the first partition, then the size of the second partition, and so on.

  • Distinct delta values can yield the same partition. For example, if the initial values are [1, 4, 5] then any delta in the range [1, 3) gives the same result ([[1], [4, 5]]) Whenever we check a delta, we are really checking the interval of deltas that gives the same result. Both Knuth’s and our implementations use this fact to speed up the search by reducing the search space. E.g. in the example above, after checking delta=1 we don’t check delta=2.

This re-implementation

The re-implementation here follows PLtoTF closely enough, but with one modification. To find the optimal delta, Knuth first calculates the minimal possible delta. This is the minimum distance between adjacent elements in the ordered list of values. In general this will not be a valid solution because the number of intervals it generates will be large. So, he next finds the smallest k such that 2^k * min_delta is a valid solution. Te then does a upwards linear search within the interval [min_delta * 2^{k-1}, min_delta * 2^k] to find the optimal delta.

Checking a particular delta is O(n). The worst-case running time of Knuth’s algorithm is then O(n^3) because the interval [2^{k-1}, 2^k] can contain O(n^2) distinct deltas to check in the worst-case.*

In the re-implementation here we realize that the problem of finding the smallest possible delta is a classic binary search problem. This is because if delta is a valid solution, any larger delta also is; and if delta is not a valid solution, any smaller delta is also not a valid solution. The re-implementation using binary search is O(n log n). Moreover, the constant factors are about the same.

  • In the worst-case there are O(n^2) distinct deltas because each pair of elements yields a delta. Let m be the smallest delta and M the largest delta. In the initial 2^k-based ranging scheme, the largest K satisfies m 2^{K-1} < M <= m 2^K, or K-1 <= log_2(M/m). Thus there are K=O(1) ranges in this initial scheme. By the pigeon-hole principle, there exists a k such that the range [m * 2^{k-1}, m * 2^k] contains O(n^2) elements. In the worst-case, the solution is the maximum element of this range.