Paul DuBois
dubois@primate.wisc.edu
Wisconsin Regional Primate Research Center
Revision date: 7 March 1997
This document contains miscellaneous observations about how troffcvt
behaves, and tries to document its limitations. It should be read
by anyone trying to write a postprocessor for troffcvt
output.
troffcvt supports the full troff language, aside
from some specific exceptions noted below. These are discussed
in sections numbered in parallel with Ossanna's troff manual.
Many are related to insignificant or obscure features of the language
(e.g., .fl, .pm). Some are more significant (e.g.,
diversion mishandling). In some sense these exceptions form the
troffcvt bug list.
troffcvt supports a limited subset of the groff
extensions to standard troff. In general, you should assume
that any particular groff extension is not supported by
troffcvt, but there are some important exceptions such
as aliases and long names. For more details, see the document
troffcvt Support for groff.
The most general and pervasive exception to standard troff
processing is that troffcvt knows nothing about the characteristics
of any output device; in particular, it uses no font metric information.
This means it doesn't know how wide or tall any character is.
This exception is pervasive in that it affects handling of a number
of requests and other aspects of the language. Some of the implications
are:
The numbering of the sections that follow correponds to the section
numbering in the Ossanna troff manual, to make it easier
to determine where troffcvt bugs affect requests listed
in a given section of the Ossanna manual.
The default resolution used by troffcvt is 432 units/inch,
but it may be changed with the -r option. (A good value
might be the least common multiple of 72 and the resolution you
use in the target format.)
Since resolution is not fixed, postprocessors should use the value
specified on the \resolution line that appears as the first
line of the setup section. It indicates number of basic units
per inch. Numeric values on following control lines that are specified
in basic units can be converted to other units as necessary using
this resolution.
The default scaling for unscaled numbers in troff requests
is not hardwired into troffcvt. Instead, scaling is specified
in the action file, although it's a good idea to use the same
default there that troff uses:
req sp parse-num v eol break space $1 good req sp parse-num i eol break space $1 badExpressions that involve calculation of "amount of motion to reach an absolute position" (as in, e.g., |3.2c) evaluate to zero. Since the current position is unknown, the distance to any other position cannot be determined. This affects processing of tbl output particularly, since tbl is fond of using \h´|N´ to line up columns.
Fonts R, I and B are initially mounted on positions 1, 2 and 3,
respectively, and the special font is mounted on all other positions.
This means fonts R and 1, I and 2,
etc., are considered equivalent. If a different font is mounted
on a given position, references to that font, either by the name
or number, are considered equivalent. This is logical to me, although
in fact it doesn't reflect the behavior of all troff versions.
For instance, xroff does not necessarily consider R
and 1 equivalent unless you mess around with its font map.
The .f register is set to the number of the current font,
or zero if the current font is not mounted (it is allowable to
refer to a font simply by naming it, so the current font doesn't
necessarily have any number). .ft 0 and \f0 are
taken as referring to this font.
Font changes are written by name (not number) in the form \font
name, where name must be interpreted by the postprocessor.
This is a difficult problem since font names tend to be site-specific
and idiosyncratic, although the standard troffcvt file
reader provides some simple font handling support that might be
useful.
Output from the .bd request typically appears as the \embolden
and \embolden-special control lines. It's not clear whether
it's worth it for postprocessors to support this request postprocessors,
particularly the special-font variant. Although one can switch
to the special font explicitly (.ft S, \fS), characters
from the special font are also logically part of the other default
fonts, and thus referenced for particular characters even if S
is not the current font. To fully support special font bolding,
you'd need to keep track of all the characters in the special
font and check every output character to see if it needs to come
from that font. Besides, this whole business of the relationship
between the special font and other fonts seems tightly linked
to the particular typesetting machinery used when troff
was originally written.
Any positive character size is allowed. For historical reasons,
embedded absolute size changes may be one or two digits up to
a size of 36, i.e., \s36 is the same as .ps 36 while
\s37 is the same as .ps 3 followed by "7".
Non-numeric input following \s is interpreted the same
way as \s0.
For the .cs request, only the font name is written out
on the \constant-width line; the width in which the characters
are to be written is currently ignored.
Since the current page number cannot be reliably determined, .bp
and .pn requests which specify a relative page number change
are not reliable.
My troff Summary and Index indicates that Vs
are the default scaling unit for .bp and .po requests.
The actions file supplied with the troffcvt distribution
tells troffcvt to ignore scaling for .bp and to
use ems for .po, which seems to make more sense.
.mk and .rt are not supported.
troff tosses extra spaces at the end of text lines. troffcvt
tries to do the same but gets confused by sequences such as "abc\fI
\fP". The trailing spaces are retained in the output, erroneously.
Use of the .j register as the argument to the .ad
request is allowed. Note: This depends on all the internal adjustment
mode type values being single-digit non-negative integers so that
the argument can be parsed by the parse-char action. The
internal codes are not necessarily the same as those used by any
particular version of troff. (The codes are known not
to be the same as those assumed by tbl, but I'm not sure
exactly what tbl assumes.)
No hyphenating is done; that is left for the postprocessor. The
optional hyphenation character appears as @opthyphen in
the output.
The .n, nl and .h registers are not set.
\p appears in the output as \break-spread.
In all the CFA (center, fill, adjust) modes, text interruption
in the input (\c) is processed such that the next text
line appears to be logically glued to the current one. The resulting
logical line counts as a single input line. (Actually, this appears
to be only sometimes true, e.g., for .ce, but not, evidently,
for .ul or .it. Huh.)
Text interruption in the input will not appear explicitly in the
output and thus is of no importance for postprocessors. \c
is manifest in troffcvt output merely as an absence of
a leading space on the next text output line. Example:
Input 1 Input 2 abc abc\c def def Output 1 Output 2 abc abc def defPostprocessors would write these out as "abc def" and "abcdef", respectively.
The .a register is not set.
.sv, .os, .ns, .rs are not supported.
The troff manual doesn't say it, but .rm allows
multiple names to be specified for removal on a single request.
troffcvt does, too.
The troff manual doesn't say that you can invoke macros
as strings, either, but you can. troff prints "abc"
when given the following input:
.de xx abc .. \*(xxYou can also invoke a string as though it is a macro (i.e., by uttering the string name on a line by itself with a leading dot). The contents of the string are interpolated into the input in place of the line on which the invocation occurs. However, since strings have no terminating newline, the input line following this "macro" invocation is taken as part of the same input line on which the invocation occurs.
troffcvt treats macros and strings as essentially equivalent.
The primary difference is that strings don't have arguments.
The copy mode mechanism doesn't care how long strings are.
These are "supported" in a poor way that probably should
be changed. Diversion output isn't saved and just goes to stdout
like everything else. Output for diversion xx is bracketed
by \diversion-begin xx and \diversion-end
xx for .di or by \diversion-append xx
and \diversion-end xx for .da. Diversion
output may be nested, which is one reason support is poor. (It
puts the burden on the postprocessor to unnest them.)
Diversion output is not saved in a macro body, because diversions
are often linked to position traps and thus might never be called.
Since that would lose the output completely, I judged it better
to interpolate the diversion into the output at the point at which
it is created. The down side is that for diversions which are
invoked explicitly, the diversion doesn't appear where it should.
Possibly diversion output should be saved in temporary files and
written to the output when the diversion is done. But the question
is: when is a diversion "done"? (There may be a .da
later in the input.)
The .d, .h, .t, dn and dl registers
are not set. The .z register is the name of the
current diversion, not a numeric value. Its value is empty
if no diversion is currently active, otherwise the current diversion
name is interpolated into the output.
Position and diversion traps (.wh, .ch, .dt)
are not supported. troffcvt ought at least to write out
some of the information for these requests so that postprocessors
could try to use it if they wanted.
The input line trap (.it) is supported.
The troff manual doesn't say it, but .rr allows
multiple registers to be specified for removal on a single request.
troffcvt does, too.
The manual also doesn't say that if the increment or format arguments
are missing, and the register already exists, the existing increment
and format carry into the new definition. In troffcvt,
only the increment carries through, since formats are broken (see
below).
You cannot set, rename, remove or change the format of read-only
registers.
The number register formats i, I, a and A
are broken. These all print in the default format. Formats 01,
001, etc. are not parsed correctly either, yet.
The ct, dl, dn, hp, ln, nl,
sb, and st registers are not supported.
The value of the % register is unreliable, since the "current
page number" is unknown.
The .A, .T, .a, .d, .h, .n,
.t, .x, and .y registers are not supported.
The .w register is always set to 1 en, since troffcvt
calculates widths of strings by assuming that all characters are
1 en wide. (See §11.)
The .z register is anomalous, since it's not really a number;
see notes for §7.4. (This isn't a troffcvt bug; troff
treats .z specially, too.)
References to non-existent or unsupported registers are interpolated
as "0" (zero).
Tab and leader characters appear as @tab and @leader
in the output.
.ta with no arguments is written as \reset-tabs.
The postprocessor should reset tab settings to "every half-inch".
If explicit settings are given, the first one is written as \first-tab
position type and all following as \next-tab
position type.
Field delimiter characters are written as @fieldbegin or
@fieldend, depending on whether they begin or end a field.
Field padding characters are written as @fieldpad when
the character occurs between pairs of field delimiter characters
(otherwise it is deleted, which may or may not be correct).
STX, ETX, ENQ, ACK, BEL, SO, SI and ESC are not treated specially.
You deserve what you get if you have them in your input files.
So there.
Ligature mode as set by .lg is not supported. The special
characters \(ff, \(fl, \(fl, \(Fi
and \(Fi normally should be defined in the action file
to write out @ff, @fi, @fl, @ffi and
@ffl, and postprocessors should be trained to recognize
these sequences.
No motion is generated for backspace characters; they appear as
@backspace in the output.
Underlining is indicated by \underline for normal underlining
and \cunderline for continuous underlining. These are identical
in troff; postprocessors may or may not wish to consider
them so, depending on the capabilities of the target format. Underlining
(both kinds) is turned off with \nounderline.
.tr doesn't work for special characters or for escaped
characters. The output character can be anything, but the input
character must be plain text. This is legal:
.tr x\(**This is not:
.tr \(**x
Transparent mode (\!) is not supported very well.
Real-life observations of behavior of troff versions: It
doesn't appear to be quite true that the rest of the line after
\! is always passed as is, at least from my observations
on groff and SunOS 4.1.1 nroff. Embedded newlines
are still processed. Comments are still stripped. If a transparent
line within a multi-line section of conditional input contains
\} on multi-line conditional input is recognized and terminates
the input if it is within a rejected clause. If it is within an
accepted clause, the \} appears on the transparent line.
Comments and concealed newlines are swallowed at a very low level
in the input routines, and are thus unavailable to postprocessors.
\w´string´ computes widths of strings
only to an approximation. Since character widths are unknown,
the width is computed as though all characters in the string are
1 en wide. Font and size changes are recognized but ignored,
which leads to particularly egregious errors for constructs such
as \w´\s+9\s+9\s+9X\s-9\s-9\s-9´. The ramifications
of the fact that \w yields only approximate results are
legion, since \w may be used in any expression, e.g., in
numeric arguments to requests, or in escape sequences such as
\h´N´.
\b´string´ and \o´string´
are supported by writing the characters in string to the
output, sandwiched between \bracket-begin (\overstrike-begin)
and \bracket-end (\overstrike-end). Certain characters,
if present in string, are botched, such as \e.
\l´Nc´ and \L´Nc´
are supported but don't always work. In particular, if the repetition
character is "x", as in \l'10x', the "x"
is eaten as part of the expression and not recognized as the repetition
character. Certain other repetition characters aren't written
to the output correctly (same bug as for \b and \o).
\zc appears as \zero-width c in the
output.
.nh and .hy appear as \hyphenate N
in the output. The value of N should be interpreted as
indicated in the troff manual. If N is zero, hyphenation
should be turned off.
The current hyphenation character is recognized and appears as
@opthyphen in the output.
.hw is not supported.
Not supported, because there is no way to determine how a postprocessor
might lay out text on a page. This is especially true for tc2html:
the resulting HTML document may be reformatted dynamically whenever
a user viewing the document in a Web browser window resizes the
window.
Unfortunately, processing of conditionals (.if, .ie/.el)
is to a large extent meaningless and may introduce errors into
the output. The tests for the conditions t and n
are processed properly, but other tests may not be. For instance,
many times a conditional will test the value of the current page
number (\n%), which cannot be determined reliably.
Conditional requests are processed in a special way. Normally,
to process a request, the arguments are parsed first. Then troffcvt
scans to the end of the request line, to avoid having extraneous
junk be parsed as text or another request, and then any actions
remaining in the request's action list are executed to interpret
the request arguments.
For conditional requests, that doesn't work. A different approach
is taken. Here is how the conditional requests can be specified
in an action file:
req if parse-condition n eol req ie parse-condition y eol req el process-condition eolFor .if and .ie, the argument is the condition to be tested, but after parsing it the rest of the line cannot be skipped over without losing some of the conditional input. What happens instead is that the parse-condition action gobbles up the condition and skips any following whitespace. If the conditional input is a single line (no \{ present), the input-line processor is invoked once recursively, which causes the rest of the line to be processed as though it were a new line. The tricky part is that processing this line will involve reading the rest of the line, including the terminating linefeed. When the inner invocation of the line processor returns from handling the conditional input, the outer invocation of the processor that is handling the conditional request (i.e., the one performing the parse-condition action), is still in its argument-parsing phase, and still expects to skip to the end of the input line after parsing the condition. So a fake linefeed is shoved into the input before returning to the condition parser.
For conditional requests that are followed by multi-line input,
a mild elaboration suffices. If the conditional input begins with
\{, the current conditional level is incremented and the
input processor is called repeatedly until the level returns to
the original value. (The level is decremented by the input routine
ChIn() which simply discards the \} and returns
the next character.)
If a condition fails, the input is scanned character-by-character
until the end of the current line (for single-line conditional
input), or until a \} matching the beginning \{
is found (for multi-line input).
The else part of the .ie/.el if-else construction is accepted
or rejected by remembering the value of the previous .ie.
If the .ie succeeded, the .el part is skipped, otherwise
it's processed.
It does not appear to be necessary that the .el immediately
follow .ie, so troffcvt does not require that. .el
following .if is skipped, as is .el following another
.el.
Observations about troff versions (which don't really belong
here, but I'm writing them down so I don't completely forget about
them):
It's not explicit in the troff manual, but for revertible
parameters such as indent or point size, the current and
previous values are saved in the environment. troffcvt
does this, too.
.rd is not supported.
.nx doesn't properly unwind the input stack if current
input source is not a file. The request is simply ignored after
printing an error message.
.pi is not supported.
.mc, .pm, .fl are not supported. For .fl,
this doesn't matter because output isn't buffered anyway.
Most of the stuff mentioned in the troff addendum is unimplemented.
These are irrelevant to troffcvt.
The description of the .ab request doesn't specify whether
the string argument is to be read in copy mode or not. Assuming
that it should be, .ab can be defined in the action file
as
req ab parse-string-value n eol abort $1The .ad, .ft and .so requests behave as described.
Other requests in this section are unsupported.
These are all unsupported.
.R and c. are supported as general registers. .R
always contains a large value, since troffcvt always assumes
it can get more memory.
$$, .L, .b, and .j are supported as
read-only registers.
.P, .k, .T are not supported.
Conditional input is treated as described.