Document Standards

This page collects standards and other general information on structured documents, i.e. documents with a semantic content structure that can be transformed into a variety of display outputs.

Markup Languages

The Standard Generalized Markup Language (SGML) is the ancestor of many of today’s popular markup languages, such as HTML and XML. Conceived before the Internet and finalized as ISO 8879:1986, SGML does not have a “home page”; the above page is retained by XML Cover Pages as a historical archive.

Sadly, many links you’ll find on that page are already dead. A Guide to SGML has a few more that are still alive, including SGML and PDF – Why We Need Both and Reasons to Learn SGML. Note that the Document Type Definition (DTD) format, still in use for HTML and XML schemas, was originally designed for SGML – hence its peculiar syntax.

Most recent markup language standards, including CSS, HTML, XML, XSD, XSL and so on, are maintained by the World Wide Web Consortium (W3C). The competing Web Hypertext Application Technology Working Group (WHATWG) promotes an HTML standard that’s practically useful rather than theoretically perfect (a common failing with W3C standards).

Some document markup languages are not based on SGML, for example Unix troff and Microsoft’s Rich Text Format (RTF). Both are in continued but limited use: troff as the format for Unix documentation, RTF as an interchange format for word processors. The one non-SGML markup language that’s still widely used by human authors is LaTeX, covered in its own section below.

DocBook & DITA

DITA (Darwin Information Typing Architecture, originally created by IBM) and the older DocBook are two OASIS standards for structured documents of usually (but not necessarily) technical nature. Essentially, DITA and DocBook are complex schemas that define a hierarchy of content tags for XML documents. Both systems resemble a simplified LaTeX without the underlying typesetting engine. You can transform documents into printable or viewable format using free converters (typically XSLT style sheets plus Java programs) from the DITA Open Toolkit or The DocBook Project; you can use a semi-integrated wrapper around such converters like Syncro Soft Oxygen; or you can purchase a fully integrated solution like Adobe FrameMaker.

Ironically, getting good documentation on these document standards is quite difficult. I managed to get three printed books on DITA, and two of them were useless (see below about the third). FrameMaker comes with some documentation, but naturally focuses on the integration of the standards with FrameMaker. So you will have to rely on the official reference documentation. For DITA, download the DITA Architectural Specification at the OASIS Standards repository and check out DITA XML.org. For DocBook, download DocBook: The Definitive Guide and perhaps DocBook XSL: The Complete Guide.

Scriptorium

Scriptorium Publishing Services provide a vast amount of information on using XML-based structured documents in technical writing. Tony Self’s DITA Style Guide offers a good concise overview and usage guidelines for DITA 1.2 – it’s the one DITA book I can recommend. Here are some other Scriptorium links of note, in alphabetical order:

TeX & LaTeX

The most important non-SGML markup language originated with Donald Knuth’s TeX typesetting system. Designed in 1978, TeX predates modern standards and accordingly comes with its own font creation facility (Metafont), device-independent page description format (DVI), and markup commands for non-ASCII characters. These features are largely obsolete today. Any decent TeX system should support Unicode input, Adobe PDF output, and modern font formats (Type 1, TrueType, OpenType). Far from obsolete, TeX’s peerless mathematical typesetting mode handles virtually any conceivable mathematical expression with aplomb.

TeX does have two shortcomings: it lacks semantic markup, and it’s rather incomprehensible. So is Knuth’s original user manual, The TeXbook (Addison-Wesley 1986). Both issues were admirably solved by Leslie Lamport’s LaTeX, an extension package that was quickly adopted by most TeX authors. Lamport also wrote an excellent user’s guide, LaTeX: A Document Preparation System (2nd ed., Addison-Wesley 1994).

LaTeX defines structural elements such as chapters and sections, manages book lists such as bibliographies and indices, and overall makes TeX much more accessible. Over the years, a growing user community organized in the TeX Users Group (TUG) and the Comprehensive TeX Archive Network (CTAN) has produced several free TeX implementations and a wealth of LaTeX extension packages. Stefan Kottwitz’s TeXblog keeps track of new releases. See LaTeX Typesetting with MiKTeX for my personal recommendations and customizations.

DITA or LaTeX?

DITA versus LaTeX may seem like an odd choice, but I’ve seen people wondering if they should abandon LaTeX for DITA since it’s a modern XML-based industry standard. Conversely, authors using or interested in DITA may not even be aware that LaTeX can be a realistic option. Personally I’ve opted for LaTeX, but the decision really depends on your required output formats.

LaTeX is based on a sophisticated print-optimized typesetting engine. Modern implementations have capabilities that rival specialized DTP software, such as automatically using the ligatures and real small caps of OpenType fonts. This is fantastic for PDF creation since you get a decent print-quality page layout by default, which you can further refine by embedding arbitrary TeX commands. Compared to XML-based markup, LaTeX generally produces superior PDF output with less effort. Moreover, it has better support for mathematical typesetting and book lists.

The downside is that output formats which don’t replicate printed pages, in particular HTML, ignore most of LaTeX’s power. That leaves you with a lot of unnecessary complexity — and a command syntax that’s difficult to translate into SGML-derived markup. Conversely, DITA markup lacks typographic control but allows recombining the same document fragments into different output hierarchies. So if your primary output format is HTML, or if you need to dynamically select and assemble your documents, DITA is a better choice.