Michel Goossens IT/ASD
A new language, XML, for transmitting structured data on the Web, is being proposed by the World Wide Web Consortium (W3C). It is expected to replace HTML in the next generation of browsers, since it offers, together with other parallel developments like MathML, CML, DSSSL, greater functionality, improved information reuse, extensible markup and better presentation control.
In 1986 ISO adopted the SGML (Standard Generalised Markup Language) standard for structured document interchange (ISO 8879), but only large companies actually started to use document handling systems based on SGML. In fact, at CERN, thanks to Anders Berglund, we were one of the first Organisations in the world to use documents marked up according to an SGML DTD (Document Type Definition) in a production environment.
In late 1989 and the beginning of 1990, Tim Berners Lee developed his hypertext browser/editor using a language which resembles SGML. (Tim recently claimed in an interview that he chose the notation of the language which was to be called HTML because people at CERN were used to angle brackets from their SGML experience.)
It was only in 1995 that HTML (Hypertext Markup Language) was formalised into an SGML DTD (HTML 2.0 (1), specification published as RFC 1866). But soon browsers (Netscape, Microsoft Explorer) started to add their own (non-standard) features to that minimal DTD, and also the collaborators of W3C (the WWW Consortium) created new unofficial DTD's to cope better with various areas (maths, internationalisation, etc.). It was soon realized that portability of documents could no longer be guaranteed if information providers started using these incompatible extensions in a general way. Therefore, at the beginning of 1997, the W3C Group adopted the "HTML 3.2 Reference Specification" (2), which tried to strike a compromise between the needs of the various information providers.
Today, most documents on the Web are marked up and transmitted using the HTML 3.2 DTD. This HTML DTD has a simple content model, which is particularly well-suited for hypertext, multimedia, and the display of small and reasonably simple documents.
Yet HTML 3.2 is too limited to cope with the Web's many application areas (databases, search engines, optimal presentation, professional printing, data verification). On top of that HTML has an unalterable DTD and thus the language cannot be extended to suit particular needs.
At the end of 1996 the W3C SGML workgroup started to develop XML (eXtensible Markup Language). The aim is to define a simple subset of the SGML meta-language that is particularly well tailored to be used on the Internet. Especially the formal definition of XML is important to allow easy checking and parsing of documents marked up in XML.
XML differs from HTML in three main areas:
One of the basic aims of the XML effort is that the XML language should be easy to learn and implement, yet that its expressive power should be maximal. Although documents marked up in HTML and XML are not directly compatible, HTML documents marked up according to the HTML 3.2 DTD should be easily converted to XML syntax.
Currently, the first two documents on XML are available as W3C working drafts, namely,
Markup languages for other application areas, like CML (Chemical Markup Language) (9) and MathML (Mathematical Markup Language (10), see below), as well as subsets of the Hytime and TEI standards, are being defined in a way compatible with XML, to guarantee optimal interchange of information.
To control the presentation style, both on screen and on the printer, the current CSS (11) (Cascading Style Sheets) effort is well suited to the relatively low-level demands of present HTML in areas such as setting fonts, colours, white space, etc. However, to support the greatly expanded range of rendering techniques made possible by XML, the more general DSSSL (12) (Document Style Semantics and Specification Language, ISO/IEC 10179, 1996) formalism should be used. An initial specification for a subset of DSSSL, dssslo (13), to be used online in combination with XML, is being developed.
After being almost ignored for many years, mathematics on the Web is making a comeback. Several small scale studies and experiments on how to deal with maths in SGML and with browsers took place, but only recently did the big players in the maths business (computer algebra vendors like Maple, Mathcad, Mathematica, large scientific editors like Elsevier and the American Mathematical Society, software companies, like Softquad, Adobe, as well as W3C) got together to define a language to describe the structure and content of mathematical expressions. The first outcome of their efforts is a draft for a mathematical Markup language MathML (14).
x2 + 4x + 4 = 0
One way of using MathML is to use presentation tags to describe the visual layout of a mathematical formula.
<MROW> <MROW> <MSUP> <MI>x</MI> <MN>2</MN> </MSUP> <MO>+</MO> <MROW> <MN>4</MN> <MO>⁢</MO> <MI>x</MI> </MROW> <MO>+</MO> <MN>4</MN> </MROW> <MO>=</MO> <MN>0</MN> </MROW>
Here we find two kinds of MathML tags: those that contain data, such as <MI>, <MN>, and <MO> (for identifier, number, and operator tokens, respectively), and those that only contain other nested MathML tags like <MSUP> and <MROW>. Note the use of nested <MROW> elements to denote terms, in this case the left-hand side of the equation functioning as an operand of the equal sign. By typing data and marking terms we greatly facilitate things like spacing for visual rendering, voice rendering, line breaking, as well as automatic processing by external applications.
For the same expression we can express its semantic content, as follows:
<EXPR> <EXPR> <EXPR> <MI>x</MI> <POWER/> <MN>2</MN> </EXPR> <PLUS/> <EXPR> <MN>4</MN> <TIMES/> <MI>x</MI> </EXPR> <PLUS/> <MN>4</MN> </EXPR> <E/> <MN>0</MN> </EXPR>
It is seen that MathML content tags are typically contained within an <EXPR> tag, which denotes a semantically meaningful expression. Good practice demands that authors should use <EXPR> tags only when its contents have an unambiguous mathematical meaning. In fact, <EXPR> tags function like parentheses; they indicate the order and scope of operations.
Note that in many cases MathML content tags are empty, i.e., they are of the form < ... />. Some of them play the role of standard infix operators.
Several vendors and organizations have promised that they are going to develop freely available implementations of the HTML Math core standard. Initially, these implementations will use embedded objects, such as Java applets, and plug-ins are planned. In particular in early summer we expect to see at least the following two applications.
Probably also several of the features of XML will be included in the important browsers before the end of the year. In particular, Netscape and MS Explorer have shown interest in moving into the direction of XML in the not-too-far future.