Scientific Publishing with XHTML, MathML and SVG

Manuel Strehl

Port 8000 Unternehmergesellschaft (haftungsbeschränkt)


Abstract


The web, and with it HTML, started as a platform for scientific information exchange. Now, more than 20 years later, the de facto standard for scientific publishing is PDF, generated mostly via word processors or various TeX idioms, less commonly via other XML languages, or auto-generated by publishers from the most heterogeneuos sources.

Since more than ten years MathML and SVG provide the technical base and a well-defined extension to HTML, so that publications, even if they are long on complex formulae, could be published in a combined XHTML+MathML+SVG format. However, beyond the display of short abstracts, HTML doesn’t play any role in current web-based scientific databases unless as carrier medium for the download links of papers in other formats.

The talk examines the current state of composing natural science papers with XHTML, MathML and SVG and, on the other hand, the display possibilities in current browser engines. The target is to provide information as a baseline for future improvements in both fields, editors/generators as well as browser engines.

The analysis breaks down in three parts. Firstly, the combination of the three XML formats is compared to existing and often-used techniques for composing as well as publishing. We will focus on word processors as WYSIWYG editors for one, and LaTeX as text based format for the other method for composing and editing papers. For publishing and displaying, we compare XHTML+MathML+SVG with PDF and, again, with file formats of word processors.

Methods for writing directly or generating from a source format are the most important factor for the acceptance of HTML and its companions and an extended usage on the web. Without simple possibilities to generate result documents, the adaption of HTML is very unlikely. Beyond direct or WYSIWYG editing of XHTML+MathML+SVG documents, we show the current state of transformation methods from various other input formats. The editing doesn’t necessarily need to combine the three techniques in place, so single transformations for formulae, graphics and text are examined, too.

Finally, only slightly less important than editing is viewing of resulting documents. We survey the current landscape of browser implementations with a strong attention to MathML, CSS and SVG support. Where any of the implementations lacks support, we try to examine external technologies to take the place of native capabilities.

The findings provided in the talk will diverge. While display capabilities got pretty good with Mozilla Firefox and Opera as best performers with native SVG and MathML support, the composing and generation of XHTML+MathML+SVG is a vast landscape. Lots of island solutions exist for various examined problems, but the work to be done to join them into a single and simple workflow is not acceptable for people not familiar with either of XHTML, MathML or SVG. For a combined format of these three languages to re-take ground in scientific publishing on the web, many efforts will have to be put in generation tools.


Table of Contents

The Point of Departure
Comparison to Existing Technologies
Composing XHTML+MathML+SVG
Displaying XHTML+MathML+SVG
Conclusion

The Point of Departure

Restricting the Target Area

We will use the terms HTML and XHTML interchangeably in this paper. That is, the differences between both terms are not significant for the results presented here. When no specific version is mentioned, HTML5 and XHTML 1.x (as application/xhtml+xml) are both targeted.

In the course of this paper the term publication will be used frequently. While the range of possibilities to display results of scientific research is extremely wide, we will constrict the term here to quite simple documents.

The documents in question are static, without animation; they are equally adept to be viewed on screen and printed out; and they are assumed to consist mainly of continuous text, arbitrarily complex formulas and still images of any kind, maybe coloured. We assume the papers to refer to results in a natural science.

This restriction is not made arbitrarily but to meet the reality in a vast majority of publishing in the natural sciences. For example, literally any paper in the open access repository arXiv.org meets these criteria.

How Papers are Obtained via the Web

While almost any information on the web is viewable as HTML document somewhere, this doesn’t hold for scientific publications. Surveying some arbitrary keywords at Google Scholar supports the assumption, that the standard format is PDF. The following table is not meant to provide hard facts but to serve as a quick overview of which format is found.

Search terms at Google Scholar, differentiated by filetype query parameter
Search term All results filetype:html filetype:pdf filetype:ps
bose einstein condensate 37.000 1.410 8.830 154
shotgun sequencing 45.000 1.460 11.300 38
runge kutta methods 90.700 2.250 33.000 921
lattice enthalpy 274.000 10.700 36.200 2.710

Figure 1. On average the results found for filetype:pdf are 5.6 times more than for filetype:html. The survey was run on 27th July 2010.

The question arises, why HTML is relegated to such a niche existence in a field where it was originally developed for [w3.org/History/1989/proposal.html]. This isn’t asked for the first time here. For example, in 2003 Mark Schenk wrote about the possibilities of XHTML + CSS for scientific publishing [www.markschenk.com/cssexp/publication/article.xml] and came to the conclusion, that the then-current web techniques were not sufficient to publish papers in HTML.

In the following section we will suggest a partial explanation in that the format of XHTML+MathML+SVG or even plain HTML doesn’t fit within most workflows for the editing and postprocessing of publications. With the advance of new technologies and new software supporting them we will cautiously risk to predict that this may change in some aspects.

Comparison to Existing Technologies

We will concentrate on several spots in the toolbox of scientific publication creation. They can mainly be parted in two groups: Creating and Editing a publication on the one hand and Publishing it afterwards. Since XHTML+MathML+SVG can be both, we will have to compare formats with properties of both groups.

LaTeX and the TeX World

TeX and, more specifically, LaTeX [latex-project.org] enjoys an unbroken sympathy in the natural sciences as editing format for publications. It provides simple input methods for editing high quality print output with complex formulas and embedded objects like tables and graphics. The file format is plain text.

From .tex files several output formats can be created. The initially supported format is DVI, but there are popular PDF, PS and RTF output filters as well. Since the mid-90s many projects emerged to create an HTML output filter for TeX, but scarcely any is still active today or close to producing acceptable markup [www.uni-giessen.de/partosch/TeX/converters.html].

However, there are projects, like the often used TeX4ht [www.cse.ohio-state.edu/~gurari/TeX4ht/], that also create from TeX formula input MathML output (not just GIFs like it is quite common). Again, the quality is disputable, but may suffice for a given project.

What seems to be mostly ignored in the TeX world is SVG support. MetaPost is a vector graphics language, that is part of the TeX suite (originally to describe fonts), and embedding EPS files is a standard procedure, but embedding SVG in LaTeX documents or creating SVG from them seems not to be possible directly at the moment. The work-arounds usually involve one or more intermediate formats, like converting SVG to EPS or converting the PDF output of pdflatex to SVG.

Compared to XHTML+MathML+SVG, LaTeX is much simpler to edit. While not an output format, the task to write a paper is facilitated through many surrounding techniques, like BibTeX for bibliographies or automated numbering of figures. Some of these techniques can be rebuilt in (X)HTML, but to do so, profound knowledge of either cutting-edge CSS or Javascript is needed.

Other simplifications are not fully reproducable in XML based languages. For example, paragraphs are simply entered with two newlines. For HTML, many meta-languages like the MediaWiki syntax, Markdown or PHP’s nl2br were invented so people don’t have to enclose paragraphs with the complete <p></p>.

The disadvantage is clearly, that TeX is never a final format, but is per design a starting point. To create a viewable/printable result, the document has to be run (under circumstances several times) through a post-processor that creates the final document in a target format.

ODF and the OpenOffice Suite

Like XHTML+MathML+SVG, ODF is a composing as well as a presentation format. It is standardized by ISO and implemented in the OpenOffice suite. We will look at XHTML export and formula editing and MathML im- and export. These features are possible in Microsoft Office as well [blogs.msdn.com/...].

What isn’t supported in both Microsoft Office and OpenOffice out of the box is SVG import, although vector based drawings are possible in both suites. For OpenOffice, a plugin exists, and since this feature is #1 on the community’s wishlist [qa.openoffice.org/iz_votes.html] of open issues, it might be implemented one day.

As a word processor, OpenOffice simplifies common tasks like selecting the font of text, text size or paragraph style. It is possible to define central styles and apply them to selected paragraphs. If headings are marked up correctly (using the appropriate style), a table of contents can be auto-created. ODF Writer files can be exported in a range of formats, like DOC, RTF, PDF and (X)HTML.

ODF and XHTML with MathML can be converted into one another with the help of a built-in OpenOffice export filter. Changing the markup by hand and re-importing may change the display of the whole document. For enhanced features of ODF, like footnotes, work-arounds are introduced. The appearance of the XHTML result diverges widely from the original and the appearance of other export formats, which are print centered.

The largest difference between editing XHTML+MathML+SVG and ODF files is, that the former are usually written by hand in a text editor or editor with preview possibilities, while the latter is almost exclusively edited in a closed WYSIWYG environment (apart from automatic creation). It is mostly a mere matter of taste, what the author prefers. However, in complex, convoluted documents trailing style elements or orphaned objects will be hard to find and eliminate in a WYSIWYG editor. (That does not mean, that it is an easy task to debug complex XHTML+MathML+SVG documents.) On the other hand, the learning curve is flat and publications can be created without prior knowledge of the internals of the format.

For XHTML+MathML+SVG there are in fact WYSIWYG editors, too, notably W3C’s Amaya [w3.org/Amaya]. But they add no benefit for scientific users compared with using OpenOffice in the field of easy editing.

The Portable Document Format

PDF [www.adobe.com/...] is a true output format. It was constructed as such from Adobe and is ISO certified, too. Beyond its roots in PostScript, PDF allows for embedding arbitrary content, either for display purpose or as an attachment.

The most prominent use case is the display of a document in exactly the same look at any computer. This is achieved by the embedding of the fonts or subsets of fonts, that are used in the document, and by precise positioning of objects contained in the current page.

This approach is very different from HTML and CSS, where a certain uncertaintity is not only tolerated but desired to allow flexible rendering on different devices. In PDF files, the device is fixed to a page, and the viewer interface is centered around the need to display this page.

The disadvantage of PDF over XHTML+MathML+SVG is, that the document cannot be modified easily. If the source, from which the PDF was created, is lost, changing the content of the paper is extremely difficult.

Providing a guaranteed display, on the other hand, has the advantage, that there are no unwanted surprises, when the publication is viewed on different machines with different software. This head start of PDF has been reduced a bit with the advance of recent CSS technologies, most notably the rediscovery of the @font-face declaration.

Composing XHTML+MathML+SVG

Writing a publication about a scientific topic means, using a tool to shape the result to look like what the researcher or the publisher wants or needs. The basic design feature of the tool must be to get out of the way of the author.

This was one of the principle goals of LaTeX. It provides helpful macros on top of TeX to quicken repetive and often done tasks in writing documents. Another approach is the WYSIWYG interface of OpenOffice, that aims to give users easy recognizable icons and widgets to manipulate the document.

Generating and Editing (X)HTML

For HTML there is no single one editor. Different methods for generating the necessary markup exist in parallel. When writing a paper, simple-to-use input formats are necessary for authors not willing to learn HTML. One can be ODF with OpenOffice’s export filter, another could be one of the popular replacements like wiki syntax, Markdown or ReStructured Text.

Using HTML has some drawbacks compared to formats tailored for human input. Notably automatic creation of references, numbering chapters, sections and figures and tools to manage the bibliography don’t exist as part of the language. Some of these drawbacks can be handled with CSS, like automatic counters, but some, especially referencing objects, stay complicated. The problem can be solved with Javascript and a coding guideline for classes and IDs, but since there is, e.g., no DOM API for CSS counters, the Javascript to be written is not trivial.

Generating MathML

The possibilities for generating MathML are quite rich. Beside export from Office suites, all larger mathematical applications offer MathML support, like Mathematica, Maple, and MATLAB. The W3C has a long list of MathML enabled software [w3.org/Math/Software/].

The problem lies less in the generation but in the embedding of MathML in (X)HTML. Although this is possible in several ways, like directly as XML or as part of HTML5 or via the object element, there are several hinderances for a seamless integration. One is, that of the recent browsers with significant market share only Opera and Firefox can display MathML out of the box, and both understand only presentational MathML. The other is, that the generators of MathML may produce quite wild output, for they try to work around issues in displaying MathML in IE with a plugin, like the MathPlayer of Design Science [www.dessci.com/en/products/mathplayer/].

A different approach is to embed LaTeX formulas in HTML documents and use Javascript to convert them to MathML on page load. Again, this works only in Opera, FF or IE with a plugin. A famous representive is LaTeXMathML [math.etsu.edu/LaTeXMathML/], which is based on ASCIIMathML, that allows a more forgiving input syntax, too.

Generating SVG

Like MathML, SVG can stem from various sources. Apart from drawing by hand, e.g., in Adobe Illustrator or Inkscape [inkscape.org] or writing in a text editor, scientific plotting software is more and more adding decent SVG support. As examples from the open source world serve gnuplot [gnuplot.info] and Grace [plasma-gate.weizmann.ac.il/Grace/], which are consistently used for plotting data in the natural sciences.

As mentioned above, OpenOffice lacks support for both input and output of SVG. That leaves us with two options for embedding the vector graphic in an existing (X)HTML document. Either we use the object element, since Firefox < 4 doesn’t understand SVG in img elements, or we embed the SVG by hand via a text editor. In the latter case, the SVG plugin of Adobe for IE won’t render the graphic without tweaking, though.

Generation of Combined Documents

The ultimate challenge is to produce a single file containing XHTML, MathML and SVG side by side. Apart from text editors, only Amaya has partially the ability to create those combined documents from scratch. If SVG support lands in OpenOffice, this will yield another alternative.

Conversion from an Input Format

We tried to follow a mixed approach to reproduce an already published Physics paper in the XHTML+MathML+SVG format. We chose a paper from arxiv.org, that provided the LaTeX sources together with a PDF that we used as reference rendering [arxiv.org/abs/cond-mat/0607380]. Additional criteria were the existence of sufficiently complex formulas, graphics, that are based on vector sources and a topic that is known to the authors, so that we can exclude errors generated by the conversion. The selected paper was published in physica status solidi (c) [wiley.com/...] in 2006.

Using TeX4ht we converted the LaTeX sources to a preliminary HTML version. This step also converted the formulas to MathML. Some changes had to be made to the LaTeX sources due to incapabilities of TeX4ht. For example, straight text in math subscript areas cannot be transformed.

Afterwards the produced markup was manually cleaned. The original XML was not wellformed. Especially some math constructs regularly came out with wrongly nested mrow and msub tags. The stylesheet provided by TeX4ht was discarded, and the markup as far as possible simplified. That includes the usage of new HTML5 features like the figure element.

The original images of the paper were created with Grace and exported as EPS. We transformed the EPS sources to SVG with Inkscape. Then the SVG was embedded directly in the markup of the XHTML file. The XML declarations and doubled IDs had to be removed, but the rest of the SVG code was not altered.

Finally, a new stylesheet was created that uses CSS 2.1/3 techniques to shape the document as close as possible to the looks of the reference PDF. Automatic counters and CSS columns are the most prominent of the advanced CSS features. A two-column layout is generated for print output, while on screen a single column is more adapt for this continuous medium. We decided against font embedding, although this is a core feature of the PDF version. The only measurable difference, as long as the used font is installed locally, is the file size. We used DejaVu Serif for the screen and Latin Modern for the print styles, the former for its readibility, the latter to closely reassemble the PDF.

The results of this procedure mirror the fact, that XHTML+MathML+SVG is both an input and an output format. The generated document is available here.

Comparison of reference PDF and generated XHTML
  PDF XHTML
No. of files 1 1
Pages 4 4
File size 256kB 444kB (60kB gzipped, without embedded fonts)
Content editable hard easy
Content printable yes buggy
Semantic content (e.g., headers) no (could be tagged) yes per definitionem

The font Latin Modern used in the stylesheet for print rendering would account for another 164kB (gzipped, three styles). This could be reduced by including only the used glyphs like it is the case in the PDF.

Another number, that’s of interest, is the rendering speed. Firebug detects the load event after 3.1 seconds on a computer with an Intel Core2 Duo (2GHz) and 2GB RAM. The Adobe Viewer takes, roughly estimated, half the time, something below 2 seconds, on the same machine to display the PDF. Opera is as fast as the PDF viewer, but both browsers need quite long to generate the print preview.

Page 2 in the reference PDF, the Firefox rendering and Opera rendering (both printed to PDF). Click for larger view.
PDF (Adobe Viewer) Print preview in Firefox Print preview in Opera
PDF/Adobe viewer Print preview Firefox Print preview Opera

There are several pitfalls and problems when going the LaTeX → XHTML+MathML+SVG way:

At the same time, it could be shown, that in principle XHTML+MathML+SVG can be used for scientific publications. While the input is cumbersome and error prone, the output, viewed in detail in the next section, serves the purpose very well.

Displaying XHTML+MathML+SVG

The display capabilities and possibilities are another essential part in the adoption of XHTML+MathML+SVG. Since Firefox 1.5 and Opera 10, two browsers now support the required techniques without plugins.

Browser Support

As shown in the last section, Firefox and Opera render a paper’s XHTML clone comparable to the reference PDF on the screen. Problems lie in an attractive print rendering and the render speed.

Additionally, the possibility to display page assets like page numbers or running headers or footers is missing. Since floating elements to the top or bottom of the containing element is not supported, footnotes and figures look displaced in print.

We constrained the topic of this paper to a narrow set of possible documents. The strong advantage of XHTML+MathML+SVG lies in additional ways to display data: video and audio embedding will become trivial in HTML5, while they are problematic in PDF; animations can be scripted with Javascript (the Javascript API of PDF is constrained mostly to handle PDF forms); linking between documents and loading data in the background are core features of HTML based web pages.

Aiding Technologies

Browsers can be extended in their functionality with plugins. Some may help with the rendering task of XHTML+MathML+SVG.

Flash

There are proposals to use Adobe Flash as rendering layer for MathML or SVG. Some code is available, that realizes these ideas in a satisfactory way. SVG Web [code.google.com/p/svgweb/] announces its alpha status, but delivers reasonable results. For MathML, fMath [fmath.info] promises cross-browser support.

A common problem of most of these libraries is, that Flash cannot easily detect embedded markup, like it is desireable for compound scientific documents. This can be overcome, but still, the Flash application is an external dependency, that has to be passed around together with a XHTML document.

Plugins for IE

For IE’s lack of support for SVG and MathML there are solid and working plugins to fill the gap. SVG is rendered by the Adobe SVG plugin, that was for a long time the most complete SVG implementation. MathML is rendered, for example, by the MathML Player from Design Science.

A similar problem arises with these plugins like in the Flash case. Embedded markup is hard to detect and render. Reliable results are only possible with the use of the object element. That can be created dynamically via JavaScript, but the detour is not straight forward.

The support for application/xhtml+xml MIME type, embedded markup and SVG seems to hit ground in IE 9. If that is the case, there is only MathML left to deal with. Using the MathML Player together with a bit JavaScript has then good chances to bring full XHTML+MathML+SVG support to Internet Explorer.

Printing Directly with CSS

Next to reading a paper on screen, it will be printed. PDFs are tailored for this task, and ODF, too, is targeted at printing the file. HTML on the other side is traditionally bad to print. The HTML Print Profile is not incorporated in any printer driver, and CSS print support in browsers is incomplete.

Apart from that, common page elements, like page numbers or running headers, are impossible to define in either HTML or CSS.

One possibility can be a third party renderer like Prince XML [princexml.com]. It takes a HTML file and a print stylesheet, that may have custom CSS extensions, and outputs a printable result. Recent versions of Prince XML can handle MathML and SVG, too.

Conversion to an Output Format

A requirement in many workflows is the conversion in another format. This is especially important, if XHTML+MathML+SVG is the input format of a document.

For (X)HTML many solutions exist to convert a document to other formats, like PDF, RTF, or even images. If the document contains MathML and SVG, too, off-the-shelf software is hard to find. One single example is the already mentioned Prince XML, that allows for conversion of XHTML+MathML+SVG to PDF.

One possible way to go is an XSLT stylesheet to convert the format to XSL-FO, which is designed to be easily converted in other formats. An open source implementation is Apache’s FOP [apache.org...]. It has the advantage of supporting SVG graphics with the help of Apache Batik. To convert MathML to XSL-FO, one can then try to convert it directly, or to convert it to SVG and let Batik render it.

For a possible XSL-FO 2, many attendees at a requirements workshop in Heidelberg, Germany, in 2006 voted for extended math support as new feature. Unfortunately, the specification seems to have made no progress since then. A custom stylesheet can be written, but it will be complex, if it should handle generic MathML.

The other way, to go via SVG, involves a second XSLT stylesheet, that converts MathML to SVG before the complete document is transformed. Such a stylesheet is published as pMML2SVG [pmml2svg.sf.net], but it is not free of issues. It is also an XSLT 2.0 stylesheet, that can’t be executed by FOP’s internal XSLT engine Xalan.

Conclusion

We have shown in the previous sections, that in principle XHTML+MathML+SVG can be used as both input and output format for publications in the natural sciences. Due to this dual nature of the format it comprises disadvantages with formats specifically designed for either in- or output.

The generation of such a compund document is not as simple as, for example, the generation of PDF from LaTeX sources. At the moment it requires heavy manual tweaking to assemble a document, that comes close to the looks of documents generated by other means. A possible remedy could be WYSIWYG editors like Amaya or OpenOffice, if it supports SVG in the future.

The generated file itself doesn’t have any specific drawbacks compared to concurring formats. The size, when gzipped, is in the same order of magnitude, and with the support of data: URIs everything including bitmap images can be packed in a single file.

Displaying XHTML+MathML+SVG on screen has become well supported with the shipping of Firefox 1.5 and Opera 10, both with integrated MathML and SVG support. The use of CSS 3 and HTML5 allows lean markup, that benefits from automatic labels, counters and advanced styling.

Printing documents is significantly more cumbersome than with other output formats. The lack of support for CSS print properties as well as elemental problems like the declaration of page numbers or footnotes disqualifies the format for this usage at the moment. Work-arounds include the transformation into another format, like PDF, or, if the XHTML+MathML+SVG was generated from a source, re-creating another format directly from the source.

The XML based format plays out its strengths, when it comes to novel types of presenting scientific results. If the content has multimedia elements embedded or relies on animations, that can be scripted, or if there are other features wanted like bookmarking, accessibility or creating mash-ups, the legions of web-based software stand ready to process such documents.