Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 3: Authors

Q 3.4: How do I convert my information to XML format?

Write or use a converter in a language that understands XML

If the source file format has some kind of consistent and recognisable structure, even simple line-breaks or spacing, it's usually possible to write pattern-matching routines in many languages to isolate the information falling into such patterns and output it with tags around it.

XSLT3 has a pattern-matching syntax for doing exactly this kind of ‘up-conversion’, and other processors such as Omnimark offer similar features. Such conversions may also create a temporary ‘half-way’ format to which a second conversion is applied to create the final XML format.

If the source files are in a known format (CSV, for example), there may be existing routines available for download or purchase which can create some XML format. A second XML-to-XML conversion can then be used to create the final format required.

Database management systems may have built-in ‘export-to-XML’ routines which can create a similar ‘half-way’ format for subsequent conversion (see How do I get XML into or out of my database? for an example).

If the information is completely unformatted, or so badly or inconsistently formatted that automated conversion is impossible, it will have to be edited by hand into XML format. Wordprocessor documents are the classic example of this. There are companies which specialise in this kind of work, particularly around the Pacific Rim, who have long experience in dealing in all kinds of weird and wonderful formats, and will send fully-fledged XML back to you.

Two formats frequently requested as sources are better-supported:

LATEX

Well-formed LATEX documents (those that do not use homebrew macros, especially those using plain TEX or obsolete commands) can be converted using the TEX4ht package. At the time of writing (2015) this is unsupported since the untimely death of its author, but is fully functional.

TEX4ht can convert to HTML and ODF (OpenOffice format) in various ways, so the resulting file can easily be opened in OpenOffice and saved as a .docx file. There are command-line options for the oowriter program (or lowriter if you are using Libre Office) which allow for scripted bulk conversion.

Other facilities are available in some editors and online services (such as the blogs and forums which support LATEX formatting in web pages). These may also be used for conversion.

Microsoft Word

Word (.docx) files are Zip files containing XML documents along with the associated images and stylesheets. By default, Word documents consist only of paragraphs (w:p elements). All the metadata about document structure is provided as font and spacing information, which can only reliably be interpreted by a human, making meaningful conversion exceptionally difficult.

However, if named styles (from the built-in style menu or created by the author) are used consistently, it is possible to write an XSLT3 script to match them and output more usable XML markup.

Some editors (eg XMLMind, AbiWord) and other systems now provide conversion from Word, both to a purely visual (HTML) format, mimicking the appearance of the original, and to a ‘semantic’ vocabulary such as DocBook or DITA, with no formatting.

The XSLT3 route also applies to OpenOffice/LibreOffice, which also stores XML in a Zip file. The markup is different, but can be converted along the same lines.

  1. And tables, the only other block-level element in normal use.