The XML FAQ: How can I make my existing HTML files work in XML?

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 3: Authors

Q 3.5: How can I make my existing HTML files work in XML?

Either make them XHTML/HTML5, or use a different document type.

Either convert them to conform to some new document type (with or without a DTD or Schema) and write a stylesheet to go with them; or edit them to conform to XHTML or HTML5.

You may need to convert existing HTML files because XML does not permit end-tag minimisation (missing , etc), unquoted attribute values, and a number of other SGML shortcuts which are commonly used in HTML. Many HTML authoring tools already produce almost (but not quite) well-formed XML by making sure that end-tags are used, attribute values are quoted, etc — however, many do not.

You may be able to convert HTML to XHTML using the Dave Raggett's HTML Tidy program (a HTML5 version also exists), which can clean up some of the horrible mess of pseudo-markup left behind by incompetent HTML editors, and even separate out some of the formatting to a stylesheet, but there is usually still some hand-editing to do.

Most modern website design programs, including DreamWeaver, still don't produce anything like well-formed HTML, largely because they are intended for making pages look pretty, rather than getting the markup right. Using a website design program and its HTML pages as the sole repository of your information can be a dangerous and expensive mistake. If you're working the other way round, however, getting the information design right in XML first, and then exporting it to a page design produced using a website design program, it's probably less important that the HTML is a mess, because browsers are very forgiving.

Converting valid HTML to XHTML
If your HTML files are valid (full formal validation with an SGML parser against one of the published DTDs, not just a simple syntax check), then try validating them as XHTML with an XML parser. If you have been creating clean HTML without embedded formatting then this process should throw up only mismatches in upper/lowercase element and attribute names, and EMPTY elements like img, plus any non-standard element type names if you use them. Simple hand-editing or a short script should be enough to fix these changes.
If your HTML validly uses end-tag omission and unquoted attribute values, this can be fixed automatically by a normalisation program like sgmlnorm (from the OpenSP package, which is part of OpenJade), or by the sgml-normalize function in an editor like Emacs/psgml (don't be put off by the names, they both do XML).
If you have a lot of valid HTML files, you could write a script to do this in a programming language which understands SGML markup (such as Omnimark, SGMLC, or one of the popular scripting languages (eg Perl, Python, Tcl, etc), using their SGML/XML libraries); or you could even use editor macros if you know what you're doing.
If your HTML is invalid or badly-formed, try the HTML Tidy program mentioned above. If that doesn't fix them, I'm afraid you'll need to write something special using the procedure below, or do it all by hand-editing, or copy-and-paste from a browser.

Converting to a new document type
If you want to move your files out of HTML into some other DTD entirely, there are many native XML industrial DTDs, and modular XML versions of popular DTDs like TEI (literary, historical, and linguistic documents) and DocBook (computer documentation) or DITA (technical documentation) to choose from. There were several sites for the exchange of new XML DTDs, but writing new ones is now rare.
You can of course just make up your own markup: so long as it makes sense and you create a well-formed file, you should be able to write a CSS or XSLT stylesheet and have your document displayed in a browser.

Converting invalid HTML to well-formed XHTML
If your files are invalid HTML (95% of the Web) they can be converted to well-formed DTDless files as follows:
replace any DOCTYPE Declaration with the XML Declaration <?xml version="1.0" encoding="UTF-8"?> (or using the appropriate character encoding).
If there was no DOCTYPE Declaration, just prepend the XML Declaration. Either way, the XML Declaration, if used, must be line 1 of the document.
Change any EMPTY elements (eg every BASE, ISINDEX, LINK, META, NEXTID and RANGE in the header, and every AREA, ATOPARA, AUDIOSCOPE, BASEFONT, BR, CHOOSE, COL, FRAME, HR, IMG, KEYGEN, LEFT, LIMITTEXT, OF, OVER, PARAM, RIGHT, SPACER, SPOT, TAB, and WBR in the body of the document) so that they end with /> instead, for example <img src="mypic.gif" alt="Picture"/>;
Make all element type names and attribute names lowercase;
Ensure there are correctly-matched explicit end-tags for all non-EMPTY elements; eg every <para> must have a </para>, etc;
Escape all < and & non-markup (ie literal text) characters as < and & respectively (there shouldn't have been any isolated < characters to start with, anyway!);
Ensure all attribute values are in matched quotes (values with embedded single quotes must be in double quotes, and vice versa — if you need both, use the " character entity reference);
Ensure all script URIs which have & as a field separator are changed to use & or a semicolon instead.
Ensure all scripts (eg JavaScript) which have < or & characters (mathematical less-than tests, and Boolean AND conditionals) are either given as CDATA Marked Sections, or (if browser processors accept them) changed to use < and & or a semicolon respectively.

Be aware that some obsolete HTML browsers may not accept XML-style EMPTY elements with the trailing slash, so the above changes may not be backwards-compatible. An alternative is to add a dummy end-tag to all EMPTY elements, so <img src="foo.gif"/> becomes <img src="foo.gif"></img>. This is valid XML but you must be able to guarantee no-one will ever put any text content inside such elements. Adding a space before the closing slash in EMPTY elements (eg <img src="foo.gif" />) may also fool older browsers into accepting XHTML as HTML.

If you have to answer Yes to any of the questions in the Checklist below, you can save yourself a lot of grief by fixing those problems first before doing anything else. You will likely then be getting very close to having well-formed files.

Markup which is syntactically correct but semantically meaningless or void should be edited out before conversion. Examples are bogus spacing devices such as repeated empty paragraphs or linebreaks, empty tables, invisible spacing GIFs etc. XML uses stylesheets, and CSS3 means you won't need any of these.

Unfortunately there is rather a lot of work to do if your files are invalid: this is why many Webmasters now insist that only valid or well-formed files are used (and why you should instruct your designers to do the same), in order to avoid unnecessary manual maintenance and conversion costs later.

Checklist for invalid HTML
If your HTML files fall into this category (HTML created by most WYSIWYG editors is usually invalid) then they will almost certainly have to be converted manually, although if the deformities are regular and carefully constructed, the files may actually be almost well-formed, and you could write a program or script to do as described above. The oddities you may need to check for include:
Do the files contain markup syntax errors? For example, are there any missing angle-brackets, backslashes instead of forward slashes on end-tags, or elements which nest incorrectly (eg starting inside one element but ending outside it)?
Are there elements with missing end-tags that cannot be inferred by (eg) sgmlnorm?
Are there any URIs (eg in hrefs or srcs) which use Microsoft Windows-style backslashes instead of normal forward slashes?
Do the files contain markup which conflicts with HTML DTDs, such as headings or lists inside paragraphs, list items outside list environments, header elements like base preceding the first html, etc? (another sloppy editor trick)
Do the files use imaginary elements which are not in any known HTML DTD? (large amounts of these are used in proprietary markup systems masquerading as HTML). Although this is easy to transform to a DTDless well-formed file (because you don't have to define elements in advance) most proprietary or browser-specific extensions have never been formally defined, so it is often impossible to work out meaningfully where the element types can be used.
Are there any invalid (non-XML) characters in your files? Look especially for native Apple Mac Roman-8 characters left by careless designers; any of the illegal Windows characters (the 32 characters at decimal codes 128–159 inclusive) inserted by Microsoft editors; and any of the ASCII control characters 0–31 (except those permitted like TAB, CR, and LF). These must be converted to the correct characters in UTF-8 (or whatever you are using).
Do your files contain invalid (old Mosaic/Netscape-style) comments? Comments must look

with double-dashes each end and no other double (especially not multiple) dashes in between.