Copyright © 2010 Silmaril Consultants
Rev: 2010-04-24T15:52:33+0100

Either make them XHTML or use a different document type.C.6  How can I make my existing HTML files work in XML?

A.

Either convert them to conform to some new document type (with or without a DTD or Schema) and write a stylesheet to go with them; or edit them to conform to XHTML.

It is necessary to convert existing HTML files because XML does not permit end-tag minimisation (missing </p>, etc), unquoted attribute values, and a number of other SGML shortcuts which have been normal in most HTML DTDs. However, many HTML authoring tools already produce almost (but not quite) well-formed XML.

You may be able to convert HTML to XHTML using the Dave Raggett's HTML Tidy program, which can clean up some of the horrible formatting mess left behind by inadequate HTML editors, and even separate out some of the formatting to a stylesheet, but there is usually still some hand-editing to do.

Most modern website design programs, including DreamWeaver, still don't produce anything like clean HTML, largely because they are for making pages look pretty, rather than getting the information right. If you get the information right in XML first, and export it to a page design produced using a website design program, it's probably less important that the HTML is a mess. Using a website design program and its HTML pages as the sole repository of your information can be a dangerous mistake, though.

B.

Converting valid HTML to XHTML

If your HTML files are valid (full formal validation with an SGML parser, not just a simple syntax check), then try validating them as XHTML with an XML parser. If you have been creating clean HTML without embedded formatting then this process should throw up only mismatches in upper/lowercase element and attribute names, and empty elements (plus perhaps the odd non-standard element type name if you use them). Simple hand-editing or a short script should be enough to fix these changes.

If your HTML validly uses end-tag omission, this can be fixed automatically by a normalization program like sgmlnorm (from OpenSP, which is part of OpenJade) or by the sgml-normalize function in an editor like Emacs/psgml (don't be put off by the names, they both do XML).

If you have a lot of valid HTML files, you could write a script to do this in a programming language which understands SGML markup (such as Omnimark, SGMLC, or one of the popular scripting languages (eg Perl, Python, Tcl, etc), using their SGML/XML libraries; or you could even use editor macros if you know what you're doing.

 

Converting to a new document type

If you want to move your files out of HTML into some other DTD entirely, there are many native XML application DTDs, and standard XML versions of popular DTDs like TEI and DocBook or DITA to choose from. There is a pilot site run by CommerceNet (http://www.xmlx.com/) for the exchange of XML DTDs.

Alternatively you could just make up your own markup: so long as it makes sense and you create a well-formed file, you should be able to write a CSS or XSLT stylesheet and have your document displayed in a browser.

 

Converting invalid HTML to well-formed XHTML

If your files are invalid HTML (95% of the Web) they can be converted to well-formed DTDless files as follows:

  1. replace the DOCTYPE Declaration with the XML Declaration <?xml version="1.0" encoding="iso-8859-1"?> (using the appropriate character encoding).

  2. If there was no DOCTYPE Declaration, just prepend the XML Declaration.

  3. Change any EMPTY elements (eg every BASE, ISINDEX, LINK, META, NEXTID and RANGE in the header, and every AREA, ATOPARA, AUDIOSCOPE, BASEFONT, BR, CHOOSE, COL, FRAME, HR, IMG, KEYGEN, LEFT, LIMITTEXT, OF, OVER, PARAM, RIGHT, SPACER, SPOT, TAB, and WBR in the body of the document) so that they end with /> instead, for example <img src="mypic.gif" alt="Picture"/>;

  4. Make all element names and attribute names lowercase;

  5. Ensure there are correctly-matched explicit end-tags for all non-EMPTY elements; eg every <para> must have a </para>, etc;

  6. Escape all < and & non-markup (ie literal text) characters as &lt; and &amp; respectively (there shouldn't be any isolated < characters to start with, anyway!);

  7. Ensure all attribute values are in matched quotes (values with embedded single quotes must be in double quotes, and vice versa—if you need both, use the &quot; character entity reference);

  8. Ensure all script URIs which have & as a field separator are changed to use &amp; or a semicolon instead.

Be aware that some obsolete HTML browsers may not accept XML-style EMPTY elements with the trailing slash, so the above changes may not be backwards-compatible. An alternative is to add a dummy end-tag to all EMPTY elements, so <img src="foo.gif"/> becomes <img src="foo.gif"></img>. This is valid XML but you must be able to guarantee no-one will ever put any text content in such elements. Adding a space before the closing slash in EMPTY elements (eg <img src="foo.gif" />) may also fool older browsers into accepting XHTML as HTML.

If you answer Yes to any of the questions in the the tip ‘Checklist for invalid HTML’ below, ‘How can I make my existing HTML files work in XML?’, you can save yourself a lot of grief by fixing those problems first before doing anything else. You will likely then be getting close to having well-formed files.

Markup which is syntactically correct but semantically meaningless or void should be edited out before conversion. Examples are spacing devices such as repeated empty paragraphs or linebreaks, empty tables, invisible spacing GIFs etc. XML uses stylesheets, so you won't need any of these.

Unfortunately there is rather a lot of work to do if your files are invalid: this is why many Webmasters now insist that only valid or well-formed files are used (and why you should instruct your designers to do the same), in order to avoid unnecessary manual maintenance and conversion costs later.

Checklist for invalid HTML

If your HTML files fall into this category (HTML created by most WYSIWYG editors is usually invalid) then they will almost certainly have to be converted manually, although if the deformities are regular and carefully constructed, the files may actually be almost well-formed, and you could write a program or script to do as described above. The oddities you may need to check for include:

  • Do the files contain markup syntax errors? For example, are there any missing angle-brackets, backslashes instead of forward slashes on end-tags, or elements which nest incorrectly (eg <B>those starting <I>inside another element</B> but ending outside</I> it)?

  • Are there any URIs (eg in hrefs or srcs) which use Microsoft Windows-style backslashes instead of normal forward slashes?

  • Do the files contain markup which conflicts with HTML DTDs, such as headings or lists inside paragraphs, list items outside list environments, header elements like base preceding the first html, etc? (another sloppy editor trick)

  • Do the files use imaginary elements which are not in any known HTML DTD? (large amounts of these are used in proprietary markup systems masquerading as HTML). Although this is easy to transform to a DTDless well-formed file (because you don't have to define elements in advance) most proprietary or browser-specific extensions have never been formally defined, so it is often impossible to work out meaningfully where the element types can be used.

  • Are there any invalid (non-XML) characters in your files? Look especially for native Apple Mac Roman-8 characters left by careless designers; any of the illegal characters (the 32 characters at decimal codes 128–159 inclusive) inserted by MS-Windows editors; and any of the ASCII control characters 0–31 (except those permitted like TAB, CR, and LF). These need to be converted to the correct characters in ISO 8859-1 (a common default in browsers), or the relevant plane of Unicode (in which case you will probably need to use UTF-8 as your document encoding).

  • Do your files contain invalid (old Mosaic/Netscape-style) comments? Comments must look <!-- like this --> with double-dashes each end and no double (especially not multiple) dashes in between.