The XML FAQ: What are these terms DTDless, valid, and well-formed?

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 4: Developers

Q 4.3: What are these terms DTDless, valid, and well-formed?

Well-formed means just syntactically correct; valid means it conforms to a DTD or Schema.

XML lets you use a Schema or Document Type Definition (DTD) to describe the markup (elements and other constructs) available in any specific type of document. However, the design and construction of Schemas and DTDs can be complex and non-trivial, so XML also lets you work without one. DTDless operation means you can invent markup without having to define it formally, provided you stick to the well-formedness rules of XML syntax.

To make this work, a DTDless file is assumed to define its own markup purely by the existence and location of elements where you create them. When an XML application encounters a DTDless file, it builds its internal model of the document structure while it reads it, because it has no Schema or DTD to tell it what to expect. There must therefore be no surprises or ambiguous syntax. To achieve this, the document must be ‘well-formed’ (must follow the rules).

To understand why this concept is needed, look at standard HTML as an example:

The <img> element is declared (in the [SGML] DTDs for HTML) as EMPTY, so it doesn't have an end-tag (there is no such thing as </img>);
Many other HTML elements (such as <para>) allow you to omit the end-tag for brevity.
If an XML processor reads an HTML file without knowing this (because it isn't using a DTD), and it encounters an <img> or a <para> (or any other start-tag), it would have no way to know whether or not to expect an end-tag. This makes it impossible to know if the rest of the file is correct or not, because it has now no evidence of whether it is inside an element or if it has finished with it.

Well-formed documents therefore require start-tags and end-tags on every normal element, and any EMPTY elements must be made unambiguous, either by using normal start-tags and end-tags, or by appending a slash to the name of the start-tag before the closing > as a signal that there will be no separate end-tag.

All XML documents, both DTDless and valid, must be well-formed. They must start with an XML Declaration if necessary (for example, identifying the character encoding or using the Standalone Document Declaration):

 
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?> 
<foo> 
  <bar>...<blort/>...</bar> 
</foo>

David Brownell writes:

XML that's just well-formed doesn't need to use a Standalone Document Declaration at all. Such declarations are there to permit certain speedups when processing documents while ignoring external parameter entities — basically, you can't rely on external declarations in standalone documents. The types that are relevant are entities and attributes. Standalone documents must not require any kind of attribute value normalisation or defaulting, otherwise they are invalid.

It's also possible to use a Document Type Declaration with DTDless files, even though there is no Document Type to refer to:

Richard Lander writes:

If you need character entities [other than the five built-in ones] in a DTDless file, you can declare them in an internal subset without referencing anything other than the root element type:
 
<?xml version="1.0" standalone="yes"?> 
<!DOCTYPE example [ 
<!ENTITY mdash " — "> 
]> 
<example>Hindsight&mdash;a wonderful thing.</example>

So…here are the rules:

Peter Flynn writes:

Well-formed XML
All tags must be balanced: that is, every element which may contain character data or sub-elements must have both the start-tag and the end-tag present (omission is not allowed except for EMPTY elements, see below);
All attribute values must be in quotes. The single-quote character (the apostrophe) may be used if the value contains a double-quote character, and vice versa. If you need isolated quotes as data as well, you can use ' or ". Do not under any circumstances use the automated typographic (‘curly’) inverted commas substituted by some wordprocessors for quoting attribute values — keep them for the values themselves (eg surname="O’Flynn" rather than surname="O'Flynn");
Any EMPTY elements (eg those with no end-tag like HTML's <img>, <hr>, and and others) must either end with /> or they must look like non-EMPTY elements by having a real end-tag (but no content). Example: would become either or (with nothing in between).
There must not be any isolated markup-start characters (< or &) in your text data. They must be given as < and & respectively, and the sequence ]] > may only occur as the end of a CDATA marked section: if you are using it for any other purpose it must be given as ]] >.
Elements must nest inside each other properly (no overlapping markup, same as for HTML);
Comment declarations  can only appear between or inside elements, or in character data content, not inside a start-tag or end-tag. The sequence -- (hyphen hyphen) must not occur by itself in the text of a comment because it can only be used as the comment close delimiter (followed by the >);
DTDless well-formed documents may use attributes on any element, but the attributes are all assumed to be of type CDATA. You cannot use ID/IDREF attribute types for parser-checked cross-referencing in DTDless documents.
XML files with no DTD are considered to have <, >, ', ", and & predefined and thus available for use. With a DTD, all character entities used must be declared, including these five.

Peter Flynn writes:

Valid XML
Valid XML files are well-formed files which have a Document Type Definition (DTD) or Schema and which conform to it. They must already be well-formed, so all the rules above apply.
A valid file begins with a Document Type Declaration specifying a DTD, or code specifying a W3C Schema. It may have an optional XML Declaration prepended.
 
<?xml version="1.0"?> 
<!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd"> 
<advert>
 <headline>...<pic/>...</headline> 
 <text>...</text>
</advert>

The XML Specification predefines an SGML Declaration for XML which is fixed for all instances and is therefore hard-coded into all XML software and never specified separately (except when using an SGML/XML switchable validator like onsgmls: see below).

Peter Flynn writes:

The SGML Declaration for XML has been removed from the text of the Specification but is available as a separate document). As this appears to suffer occasionally from bitrot or neglect, there is a copy here (WebSGML TC) and here (Extended Naming Rules TC), and a version for onsgmls here.

The specified DTD must be accessible to the XML processor using the URI supplied in the SYSTEM Identifier, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network. Note that DTD specifications must be URIs (local, relative, or absolute). Proprietary-specific filesystem references (eg C:\dtds\my.dtd are not URIs and cannot be used: use the file:///C|/dtds/my.dtd format instead.

It is possible (many people would say preferable) to supply a Formal Public Identifier with the PUBLIC keyword, and use an XML Catalog to dereference it, but the Specification mandates a SYSTEM Identifier so this must still be supplied after the PUBLIC identifier: no further keyword is needed. A PUBLIC identifier constitutes a claim to ownership only of the identifier, not to the DTD itself (although ion many cases that is implied).

 
<!DOCTYPE advert PUBLIC	
   "+//Silmaril//DTD Foo Corp Advertisements//EN"
   "http://www.foo.org/ad.dtd"> 
<advert>...</advert>

The test for validity is that a validating parser finds no errors in the file: it must conform absolutely to the definitions and declarations in the DTD.

XML (W3C) Schemas are not usually linked directly from within an XML document instance in the way that DTDs are: the relevant Schema (XSD file) for a document instance is normally specified to the parser separately, either by file system reference, or using a Target Namespace.