Related
s
DOM
HTML
HTML5
MSXML
NAMESPACES
NOT SGML
SAX
SCHEMA
SGML
SVG
TEX
UNICODE
XML CHINESE
XML CONDENSED
XML DUTCH
XSL
D.2 What are these terms DTDless, valid, and
well-formed?
XML lets you use a Schema or Document Type Definition (DTD) to describe the markup (elements and other constructs) available in any specific type of document. However, the design and construction of Schemas and DTD can be complex and non-trivial, so XML also lets you work without one. DTDless operation means you can invent markup without having to define it formally, provided you stick to the rules of XML syntax.
To make this work, a DTDless file is assumed to define its own markup purely by the existence and location of elements where you create them. When an XML application encounters a DTDless file, it builds its internal model of the document structure while it reads it, because it has no Schema or DTD to tell it what to expect. There must therefore be no surprises or ambiguous syntax. To achieve this, the document must be ‘well-formed’ (must follow the rules).
To understand why this concept is needed, look at standard HTML as an example:
The img element is
declared (in the DTDs for HTML) as EMPTY, so it doesn't
have an end-tag (there is no such thing as </img>);
Many other HTML elements (such as para) allow you to omit
theend-tag for brevity when using the SGML version of
HTML.
If an XML processor reads an HTML file without
knowing this (because it isn't using a DTD), and it
encounters an <img> or
a <para> (or any other
start-tag), it would have no way to know whether or not
to expect an end-tag. This makes it impossible to know
if the rest of the file is correct or not, because it
has now no evidence of whether it is inside an element
or if it has finished with it.
Well-formed documents therefore require start-tags and end-tags on every normal element, and any EMPTY elements must be made unambiguous, either by using normal start-tags and end-tags, or by appending a slash to the name of the start-tag before the closing > as a sign that there will be no separate end-tag.
All XML documents, both DTDless and valid, must be well-formed. They must start with an XML Declaration if necessary (for example, identifying the character encoding or using the Standalone Document Declaration):
<?xml version="1.0" encoding="iso-8859-1" standalone="yes"?>
<foo>
<bar>...<blort/>...</bar>
</foo>
XML that's just well-formed doesn't need to use a Standalone Document Declaration at all. Such declarations are there to permit certain speedups when processing documents while ignoring external parameter entities—basically, you can't rely on external declarations in standalone documents. The types that are relevant are entities and attributes. Standalone documents must not require any kind of attribute value normalisation or defaulting, otherwise they are invalid.
It's also possible to use a Document Type Declaration with DTDless files, even though there is no Document Type to refer to:
If you need character entities [other than the five built-in ones] in a DTDless file, you can declare them in an internal subset without referencing anything other than the root element type:
<?xml version="1.0" standalone="yes"?> <!DOCTYPE example [ <!ENTITY mdash "---"> ]> <example>Hindsight—a wonderful thing.</example>
All tags must be balanced: that is, every element which may contain character data or sub-elements must have both the start-tag and the end-tag present (omission is not allowed except for EMPTY elements, see below);
All attribute values must be in quotes. The
single-quote character (the apostrophe) may be used if
the value contains a double-quote character, and vice
versa. If you need isolated quotes as data as well,
you can use '
or ". Do not
under any circumstances use the automated typographic
(‘curly’) inverted commas substituted by
some wordprocessors for quoting attribute
values.
Any EMPTY elements (eg those with no end-tag like
HTML's img, hr, and br and others) must
either end with
/> or
they must look like non-EMPTY elements by having a
real end-tag (but no content). Example: <br> would become either
<br/> or <br></br> (with nothing in
between).
There must not be any isolated markup-start
characters (< or
&) in your text
data. They must be given as < and & respectively, and the
sequence
]]>
may only occur as the end of a CDATA marked section:
if you are using it for any other purpose it must be
given as ]]>.
Elements must nest inside each other properly (no overlapping markup, same as for HTML);
DTDless well-formed documents may use attributes on any element, but the attributes are all assumed to be of type CDATA. You cannot use ID/IDREF attribute types for parser-checked cross-referencing in DTDless documents.
XML files with no DTD are considered to have
<, >, ', ", and & predefined and thus
available for use. With a DTD, all character entities
used must be declared, including these five.
Valid XML files are well-formed files which have a Document Type Definition (DTD) and which conform to it. They must already be well-formed, so all the rules above apply.
A valid file begins with a Document Type Declaration, but may have an optional XML Declaration prepended:
<?xml version="1.0"?> <!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd"> <advert> <headline>...<pic/>...</headline> <text>...</text> </advert>
The XML Specification predefines an SGML Declaration for XML which is fixed for all instances and is therefore hard-coded into all XML software and never specified separately (except when using an SGML/XML switchable validator like onsgmls: see below).
The SGML Declaration for XML has been removed from the text of the Specification but is available as a separate document). As this appears to be suffering from bitrot or neglect, there is a copy here and a version for onsgmls here.
The specified DTD must be accessible to the XML processor using the URI supplied in the SYSTEM Identifier, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network. Note that DTD specifications must be URIs (local, relative, or absolute). Proprietary-specific filesystem references (eg C:\dtds\my.dtd are not URIs and cannot be used: use the file:///C|/dtds/my.dtd format instead.
It is possible (many people would say preferable) to supply a Formal Public Identifier with the PUBLIC keyword, and use an XML Catalog to dereference it, but the Specification mandates a SYSTEM Identifier so this must still be supplied (after the PUBLIC identifier: no further keyword is needed):
<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN" "http://www.foo.org/ad.dtd"> <advert>...</advert>
The test for validity is that a validating parser finds no errors in the file: it must conform absolutely to the definitions and declarations in the DTD.
XML (W3C) Schemas are not usually linked directly from within an XML document instance in the way that DTDs are: the relevant Schema (XSD file) for a document instance is normally specified to the parser separately, either by file system reference, or using a Target Namespace.