Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 3: Authors

Q 3.2: How does XML handle white-space in my documents?

Parsers keep it all. It's up to the application to decide what to do with it.

All white-space, including linebreaks (Mac CR, Win CR/LF, Unix LF), TAB characters, normal spaces, non-breaking spaces, and all other forms of space, even between ‘structural’ elements where no text can ever appear, is passed by the parser unchanged to the application (processor, browser, formatter, viewer, converter, etc). The parser identifies the context in which the white-space was found (Element Content, Character Data Content, or Mixed Content), if this information is available, eg from a DTD or Schema. This means it is the application's responsibility to decide what to do with such space, not the parser's.

This is one of the few really radical changes from SGML, where compulsory DTDs meant all white-space in element content could be recognised and discarded by the parser before it got anywhere near the application. See Why? below for why.

There are two different types of white-space:

  • insignificant white-space (discardable white-space) which occurs between structural elements in element content. This is space which occurs where only other elements are allowed, where text never occurs. It is usually inserted automatically by editing software or manually by an author or editor to help with the visual clarity of the markup, in the knowledge that it has nothing to do with spacing you see when the document is processed or formatted, because it will be discarded. In XML, however, this space does get passed to the application, so you have to rely on the application to discard it. In SGML it got suppressed, which is why you can put all that extra space in old-style HTML documents and not worry about it;

  • significant white-space which occurs only inside elements which can contain text (Character Data Content, like a HTML title) or text and markup mixed together (Mixed Content, as in normal paragraphs). In XML, this space will still get passed to the application exactly as under SGML.

In both cases, it is the application's responsibility to handle the space correctly (XSLT3 and up, for example, provides a strip-space instruction to specify how to handle it). The parser must therefore inform the application that white-space has occurred in element content, if it can detect it, so that it can be discarded. (Users of SGML will recognise that this information is not in the ESIS, but it is in the Grove.)

 
<chapter> 
  <title> 
   My title for
   Chapter 1. 
  </title> 
    <para> 
text 
    </para> 
</chapter>
	  

In the example above, the application will receive all the pretty-printing linebreaks, TABs, and spaces between the elements as well as those embedded in the chapter title. It is the function of the application, not the parser, to decide which type of white-space to discard and which to retain. Many XML applications have configurable options to allow programmers or users to control how such white-space is handled.

    White-space within markup
  • White-space between attributes is present for syntactic purposes (to let the parser see where one attribute stops and the next one starts) so it has no significance.

  • White-space within CDATA attribute values is usually preserved as-is, and may be normalized (eg in XSLT) for subsequent use (see White-space in Schemas).

  • White-space in Processing Instructions is preserved but any leading white-space is stripped from the content value.

Wendell Piez writes:

White-space in Schemas

Schema validation, including DTD validation, may provide for whitespace normalization on attributes (and in XSD, text content), or at any rate specify it in such a way that in some parses (schema-aware), they will be normalized (and hence may appear differently from the data). So, a bit ouch. In XSLT they are left alone, but given such a parse before the XSLT, they are not.

Peter Flynn writes:

Why?

In SGML, a DTD is compulsory, always. A parser therefore always knows in advance whether white-space has occurred in Element Content (and can therefore be discarded) or in Mixed Content or Character Data Content (where it must be preserved). XML allows processing without a DTD or Schema, where it may be impossible to tell whether space should be discarded or not, so the general rule was imposed that all white-space must be reported to the application.