Copyright © 2010 Silmaril Consultants
Rev: 2010-03-01T00:19:00+0000

Parsers keep it all. It's up to the application to decide what to do with it.C.4  How does XML handle white-space in my documents?

All white-space, including linebreaks (Mac CR, Win CR/LF, Unix LF), TAB characters, and normal spaces, even between ‘structural’ elements where no text can ever appear, is passed by the parser unchanged to the application (browser, formatter, viewer, converter, etc), identifying the context in which the white-space was found (element content, data content, or mixed content, if this information is available to the parser, eg from a DTD or Schema). This means it is the application's responsibility to decide what to do with such space, not the parser's:

The parser must inform the application that white-space has occurred in element content, if it can detect it. (Users of SGML will recognize that this information is not in the ESIS, but it is in the Grove.)

 
<chapter> 
  <title> 
   My title for
   Chapter 1. 
  </title> 
    <para> 
text 
    </para> 
</chapter>
	  

In the example above, the application will receive all the pretty-printing linebreaks, TABs, and spaces between the elements as well as those embedded in the chapter title. It is the function of the application, not the parser, to decide which type of white-space to discard and which to retain. Many XML applications have configurable options to allow programmers or users to control how such white-space is handled.

Why?

In SGML, a DTD is compulsory always. A parser therefore always knows in advance whether white-space has occurred in element content (and can therefore be discarded) or in mixed content or PCDATA (where it must be preserved). XML allows processing without a DTD or Schema, so it may be impossible to tell whether space should be discarded or not, so the general rule was imposed that all white-space must be reported to the application.