The XML FAQ: What is my information? DATA or DOCUMENT?

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 4: Developers

Q 4.14: What is my information? DATA or DOCUMENT?

It depends on what you're using it for.

Some important distinctions exist between the major classes of XML applications and the way in which they are used.

Two classes of applications are usually referred to as ‘document’ and ‘data’ applications, and this is reflected in the software, which is usually (but not always) aimed at one class or the other.

Document-style applications: These are like traditional publishers' work: text and images in a structured environment, with fonts and formatting. In most cases this includes Web pages as well as material destined for PDF, eBooks, or print-like books and magazines. The hallmark of document applications is that they make heavy use of Mixed Content (eg subelements within text).
Data-style applications: These are found mostly in e-commerce, web services, and process or application control, with XML being used as a container for information being stored or passed between systems, usually unformatted and unseen by humans. Their hallmark is the absence of Mixed Content, and the prevalence of numeric or categorical data.

There is a third major area, Web Development, whose requirements are often hybrid, and span the features of both document and data applications because they contain partly static descriptive text and partly dynamic data.

Rick Jelliffe writes:

(in the xml-dev mailing list in thread Re: [xml-dev] XML mantra)
Non-document data tends to be XML only when there are other requirements or factors, such as longevity, validation, standardization, surrounding architecture, use of framework or library, team exposure, lack of datatypes, where the data forms a natural hierarchy, and so on.

While in theory it would be possible to use data-class software to write a novel, or document-class software to create invoices, it would probably be severely suboptimal. Because of the nature of the information used by the two classes, data-class applications tend to use Schemas, and document-class applications tend to use DTDs, but there is a considerable degree of overlap.

The way in which XML gets used in these two classes is also divided in two: XML can be used manually or under program control.

Manual usage: This means editing and maintaining the files with an editor, from the keyboard, seeing the information on the screen as you do so. This is suitable for individual documents, especially in the publishing field, for web pages, and for developers working on single instances such as sample files or web site templates. Manual processing also implies running production programs like formatters, converters, and database queries on a one-by-one basis, using the keyboard and mouse in the normal way. Much of the software for manual usage can be run from the command line, which makes it easy to use for one-off applications and in hidden applications like Web scripts.
Programmable usage: This means writing programs which call on software services from APIs, libraries, or the network to handle XML files from inside the program. XML files in data applications are almost never edited by hand. This is the normal method of operating for e-commerce applications, web automation, web services, and other process or application controls. There are libraries and APIs for many languages, including Java, C, and C++ as well as the usual scripting languages like Python, Perl, Tcl, Ruby, etc.

In addition to these axes, there are currently two different ways of processing XML, memory-mapped or event-triggered, usually referred to by the names of their original instantiations, the Document Object Model (DOM) and the Simple API for XML (SAX) respectively. Both use a model of document engineering based on a tree-like structure of hierarchical document markup known as a Grove (a collection of trees, effectively an in-memory map of the result of parsing the document markup). In this model, every ‘node’ (item of information) from the outermost element down through every element and attribute to each piece of unmarked text can be identified. For applications using Schemas, a Post-Schema-Validation Infoset (PSVI, equivalent to a grove) is defined, which specifies what information a parser should make available to the application.

Joe Fawcett writes:

(in article <eFIrHKtCGHA.2920@tk2msftngp13.phx.gbl>)
Briefly ‘node’ is a generic term for any of the many types of XML building blocks, including element: <myElement/>; attribute: <myElement myAttribute="myValue"/>; and text node: <my Text Node>
There are also comments [Comment Declarations], Processing Instructions and the invisible Document Node representing the root of the XML document, as well as others.

Grossly oversimplified, a DOM-based application reads an entire XML document into memory and then provides programmable access to every node in every tree in the grove; whereas a SAX-based application reads the XML document, and events are triggered by the occurrence of nodes as they happen, for which rules or actions have been pre-programmed. (In reality it's more complex than that, and both methods share a lot of concepts in common.)

Both models provide an abstract API for constructing, accessing, and manipulating XML documents. A binding of the abstract API to a particular programming language provides a concrete API. Vendors provide concrete APIs which let you use one or other method to query and manipulate XML documents. Both types of parser have been implemented in many languages and under many operating systems and interfaces. There are FAQs for both DOM and SAX.