The XML FAQ: Not the XML FAQ

Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 5: Appendices

Q 5.3: Not the XML FAQ

Infrequently Asked Questions

This is a list of topics that people have asked about or searched for in relation to the XML FAQ, which are not necessarily directly connected to XML and its technology, nor frequently asked questions. It also includes some fall-back definitions for the benefit of users who have come to XML by different routes and may not have been exposed to a document publishing background.

Readers may also want to look at Joe English's ‘Not the SGML FAQ’ at http://www.flightlab.com/~joe/sgml/faq-not.txt.

AJaX

Asynchronous HTTP, JavaScript, and XML. A technique for improving the interactivity of web pages whereby in-browser scripting detects user activity and pre-fetches the required data asynchronously from an XML-backed data-store, instead of waiting until the user clicks on a link and requesting it synchronously from the server.

Attributes

These are items of metadata or metainformation (information about information) which can be added to the start-tag of an element. Usually attributes are a way of refining the meaning, function, or some other quality of an element. They take the form of a name and a quoted value joined by an equals sign, eg

 
<part xml:id="B22" catnum="51N1573R" level="App">Left-handed
      Screwdriver</part>

Attribute names must follow the XML rules for Names (see the spec). If your application does not use a DTD or Schema, the attribute values are treated as plain text (CDATA) and cannot have any special meaning to XML (with the exception of xml:id and xml:lang, see below). In a DTD or Schema, attributes can be assigned datatypes, the most common being (using DTD terminology for simplicity):

ID or IDREF

ID attribute values must be XML Names (no spaces; must begin with a letter) and they must be unique in a document. An IDREF attribute value can occur any number of times, but it must be the value of an ID attribute in the same document. ID and IDREF are most frequently used for cross-referencing within documents.

Note that an ID attribute can have any name: it doesn't have to be called ‘ID’, although it frequently is. Conversely — as a matter of best practice — you should never use the name ‘ID’ (‘id’) for an attribute which is not of type ID, simply because it's confusing. If your application has unique identity values that the community calls IDs, and which are not XML Names, either name the attribute something different (eg ‘Product-ID’) or document heavily that the value is not an XML ID.

There is a W3C Recommendation that document type designers should use the attribute name xml:id, and this can be interpreted by parsers as being a unique ID without the need for the document to use a DTD or Schema.

CDATA

Just text.

Token List

The attribute must have one of a restricted number of values (specified in parentheses in the declaration, separated by vertical bars), eg

 
<!ATTLIST part level (App|Jny|Mst) #REQUIRED> 
<!ATTLIST Q.27 resp  (Yes|No) "Yes">

In the first example there is no default, and a value is compulsory. In the second, ‘Yes’ is the default value (if the attribute is omitted, the parser will take the default value from the declaration).

ENTITY

The attribute value must be a declared Entity.

NMTOKEN

An XML Name Token is like an ID value (no spaces) but it can begin with a non-letter (eg a digit or punctuation).

Special attributes

In addition to xml:id (mentioned above), there are two others allowed by the XML Specification:

xml:space: to signal an intention that in that element, white space should be preserved by applications;
xml:lang: to specify the language used in the contents and attribute values of any element.

See sections 2.10 and 2.12 of the Spec for more detail.

In Schemas a much greater range of datatypes is available than in DTDs, and complex validation criteria can be attached to each.

Attributes in a DTD can be declared as #REQUIRED (compulsory), #IMPLIED (optional), or #FIXED (predefined and invariable).

There is not intended to be any limit on the length of an attribute value, but you should check that your processing software can handle unusual data volumes if you intend to use very large lengths.

BPEL

The Business Process Execution Language is an XML-based specification of the steps required for a cooperative business process to take place between consenting servers.

Byte Order Mark

A two-byte signature (0xFEFF, defined in Unicode and ISO 10646) which must be prepended to the XML document when using the the UCS-2 encoding, in order to allow processors to differentiate between the UCS-2 and UTF-8 encodings.

Colour

XML is designed for identifying information about the structure and content of text documents, rather than their appearance. Although it is perfectly possible to identify and store information about appearances, this information is usually kept in a CSS or XSL stylesheet. If you need to record information about the formatting or appearance of an existing document, there are features in the TEI Schema/DTD for doing so.

Data export

A common requirement in the flat data model used in many e-commerce systems is to export XML data to the Comma-Separated Values (CSV) data format used as input to spreadsheets, or to JSON (see question D.10 on ‘What about JSON?’). There is a simple example of a short script to do CSV here. More complex and sophisticated routines could easily be written using XSLT or other XML processing software. Users should note that while conversion to CSV or JSON are adequate for simple data formats, they are inappropriate formats for normal XML text documents which use Mixed Content models.

Data import

Many XML projects require the import of existing documents in non-XML formats. The import of existing HTML documents is explained in How can I make my existing HTML files work in XML?, and if you can convert your documents to XHTML; this is probably the simplest method. OpenOffice saves Open Document Format (ODF) files, which are the international standard for office XML documents. Word files can be saved as WordML (2003) or Office Open XML (2007: Microsoft's alternative to ODF). In both cases an XSLT transformation can be written to create a suitable XML import format. For complex documents in other formats, however, specialist conversion software is needed. Some XML editors are beginning to offer inbuilt conversion of other formats, and there are many standalone conversion systems available (some at high cost) for formats which are otherwise not easily machine-accessible via markup, like PDF, PostScript, LATEX, Quark XPress, and most proprietary document formats. The critical point is that almost all non-XML (non-SGML) document are formatted to make them human-readable and pretty, not to make them machine-readable. It is therefore often the case that the information required to make the document meaningful in XML simply doesn't exist in these formats. The only alternative for this class of documents is to have them rekeyed or scanned into XML by one of the many companies in the Indian subcontinent or the Pacific Rim.

Disadvantages

XML markup has a few disadvantages:

It can be verbose unless element and attribute names are chosen with care. In large documents the markup overhead need not be large, but in short messages it can be significantly more than the actual data, especially when the element or attribute names are concocted by machine.
Overlapping markup is not permitted (an element cannot start inside one element and end inside another): element markup must nest hierarchically.
Most applications require the document to be loaded to memory in its entirety before it can be parsed and processed. This can become a problem for truly huge documents (larger than the addressable memory of a computer system). Arguably, XML is the perhaps wrong tool to use for files this size, but there are streaming systems which will enable them to be processed.
Some of the software is truly mediocre.

Editing

To edit (open) an XML file you should use an XML editor. It is possible to open an XML file using any standard plaintext editor or even a wordprocessor, but be aware that they may try to reformat the file incorrectly because they don't understand XML.

Entities

An entity is a unit of storage in XML. It can be as small as a character or as large as a whole document. Four types of entity are declarable:

General entities

which can be like string-replacement macros:

<!ENTITY IBM "International Business Machines">

These can be used for shorthand data entry or to guarantee uniform spelling like &IBM; and they get replaced when the file is parsed.

They can also represent external files:

<!ENTITY chap5 SYSTEM "chapter5.xml">

which can be used as a file-inclusion mechanism at the point where you insert &chap5;. External general file entities must not contain the XML Declaration or any Document Type Declaration.

Document entities

These are like external general file entities except that they specify the type of data they contain, using a declared Notation, so that the parser and application can decide how to handle them (eg include them or hand them to another program specific to their type of medium):

 
<!ELEMENT link (#PCDATA)> <!ATTLIST link to ENTITY #REQUIRED>
... 
<!NOTATION PDF PUBLIC 
  "-//Adobe//NOTATION Portable Document Format//EN//PDF"
  "http://partners.adobe.com/public/developer/pdf/index_reference.html"> 
<!ENTITY pricelist SYSTEM "/sales/pricelist.pdf" NDATA PDF> 
... 
<para>Please refer to our <link to="pricelist">current price
      list</link>.</para>

This provides an extremely robust method of defining an external entity once and allowing it to be referenced multiple times (if the external filename changes, you only have to update the entity declaration).

Character entities

like á to represent characters that users without the required keyboard features may want to enter like ‘á’;

Parameter Entities

are like General Entities but can only be referenced within a DTD. They are used for control of content models, inclusion or exclusion of declarations, and modification of modular constructs:

 
<!ENTITY % local.qandaset.mix "|bibliodiv">

(to use an example from the DTD for this FAQ) where the mix of element types in the content model for qandaset is specified by the entities qandaset.mix (defined by DocBook) and by local.qandaset.mix (definable by the user [me]) so that the DTD can be tweaked without having to be edited.

General entity names, including XML document entities and character entities, always start with an ampersand (&) and end with a semicolon (;), and can be used anywhere in your document. Parameter entities can only be used in a DTD: they start with a percent sign (%) and end with a semicolon.

Enumeration

To count the number of occurrences of a node in an XML document, you can use the count function in XSL[T], eg

 
<xsl:value-of select="count(//chapter)"/>

To apply a counter to a repetitive element type, use the xsl:number element, eg

 
<xsl:number select="appendix" level="any" format="A"/>

For more on XSLT, see How do I control the formatting of XML?.

Environment variables

XML is a markup language, not a programming language, so it has no concept of environment variables. However, if you are using a DTD, and accessing your XML files under program control (eg in a script rather than by hand) it is possible to modify the value of declared attributes or entities (eg with a stream-editor like sed) before the file is opened, and thereby to pass values from the external environment into the document. A similar approach would be possible with Schemas.

Escaping

Escaping means temporarily switching the way a program works to do something different with the data. In SGML, it was conventional to use only ASCII characters in your documents because keyboards, screens, and fonts for other characters were often unavailable. To escape from the limitations of this format for non-ASCII characters like accents and symbols a set of mnemonic names was available, prefixed by an ampersand (&) to turn the escapement on, and followed by a semicolon (;) to turn the it off, so an á was given as á.

XML allows you to use Unicode, so any character or symbol in any language can be entered as itself. If you are using UTF-8 encoding in your documents, there is no need to use escaping except for the two markup symbols (< and &). However, not everyone has a Unicode editor, and complete Unicode fonts are very large, so it is conventional in alphabetic languages to pick an encoding which allows you to use the majority of the characters you need, and to use escaping for the occasional other characters.

Floating-point

You cannot declare character data content or attribute values as floating-point (or many other data types) using DTDs. To do that you need to use a Schema.

GTT

The Gnome Time Tracker is a component of the Gnome interface used extensively on Linux systems. Part of its internal data is configured in XML.

Games

I am not aware of any computer games written using XML yet, although XML is used in some of the internal control and configuration files used by games.

Idempotency

A term used in the HTTP specification to describe the side-effect-free nature of repeated requests for a resource.

JavaScript

ECMAscript (to give it its real name) has nothing to do with the Java language. It's designed to run inside browser windows, navigating or acting on the markup of a page to create dynamic content, validate forms, or instantiate objects in ways that are not possible with static HTML. It is also designed so that it cannot write to the user's local filesystem, for obvious security reasons, so it cannot easily be used to create XML files locally, although there are some back-doors in Microsoft software which allow modified pages to be saved to disk.

Line breaks

XML files can be created using any of the three standard newline representations: CR (Mac), LF (Unix), or CR/LF (Windows). Use of anything else may lead to undefined behaviour (so old DOS editors that use LF/CR may create unusable files). XML processors normalise all line-ends to LF.

Line-breaking in your output is governed by your rendering engine (eg a browser, a typesetter, etc). Your DTD or Schema may define special elements or entities to be used on rare occasions when a forced linebreak is required, but this is not normally something done in XML (exception: reconstruction of historical documents using the TEI).

Loops

To process some XML repetitively, you need to use a processing language which allows looping or the cyclical handling of a defined set of nodes. For example in XSLT, to output all chapter titles to make a table of contents (ie out of natural document position), you could say:

 
<xsl:for-each select="//chapter"> 
  <li> 
    <xsl:value-of select="title"/> 
  </li> 
</xsl:for-each>

Multimedia

The Synchronized Multimedia Integration Language (SMIL) provides an XML vocabulary for simple authoring of interactive audiovisual presentations. SMIL is typically used for ‘rich media’/multimedia presentations which integrate streaming audio and video with images, text or any other media type.

Patents, Copyright, and Intellectual Property

I'm not a lawyer, and this is not legal advice. If you're worried, see a psychiatrist first ⌣

Since the USA (and, increasingly, elsewhere) stopped sanity-checking patent applications, pretty much anyone can patent anything in these countries, regardless of whether or not it already exists. If you are sufficiently intellectually bankrupt, you can then start sending invoices to companies and even individuals demanding payment of license fees for continued use.

XML was drafted during 1995 and first published in 1996, so anyone claiming they invented pointy-bracket self-defining hierarchically-nested structured markup after that is probably a few elements short of a Schema. XML is based on SGML, which is an international standard codified as ISO 8879:1986, and it was preceded by numerous other closely-related markup systems, so anyone claiming they invented it after that date is equally wide of the markup.

Lots of subsequent derivative technologies which owe their existence to the SGML and XML groundwork quite possibly are valid patents, in the same way that fire was not originally patented but matches and lighters were.

Patents were originally designed for new physical inventions. Their use for methodologies and algorithms extended the concept into the realm of ideas, which many people regard as deeply suspect. The patenting of natural phenomena like genes (which are pre-existing parts of Nature like politicians and pond scum), is meaningless and intellectually void, although legally enforceable in the USA and elsewhere.

Copyright subsists automatically in anything you create, but in some countries (notably the USA and France) you cannot enforce this unless you register your interest. Copyright persists for a number of years after your death (EU: 75, different elsewhere) in order to let your descendants benefit from sales of your work.

Copyright is for the physical form of intellectual expression like books, newspapers, works of art, web sites, or computer programs. It exists to prevent others stealing your work and selling it. You can quote snippets of other people's work without permission, such as a line of a poem, or a bar of music, or a sentence from a novel, provided you say whose it is and where to find it: otherwise you need to ask permission beforehand. Copyright already provides more than adequate protection for computer programs, making the use of patents for them unnecessary overkill.

Intellectual Property identifies you as the owner of the thoughts and ideas which may find their physical manifestation in patentable inventions or copyrightable publications. Even if you sell off your patents, and for long after your copyrights have expired, you can still be seen as the person who dreamed up the idea, and some countries (eg the UK) allow you formally to assert your right to be so identified, regardless of what happens to the book or the gizzmo.

You should always acknowledge the intellectual property of others, especially when you use it in furtherance of your own aims. Pretending that someone else's smart ideas are your own is probably a worse offence than trying to patent fire, water, the wheel, or XML.

Pipelining

Technique for reducing complex sequential and parallel processing requirements to a set of components which can be completed under program control. The term is taken from the Unix facility for redirecting the output of one command into the input of another (called a ‘pipe’), in effect creating a chain or pipeline through which data passes on its way from source to result.

The W3C has a Note pending submission on an XML Pipeline Definition Language which could be used to define a pipeline in a portable, vendor-independent manner.

RSS

The Really Simple Syndication format was designed to allow news sites to process updates by machine, and it evolved into a semi-standard format for blogs and other frequently-changing sites to notify the world of changes. Unfortunately it was never properly defined, and has multiple incompatible and undocumented versions. It was about to be superseded by a vastly better language called Atom, but Microsoft have recently announced their support for RSS, so it looks like we may be stuck with a lemon for years to come.

‘Newsreaders’ (RSS readers) are available for all platforms, both standalone and as browser plugins. Do not confuse these with programs of the same description designed to provide access to the Usenet News service, which is a different thing entirely (and which you will need to read at comp.text.xml).

Rendering

Using XSLT or XSL-FO transformation (or other similar conversion systems), information marked up in XML can be rendered to almost any target: HTML, PDF, audio, Braille, and almost any plain-text format (eg LATEX). How it appears (or sounds) is the result of using stylesheets or other transformation logic activated by the markup.

SML

The Spacecraft Markup Language is an application of XML.

The Standard ML programming language is not.

Did you mean SGML?

SOAP

A W3C standard for the ‘definition of the XML-based information which can be used for exchanging structured and typed information between peers in a decentralised, distributed environment’. Most commonly used in Web Services for message-passing.

Originally the Simple Object Access Protocol, the acronym is now undefined, or expressed as the Service-Oriented Access Protocol. Guru99 has a good tutorial on SOAP.

Searching

You can search individual XML files on a sequential, standalone, unindexed command-line basis using programs such as lxgrep or lxprintf, parts of the LTXML2 toolkit. Many editors include a search facility as well

The original XSLT allowed a limited search facility simply by using functions like contains, starts-with, and ends-with. XSLT3 now has Regular Expressions.XQuery is a fully-fledged search language for XML.

The Saxon XSLT processor comes with an implementation of XQuery (see also the XQL FAQ), which can accept queries either from the command line or from a file. Saxon can also use a control file to specify groups of XML files to be searched together.

For indexed searching (for speed) you need an XQuery search tool that implements an indexing engine which reads and understands markup. These are usually implemented as part of a ‘native’ XML database system such as eXist (and many others), which run either standalone or in parallel with an XML server like Cocoon.

Traditional relational databases (MySQL, Oracle, etc) tend to store XML as undistinguished strings or BLOBs, using bolt-on XML backends to handle the markup on import and export. ‘Native’ XML databases have the XML handling built-in, and can be configured for granularity, to store at a specific element level, making markup-sensitive searching much more effective.

Serving XML

See Do I have to change any of my server software to work with XML?

Sorting

To sort a repetitive set of XML elements in XSL[T], use the xsl:sort element, eg

 
<xsl:for-each select="//acronym"> 
  <xsl:sort select="@abbrev"/>
  <xsl:value-of select="@abbrev"/> 
  <xsl:text>: </xsl:text> 
  <xsl:apply-templates/> 
</xsl:for-each>

Special characters

XML has only two special markup characters in normal documents:

The open angle bracket or less-than sign (<) which begins a start-tag or end-tag like <report> or </table>;
The ampersand character (&) which starts an entity reference like á for á or § for §.

Contrary to popular opinion, the closing angle bracket or greater-than (>) and the semicolon (;) are not special characters in normal text: they only acquire their temporary special meaning once one of the two markup characters has been encountered.

In DTDs, the percent sign (%) has a special meaning in entity declarations: it defines the entity as a parameter entity, meaning that it can only be used inside the DTD, not in a document text, and only for data substitution (a kind of simple macro).

The exclamation mark (!) acquires a special meaning immediately after a less-than sign: when followed by one of the declaration keywords in a DTD it signals the start of Declaration; when followed by two dashes it signals the start of a comment (ended by another two dashes and a greater-than sign.

TMX

TMX is a standard method to describe translation memory data that is being exchanged among tools and/or translation vendors for human-language translation (part of the OSCAR project from LISA).

Tables

You can define tables any way you wish in XML (see Does XML let me make up my own tags?) but there are a few existing table models which have become so widely-used (and supported by software) that it would need a very compelling reason to invent something new. There are more details in (Flynn, 1998 § 2.3.7).

HTML: HTML tables were invented by Mosaic (now Netscape) and first appeared in the HTML2 DTD. In all versions of HTML and XHTML they define a very simple but practical model, with very few refinements, suitable for web use and for rudimentary printing. Their chief advantage is that in a browser the cell heights and widths (and thus the column widths) expand or contract automatically to accommodate the amount of text contained in them. Most other table models assume the widths of the columns and the height of the cells will be specified in advance (which you can do in HTML but this is rarely used).
CALS: Computer-Aided Logistics and Support (and several other acronyms over the years) was (is) part of the US military project to ensure a consistent markup for all documentation, originally in SGML, now in XML. As part of this activity the CALS table model has become the most widely-used in technical documentation, especially for Interactive Electronic Technical Manuals (IETMs), with extensive support in all the major editors, and it is the default table model in the DocBook DTD and Schema. The CALS definitions are very powerful but quite complex, and can handle virtually all requirements for spanning, ruling, and aligning.
SASOUT: This model has been used extensively in the social sciences and elsewhere for defining tables based on the semantics of the data, rather than the appearance. At one time they were an alternative in DocBook (enabled by a simple parameter entity switch).
TEI: The TEI model is designed to allow the encoder to represent existing tables being transcribed from historical, literary, or archive material, rather than for the generation of new data. The markup is at the same level of simplicity as the HTML model, but it is designed to allow the inclusion of the much denser markup and metadata needed in research texts.
LATEX: The LATEX model is not of direct concern to the XML user except insofar as LATEX is a common target for transformations from XML using XSLT in order to create PDFs. Like CALS, LATEX tables can handle almost any formatting, but the default alignments assume that each column format is defined beforehand, and that each cell will occupy one line of data: an additional package (array) is needed to handle multi-line cells in the way that other models do.

In XML, it is not necessary to use tables to mark up lists as is often done in wordprocessors, because the processing facilities of languages like XSLT allow you to transform the document to use non-tabular methods (like HTML's divs). Table markup should therefore be confined to ‘real’ tables (data arranged in rows and columns) and not abused simply because you want something displayed on a level with something else: it is better to pick markup which is designed to do the job properly rather than to distort existing facilities.

Wordprocessor users are usually unaware that many structures that they currently use wordprocessor tables for are in fact segmented lists, which wordprocessors are incapable of handling correctly. One of the major reasons for doing it properly is that the data can then be reprocessed to make sense when read in the natural order.

Text document formatting functions

Because XML is a metalanguage to let you define and name your own information structures, it has no built-in knowledge of anything to start with. It therefore has no inherent understanding of any document specifics like bulleted lists, sections, footnotes, or any of the common online features like drop-down menus, forms (inputs, check boxes, radio buttons, and text areas), scripts, mouseovers, or other bells and whistles — these are things which you have to use XML to define, in a DTD or Schema for your specific application. Contrary to the impression given by some manufacturers these things are not built into XML itself. You first choose or design a document type (Schema or DTD) to represent your information accurately, then you can generate effects like the above by using CSS styling, or writing an XSL[T] transformation of your XML to HTML, Word, LATEX, PDF, or whatever other format is capable of instantiating them.

There are additional native-XML proposals and recommendations at the W3C for XML Forms handling, XML Linking, XML Security, and a lot of other features, but these are architectural enabling mechanisms, not drop-in replacements for HTML.

UML

The Unified Modeling Language has nothing to do with XML, although there are many points of contact, and some software is available to express some UML structures in XML for the purposes of inter-process messaging.

URI parsing errors

See Semicolon.

Variables

XML doesn't have variables or parameters, nor does it have fields or records. These are all terms from programming and database technology, and do not have exact equivalents in XML.

XML identifies your information with elements and attributes.

WAP

The Wireless Application Protocol (WAP) is now handled by the Open Mobile Alliance.

Well-formed

See Well-formed XML.

White-space

See How does XML handle white-space in my documents?.

XLL

The XML Linking Language comprises the XLink specification and the XPointer specification. For details, see the XML Linking Working Group at the W3C.

XLS

Microsoft proprietary spreadsheet file format written by their Excel spreadsheet program. XLS files are not XML files, but modern versions of Excel save their data as .xlst files in Microsoft's Office Open XML format (OOXML).

Do not confuse XLS with XSL (see How do I control the formatting of XML?).

XML

This is the XML FAQ. Everything in it is about XML. For introductory explanations, see Basics.

XML and security, privacy, and identity standards

Eve

XML Protocol

There is a Working Group for Web Services at the W3C, and part of their remit is to work on an XML Protocol. See http://www.w3.org/2000/xp/Group/ for details.

XMLHTTP

Feature implemented in MSXML and elsewhere to allow the retrieval of web pages, binary data, or scripted responses under program control (like using curl, wget or dog in a shell script). Used asynchronously in AJaX applications to pre-fetch data, saving time to make it appear that an application is operating locally.

XUL

The XML User Interface Language, designed for specifying the user interface in the Mozilla browser.

asp.net

ASP (Active Server Pages) is a Microsoft language for serving dynamic web pages, similar in concept to JSP, PHP, and others. In itself, ASP has nothing inherently to do with XML, although like any server-side system, it can be used for serving XML just as well as an other type of file.

.NET itself is an application platform and methodology for web services development on Microsoft servers. Most web services are predicated on XML as the ‘common carrier’ of inter-business messaging, so .NET has a significant XML component.

Marc Hadley writes:

There are many alternatives to ASP, most of which use a similar page based approach. Java based alternatives include Java Server Pages (JSP), Java Server Faces (JSF) and Cocoon (which includes eXtensible Server Pages — XSP). Popular scripting language alternatives include Zope (Python) and Rails (Ruby) [all of which have extensive XML support. — Ed.]