Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ

The XML FAQ — Frequently-Asked Questions about the Extensible Markup Language

Section 3: Authors

Q 3.19: What is parsing and how do I do it in XML?

Parsing is splitting up information into its component parts

Parsing is the act of splitting up information into its component parts (schools used to teach this in language classes until the teaching profession caught the anti-grammar virus).

‘Mary feeds Spot’ parses as

  1. Subject = Mary, proper noun, nominative case;

  2. Verb = feeds, transitive, third person singular, active voice, present tense;

  3. Object = Spot, proper noun, accusative case.

In computing, a parser is a program (or a piece of code or API that you can reference inside your own programs) which analyses files to identify the component parts. All applications that read input have a parser of some kind, otherwise they'd never be able to figure out what the information means. Microsoft Word contains a parser which runs when you open a .doc file and checks that it can identify all the hidden codes; iCal and Google Calendar contain a parser which reads an .ical appointment attachment in your email, and works out what information is in it. Give them a corrupted file and you'll get an error message.

XML applications are just the same: they contain a parser which reads XML and identifies the function of each the pieces of the document, and it then makes that information available in memory to the rest of the program.

While reading an XML file, a parser checks the syntax (pointy brackets, matching quotes, etc) for well-formedness, and reports any violations (reportable errors). The XML Specification lists what these are.

Validation is another stage beyond parsing. As the component parts of the program are identified, a validating parser can compare them with the pattern laid down by the DTD or Schema, to check that they conform. In the process, default values and datatypes (if specified) can be added to the in-memory result of the validation that the validating parser gives to the application.

 
<person xml:id="abc123" birth="1960-02-31" gender="female"> 
  <name>
    <forename>Judy</forename> 
    <surname>O'Grady</surname> 
  </name>
</person> 
	  

The example above parses as:

  1. Element <person> identified with Attribute xml:id (predefined type ‘ID’) containing "abc123" and Attribute birth containing "1960-02-31" and Attribute gender containing "female" containing ...

  2. Element <name> containing ...

  3. Element <forename> containing text ‘Judy’ followed by ...

  4. Element <surname> containing text ‘O'Grady’.

(and lots of other stuff too). This ends up as a kind of family-tree structure in the application's memory (tree structures are a common way for programs to store related data).

As well as built-in parsers, there are also stand-alone parser-validators (see Bill Rayer’s tip), which read an XML file and tell you if they find an error (like missing angle-brackets or quotes, or misplaced markup). This is essential for testing files in isolation before doing something else with them, especially if they have been created by hand without an XML editor, or by an API which may be too deeply embedded elsewhere to allow easy testing.

Bill Rayer writes:

For standalone parsing/validation use software like James Clark's onsgmls or Richard Tobin's rxp. Both work under Linux and Windows/DOS. The difference is in the format of the error listing (if any), and that some versions of onsgmls do not retrieve DTDs or other files over the network, whereas rxp does.

Make sure your XML file correctly references its DTD in a Document Type Declaration, and that the DTD file[s] are locally accessible (rxp will retrieve them if you have an Internet connection; onsgmls may not, so it may need a local copy).

Download and install the software. Make sure it is installed to a location where your operating system can find it. If you don't know what any of this means, you will need some help from someone who knows how to download and install software on your type of operating system.

For onsgmls, copy pubtext/xml.soc and pubtext/xml.dcl to your working directory.

To validate myfile.xml, open a shell (command or terminal) window (Linux) or an MS-DOS (command) window (Microsoft Windows). In these examples we'll assume your XML file is called myfile.xml and it's in a folder called myfolder. Use the real names of your folder and file when you type the commands.

For onsgmls:
$ onsgmls -wxml -wundefined -cxml.soc -s myfile.xml
		

There are many other options for onsgmls which are described on the Web page. The ones given here are required because it's based on an SGML parser and these options switch it to XML mode and suppress the normal output, leaving just the errors (if any).

In Microsoft Windows you may have to prefix the onsgmls command with the full path to wherever it was installed, eg C:\Program Files\OpenSP\bin\onsgmls.

For rxp:
$ rxp myfile.xml
		

rxp also has some options which are described on its Web page.

In Microsoft Windows you may have to prefix the rxp command with the full path to wherever it was installed, eg C:\Program Files\ltxml2\bin\rxp.