Copyright © 2010 Silmaril Consultants
Rev: 2010-04-24T15:52:33+0100

Parsing is splitting up information into its component partsC.26  What is parsing and how do I do it in XML?

Parsing is the act of splitting up information into its component parts (schools used to teach this in language classes until the teaching profession collectively caught the anti-grammar disease).

‘Mary feeds Spot’ parses as

  1. Subject = Mary, proper noun, nominative case

  2. Verb = feeds, transitive, third person singular, present tense

  3. Object = Spot, proper noun, accusative case

In computing, a parser is a program (or a piece of code or API that you can reference inside your own programs) which analyses files to identify the component parts. All applications that read input have a parser of some kind, otherwise they'd never be able to figure out what the information means. Microsoft Word contains a parser which runs when you open a .doc file and checks that it can identify all the hidden codes. Give it a corrupted file and you'll get an error message.

XML applications are just the same: they contain a parser which reads XML and identifies the function of each the pieces of the document, and it then makes that information available in memory to the rest of the program.

While reading an XML file, a parser checks the syntax (pointy brackets, matching quotes, etc) for well-formedness, and reports any violations (reportable errors). The XML Specification lists what these are.

Validation is another stage beyond parsing. As the component parts of the program are identified, a validating parser can compare them with the pattern laid down by a DTD or a Schema, to check that they conform. In the process, default values and datatypes (if specified) can be added to the in-memory result of the validation that the validating parser gives to the application.

<person corpid="abc123" 
        birth="1960-02-31" 
        gender="female">
  <name>
    <forename>Judy</forename>
    <surname>O'Grady</surname>
  </name>
</person> 
	  

The example above parses as:

  1. Element person identified with Attribute corpid containing abc123 and Attribute birth containing 1960-02-31 and Attribute gender containing female containing ...

  2. Element name containing ...

  3. Element forename containing text ‘Judy’ followed by ...

  4. Element surname containing text ‘O'Grady’

(and lots of other stuff too).

As well as built-in parsers, there are also stand-alone parser-validators, which read an XML file and tell you if they find an error (like missing angle-brackets or quotes, or misplaced markup). This is essential for testing files in isolation before doing something else with them, especially if they have been created by hand without an XML editor, or by an API which may be too deeply embedded elsewhere to allow easy testing.

Bill Rayer writes:

For standalone parsing/validation use software like James Clark's nsgmls or Richard Tobin's rxp. Both work under Linux and Windows/DOS. The difference is in the format of the error listing (if any), and that some versions of nsgmls do not retrieve DTDs or other files over the network, whereas rxp does.

Make sure your XML file correctly references its DTD in a Document Type Declaration, and that the DTD file[s] are locally accessible (rxp will retrieve them if you have an Internet connection; nsgmls may not, so it may need a local copy).

Download and install the software. Make sure it is installed to a location where your operating system can find it. If you don't know what any of this means, you will need some help from someone who knows how to download and install software on your type of operating system.

For nsgmls, copy pubtext/xml.soc and pubtext/xml.dcl to your working directory.

To validate myfile.xml, open a shell window (Linux) or an MS-DOS (‘command’) window (Microsoft Windows). In these examples we'll assume your XML file is called myfile.xml and it's in a folder called myfolder. Use the real names of your folder and file when you type the commands.

For nsgmls:

$ nsgmls -wxml -wundefined -cxml.soc -s myfile.xml There are many other options for nsgmls which are described on the Web page. The ones given here are required because it's an SGML parser and these options switch it to XML mode and suppress the normal output, leaving just the errors (if any).

(In Microsoft Windows you may have to prefix the nsgmls with the full path to wherever it was installed, eg C:\Program Files\nsgmls\nsgmls).

For rxp:

$ rxp myfile.xml Rxp also has some options which are described on its Web page.

(In Microsoft Windows you may have to prefix the rxp with the full path to wherever it was installed, eg C:\Program Files\rxp\rxp).