Related
s
DOM
HTML
HTML5
MSXML
NAMESPACES
NOT SGML
SAX
SCHEMA
SGML
SVG
TEX
UNICODE
XML CHINESE
XML CONDENSED
XML DUTCH
XSL
C.26 What is parsing and how do I do it in XML?
Parsing is the act of splitting up information into its component parts (schools used to teach this in language classes until the teaching profession collectively caught the anti-grammar disease).
‘Mary feeds Spot’ parses as
Subject = Mary, proper noun, nominative case
Verb = feeds, transitive, third person singular, present tense
Object = Spot, proper noun, accusative case
In computing, a parser is a program (or a piece of code
or API that you can reference inside your own programs)
which analyses files to identify the component parts. All
applications that read input have a parser of some kind,
otherwise they'd never be able to figure out what the
information means. Microsoft Word contains a parser which
runs when you open a .doc
file and checks that it can identify all the hidden codes.
Give it a corrupted file and you'll get an error
message.
XML applications are just the same: they contain a parser which reads XML and identifies the function of each the pieces of the document, and it then makes that information available in memory to the rest of the program.
While reading an XML file, a parser checks the syntax (pointy brackets, matching quotes, etc) for well-formedness, and reports any violations (reportable errors). The XML Specification lists what these are.
Validation is another stage beyond parsing. As the component parts of the program are identified, a validating parser can compare them with the pattern laid down by a DTD or a Schema, to check that they conform. In the process, default values and datatypes (if specified) can be added to the in-memory result of the validation that the validating parser gives to the application.
<person corpid="abc123"
birth="1960-02-31"
gender="female">
<name>
<forename>Judy</forename>
<surname>O'Grady</surname>
</name>
</person>
The example above parses as:
Element person
identified with Attribute corpid containing abc123 and Attribute
birth containing
1960-02-31 and
Attribute gender
containing female
containing ...
Element name
containing ...
Element forename
containing text ‘Judy’ followed by
...
Element surname
containing text ‘O'Grady’
(and lots of other stuff too).
As well as built-in parsers, there are also stand-alone parser-validators, which read an XML file and tell you if they find an error (like missing angle-brackets or quotes, or misplaced markup). This is essential for testing files in isolation before doing something else with them, especially if they have been created by hand without an XML editor, or by an API which may be too deeply embedded elsewhere to allow easy testing.
For standalone parsing/validation use software like James Clark's nsgmls or Richard Tobin's rxp. Both work under Linux and Windows/DOS. The difference is in the format of the error listing (if any), and that some versions of nsgmls do not retrieve DTDs or other files over the network, whereas rxp does.
Make sure your XML file correctly references its DTD in a Document Type Declaration, and that the DTD file[s] are locally accessible (rxp will retrieve them if you have an Internet connection; nsgmls may not, so it may need a local copy).
Download and install the software. Make sure it is installed to a location where your operating system can find it. If you don't know what any of this means, you will need some help from someone who knows how to download and install software on your type of operating system.
For nsgmls, copy pubtext/xml.soc and pubtext/xml.dcl to your working directory.
To validate myfile.xml, open a shell window (Linux) or an MS-DOS (‘command’) window (Microsoft Windows). In these examples we'll assume your XML file is called myfile.xml and it's in a folder called myfolder. Use the real names of your folder and file when you type the commands.
$ nsgmls -wxml -wundefined -cxml.soc -s myfile.xml
There are many other options for nsgmls which
are described on the Web page.
The ones given here are required because it's an
SGML parser and these options switch it to XML mode
and suppress the normal output, leaving just the
errors (if any).
(In Microsoft Windows you may have to prefix the nsgmls with the full path to wherever it was installed, eg C:\Program Files\nsgmls\nsgmls).
$ rxp myfile.xml
Rxp also has some options which are described on
its Web
page.
(In Microsoft Windows you may have to prefix the rxp with the full path to wherever it was installed, eg C:\Program Files\rxp\rxp).