Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ
Parsing is the act of splitting up information into its
component parts (schools used to teach this in language
classes until the teaching profession caught the
anti-grammar virus).
‘Mary feeds Spot’ parses as
Subject = Mary, proper noun, nominative case;
Verb = feeds, transitive, third person singular,
active voice, present tense;
Object = Spot, proper noun, accusative case.
In computing, a parser is a program (or a piece of code
or API that you can reference inside your own programs)
which analyses files to identify the component parts. All
applications that read input have a parser of some kind,
otherwise they'd never be able to figure out what the
information means. Microsoft Word
contains a parser which runs when you open a
.doc file and checks that it can
identify all the hidden codes;
iCal and Google
Calendar contain a parser which
reads an .ical appointment attachment
in your email, and works out what information is in it. Give
them a corrupted file and you'll get an error
message.
XML applications are just the same: they contain a
parser which reads XML and identifies the function of each
the pieces of the document, and it then makes that
information available in memory to the rest of the
program.
While reading an XML file, a parser checks the syntax
(pointy brackets, matching quotes, etc) for well-formedness, and reports any
violations (reportable errors). The XML Specification lists what these
are.
Validation is another stage
beyond parsing. As the component parts of the program are
identified, a validating parser can compare them with the
pattern laid down by the DTD or Schema, to check that they
conform. In the process, default values and datatypes (if
specified) can be added to the in-memory result of the
validation that the validating parser gives to the
application.
<person xml:id="abc123" birth="1960-02-31" gender="female">
<name>
<forename>Judy</forename>
<surname>O'Grady</surname>
</name>
</person>
The example above parses as:
Element <person> identified
with Attribute xml:id
(predefined type ‘ID’) containing "abc123" and Attribute birth containing "1960-02-31" and Attribute gender containing "female" containing ...
Element <name> containing
...
Element <forename>
containing text ‘Judy’ followed by
...
Element <surname>
containing text ‘O'Grady’.
(and lots of other stuff too). This ends up as a kind of
family-tree structure in the application's memory (tree
structures are a common way for programs to store related
data).
As well as built-in parsers, there are also standalone
parser-validators (see Bill Rayer’s tip),
which read an XML file and tell you if they find an error
(like missing angle-brackets or quotes, or misplaced
markup). This is essential for testing files in isolation
before doing something else with them, especially if they
have been created by hand without an XML editor, or by an
API which may be too deeply embedded elsewhere to allow easy
testing.
If you don’t
want to install software, there are many online
parser-validator websites such as
Mukul Gandhi’s at
https://www.softwarebytes.org/xmlvalidation/
where you can upload your XML document,
the XSD schema and any ancillary files,
and select the version.
Bill Rayer writes:
For standalone parsing/validation use software like
James Clark's onsgmls or
Richard Tobin's rxp.
Both work under Linux and Windows/DOS. The difference is
in the format of the error listing (if any), and that some
versions of onsgmls do not
retrieve DTDs or other files over the network, whereas
rxp does.
Make sure your XML file correctly references its DTD
in a Document Type Declaration, and that the DTD file[s]
are locally accessible (rxp
will retrieve them if you have an Internet connection;
onsgmls may not, so it may need
a local copy).
Download and install the software. Make sure it is
installed to a location where your operating system can
find it. If you don't know what any of this means, you
will need some help from someone who knows how to download
and install software on your type of operating
system.
For onsgmls, copy
pubtext/xml.soc and
pubtext/xml.dcl to your working
directory.
To validate myfile.xml, open a
shell (command or terminal) window (Linux) or an MS-DOS
(command) window (Microsoft Windows). In these examples
we'll assume your XML file is called
myfile.xml and it's in a folder
called myfolder. Use the real names
of your folder and file when you type the commands.
- For onsgmls:
$ onsgmls -wxml -wundefined -cxml.soc -s myfile.xml
There are many other options
for onsgmls which are
described on the Web
page. The ones given here are required
because it's based on an SGML parser and these
options switch it to XML mode and suppress the
normal output, leaving just the errors (if
any).
In Microsoft Windows you may have to prefix the
onsgmls command with the full
path to wherever it was installed, eg
C:\Program
Files\OpenSP\bin\onsgmls.
- For rxp:
$ rxp myfile.xml
rxp also has some options
which are described on its Web
page.
In Microsoft Windows you may have to prefix the
rxp command with the full path to
wherever it was installed, eg C:\Program
Files\ltxml2\bin\rxp.