Your support for our advertisers helps cover the cost of hosting, research, and maintenance of this FAQ
While it is possible to write conversion routines by inventing your own XML parser, it is not recommended except as an exercise for students of computing science. All major languages already have XML libraries that do all the heavy lifting of parsing (and validating, if needed).
However, you do need to know what's in the XML document before you start: there is no magic wand that will automatically deduce what things mean and where they are located in the file.
If you have been handed some unknown XML files out of the blue, you will need to go and find the creator or some documentation about them. The first 2–3 lines of the file (the Document Type Declaration) may hold a clue as to what type of XML they are. You may also need a copy of the DTD or Schema to which the files have been created, as it will contain the rules of structure, and may contain information about the data types of attributes.
The use of a utility or language designed for the task is strongly recommended. XSLT3 has all the facilities for handling XML built in from the start, and standalone processors are available for all platforms. Many XML editors have a copy of XSLT (XSLT3, hopefully) built in, so they offer an integrated development environment for editing and conversion. XSLT3 conversion can also run inside server packages like Apache Cocoon, and there are versions of at least two processors (Saxon-CE and Saxon-JS) which run in browsers.
Other options for programmed conversion are:
Use an XML processing or pipelining package. XProc is an W3C Recommendation for pipelining, and XML Calabash is an Open Source implementation. There are others, some of which are commercial products which provide extensive document management, document database, and document conversion and editing functions, often as part of a much larger enterprise information solution, using XSLT3 or their own in-house systems. Two popular ones are MarkLogic and OmniMark.
For ‘data’ XML, you can use a conversion system that does not require writing code: Flexter and ReData are examples of conversion systems using a graphical interface for mapping source elements (XML) to target fields (several formats). While this approach is probably not appropriate for ‘document’ XML (books, articles, etc) it provides a useful method for data which is essentially tabular or rectangular in structure, as they can connect directly to your database.
Use a conventional compilable language. Java or C (or one of its many ++/♯ variants) would be common; Pascal, FORTRAN, or COBOL are rare these days, but XML libraries do exist for them). BASIC, anyone?
Use a scripting language. Perl, Python, Tcl, VBscript, or even Powershell are all popular, and XML libraries exist for them; the Python ones have an excellent reputation.
Combine XML utilities with standard shell command utilities. Here is an early example of an XML-to-CSV routine which uses onsgmls to parse and validate the document, expose the ESIS, and use awk to reformat it into CSV. Similar processes can be developed using the LTXML2 toolkit.
There are downloadable or web-based (sometimes free) programs claiming to be ‘easy’ XML converters. The editor would like to hear recommendations or warnings ⌣.
One factor to consider is if your application needs to ‘push’ the XML through the conversion process element by element (perhaps because much of it needs to be treated in a particular order); or if your application needs to ‘pull’ identifiable elements of the XML into the conversion process because only those elements are needed, or because the data is undergoing restructuring in the process. Most conversion systems can do either, or a mixture of both.
Rick Jelliffe writes:
Pull vs push
[…] it is also typical that in a pull-program, the structure of the input document is largely irrelevant: you just want to know how to address the information to want, and the input document tree could be rebalanced or rotated or sharded or differently arranged without any effect (apart from the addressing, which needs to be updated.)
However, with push programming, the structure of the input document is part of the information being represented. Change the structure (nesting and ordering) or represent attributes instead as elements, and you corrupt the information.
The process of converting XML to other formats is sometimes referred to as ‘down-converting’, as it may involve the unavoidable loss of information (usually metadata) when the target format simply doesn't have a way to represent it.