Validating XML Documents: A Comprehensive Guide, Study notes of Computer Science

A detailed tutorial on validating xml documents using document type definitions (dtds) and xml schema. It covers the basics of xml, the differences between dtds and xml schema, and how to validate xml documents using various tools. It also includes examples of xml documents and their corresponding schema files.

Typology: Study notes

Pre 2010

Uploaded on 09/02/2009

koofers-user-6ox-1
koofers-user-6ox-1 🇺🇸

10 documents

1 / 29

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Validating XML
Presented by developerWorks, your source for great tutorials
ibm.com/developerWorks
Table of Contents
If you're viewing this document online, you can click any of the topics below to link directly to that section.
1. Introduction 2
2. Validation basics 5
3. Validating a document 9
4. Document Type Definitions (DTDs) 13
5. XML Schema 20
6. Validation summary 28
Validating XML Page 1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d

Partial preview of the text

Download Validating XML Documents: A Comprehensive Guide and more Study notes Computer Science in PDF only on Docsity!

Validating XML

Presented by developerWorks, your source for great tutorials

ibm.com/developerWorks

Table of Contents

If you're viewing this document online, you can click any of the topics below to link directly to that section.

1. Introduction 2

2. Validation basics 5

3. Validating a document 9

4. Document Type Definitions (DTDs) 13

5. XML Schema 20

6. Validation summary 28

Section 1. Introduction

Should I take this tutorial?

This tutorial examines the validation of XML documents using either Document Type Definitions (DTDs) or XML Schema. It is aimed at developers who have a need to control the types and content of the data in their XML documents, and assumes that you are familiar with the basic concepts of XML. (You can get a basic grounding in XML itself through the Introduction to XML tutorial.) It also assumes a basic familiarity with XML Namespaces. (You can pick up the basics of namespaces in the Understanding DOM tutorial.)

This tutorial demonstrates validation using Java from the command line, but the principles and concepts of validation are the same for any programming environment, so Java experience is not required to gain a thorough understanding. DTDs and XML Schema, in particular, are language- and platform-independent.

What is XML validation?

In the creation of a database, using a data model in conjunction with integrity constraints can ensure that the structure and content of the data meet the requirements. But how do you enforce that kind of control using XML, when your data is just text in hand-editable files? Fortunately, validating files and documents can make sure that data fits constraints. In this tutorial, you'll learn what validation is, and you'll learn how to check a document against a Document Type Definition (DTD) or an XML Schema document.

DTDs were originally defined in the XML 1.0 Recommendation and are a carryover from the original Standard Generalized Markup Language (SGML), the precursor to HTML. Their syntax is slightly different from XML, which is one drawback to using them. They also have limitations in how they can be used, which led developers to seek an alternative in the form of XML schemas. DTDs are still in use in a significant number of environments, however, so an understanding of them is important.

There are a number of competing schema proposals, but the primary alternative to DTDs is the XML Schema Recommendation maintained by the W3C. (Throughout the course of the tutorial, "XML Schema" should be considered synonymous with "W3C XML Schema.") These schema documents, which, in terms of syntax, are also XML documents, provide a more familiar and more powerful environment in which to create the constraints on the data that can exist in XML data.

Please note that validation is by no means a requirement when working with XML data. If the overall structure and content of the XML data aren't crucial, feel free to bypass validation.

By the end of this tutorial you will learn how to create both a DTD and an XML Schema document. You'll also learn the concepts of using them to validate an XML document.

Section 2. Validation basics

What is validation?

XML files are designed to be easy for people to read and edit. They are also designed for easy data exchange among different systems and different applications. Unfortunately, both of these advantages can work against the need for data to be in a specific format. Validation enables confirmation that XML data follows a specific predetermined structure so that an application can receive it in a predictable way. This structure against which the data is compared can be provided in a number of different ways, including Document Type Definitions (DTDs) and XML schemas.

A document that has been checked against a DTD or schema in this way is considered a "valid" document.

Valid versus well formed

Because valid has other meanings in the English language, it is sometimes confused with the XML-specific term well formed.

A well-formed document conforms to the rules of XML. All elements have start and end tags, all attributes are enclosed in quotes, all elements are nested correctly, and so on. A document cannot be parsed unless it is well formed.

Just because a document can be parsed, however, does not mean that it is valid in the XML sense. In order to be considered valid, a document must be parsed by a validating parser, which compares it to a predetermined structure.

Valid documents are always well formed, but well-formed documents may not be valid.

Document Type Definitions (DTD)

The concept behind validation actually predates XML itself. When XML was first created it was as an application of SGML. SGML allows different systems to talk to each other by allowing authors to create a DTD. As long as the data followed the DTD, each system could read it.

DTDs define elements that are allowed in a document, what they can contain, and the attributes they can and/or must have.

Compare this simple document and its DTD:

The document:

Elizabeth Davison Beach volleyball

XML Schema:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<xsd:element name="memories"> <xsd:complexType> <xsd:sequence> <xsd:element name="memory" type="memoryType"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="memoryType"> <xsd:sequence> <xsd:element name="subdate" type="xsd:date"/> <xsd:element name="donor" type="xsd:string"/> <xsd:element name="subject" type="xsd:string"/> <xsd:attribute name="tapeid" type="idNumber" /> </xsd:complexType> </xsd:schema>

Notice that in this case the syntax for the schema definitions themselves is different from the syntax for DTDs. The syntax for the schema definitions is also the means for tying the definitions into a schema document using XML Namespaces instead of the DOCTYPE.

The example document

The complete example XML file for validation in this tutorial consists of information that is part of The Millennium Memory Project, which collects donated home movies and other personal history recordings for posterity.

Each entry consists of a memory and the information about it, such as the donor, location, and subject.

2001-05-23 John Baker Fishing with the grandchildren on beautiful day. Pier 60 2001-05-18 Elizabeth Davison Beach volleyball Asbury Park, NJ

} public void error (SAXParseException e) { System.out.println("Parsing error: "+e.getMessage()); } public void warning (SAXParseException e) { System.out.println("Parsing problem: "+e.getMessage()); } public void fatalError (SAXParseException e) { System.out.println("Parsing error: "+e.getMessage()); System.out.println("Cannot continue."); System.exit(1); } }

Validation in JAXP

As discussed in the introduction of this tutorial, it is not necessary to code and run the following examples to understand validation. Should you decide to do so, using JAXP to parse (and ultimately validate) a document involves four steps (The next panel discusses Validation in Xerces Java on page 11):

  1. Create the DocumentBuilderFactory. Because the DocumentBuilder, which actually parses the document, is an interface, it cannot be instantiated directly. Instead, a DocumentBuilderFactory is created. This factory has certain properties, such as isValidating(), that will determine the behavior of any parsers created with it. To create a validating parser, use setValidating(true).
  2. Create the DocumentBuilder. Use the DocumentBuilderFactory to create the DocumentBuilder object, which parses the document.
  3. Assign the ErrorHandler. It doesn't do any good for the parser to check for problems if it doesn't know what to do with them. Use the setErrorHandler() method of the DocumentBuilder to tell the parser to send errors to a new ErrorChecker object, which was created in the previous panel, Create an error handler on page 9.
  4. Parse the document. If a document is not well formed, StructureTest will catch the exception. If it is well formed but there are validation errors, the parser sends them to the ErrorChecker object which reports them.

This is the basic principle behind validating a document: create a validating parser, determine the destination for validation errors, and parse the document. The particulars are slightly different for Validation in Xerces Java on page 11.

import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import java.io.File; import org.w3c.dom.Document; public class StructureTest { public static void main (String args[]) { File docFile = new File("memory.xml"); try { DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setValidating(true);

DocumentBuilder db = dbf.newDocumentBuilder(); ErrorChecker errors = new ErrorChecker(); db.setErrorHandler(errors); Document doc = db.parse(docFile); } catch (Exception e) { System.out.print("Parsing problem."); } } }

Validation in Xerces Java

Using Xerces Java to validate a document involves the same basic principles as Validation in JAXP on page 10. Here the steps are:

  1. Create the parser. Xerces allows the direct creation of the parser, unlike JAXP, which requires using factories.
  2. Turn on validation.
  3. Set the error handler.
  4. Parse the document.

import org.apache.xerces.parsers.DOMParser; import java.io.File; import org.w3c.dom.Document; public class SchemaTest { public static void main (String args[]) { File docFile = new File("memory.xml"); try { DOMParser parser = new DOMParser(); parser.setFeature("http://xml.org/sax/features/validation", true); parser.setProperty( "http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation", "memory.xsd"); ErrorChecker errors = new ErrorChecker(); parser.setErrorHandler(errors); parser.parse("memory.xml"); } catch (Exception e) { System.out.print("Problem parsing the file."); } } }

The application directly instantiates a DOMParser. Each parser created this way has a set of features, one of which is validation. The parser's setFeature() method turns it on.

You can specify the location of the schema document within the XML (as seen in The XML Schema instance document on page 20), or you can specify it within the application itself using a property on the parser, as seen above.

Section 4. Document Type Definitions (DTDs)

External DTDs

In order to validate a document, you must have a standard to validate it against. The oldest and best-supported means for specifying requirements in an XML document is the DTD. A DTD may be internal or external.

When it comes to DTDs, most people are more familiar with the external variety in which the DOCTYPE declaration refers to a file containing the actual definitions.

There are several ways to designate the location of a DTD file. For example, an XHTML file can designate a DTD that determines whether it is following the XHTML Strict, XHTML Transitional, or XHTML Frameset Recommendations developed at the W3C. To designate XHTML Transitional, the author might specify:

The DOCTYPE declaration consists of several parts:

  • * it would immediately be deemed invalid. * PUBLIC A DOCTYPE can designate a publicly recognized DTD, potentially saving the processor a trip to the server to retrieve it. The alternative, SYSTEM, is shown below. A SYSTEM identifier indicates the URI where the DTD can be found. * "-//W3C//DTD HTML 4.01 Transitional//EN": The actual public identifier for the Transitional XHTML DTD.

For custom DTDs, developers typically use a SYSTEM identifier, such as:

The parts match those for a PUBLIC identifier, except the declaration shows the location of the DTD.

Typically, the DOCTYPE declaration also specifies the SYSTEM identifier when using a PUBLIC identifier in case the processor doesn't recognize the latter:

The external DTD file simply contains the definitions, starting with the Elements on page 14. For an internal DTD, these definitions are part of the XML file itself.

Structure of an internal DTD

An external DTD can specify the contents of many different documents making them somewhat easier to maintain. However, there are times when a valid document needs to stand on its own. In this case, you need to include the DTD information within the document itself.

** **]>** TBD TBD

Ignoring the actual content for a moment, notice the structure of the internal DTD. The DOCTYPE declaration still contains the information but rather than referring to a local or remote file, the actual DTD is included between the brackets.

Elements

In both internal and external DTDs, elements are the foundation of an XML document, so they are typically defined first.

An element definition consists of the ELEMENT keyword, the name of the element, and the content it can contain. The content of an element may be text, other elements, or nothing at all (in the case of empty elements).

]>

Designate an element that can contain text with the #PCDATA keyword. This is short for parsed character data; it refers to any text within an element and cannot include markup. Examples are the subdate, donor, and subject elements.

The memory and memories elements show the syntax used to specify elements that contain only other elements as content.

An element can also be defined as EMPTY, as in the media element. Empty elements typically carry all information in attributes. For example:

elements. These relationships are known as parent-child relationships. For example, the media element is contained within the memory element, so the memory element is considered the parent of the media element. Conversely, the media element is the child of the memory element. One parent, such as memory, may have multiple children.

The order of child elements can also be determined by looking at a DTD. Paradoxically, while child elements must always appear in the order in which they appear in the DTD, the DTD can be written in such a way that the children can appear in any order.

Strictly speaking, the required order doesn't change, but the options do. For example, the current DTD specifies that the location element can have either a place or a description.

If this choice could be repeated, as in:

then the location could contain a place and a description, in any order. The same thought could be applied to the memory element:

In this case, the elements can appear in any order because the DTD allows unlimited choices. First a subdate could be chosen, then a location, then a donor, and so on. Notice, however, that once this technique is employed, certain previous restrictions become useless. Because choices can be made more than once, any of the specified elements can be chosen any number of times, or not at all.

This is a serious limitation of DTDs. It is overcome through the use of XML schemas, which allow much greater control. Schemas are also useful when defining mixed content.

Mixed content

A mixed content element contains both text and other elements. One good example of this is text containing HTML markup. Consider the following potential subject:

A reading of Charles Dickens' A Christmas Carol. Absolutely marvelous!

This is known as mixed content because it has both character data and an element ). In order to make this acceptable to a validating parser, the i element must be defined and the subject element must be allowed to take any number of either #PCDATA or i choices. To allow common markup, the DTD needs to read:

Note that while this does solve the problem, there is no way to constrain the order. This, too, is a problem solved by XML Schema.

Define attributes

While it is possible to create an XML structure with nothing but elements, the more common situation is elements with attributes. Attributes must also be defined if they are to appear on elements in a validated document.

There are several ways to define an attribute. The first is to simply designate it as character data, or CDATA:

In this case, the DTD assigns the attribute tapeid to the memory element. The tapeid attribute consists of character data, and is required. An element can also be designated as #IMPLIED or #FIXED, in which case a default value must also be specified.

Some attributes are enumerated, meaning that a value must be chosen from a predetermined list. For example:

In this case, the document must choose a value from the list. If no value is provided, the parser will use the default value of 8mm. This is the case in any document for which a DTD is present, even if the parser is not validating.

Multiple attributes can be designated with a single ATTLIST definition:

A second means for defining attribute content involves IDs and IDREFs.

IDs and IDREFs

It is sometimes necessary to "link" data together with the use of an identifier , much the way primary and foreign keys work in a relational database. For example, it might be a requirement that the memory identifier matches up with the media identifier, so that a memory can be located. ID and IDREF datatypes allow you to enforce such data integrity:

All of these difficulties are resolved with the use of XML Schema.

Section 5. XML Schema

The XML Schema instance document

In contrast to DTDs, schema documents are built in XML itself. Validation using schemas requires two documents: the schema document, and the instance document.

The schema document is the document containing the structure, and the instance document is the document containing the actual XML data. An application determines the schema for an instance document in one of two ways:

  1. From the document itself: While documents use the DOCTYPE declaration to point to an external DTD, they use attributes and namespaces to point to an external schema document:
...

First create the namespace itself, then use the noNamespaceSchemaLocation attribute to determine the location. Schemas can also be created for a particular target namespace. In that case, specify the targetNamespace in the schema document itself.

  1. Through properties set within the application: With Xerces, set the http://apache.org/xml/properties/schema/external-schemaLocation and the http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation properties to determine the location of the schema document, as seen in Validation in Xerces Java on page 11.

Structure of a schema document

A schema document is simply an XML document with predefined elements and attributes describing the structure of another XML document.

Consider this sample schema document:

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">

<xsd:element name="memories"> <xsd:complexType> <xsd:sequence> <xsd:element name="memory" maxOccurs="unbounded" type="memoryType"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="memoryType"> <xsd:sequence> <xsd:element name="media">