





















Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A detailed tutorial on validating xml documents using document type definitions (dtds) and xml schema. It covers the basics of xml, the differences between dtds and xml schema, and how to validate xml documents using various tools. It also includes examples of xml documents and their corresponding schema files.
Typology: Study notes
1 / 29
This page cannot be seen from the preview
Don't miss anything!






















If you're viewing this document online, you can click any of the topics below to link directly to that section.
Section 1. Introduction
Should I take this tutorial?
This tutorial examines the validation of XML documents using either Document Type Definitions (DTDs) or XML Schema. It is aimed at developers who have a need to control the types and content of the data in their XML documents, and assumes that you are familiar with the basic concepts of XML. (You can get a basic grounding in XML itself through the Introduction to XML tutorial.) It also assumes a basic familiarity with XML Namespaces. (You can pick up the basics of namespaces in the Understanding DOM tutorial.)
This tutorial demonstrates validation using Java from the command line, but the principles and concepts of validation are the same for any programming environment, so Java experience is not required to gain a thorough understanding. DTDs and XML Schema, in particular, are language- and platform-independent.
What is XML validation?
In the creation of a database, using a data model in conjunction with integrity constraints can ensure that the structure and content of the data meet the requirements. But how do you enforce that kind of control using XML, when your data is just text in hand-editable files? Fortunately, validating files and documents can make sure that data fits constraints. In this tutorial, you'll learn what validation is, and you'll learn how to check a document against a Document Type Definition (DTD) or an XML Schema document.
DTDs were originally defined in the XML 1.0 Recommendation and are a carryover from the original Standard Generalized Markup Language (SGML), the precursor to HTML. Their syntax is slightly different from XML, which is one drawback to using them. They also have limitations in how they can be used, which led developers to seek an alternative in the form of XML schemas. DTDs are still in use in a significant number of environments, however, so an understanding of them is important.
There are a number of competing schema proposals, but the primary alternative to DTDs is the XML Schema Recommendation maintained by the W3C. (Throughout the course of the tutorial, "XML Schema" should be considered synonymous with "W3C XML Schema.") These schema documents, which, in terms of syntax, are also XML documents, provide a more familiar and more powerful environment in which to create the constraints on the data that can exist in XML data.
Please note that validation is by no means a requirement when working with XML data. If the overall structure and content of the XML data aren't crucial, feel free to bypass validation.
By the end of this tutorial you will learn how to create both a DTD and an XML Schema document. You'll also learn the concepts of using them to validate an XML document.
Section 2. Validation basics
What is validation?
XML files are designed to be easy for people to read and edit. They are also designed for easy data exchange among different systems and different applications. Unfortunately, both of these advantages can work against the need for data to be in a specific format. Validation enables confirmation that XML data follows a specific predetermined structure so that an application can receive it in a predictable way. This structure against which the data is compared can be provided in a number of different ways, including Document Type Definitions (DTDs) and XML schemas.
A document that has been checked against a DTD or schema in this way is considered a "valid" document.
Valid versus well formed
Because valid has other meanings in the English language, it is sometimes confused with the XML-specific term well formed.
A well-formed document conforms to the rules of XML. All elements have start and end tags, all attributes are enclosed in quotes, all elements are nested correctly, and so on. A document cannot be parsed unless it is well formed.
Just because a document can be parsed, however, does not mean that it is valid in the XML sense. In order to be considered valid, a document must be parsed by a validating parser, which compares it to a predetermined structure.
Valid documents are always well formed, but well-formed documents may not be valid.
Document Type Definitions (DTD)
The concept behind validation actually predates XML itself. When XML was first created it was as an application of SGML. SGML allows different systems to talk to each other by allowing authors to create a DTD. As long as the data followed the DTD, each system could read it.
DTDs define elements that are allowed in a document, what they can contain, and the attributes they can and/or must have.
Compare this simple document and its DTD:
The document:
XML Schema:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="memories"> <xsd:complexType> <xsd:sequence> <xsd:element name="memory" type="memoryType"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="memoryType"> <xsd:sequence> <xsd:element name="subdate" type="xsd:date"/> <xsd:element name="donor" type="xsd:string"/> <xsd:element name="subject" type="xsd:string"/> <xsd:attribute name="tapeid" type="idNumber" /> </xsd:complexType> </xsd:schema>
Notice that in this case the syntax for the schema definitions themselves is different from the syntax for DTDs. The syntax for the schema definitions is also the means for tying the definitions into a schema document using XML Namespaces instead of the DOCTYPE.
The example document
The complete example XML file for validation in this tutorial consists of information that is part of The Millennium Memory Project, which collects donated home movies and other personal history recordings for posterity.
Each entry consists of a memory and the information about it, such as the donor, location, and subject.
} public void error (SAXParseException e) { System.out.println("Parsing error: "+e.getMessage()); } public void warning (SAXParseException e) { System.out.println("Parsing problem: "+e.getMessage()); } public void fatalError (SAXParseException e) { System.out.println("Parsing error: "+e.getMessage()); System.out.println("Cannot continue."); System.exit(1); } }
Validation in JAXP
As discussed in the introduction of this tutorial, it is not necessary to code and run the following examples to understand validation. Should you decide to do so, using JAXP to parse (and ultimately validate) a document involves four steps (The next panel discusses Validation in Xerces Java on page 11):
This is the basic principle behind validating a document: create a validating parser, determine the destination for validation errors, and parse the document. The particulars are slightly different for Validation in Xerces Java on page 11.
import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import java.io.File; import org.w3c.dom.Document; public class StructureTest { public static void main (String args[]) { File docFile = new File("memory.xml"); try { DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance(); dbf.setValidating(true);
DocumentBuilder db = dbf.newDocumentBuilder(); ErrorChecker errors = new ErrorChecker(); db.setErrorHandler(errors); Document doc = db.parse(docFile); } catch (Exception e) { System.out.print("Parsing problem."); } } }
Validation in Xerces Java
Using Xerces Java to validate a document involves the same basic principles as Validation in JAXP on page 10. Here the steps are:
import org.apache.xerces.parsers.DOMParser; import java.io.File; import org.w3c.dom.Document; public class SchemaTest { public static void main (String args[]) { File docFile = new File("memory.xml"); try { DOMParser parser = new DOMParser(); parser.setFeature("http://xml.org/sax/features/validation", true); parser.setProperty( "http://apache.org/xml/properties/schema/external-noNamespaceSchemaLocation", "memory.xsd"); ErrorChecker errors = new ErrorChecker(); parser.setErrorHandler(errors); parser.parse("memory.xml"); } catch (Exception e) { System.out.print("Problem parsing the file."); } } }
The application directly instantiates a DOMParser. Each parser created this way has a set of features, one of which is validation. The parser's setFeature() method turns it on.
You can specify the location of the schema document within the XML (as seen in The XML Schema instance document on page 20), or you can specify it within the application itself using a property on the parser, as seen above.
Section 4. Document Type Definitions (DTDs)
External DTDs
In order to validate a document, you must have a standard to validate it against. The oldest and best-supported means for specifying requirements in an XML document is the DTD. A DTD may be internal or external.
When it comes to DTDs, most people are more familiar with the external variety in which the DOCTYPE declaration refers to a file containing the actual definitions.
There are several ways to designate the location of a DTD file. For example, an XHTML file can designate a DTD that determines whether it is following the XHTML Strict, XHTML Transitional, or XHTML Frameset Recommendations developed at the W3C. To designate XHTML Transitional, the author might specify:
The DOCTYPE declaration consists of several parts:
For custom DTDs, developers typically use a SYSTEM identifier, such as:
The parts match those for a PUBLIC identifier, except the declaration shows the location of the DTD.
Typically, the DOCTYPE declaration also specifies the SYSTEM identifier when using a PUBLIC identifier in case the processor doesn't recognize the latter:
The external DTD file simply contains the definitions, starting with the Elements on page 14. For an internal DTD, these definitions are part of the XML file itself.
Structure of an internal DTD
An external DTD can specify the contents of many different documents making them somewhat easier to maintain. However, there are times when a valid document needs to stand on its own. In this case, you need to include the DTD information within the document itself.
** **]>**Ignoring the actual content for a moment, notice the structure of the internal DTD. The DOCTYPE declaration still contains the information but rather than referring to a local or remote file, the actual DTD is included between the brackets.
Elements
In both internal and external DTDs, elements are the foundation of an XML document, so they are typically defined first.
An element definition consists of the ELEMENT keyword, the name of the element, and the content it can contain. The content of an element may be text, other elements, or nothing at all (in the case of empty elements).
Designate an element that can contain text with the #PCDATA keyword. This is short for parsed character data; it refers to any text within an element and cannot include markup. Examples are the subdate, donor, and subject elements.
The memory and memories elements show the syntax used to specify elements that contain only other elements as content.
An element can also be defined as EMPTY, as in the media element. Empty elements typically carry all information in attributes. For example:
elements. These relationships are known as parent-child relationships. For example, the media element is contained within the memory element, so the memory element is considered the parent of the media element. Conversely, the media element is the child of the memory element. One parent, such as memory, may have multiple children.
The order of child elements can also be determined by looking at a DTD. Paradoxically, while child elements must always appear in the order in which they appear in the DTD, the DTD can be written in such a way that the children can appear in any order.
Strictly speaking, the required order doesn't change, but the options do. For example, the current DTD specifies that the location element can have either a place or a description.
If this choice could be repeated, as in:
then the location could contain a place and a description, in any order. The same thought could be applied to the memory element:
In this case, the elements can appear in any order because the DTD allows unlimited choices. First a subdate could be chosen, then a location, then a donor, and so on. Notice, however, that once this technique is employed, certain previous restrictions become useless. Because choices can be made more than once, any of the specified elements can be chosen any number of times, or not at all.
This is a serious limitation of DTDs. It is overcome through the use of XML schemas, which allow much greater control. Schemas are also useful when defining mixed content.
Mixed content
A mixed content element contains both text and other elements. One good example of this is text containing HTML markup. Consider the following potential subject:
This is known as mixed content because it has both character data and an element ). In order to make this acceptable to a validating parser, the i element must be defined and the subject element must be allowed to take any number of either #PCDATA or i choices. To allow common markup, the DTD needs to read:
Note that while this does solve the problem, there is no way to constrain the order. This, too, is a problem solved by XML Schema.
Define attributes
While it is possible to create an XML structure with nothing but elements, the more common situation is elements with attributes. Attributes must also be defined if they are to appear on elements in a validated document.
There are several ways to define an attribute. The first is to simply designate it as character data, or CDATA:
In this case, the DTD assigns the attribute tapeid to the memory element. The tapeid attribute consists of character data, and is required. An element can also be designated as #IMPLIED or #FIXED, in which case a default value must also be specified.
Some attributes are enumerated, meaning that a value must be chosen from a predetermined list. For example:
In this case, the document must choose a value from the list. If no value is provided, the parser will use the default value of 8mm. This is the case in any document for which a DTD is present, even if the parser is not validating.
Multiple attributes can be designated with a single ATTLIST definition:
A second means for defining attribute content involves IDs and IDREFs.
IDs and IDREFs
It is sometimes necessary to "link" data together with the use of an identifier , much the way primary and foreign keys work in a relational database. For example, it might be a requirement that the memory identifier matches up with the media identifier, so that a memory can be located. ID and IDREF datatypes allow you to enforce such data integrity:
All of these difficulties are resolved with the use of XML Schema.
Section 5. XML Schema
The XML Schema instance document
In contrast to DTDs, schema documents are built in XML itself. Validation using schemas requires two documents: the schema document, and the instance document.
The schema document is the document containing the structure, and the instance document is the document containing the actual XML data. An application determines the schema for an instance document in one of two ways:
First create the namespace itself, then use the noNamespaceSchemaLocation attribute to determine the location. Schemas can also be created for a particular target namespace. In that case, specify the targetNamespace in the schema document itself.
Structure of a schema document
A schema document is simply an XML document with predefined elements and attributes describing the structure of another XML document.
Consider this sample schema document:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<xsd:element name="memories"> <xsd:complexType> <xsd:sequence> <xsd:element name="memory" maxOccurs="unbounded" type="memoryType"/> </xsd:sequence> </xsd:complexType> </xsd:element> <xsd:complexType name="memoryType"> <xsd:sequence> <xsd:element name="media">