Download XML Documents Trees - Distributed Software Develop | CS 682 and more Study notes Software Engineering in PDF only on Docsity!
Distributed Software Development
XML Chris Brooks
Department of Computer ScienceUniversity of San Francisco
Department of Computer Science — University of San Francisco – p. 1/
??
Outline
•^ About XML •^ Structuring XML documents •^ Using CSS to display XML •^ Parsing with DOM •^ Parsing with SAX
Department of Computer Science — University of San Francisco – p. 2/
??
X
•^ XML is a language for describing data
•^ Really more of a meta-language
•^ XML itself provides metadata
•^ Data types, relations between data objects, etc.
•^ Designed to be read, created, and consumed byprograms.
Department of Computer Science — University of San Fra
Advantages of XML
•^ Well-defined, easy-to-manipulate structure •^ Human-readable •^ Extensible •^ Metadata can be included directly with data •^ Widely used
Department of Computer Science — University of San Francisco – p. 4/
??
Things to note
•^ An XML document has two components:
•^ tags (metadata) •^ content (data)
•^ Metadata serves to help an application make senseof the data.
Department of Computer Science — University of San Francisco – p. 5/
??
Exam
version="1.0"?>
J.R.R.
Tolkien
The
Lord
of^
the^
Rings
Fellowship
of
The
Ring
The
Two
Towers
Return
of
the
King
Ballantine
Department of Computer Science — University of San Fra
XML documents as trees
•^ An XML document can also be represented as a tree. •^ This makes XML very easy to parse. •^ The outermost element is the root element, andelements contained within it are children of thatelement. •^ Content is stored at the leaves •^ What would the tree for our Tolkien example look like?
Department of Computer Science — University of San Francisco – p. 7/
??
Outline
•^ About XML •^ Stucturing XML documents •^ Using CSS to display XML •^ Parsing with DOM •^ Parsing with SAX
Department of Computer Science — University of San Francisco – p. 8/
??
Eleme
•^ XML requires that every starting tag have acorresponding closing tag. •^ Everything between a starting tag and a closing tag iscalled an
element
•^ For example, Return of The King is an element •^ So is everything between and •^ As is everything between and . •^ This means that elements must be nested.
Department of Computer Science — University of San Fra
Tags and elements
•^ Tags form the boundaries of elements, and giveprocessing instructions to parsers.
•^ Empty elements:
<coAuthor
All information is
contained in the tag. • Container elements:
•^ Comments:
here’s
a
comment
•^ Declaration:
<!ENTITY
jrrt
‘‘J.R.R.
Tolkien>
This provides a way to define
variables or constants in a single location. • Entity reference:
&jrrt
Department of Computer Science — University of San Francisco – p. 10/
??
Attributes and Values
•^ You can also specify that an element has
attributes
•^ These attributes can take on
values
•^ This is helpful when you want to specify that an objectbelongs to one of a few types.^ <book
genre="fantasy"
size="large">
Department of Computer Science — University of San Francisco – p. 11/
??
Attributes vs. Sub-eleme
•^ We could rewrite the example above usingsubelements instead of attributes. •^ When to use one over the other is largely stylistic.
•^ Can always transform one into the other
•^ If a feature can only take on one of a few values, anattribute might make more sense. •^ If we expect to extend the number of genres, asubelement is preferable. •^ Also, order is preserved for subelements
•^ Semantically, attribute/value pairs are treated as adictionary.
•^ So, a list of authors should be done as subelements
Department of Computer Science — University of San Fran
Entities
•^ We could then use our entity definitions later in thedocument by prepending a ’&’ to them^
the
Author
of^
The^
Lord
of
the
Rings
is
&jrrt;
he
invented
a^ grammar
and
semantics
for
Elvish,
which
can
be
found
at
&elvish-key;
Department of Computer Science — University of San Francisco – p. 19/
??
Outline
•^ About XML •^ Stucturing XML documents •^ Using CSS to display XML •^ Parsing with DOM •^ Parsing with SAX
Department of Computer Science — University of San Francisco – p. 20/
??
Using CSS to display X
•^ CSS can also be used to display XML documents. •^ Control is limited to laying out a complete XMLdocument. •^ If we want filtering or sorting, we’ll need to use XSLT.
Department of Computer Science — University of San Fran
An example
•^ Let’s say we have an XML-based CD database: •^ We can use CSS to display it in a web browser. •^ (see separate examples)
Department of Computer Science — University of San Francisco – p. 22/
??
Outline
•^ About XML •^ Stucturing XML documents •^ Validating XML with schema •^ Using CSS to display XML •^ Parsing with DOM •^ Parsing with SAX
Department of Computer Science — University of San Francisco – p. 23/
??
Parsing X
•^ XML also has the advantage of being easy forprograms to parse and construct. •^ There are two different approaches to parsing andmanipulating XML. •^ SAX: Simple API for XML
•^ Event-driven parser •^ User defines actions to take when an element isfound during parsing.
Department of Computer Science — University of San Fran
Parsing XML
•^ DOM: Document Object Model
•^ Tree parser: Entire document is instantiated inmemory as a tree. •^ Nice for random-access applications •^ Large documents may consume a large amount ofmemory
•^ Most languages provide support for both. We’ll startwith DOM.
Department of Computer Science — University of San Francisco – p. 25/
??
Libraries
-^ The DOM model is specified in a language independent way. •^ Implementations then follow this specification. -^ This means that they all work very similarly. -^ Java -^ javax.xml.parsers built into Java 1.5 •^ Apache’s Xerces parser provides support for both SAX and DOM.^ •
Xerces also has C++ and Perl implementations
-^ JDOM is also a popular tool for parsing and creating XML in Java. -^ Python -^ Built-in support for SAX, DOM, and minidom •^ ElementTree is a DOM-like parser. •^ 4suite provides third-party implementations
Department of Computer Science — University of San Francisco – p. 26/
??
Libra
-^ Perl -^ LibXML provides SAX and DOM functionality. -^ C# -^ .NET has built-in support for SAX and DOM -^ Ruby -^ The REXML library provides tree parsing, but not with the DOM interface.
Department of Computer Science — University of San Fran
Parsing a document in Python
•^ Example:^ from
xml.dom
import
minidom
doc=
minidom.parse(’library.xml’)
•^ Reads in and parses a document •^ creates a Document object. •^ toxml() show the XML version.
Department of Computer Science — University of San Francisco – p. 28/
??
Traversing the tree
•^ childNodes, firstChild, lastChild, parentNode •^ childNodes can have childNodes. •^ Leaves are text nodes,
•^ Respond to ’data’, which gives up the data theystore.
•^ This is useful if you need to process an entiredocument, but annoying if you’re searching.
Department of Computer Science — University of San Francisco – p. 29/
??
Finding specific eleme
•^ getElementsByTagName finds all elements accordingto name:^ eltlist
=^
doc.getElementsByTagName(’key’)
•^ Can search at any node
Department of Computer Science — University of San Fran
Parsing with SAX
•^ DOM is very convenient to use in many cases, but notall
•^ Document is too large to hold in memory •^ Document is malformed •^ Document is being produced (and should beconsumed) incrementally
•^ In these cases, a SAX parser may be moreappropriate.
Department of Computer Science — University of San Francisco – p. 37/
??
SAX: Simple API for XML
•^ SAX is an interface that was developed to provide anuniform way to integrate different XML parsers.
•^ Interesting contrast in origin to DOM. •^ SAX developed ’bottom-up’ by XML developers •^ DOM developed ’top-down’ by the W3C.
•^ SAX is an
event-driven parser
•^ You define an event handler that is passed to theparser. •^ Describes how to handle particular types ofelements. •^ Document is processed sequentially. State mustbe maintained by hand.
Department of Computer Science — University of San Francisco – p. 38/
??
Using SAX within Pyt
-^ (Note: Java looks very similar) •^ Most of the work involves creating
handlers
-^ For example, to deal with processing content, override the
content handler
import
xml.sax from
xml.sax.handler
import
class
CDHandler(ContentHandler)
def
init(self)
self.books
= [];
self.buffer
=^
self.inTitle
=^
False
def
startElement(self,
name,
attrs)
if^ name
’title’
self.inTitle
= True
def
endElement(self,
name)
if^ name
’title’
self.inTitle
= False
print
self.buffer self.buffer
=^ ’’
Department of Computer Science — University of San Fran
Using SAX within Python
•^ To use this, we then register the handler with a SAXparser.
parser
=^ xml.sax.make_parser() handler
=^ CDHandler() parser.setContentHandler(handler)parser.parse(’cdcat.xml’)
Department of Computer Science — University of San Francisco – p. 40/
??
SAX comments
•^ You must keep track of ’where you are’ yourself.
•^ No access to the enclosing context •^ It’s hard with SAX to, for example, print thecorresponding artist for each title node.
•^ SAX has more modest memory requirements thanDOM
•^ Nodes are discarded after parsing
•^ More flexible recovery from parsing errors. •^ Use the parser that best fits your needs.
Department of Computer Science — University of San Francisco – p. 41/
??