XML-Relational Mapping: Approaches, Techniques, and Queries, Slides of Introduction to Database Management Systems

Various approaches to mapping xml data to relational databases, including text files, object-oriented dbms, and relational dbms. It covers schema-oblivious and schema-aware mapping techniques, node/edge-based and interval-based schema designs, and label-path and dewey-order encoding. The document also explores the advantages and disadvantages of each approach and provides examples of queries using these techniques.

Typology: Slides

2011/2012

Uploaded on 01/29/2012

arold
arold 🇺🇸

4.7

(24)

372 documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
XML-Relational Mapping
CPS 116
Introduction to Database Systems
2
Announcements
Homework #3 due next Tuesday (Nov. 2)
3
Approaches to XML processing
Text files (!)
Specialized XML DBMS
Lore (Stanford), Strudel (AT&T), Tamino/QuiP
(Software AG), X-Hive, Timber (Michigan), etc.
Still a long way to go
Object-oriented DBMS
eXcelon (ObjectStore), ozone, etc.
Not as mature as relational DBMS
Relational (and object-relational) DBMS
Middleware and/or object-relational extensions
4
Mapping XML to relational
Store XML in a CLOB (Character Large OBject) column
Simple, compact
Full-text indexing can help (often provided by DBMS vendors as
object-relational “extensions”)
Poor integration with relational query processing
Updates are expensive
Alternatives?
Schema-oblivious mapping:
well-formed XML generic relational schema
Node/edge-based mapping for graphs
Interval-based mapping for trees
Path-based mapping for trees
Schema-aware mapping:
valid XML special relational schema based on DTD
5
Node/edge-based: schema
Element(eid, tag)
Attribute(eid,attrName,attrValue)
Attribute order does not matter
ElementChild(eid, pos, child)
pos specifies the ordering of children
child references either Element(eid) or Text(tid)
Text(tid, value)
tid cannot be the same as any eid
)Need to “invent” lots of id’s
)Need indexes for efficiency, e.g., Element(tag), Text(value)
Key: (eid,attrName)
Keys: (eid, pos), (child)
6
Node/edge-based: example
<bibliography>
<book ISBN=”ISBN-10” price=”80.00”>
<title>Foundations of Databases</title>
<author>Abiteboul</author>
<author>Hull</author>
<author>Vianu</author>
<publisher>Addison Wesley</publisher>
<year>1995</year>
</book>…
</bibliography>
eid tag
e0 bibliography
e1 book
e2 title
e3 author
e4 author
e5 author
e6 publisher
e7 year
Element
eid pos child
e0 1 e1
e1 1 e2
e1 2 e3
e1 3 e4
e1 4 e5
e1 5 e6
e1 6 e7
e2 1 t0
e3 1 t1
e4 1 t2
e5 1 t3
e6 1 t4
e7 1 t5
ElementChild
eid attrName attrValue
e1 ISBN ISBN-10
e1 price 80
Attribute
tid value
t0 Foundations of Databases
t1 Abiteboul
t2 Hull
t3 Vianu
t4 Addison Wesley
t5 1995
Text
pf3
pf4

Partial preview of the text

Download XML-Relational Mapping: Approaches, Techniques, and Queries and more Slides Introduction to Database Management Systems in PDF only on Docsity!

XML-Relational Mapping

CPS 116

Introduction to Database Systems

Announcements

™ Homework #3 due next Tuesday (Nov. 2)

3

Approaches to XML processing

™ Text files (!)

™ Specialized XML DBMS

ƒ Lore (Stanford), Strudel (AT&T), Tamino/QuiP

(Software AG), X-Hive, Timber (Michigan), etc.

ƒ Still a long way to go

™ Object-oriented DBMS

ƒ eXcelon (ObjectStore), ozone, etc.

ƒ Not as mature as relational DBMS

™ Relational (and object-relational) DBMS

ƒ Middleware and/or object-relational extensions

4

Mapping XML to relational

™ Store XML in a CLOB (Character Large OBject) column

ƒ Simple, compact ƒ Full-text indexing can help (often provided by DBMS vendors as object-relational “extensions”) ƒ Poor integration with relational query processing ƒ Updates are expensive

™ Alternatives?

ƒ Schema-oblivious mapping: well-formed XML → generic relational schema

  • Node/edge-based mapping for graphs
  • Interval-based mapping for trees
  • Path-based mapping for trees ƒ Schema-aware mapping: valid XML → special relational schema based on DTD

5

Node/edge-based: schema

™ Element ( eid , tag )

™ Attribute ( eid , attrName , attrValue )

ƒ Attribute order does not matter

™ ElementChild ( eid , pos , child )

ƒ pos specifies the ordering of children ƒ child references either Element ( eid ) or Text ( tid )

™ Text ( tid , value )

ƒ tid cannot be the same as any eid

) Need to “invent” lots of id ’s

) Need indexes for efficiency, e.g., Element ( tag ), Text ( value )

Key: ( eid , attrName )

Keys: ( eid , pos ), ( child )

6

Node/edge-based: example

Foundations of Databases Abiteboul Hull Vianu Addison Wesley 1995

eid tag e0 bibliography e1 book e2 title e3 author e4 author e5 author e6 publisher e7 year

Element

eid pos child e0 1 e e1 1 e e1 2 e e1 3 e e1 4 e e1 5 e e1 6 e e2 1 t e3 1 t e4 1 t e5 1 t e6 1 t e7 1 t

ElementChild

eid attrName attrValue e1 ISBN ISBN- e1 price 80

Attribute

tid value t0 Foundations of Databases t1 Abiteboul t2 Hull t3 Vianu t4 Addison Wesley t5 1995

Text

Node/edge-based: simple paths

™ //title

ƒ SELECT eid FROM Element WHERE tag = ‘title’;

™ //section/title

ƒ SELECT e2.eid FROM Element e1, ElementChild c, Element e WHERE e1.tag = ‘section’ AND e2.tag = ‘title’ AND e1.eid = c.eid AND c.child = e2.eid;

) Path expression becomes joins!

ƒ Number of joins is proportional to the length of the path

expression

Node/edge-based: more complex paths

™ //bibliography/book[author=“Abiteboul”]/@price ƒ SELECT a.attrValue FROM Element e1, ElementChild c1, Element e2, Attribute a WHERE e1.tag = ‘bibliography’ AND e1.eid = c1.eid AND c1.child = e2.eid AND e2.tag = ‘book’

AND e2.eid = a.eid AND a.attrName = ‘price’;

AND EXISTS (SELECT * FROM ElementChild c2, Element e3, ElementChild c3, Text t WHERE e2.eid = c2.eid AND c2.child = e3.eid AND e3.tag = ‘author’ AND e2.eid = c3.eid AND c3.child = t.tid AND t.value = ‘Abiteboul’)

9

Node/edge-based: descendent-or-self

™ //book//title

ƒ Requires SQL3 recursion ƒ WITH ReachableFromBook(id) AS ((SELECT eid FROM Element WHERE tag = ‘book’) UNION ALL (SELECT c.child FROM ReachableFromBook r, ElementChild c WHERE r.eid = c.eid)) SELECT eid FROM Element WHERE eid IN (SELECT * FROM ReachableFromBook) AND tag = ‘title’;

10

Interval-based: schema

™ Element ( left , right , level , tag )

ƒ left is the start position of the element ƒ right is the end position of the element ƒ level is the nesting depth of the element (strictly speaking, unnecessary) ƒ Key is left

™ Attribute ( left , attrName , attrValue )

™ Text ( left , level , value )

) Where did ElementChild go?

ƒ E 1 is the parent of E 2 iff: [ E 1. left , E 1. right ] ⊃ [ E 2. left , E 2. right ], and E 1. level = E 2. level – 1

11

Interval-based: example

1 2 34Foundations of Databases 67Abiteboul 910Hull 1213Vianu 1516Addison Wesley 18191995 21… bibliography

book

title author author author publisher year

1,999,

2,21,

3,5,3 6,8,3 9,11,3 12,14,3 15,17,3 18,20,

12

Interval-based: queries

™ //section/title

ƒ SELECT e2.left FROM Element e1, Element e WHERE e1.tag = ‘section’ AND e2.tag = ‘title’ AND e1.left < e2.left AND e2.right < e1.right AND e1.level = e2.level-1;

)Path expression becomes “containment” joins!

  • Number of joins is proportional to path expression length

™ //book//title

ƒ SELECT e2.left FROM Element e1, Element e WHERE e1.tag = ‘book’ AND e2.tag = ‘section’ AND e1.left < e2.left AND e2.right < e1.right;

)No recursion!

Inlining

™ An author element just has a PCDATA child

™ Instead of using foreign keys

ƒ book_author ( ISBN , author_id )

ƒ author ( id , PCDATA_id )

ƒ PCDATA ( id , value )

™ Why not just “inline” the string value inside book?

ƒ book_author ( ISBN , author_PCDATA_value )

ƒ PCDATA table no longer stores author values

More general inlining

™ As long as we know the structure of an element and its

number of children (and recursively for all children), we can

inline this element where it appears

™ With no inlining at all

book ( ISBN , publisher_id ) publisher ( id , name_id , address_id ) name ( id , PCDATA_id ) address ( id , PCDATA_id )

™ With inlining

book ( ISBN , publisher_name_PCDATA_value , publisher_address_PCDATA_value )

21

Queries

™ book ( ISBN , price , title , publisher , year ),

book_author ( ISBN , author ),

book_section ( ISBN , section_pos , section_id ),

section ( id , title , text ), section_section ( id , section_pos , section_id )

™ //title ƒ (SELECT title FROM book) UNION ALL (SELECT title FROM section);

™ //section/title ƒ SELECT title FROM section;

™ //bibliography/book[author=“Abiteboul”]/@price ƒ SELECT price FROM book, book_author WHERE book.ISBN = book_author.ISBN AND author = ‘Abiteboul’; ™ //book//title ƒ (SELECT title FROM book) UNION ALL (SELECT title FROM section)

These queries only work for the given DTD

22

Pros and cons of inlining

™ Not always applicable

ƒ * and +, recursive schema (e.g., section)

™ Fewer joins

™ More “scattering” (e.g., there is no longer any table

containing all titles; author information is scattered

across book , section , etc.)

)Heuristic: do not inline elements that can be shared

23

Result restructuring

™ Simple results are fine

ƒ Each tuple returned by SQL gets converted to an element

™ Simple grouping is fine (e.g., books with multiple authors)

ƒ Tuples can be returned by SQL in sorted order; adjacent tuples are grouped into an element

™ Complex results are problematic (e.g., books with multiple

authors and multiple references)

ƒ One SQL query returns one table whose columns cannot store sets ƒ Option 1: return one table with all combinations of authors and references → bad ƒ Option 2: return two tables, one with authors and the other with references → join is done as post processing ƒ Option 3: return one table with all author and reference columns; pad with NULL’s; order determines grouping → messy

24

Comparison of approaches

™ Schema-oblivious

ƒ Flexible and adaptable; no DTD needed ƒ Queries are easy to formulate

  • Translation can be easily automated ƒ Queries involve lots of join and are expensive

™ Schema-aware

ƒ Less flexible and adaptable ƒ Need to know DTD to design the relational schema ƒ Query formulation requires knowing DTD and schema ƒ Queries are more efficient ƒ XQuery is tougher to formulate because of result restructuring