External Sorting and Selection Operators in Database Systems, Slides of Introduction to Database Management Systems

The concept of external sorting and its implementation in database systems. It also covers the selection operators and their access paths, including file scan, index scan, hash index, and b+-tree. Examples and cost estimates for each access path.

Typology: Slides

2012/2013

Uploaded on 04/27/2013

prakash
prakash 🇮🇳

4.6

(10)

63 documents

1 / 39

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Database Management Systems Design
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27

Partial preview of the text

Download External Sorting and Selection Operators in Database Systems and more Slides Introduction to Database Management Systems in PDF only on Docsity!

Database Management Systems Design

Selection and SF

• Selections can have where clause with CNF, or DNF.

• Selectivity factor for CNF case:

  • If where clause is of the form

p1 and p2 and … and pn,

  • then
  • Assume that all predicates are independent
  • Exsmple: Select sid, sname From Students Where gpa= 4.0 AND age < 25 AND sname <> “Bob”;

SF = SFp 1 ∧ SFp 2 ∧ ...∧ SFpn

The Role of sorting

  • Sorting plays a pivotal role in the implementation of relational operators and in choosing access path
  • Idea: If some operator need to pass several times over the tuples of a table, sorting might speed things up
  • Examples:
    • Sorting for duplicate elimination in projections
    • Sort-merge join
    • Sorting for aggregation
    • Sorting for order by clauses
  • Question: Since a table R can have gigabytes of worth of data, how do we sort? - Answer: External Sorting

External Sorting

Input file

Sort pages

(Quicksort)

Merge-Sort

into 2-pages

Blocks (runs)

Merge-Sort

into 4-pages run

The idea behind External sorting

  • Phase I, sort each page in memory using an in-memory sorting algorithm
    • Often Quicksort is used
    • Requires 1 pass over the relation to sort
    • Each page is called a 1-page run
      • Run is a collection of pages with sorted tuples, stored as a file
  • Phase II, merge sort

1. Let i = 1

2. Start sort merging pairs of runs of size i to build runs of size

2i

3. When all runs of size i have been consumed

  • If only one run remains, finish
  • else set i = 2i, and goto step 2

Implementing External Sorting

  • Each run is stored in a temporary file
  • Worst case, you need three buffer pages
    • 2 pages for input
    • 1 page for output
  • Usage of pages
    • One is used to keep a page of tuples from a run A
    • The other is used to keep a page of tuples from a run B
    • Third page becomes a merged and sorted page to become

part of a new run C, with size twice that of the runs A and B

  • In practice, you have B pages available
    • B – 1 are used to input runs
    • 1 is used for output

Reality: B buffer pages

Input 1

Input 2 Output

Buffer pages

This is the minimal barebones scheme

Input B-

Sort & Merge

  • Sorting with B buffers: Pass
  • 3,4 5,2 8,1 2,10 7,21 9, - 5,2 2, - 3,4 8, - 1,2 2,3 4,5 8, - 1 st run on pass

Sketch of Algorithm

• Pass 0:

– Read B pages at a time,

– sort each page in memory (e.g. quick sort or

heap sort)

– Write run of size B to disk. This will produce

N/B runs, where N = NPages(R) for relation R

• Passes 1, 2, …, K

1. Use B -1 buffers to read a page from each run

of size i

2. Do a (B-1)-way sort merge to produce a run

with the size 2i. Each page is first kept in ouputDocsity.com

Leveraging on buffers

• By using B buffers first pass builds N/B

runs, where N is NPages(R) for a relation R

– Originally we had N runs …

• Moreover, the total number of passes to do

sorting is decreased by doing (B-1)-way

merging operations

• The cost for sorting a relation R, with N

pages using B buffers (1 for output, B-1 for

input) is:

  • I/O Cost =2N(log (^) B-1 N/B  + 1)

– Must do (B-1)-way merging operations

Scenario

• Query:

Q1: Select sname, sage

From Students where gpa = 4.0;

Q2: Select sname, gpa

From Students where gpa > 3.50 AND age < 25;

• Information for Students:

– Cardinality = 5,

– #tuples / page = 100

– SF gpa> 3.50 = 10%

Costs for Selection Access paths

• No index, and data not sorted for table R, predicate

is Attr op value

  • Algorithm:
    • Read each tuple,
    • Evaluate the predicates in the where clause for each one.
    • Emit the results to next operator or to the output stream
  • Cost: Read entire relation R, NPages(R) I/Os
  • Scenario:
    • Heap file with not index defined

Cost for selections with Hash Index

• Hash index, and predicate : Attr = value

  • Algorithm: Probe hash table to find page with values.
  • Clustered : Fetch records, and read any overflow page
    • Cost: ( 2 + #number of overflow pages) * I/O
  • Unclustered : Fetch records, fetch data pages, and do the

same for overflow pages.

  • Cost: Variable, in the worst case each search key forces us to read a different page in the data file.
  • Often we don’t use unclustered index for selections
  • Use selectivity to make a good guess

Selections using a B+-tree

• When a selection has a where clause with a

predicate of the form Attr op value

– Clustered B+-tree: best strategy to use (if

available)

  • Note that hash index is better if op is equality

– Un-clustered B+ tree: depends on SF for the

predicate

– Algorithm:

  • Search tree to find first data entry entry that points to qualifying tuples (tuples that pass the where condition)
  • Scan data entries to find and retrieve qualifying tuples
  • Returns these to next operator Docsity.com