External Sorting-Introduction to Database Systems-Lecture 08 Slides-Computer Science, Slides of Introduction to Database Management Systems

This lecture is about Database systems, delivered by Philip Bohannon in University of California at California. External Sorting, Brian Cooper, Sort, Quicksort, Mergesort, Heapsort, Selection Sort, Insertion Sort, Radix Sort, Bubble Sort, 2-way Sort, Double Buffering, Clustered B Tree

Typology: Slides

2011/2012

Uploaded on 02/12/2012

dylanx
dylanx 🇺🇸

4.7

(21)

286 documents

1 / 52

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
External sorting
R & G – Chapter 13
Brian Cooper
Yahoo! Research
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34

Partial preview of the text

Download External Sorting-Introduction to Database Systems-Lecture 08 Slides-Computer Science and more Slides Introduction to Database Management Systems in PDF only on Docsity!

External sorting

R & G – Chapter 13

Brian Cooper

Yahoo! Research

A little bit about Y!

 Yahoo! is the most visited website in the

world

 Sorry Google

 500 million unique visitors per month

 74 percent of U.S. users use Y! (per month)

 13 percent of U.S. users’ online time is on Y!

Why sort?

 Users usually want data sorted

 Sorting is first step in bulk-loading a B+ tree

 Sorting useful for eliminating duplicates

 Sort-merge join algorithm involves sorting

Blueberry Strawberry Kiwi Mango Orange Apple Grapefruit Banana Blueberry Strawberry Orange Mango Kiwi Grapefruit Banana Apple

Key problem in database sorting

4 GB: $

480 GB: $

 How to sort data that does not fit in memory?

Example: merge sort

Apple Banana Blueberry Grapefruit Kiwi Mango Orange Strawberry Apple Banana Blueberry Grapefruit Kiwi Mango Orange Strawberry Apple Banana Grapefruit Orange Blueberry Kiwi Mango Strawberry Apple Banana Grapefruit Orange Blueberry Kiwi Mango Strawberry

Isn’t that good enough?

 Consider a file with N records

 Merge sort is O(N lg N) comparisons

 We want to minimize disk I/Os

 Don’t want to go to disk O(N lg N) times!

 Key insight: sort based on pages, not records

 Read whole pages into RAM, not individual records

 Do some in-memory processing

 Write processed blocks out to disk

 Repeat

 Pass 0: sort each page  Pass 1: merge two pages into one run  Pass 2: merge two runs into one run  …  Sorted!

2-way sort

Unsorted Sorted RAM Sorted Sorted RAM Sorted Sorted Sorted RAM Sorted

What did that cost us?

 Why is this better than plain old merge sort?

 N >> P

 So O(N lg N) >> O(P lg P)

 Example:

 1,000,000 record file  8 KB pages  100 byte records  = 80 records per page  = 12,500 pages  Plain merge sort: 41,863,137 disk I/O’s  2-way external merge sort: 365,241 disk I/O’s  4.8 days versus 1 hour

Can we do better?

 2-way merge sort only uses 3 memory buffers

 Two buffers to hold input records  One buffer to hold output records  When that buffer fills up, flush to disk

 Usually we have a lot more memory than that

 Set aside 100 MB for sort scratch space = 12,800 buffer pages

 Idea: read as much data into memory as possible

each pass

 Thus reducing the number of passes  Recall total cost: 2P * Passes

Example

Input Output

Example

Input Output

Example

Input Output

Example

Input Output