External Sorting: A Comprehensive Guide for Computer Science Students, Lecture notes of Advanced Data Analysis

Students who are in 2nd CSE jntuk

Typology: Lecture notes

2019/2020

Uploaded on 04/16/2020

rameshpics
rameshpics 🇮🇳

5

(1)

3 documents

1 / 26

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
RAGHU INSTITUTE OF TECHNOLOGY
AUTONOMOUS
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
II BTECH II SEM
Advanced Data Structures
Unit-1
Prepared By
Dr.V.Sangeetha
&
Mr. N.UdayKumar
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a

Partial preview of the text

Download External Sorting: A Comprehensive Guide for Computer Science Students and more Lecture notes Advanced Data Analysis in PDF only on Docsity!

RAGHU INSTITUTE OF TECHNOLOGY AUTONOMOUS DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

II BTECH II SEM Advanced Data Structures Unit- Prepared By Dr.V.Sangeetha & Mr. N.UdayKumar

  • (^) SORTING
  • (^) Introduction
  • (^) External Sorting
  • (^) K-way Merging
  • (^) Buffer Handling for parallel Operation
  • (^) Run Generation
  • (^) Optimal Merging of Runs.

References :

  • (^) Data Structures, a Pseudocode Approach, Richard F Gilberg, Behrouz

A Forouzan, Cengage.

  • (^) 2. Fundamentals of DATA STRUCTURES in C: 2nd ed, , Horowitz ,

Sahani, Andersonfreed

Contents:

2

Why Sort?

A classic problem in computer science!

Data requested in sorted order

 e.g., find students in increasing gpa order

Sorting is first step in bulk loading B+ tree index.

Sorting useful for eliminating duplicate copies in a

collection of records

Sort-merge join algorithm involves sorting.

Problem: sort 10GB of data with 1MB of RAM.

4

Using secondary storage effectively

General Wisdom :

I/O costs dominate

Design algorithms to reduce I/O

5

  • (^) Let’s try to achieve balanced partitioning
  • (^) A gets n/2 elements, B gets rest half
  • (^) Sort A and B recursively
  • (^) Combine sorted A and B using a process called

merge , which combines two sorted lists into one

  • (^) How? We will see soon

Merge Sort

7

Partition into lists of size n/

Example

[10, 4, 6, 3]

[10, 4, 6, 3, 8, 2, 5, 7]

[8, 2, 5, 7]

[10, 4] [6, 3]^

[8, 2] [5, 7]

[4] [10] (^) [3][6] [2][8] [5][7]

8

Merge Sort

10

Example

81 94 11 96 12 35 17 99

81 94 11 96 12 35 17 99

Sort Sort

11 81 94 96 12 17 35 99

Merge : Merge two sorted lists and repeatedly choose the

smaller of the two “heads” of the lists

Merge Sort: Divide records into two parts; merge-sort those

recursively, and then merge the lists.

11

2-Way Sort: Requires 3 Buffers

Phase 1: PREPARE.

Read a page, sort it, write it.

only one buffer page is used

Phase 2, 3, …, etc.: MERGE:

Three buffer pages used.

Disk

input

Main

memory

Disk

Main memory

buffers

INPUT 1

INPUT 2

OUTPUT

1 buffer

1 buffer

1 buffer

13

2-Way Sorting Cont’d

The most popular method for sorting on

external storage devices is merge sort. This

method consists of essentially two distinct

phases.

First, segments of the input file are sorted

using a good internal sort method. These

sorted segments, known as runs , are written

out onto external storage as they are

generated.

14

  • (^) Step 1) Internally sort three blocks at a time (i.e., 750

records) to obtain six runs R 1

  • R 6 . A method such as heapsort

or quicksort could be used. These six runs are written out

onto the scratch disk.

Example

16

Example Cont’d

Step 2: Set aside three blocks of internal

memory, each capable of holding 250 records.

Two of these blocks will be used as input

buffers and the third as an output buffer.

Merge runs R

1

and R

2

. This is carried out by first

reading one block of each of these runs into

input buffers and etc

17

Example Analysis

Let us now analyze how much time is required to sort

these 4500 records. The analysis will use the

following notation:

t s

= maximum seek time

t l

= maximum latency time

t rw

= time to read or write one block of 250 records

t IO

= t s

+ t l

+ t rw

t IS

= time to internally sort 750 records

n t m

= time to merge n records from input buffers to

the output buffer

19

Computing times for disk sort

20