Data Structures and Algorithm Analysis - Lecture II, Study notes of Algorithms and Programming

The outline of a lecture in the data structures and algorithm analysis course at george mason university. The lecture covers count sorts, string matching, hashing, and b-trees. It includes detailed explanations of count sorts, horspool’s algorithm for string matching, the basics of hashing, collisions, open hashing, closed hashing with linear probing, and storing data on disk.

Typology: Study notes

Pre 2010

Uploaded on 02/10/2009

koofers-user-kd6-1
koofers-user-kd6-1 🇺🇸

10 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Outline Introduction Count Sorts String Matching Hashing B-Trees Homework
CS 483 - Data Structures and Algorithm Analysis
Lecture VII: Chapter 7
R. Paul Wiegand
George Mason University, Department of Computer Science
March 29, 2006
R. Paul Wiegand George Mason University, Department of Computer Science
CS483 Lecture II
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download Data Structures and Algorithm Analysis - Lecture II and more Study notes Algorithms and Programming in PDF only on Docsity!

CS 483 - Data Structures and Algorithm Analysis

Lecture VII: Chapter 7

R. Paul Wiegand

George Mason University, Department of Computer Science

March 29, 2006

R. Paul Wiegand George Mason University, Department of Computer Science

Outline

1 Introduction: Space vs. Time Tradeoff

2 Sorting by Counting

3 String Matching

4 Hashing

5 B-Trees

6 Homework

R. Paul Wiegand George Mason University, Department of Computer Science

Comparison Count Sort

ComparisonCountingSort(A[0... n − 1])

for i ← 0 to n − 1 do Count[i ] ← 0 for i ← 0 to n − 2 do for j ← i + 1 to n − 1 do if A[i ] < A[j] then Count[j] ← Count[j] + 1 else Count[j] ← Count[j] + 1 for i ← 0 to n − 1 do S[Count[i ]] ← A[i ] return S

A = 64 31 84 96 19 47 0 0 0 0 0 0 3 0 1 1 0 0 3 1 2 2 0 1 3 1 4 3 0 1 3 1 4 5 0 1 3 1 4 5 0 2 3 1 4 5 0 2

init i = 0 i = 1 i = 2 i = 3 i = 4 i = 5

S

C (n) =

∑n− 2

i =

∑n− 1

j=i +1 1 =^

n(n−1)

2 ∈^ Θ(n

For each element to be sorted, count the total number of elements smaller than this element.

R. Paul Wiegand George Mason University, Department of Computer Science

Distribution Counting

ComparisonCountingSort(A[0... n − 1])

for j ← 0 to u − ℓ do D[j] ← 0 for i ← 0 to n − 1 do D[A[i ] − ℓ] ← D[A[i ] − ℓ] + 1 for j ← 1 to u − ℓ do D[j] ← D[j − 1] + D[j] for i ← n − 1 downto 0 do j ←− A[i ] − ℓ S[D[j] − 1] ←− A[i ] D[j] ←− D[j] − 1

Sometimes the input is

constrained

Fixed array of

values

Each in [ℓ, u]

Sometimes we want

additional information

information

}

After accumulation...

D[0... 2] S[0... 5]

A[i = 5] = 12 1 4 6 12 A[i = 4] = 12 1 3 6 12 12 A[i = 3] = 13 1 2 6 12 12 13 A[i = 2] = 12 1 2 5 12 12 12 13 A[i = 1] = 11 1 2 5 11 12 12 12 13 A[i = 0] = 13 0 1 5 11 12 12 12 13 13

R. Paul Wiegand George Mason University, Department of Computer Science

Horspool’s Algorithm

Idea: When we shift, make as large a shift as possible

Match pattern from right to left

Consider character c of the text that was aligned against the last

character of the pattern

t(c) =

m, if c is not in the first m − 1 characters

dist from rightmost c in first m − 1 characters, otherwise

Still Θ(n) in Avg case, Θ(nm) in worst case

But on average, must faster than brute force

ShiftTable(P[0... m − 1])

for j ← 0 to m − 2 do T [P[j]] ← m − 1 − j

return T

R. Paul Wiegand George Mason University, Department of Computer Science

Horspool’s Algorithm

Idea: When we shift, make as large a shift as possible

Match pattern from right to left

Consider character c of the text that was aligned against the last

character of the pattern

t(c) =

m, if c is not in the first m − 1 characters

dist from rightmost c in first m − 1 characters, otherwise

Still Θ(n) in Avg case, Θ(nm) in worst case

But on average, must faster than brute force

ShiftTable(P[0... m − 1])

for j ← 0 to m − 2 do T [P[j]] ← m − 1 − j

return T

May repeatedly over- write shift value for a given character

R. Paul Wiegand George Mason University, Department of Computer Science

The Basics of Hashing

Hashes are often useful for implementing dictionaries (basic

operations: Insert, Search, & Delete)

Construct a data type to store records by key value (Hash Table),

generally an array H[0... m − 1]

Use the key to access the table by computing its address with a

predefined Hash Function, h(k)

If keys are nonnegative integers, a simple hash function is

h(k) = k mod m

For strings of a fixed length, we might use:

h(k) =

ℓ− 1

i =0 ord(ci^ )

mod m

Or, where C is a larger constant than any ord(ci ):

h ← 0; for i ← 0 to ℓ − 1 do h ← (h · C + ord(ci ))

R. Paul Wiegand George Mason University, Department of Computer Science

Collisions

Hash functions should try to:

1 Distribute keys in the table as evenly as possible

2 Be easy to compute

When the hash functions computes the same value for different

keys, a collision occurs

When m < n (n is the number of keys inserted into the table),

this will occur

Even when m ≥ n it is still possible (depending on the data

and the hash function)

Hash implementations need to have a collision resolution

method, such as:

Open hashing (separate chaining)
Closed hashing (open addressing)

R. Paul Wiegand George Mason University, Department of Computer Science

Open Hashing

Each cell in the hash table is a

linked list

Values are stored in list, collisions

are handled by chaining values

If n keys are distributed evenly,

each list is about the same size: mn

load factor — α = mn

Average number of nodes visited

during a successful search:

S ≈ 1 + α 2

Average number of nodes visited

during an unsuccessful search:

U ≈ α

Example:

m = 5 h(k) = (suitvalue + cardvalue) mod m {♠,♦ ,♥ ,♣ } = { 42 , 28 , 14 , 0 } {K , Q, J, A} = { 13 , 12 , 11 , 1 }

Data A♦ 5 ♣ 9 ♠ 7 ♠ K♥ h(k) 4 0 1 4 2

5 ♣ 9 ♠ K♥ A♦

When load factor is near 1 & keys are well distributed, access is Θ(1) on average

R. Paul Wiegand George Mason University, Department of Computer Science

Closed Hashing with Linear Probing

All keys are stored in table

On collisions, we shift right until we

find an open position

At the end, we wrap back to the start

Delete is problematic (mark & skip)

Avg. # comparisons when successful:

S ≈ 12

1 − 1 −^1 α

Avg. # comparisons when

unsuccessful: U ≈ 12

1 − (1−^1 α) 2

Example:

Data A♦ 5 ♣ 9 ♠ 7 ♠ K♥ h(k) 4 0 1 4 2

A♦
5 ♣ A♦
5 ♣ 9 ♠ A♦
5 ♣ 9 ♠ 7 ♠ A♦
5 ♣ 9 ♠ 7 ♠ K♥ A♦

R. Paul Wiegand George Mason University, Department of Computer Science

Clustering & Double Hashing

0.2 0.4 0.6 0.

0

50

100

150

200

α

comparisons

Unsuccessful Successful

The main problem is clustering

A cluster is a sequence of consecutive

filled positions in the table

One possible solution: double hash

Use a second hash function to

compute the probe interval

h 2 (k) = m − 2 − k mod (m − 2)

We need h 2 (k) and m to be

“relatively prime” (only common

divisor is 1)

Choosing a prime m ensures this

R. Paul Wiegand George Mason University, Department of Computer Science

Clustering & Double Hashing

0.2 0.4 0.6 0.

0

50

100

150

200

α

comparisons

Unsuccessful Successful

The main problem is clustering

A cluster is a sequence of consecutive

filled positions in the table

One possible solution: double hash

Use a second hash function to

compute the probe interval

h 2 (k) = m − 2 − k mod (m − 2)

We need h 2 (k) and m to be

“relatively prime” (only common

divisor is 1)

Choosing a prime m ensures this

We can still have problems as α approaches 1

Only solution: rehash (scan table & relocate into a bigger table)

R. Paul Wiegand George Mason University, Department of Computer Science

B-Trees

K 1... K j K m  1

T 0 T 1 T j  1 T j

T m  2 T m  1

Data records stored in leaves in increasing order of the keys

Each parental node contains m − 1 (distinct) ordered keys

All keys in T 0 are smaller than K 1 , all keys in T 1 are in [K 1 , K 2 ), etc.

Every B-Tree of order m > 2 must satisfy:

Root is leaf or has between 2 and m children

Internal nodes (∼ root ∨ leaf) have b/w ⌈m/ 2 ⌉ and m children

The tree is (perfectly) balanced; all leaves at same level

R. Paul Wiegand George Mason University, Department of Computer Science

Searching in a B-Tree

Keys are ordered in the node, so we can use binary search to find

the pointer to follow

But we don’t care about key comparisons, we care about disk access

We usually choose the order of a B-Tree s.t. the node size

corresponds with disk pages

How many nodes do we have to consider? Height plus 1 ...

R. Paul Wiegand George Mason University, Department of Computer Science