Download Data Structures and Algorithm Analysis - Lecture II and more Study notes Algorithms and Programming in PDF only on Docsity!
CS 483 - Data Structures and Algorithm Analysis
Lecture VII: Chapter 7
R. Paul Wiegand
George Mason University, Department of Computer Science
March 29, 2006
R. Paul Wiegand George Mason University, Department of Computer Science
Outline
1 Introduction: Space vs. Time Tradeoff
2 Sorting by Counting
3 String Matching
4 Hashing
5 B-Trees
6 Homework
R. Paul Wiegand George Mason University, Department of Computer Science
Comparison Count Sort
ComparisonCountingSort(A[0... n − 1])
for i ← 0 to n − 1 do Count[i ] ← 0 for i ← 0 to n − 2 do for j ← i + 1 to n − 1 do if A[i ] < A[j] then Count[j] ← Count[j] + 1 else Count[j] ← Count[j] + 1 for i ← 0 to n − 1 do S[Count[i ]] ← A[i ] return S
A = 64 31 84 96 19 47 0 0 0 0 0 0 3 0 1 1 0 0 3 1 2 2 0 1 3 1 4 3 0 1 3 1 4 5 0 1 3 1 4 5 0 2 3 1 4 5 0 2
init i = 0 i = 1 i = 2 i = 3 i = 4 i = 5
S
C (n) =
∑n− 2
i =
∑n− 1
j=i +1 1 =^
n(n−1)
2 ∈^ Θ(n
For each element to be sorted, count the total number of elements smaller than this element.
R. Paul Wiegand George Mason University, Department of Computer Science
Distribution Counting
ComparisonCountingSort(A[0... n − 1])
for j ← 0 to u − ℓ do D[j] ← 0 for i ← 0 to n − 1 do D[A[i ] − ℓ] ← D[A[i ] − ℓ] + 1 for j ← 1 to u − ℓ do D[j] ← D[j − 1] + D[j] for i ← n − 1 downto 0 do j ←− A[i ] − ℓ S[D[j] − 1] ←− A[i ] D[j] ←− D[j] − 1
Sometimes the input is
constrained
Fixed array of
values
Each in [ℓ, u]
Sometimes we want
additional information
information
}
After accumulation...
D[0... 2] S[0... 5]
A[i = 5] = 12 1 4 6 12 A[i = 4] = 12 1 3 6 12 12 A[i = 3] = 13 1 2 6 12 12 13 A[i = 2] = 12 1 2 5 12 12 12 13 A[i = 1] = 11 1 2 5 11 12 12 12 13 A[i = 0] = 13 0 1 5 11 12 12 12 13 13
R. Paul Wiegand George Mason University, Department of Computer Science
Horspool’s Algorithm
Idea: When we shift, make as large a shift as possible
Match pattern from right to left
Consider character c of the text that was aligned against the last
character of the pattern
t(c) =
m, if c is not in the first m − 1 characters
dist from rightmost c in first m − 1 characters, otherwise
Still Θ(n) in Avg case, Θ(nm) in worst case
But on average, must faster than brute force
ShiftTable(P[0... m − 1])
for j ← 0 to m − 2 do T [P[j]] ← m − 1 − j
return T
R. Paul Wiegand George Mason University, Department of Computer Science
Horspool’s Algorithm
Idea: When we shift, make as large a shift as possible
Match pattern from right to left
Consider character c of the text that was aligned against the last
character of the pattern
t(c) =
m, if c is not in the first m − 1 characters
dist from rightmost c in first m − 1 characters, otherwise
Still Θ(n) in Avg case, Θ(nm) in worst case
But on average, must faster than brute force
ShiftTable(P[0... m − 1])
for j ← 0 to m − 2 do T [P[j]] ← m − 1 − j
return T
May repeatedly over- write shift value for a given character
R. Paul Wiegand George Mason University, Department of Computer Science
The Basics of Hashing
Hashes are often useful for implementing dictionaries (basic
operations: Insert, Search, & Delete)
Construct a data type to store records by key value (Hash Table),
generally an array H[0... m − 1]
Use the key to access the table by computing its address with a
predefined Hash Function, h(k)
If keys are nonnegative integers, a simple hash function is
h(k) = k mod m
For strings of a fixed length, we might use:
h(k) =
ℓ− 1
i =0 ord(ci^ )
mod m
Or, where C is a larger constant than any ord(ci ):
h ← 0; for i ← 0 to ℓ − 1 do h ← (h · C + ord(ci ))
R. Paul Wiegand George Mason University, Department of Computer Science
Collisions
Hash functions should try to:
1 Distribute keys in the table as evenly as possible
2 Be easy to compute
When the hash functions computes the same value for different
keys, a collision occurs
When m < n (n is the number of keys inserted into the table),
this will occur
Even when m ≥ n it is still possible (depending on the data
and the hash function)
Hash implementations need to have a collision resolution
method, such as:
Open hashing (separate chaining)
Closed hashing (open addressing)
R. Paul Wiegand George Mason University, Department of Computer Science
Open Hashing
Each cell in the hash table is a
linked list
Values are stored in list, collisions
are handled by chaining values
If n keys are distributed evenly,
each list is about the same size: mn
load factor — α = mn
Average number of nodes visited
during a successful search:
S ≈ 1 + α 2
Average number of nodes visited
during an unsuccessful search:
U ≈ α
Example:
m = 5 h(k) = (suitvalue + cardvalue) mod m {♠,♦ ,♥ ,♣ } = { 42 , 28 , 14 , 0 } {K , Q, J, A} = { 13 , 12 , 11 , 1 }
Data A♦ 5 ♣ 9 ♠ 7 ♠ K♥ h(k) 4 0 1 4 2
5 ♣ 9 ♠ K♥ A♦
When load factor is near 1 & keys are well distributed, access is Θ(1) on average
R. Paul Wiegand George Mason University, Department of Computer Science
Closed Hashing with Linear Probing
All keys are stored in table
On collisions, we shift right until we
find an open position
At the end, we wrap back to the start
Delete is problematic (mark & skip)
Avg. # comparisons when successful:
S ≈ 12
1 − 1 −^1 α
Avg. # comparisons when
unsuccessful: U ≈ 12
1 − (1−^1 α) 2
Example:
Data A♦ 5 ♣ 9 ♠ 7 ♠ K♥ h(k) 4 0 1 4 2
A♦
5 ♣ A♦
5 ♣ 9 ♠ A♦
5 ♣ 9 ♠ 7 ♠ A♦
5 ♣ 9 ♠ 7 ♠ K♥ A♦
R. Paul Wiegand George Mason University, Department of Computer Science
Clustering & Double Hashing
0.2 0.4 0.6 0.
0
50
100
150
200
α
comparisons
Unsuccessful Successful
The main problem is clustering
A cluster is a sequence of consecutive
filled positions in the table
One possible solution: double hash
Use a second hash function to
compute the probe interval
h 2 (k) = m − 2 − k mod (m − 2)
We need h 2 (k) and m to be
“relatively prime” (only common
divisor is 1)
Choosing a prime m ensures this
R. Paul Wiegand George Mason University, Department of Computer Science
Clustering & Double Hashing
0.2 0.4 0.6 0.
0
50
100
150
200
α
comparisons
Unsuccessful Successful
The main problem is clustering
A cluster is a sequence of consecutive
filled positions in the table
One possible solution: double hash
Use a second hash function to
compute the probe interval
h 2 (k) = m − 2 − k mod (m − 2)
We need h 2 (k) and m to be
“relatively prime” (only common
divisor is 1)
Choosing a prime m ensures this
We can still have problems as α approaches 1
Only solution: rehash (scan table & relocate into a bigger table)
R. Paul Wiegand George Mason University, Department of Computer Science
B-Trees
K 1... K j K m 1
T 0 T 1 T j 1 T j
T m 2 T m 1
Data records stored in leaves in increasing order of the keys
Each parental node contains m − 1 (distinct) ordered keys
All keys in T 0 are smaller than K 1 , all keys in T 1 are in [K 1 , K 2 ), etc.
Every B-Tree of order m > 2 must satisfy:
Root is leaf or has between 2 and m children
Internal nodes (∼ root ∨ leaf) have b/w ⌈m/ 2 ⌉ and m children
The tree is (perfectly) balanced; all leaves at same level
R. Paul Wiegand George Mason University, Department of Computer Science
Searching in a B-Tree
Keys are ordered in the node, so we can use binary search to find
the pointer to follow
But we don’t care about key comparisons, we care about disk access
We usually choose the order of a B-Tree s.t. the node size
corresponds with disk pages
How many nodes do we have to consider? Height plus 1 ...
R. Paul Wiegand George Mason University, Department of Computer Science