Hash Tables in Algorithms & Data Abstract Structures - CPSC 223, Fall 2010, Slides of Data Structures and Algorithms

A part of the lecture notes for the algorithms & data abstract structures course (cpsc 223) at the university of x, taught in the fall of 2010. The notes cover the topic of hash tables, including the basic idea, advantages over arrays, hash functions, collisions, and resolving collisions using open addressing and separate chaining. The document also includes examples of hash functions and their performance.

Typology: Slides

2012/2013

Uploaded on 09/09/2013

zaid
zaid 🇮🇳

4.5

(2)

59 documents

1 / 13

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
11/30/10%
1%
CPSC 223
Algorithms & Data Abstract Structures
Lecture 24: !
Hash Tables!
Today …
Hash Tables [Ch 12: 686-706]!
Reminders:!
Project presentations Thursday … !
Guest lecture next Tuesday!
Next week: (re)readThe data-structure canon!
CPSC%223%**%Fall%2010%
pf3
pf4
pf5
pf8
pf9
pfa
pfd

Partial preview of the text

Download Hash Tables in Algorithms & Data Abstract Structures - CPSC 223, Fall 2010 and more Slides Data Structures and Algorithms in PDF only on Docsity!

CPSC 223

Algorithms & Data Abstract Structures

Lecture 24:

Hash Tables

Today …

• Hash Tables [Ch 12: 686-706]

• Reminders:

  • Project presentations Thursday …
  • Guest lecture next Tuesday
  • Next week: (re)read “ The data-structure canon ” CPSC 223 -­‐-­‐ Fall 2010

B-Trees versus Arrays

  • What are advantages of balanced search trees over

arrays for storing collections of data items?

  • Output (traversal) in sorted order
  • Faster retrieve (and lookup) …
  • O ( n ) for arrays, O (log n ) for balanced search trees

Can we improve search time for arrays?

Yes!

  • Using Hash Tables … CPSC 223 -­‐-­‐ Fall 2010

Hash Tables

Basic Idea

  • Define a “ hash functionh
  • h : KeyIndex
  • Make h fast (e.g., constant time)
  • This makes retrieve O (1)!
  • … which is even faster than in BSTs CPSC 223 -­‐-­‐ Fall 2010 h 0 1 2 n – 1 key h maps keys to array indexes table

Hash Functions

“ Perfect Hash Functions ”

  • Map each key to a unique array index
  • Hard if you do not know all search key values to expect
  • Note you may also have more keys than indexes
  • Most Hash Functions
  • Map two or more keys to the same index
  • This results in “ collisions
  • We have to deal with collisions (more later) …
  • … but we also want hash functions that minimize collisions CPSC 223 -­‐-­‐ Fall 2010

Examples of Hash Functions (from textbook)

Assumptions

  • keys are positive integers
  • we have a hash table (array) of 100 elements (0 .. 99)

“ Selecting digits ”

  • Select digits of the key to use as the hash value
  • Lets say keys are 9-digit employee numbers
    • h ( k ) = 4th^ and 9th^ digit
    • For example: h (001364825) = 35
    • Here we store (retrieve) entry with key 001364825 at table[35]
  • This is a fast and simple approach, but
    • May not evenly distribute data CPSC 223 -­‐-­‐ Fall 2010

Examples of Hash Functions (from textbook)

“ Folding ”

  • Add digits instead
  • Lets say keys are 9-digit employee numbers
    • h ( k ) = i 1 + i 2 + … + i 9 where k = i 1 i 2 … i 9
    • For example: h (001364825) = 29
    • Store (retrieve) entry with key 001364825 at table[29]
  • This is also fast, but
    • Also may not evenly distribute data
    • In this example, only hits ranges from 0 to 81
    • Can pick different schemes (like i 1 i 2 i 3 + i 4 i 5 i 6 + i 7 i 8 i 9 ) CPSC 223 -­‐-­‐ Fall 2010

Examples of Hash Functions (from textbook)

“ Modular Arithmetic ”

  • Sometimes we end up with indexes outside of the

range of table indexes

  • We can use the modulo operator (%) to map values to

valid table indexes

  • h ( k ) = i mod table size
  • In our example we can use the key directly …  h (001364825) = 1,364,825 mod 100 = 25
  • Key values used may require carefully chosen table sizes
    • E.g., 110 mod 100, 210 mod 100, 310 mod 100, etc
    • Convention to more evenly distribute values is to use a prime number (e.g., 101 in this case) CPSC 223 -­‐-­‐ Fall 2010

Resolving Collisions (insert)

  • Two general approaches Open Addressing - If location occupied, then find another location Restructuring the Hash Table - Add more room to the Hash Table to store collisions CPSC 223 -­‐-­‐ Fall 2010

Approach 1: Open Addressing

If a location is taken, “ probe ” (search) array for the next

“ open ” (available) index

  • Linear probing
    • Search for next available sequentially
    • Take the next free index
    • If at the end, start at position 0
    • Search works similarly
      • Deletion tricky
      • Mark indexes as “deleted” so we don’t throw off search CPSC 223 -­‐-­‐ Fall 2010 k k 0 1 2 3 k 4 h (k4) = i = 1 i + 1 i + 2 i + 3 Linear probing can create large “primary” clusters

Approach 1: Open Addressing

If a location is taken, “ probe ” (search) array for the next

“ open ” (available) index

  • Quadratic probing
    • Helps eliminate “ primary ” clusters
    • Instead of sequentially probing
    • Probe “quadratic” sequences
      • i + 1^2 , i + 2^2 , i + 3^2 , i + 4^2 , …
    • Creates “ secondary ” clusters since collisions use same sequences CPSC 223 -­‐-­‐ Fall 2010 k k 0 1 2 3 4 h (k3) = i = 1 i + 1 5^ i^ + 4

Approach 1: Open Addressing

If a location is taken, “ probe ” (search) array for the next

“ open ” (available) index

  • Double hashing
    • Further reduces clustering
    • Use a second hash function h 2 to determine the size of sequence steps
    • Note steps depend on key value CPSC 223 -­‐-­‐ Fall 2010 k k 0 1 2 3 k 4 h (k4) = i = 1 i + h 2 (k4)

Approach 2: Restructure Array

Change the structure of the hash table to hold

multiple items in the same position

  • Separate Chaining (HW10)
    • Instead of using a static arrays, use linked lists ( chains )
    • We end up with “ Chain Nodes ” holding entries CPSC 223 -­‐-­‐ Fall 2010

Approach 2: Restructure Array

Separate Chaining

  • Instead of using a static array, use a linked list (chain)
  • End up with “Chain Nodes” holding entries CPSC 223 -­‐-­‐ Fall 2010 h key Table (Array)

Linked Lists (one per table loca2on)

The Cost of Hashing

  • Ideally
    • Insert, Delete, and Retrieve are O (1)
    • Traversal is O ( n ) … but the result is not sorted
  • In practice
    • Collisions increase the cost
    • Cost depends on the “ load factor ” ... how full the table is = # items in table / table size
    • As the table fills, the chances of collisions increase
    • Thus hashing efficiency decreases as load factor increases Note that > 1 if more items than array positions CPSC 223 -­‐-­‐ Fall 2010

The Cost of Hashing

Cost of Separate Chaining

  • Insertion is still O (1)
    • New items added to the front of the linked list
  • Deletion, retrieval may require searching entire linked list

(chain)

  • So again, cost depends on collisions
  • Here is the average length of each linked list

(assuming a “good” hash function)

  • But since α = n / constant , search is worst-case O ( n )
  • In practice, hash tables are efficient at searching though! CPSC 223 -­‐-­‐ Fall 2010

HashTable

  • Maps each keyword to a table index (hash function)
  • Each table index contains a (linked) List of ChainNodes
  • In Dictionary
    • insert involves adding (keyword, Entry) pairs
    • remove involves removing Entry’s (and possibly ChainNodes)
    • search (new operation) finds and returns Entries given a keyword

Assignment 10 – Hash Table

CPSC 223 -­‐-­‐ Fall 2010 Table 0 1 2 3 4 5 C1 : ChainNode keyword = “device” L1 : List e1 : Entry e2 : Entry C2 : ChainNode keyword = “contrivance” L2 : List e1 : Entry L3 : List If h(device) = 2 and h(contrivance) = 2