Hashing-Algorithms and Data Representation-Lecture Slides, Slides of Data Representation and Algorithm Design

This lecture was delivered by Dr. Ameet Shashank at B R Ambedkar National Institute of Technology. Its relate to Data Representation and Algorithm Design course. Its main points are: Hashng, Function, Plan, Good, Collision, Resolution, Strategy, Social, Security, Java

Typology: Slides

2011/2012

Uploaded on 07/15/2012

saandeep
saandeep 🇮🇳

4.5

(6)

99 documents

1 / 9

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
4.2 Hashing
2
Optimize Judiciously
Reference
: Effective Java
by Joshua Bloch.
More computing sins are committed in the name of efficiency
(without necessarily achieving it) than for any other single reason -
including blind stupidity. - William A. Wulf
We should forget about small efficiencies, say about 97% of the time:
premature optimization is the root of all evil. - Donald E. Knuth
We follow two rules in the matter of optimization:
Rule 1: Don't do it.
Rule 2 (for experts only). Don't do it yet - that is, not until
you have a perfectly clear and unoptimized solution.
- M. A. Jackson
3
Hashing: Basic Plan
Save items in a key-indexed table. Index is a function of the key.
Hash function. Method for computing table index from key.
Collision resolution strategy. Algorithm and data structure to handle
two keys that hash to the same index.
Classic space-time tradeoff.
!No space limitation: trivial hash function with key as address.
!No time limitation: trivial collision resolution with sequential search.
!Limitations on both time and space: hashing (the real world).
4
Choosing a Good Hash Function
Idealistic goal: scramble the keys uniformly.
!Efficiently computable.
!Each table position equally likely for each key.
Ex: Social Security numbers.
!Bad: first three digits.
!Better: last three digits.
Ex: date of birth.
!Bad: birth year.
!Better: birthday.
Ex: phone numbers.
!Bad: first three digits.
!Better: last three digits.
573 = California, 574 = Alaska
assigned in chronological order within a
given geographic region
thoroughly researched problem
docsity.com
pf3
pf4
pf5
pf8
pf9

Partial preview of the text

Download Hashing-Algorithms and Data Representation-Lecture Slides and more Slides Data Representation and Algorithm Design in PDF only on Docsity!

4.2 Hashing

2

Optimize Judiciously

Reference : Effective Java by Joshua Bloch. More computing sins are committed in the name of efficiency (without necessarily achieving it) than for any other single reason - including blind stupidity. - William A. Wulf We should forget about small efficiencies, say about 97 % of the time: premature optimization is the root of all evil. - Donald E. Knuth We follow two rules in the matter of optimization: Rule 1 : Don't do it. Rule 2 (for experts only). Don't do it yet - that is, not until you have a perfectly clear and unoptimized solution.

  • M. A. Jackson 3

Hashing: Basic Plan

Save items in a key-indexed table. Index is a function of the key.

Hash function. Method for computing table index from key.

Collision resolution strategy. Algorithm and data structure to handle

two keys that hash to the same index.

Classic space-time tradeoff.

! No space limitation: trivial hash function with key as address.

! No time limitation: trivial collision resolution with sequential search.

! Limitations on both time and space: hashing (the real world).

4

Choosing a Good Hash Function

Idealistic goal: scramble the keys uniformly.

! Efficiently computable.

! Each table position equally likely for each key.

Ex: Social Security numbers.

! Bad: first three digits.

! Better: last three digits.

Ex: date of birth.

! Bad: birth year.

! Better: birthday.

Ex: phone numbers.

! Bad: first three digits.

! Better: last three digits.

573 = California, 574 = Alaska assigned in chronological order within a given geographic region thoroughly researched problem

5

Hash Codes and Hash Functions

Hash code. A 32-bit int (between - 2147483648 and 2147483647 ).

Hash function. An int between 0 and M- 1.

Bug. Don't use (code % M) as array index.

Subtle bug. Don't use (Math.abs(code) % M) as array index.

OK. Safe to use ((code & 0 x 7 fffffff) % M) as array index.

String s = "call"; int code = s.hashCode(); int hash = code % M; 3045982 7121 8191 6

Implementing Hash Code in Java

API for hashCode().

! Return an int.

! If x.equals(y) then x and y must have the same hash code.

! Repeated calls to x.hashCode() must return the same value.

Default implementation. Memory address of x.

Customized implementations. String, URL, Integer, Date.

User-defined implementaitons. Tricky to get right, black art.

inherited from Object 7

Designing a Good Hash Code

Java 1.5 string library.

! Equivalent to h = 31L-^1 ·s 0 + … + 31^2 ·sL- 3 + 31·sL- 2 + sL- 1.

! Horner's method to hash string of length L: O(L).

Ex.

public int hashCode() { int hash = 0 ; for (int i = 0 ; i < length(); i++) hash = ( 31 * hash) + s[i]; return hash; } String s = "call"; int code = s.hashCode(); 3045982 = 99· 313 + 97· 312 + 108· 311 + 108· 310 ith character of s char Unicode … … 'a' 97 'b' 98 'c' 99 … … 8

Designing a Bad Hash Code

Java 1.1 string library.

! For long strings: only examines 8-9 evenly spaced characters.

! Saves time in performing arithmetic…

But great potential for bad collision patterns.

public int hashCode() { int hash = 0 ; int skip = Math.max( 1 , length() / 8 ); for (int i = 0 ; i < length(); i += skip) hash = ( 37 * hash) + s[i]; return hash; } http://www.cs.princeton.edu/introcs/13loop/Hello.java http://www.cs.princeton.edu/introcs/13loop/Hello.class http://www.cs.princeton.edu/introcs/13loop/Hello.html http://www.cs.princeton.edu/introcs/13loop/index.html http://www.cs.princeton.edu/introcs/12type/index.html

Separate Chaining

14

Separate Chaining

Separate chaining: array of M linked lists.

! Hash: map key to integer i between 0 and M-1.

! Insert: put at front of ith^ chain (if not already there).

! Search: only need to search ith^ chain.

untravelled 3 suburban 3 ishmael 5017 seriously 0

.... 3480 7121 hash me call key jocularly seriously listen browsing st[ 0 ] st[ 1 ] st[ 2 ] st[ 8190 ] st[ 3 ] (^) suburban untravelled considerating null M = 8191 typically M $ N/ 15

Separate Chaining: Java Implementation

public class ListHashST<Key, Value> { private int M = 8191 ; private Node[] st = new Node[M]; private static class Node { Object key; Object val; Node next; Node(Object key, Object val, Node next) { this.key = key; this.val = val; this.next = next; } } private int hash(Key key) { return (key.hashCode() & 0x 7 fffffff) % M; } between 0 and M- 1 no generic array creation in Java 16

Separate Chaining: Java Implementation (cont)

public void put(Key key, Val val) { int i = hash(key); for (Node x = st[i]; x != null; x = x.next) { if (key.equals(x.key)) { x.val = val; return; } } st[i] = new Node(k, val, st[i]); } public Val get(Key key) { int i = hash(key); for (Node x = st[i]; x != null; x = x.next) if (key.equals(x.key)) return (Val) x.val; return null; } check if key already present insert at front of chain

17

Separate Chaining Performance

Separate chaining performance.

! Cost is proportional to length of chain.

! Average length = N / M.

! Worst case: all keys hash to same chain.

Theorem. Let % = N / M > 1 be average length of list. For any t > 1,

probability that list length > t % is exponentially small in t.

Parameters.

! M too large # too many empty chains.

! M too small # chains too long.

! Typical choice: % = N / M $ 10 # constant-time ops.

depends on hash map being random map 18

Advantages. Fast insertion, fast search.

Disadvantage. Hash table has fixed size, assumes good hash function.

Sorted array Implementation Unsorted list log N Get N N Put N log N Get N/ 2 N/ 2 Put N N/ 2 Remove N/ 2 Worst Case Average Case N Remove N Separate chaining N N N 1 * 1 * 1 *

  • assumes hash function is random

Symbol Table: Implementations Cost Summary

fix: use repeated doubling, and rehash all keys

Linear Probing

20

Linear Probing

Linear probing: array of size M.

! Hash: map key to integer i between 0 and M-1.

! Insert: put in slot i if free; if not try i+1, i+2, etc.

! Search: search slot i; if occupied but no match, try i+1, i+2, etc.

typically M $ 2N

0 - 1 - 2 S 3 H 4 - 5 - 6 A 7 C 8 E 9 R 10

11 N 12 insert I

  • 0 - 1 - 2 S 3 H 4 - 5 - 6 A 7 C 8 E 9 1 R 01 I 11 - 2 hash(I) = 11 insert N
  • 0 - 1 - 2 S 3 H 4 - 5 - 6 A 7 C 8 E 9 1 R 01 I 11 N 2 hash(N) = 8

25

Double Hashing

Idea Avoid clustering by using second hash to compute skip for search.

Hash. Map key to integer i between 0 and M-1.

Second hash. Map key to nonzero skip value k.

Ex: k = 1 + (v mod 97).

Effect. Skip values give different search paths for keys that collide.

Best practices. Make k and M relatively prime.

hashCode() 26

Theorem. [Guibas-Szemerédi] Let % = N / M < 1 be average length of list.

Parameters. Typical choice: M $ 2N # constant-time ops.

Disadvantage. Delete cumbersome to implement.

Double Hashing Performance

assumes hash function is random !

" ln^ (^1 +^ ") insert / search miss $ search hit $ 27

Hashing Tradeoffs

Separate chaining vs. linear probing/double hashing.

! Space for links vs. empty table slots.

! Small table + linked allocation vs. big coherent array.

Linear probing vs. double hashing.

load factor % 50 % 66 % 75 % 90 % linear probing get 1. 5 2. 0 3. 0 5. 5 put 2. 5 5. 0 8. 5 55. 5 double hashing get 1. 4 1. 6 1. 8 2. 6 put 1. 5 2. 0 3. 0 5. 5 number of probes

Odds and Ends

29

Hashing: Java Library

Java has built-in libraries for symbol tables.

! java.util.HashMap = linear probing hash table implementation.

Duplicate policy.

! Java HashMap allows null values.

! Our implementation forbids null values.

import java.util.HashMap; public class HashMapDemo { public static void main(String[] args) { HashMap<String, String> st = new HashMap <String, String>(); st.put("www.cs.princeton.edu", " 128. 112. 136. 11 "); st.put("www.princeton.edu", " 128. 112. 128. 15 "); System.out.println(st.get("www.cs.princeton.edu")); } } 30

Symbol Table: Using HashMap

Symbol table. Implement our API using java.util.HashMap.

import java.util.HashMap; import java.util.Iterator; public class ST<Key, Val> implements Iterable { private HashMap<Key, Val> st = new HashMap<Key, Val>(); public void put(Key key, Val val) { if (val == null) st.remove(key); else val == null)st.put(key, val); } public Val get(Key key) { return st.get(key); } public Val remove(Key key) { return st.remove(key); } public boolean contains(Key key) { return st.containsKey(key); } public int size() contains(Key ke{ return st.size(); } public Iterator iterator() { return st.keySet().iterator(); } } 31

Algorithmic Complexity Attacks

Is the random hash map assumption important in practice?

! Obvious situations: aircraft control, nuclear reactor, pacemaker.

! Surprising situations: denial-of-service attacks.

Real-world exploits. [Crosby-Wallach 2 003]

! Bro server: send carefully chosen packets to DOS the server,

using less bandwidth than a dial-up modem

! Perl 5 .8.0: insert carefully chosen strings into associative array.

! Linux 2.4.20 kernel: save files with carefully chosen names.

Reference: http://www.cs.rice.edu/~scrosby/hash malicious adversary learns your ad hoc hash function (e.g., by reading Java API) and causes a big pile-up in single address that grinds performance to a halt 32

Algorithmic Complexity Attack: Java Library

Goal. Find strings with the same hash code.

Solution. The base-31 hash code is part of Java's string API.

2 N^ strings of length 2N that hash to same value! Key hashCode() AaAaAaAa - AaAaAaBB - AaAaBBAa - AaAaBBBB - AaBBAaAa - AaBBAaBB - AaBBBBAa - AaBBBBBB -

BBAaAaAa BBAaAaBB BBAaBBAa BBAaBBBB BBBBAaAa BBBBAaBB BBBBBBAa BBBBBBBB Key hashCode() Aa 2112 BB 2112