String Matching: Rabin-Karp Algorithm, Study notes of Computer science

A naive algorithm for this problem simply considers all possible starting positions i of a matching string ... The Rabin-Karp String Matching Algorithm.

Typology: Study notes

2022/2023

Uploaded on 03/01/2023

jannine
jannine 🇺🇸

4.9

(15)

239 documents

1 / 7

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
String Matching: Rabin-Karp Algorithm
Greg Plaxton
Theory in Programming Practice, Fall 2005
Department of Computer Science
University of Texas at Austin
pf3
pf4
pf5

Partial preview of the text

Download String Matching: Rabin-Karp Algorithm and more Study notes Computer science in PDF only on Docsity!

String Matching: Rabin-Karp Algorithm

Greg Plaxton Theory in Programming Practice, Fall 2005 Department of Computer Science University of Texas at Austin

The (Exact) String Matching Problem

  • The (exact) string matching problem: Given a text string t and a pattern string p, find all occurrences of p in t
  • A naive algorithm for this problem simply considers all possible starting positions i of a matching string within t, and compares p to the substring of t beginning at each such position i - The worst-case complexity of this algorithm is Θ(mn), where m denotes the length of p and n denotes the length of t - Can we do better?

The Rabin-Karp String Matching Algorithm

  • Assume the text string t is of length m and the pattern string p is of length n
  • Let si denote the length-n contiguous substring of t beginning at offset i ≥ 0 - So, for example, s 0 is the length-n prefix of t
  • The main idea is to use a hash function h to map each si to a good- sized set such as the set of the first k nonnegative integers, for some suitable k - Initially, we compute h(p) - Whenever we encounter an i for which h(si) = h(p), we check for a match as in the naive algorithm - If h(si) 6 = h(p), we don’t need to check for a match

The Choice of Hash Function

  • It should be easy to compare two hash values
    • For example, if the range of the hash function is a set of sufficiently small nonnegative integers, then two hash values can be compared with a single machine instruction
  • The number of false positives induced by the hash function should be similar to that achieved by a “random” function - If the range of the hash function is of size k, we’d like each hash value to be achieved by approximately the same number of n-symbol strings (where n is the length of the pattern)
  • It should be easy (e.g., a constant number of machine instructions) to compute h(si+1) given h(si)

A Good Choice for the Hash Function

  • View each string as a nonnegative number, but take the result modulo k for some suitable modulus k
  • For example, we might take k to be 232 , to ensure that the hash values can be stored in a 32 -bit integer
  • In practice the modulus k is generally taken to be a prime (e.g., a 32 -bit prime) in order to better destroy any structure in the input data - For example, note that the 8-bit ASCII codes for printable characters all begin with a 0 - So if we use k = 2^32 , bits 7, 15, 23, and 31 of the hash of a printable string are guaranteed to be zero
  • But can we still compute h(si+1) from h(si) efficiently?