Edit Distance: String Shuffle and Customized Dynamic Programming Algorithm, Slides of Data Structures and Algorithms

The problem of determining if one string is a shuffle of another and provides an efficient dynamic programming algorithm for the edit distance problem. The edit distance problem measures the minimum number of edit operations (substitution, insertion, or deletion) required to transform one string into another. The document also covers customizations for substring matching and longest common subsequence problems.

Typology: Slides

2012/2013

Uploaded on 04/27/2013

shareeka_555
shareeka_555 🇮🇳

4

(6)

74 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Edit Distance
Docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Edit Distance: String Shuffle and Customized Dynamic Programming Algorithm and more Slides Data Structures and Algorithms in PDF only on Docsity!

Edit Distance

Problem of the Day

Suppose you are given three strings of characters: X, Y , and

Z, where |X| = n, |Y | = m, and |Z| = n + m. Z is said to

be a shuffle of X and Y iff Z can be formed by interleaving

the characters from X and Y in a way that maintains the left-

to-right ordering of the characters from each string.

1. Show that cchocohilaptes is a shuffle of chocolate and

chips, but chocochilatspe is not.

Edit Distance

Misspellings make approximate pattern matching an impor-

tant problem

If we are to deal with inexact string matching, we must first

define a cost function telling us how far apart two strings are,

i.e., a distance measure between pairs of strings.

A reasonable distance measure minimizes the cost of the

changes which have to be made to convert one string to

another.

String Edit Operations

There are three natural types of changes:

• Substitution – Change a single character from pattern s

to a different character in text t, such as changing “shot”

to “spot”.

• Insertion – Insert a single character into pattern s to help

it match text t, such as changing “ago” to “agog”.

• Deletion – Delete a single character from pattern s to

help it match text t, such as changing “hour” to “our”.

Recursive Edit Distance Code

#define MATCH 0 (* enumerated type symbol for match ) #define INSERT 1 ( enumerated type symbol for insert ) #define DELETE 2 ( enumerated type symbol for delete *)

int string compare(char *s, char t, int i, int j) { int k; ( counter ) int opt[3]; ( cost of the three options ) int lowest cost; ( lowest cost *)

if (i == 0) return(j * indel(’ ’)); if (j == 0) return(i * indel(’ ’));

opt[MATCH] = string compare(s,t,i-1,j-1) + match(s[i],t[j]); opt[INSERT] = string compare(s,t,i,j-1) + indel(t[j]); opt[DELETE] = string compare(s,t,i-1,j) + indel(s[i]);

lowest cost = opt[MATCH]; for (k=INSERT; k<=DELETE; k++) if (opt[k] < lowest cost) lowest cost = opt[k];

return( lowest cost ); }

Speeding it Up

This program is absolutely correct but takes exponential time

because it recomputes values again and again and again!

But there can only be |s| · |t| possible unique recursive calls,

since there are only that many distinct (i, j) pairs to serve as

the parameters of recursive calls.

By storing the values for each of these (i, j) pairs in a table,

we can avoid recomputing them and just look them up as

needed.

Differences with Dynamic Programming

The dynamic programming version has three differences

from the recursive version:

• First, it gets its intermediate values using table lookup

instead of recursive calls.

• Second, it updates the parent field of each cell, which

will enable us to reconstruct the edit-sequence later.

• Third, it is instrumented using a more general

goal cell() function instead of just returning

m[|s|][|t|].cost. This will enable us to apply this

routine to a wider class of problems.

We assume that each string has been padded with an initial

blank character, so the first real character of string s sits in

s[1].

Dynamic Programming Edit Distance

int string compare(char *s, char t) { int i,j,k; ( counters ) int opt[3]; ( cost of the three options *)

for (i=0; i<MAXLEN; i++) { row init(i); column init(i); }

for (i=1; i<strlen(s); i++) for (j=1; j<strlen(t); j++) { opt[MATCH] = m[i-1][j-1].cost + match(s[i],t[j]); opt[INSERT] = m[i][j-1].cost + indel(t[j]); opt[DELETE] = m[i-1][j].cost + indel(s[i]);

m[i][j].cost = opt[MATCH]; m[i][j].parent = MATCH; for (k=INSERT; k<=DELETE; k++) if (opt[k] < m[i][j].cost) { m[i][j].cost = opt[k]; m[i][j].parent = k; } }

Example

Below is an example run, showing the cost and parent values

turning “thou shalt not” to “you should not” in five moves:

P y o u - s h o u l d - n o t T pos 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 t: 1 1 1 2 3 4 5 6 7 8 9 10 11 12 13 13 h: 2 2 2 2 3 4 5 5 6 7 8 9 10 11 12 13 o: 3 3 3 2 3 4 5 6 5 6 7 8 9 10 11 12 u: 4 4 4 3 2 3 4 5 6 5 6 7 8 9 10 11 -: 5 5 5 4 3 2 3 4 5 6 6 7 7 8 9 10 s: 6 6 6 5 4 3 2 3 4 5 6 7 8 8 9 10 h: 7 7 7 6 5 4 3 2 3 4 5 6 7 8 9 10 a: 8 8 8 7 6 5 4 3 3 4 5 6 7 8 9 10 l: 9 9 9 8 7 6 5 4 4 4 4 5 6 7 8 9 t: 10 10 10 9 8 7 6 5 5 5 5 5 6 7 8 8 -: 11 11 11 10 9 8 7 6 6 6 6 6 5 6 7 8 n: 12 12 12 11 10 9 8 7 7 7 7 7 6 5 6 7 o: 13 13 13 12 11 10 9 8 7 8 8 8 7 6 5 6 t: 14 14 14 13 12 11 10 9 8 8 9 9 8 7 6 5

The edit sequence from “thou-shalt-not” to “you-should-not”

is DSMMMMMISMSMMMM

Reconstruct Path Code

reconstruct path(char *s, char *t, int i, int j) { if (m[i][j].parent == -1) return;

if (m[i][j].parent == MATCH) { reconstruct path(s,t,i-1,j-1); match out(s, t, i, j); return; } if (m[i][j].parent == INSERT) { reconstruct path(s,t,i,j-1); insert out(t,j); return; } if (m[i][j].parent == DELETE) { reconstruct path(s,t,i-1,j); delete out(s,i); return; } }

Customizing Edit Distance

• Table Initialization – The functions row init() and

column init() initialize the zeroth row and column

of the dynamic programming table, respectively.

• Penalty Costs – The functions match(c,d) and

indel(c) present the costs for transforming character

c to d and inserting/deleting character c. For edit distance,

match costs nothing if the characters are identical, and 1

otherwise, while indel always returns 1.

• Goal Cell Identification – The function goal cell

returns the indices of the cell marking the endpoint of the

Substring Matching

Suppose that we want to find where a short pattern s best

occurs within a long text t, say, searching for “Skiena” in all

its misspellings (Skienna, Skena, Skina,... ).

Plugging this search into our original edit distance function

will achieve little sensitivity, since the vast majority of any

edit cost will be that of deleting the body of the text.

We want an edit distance search where the cost of starting

the match is independent of the position in the text, so that a

match in the middle is not prejudiced against.

Likewise, the goal state is not necessarily at the end of both

strings, but the cheapest place to match the entire pattern

somewhere in the text.

Customizations

row init(int i) { m[0][i].cost = 0; (* note change ) m[0][i].parent = -1; ( note change *) }

goal cell(char *s, char *t, int *i, int j) { int k; ( counter *)

i = strlen(s) - 1; j = 0; for (k=1; k<strlen(t); k++) if (m[i][k].cost < m[i][*j].cost) *j = k; }