Parallelized Knuth-Morris-Pratt Search Algorithm

Chris Dolen

Robert Zimmerman

Introduction

Our project was to implement a highly parallelized and modular version of the Knuth-Morris-Pratt algorithm. The main objective of the project was to demonstrate the implementation of parallel units capable of performing this algorithm on the FPGA. The secondary objective was to develop a modular platform for parallelization of serial algorithms that require a short input string that varies often and a large database that is fairly static.

Knuth-Morris-Pratt Algorithm

The algorithm we chose to parallelize is called the Knuth-Morris-Pratt algorithm, as it was invented by Donald Knuth, Vaughan Pratt, and J. H. Morris.  The algorithm is used to search for a given string in a longer string or set of strings.  It is better than the brute force algorithm of comparing characters in the strings by moving 1 character at a time in the set of strings – it skips a subset of the string if a partial match has already been found. A link to the paper describing this algorithm is located in the references section. The example below is based on an online example that is derived from this paper.

Example of the Algorithm

This section shows an example of the algorithm in action. First let’s create our string that we want to check for a possible match.  This string we will call the database.  It can be any series of characters.  We also need a string that we are searching for.  This string is called the search string.  We will also have two index variables, x and i.  x will be the current position we are looking at in the database, and i will be the current character we are matching in the search string.  In order to skip characters that have already been matched, a table also needs to be created.  So let’s create an array called table, and index with the variable i.  The index here is the same I as previously mentioned.  Now we can define our database and our search string:

Database: ABC ABCDAB ABCDABCDABDE

Search string: ABCDABD

The table is initialized to all 0’s.  It needs to be computed first, so that it can be referenced to bypass recomparing partial matches.  Here is the pseudo code of the table building algorithm, taken from Wikipedia.

algorithm kmp_table:
    input:
        an array of characters, W (the word to be analyzed)
        an array of integers, T (the table to be filled)
    output:
        nothing (but during operation, it populates the table)
 
    define variables:
        an integer, i ← 2 (the current position we are computing in T)
        an integer, j ← 0 (the zero-based index in W of the next character of the current candidate substring)
 
    (the first few values are fixed but different from what the algorithm might suggest)
    let T[0] ← -1, T[1] ← 0
 
    while i is less than the length of W, do:
        (first case: the substring continues)
        if W[i - 1] = W[j], let T[i] ← j + 1, i ← i + 1, j ← j + 1
 
        (second case: it doesn't, but we can fall back)
        otherwise, if j > 0, let j ← T[j]
 
        (third case: we have run out of candidates.  Note j = 0)
        otherwise, let T[i] ← 0, i ← i + 1

 

Our code was based off this pseudo code, so understanding how it works is important.  First of all, you can see that table[0] = -1, and table[1] = 0.  Those two table values are always set to these initial conditions.  So our table looks like this:

i

0

1

2

3

4

5

6

Search String

A

B

C

D

A

B

D

Table[i]

-1

0

0

0

0

0

0

 

What we’re trying to do is fill in the table so that a repeating pattern in the search string is noted.  As you can see, there is a repeating pattern of AB in the search string.  Knowing this will make the searching of the database faster.  The table building algorithm helps us find these repeating patterns.  So, starting at i=2, we look at (i-1) which is a B.  Now the search string is checked to see if this character repeats anywhere, starting at index 0 (referred to as j in the Wikipedia pseudo code).  Index 0 (j=0) is A, which is not equal to B.  Now we check to see if j > 0, which it is not.  Therefore we determine that Table[2] = 0.  For i=3, we look at (3-1) which is C, and not equal to A.  So the same thing happens, and Table[3] = 0.  When i=4, we compare D to A, and they are not equal, so Table[4] = 0.  But, when i=5, we compare A to A, and they are equal.  This means the variable j is incremented by 1, and Table[5] = 1.  Now, when i=6, B (at i-1) is compared to B (at j, which is now 1).  Again, these are equal.  So j is incremented once more, and now Table[6] = 2.  This is the end of the search string, so the table building algorithm is over.  Our table now looks like this:

 

i

0

1

2

3

4

5

6

Search String

A

B

C

D

A

B

D

Table[i]

-1

0

0

0

0

1

2

 

You might notice that even though the partial match starts at i=4, its table value is still a 0, and a 1 isn’t in the table until the following character, at i=5.  You might also notice that ABD is not a repeating pattern, yet D (at i=6) has a non-zero value in the table.  This is just how the table building algorithm works.  It will become clearer once we finish the example of the search algorithm.  For now, just remember that a 1 in the table means the previous search string character is the beginning of a partial match, that the partial match continues as long as the table values keep incrementing, and that the maximum value in an incrementing sequence in the table is where the partial match fails.  For example, our partial match above ends at i=5, but the highest value in the table is 2, which is at i=6.  So the partial match failed at i=6.

Now to continue with the search algorithm.  The pseudo code for this algorithm, from Wikipedia, is shown below:

algorithm kmp_search:
    input:
        an array of characters, S (the text to be searched)
        an array of characters, W (the word sought)
    output:
        an integer (the zero-based position in S at which W is found)
 
    define variables:
        an integer, m ← 0 (the beginning of the current match in S)
        an integer, i ← 0 (the position of the current character in W)
        an array of integers, T (the table, computed elsewhere)
 
    while m + i is less than the length of S, do:
        if W[i] = S[m + i],
            let i ← i + 1
            if i equals the length of W,
                return m
        otherwise,
            let m ← m + i - T[i],
            if i is greater than 0,
                let i ← T[i]
 
    (if we reach here, we have searched all of S unsuccessfully)
    return the length of S

 

Here we have our database and search string:

x

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

DB

A

B

C

 

A

B

C

D

A

B

 

A

B

C

D

A

B

C

D

A

B

D

E

i

0

1

2

3

4

5

6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SS

A

B

C

D

A

B

D

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

To begin with, x and i start at 0, and the first characters of the database and search string are compared.  A and A match, so the algorithm continues comparing.  B and B match, C and C match, but “space” and D do not match.  When this occurs, the table is checked to see what should be done next.  x is marking the beginning of the current match in the database, meaning the first place we found a character match.  Right now it’s 0, but when this match failed, it gets assigned x + i – Table[i].  x is 0, i is the position of the current character in the search string, which is 3, and Table[3] = 0.  So x is now 3.  i becomes Table[i], which is again 0.  So now our progress looks like this:

x

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

DB

A

B

C

 

A

B

C

D

A

B

 

A

B

C

D

A

B

C

D

A

B

D

E

i

 

 

 

0

1

2

3

4

5

6

 

 

 

 

 

 

 

 

 

 

 

 

 

SS

 

 

 

A

B

C

D

A

B

D

 

 

 

 

 

 

 

 

 

 

 

 

 

 

As you can see, the algorithm skipped checking x=1 and x=2, as it knows there is no partial match there.  This was determined by the help of the table.  If there were a partial match there, the table value at i=3 (the location where the string matching failed) would have been non-zero.  As we can see by looking at our array, the next location we really want to be looking at is x=4, which is where the next A is.  This happens after one more step.  A is compared to “space”, and since it is not a match, the search string is shifted one more place. 

x

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

DB

A

B

C

 

A

B

C

D

A

B

 

A

B

C

D

A

B

C

D

A

B

D

E

i

 

 

 

 

0

1

2

3

4

5

6

 

 

 

 

 

 

 

 

 

 

 

 

SS