Module 5: Pattern Search


Problem Definition

Simple description:

Example:

Practical issues:

In-Class Exercise 5.1: Download this template and implement a simple algorithm for finding a pattern within a text. Both the pattern and text are of type char[] (arrays of characters). Analyse the complexity of your algorithm for a pattern of size m and a text of length n.


Using Signatures: The Rabin-Karp Algorithm

Key ideas:

Details:

Pseudocode:


Algorithm: RabinKarpPatternSearch (pattern, text)
Input: pattern array of length m, text array of length n
     // Compute the signature of the pattern using modulo-arithmetic. 
1.   patternHash = computePatternSignature (pattern);
     // A small optimization: compute dm-1 mod q just once. 
2.   multiplier = dm-1 mod q;
     // Initialize the current text signature to first substring of length m. 
3.   texthash = compute signature of text[0]...text[m-1]
     // No match possible after this position: 
4.   lastPossiblePosition = n - m
     // Start scan. 
5.   textCursor = 0
6.   while textCursor <= lastPossiblePosition
7.     if textHash = patternHash
          // Potential match. 
8.       if match (pattern, text, textCursor)
           // Match found 
9.         return textCursor
10.      endif
         // Different strings with same signature, so continue search.        
11.    endif
12.    textCursor = textCursor + 1
       // Use O(1) computation to compute next signature: this uses 
       // the multiplier as well as d and q. 
13.    textHash = compute signature of text[textCursor],...,text[textCursor+m-1]
14.  endwhile
15.  return -1
Output: position in text if pattern is found, -1 if not found.
  

Example:


 A B A B A C B A C A C A B A          TextHash=2199650 patternHash=2231457
 B A C A

 A B A B A C B A C A C A B A          TextHash=2231425 patternHash=2231457
   B A C A

 A B A B A C B A C A C A B A          TextHash=2199651 patternHash=2231457
     B A C A

 A B A B A C B A C A C A B A          TextHash=2231458 patternHash=2231457
       B A C A

 A B A B A C B A C A C A B A          TextHash=2200705 patternHash=2231457
         B A C A

 A B A B A C B A C A C A B A          TextHash=2265187 patternHash=2231457
           B A C A

 A B A B A C B A C A C A B A          TextHash=2231457 patternHash=2231457
             B A C A

 A B A B A C B A C A C A B A          Pattern found
             B A C A

Analysis:

In practice:

In-Class Exercise 5.2: Download this template and implement the Rabin-Karp algorithm but with a different signature function. Use the simple signature of adding up the ascii codes in the string (shown in the second figure in this section).


The Knuth-Morris-Pratt Algorithm

Key ideas:

Example:

  • Before going into detail, consider an example.

  • Text = "B A B C A B A B A B A B B A C A A B"
    Pattern = "B A B A B B"

  • Suppose the nextIndex function turns out to be: (we'll learn how to compute it later)
    Mismatch Position nextIndex
    0 0
    1 0
    2 0
    3 1
    4 2
    5 3

  • Let's apply this nextIndex function to the text:
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 0
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 1
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 2
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at pattern position 3 nextIndex=1
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at pattern position 1 nextIndex=0
         B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at start of pattern: shifting right
           B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at start of pattern: shifting right
             B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 0
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 1
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 2
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 3
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 4
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at pattern position 5 nextIndex=3
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 3
                   B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 4
                   B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 5
                   B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Pattern found
                   B A B A B B
    

Computing the nextIndex function:

  • The hard part is understanding how to compute the nextIndex function.

  • At first, it may seem that having a high-degree of overlap is good:
    
     E A A A A E A A A A C A A A A
       A A A A F
    
     E A A A A E A A A A C A A A A
         A A A A F
    
    Here, the overall "shift" was only one space.

  • Now consider:
    
     E A B C D E A A A A C A A A A
       A B C D F
    
     E A B C D E A A A A C A A A A
               A B C D F
    
    In this case: the overall shift was "four spaces" (much better).

  • What made the difference?
    • If the pattern has very little repetition in the pattern, the shift will be larger.
    • If the pattern has a lot of repetition, the shift will generally be smaller.

In-Class Exercise 5.3: Consider this partial match:


 E A B C A B C A B A B C E A A A A C A A A A
   A B C A B C A F
How many letters can we skip? What about this case (below)?

 E A B C B B C A C A B C E A A A A C A A A A
   A B C B B C A F

  • What is the intuition? Consider this example:
    
     E A B A B E A A A A C A A A A
       A B A B F
    
     E A B A B E A A A A C A A A A
           A B A B F
    
    The amount shifted depends on how large a "prefix of the pattern" is a "suffix of the matched-text".

  • Consider the first example again:
    
     E A A A A E A A A A C A A A A
       A A A A F
    
    • The A A A prefix of the pattern is a suffix of the matched text A A A A
    • That's why we must consider the possibility of matching it:
      
       E A A A A E A A A A C A A A A
           A A A A F
       

  • Second example:
    
     E A B A B E A A A A C A A A A
       A B A B F
    
    • The A B prefix of the pattern is a suffix of the matched text A B A B.

  • Thus, intuitively, the amout to slide forward depends on the prefix-suffix match:
    • The more the match, the less we can slide forward.

The relation with nextIndex:

  • Consider the first example:
    
     E A A A A E A A A A C A A A A          Mismatch at pattern position 4 nextIndex=3
      
    • Here, we slide forward one space.
    • But four characters were matched.
      => need to start comparisons at position 3 (fourth character) in pattern.
      => nextIndex = 3.
      
         A A A A F
       E A A A A E A A A A C A A A A
           A A A A F
        
  • Consider the second example:
    
     E A B A B E A A A A C A A A A          Mismatch at pattern position 4 nextIndex=2
       
    • In this example, we slide forward two spaces.
    • Since two characters match, the comparisons start at position 2.
      => nextIndex=2.
      
         A B A B F
       E A B A B E A A A A C A A A A
             A B A B F
         

  • How to define nextIndex:
    • Notice: nextIndex is the length of the largest prefix of the pattern that is a suffix of the matched-text.
    • Alternatively: the length of the largest suffix of the matched-text that is a prefix of the pattern.
    • But: the matched text is also in the pattern!
      => can pre-compute nextIndex using only the pattern.
  • Formal definition:
    • Suppose the mismatch occurs at the k-th character (in the pattern).
      => first k-1 characters of the pattern have matched.
    • Call this k-1 substring the matched part.
    • Let nextIndex = largest suffix of matched part that is also a prefix.

Pseudocode:


Algorithm: KnuthMorrisPrattPatternSearch (pattern, text)
Input: pattern array of length m, text array of length n
     // Compute the nextIndex function, stored in an array: 
1.   nextIndex = computeNextIndex (pattern)
     // Start the pattern search. 
2.   textCursor = 0
3.   patternCursor = 0
4.   while textCursor < n and patternCursor < m
5.     if text[textCursor] = pattern[patternCursor]
         // Advance cursors as long as characters match. 
6.       textCursor = textCursor + 1
7.       patternCursor = patternCursor + 1
8.     else
         // Mismatch. 
9.       if patternCursor = 0
           // If the mismatch occurred at first position, simply 
           // advance the pattern one space. 
10.        textCursor = textCursor + 1
11.      else
           // Otherwise, use the nextIndex function. The textCursor 
           // already points to the next place to compare in the text. 
12.        patternCursor = nextIndex[patternCursor]
13.      endif
14.    endif
15.  endwhile
16.  if patternCursor = m
       // If there was a match, return the first position in the match. 
17.    return (textCursor - patternCursor)
18.  else
19.    return -1
Output: position in text if pattern is found, -1 if not found.
  

Algorithm: computeNextIndex (pattern)
Input: a pattern array of length m
1.   nextIndex[0] = 0
2.   for i=1 to m-1
3.     nextIndex[i] = largest suffix of pattern[0]...pattern[i-1] that
                      is a prefix of pattern.
4.   endfor
5.   return nextIndex
Output: the nextIndex array (representing the nextIndex function)
  

Analysis:

  • Since we never backtrack in scanning the text: O(n) for the scan.

  • Our implementation of computeNextIndex took: O(m2) time.
    => Overall time: O(m2 + n)

  • It is possible to optimize computeNextIndex to take O(m) time.
    => Overall time: O(m + n)


Pattern Search with Deterministic Finite Automata

The "Evil Bureaucracy" metaphor:

  • Suppose you need to get a document (e.g., passport application) signed by various officials in sequence.
  • The officials:
    OFFICIAL   ABBREVIATION
    Assistant Director   A
    Director   D
    Managing Director   M
    Chief Managing Director   C
  • The paper trail:

  • Interpretation: going from "state" to "state"

    • Here, "A" represents the state of "approval by A".
    • Alternatively, suppose you have the signatures, then "A" checks for the "A" signature.

  • The above "state machine" recognizes the pattern "A D M C".

What is a DFA?

  • DFA = Deterministic Finite Automaton.

  • A DFA is a collection of "states" and "arcs".

  • The arcs sometimes have "labels" on them.

  • A DFA scans an input string and follows labels accordingly.

  • Some DFA "states" are marked as "final states".

  • If the input causes the DFA to go into a final state, the input is "accepted".

Example: consider a DFA to recognize the string "A B C D".

  • If the input string is "A B C D", the DFA goes into the final state.

  • Any other input string causes the DFA to end in a "non-final" state.

Example: consider this DFA

  • The DFA finds the pattern "A B C D" in the text "... (any number of E's) ... A B C D ... (anything else)".

  • Example text: "E E A B C D A E D C"

  • Here, we allow states to "get stuck" (not accept input).

  • Note:
    • States are (typically) numbered 0, 1, ... etc.
    • Input characters are "swallowed" (removed) one-by-one for each arc taken.

In-Class Exercise 5.4: Draw a DFA to find the first occurence of "A B C D" in a text consisting of the following structure: "... (zero or more E's) ... A ... (zero or more E's) ... B C D ... (anything)"

A DFA to find the first occurrence of "A B C D" in any text:

  • Here, the idea is to let state 5 "eat" non-A characters until the first "A" is found.

  • When an "A" is detected in other states, move to state 1.
Thus, a DFA can be used for the pattern search problem.

Another example: a DFA for the pattern "A B C A B D"

  • Note: only some arcs are shown.

  • Observe arc from state 5 to state 3
    => because the sub-pattern "A B" is already scanned.

The nextState function:

  • A DFA is represented in a program using a "nextState" function (a table).

  • The table has one row for each state, one column for each possible input character.

  • The table entry for row i and column "C" tells you the next state going from state i upon input "C".
    => the arc leading out of i with character "C".

Example:

  • Suppose our character set is {A, B, C, D}.

  • The nextState function for the pattern "A B C A B D" is:
    Input char
    Current
    state
    A B C D
    0 1 0 0 0
    1 1 2 0 0
    2 1 0 3 0
    3 4 0 0 0
    4 1 5 0 0
    5 1 0 3 6

  • Sample execution using text: "A A B C A B C A B D D"
    
     A A A B C A B C A B D          Processing 'A': next-state = 1
    
    
     A A A B C A B C A B D          Processing 'A': next-state = 1
    
    
     A A B B C A B C A B D          Processing 'B': next-state = 2
    
    
     A A B C C A B C A B D          Processing 'C': next-state = 3
    
    
     A A B C A A B C A B D          Processing 'A': next-state = 4
    
    
     A A B C A B B C A B D          Processing 'B': next-state = 5
    
    
     A A B C A B C C A B D          Processing 'C': next-state = 3
    
    
     A A B C A B C A A B D          Processing 'A': next-state = 4
    
    
     A A B C A B C A B B D          Processing 'B': next-state = 5
    
    
     A A B C A B C A B D D          Processing 'D': next-state = 6
    
    
     A A B C A B C A B D D          Pattern found
             A B C A B D
    

Building the nextState function:

  • Some observations:
    • The nextState function uses the current character in the input.
    • States can be numbered according to the length of the prefix detected so far.
    • Forward arcs (low number to high-numbered state) go along the pattern itself.
    • Reverse arc lengths depend on the length of the prefix detected so far.
    • Key observation: a reverse arc decreases by the length of the largest suffix that is a prefix (of the pattern).
    • The final state is numbered with the pattern length.

Pseudocode:


Algorithm: DFAPatternSearch (pattern, text)
Input: pattern array of length m, text array of length n
     // First, build the nextState function. 
1.   nextState = computeNextStateFunction (pattern)
     // Now scan input character by character. 
2.   currentState = 0
3.   for textCursor=0 to n-1
4.     character = text[textCursor]
       // Go to next state. 
5.     currentState = nextState[currentState][character]
       // If it's the final state, we're done. 
6.     if currentState = m
7.       return (textCursor - m + 1)
8.     endif
9.   endfor
10.  return -1
Output: position in text if pattern is found, -1 if not found.
  

Algorithm: computeNextStateFunction (pattern)
Input: pattern array of length m
1.   for state=0 to m-1
2.     for character=smallestChar to largestChar
3.       patternPlusChar = concatenate (pattern, character)
4.       k = length of longest suffix of patternPlusChar that is a prefix of pattern
5.       nextState[state][character] = k
6.     endfor
7.   endfor
  

Analysis:

  • The nextState function requires O(cm2) time to compute where c = size of character set.

  • Processing the text requires O(n) time.
    => total time is O(cm2 + n).

In practice:

  • The DFA method is rarely used because character sets tend to be large
    => nextState computation is significant.

  • However, DFA's are easy to build and have many other applications
    => useful as library code.

  • DFA's can be automatically built using tools like Lex.


Wildcards, Regular Expressions and Non-Deterministic Finite Automata

What does "wildcard" mean?

  • Consider the Unix command to list files, used in the following way:
    
        ls *.java
       

  • Here, the "*" is equivalent to "any character string".
    (Not containing special characters like "*").

  • The pattern *.java is a wildcard expression.

  • In simple text search, a pattern is one string.

  • A wildcard expression can specify a set of strings:
    *.java = {.java, a.java, aa.java, aaa.java, ... ,b.java, ... (infinite) set}

Regular expressions:

  • Regular expressions are a generalization of wildcard expressions.

  • Regular expressions are specified "bottom up" using operators:

    • Terminal: a character is a regular expression.
      Example: "A" is a regular expression.

    • Concatenation: if R1 and R2 are regular expressions, so is R1R2.
      Example: "A" concatenated with "B" gives regular expression "A B".

    • Or: if R1 and R2 are regular expressions, so is R1 | R2.
      • A | B is a regular expression specifying the set {"A", "B"}.
      • Since C is a regular expression, using concatenation, (A | B) C   is a regular expression.
      • (A | B) C specifies the set {"A C", "B C"}.

    • Closure (or Kleene-star): if R1 is a regular expression, R1* specifies any number of repetitions of R1:
      • Note: the * is used as a superscript and (confusingly) is NOT the same as the wildcard "*".
      • The expression A* specifies the set {"A", "A A", "A A A", ... (infinite set)} and the empty string.
      • Similarly, (AB)*C = {"C", "A B C", "A B A B", ... }.

  • The alphabet and expression operators.
    • Underlying both wildcard expressions and regular expression is an alphabet
      => the characters used as terminals.
    • Operators used for expressions, such as | and * are NOT permitted to be in the alphabet.

  • Expressing wildcards with regular expressions:
    • Suppose our alphabet consists of {A, B, C}
    • Consider the wildcard expression "* A B *".
    • The corresponding regular expression is: (A | B | C)* (A B) (A | B | C)*
    • Intuition: any combination of letters, followed by A B, followed by any combination of letters.

  • Terminology: the following are equivalent
    • A C satisfies expression (A | B) C
    • A C matches expression (A | B) C
    • A C is in the set specified by expression (A | B) C

In-Class Exercise 5.5: Construct the regular expression for the set {"D A B", "D A C", "D A B E", "D A C E", "D A B E E", "D A C E E",... }

The pattern-search problem we want to solve:

  • Given a regular expression, e.g., (A | B) (A B)* C, and a text, find the first occurrence of a substring in the text that satisfies the regular expression.
    (i.e., is in the set specified by the expression).
    • Example: regular expression (A | B) (A B)* C, and text "D B B A A B C D A".
    • The substring "A A B C" in "D B B A A B C D A" satisfies in the expression.

  • First, we'll solve a simpler problem: Given a regular expression and a string (pattern), does the string satisfy the expression?

  • We'll use a Non-Deterministic Finite Automaton (NDFA) to recognize strings that match a regular expression.

What is an NDFA?

  • First, recall that our DFA examples were built to recognize a single pattern (string).

  • We can construct a DFA to recognize strings from a set:
    • Consider the set {"A B C", "A D E"}
    • This DFA recognizes the set:

    • Note: there are multiple final states.

  • Consider a DFA (separately) for each string in {"A B C", "A D E"}


    (States are numbered differently, to avoid confusion).

  • Now, combine them into an NDFA:

    • The NDFA has "special states" that allow non-deterministic jumps to other states.
    • A jump from a "special state" does NOT swallow an input character.
    • The NDFA can pick any outgoing arc from a "special state".
    • The non-special states are called "regular" states.
    • Regular states must swallow input.
    • For a particular input character, there is only one state to go to from a regular state.

  • What does "choice" really mean?
    • Think of the NDFA faced with a choice of two arcs (in a special state).
    • The NDFA makes a clone of itself.
    • The original follows one arc.
    • The clone follows the second arc.
    • Thus, for multiple arcs, there are multiple clones.
      => a family of NDFA's operate on input (in parallel).
    • If any one of the NDFA's reach a final state, the input string is "recognized".
    Example:

Constructing an NDFA from a regular expression:

  • Recall original goal: a recognizer for regular expressions.

  • The NDFA will be constructed recursively using combining rules.

  • Rules:
    • Terminal: An NDFA for a single character, e.g., for "A"

      • Special states are used as start and end states.
      • A regular state is used to recognize the character.
    • Concatenation:

      • Suppose we have NDFA's for regular expressions R1 and R2
      • We want the NDFA for R1 R2
      • The NDFA's are combined in sequence by merging the final state of R1 and the start state of R2.
    • Or:

      • Suppose we have NDFA's for regular expressions R1 and R2
      • We want the NDFA for R1 | R2
      • The NDFA's are combined in "parallel" by introducing new special states.
    • Closure:

      • Suppose we have an NDFA for regular expression R1.
      • We want the NDFA for R1*.
      • Add new start and final states, and new arcs.n

  • Optimizations:
    • The above combination rules will result in "bloated" NDFA's.
    • Several optimizations can be applied, for example:
      • Special states in sequence:

      • Removing the special start state for a single-character.

In-Class Exercise 5.6: Construct an NDFA for the regular expression "C (A | B) (A B)*". First, use the rules to construct an unoptimized version, then try to optimize it (reduce the number of states).

How to "run" an NDFA on input:

  • Track all possible states in clones simultaneously.
  • Use a deque
    => a double-ended queue.
  • When processing a "special state", place all possible next states in the front of the deque.
  • When processing a regular state, consume the input character, and place next state at the end of the queue.
  • Keep processing until either:
    • A final state is reached
      => accept input string.
    • All input is consumed
      string does not satisfy regular expression.

In-Class Exercise 5.7: Write pseudocode describing a recursive algorithm to see if a given string (such as PatternSearch.java) matches a given wildcard pattern (such as Patt*nSearch.*).


Web Search Engines

In this section, we'll give an overview of how search engines work.

Three parts to search engines:

  • Crawling:
    • The process of scouring the web looking for new documents.
    • Usually, by retrieving a page and then following links in it.
    • Typically non-recursive, with priorities.
      => keep separate list of "well-known" sites.
    • Crawling is independent of other processes.
    • Easy to parallelize.

  • Indexing
    • Key idea: build an inverted list over document base.
    • Also collect/compute information used in relevance-ranking.

  • Query handling:
    • Present a web-page for users with query fields, e.g., Google's simple homepage.
    • Extract query and parse it.
    • Compute/extract results (links to documents).
    • Sort results by relevance or other factors.
    • Display (send back to browser).

In-Class Exercise 5.8: Look up the "page rank" algorithm and describe how it works. What are the disadvantages of using that algorithm?

Indexing via an example:

  • Consider these two documents:

  • An inverted list:
    • Simply a collection of all "words" with pointers to documents that contain them:

  • Key ideas in processing a document:
    • Skip "stop words" such as "the, an, a,...".
    • Maintain positions of important words.
    • Add data to inverted list.
    • Other info: statistical data
      (e.g., number of occurrences in document).

Query-handling:

  • Example: "Algorithm"
    • Find all documents containing word.
    • Rank according to relevance.
      => rank Document 97 higher than Document 145 (because "Algorithm" is in the title).
  • Example: "Binary Search Tree".
    • Find all documents containing the words.
    • Identify those containing all three words.
    • Rank according to relevance.
      => rank Document 145 higher than Document 97 (because the words occur in sequence).