\( \newcommand{\eqb}[1]{\begin{eqnarray*}#1\end{eqnarray*}} \newcommand{\eqbn}[1]{\begin{eqnarray}#1\end{eqnarray}} \newcommand{\bb}[1]{\mathbf{#1}} \newcommand{\nchoose}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{\defn}{\stackrel{\vartriangle}{=}} \newcommand{\eql}{\;\; = \;\;} \definecolor{dkblue}{RGB}{0,0,120} \definecolor{dkred}{RGB}{120,0,0} \definecolor{dkgreen}{RGB}{0,120,0} \newcommand{\bigsp}{\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \newcommand{\plss}{\;\;+\;\;} \newcommand{\miss}{\;\;-\;\;} \newcommand{\implies}{\Rightarrow\;\;\;\;\;\;\;\;\;\;\;\;} \)


Module 5: Pattern Search


Module objectives

 

By the end of this module, you will be able to:

 


5.1   Problem Definition

 

Simple description:

  • Given two strings of characters, the "text" and the "pattern", find the first occurence of the "pattern" (as a substring) in the "text".
  • Terminology:
    • Pattern: string or string-specification you are looking for.
    • Text: sequence of characters in which you are searching.

Example:

  • A pattern "BAC" that is found in the text "ABABACBABABA"
    
     A B A B A C B A B A B A
           B A C
       
  • Pattern "BAD" is not found in text "ABABACBABABA"
 

Practical issues:

  • Type of data:
    • Usually: text (characters).
    • Others: image data, bitstrings.

  • How is the data presented to the algorithm?
    • Usually: two arrays of characters.
    • Other ways: files, database tables.

  • Typical operations supported by the algorithm (Java):
    
    public interface PatternSearchAlgorithm {
    
      public int findFirstOccurrence (char[] pattern, char[] text);
    
      public int[] findAllOccurrences (char[] pattern, char[] text);
    
    }
        

  • Variations:
    • Are multiple searches expected on the same text?
      ⇒ text can be pre-processed.
    • Find all occurrences of pattern.
    • Find largest-prefix of pattern in text (closest-match).
    • Multiple patterns: wildcards, regular expressions.
      Example:
      
            grep in*face *.java 
            
      (Find all lines containing words that match the pattern "in*face" in all .java files).
    • Multiple documents (texts).

  • Applications:
    • Editors.
    • Databases.
    • Document searching.
    • Pattern recognition (images, character recognition).
 

In-Class Exercise 5.1: Download this template and implement the straightforward iterative algorithm for finding a pattern within a text. Both the pattern and text are of type char[] (arrays of characters). You will also need UniformRandom.java.

  • Start by writing pseudocode for two versions: one with for-loops and one with while-loops.
  • Analyse the complexity of your algorithm for a pattern of size m and a text of length n.
  • Implement the algorithm in the template.
  • Write pseudocode for an alternative recursive version. What is its running time in terms of m and n?
 


5.2   Using Signatures: The Rabin-Karp Algorithm

 

Key ideas:

  • Compute the "signature" of the pattern.
  • Compute the "signature" of each substring of the text.
  • Scan text until signature matches
    ⇒ potential match, so perform string comparison.

 

Details:

  • Recall how a typical signature is computed:


    ⇒ signature involves all characters
    ⇒ computing signature for each substring takes O(m) time.
    ⇒ no faster than naive search.

  • Observation: two successive substrings differ by only two characters.


    ⇒ next signature can be computed quickly (O(1)).

  • Note: simple addition of ascii values has been shown to be less effective.
    ⇒ use a different signature function.

  • Let's put this idea into pseudocode:
     
    
    Algorithm: SignaturePatternSearch (pattern, text)
    Input: pattern array of length m, text array of length n
    
    1.   patternHash = computePatternSignature (pattern);
    2.   texthash = compute signature of text[0]...text[m-1]
    3.   lastPossiblePosition = n - m               // No match possible after this position: 
    4.   textCursor = 0
    5.   while textCursor <= lastPossiblePosition
    6.     if textHash = patternHash                // Potential match. 
    7.       if match (pattern, text, textCursor)
    8.         return textCursor                    // Match found 
    9.       endif
    10.    endif
    11.    textCursor = textCursor + 1
    12.    textHash = compute signature of text[textCursor],...,text[textCursor+m-1]
    13.  endwhile
    14.  return -1
    Output: position in text if pattern is found, -1 if not found.
      

  • Rabin and Karp used a different signature function to lower the chances of a false positive.
    • Let m be the pattern length.
    • Define the signature of the m-length substring starting at text[i] to be:

      f(text[i],...,text[i+m-1])
      = text[i] * dm-1 + text[i+1] * dm-2 + ... + text[i+m-1] * d0
      .

    • Then, for the next substring:

      f(text[i+1],...,text[i+m])
      = text[i+1]*dm-1 + text[i+2]*dm-2 + ... + text[i+m]*d0
      .
      = d * (f(text[i],...,text[i+m-1]) - text[i]*dm-1 + text[i+m]

    • Suppose we use the short form: fi = f(text[i],...,text[i+m-1])
      Then, fi+1 = d * (fi - text[i]*dm-1) + text[i+m]

  • What is d?
    • d can be any positive integer.
    • Choose d so that dk can be computed fast
      d is a power of 2.
    • Example: d=16 or d=32.

  • What if dk gets too large?
    ⇒ Overflow can occur.

  • Use modulo arithmetic to avoid overflow:
    • Let q be a large prime number.
    • Choose q so that (d+1)*q is less than the largest integer.
      (In Java: largest integer is Integer.MAX_VALUE).
    • Use this fact: if x mod q = y mod q and a mod q = b mod q, then:
      • (x+a) mod q = (y+b) mod q
      • (x*a) mod q = (y*b) mod q.
    • Essentially, all arithmetic operations can be performed "mod q".
    • Thus, to compute


      fi+1 = d * (fi - text[i]*dm-1) + text[i+m]

      we can instead compute

      • t1 = dm-1 mod q
      • t2 = (text[i]*t1) mod q
      • t3 = (d * (fi - t2) ) mod q
      • fi+1 = (t3 + text[i+m]) mod q

    • The "mod q" approach guarantees that two equal strings will have the same signature.

  • What if two different substrings have the same signature?
    • Identical signatures do NOT mean identical substrings.
    • A string comparison is still needed if the signatures match.
      ⇒ will happen in only a few cases.
 

Pseudocode:


Algorithm: RabinKarpPatternSearch (pattern, text)
Input: pattern array of length m, text array of length n
     // Compute the signature of the pattern using modulo-arithmetic. 
1.   patternHash = computePatternSignature (pattern);
     // A small optimization: compute dm-1 mod q just once. 
2.   multiplier = dm-1 mod q;
     // Initialize the current text signature to first substring of length m. 
3.   texthash = compute signature of text[0]...text[m-1]
     // No match possible after this position: 
4.   lastPossiblePosition = n - m
     // Start scan. 
5.   textCursor = 0
6.   while textCursor <= lastPossiblePosition
7.     if textHash = patternHash
          // Potential match. 
8.       if match (pattern, text, textCursor)
           // Match found 
9.         return textCursor
10.      endif
         // Different strings with same signature, so continue search.        
11.    endif
12.    textCursor = textCursor + 1
       // Use O(1) computation to compute next signature: this uses 
       // the multiplier as well as d and q. 
13.    textHash = compute signature of text[textCursor],...,text[textCursor+m-1]
14.  endwhile
15.  return -1
Output: position in text if pattern is found, -1 if not found.
  
 

Example:


 A B A B A C B A C A C A B A          TextHash=2199650 patternHash=2231457
 B A C A

 A B A B A C B A C A C A B A          TextHash=2231425 patternHash=2231457
   B A C A

 A B A B A C B A C A C A B A          TextHash=2199651 patternHash=2231457
     B A C A

 A B A B A C B A C A C A B A          TextHash=2231458 patternHash=2231457
       B A C A

 A B A B A C B A C A C A B A          TextHash=2200705 patternHash=2231457
         B A C A

 A B A B A C B A C A C A B A          TextHash=2265187 patternHash=2231457
           B A C A

 A B A B A C B A C A C A B A          TextHash=2231457 patternHash=2231457
             B A C A

 A B A B A C B A C A C A B A          Pattern found
             B A C A
 

Analysis:

  • The pattern signature is computed once: O(m).

  • The first text signature takes O(m) time.

  • Each subsequent signature takes O(1) time
    O(m+n) time overall.

  • However, in theory a "false" match could occur every time.
    ⇒ comparison required at every text position.
    O(mn) time overall.

  • In practice, this is unlikely: O(m+n) time on the average.

  • Recall: the naive algorithm takes O(mn) time.

  • How different is O(mn) from O(m+n)?
    if m = n/2 then mn = O(n2) but m+n = O(n).

In practice:

  • Signatures take time to compute
    ⇒ naive search often performs better.
  • Signature approach not worth it for small m (pattern length).
  • The signature technique is very general
    ⇒ can apply to 2D patterns, other kinds of data.
 

In-Class Exercise 5.2: Download this template and implement the Rabin-Karp algorithm but with a different signature function. Use the simple signature of adding up the ascii codes in the string (shown in the second figure in this section).
 


5.3   The Knuth-Morris-Pratt Algorithm

 

Key ideas:

  • The particular sequence of characters in the pattern can be exploited.

  • For example, consider
    
     B A B C A B A B A B A B B A C A A B          
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          
                   B A B A B B
    
    
    • In the first line, we know that five characters matched.
      ⇒ partial match.
    • This means that the text contained "B A B A B".
    • Thus, the "B A B" in the first part of the pattern could be matched against the second "B A B" in the text.
    • We could pre-compute all such possibilities.
    • This will allow "skipping" parts of the text.

  • Observations about skipping:
    • In the above example, we can skip two places to the right.
    • Note: we don't have to compare the "B A B"
      ⇒ Can start comparison later in the pattern
      ⇒ In the above example: start comparing at the "A" in "B A B A".
      ⇒ No backing up required!
    • Thus, the text-cursor always moves right.
      ⇒ we only decide where in the pattern to start comparing.
      ⇒ need to pre-compute pattern-shifts in case of partial matches.

  • We will build a function called nextIndex that will take the mismatch position (in the pattern), and return the "next pattern position to start comparing".
 

Example:

  • Before going into detail, consider an example.

  • Text = "B A B C A B A B A B A B B A C A A B"
    Pattern = "B A B A B B"

  • Suppose the nextIndex function turns out to be: (we'll learn how to compute it later)
    Mismatch Position nextIndex
    0 0
    1 0
    2 0
    3 1
    4 2
    5 3

  • Let's apply this nextIndex function to the text:
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 0
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 1
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 2
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at pattern position 3 nextIndex=1
     B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at pattern position 1 nextIndex=0
         B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at start of pattern: shifting right
           B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at start of pattern: shifting right
             B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 0
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 1
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 2
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 3
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 4
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Mismatch at pattern position 5 nextIndex=3
               B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 3
                   B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 4
                   B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Matched up to pattern position 5
                   B A B A B B
    
     B A B C A B A B A B A B B A C A A B          Pattern found
                   B A B A B B
    
 

Computing the nextIndex function:

  • The hard part is understanding how to compute the nextIndex function.

  • At first, it may seem that having a high-degree of overlap is good:
    
     E A A A A E A A A A C A A A A
       A A A A F
    
     E A A A A E A A A A C A A A A
         A A A A F
    
    Here, the overall "shift" was only one space.

  • Now consider:
    
     E A B C D E A A A A C A A A A
       A B C D F
    
     E A B C D E A A A A C A A A A
               A B C D F
    
    In this case: the overall shift was "four spaces" (much better).

  • What made the difference?
    • If the pattern has very little repetition in the pattern, the shift will be larger.
    • If the pattern has a lot of repetition, the shift will generally be smaller.
 

In-Class Exercise 5.3: Consider this partial match:


 E A B C A B C A B A B C E A A A A C A A A A
   A B C A B C A F
How many letters can we skip? What about this case (below)?

 E A B C B B C A C A B C E A A A A C A A A A
   A B C B B C A F
 

  • What is the intuition? Consider this example:
    
     E A B A B E A A A A C A A A A
       A B A B F
    
     E A B A B E A A A A C A A A A
           A B A B F
    
    The amount shifted depends on how large a "prefix of the pattern" is a "suffix of the matched-text".

  • Consider the first example again:
    
     E A A A A E A A A A C A A A A
       A A A A F
    
    • The A A A prefix of the pattern is a suffix of the matched text A A A A
    • That's why we must consider the possibility of matching it:
      
       E A A A A E A A A A C A A A A
           A A A A F
       

  • Second example:
    
     E A B A B E A A A A C A A A A
       A B A B F
    
    • The A B prefix of the pattern is a suffix of the matched text A B A B.

  • Thus, intuitively, the amout to slide forward depends on the prefix-suffix match:
    • The more the match, the less we can slide forward.
 

The relation with nextIndex:

  • Consider the first example:
    
     E A A A A E A A A A C A A A A          Mismatch at pattern position 4 nextIndex=3
      
    • Here, we slide forward one space.
    • But four characters were matched.
      ⇒ need to start comparisons at position 3 (fourth character) in pattern.
      ⇒ nextIndex = 3.
      
         A A A A F
       E A A A A E A A A A C A A A A
           A A A A F
        
  • Consider the second example:
    
     E A B A B E A A A A C A A A A          Mismatch at pattern position 4 nextIndex=2
       
    • In this example, we slide forward two spaces.
    • Since two characters match, the comparisons start at position 2.
      ⇒ nextIndex=2.
      
         A B A B F
       E A B A B E A A A A C A A A A
             A B A B F
         

  • How to define nextIndex:
    • Notice: nextIndex is the length of the largest prefix of the pattern that is a suffix of the matched-text.
    • Alternatively: the length of the largest suffix of the matched-text that is a prefix of the pattern.
    • But: the matched text is also in the pattern!
      ⇒ can pre-compute nextIndex using only the pattern.

  • Formal definition:
    • Suppose the mismatch occurs at the k-th character (in the pattern).
      ⇒ first k-1 characters of the pattern have matched.
    • Call this k-1 substring the matched part.
    • Let nextIndex = largest suffix of matched part that is also a prefix.

Pseudocode:


Algorithm: KnuthMorrisPrattPatternSearch (pattern, text)
Input: pattern array of length m, text array of length n
     // Compute the nextIndex function, stored in an array: 
1.   nextIndex = computeNextIndex (pattern)
     // Start the pattern search. 
2.   textCursor = 0
3.   patternCursor = 0
4.   while textCursor < n and patternCursor < m
5.     if text[textCursor] = pattern[patternCursor]
         // Advance cursors as long as characters match. 
6.       textCursor = textCursor + 1
7.       patternCursor = patternCursor + 1
8.     else
         // Mismatch. 
9.       if patternCursor = 0
           // If the mismatch occurred at first position, simply 
           // advance the pattern one space. 
10.        textCursor = textCursor + 1
11.      else
           // Otherwise, use the nextIndex function. The textCursor 
           // already points to the next place to compare in the text. 
12.        patternCursor = nextIndex[patternCursor]
13.      endif
14.    endif
15.  endwhile
16.  if patternCursor = m
       // If there was a match, return the first position in the match. 
17.    return (textCursor - patternCursor)
18.  else
19.    return -1
Output: position in text if pattern is found, -1 if not found.
  

Algorithm: computeNextIndex (pattern)
Input: a pattern array of length m
1.   nextIndex[0] = 0
2.   for i=1 to m-1
3.     nextIndex[i] = largest suffix of pattern[0]...pattern[i-1] that
                      is a prefix of pattern.
4.   endfor
5.   return nextIndex
Output: the nextIndex array (representing the nextIndex function)
  

Analysis:

  • Since we never backtrack in scanning the text: O(n) for the scan.

  • Our implementation of computeNextIndex took: O(m2) time.
    ⇒ Overall time: O(m2 + n)

  • It is possible to optimize computeNextIndex to take O(m) time.
    ⇒ Overall time: O(m + n)
 


5.4   The Boyer-Moore-Horspool Algorithm

 

The problems with previous algorithms:

  • KMP runs in time O(m + n), but is complicated and is inefficient for small patterns.
  • KMP does not exploit low-overlap between pattern and text.

The Boyer-Moore-Horspool algorithm:

  • It's a heuristic: runs in worst-case O(mn) time.
  • But is simple and effective in practice.

Key ideas:

  • Start comparisons from the right end of the pattern:

  • When a mismatch occurs, shift pattern to the right.
  • Ideal case: mismatch character in text does not occur in pattern at all
          ⇒ Entire pattern can be shifted to the mismatch position.
  • Less ideal case:

  • The rule at each step is:
    • Start comparing from the right end of the pattern.
    • Identify the mismatch char in the text, and its position.
    • Find the rightmost occurence of this char in the pattern, that is to the left of the mismatch position.
    • Move the pattern rightwards to align these two chars.

The worst-case running time is still O(mn), but it is easy to implement and practically fast.
 


5.5   Pattern Search with Deterministic Finite Automata

 

The "Evil Bureaucracy" metaphor:

  • Suppose you need to get a document (e.g., passport application) signed by various officials in sequence.

  • The officials:
    OFFICIAL   ABBREVIATION
    Assistant Director   A
    Director   D
    Managing Director   M
    Chief Managing Director   C

  • The paper trail:

  • Interpretation: going from "state" to "state"


         ⇒ Here, "A" represents the state of "approval by A".

  • Consider this alternative scenario:
    • Suppose your form needs to have one item satisfied for each bureaucrat.
    • It is each bureaucrat's job to check that their item is on the form.
    • Thus, "A" checks for the "A" item, "B" looks for the "B" item, and so on.

  • Then, the above "state machine" recognizes the pattern of items "A D M C".
 

What is a DFA?

  • DFA = Deterministic Finite Automaton.

  • A DFA is a collection of "states" and "arcs".

  • The arcs have "labels" on them.

  • A label on an arc is a "rule" that describes when one state leads to another.

  • A DFA scans an input string and follows labels accordingly.

  • Each input character causes a jump from the current "state" to the next "state"
          ⇒ Which can sometimes result in jumping from a state to itself

  • Some DFA "states" are marked as "final states".

  • If the input causes the DFA to go into a final state, the input is "accepted".
 

Example: consider a DFA to recognize the string "A B C D".

  • If the input string is "A B C D", the DFA goes into the final state.
    • Initially we are in state 0.
    • The first letter, "A", causes a jump to state 1.
    • The next letter, "B", causes a jump from state 1 to state 2.
    • ... and so on.

  • Note: there is an implicit "cursor" or position in the input string as we progress one character at a time.

  • Any other input string causes the DFA to end in a "non-final" state.
 

Two key questions about DFAs:

  • Given a DFA, what is the set of strings that will be "accepted" by the DFA (i.e., will reach a final state)?

  • Given a set of strings, can a DFA be constructed to "accept" only that set?
 

Example: consider this DFA

  • The DFA finds the pattern "A B C D" in the text "... (any number of E's) ... A B C D ... (anything else)".

  • The DFA does this when we feed it the text
          "... (any number of E's) ... A B C D ... (anything else)"

  • Example text: "E E A B C D A E D C"

  • Here, we allow states to "get stuck" (not accept input).

  • Note:
    • States are (typically) numbered 0, 1, ... etc.
    • Input characters are "swallowed" (removed) one-by-one for each arc taken.
 

In-Class Exercise 5.4: Draw a DFA to find the first occurence of "A B C D" in a text consisting of the following structure: "... (zero or more E's) ... A ... (zero or more E's) ... B C D ... (anything)". Write the pseudocode equivalent of the DFA. That is, what does a DFA implementation in code (pseudocode) look like?
 

A DFA to find the first occurrence of "A B C D" in any text:

  • Here, the idea is to let state 5 "eat" non-A characters until the first "A" is found.

  • When an "A" is detected in other states, move to state 1.
Thus, a DFA can be used for the pattern search problem.
 

Another example: a DFA for the pattern "A B C A B D"

  • Note: only some arcs are shown.

  • Observe arc from state 5 to state 3
    ⇒ because the sub-pattern "A B" is already scanned.
 

The nextState function:

  • A DFA is represented in a program using a "nextState" function (a table).

  • The table has one row for each state, one column for each possible input character.

  • The table entry for row i and column "C" tells you the next state going from state i upon input "C".
    ⇒ the arc leading out of i with character "C".
 

Example:

  • Suppose our character set is {A, B, C, D}.

  • The nextState function for the pattern "A B C A B D" is:
    Input char
    Current
    state
    A B C D
    0 1 0 0 0
    1 1 2 0 0
    2 1 0 3 0
    3 4 0 0 0
    4 1 5 0 0
    5 1 0 3 6

  • Sample execution using text: "A A B C A B C A B D D"
    
     A A A B C A B C A B D          Processing 'A': next-state = 1
    
    
     A A A B C A B C A B D          Processing 'A': next-state = 1
    
    
     A A B B C A B C A B D          Processing 'B': next-state = 2
    
    
     A A B C C A B C A B D          Processing 'C': next-state = 3
    
    
     A A B C A A B C A B D          Processing 'A': next-state = 4
    
    
     A A B C A B B C A B D          Processing 'B': next-state = 5
    
    
     A A B C A B C C A B D          Processing 'C': next-state = 3
    
    
     A A B C A B C A A B D          Processing 'A': next-state = 4
    
    
     A A B C A B C A B B D          Processing 'B': next-state = 5
    
    
     A A B C A B C A B D D          Processing 'D': next-state = 6
    
    
     A A B C A B C A B D D          Pattern found
             A B C A B D
    
 

Building the nextState function:

  • Some observations:
    • The nextState function uses the current character in the input.
    • States can be numbered according to the length of the prefix detected so far.
    • Forward arcs (low number to high-numbered state) go along the pattern itself.
    • Reverse arc lengths depend on the length of the prefix detected so far.
    • Key observation: a reverse arc decreases by the length of the largest suffix that is a prefix (of the pattern).
    • The final state is numbered with the pattern length.
 

Pseudocode:


Algorithm: DFAPatternSearch (pattern, text)
Input: pattern array of length m, text array of length n
     // First, build the nextState function. 
1.   nextState = computeNextStateFunction (pattern)
     // Now scan input character by character. 
2.   currentState = 0
3.   for textCursor=0 to n-1
4.     character = text[textCursor]
       // Go to next state. 
5.     currentState = nextState[currentState][character]
       // If it's the final state, we're done. 
6.     if currentState = m
7.       return (textCursor - m + 1)
8.     endif
9.   endfor
10.  return -1
Output: position in text if pattern is found, -1 if not found.
  

Algorithm: computeNextStateFunction (pattern)
Input: pattern array of length m
1.   for state=0 to m-1
2.     for character=smallestChar to largestChar
3.       patternPlusChar = concatenate (pattern, character)
4.       k = length of longest suffix of patternPlusChar that is a prefix of pattern
5.       nextState[state][character] = k
6.     endfor
7.   endfor
  
 

Analysis:

  • The nextState function requires O(cm2) time to compute where c = size of character set.

  • Processing the text requires O(n) time.
    ⇒ total time is O(cm2 + n).

In practice:

  • The DFA method is rarely used because character sets tend to be large
    ⇒ nextState computation is significant.

  • However, DFA's are easy to build and have many other applications
    ⇒ useful as library code.

  • DFA's can be automatically built using tools like Lex.
 


5.6   Wildcards, Regular Expressions and Non-Deterministic Finite Automata

 

What does "wildcard" mean?

  • Consider the Unix command to list files, used in the following way:
    
        ls *.java
       

  • Here, the "*" is equivalent to "any character string".
    (Not containing special characters like "*").

  • The pattern *.java is a wildcard expression.

  • In simple text search, a pattern is one string.

  • A wildcard expression can specify a set of strings:
    *.java = {.java, a.java, aa.java, aaa.java, ... ,b.java, ... (infinite) set}
 

Regular expressions:

  • Regular expressions are a generalization of wildcard expressions.

  • Think of a regular expression as a "way to describe a set of strings that satisfy a property".

  • Regular expressions are specified recursively using operators:

    • Terminal: a character is a regular expression.
      Example: "A" is a regular expression.

    • Concatenation: if R1 and R2 are regular expressions, so is R1R2.
      Example: "A" concatenated with "B" gives regular expression "A B".

    • Or: if R1 and R2 are regular expressions, so is R1 | R2.
      • A | B is a regular expression specifying the set {"A", "B"}.
      • Since C is a regular expression, using concatenation, (A | B) C   is a regular expression.
      • (A | B) C specifies the set {"A C", "B C"}.

    • Closure (or Kleene-star): if R1 is a regular expression, R1* specifies any number of repetitions of R1:
      • Note: the * is used as a superscript and (confusingly) is NOT the same as the wildcard "*".
      • The expression A* specifies the set {"A", "A A", "A A A", ... (infinite set)} and the empty string.
      • Similarly, (AB)*C = {"C", "A B C", "A B A B", ... }.

  • The alphabet and expression operators.
    • Underlying both wildcard expressions and regular expression is an alphabet
      ⇒ the characters used as terminals.
    • Operators used for expressions, such as | and * are NOT permitted to be in the alphabet.

  • Expressing wildcards with regular expressions:
    • Suppose our alphabet consists of {A, B, C}
    • Consider the wildcard expression "* A B *".
    • The corresponding regular expression is: (A | B | C)* (A B) (A | B | C)*
    • Intuition: any combination of letters, followed by A B, followed by any combination of letters.

  • Terminology: the following are equivalent
    • A C satisfies expression (A | B) C
    • A C matches expression (A | B) C
    • A C is in the set specified by expression (A | B) C
 

In-Class Exercise 5.5: Construct the regular expression for the set {"D A B", "D A C", "D A B E", "D A C E", "D A B E E", "D A C E E",... }
 

The pattern-search problem we want to solve:

  • Given a regular expression, e.g., (A | B) (A B)* C, and a text, find the first occurrence of a substring in the text that satisfies the regular expression.
    (i.e., is in the set specified by the expression).
    • Example: regular expression (A | B) (A B)* C, and text "D B B A A B C D A".
    • The substring "A A B C" in "D B B A A B C D A" satisfies in the expression.

  • First, we'll solve a simpler problem: Given a regular expression and a string (pattern), does the string satisfy the expression?

  • We'll use a Non-Deterministic Finite Automaton (NDFA) to recognize strings that match a regular expression.
 

What is an NDFA?

  • First, recall that our DFA examples were built to recognize a single pattern (string).

  • We can construct a DFA to recognize strings from a set:
    • Consider the set {"A B C", "A D E"}
    • This DFA recognizes the set:

    • Note: there are multiple final states.

  • Consider a DFA (separately) for each string in {"A B C", "A D E"}


    (States are numbered differently, to avoid confusion).

  • Now, combine them into an NDFA:

    • The NDFA has "special states" that allow non-deterministic jumps to other states.
    • A jump from a "special state" does NOT swallow an input character.
    • The NDFA can pick any outgoing arc from a "special state".
    • The non-special states are called "regular" states.
    • Regular states must swallow input.
    • For a particular input character, there is only one state to go to from a regular state.

  • What does "choice" really mean?
    • Think of the NDFA faced with a choice of two arcs (in a special state).
    • The NDFA makes a clone of itself.
    • The original follows one arc.
    • The clone follows the second arc.
    • Thus, for multiple arcs, there are multiple clones.
      ⇒ a family of NDFA's operate on input (in parallel).
    • If any one of the NDFA's reach a final state, the input string is "recognized".
    • Think of each clone as a "thread" running on its own.
 
Example:

 

Constructing an NDFA from a regular expression:

  • Recall original goal: a recognizer for regular expressions.

  • The NDFA will be constructed recursively using combining rules.

  • Rules:
    • Terminal: An NDFA for a single character, e.g., for "A"

      • Special states are used as start and end states.
      • A regular state is used to recognize the character.
    • Concatenation:

      • Suppose we have NDFA's for regular expressions R1 and R2
      • We want the NDFA for R1 R2
      • The NDFA's are combined in sequence by merging the final state of R1 and the start state of R2.
    • Or:

      • Suppose we have NDFA's for regular expressions R1 and R2
      • We want the NDFA for R1 | R2
      • The NDFA's are combined in "parallel" by introducing new special states.
    • Closure:

      • Suppose we have an NDFA for regular expression R1.
      • We want the NDFA for R1*.
      • Add new start and final states, and new arcs (arrows).

  • Optimizations:
    • The above combination rules will result in "bloated" NDFA's.
    • Several optimizations can be applied, for example:
      • Special states in sequence:

      • Removing the special start state for a single-character.

 

In-Class Exercise 5.6: Construct an NDFA for the regular expression "C (A | B) (A B)*". First, use the rules to construct an unoptimized version, then try to optimize it (reduce the number of states).
 

Two ways to "run" an NDFA on input:

  1. Use threads:
    • For each "clone", fire off a thread that starts the new clone with the appropriate "choice" made (one thread per choice).
    • If any thread reaches a final state, the string is accepted.
    • If any thread gets stuck, it stops.

  2. Use a single thread and a deque (a double-ended queue).
 

Using a deque:

  • Track all possible states in clones simultaneously.

  • When processing a "special state", place all possible next states in the front of the deque.

  • When processing a regular state, consume the input character, and place next state at the end of the queue.

  • Keep processing until either:
    • A final state is reached
      ⇒ accept input string.
    • All input is consumed
      string does not satisfy regular expression.
 

In-Class Exercise 5.7: Write pseudocode describing a recursive algorithm to see if a given string (such as PatternSearch.java) matches a given wildcard pattern (such as Patt*nSearch.*).
 


5.7   Web Search Engines

 

In this section, we'll give an overview of how search engines work.

Three parts to search engines:

  • Crawling:
    • The process of scouring the web looking for new documents.
    • Usually, by retrieving a page and then following links in it.
    • Typically non-recursive, with priorities.
      ⇒ keep separate list of "well-known" sites.
    • Crawling is independent of other processes.
    • Easy to parallelize.

  • Indexing
    • Key idea: build an inverted list over document base.
    • Also collect/compute information used in relevance-ranking.

  • Query handling:
    • Present a web-page for users with query fields, e.g., Google's simple homepage.
    • Extract query and parse it.
    • Compute/extract results (links to documents).
    • Sort results by relevance or other factors.
    • Display (send back to browser).
 

In-Class Exercise 5.8: Look up the "page rank" algorithm and describe how it works. What are the disadvantages of using that algorithm?
 

Indexing via an example:

  • Consider these two documents:

  • An inverted list:
    • Simply a collection of all "words" with pointers to documents that contain them:

  • Key ideas in processing a document:
    • Skip "stop words" such as "the, an, a,...".
    • Maintain positions of important words.
    • Add data to inverted list.
    • Other info: statistical data
      (e.g., number of occurrences in document).
 

Query-handling:

  • Example: "Algorithm"
    • Find all documents containing word.
    • Rank according to relevance.
      ⇒ rank Document 97 higher than Document 145 (because "Algorithm" is in the title).
  • Example: "Binary Search Tree".
    • Find all documents containing the words.
    • Identify those containing all three words.
    • Rank according to relevance.
      ⇒ rank Document 145 higher than Document 97 (because the words occur in sequence).



© 2001, Rahul Simha