By the end of this module, you will be able to:
Simple description:
Example:
A B A B A C B A B A B A B A C
Practical issues:
public interface PatternSearchAlgorithm { public int findFirstOccurrence (char[] pattern, char[] text); public int[] findAllOccurrences (char[] pattern, char[] text); }
grep in*face *.java(Find all lines containing words that match the pattern "in*face" in all .java files).
In-Class Exercise 5.1:
Download this template
and implement the straightforward iterative algorithm for
finding a pattern within a text.
Both the pattern and text are of type char[] (arrays of characters).
You will also need UniformRandom.java.
Key ideas:
Details:
⇒ signature involves all characters
⇒ computing signature for each substring takes O(m) time.
⇒ no faster than naive search.
⇒ next signature can be computed quickly (O(1)).
Algorithm: SignaturePatternSearch (pattern, text) Input: pattern array of length m, text array of length n 1. patternHash = computePatternSignature (pattern); 2. texthash = compute signature of text[0]...text[m-1] 3. lastPossiblePosition = n - m // No match possible after this position: 4. textCursor = 0 5. while textCursor <= lastPossiblePosition 6. if textHash = patternHash // Potential match. 7. if match (pattern, text, textCursor) 8. return textCursor // Match found 9. endif 10. endif 11. textCursor = textCursor + 1 12. textHash = compute signature of text[textCursor],...,text[textCursor+m-1] 13. endwhile 14. return -1 Output: position in text if pattern is found, -1 if not found.
fi+1 = d * (fi -
text[i]*dm-1) + text[i+m]
we can instead compute
Pseudocode:
Algorithm: RabinKarpPatternSearch (pattern, text) Input: pattern array of length m, text array of length n // Compute the signature of the pattern using modulo-arithmetic. 1. patternHash = computePatternSignature (pattern); // A small optimization: compute dm-1 mod q just once. 2. multiplier = dm-1 mod q; // Initialize the current text signature to first substring of length m. 3. texthash = compute signature of text[0]...text[m-1] // No match possible after this position: 4. lastPossiblePosition = n - m // Start scan. 5. textCursor = 0 6. while textCursor <= lastPossiblePosition 7. if textHash = patternHash // Potential match. 8. if match (pattern, text, textCursor) // Match found 9. return textCursor 10. endif // Different strings with same signature, so continue search. 11. endif 12. textCursor = textCursor + 1 // Use O(1) computation to compute next signature: this uses // the multiplier as well as d and q. 13. textHash = compute signature of text[textCursor],...,text[textCursor+m-1] 14. endwhile 15. return -1 Output: position in text if pattern is found, -1 if not found.
Example:
A B A B A C B A C A C A B A TextHash=2199650 patternHash=2231457 B A C A A B A B A C B A C A C A B A TextHash=2231425 patternHash=2231457 B A C A A B A B A C B A C A C A B A TextHash=2199651 patternHash=2231457 B A C A A B A B A C B A C A C A B A TextHash=2231458 patternHash=2231457 B A C A A B A B A C B A C A C A B A TextHash=2200705 patternHash=2231457 B A C A A B A B A C B A C A C A B A TextHash=2265187 patternHash=2231457 B A C A A B A B A C B A C A C A B A TextHash=2231457 patternHash=2231457 B A C A A B A B A C B A C A C A B A Pattern found B A C A
Analysis:
In practice:
In-Class Exercise 5.2: Download this template and implement the Rabin-Karp algorithm but with a different signature function. Use the simple signature of adding up the ascii codes in the string (shown in the second figure in this section).
Key ideas:
B A B C A B A B A B A B B A C A A B B A B A B B B A B C A B A B A B A B B A C A A B B A B A B B
Example:
Mismatch Position | nextIndex | ||
0 | 0 | ||
1 | 0 | ||
2 | 0 | ||
3 | 1 | (Infer: match up to 0) | |
4 | 2 | (Infer: match up to 1) | |
5 | 3 | (Infer: match up to 2) |
B A B C A B A B A B A B B A C A A B Matched up to pattern position 0 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 1 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 2 B A B A B B B A B C A B A B A B A B B A C A A B Mismatch at pattern position 3 nextIndex=1 B A B A B B B A B C A B A B A B A B B A C A A B Mismatch at pattern position 1 nextIndex=0 B A B A B B B A B C A B A B A B A B B A C A A B Mismatch at start of pattern: shifting right B A B A B B B A B C A B A B A B A B B A C A A B Mismatch at start of pattern: shifting right B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 0 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 1 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 2 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 3 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 4 B A B A B B B A B C A B A B A B A B B A C A A B Mismatch at pattern position 5 nextIndex=3 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 3 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 4 B A B A B B B A B C A B A B A B A B B A C A A B Matched up to pattern position 5 B A B A B B B A B C A B A B A B A B B A C A A B Pattern found B A B A B B
Computing the nextIndex function:
E A A A A E A A A A C A A A A A A A A F E A A A A E A A A A C A A A A A A A A FHere, the overall "shift" was only one space.
E A B C D E A A A A C A A A A A B C D F E A B C D E A A A A C A A A A A B C D FIn this case: the overall shift was "four spaces" (much better).
In-Class Exercise 5.3:
Consider this partial match:
E A B C A B C A B A B C E A A A A C A A A A
A B C A B C A F
How many letters can we skip? What about this case (below)?
E A B C B B C A C A B C E A A A A C A A A A
A B C B B C A F
E A B A B E A A A A C A A A A A B A B F E A B A B E A A A A C A A A A A B A B FThe amount shifted depends on how large a "prefix of the pattern" is a "suffix of the matched-text".
E A A A A E A A A A C A A A A A A A A F
E A A A A E A A A A C A A A A A A A A F
E A B A B E A A A A C A A A A A B A B F
The relation with nextIndex:
E A A A A E A A A A C A A A A Mismatch at pattern position 4 nextIndex=3
A A A A F E A A A A E A A A A C A A A A A A A A F
E A B A B E A A A A C A A A A Mismatch at pattern position 4 nextIndex=2
A B A B F E A B A B E A A A A C A A A A A B A B F
Pseudocode:
Algorithm: KnuthMorrisPrattPatternSearch (pattern, text) Input: pattern array of length m, text array of length n // Compute the nextIndex function, stored in an array: 1. nextIndex = computeNextIndex (pattern) // Start the pattern search. 2. textCursor = 0 3. patternCursor = 0 4. while textCursor < n and patternCursor < m 5. if text[textCursor] = pattern[patternCursor] // Advance cursors as long as characters match. 6. textCursor = textCursor + 1 7. patternCursor = patternCursor + 1 8. else // Mismatch. 9. if patternCursor = 0 // If the mismatch occurred at first position, simply // advance the pattern one space. 10. textCursor = textCursor + 1 11. else // Otherwise, use the nextIndex function. The textCursor // already points to the next place to compare in the text. 12. patternCursor = nextIndex[patternCursor] 13. endif 14. endif 15. endwhile 16. if patternCursor = m // If there was a match, return the first position in the match. 17. return (textCursor - patternCursor) 18. else 19. return -1 Output: position in text if pattern is found, -1 if not found.
Algorithm: computeNextIndex (pattern) Input: a pattern array of length m 1. nextIndex[0] = 0 2. for i=1 to m-1 3. nextIndex[i] = largest suffix of pattern[0]...pattern[i-1] that is a prefix of pattern. 4. endfor 5. return nextIndex Output: the nextIndex array (representing the nextIndex function)
Analysis:
The problems with previous algorithms:
The Boyer-Moore-Horspool algorithm:
Key ideas:
The worst-case running time is still O(mn), but it is easy to implement and practically fast.
The "Evil Bureaucracy" metaphor:
OFFICIAL | ABBREVIATION | ||
Assistant Director | A | ||
Director | D | ||
Managing Director | M | ||
Chief Managing Director | C |
⇒
Here, "A" represents the state of "approval by A".
What is a DFA?
Example: consider a DFA to recognize the string "A B C D".
Two key questions about DFAs:
Example: consider this DFA
In-Class Exercise 5.4: Draw a DFA to find the first occurence of "A B C D" in a text consisting of the following structure: "... (zero or more E's) ... A ... (zero or more E's) ... B C D ... (anything)". Write the pseudocode equivalent of the DFA. That is, what does a DFA implementation in code (pseudocode) look like?
A DFA to find the first occurrence of "A B C D" in any text:
Another example: a DFA for the pattern "A B C A B D"
The nextState function:
Example:
Input char | ||||
Current state | A | B | C | D |
0 | 1 | 0 | 0 | 0 |
1 | 1 | 2 | 0 | 0 |
2 | 1 | 0 | 3 | 0 |
3 | 4 | 0 | 0 | 0 |
4 | 1 | 5 | 0 | 0 |
5 | 1 | 0 | 3 | 6 |
A A A B C A B C A B D Processing 'A': next-state = 1 A A A B C A B C A B D Processing 'A': next-state = 1 A A B B C A B C A B D Processing 'B': next-state = 2 A A B C C A B C A B D Processing 'C': next-state = 3 A A B C A A B C A B D Processing 'A': next-state = 4 A A B C A B B C A B D Processing 'B': next-state = 5 A A B C A B C C A B D Processing 'C': next-state = 3 A A B C A B C A A B D Processing 'A': next-state = 4 A A B C A B C A B B D Processing 'B': next-state = 5 A A B C A B C A B D D Processing 'D': next-state = 6 A A B C A B C A B D D Pattern found A B C A B D
Building the nextState function:
Pseudocode:
Algorithm: DFAPatternSearch (pattern, text) Input: pattern array of length m, text array of length n // First, build the nextState function. 1. nextState = computeNextStateFunction (pattern) // Now scan input character by character. 2. currentState = 0 3. for textCursor=0 to n-1 4. character = text[textCursor] // Go to next state. 5. currentState = nextState[currentState][character] // If it's the final state, we're done. 6. if currentState = m 7. return (textCursor - m + 1) 8. endif 9. endfor 10. return -1 Output: position in text if pattern is found, -1 if not found.
Algorithm: computeNextStateFunction (pattern) Input: pattern array of length m 1. for state=0 to m-1 2. for character=smallestChar to largestChar 3. patternPlusChar = concatenate (pattern, character) 4. k = length of longest suffix of patternPlusChar that is a prefix of pattern 5. nextState[state][character] = k 6. endfor 7. endfor
Analysis:
In practice:
What does "wildcard" mean?
ls *.java
Regular expressions:
In-Class Exercise 5.5: Construct the regular expression for the set {"D A B", "D A C", "D A B E", "D A C E", "D A B E E", "D A C E E",... }
The pattern-search problem we want to solve:
What is an NDFA?
(States are numbered differently, to avoid confusion).
Constructing an NDFA from a regular expression:
In-Class Exercise 5.6: Construct an NDFA for the regular expression "C (A | B) (A B)*". First, use the rules to construct an unoptimized version, then try to optimize it (reduce the number of states).
Two ways to "run" an NDFA on input:
Using a deque:
In-Class Exercise 5.7: This is a recursion-writing exercise. Write pseudocode describing a recursive algorithm to see if a given string (such as PatternSearch.java) matches a given wildcard pattern (such as Patt*nSearch.*). Your algorithm should solve the problem directly, without using DFAs or NDFAs.
In this section, we'll give an overview of how search engines work.
Three parts to search engines:
In-Class Exercise 5.8: Look up the "page rank" algorithm and describe how it works. What are the disadvantages of using that algorithm?
Indexing via an example:
Query-handling: