By the end of this module, you will be able to:
We will first consider two aspects of keys:
Signatures (of keys):
In-Class Exercise 4.1: Compare the two signatures, "first-letter-of-string" and "concatenation-of-ascii-codes", with regard to uniqueness, brevity, efficiency and order-preservation.
Addresses (of keys):
Key ideas:
Two problems:
Details:
Example:
"catch", "throws", "break", "byte", "volatile", "throw", "protected", "this", "do", "switch", "boolean", "char", "package", "true", "extends", "new", "instanceof", "for", "public", "return", "while", "case", "abstract", "false", "void", "synchronized", "implements", "finally", "null", "try", "const", "default", "native", "continue", "super", "import", "class", "final", "transient", "short", "double", "long", "private", "goto", "int", "if", "else", "static", "interface", "float"
Pseudocode:
Algorithm: initialize (numBuckets) Input: desired number of buckets 1. Initialize array of linked lists;
Algorithm: insert (key, value) Input: key-value pair // Compute table entry: 1. entry = key.hashCode() mod numBuckets 2. if table[entry] is null // No list present, so create one 3. table[entry] = new linked list; 4. table[entry].add (key, value) 5. else 6. // Otherwise, add to existing list 7. table[entry].add (key, value) 8. endif
Algorithm: search (key) Input: search-key // Compute table entry: 1. entry = key.hashCode() mod numBuckets 2. if table[entry] is null 3. return null 4. else 5. return table[entry].search (key) 6. endif Output: value, if found; null otherwise.
About signatures:
a: 1 ("abstract") b: 3 ("boolean", "break", "byte") c: 6 ("case", "catch", "char", "class", "const", "continue") d: 3 ("default", "do", "double") e: 2 ("else", "extends") f: 5 ("false", "final", "finally", "float", "for") g: 1 ("goto") h: 0 i: 6 ("if", "implements", "import", "instanceof", "int", "interface") j: 0 k: 0 l: 1 ("long") m: 0 n: 3 ("native", "new", "null") o: 0 p: 4 ("package", "private", "protected", "public") q: 0 r: 1 ("return") s: 5 ("short", "static", "super", "switch", "synchronized") t: 6 ("this", "throw", "throws", "transient", "true", "try") u: 0 v: 2 ("void", "volatile") w: 1 ("while") x: 0 y: 0 z: 0
table entry 0: 3 table entry 1: 0 table entry 2: 3 table entry 3: 3 table entry 4: 0 table entry 5: 1 table entry 6: 1 table entry 7: 1 table entry 8: 4 table entry 9: 0 table entry 10: 0 table entry 11: 2 table entry 12: 0 table entry 13: 2 table entry 14: 1 table entry 15: 1 table entry 16: 0 table entry 17: 0 table entry 18: 2 table entry 19: 2 table entry 20: 2 table entry 21: 4 table entry 22: 1 table entry 23: 0 table entry 24: 0 table entry 25: 0(A more even, but not perfect distribution).
int hashValue = 0; for (int i=0; i < length; i++) { // Note: 31 = 2**5-1. The string characters are assumed to be // stored in charArray[]. hashValue = 31*hashValue + charArray[i]; } return hashValue;
In-Class Exercise 4.2:
Download HashAnalysis.java
and the file words.txt (a dictionary).
Print out the distribution of words (only the number) over
buckets for each of the following signatures:
Hint: you do not have to implement a hashtable to compute these numbers;
you only need to compute the signature, and record statistics.
Sizing a table:
Analysis (assuming few keys per bucket):
Variations:
In-Class Exercise 4.3: Suppose you had to use a hash table for a collection of floating-point numbers: x1, x2,...,xn. Design a hashing function to hash to the signatures 0,...,m (which are integers).
For a collection of points, (x1,y1), (x2,y2), ..., (xn,yn), consider the following queries:
One approach:
2D hashing:
p1: | (-1.5, 6.25) |
p2: | (-0.75, 6.1) |
p3: | (-0.1, 6.33) |
p4: | (-0.1, 7.1) |
p5: | (2.1, 6.48) |
p6: | (2.25, 6.8) |
Point | Coordinates | 2D hash-value |
p1 | (-1.5, 6.25) | (0,0) |
p2 | (-0.75, 6.1) | (0,0) |
p3 | (-0.1, 6.33) | (1,1) |
p4 | (-0.1, 7.1) | (1,4) |
p5 | (2.1, 6.48) | (4,1) |
p6 | (2.25, 6.8) | (4,3) |
In-Class Exercise 4.4: Describe how to efficiently compute the hash value for a point p=(x,y), given m and the bounding rectangle.
⇒ closest point overall: p3 = (-0.1, 6.33).
In-Class Exercise 4.5: The 2D hashtable above is a 2D-geometric analog of the regular 1D hashtable (that we saw earlier for strings). Design a tree-like data structure for 2D point data that is analogous to the binary trees we used for strings.
Consider combinations of "structure" and whether or not signatures are used:
Key ideas:
Key ideas:
Insertion:
Key | Signature | |
A | 1000001 | |
B | 0100001 | |
C | 1100001 | |
D | 0010001 | |
E | 1010001 | |
F | 0110001 |
Algorithm: insert (key, value) Input: key-value pair 1. if trie is empty 2. root = create new root with key-value pair; 3. return 4. endif // Start numbering the bits from 0. 5. recursiveInsert (root, key, value, 0)
Algorithm: recursiveInsert (node, key, value, bitNum) Input: trie node, key-value pair, which bit we are using now // Compare with node key to see if it's a duplicate. 1. if node.key = key 2. Handle duplicates; 3. return 4. endif // Otherwise, examine the bitNum-th bit 5. if key.getBit (bitNum) = 0 // Go left if possible, or insert. 6. if node.left is null 7. node.left = new trie node with key-value; 8. else // Note: at next level we'll need to examine the next bit. 9. recursiveInsert (node.left, key, value, bitNum+1) 10. endif 11. else // Same thing on the right 12. if node.right is null 13. node.right = new trie node with key-value; 14. else 15. recursiveInsert (node.right, key, value, bitNum+1) 16. endif 17. endif
Search:
Algorithm: search (key) Input: search-key 1. node = recursiveSearch (root, key, 0) 2. if node is null 3. return null 4. else 5. return node.value 6. endif Output: value, if key is found
Algorithm: recursiveSearch (node, key, bitNum) Input: trie node, search-key, which bit to examine // Compare with key in node. 1. if node.key = key // Found. 2. return node 3. endif // Otherwise, navigate further. 4. if key.getBit (bitNum) = 0 5. if node.left is null // Not found => search ends. 6. return null 7. else // Search left. 8. return recursiveSearch (node.left, key, bitNum+1) 9. endif 10. else 11. if node.right is null // Not found => search ends. 12. return null 13. else // Search right. 14. return recursiveSearch (node.right, key, bitNum+1) 15. endif 16. endif Output: trie node if found, else null.
In-Class Exercise 4.6: Use the "ascii code of letter" signature to insert "A B C D E F" into a simple trie. Show all your steps. Ascii codes are available here. Notice what happens and explain why we used "reverse ascii" earlier.
Analysis:
In-Class Exercise 4.7: Why is this true? That is, why is it that n keys need at most log2(n) bits to represent the keys?
Key ideas:
Insertion:
Algorithm: initialize (maxBits) Input: maximum number of bits 1. Store maximum number of bits to use;
Algorithm: insert (key, value) Input: key-value pair 1. if trie is empty 2. root = create empty internal node; // Start bit-numbering at 0 and create path to leaf: 3. extendBranch (root, key, value, 0) 4. return 5. endif 6. recursiveInsert (root, key, value, 0)
Algorithm: extendBranch (node, key, value, bitNum) Input: internal trie node, key-value pair, bit number 1. Create path of internal nodes from level bitNum to maxBits-1; 2. if key.getBit (maxBits) = 0 3. create left leaf at end of path; 4. else 5. create right leaf at end of path; 6. endif
Algorithm: recursiveInsert (node, key, value, bitNum) Input: trie node, key-value pair, bit number 1. if key.getBit (bitNum) = 0 // Check left side. 2. if node.left is null // Grow a branch of internal nodes and append leaf. 3. extendBranch (node, key, value, bitNum) 4. else 5. recursiveInsert (node.left, key, value, bitNum+1) 6. endif 7. else // Check right side. 8. if node.right is null // Grow a branch of internal nodes and append leaf. 9. extendBranch (node, key, value, bitNum) 10. else 11. recursiveInsert (node.right, key, value, bitNum+1) 12. endif 13. endif
Search:
Sort-order scan:
Analysis:
In-Class Exercise 4.8: Why is it that the full trie wastes O(n) storage? In other words, explain why the number of internal nodes is O(n).
Key ideas:
Insertion:
Algorithm: insert (key, value) Input: key-value pair 1. if trie is empty 2. root = new root node with key-value; 3. return 4. endif // Start with level 0 (bit number 0): 5. recursiveInsert (root, key, value, 0)
Algorithm: recursiveInsert (node, key, value, bitNum) Input: trie node, key-value pair, bit number 1. if node contains a key // Need to create a path to distinguish key and node.key 2. extendSmallestBranch (node, node.key, node.value, key, value, bitNum) 3. return 4. endif // Otherwise, node is an empty interior node for navigation only. 5. if key.getBit (bitNum) = 0 // Check left. 6. if node.left is null 7. node.left = new node containing key-value; 8. else 9. recursiveInsert (node.left, key, value, bitNum+1) 10. endif 11. else // Check right. 12. if node.right is null 13. node.right = new node containing key-value; 14. else 15. recursiveInsert (node.right, key, value, bitNum+1) 16. endif 17. endif
Algorithm: extendSmallestBranch (node, key1, value1, key2, value2, bitNum) Input: node from which to build branch, two key-value pairs, bit number. // Examine bits from bitNum to maxBits. // As long as the bits are equal in the two keys, extend branch. // When the bits differ, stop and create children with key1 and key2.
In-Class Exercise 4.9: Use the "reverse ascii code of first letter" signature to insert "A B C D E F" into a compressed trie. Show all your steps. Ascii codes are available here.
Motivation:
Solutions (using Binary Search Tree as data structure):
A Java programming trick for handling duplicates: (source file)
import java.util.*; public class Duplicate { // Any Map can be used, e.g., TreeMap. static Map originalDataStructure = new HashMap(); // Insertion. static void insert (Object key, Object value) { // 1. Attempt a direct insertion. Object oldValue = originalDataStructure.put (key, value); // 2. If the value wasn't already there, nothing to be done. if (oldValue == null) { // No duplicates. return; } // 3. Otherwise, check if duplicates already exist. if (oldValue instanceof HashSet) { // 3.1 There are, so add the new one. HashSet duplicates = (HashSet) oldValue; duplicates.add (value); } else { // 3.2 This is the first duplicate => create a HashSet to store duplicates. HashSet duplicates = new HashSet (); // 3.3 Place old value and duplicate in hashset. duplicates.add (oldValue); duplicates.add (value); // 3.4 Add the hashset itself as the value. originalDataStructure.put (key, duplicates); } } // Enumerate and print all values. static void printAll () { Collection values = originalDataStructure.values(); for (Iterator i=values.iterator(); i.hasNext(); ) { Object obj = i.next(); // If the value is a HashSet, we have duplicates. if (obj instanceof HashSet) { // Extract the different values. HashSet duplicates = (HashSet) obj; for (Iterator j=duplicates.iterator(); j.hasNext(); ) { System.out.println (j.next()); } } else { // If it's not a HashSet, the extract obj is a value. System.out.println (obj); } } } public static void main (String[] argv) { // Add data with duplicates. insert ("Ali", "Addled Ali"); insert ("Bill", "Bewildered Bill"); insert ("Bill", "Befuddled Bill"); // Duplicate. insert ("Chen", "Cantankerous Chen"); insert ("Dave", "Discombobulated Dave"); insert ("Dave", "Distracted Dave"); // Duplicate. insert ("Dave", "Disconcerted Dave"); // Duplicate. insert ("Ella", "Egregious Ella"); insert ("Franco", "Flummoxed Franco"); insert ("Gita", "Grumpy Gita"); insert ("Gita", "Grouchy Gita"); // Duplicate. // Print. printAll(); } }