Module 4: Hashing and Tries


Introduction: Signatures and Addresses

 

We will first consider two aspects of keys:

 

Signatures:

 

Addressing:


Hashing

 

Key ideas:

 

Two problems:

 

Details:

Example:

 

Pseudocode:

 

About signatures:

In-Class Exercise 4.1: Download this template and the file words (a dictionary). Print out the distribution of words (only the number) over buckets for each of the following keys:

Hint: you do not have to implement hashing to compute these numbers; you only need to compute the hash function.
 

Sizing a table:

 

Analysis (assuming few keys per bucket):

 

Variations:

In-Class Exercise 4.2: Suppose you had to use a hash table for a collection of floating-point numbers: x1, x2,...,xn. Design a hashing function to hash to the range 0,...,m.


Geometric Hashing

 

For a collection of points, (x1,y1), (x2,y2), ..., (xn,yn), consider the following queries:

One approach:

  • Suppose points are represented by (x,y) values (reals).
  • Use a hash function on x-coordinate of each point.
  • Can work very well for equality search (first query).
  • But nearest-point query?
 

2D hashing:

  • We'll use this example to illustrate:
    • Data:
      p1: (-1.5, 6.25)
      p2: (-0.75, 6.1)
      p3: (-0.1, 6.33)
      p4: (-0.1, 7.1)
      p5: (2.1, 6.48)
      p6: (2.25, 6.8)
    • Query point: q = (0.76, 6.32).

  • Find minimum x and y values:
    xmin = min (x1,...,xn) = -1.5
    ymin = min (y1,...,yn) = 6.1

  • Find maximum x and y values:
    xmax = max (x1,...,xn) = 2.25
    ymax = max (y1,...,yn) = 7.1

  • Then, the rectangle with corners (xmin, ymin), (xmax, ymax) is a bounding rectangle.

  • Divide bounding rectangle into a m x m grid with m2 cells.
    Example with m = 8.

  • Give coordinates to each cell, e.g. when m = 5:

  • Each point lies in some cell: use the coordinates as the 2D-hash value.

    Point Coordinates 2D hash-value
    p1 (-1.5, 6.25) (0,0)
    p2 (-0.75, 6.1) (0,0)
    p3 (-0.1, 6.33) (1,1)
    p4 (-0.1, 7.1) (1,4)
    p5 (2.1, 6.48) (4,1)
    p6 (2.25, 6.8) (4,3)

  • Insert all the points using 2D hash-values:
    • Create a 2D table of linked lists (2D hash table).
    • Insert each point in appropriate linked list:

  • For nearest-point query:
    • Hash query point q.
      => cell (3, 1)

    • Work outwards in a spiral to find nearest non-empty cell.

    • Find closest point in non-empty cell: p5
    • Mark radius around q with distance to current closest point p5:

    • Search all cells that overlap circle:


      => closest point overall: p3 = (-0.1, 6.33).


Tries

 

Consider combinations of "structure" and whether or not signatures are used:

Key ideas:

  • Note: "trie" rhymes with "try".

  • Use bit-string signature for each key.
    Example: "reverse ascii code of first letter in string"
    Thus, signature("C") = 1100001.

  • At level i, use i-th bit of signature.

  • If bit = 0, go left. Otherwise, go right.

  • Compare with binary tree:

  • Note:
    • Trie is not necessarily in in-order.
    • Different signatures will produce different tries.
    • We did not compare with key in node.

  • Varieties of tries:
    • Simple trie: "store keys in interior nodes".
    • Full trie: "store keys in leaves"
    • Compressed trie: "store keys in leaves and compress paths",
    • Patricia trie: "combination of simple trie and compressed-trie".


Simple Tries

 

Key ideas:

  • Note: Simple tries are also called "Digital Search Trees".
  • Insertion: navigate using signature bits until insertion can be made.
  • Search: navigate using signature bits, compare with nodes along the way.
 

Insertion:

  • Example:
    • We will use the signature "reverse ascii code of first letter"
      (The data happens to have only one letter).
    • We will insert the following keys:
      Key Signature
      A 1000001
      B 0100001
      C 1100001
      D 0010001
      E 1010001
      F 0110001
    • Insert "A" (1000001)
      => Empty trie, place in root.
    • Insert "B" (0100001)
      Compare with "A" => not equal, so proceed
      First bit = 0 => go left
      No link => insert

    • Insert "C"
      Compare with "A" => not equal, so proceed
      First bit = 1 => go right
      No link => insert

    • Insert "D"

    • Insert "E"

    • Insert "F"

  • Pseudocode:
    
    Algorithm: insert (key, value)
    Input: key-value pair
    1.   if trie is empty
    2.     root = create new root with key-value pair;
    3.     return
    4.   endif
         // Start numbering the bits from 0. 
    5.   recursiveInsert (root, key, value, 0)
       
    
    Algorithm: recursiveInsert (node, key, value, bitNum)
    Input: trie node, key-value pair, which bit we are using now   
         // Compare with node key to see if it's a duplicate. 
    1.   if node.key = key
    2.     Handle duplicates;
    3.     return
    4.   endif
         // Otherwise, examine the bitNum-th bit 
    5.   if key.getBit (bitNum) = 0
           // Go left if possible, or insert. 
    6.     if node.left is null
    7.       node.left = new trie node with key-value;
    8.     else
             // Note: at next level we'll need to examine the next bit. 
    9.       recursiveInsert (node.left, key, value, bitNum+1)
    10.    endif
    11.  else
           // Same thing on the right 
    12.    if node.right is null
    13.      node.right = new trie node with key-value;
    14.    else
    15.      recursiveInsert (node.right, key, value, bitNum+1)
    16.    endif
    17.  endif
       

Search:

  • Search is straightforward:
    • Compare the input key with the current node.
    • If equal, the key is found; return.
    • Otherwise, examine i-th bit (at level i) and go left or right accordingly.
    • If next link is null, search ends without finding the key.

  • Pseudocode:
    
    Algorithm: search (key)
    Input: search-key
    1.   node = recursiveSearch (root, key, 0)
    2.   if node is null
    3.     return null
    4.   else
    5.     return node.value
    6.   endif
    Output: value, if key is found
       
    
    Algorithm: recursiveSearch (node, key, bitNum)
    Input: trie node, search-key, which bit to examine
         // Compare with key in node. 
    1.   if node.key = key
           // Found. 
    2.     return node
    3.   endif
         // Otherwise, navigate further. 
    4.   if key.getBit (bitNum) = 0
    5.     if node.left is null
             // Not found => search ends. 
    6.       return null
    7.     else
             // Search left. 
    8.       return recursiveSearch (node.left, key, bitNum+1)
    9.     endif
    10.  else
    11.    if node.right is null
             // Not found => search ends. 
    12.      return null
    13.    else
             // Search right. 
    14.      return recursiveSearch (node.right, key, bitNum+1)
    15.    endif
    16.  endif
    Output: trie node if found, else null.
       

In-Class Exercise 4.3: Use the "ascii code of first letter" signature to insert "A B C D E F" into a simple trie. Show all your steps. Ascii codes are available here.
 

Analysis:

  • Maximum height = maximum length of bitstring.

  • n keys should need no more than log2(n) bits for signature.
    => maximum tree height = log (n).

    In-Class Exercise 4.4: Why is this true? That is, why is it that n keys need at most log2(n) bits to represent the keys?

  • Insertion: O(log (n)).

  • Search: O(log (n)).

  • Note: a signature function should be constructed carefully so that no bits are "wasted"
    • The "ascii code" signature wastes the first few bits (since they are common to all letters).
    • Commonly used hash functions do not waste bits because of the "mod" function.


Full tries

 

Key ideas:

  • Disadvantage of simple trie:
    does not maintain sort-order, even if signature is order-preserving.

  • In Full Trie: maintain sort order
    Note: must use order-preserving signature.

  • To do this: avoid using interior nodes
    => all keys at leaves.

  • Interior nodes are for navigation only:

  • To sort: scan leaves left to right.

Insertion:

  • Key ideas:
    • Navigate using bits as long as internal nodes exist.
    • If internal node for i-th bit does not exist, create internal node and all other nodes on path to leaf.

  • Note: we will need to know in advance the maximum number of bits.

  • Example:
    • We will insert the keys (strings) "A", "B", "C", "D", "E", "F".
    • Signature: the 5 lowest-order bits of the first letter.
      (Note: the data just happens to have only one letter).
    • Insert "A" (00001)
      • Trie is empty, so create root as well as all internal nodes on path to leaf node for "A".

    • Insert "B" (00010)
      • Traverse left for first three 0's.
      • Bit = 1 at 4-th level.
      • No internal node exists:
        => create path to leaf.

    • Insert "C" (00011)
      • All internal nodes on path exists
        only navigation required (and leaf node).

    • Insert "D" (00100)

    • "E" (00101)

    • "F" (00110)

  • Pseudocode:
    
    Algorithm: initialize (maxBits)
    Input: maximum number of bits
    1.   Store maximum number of bits to use;
        
    
    Algorithm: insert (key, value)
    Input: key-value pair
    1.   if trie is empty
    2.     root = create empty internal node;
           // Start bit-numbering at 0 and create path to leaf: 
    3.     extendBranch (root, key, value, 0)
    4.     return
    5.   endif
    6.   recursiveInsert (root, key, value, 0)
        
    
    Algorithm: extendBranch (node, key, value, bitNum)
    Input: internal trie node, key-value pair, bit number
    1.  Create path of internal nodes from level bitNum to maxBits-1;
    2.  if key.getBit (maxBits) = 0
    3.    create left leaf at end of path;
    4.  else
    5.    create right leaf at end of path;
    6.  endif
        
    
    Algorithm: recursiveInsert (node, key, value, bitNum)
    Input: trie node, key-value pair, bit number
    1.  if key.getBit (bitNum) = 0
          // Check left side. 
    2.    if node.left is null
            // Grow a branch of internal nodes and append leaf. 
    3.      extendBranch (node, key, value, bitNum)
    4.    else
    5.      recursiveInsert (node.left, key, value, bitNum+1)
    6.    endif
    7.  else
          // Check right side. 
    8.    if node.right is null
            // Grow a branch of internal nodes and append leaf. 
    9.      extendBranch (node, key, value, bitNum)
    10.   else
    11.     recursiveInsert (node.right, key, value, bitNum+1)
    12.   endif
    13. endif
        

Search:

  • Similar to simple trie.

  • One difference: compare with keys only at leaves.

Sort-order scan:

  • Must use order-preserving signature.

  • Since all keys are at leaves
    => in-order traversal will result in sort-order.

  • Optimization: link leaves together.
 

Analysis:

  • Again, maximum height = number of bits in signature.
    => log2(n).

  • Insertion: O(log (n)).

  • Search: O(log (n)).

  • Note:
    • Simple trie: a key-comparison occurs at each interior node in search path.
    • Full trie: only bit-evaluations occur at internal nodes
      => only ONE key comparison (at leaf)
      => faster search (especially if keys are long).
    • Simple trie: One node per key.
    • Full trie: multiple (wasted) internal nodes.
      => O(n) extra storage. (Why?)

In-Class Exercise 4.5: Why is it that the full trie wastes O(n) storage? In other words, explain why the number of internal nodes is O(n).


Compressed tries

 

Key ideas:

  • Suppose "D" and "E" are the only keys to the left of the root.

  • Common internal nodes are "wasted":

  • Compressed trie: compress common sections.

  • Idea: during insertion, extend path only as much as needed:

  • Note: sort-order is still maintained in leaves.

Insertion:

  • Example:
    • Insert "A" (00001)
      • Trie empty => insert as root.

    • Insert "B" (00010)
      • "B" differs from "A" in fourth bit
        => path of length 3 required.

    • Insert "C" (00011)
      • "C" differs from "B" in last bit
        => full path required

    • Insert "D" (00100)
      • "D" differs at third bit, no path-creation required.

    • Insert "E" (00101)
      • "E" differs from "D" in last bit
        => path extension required.

    • Insert "F" (00110)
      • No path creation required.

  • Pseudocode:
    
    Algorithm: insert (key, value)
    Input: key-value pair
    1.   if trie is empty
    2.     root = new root node with key-value;
    3.     return
    4.   endif
         // Start with level 0 (bit number 0): 
    5.   recursiveInsert (root, key, value, 0)
        
    
    Algorithm: recursiveInsert (node, key, value, bitNum)
    Input: trie node, key-value pair, bit number
    1.   if node contains a key
           // Need to create a path to distinguish key and node.key 
    2.     extendSmallestBranch (node, node.key, node.value, key, value, bitNum)
    3.     return
    4.   endif
         // Otherwise, node is an empty interior node for navigation only. 
    5.   if key.getBit (bitNum) = 0
           // Check left. 
    6.     if node.left is null
    7.       node.left = new node containing key-value;
    8.     else
    9.       recursiveInsert (node.left, key, value, bitNum+1)
    10.    endif
    11.  else
           // Check right. 
    12.    if node.right is null
    13.      node.right = new node containing key-value;
    14.    else
    15.      recursiveInsert (node.right, key, value, bitNum+1)
    16.    endif
    17.  endif
        
    
    Algorithm: extendSmallestBranch (node, key1, value1, key2, value2, bitNum)
    Input: node from which to build branch, two key-value pairs, bit number.
         // Examine bits from bitNum to maxBits. 
         // As long as the bits are equal in the two keys, extend branch. 
         // When the bits differ, stop and create children with key1 and key2. 
        

In-Class Exercise 4.6: Use the "reverse ascii code of first letter" signature to insert "A B C D E F" into a compressed trie. Show all your steps. Ascii codes are available here.


Handling Duplicates (in any Data Structure)

 

Motivation:

  • In practice, data often contains duplicates.

  • In some applications, you want to store multiple values for a single key.
    => data structure should be able to store all values for a key.

Solutions (using Binary Search Tree as data structure):

  • Duplicate lists:
    • Use the "attached-list" idea from hashing.
    • Store duplicates in a linked-list off of tree nodes.

    • Advantages: flexible, efficient (only those keys with duplicates need the list).
    • Disadvantages: extra space required in each node.

  • Duplicate cache:
    • If duplicates are few (O(1)), store in separate table.

    • Constant-cost added to each search (to search cache).
    • Advantages: simple, efficient for few duplicates, interchangeable with data structures.
    • Disadvantages: inefficient for many duplicates.

A Java programming trick for handling duplicates: (source file)


import java.util.*;

public class Duplicate {

  // Any Map can be used, e.g., TreeMap. 
  static Map originalDataStructure = new HashMap();


  // Insertion. 

  static void insert (Object key, Object value)
  {
    // 1. Attempt a direct insertion. 
    Object oldValue = originalDataStructure.put (key, value);

    // 2. If the value wasn't already there, nothing to be done. 
    if (oldValue == null) {
      // No duplicates. 
      return;
    }

    // 3. Otherwise, check if duplicates already exist. 
    if (oldValue instanceof HashSet) {
      // 3.1 There are, so add the new one. 
      HashSet duplicates = (HashSet) oldValue;
      duplicates.add (value);
    }
    else {
      // 3.2 This is the first duplicate => create a HashSet to store duplicates. 
      HashSet duplicates = new HashSet ();
      // 3.3 Place old value and duplicate in hashset. 
      duplicates.add (oldValue);
      duplicates.add (value);
      // 3.4 Add the hashset itself as the value. 
      originalDataStructure.put (key, duplicates);
    }
  }


  // Enumerate and print all values. 

  static void printAll ()
  {
    Collection values = originalDataStructure.values();
    for (Iterator i=values.iterator();  i.hasNext(); ) {
      Object obj = i.next();
      // If the value is a HashSet, we have duplicates. 
      if (obj instanceof HashSet) {
        // Extract the different values. 
        HashSet duplicates = (HashSet) obj;
        for (Iterator j=duplicates.iterator();  j.hasNext(); ) {
          System.out.println (j.next());
        }
      }
      else {
        // If it's not a HashSet, the extract obj is a value. 
        System.out.println (obj);
      }
    }
  }

  public static void main (String[] argv)
  {
    // Add data with duplicates. 
    insert ("Ali", "Anorexic Ali");
    insert ("Bill", "Bulimic Bill");
    insert ("Bill", "Blasphemous Bill");     // Duplicate. 
    insert ("Chen", "Cadaverous Chen");
    insert ("Dave", "Dyspeptic Dave");
    insert ("Dave", "Duplicitous Dave");     // Duplicate. 
    insert ("Dave", "Diabolical Dave");      // Duplicate. 
    insert ("Ella", "Egregious Ella");
    insert ("Franco", "Flatulent Franco");
    insert ("Gita", "Gluttonous Gita"); 
    insert ("Gita", "Grotesque Gita");       // Duplicate. 

    // Print. 
    printAll();
  }

}