CS 3212: Algorithms

Student questions

Here we will post your questions with short answers, as long as the questions are not from homeworks or are the module questions themselves. (For those, stop by during office hours.)

General questions about assignments/programming etc.

What's a good way to create my own tests? This is generally a hard thing to do with guarantees. For any particular algorithm, it's best to try and identify categories of edge cases. For example, in sorting, edge cases can include the sizes of data, odd/even arrays for algorithms like mergesort (that split in the middle), already-sorted data, reverse-sorted data.

Why do we lose commitment points for using an extension? Great question. Let's address this in class on 10/1. (If I forget, remind me.)

Questions about pattern search.

What's a special state in the NDFA? These states are additional states created and inserted for our use in constructing NDFA's. After all, the NDFA is not some given thing that constrains what we do: we are constructing an NDFA for our pattern-search purposes, which means we can build one any which way. The bottom-up construction approach is simplified if we introduce these extra "glue" states to put the larger NDFA together. In practice, one can optimize away some of these after the large NDFA is constructed.

When should one use a DFA vs an NDFA for pattern search? An NDFA works for both fixed strings and regular expressions whereas a DFA works only for fixed strings (without wildcards, without regular expressions). The DFA is also the first step in understanding how an NDFA works.

When parsing a regular expression is * part of the regex or something else? That depends on the mini-language used to describe the regular expression. The use of * is popular in many regular-expression languages, including the one used by Java. Remember, a regular expression (like a math expression) is a formal and precise way of describing a set of strings. Thus, constructing a regular expression must precisely follow rules to avoid ambiguity. This means some notation is needed to say "zero or more occurrences of ...". Most regex languages use * for "zero or more ..." and + for "one or more ..."

How to construct an NDFA for "C (A | B)"? Or "(A B)*"? In the bottom-up process shown in Module 5, start by constructing an NDFA for the individual letters C, A and B. Then piece together the larger NDFA. Thus, for (AB)*, you'd first build the NDFA for AB (using the concatenation rule) and then build the one for (AB)* by adding the additional states and arrows using the third rule.

Why is BMH used if it's O(mn)? What is KMP like in practice? Both RK and KMP are theoretically better but a bit slow in practice. RK is slow because the hash functions do take a bit of arithmetic computation that's more expensive than a simple letter-to-letter comparison (which can be done at byte-comparison speeds). KMP is slow in practice because of the computation of the next-state function, although it will be the fastest for multiple searches of the same pattern in many texts. BMH is an example of a simple heuristic that performs well in practice for standard English text. Also, simple text search in applications like a text editor rarely need long strings nor do they commonly feature extremely long text. A word search in terabytes of text would need a different strategy. (How would you do this?)

Memory questions

How are trees stored in memory?
Every tree node is like a linked-list node except that instead of a single "next" pointer, there are now two pointers, one called "left" and the other called "right". And just like we needed an outside-the-list variable to point to the front of the list, we'll need an outside-the-tree variable to point to the root of the tree.

When showing memory diagrams, how much detail should be included? For this course, use the type of memory diagrams we showed in the 2113 material with addresses shown whenever a pointer is involved.

Why are values copied for assignment between stack-variables versus address-copying for assignment between heap-variables? Not true! Every variable assignment (of any kind) is a value copy. That is, the value in the right-side variable gets copied into the left-side variable. Consider:


     int b = 5;
     int a = b;     
     // a now has the value 5

     int[] x = {1,2,3};
     int[] y = x;     
     // The variable y now has the value that was in x, which is
     // a pointer to the segment of memory (in the heap) that contains
     // the array [1][2][3]

     // Now let's look at a heap variable example
     Node n = new Node();
     n.x = 5;
     Node m = new Node();
     m.x = n.x
     // The variable x that's in the heap block pointed to by m will
     // have the value 5, because that's what got copied from n.x

What's beyond the memory diagrams we've done so far? It's worth understanding how polymorphism works with objects. Other things worth understanding: how does garbage collection work? What does the term "memory leak" mean? Here are some short answers:

There is a part of the operating system that is always looking for unused heap blocks. Consider this:


    Node n = new Node();   // Suppose n has the address 6068
    n.x = 5;
    n = new Node();        // Now n is made to point to a new block, say at 7322
    // At this point, no variable is pointing to the block at 6068.
    // Which means it's kind of "floating" (nothing points to it).
    // The garbage collector is looking for such blocks and will 
    // scoop it up for use in the future.

A memory leak is when a program unwittingly creates too many heap blocks. This is difficult to do in Java but quite easy in C/C++, both of which force the programmer to remember to return blocks to the heap. C/C++ do not have a garbage collector.

Hashing questions

What's the difference between a hashmap, hashset and hashtable? The term "hashtable" is Java's older term for "hashmap" so let's just distinguish between hashset and hashmap:


    // A hashset stores only keys:
    HashSet<String> set = new HashSet<>();
    set.add ("hello");
    set.add ("world");
    if (set.contains("hiya")) {
        // ...
    }

    // Each time a hashmap stores a key, the associate "value" is also
    // added, because the real purpose is to retrieve the value when
    // the key is presented with a key, the hashmap is expected to 
    // return the associated value.
    HashMap<Integer,String> map = new HashSet<>();
    map.put (1, "one");
    map.put (2, "two");
    map.put (3, "three");
    String s = map.get(2);    // s points to "two"

Is there a universal hashing function? Not in general, but there are near-universal hash functions for particular data types like integers. These have provable properties about the low probability of collision. There's a fair bit of math that goes into their analysis.

How does one reduce collisions in hashing? In practice, you have a couple of options:

First, in your application you should first evaluate standard hashing functions to see if your data does indeed have collisions.
One straightforward approach is to hash twice, so that a bucket with multiple keys points to a second-level hashtable instead of a list.
Another option is to design your own custom hash function. This is useful when storing complex objects like images.

How does the hash function in geometric hashing work? Module 4 explains some of it, but if that's not enough, let's do this in class.

Trie questions

What makes an internal node common enough to be removed in a compressed trie? How do you compress a part of the trie without recreating it the next time something is inserted? A compressed trie takes the approach of "uncompress only when you must". Thus, whenever two keys need to be distinguished by additional internal-node routing, that part of the trie that does that routing is expanded at the moment the second of the two keys is inserted. Then, when deletions occur, the common part is compressed.

How is a trie faster if comparisons are still being made to check if a bit is 0 or 1? A trie is no faster than a balanced binary tree but much faster than a linked list. This is because the trie height is no more than log(n). So, yes, comparisons are made to go left or right but no more than log(n) comparisons. So, why use a trie in place of a tree? A couple of reasons:

Tries do not need special balancing tricks. They will balance out as long as the signature is short enough (around log(n) bits).
Tries can be used with very long keys that have short signatures.
Some data types may not have a natural "order". A signature allows a tree-like structure (the trie) to be used nonetheless.