Recall the three basic operations implemented earlier for a linked list:
public class OurLinkedList { // ... variable declarations ... public void add (Integer K) // Insert operation. { // ... } public boolean contains (Integer K) // Search operation. { // ... } public Integer get (int i) // get() operation. { // ... } }
The other data structure we created for this purpose was the array-list, with exactly the same operations:
public class OurArrayList { // ... public void add (Integer K) // Insert operation. { // ... } public boolean contains (Integer K) // Search operation. { // ... } public Integer get (int i) // get() operation. { // ... } }
Let us next compare the performance of these two data structures for each of these operations:
public class ListComparison { public static void main (String[] argv) { // Use a 10,000-insertions (repeat for 1000 samples). testInsert (1000, 10000); // Use a list with 100000 elements (repeat for 1000 samples). testSearch (1000, 100000); // Use a list with 100000 elements (repeat for 1000 samples). testGet (1000, 100000); } static void testInsert (int numTrials, int numElements) { // Evaluate the "insert" operation in a linked-list. // Repeat for given number of trials. double total = 0; for (int k=0; k < numTrials; k++) { long startTime = System.currentTimeMillis(); // Make a list and add numElements to it. OurLinkedList list = new OurLinkedList (); for (int i=0; i < numElements; i++) { list.add (i); } long timeTaken = System.currentTimeMillis() - startTime; total += timeTaken; } // This is the average insert time. double avg = total / numTrials; System.out.println ("Average insert time for linked list: " + avg); // Now repeat for an array list. total = 0; for (int k=0; k < numTrials; k++) { long startTime = System.currentTimeMillis(); // Make a list and add numElements to it. OurArrayList list = new OurArrayList (); for (int i=0; i < numElements; i++) { list.add (i); } long timeTaken = System.currentTimeMillis() - startTime; total += timeTaken; } // Average for the array list. avg = total / numTrials; System.out.println ("Average insert time for array list: " + avg); } static void testSearch (int numTrials, int numElements) { // ... similar ... } static void testGet (int numTrials, int numElements) { // ... similar ... } }Note:
Operation | 2006 Mac-OSX | SUN server (Unix) | 2005 Win-XP |
Insert (linked-list) | 3.412 | 2.969 | 3.562 |
Insert (array-list) | 3.39 | 1.136 | 1.406 |
Search (linked-list) | 0.626 | 4.675 | 0.687 |
Search (array-list) | 0.669 | 4.177 | 0.469 |
Get() (linked-list) | 0.432 | 3.479 | 0.719 |
Get() (array-list) | 0.0 | 0.002 | 0.0 |
In-Class Exercise 1: Why is get() much faster for an array list? First examine the code to see how the method is implemented and then explain.
Clearly, one way to compare algorithms is: implement them and test them on large data sets.
Disadvantages of this approach:
Goal of abstract analysis:
Some key ideas:
A few more key ideas:
The Big-Oh notation:
What does it mean?
In-Class Exercise 2: Suppose Algorithm A takes 3n3+5n2+100n time and Algorithm B takes 4n3 time (worst-case) on a problem of size n. If we were to plot the two curves f(n) = 3n3+5n2+100n and g(n) = 4n3, would the curve for g(n) eventually rise above that of f(n)? If so, at what value of n does that happen? Write a small program to find out.
About constants:
Now we're ready for a formal analysis of the linked-list operations:
Next, let's consider the array list:
In-Class Exercise 3: For the insert operation on an array-list, suppose that we start with an initial array size of 1. How many array-doublings are needed if 1024 items are inserted into the list? In general, for large n, how many doublings are needed?
In-Class Exercise 4: How much time (in order-notation) is needed, worst-case, for search and get() in an array-list?
Consider this problem:
Consider this simple algorithm for the problem:
Algorithm: duplicateDetection (A) Input: An array A 1. duplicatesExist = false 2. for i=1 to n 3. // Check whether A[i] occurs again 4. for j=1 to n 5. if i != j 6. if A[i] = A[j] 7. duplicatesExist = true 8. endif 9. endif 10. endfor 11. endfor 12. return duplicatesExistNote:
public class DuplicateDetection { public static void main (String[] argv) { // Make a large array and test. int[] X = makeData (10000); detectDuplicates (X); // We'll do this for data sizes of 10K, 30K, 50K, 70K and 90K. X = makeData (30000); detectDuplicates (X); X = makeData (50000); detectDuplicates (X); X = makeData (70000); detectDuplicates (X); X = makeData (90000); detectDuplicates (X); } static void detectDuplicates (int[] A) { // Check for duplicates. long startTime = System.currentTimeMillis(); boolean dupExists = false; for (int i=0; i < A.length; i++) { for (int j=0; j < A.length; j++) { if ( (i != j) && (A[i] == A[j]) ) { // Duplicates exist. dupExists = true; } } } double timeTaken = System.currentTimeMillis() - startTime; System.out.println ("Time taken for size=" + A.length + ": " + timeTaken); } static int[] makeData (int size) { // ... how this works is not relevant ... } }
Analysis for an array of n elements:
The constant of proportionality:
In-Class Exercise 5: Add code to the above program to identify the constant of proportionality. That is, divide the actual measured running time by n2. Alternatively, find the constant b such that b * running-time = n2. (Then, a = b-1).
There is an obvious improvement:
In-Class Exercise 6: Modify the above program to incorporate this optimization. Then, identify the new constant of proportionality. In terms of n, what is the exact number of comparisons? (It's going to be smaller than n*(n-1), obviously).
If we want sorted output:
In a sorted list,
Consider the linked version:
Notice that we can analyse the performance without looking at any code.
In-Class Exercise 7: In Big-Oh notation, how much time (as a function of n, the number of elements) do insert and search take? How do these functions compare with the unsorted linked list?
For completeness, let's examine the code:
class ListItem { // ... } public class OurSortedLinkedList { // ... public void add (Integer K) { if (front == null) { // This is the same as before: front = new ListItem (); front.data = K; rear = front; rear.next = null; } else { // This part is a little more complicated now since // we have to first find the right place and then // possibly insert between existing elements. // Find the right place for it. ListItem listPtr = front; ListItem followPtr = null; while ( (listPtr != null) && (listPtr.data < K) ) { followPtr = listPtr; listPtr = listPtr.next; } // Make the node. ListItem nextOne = new ListItem (); nextOne.data = K; // There are three cases to handle. if (listPtr == front) { // CASE 1: Insert in front. nextOne.next = front; front = nextOne; } else if (listPtr == null) { // CASE 2: Insert at rear. rear.next = nextOne; rear = nextOne; } else { // CASE 3: Insert in the middle. followPtr.next = nextOne; nextOne.next = listPtr; } } numItems ++; } public boolean contains (Integer K) { if (front == null) { return false; } // Start from the front and walk down the list. We don't // have to go further once we've hit something larger than K. ListItem listPtr = front; while ( (listPtr != null) && (listPtr.data <= K) ) { if ( listPtr.data.equals(K) ) { return true; } listPtr = listPtr.next; } return false; } public String toString () { // ... } }
In-Class Exercise 8: Execute the above program, while printing out the actual node addresses. Draw a step-by-step picture showing the state of the list after each insertion. Write the node addresses down on the drawing.
Now let's consider the array-list version:
Again, we can analyse the time taken without looking at code.
In-Class Exercise 9: How much time is needed in Big-Oh notation for each of the two operations, insert and search, for an array-list with n elements?
Now let's look at the code:
public class OurSortedArrayList { // This is the array in which we'll store the integers. Integer[] data = new Integer [1]; // Initially, there are none. int numItems = 0; public void add (Integer K) { if (numItems >= data.length) { // Need more space. Let's double it. Integer [] data2 = new Integer [2 * data.length]; // Copy over data into new space. for (int i=0; i < data.length; i++) { data2[i] = data[i]; } // Make the new array the current one. data = data2; } // Now find the right place. int k = numItems; for (int i=0; i < numItems; i++) { if (data[i] > K) { k = i; break; } } // Insert at k, by shifting everything to the right. for (int j=numItems; j > k; j--) { data[j] = data[j-1]; } data[k] = K; numItems ++; } public boolean contains (Integer K) { return binarySearch (data, K, 0, numItems-1); } static boolean binarySearch (Integer[] A, int value, int start, int end) { // Only need to check if the interval got inverted. if (start > end) { return false; } // Find the middle: int mid = (start + end) / 2; if (A[mid] == value) { return true; } else if (value < A[mid]) { // Search the left half: A[start],...,A[mid-1] return binarySearch (A, value, start, mid-1); } else { // Search the right half: A[mid+1],...,A[end] return binarySearch (A, value, mid+1, end); } } }
In-Class Exercise 10: Download Log.java and implement a method to compute the base-2 logarithm of an integer. The result must itself be an integer (truncated from a real number if necessary). What is the connection between this exercise and binary search above?
Recall Selection-sort:
Algorithm: selectionSort (A) Input: an unsorted array A 1. for i=1 to n-1 // Find i-th smallest element in A[i], ..., A[n] 2. pos = i 3. for j=i+1 to n 4. if A[j] < A[pos] // Record best so far 5. pos = j 6. endif 7. endfor 8. swap A[i] and A[pos] 9. endfor
Let us analyse the running time:
In-Class Exercise 11: What is the total amount of "work done" above? Simplify the above expression and express it in Big-Oh notation.
Suppose we have three algorithms whose execution time as a function of problem size (n) is:
Algorithms A and B are fundamentally different from that of C:
In-Class Exercise 12: To see the difference between exponential and polynomial, compute n4 and 2n for n = 10, 20, ..., 100. Write a small program to print out these values, along with the ratio 2n/ n4.
Factorials:
In-Class Exercise 13: Argue that factorials are worse than exponentials, i.e., that n! must eventually grow larger than an for any a.
Ease of analysis:
In-Class Exercise 14: Recall the "Manhattan" example from Module 4 (the material on Recursion). Download and examine Manhattan.java. Consider the special case where the number of rows and columns are identical; thus, we'll only use r to denote both the number of rows and the number of columns. Modify the code to count the number of calls made to countPaths(). This will serve as the "work done" by the algorithm. Let f(r) denote the work done for different values of r. Then print f(r) for various values of r in the range r = 1, 2, ..., 10. How does f(r) compare with 2r or r!?
The following table summarizes the ranking of common time complexities.
The higher a time complexity appears in the table, the more efficient it is. Alternatively the lower in the table the worse the efficiency becomes which predicts indefinite to impossible real-time behavior.
O(c) or O(1) Constant time O(log n) Logarithmic time O(n) Linear time O(n log n) "Loglinear" time O(nc) Polynomial time
O(n2) Quadratic time O(n3) Cubic time O(n4) Quartic time O(cn) Exponential time O(n!) Factorial time
For a small data set, we can see quick divergence between the difference classes on a linear scale plot, but it is difficult to grasp the overall scale of growth and how different each class is. 100 elements is such a small data set that algorithms up to low polynomial time can be solved in an acceptable real time. However, when the data set grows into a large data set, even polynomial time algorithms become intractable.
We are most concerned with large data sets. The following visualizations illustrate data sets up to a size of 10,000 elements. 10,000 elements can only marginally be considered a "large data" set as most modern applications are dealing with data containing with many orders of magnitude larger, i.e. billions and trillions of records. Regardless, at 10,000 records the pattern of extreme performance differences start to emerge in the visualizations.
With a linear scale plot, everything worse than loglinear is tightly grouped close to the y-axis and it is difficult to see differences.
If we scale the y-axis logrithmically, the asymptotic behavior starts to be more apparant. Both exponential and factorial are still off the chart at small values of n, but we can see that all other classes that have better asymptotic performance clearly start to flatten (on this scale) which help differentiate their differences in performance.