Data Structures

1. What is a data structure?
• An organization of a data set
• Several operations to be performed on the data set

2. The general problem of data structure design
• Input: a data set and specifications of the operations to be supported
• Design: Come up with an organization (structure) of the data elements, and algorithms for the operations, so that the operations are as fast as possible.

3. When do we need a data structure?

4. Review of Stacks and Queues

5. A stack is a data structure S that supports three operations:
• Push (S,a)
• Pop(S)
• Top(S)

6. The pushing (insertion) and popping (detetion) are done at one and the same end of the stack, namely, its top.

7. One implementation is an array S[1:N] and a global index I pointing to the top of the stack, assuming that S[i] is empty but S[i-1] is full
 Procedure Push (S,a) begin S[I] := a; I := I+1; end

 function Pop(S) begin I := I-1; return(S[I]); end

 function Top(S) begin return(S[I-1]); end

8. A queue is a data structure Q that supports two operations:
• enqueue(Q,a): it inserts a new element into the rear end (called tail) of the Queue Q
• deque(Q): deletes the front element from the head of the Queue.

9. One implementation is a circular array Q[0:N-1] and two global indexes H and T, pointing to the head and tail elements of Q.

Q[H] is considered full, Q[T] is considered empty.

By circular we mean: the element before Q is Q[N-1],
and the element after Q[N-1] is Q.
 Procedure enqueue(Q,a) begin Q[T] := a; T = T - 1 modulo N; end

function dequeue(Q)
 begin H := H-1 modulo N; return (Q[H+1 modulo N]); end

10. Records: A record is a built-in structure data type. It is an aggregate of several elements called fields, where each field is a variable of a standard type or of a record type.

11. Syntax of a record:

 record name begin field declarations; field declarations; ... field declarations; end

Example:
 record employee begin char name[1:30] integer SSN; char address[1:100] real salaray; end

Once a record has been defined, it becomes a new data type that can be used in declaring record variables:

employee X;
employee Y[1:100];
 record company begin char name[1:50] char address[1:100] employee Z[1:1000] end

to access an individual field F of a record R, use
R.F

 Examples: X.SSN := 123456789; X.salary := X.salary + 1000;

12. Pointers data types
A pointer is a data type whose contents are addresses of other variables of a specified type.

 Examples: charpointer p; intpointer q[1:n]; employeepointer r; charptr p; intptr q[1:n]; employeeptr r;

to access the field F of a record pointed to by a record pointer r, use
r ---> F

13. A self-referential record is a record that has a pointer field that points to a record of the same type.

14. Example:
 record employee begin char name[1:30] integer SSN; employeeptr next; end

15. Singly Linked Lists:
• a singly linked list is a sequence of records, where every record has a field that points to the next record

• p pointer P that points to the first record is part of the linked list structure

• to access the i-th record, we have to "walk" from the first record to the i-th record, one record at a time, using the next pointers.

• to insert a record at a specified location, say after the i-th record, we walk to the i-th record, then change the values of the pointers so the i-th record points to the new record and the new record points to the (i+1)st record.

• deleting the i-the record is done in a similar fashion.

16. Doubly linked lists: these are like single linked lists except that each record has a field that points to the previous record, and another field that points to the next record.

17. Introduction to graphs
• Definition: A graph G=(V,E) consists of a finite set V, whose elements are called nodes, and a set E, which is a subset of V x V. The elements of E are called edges.

If the directions of the edges are of significance, that is, (x,y) is different from (y,x), the graph is called directed.

Otherwise, the graph is called undirected.

• if (x,y) is an edge, then x is said to be adjacent to y, and y is adjacent from x.

• in the case of undirected graphs, if (x,y) is an edge, we just say that x and y are adjacent (or x is adjacent to y, or y is adjacent to x). Also, we say that x is the neighbor of y.

• The indegree of a node x is the number of nodes adjacent to x

• The outdegree of a node x is the number of nodes adjacent from x

• the degree of a node x in an undirected graph is the number of neighbors of x

• A path from a node x to a node y in a graph is a sequence of node x, x1,x2,...,xn,y, such that x is adjacent to x1, x1 is adjacent to x2, ..., and xn is adjacent to y.

• The length of a path is the number of its edges.

• a cycle is a path that begins and ends at the same node

• The distance from node x to node y is the length of the shortest path from x to y.

• an undirected graph is connected if there is at least a path between every pair of nodes

• Representations: Adjacency lists and adjacency matrices
Given a graph G=(V,E) of n nodes, G can be represented by a matrix A[1:n,1:n], where
A[i,j]=1 if (i,j) is an edge;
A[i,j]=0 if (i,j) is not an edge.
The nodes are assumed to be conveniently labeled 1,2,...,n.
A is called an adjacency matrix.

G can also be represented by what's called adjacency lists: an array P[1:n] of pointers, where P[i] points to the linked list of all the nodes ajdacent from node i. Each one of those nodes is represented by a simple record of two fields: the node label and the next pointer.

18. Trees

• Definition: a tree is a connected acyclic graph (i.e., it has no cycles).
• Rooted trees: A rooted tree is a tree where one node is designated as root. By holding the tree at its root, and letting the other nodes descend from it, we get a hierarchical structure. In that structure, every node has a parent, and some nodes have one or more children.
• Definitions:

• a leaf is a node that has no children

• the ancestors of a node x is the set of nodes on the path from x to the root, including x and the root.

• the proper ancestors of x are the ll the ancestors of x except x itself.

• the descendents of x are the children of x, their children, and so on all the way down.

• the proper descendents of x are all the descendents of x except x itself.

• the subtree rooted at x is the tree consisting of x and all its descendents.

• the depth of x is the distance from the root to x.

• the height of x is the distance from x to the farthest descendent of x.

• the height (or depth) of the tree is the height of the root.

• the nodes clearly partition into levels. The top level contains just the root, and it is labeled level 0. The next level, labeled level 1, contains the children of the root. The level after that, labeled level 2, conatins the grandchildren of the root, and so on.

• the depth of a node is the label of its level.

• the height of the tree is the label of the lowest level.
• Binary trees: rooted trees where every node has at most two children

• Representation of binary trees: every node s represented by a record that has at least three fields: the node label (or contents), a pointer to the left child, and a pointer to the right child. The tree is pointed to by the pointer to the root node.
• Binary Search Trees: a BST is a binary tree T where
1. every node contains a value, called key (real, integer, character, etc.)
2. for every node x, all the nodes in the left subtree of x have keys that are <= the key of x, and all the nodes in the right subtree of x have keys that are > the key of x.

• BST supports three operations: insert, delete, and search

In the three algorithms below, T is a pointer to the tree root, and a a is a key.

The search function returns a pointer to the record containing a, if any. Otherwise, it returns nil.

 function search(T,a) begin nodeptr p; p=T; while (p != nil and p --> key != a) do if a < p --> key then p := p --> left; else p := p --> right; endif endwhile return (p); end search

The insert procedure inserts the key a into T.
 procedure insert(T,a)
begin
nodeptr p;
p=T;
Bool done = false;
while not done do
if a <= p --> key then
if p --> left != nil then
p = p --> left;
else
p --> left = new(node);
p --> left --> key := a;
done =true;
endif
else
if p ---> right != nil then
p = p --> right;
else
p --> right = new(node);
p --> right -- > := a;
done = true;
endif
endif
endwhile
end insert

• The procedure delete deletes the node containing key a in the tree
• p points to the node that will be found to contain a
• q points to the parent of p
• r points to a child of p
• direction = 0 if p is a left child of q, 1 if p is the right child of q
procedure  delete(T,a)
begin
nodeptr  p,q,r,s;
integer  direction;
p = T;
while  (p != nil and p-->key != a) do
if a < p-->key then
q := p;
p := p-->left;
direction := 0;
else
q := p;
p := p-->right;
direction := 1;
endif
endwhile
if p == nil then
return;
elseif p-->left == nil and p-->right == nil then
/* p has no children */
/* delete node pointed to by p and exit */
if direction == 0
q-->left = nil;
else
q-->right = nil
endif
free (p);
elseif p-->left == nil then
/* p has only one child, the right one */
if direction == 0 then
q-->left := p-->right;
/* shortcut from parent to grandchild */
else
q-->right := p-->right;
endif
elseif p-->right == nil then
/* p has only one child, the left one */
if direction == 0 then
q-->left := p-->left;
else
q-->right := p-->left;
endif
else	/* p has two children*/
/* find the maximum node in the left subtree*/
/* subtree of p */
s := p-->left;
q := p;
/* now q will be the parent of s , and direction*/
/* will indicate the type of child s is to q*/
direction = 0;
while s-->right != nil do
q := s;
s := s-->right;
direction := 1;
endwhile
/*Now s points to the maximum node in the */
/*left subtree of p*/
p-->key := s-->key;
/* now node s must be deleted. But since s has */
/* no right child, the deletion is done by */
/* deletion or shortcutting */
if s-->left == nil then 	/* s is a leaf*/
if direction == 0 then
q-->left := nil;
else
q-->right := nil;
endif
free (s);
return;
else		/* s has a left child */
if direction == 0 then
q-->left := s-->left;
else
q-->right := s-->left;
endif
free(s)
return;
endif
endif
end delete

19. the time complexity of insert, delete, and search is O(h) each, where h is the height of the tree.

20. Full trees and almost complete trees:

• A full binary tree is a binary tree where every non-leaf has two children and all the leaves are at the same level.

• Exercises: show that the number of nodes in level i of a full binary tree is 2i. Show also that a full binary tree of height h has 2h+1 - 1 nodes.

• the canonical labeling of a full binary tree is a labeling of the nodes from top to bottom, left to right, starting with the root being labeled 1.

• observe that the labels of the children of node i are 2i and 2i+1. The label of the parent of i is [i/2] (integer division).

• an almost complete binary tree of n nodes is the binary tree consisting of the first n nodes of a full binary tree.

• observe that if the bottom level of an almost complete binary tree is removed, we are left with a full tree. Also, the nodes in the bottom level are packed to the left end of the tree without any "holes"

• an almost complete binary tree of n nodes can be implemented with an array A[1:n] of n elements, where A[i] contains the contents of node of label i.

21. Heaps (min-heaps and max-heaps)

• Definition: a min-heap is an almost complete binary tree where every node has a value smaller than the values of its children

• Operations on a min-heap: delete-min and insert
function delete-min(H) /* H is the heap*/
begin  x= root of H; a=key of x; /* to be returned at the end*/ remove a from node x; take the last node (node n), remove its key (call it b), and store b in the root; remove node n; /* now we have to restor the heap*/
 while (x has a key bigger than one of its children) do swap x with the smaller child; make x point to that child; endwhile
 /* note the while loop will stop when x becomes smaller */ /* than both its children, or when x becomes a leaf*/ end delete-min

Time of delete-min: O(h)=O(log n)

procedure insert(H,a)         /* inserts the key value a into the heap*/ begin  create a node of label n+1, and insert a into that node; let x point to that node; while (x is smaller than its parent) do
 swap the key of x with key of the parent of x; let x point to its parent;
 endwhile
 /* this will loop until either x becomes larger than its parent, or it becomes the root*/
end insert
Time of insert: O(h)=O(log n)

• Note that with insert(), we can create a heap from scratch by n insertions into an intially empty heap. This takes time =
O(log 1 + log 2 + ... + log n) = O(log(n!))= O(nlog n).
There is a procedure for constructing a heap in O(n) time.

• By starting with a heap, and calling delete-min n times, we sort the heap in time =
O(logn + log(n-1) + ... + log 1) = O(nlog n).

22. Sets and the Union-Find Data Structure

• General implementations of sets
• Union-Find problem: Given n sets, where set i has a single element {i}, we need to perform O(n) operations of union (U) and find (F):

• U(A,B) removes the sets A and B, and replaces them with a new set equal to
A U B

• F(x) is a function that returns the name of the set containing x at the time F is called.

• note that the sets are mutually disjoint at all times.

• the problem is to design a data structure so that O(n) calls to U and F take as little time as possible.

• we will carry out the design by having two different representations of the sets: one conceptual and one physical.

• the conceptual representation of each set is a rooted tree (not necessarily binary). The nodes are labeld with the elements of the corresponding set. The label of the root serves the extra purpose of being the name of the set.

• the physical representation of all the trees (at once) is a single array called PARENT[1:n].

• if i is not a root, PARENT[i] is the parent of node i in whatever tree i happens to be.

• if i is a root, PARENT[i]= -(number of nodes in the tree rooted at i).
• three different implementations of the Union-Find

• first implementation
U(i,j) is carried out by simply making i the parent of j, thus combining the two trees into a single tree, and at once eliminating the old trees as distinct sets.

F(x) returns the root of the tree containing x by walking up that tree from x, using the PARENT array, until a node is reached which has no parent. That is the root.

procedure U(i,j)
begin  PARENT[i]=PARENTiI]+PARENT[j]; /* this computes the total number of nodes in the new tree */ PARENT[j]=i; end

Time: O(1), that is, constant.

function F(x)
begin  integer r=x; while (PARENT[r] > 0) do r = PARENT[r]; endwhile
 /*now r is a root*/ return(r);
end

Time: O(h) where h is the height of the tree containing x.
time for O(n) Us and Fs: up to O(n2). The proof is an exercise.

• Second Implementation
in U(i,j), make the root of the heavier tree become the parent of the other root. By heavier we mean "has more nodes".

F(x) is not modified here.

procedure U(i,j)
begin  if |PARENT[i]| >= |PARENT[j]| then
 PARENT[i]=PARENTiI]+PARENT[j]; PARENT[j]=i;
 else
 PARENT[j]=PARENTiI]+PARENT[j]; PARENT[i]=j;
 endif
end
time: O(1)
Time of an O(n) sequence of Us and Fs: O(nlog n).
This is because every tree is of height <= log(its number of nodes).
The proof for that is by induction on the number of Us performed.
Therefore, every tree is of height <= O(log n).
Hence, every F(x) takes O(log n) time.
As a result, O(n) Us takes O(n), and O(n) Fs take O(nlog n).
The total time for O(n) Us and Fs is then O(nlog n).

• Third implementation
U will remain the same as in the second implementation.
F will be changed using a technique called path compression. In particular, F(x) first finds the root. Afterwards, in a second walk from x to the root, every node on that path is made a direct child of the root.
function F(x)
begin  integer r,s,t; r=x; while PARENT[r] > 0
 r = PARENT[r];
 endwhile /*now r is a root*/ s=x; while s != r do
 t := s; s := PARENT[s]; PARENT[t] := r;
 endwhile return (r);
end

Time of F: O(h) where h is the height of the tree containing x.
Time of O(n) sequence of Us and Fs: O(nG(n)), where
G(n) <= 5 for all n <= 265000. That is, for all practical values of n, the time is O(n).
Abdou Youssef 2001-05-28