Algorithms and Data Structures (COMP 1004 Theory 2)

Size: px

Start display at page:

Download "Algorithms and Data Structures (COMP 1004 Theory 2)"

Griselda Snow
5 years ago
Views:

1 Algorithms and Data Structures (COMP 1004 Theory 2) Anthony Hunter Department of Computer Science University College London February 8, / 152

2 Course contents Search algorithms Graph algorithms File compression algorithms Cryptology 2 / 152

3 Search algorithms Divide and conquer techniques Binary search Interpolation search Binary search trees trees B-trees Hashing 3 / 152

4 Searching a repository A respository is a set of records Each record has one or more keys (descriptors/identifiers) The search key(s) is a delineation of the records required Retrieve records that match the search key(s) Examples of respositories Repository Record Search key Library database Library book Book/Author name Student database Student file Student name Dictionary Word entry Word Web Web page Keywords 4 / 152

5 Sequential search Consider the items in the array in sequence. For an unsorted array of size n takes n + 1 comparisons for an unsuccessful search (always) takes n/2 comparisons for a successful search (on average) For a sorted array of size n takes n/2 comparisons for a successful and unsuccessful search (on average) 5 / 152

6 Binary search Assume records are sorted Use divide and conquer: Look at mid-point, and if need to continue the search, then use left or right half of array. For an array of size n, binary search never uses more than log n comparisons for successful and unsuccessful search 6 / 152

7 Binary search The following is an array of keys, where each key is from a different record, and each row is a cycle of binary search for which the search key is M. Example of binary search A A A C E E E G H I L M N P R S X l x r l x r l x r x Legend: l is the pointer to the left end of the current array, r is the pointer to the right end of the current array, and x is the pointer to the key currently being compared with the search key. 7 / 152

8 Binary search Position for comparison for binary search x = l + 1 (r l) 2 A A A C E E E G H I L M N P R S X l x r l x r l x r x Calculation for above example (Cycle 1) x = (17 1) = 9 2 (Cycle 2) x = (17 10) = (Cycle 3) x = (12 10) = / 152

9 Interpolation search Interpolation search is a divide and conquer approach that is more intelligent than binary search. Binary search uses the midpoint as the the next position to search, whereas interpolation tries to gauge the position based on the distance of the key from the keys at the left and right ends of the array. Interpolation search assumes a fairly well distributed set of keys, which is often not a reasonable assumption. Example of interpolation search When searching for a name in a telephone directory, if we look for a name beginning with B (respectively V), then we start by looking near the start (respectively end) of the directory. 9 / 152

10 Interpolation search Position for comparison for binary search x = l + 1 (r l) 2 Position for comparison for interpolation search x = l + v l r (r l) l Legend: l is the pointer to the left end of the current array, r is the pointer to the right end of the current array, x is the pointer to the key currently being compared with the search key, l is the key of record at position l, and r is the key of record at position r. 10 / 152

11 Interpolation search Position for comparison for interpolation search x = l + v l r (r l) l A A A C E E E G H I L M N P R S X l x r l x r x Calculation for above example (Cycle 1) x = 1 + M A 13 1 (17 1) = 1 + (17 1) = X A 24 1 Note, we use the numerical representation of each letter (i.e. A = 1, B = 2, C = 3, etc) when substituting for v, l, and r. 11 / 152

12 Binary search trees: Definition A binary search tree is a binary tree that satisfies the binary search tree property Binary search tree property For all nodes x, y If y is a node in the left subtree of x, then key[y] key[x] If y is a node in the right subtree of x, then key[y] key[x] Arrays for representing a binary tree: key[x], right[x], left[x], parent[x], where x is a node in the tree. 12 / 152

13 Binary search trees: Examples For simplicity in these examples, we have the node and its key value being the same key[2] = 2 parent[2] = 3 left[2] = nil right[2] = nil key[3] = 3 parent[3] = 6 left[3] = 2 right[3] = 5 key[5] = 5 parent[5] = 3 left[5] = nil right[5] = nil key[6] = 6 parent[6] = nil left[6] = 3 right[6] = 7 key[7] = 7 parent[7] = 6 left[7] = nil right[7] = 9 key[9] = 9 parent[9] = 7 left[9] = nil right[9] = nil 13 / 152

14 Binary search trees: Examples / 152

15 Binary search trees: Searching For the root of the tree x, and a search key k, the position of k in the tree is returned (or nil if it is not in the tree). TreeSearch(x, k) if x = nil or k = key[x] then return x if k < key[x] then return TreeSearch(left[x], k) else return TreeSearch(right[x], k) 15 / 152

16 Binary search trees: Minimum and maximum elements Minimum element in binary search tree found by travelling down left-most branch to leaf or until there is no left child. If tree has no left branch, then return root. TreeMaximum is defined similarly. TreeMinimum(x) while left[x] nil x = left[x] return x 16 / 152

17 Binary search trees: Successor keys Successor of 13 is 15, and successor of 15 is / 152

18 Binary search trees: Successor keys If all keys are distinct, then sucessor of a node x is the node with the smallest key greater than key[x]. Structure of binary search tree allows for successor to be identified without need to compare nodes. Rules for finding successor keys If right subtree of x is non-empty, then the successor of x is the leftmost node in the right subtree. If right subtree of x is empty, and x has a successor, then the successor is the lowest ancestor of x whose left child is also an ancestor of x. 18 / 152

19 Binary search trees: Successor keys For a node in the tree x, the successor of x is returned (or nil if x is the largest key in the tree). TreeSuccessor(x) if right[x] nil then return TreeMinimum(right[x]) y = parent[x] while y nil and x = right[y] x = y y = parent[y] return y 19 / 152

20 Binary search trees: Insertion To insert a key v into the tree, we construct a new node z such that key[z] = v left[z] = right[z] = nil We assume that a new node is inserted as leaf. The TreeInsert algorithm modifies a tree T so that node z is inserted in the correct position in T. 20 / 152

21 Binary search trees: Insertion Node z is inserted in the correct position in the tree. TreeInsert(z) y = nil x is the root of the tree while x nil y = x if key[z] < key[x] then x = left[x] else x = right[x] parent[z] = y if y = nil then the root of the tree is z else if key[z] < key[y] then left[y] = z else right[y] = z 21 / 152

22 Binary search trees: Insertion This could be obtained from the sequence 15,5,7,18,3,17,4,13,9,20,2 of insertions / 152

23 Binary search trees: Deletion If node has no children, just delete it (e.g. key 13) / 152

24 Binary search trees: Deletion If node has no children, just delete it (e.g. key 13) / 152

25 Binary search trees: Deletion If node has exactly one child, splice it out (e.g. key 16) / 152

26 Binary search trees: Deletion If node has exactly one child, splice it out (e.g. key 16) / 152

27 Binary search trees: Deletion If node has two children (e.g. key 5), splice out its successor, which will have at most one child, and then put the succesor key in the place of the key to be deleted / 152

28 Binary search trees: Deletion If node has two children (e.g. key 5), splice out its successor, which will have at most one child, and then put the succesor key in the place of the key to be deleted / 152

29 Binary search trees: Deletion TreeDelete(z) if left[z] = nil or right[z] = nil then y = z else y = TreeSuccessor(z) if left[y] nil then x = left[y] else x = right[y] if x nil then parent[x] = parent[y] if parent[y] = nil then the root of the tree is x else if y = left[parent[y]] then left[parent[y]] = x else right[parent[y]] = x if y z then key[z] = key[y] Lines 1-3 determine node y to splice out. Lines 4-6 determine assignment for x. If y has no children, then x is nil, otherwise x is the child of y. Lines 7-15 splice y out. 29 / 152

30 Binary search trees: Deletion with example 1 TreeDelete(z) z = 13 if left[z] = nil or right[z] = nil then y = z y = 13 else y = TreeSuccessor(z) if left[y] nil then x = left[y] else x = right[y] x = nil if x nil then parent[x] = parent[y] if parent[y] = nil then the root of the tree is x else if y = left[parent[y]] then left[parent[y]] = x else right[parent[y]] = x right[12] = nil if y z then key[z] = key[y] 30 / 152

31 Binary search trees: Deletion with example 2 TreeDelete(z) z = 16 if left[z] = nil or right[z] = nil then y = z y = 16 else y = TreeSuccessor(z) if left[y] nil then x = left[y] else x = right[y] x = 20 if x nil then parent[x] = parent[y] parent[20] = 15 if parent[y] = nil then the root of the tree is x else if y = left[parent[y]] then left[parent[y]] = x else right[parent[y]] = x right[15] = 20 if y z then key[z] = key[y] 31 / 152

32 Binary search trees: Deletion with example 3 TreeDelete(z) z = 5 if left[z] = nil or right[z] = nil then y = z else y = TreeSuccessor(z) y = 6 if left[y] nil then x = left[y] else x = right[y] x = 7 if x nil then parent[x] = parent[y] parent[7] = 10 if parent[y] = nil then the root of the tree is x else if y = left[parent[y]] then left[parent[y]] = x left[10] = 7 else right[parent[y]] = x if y z then key[z] = key[y] key[5] = 6 32 / 152

33 Binary search trees: Shape of tree Sequence of keys being inserted affects the shape of the tree. If keys are in order in the sequence of insertion (e.g. 1,2,3,4,..etc), then the tree is a chain. Some sequences give a balanced tree (e.g. 4,2,6,1,3,5,7) / 152

34 Binary search trees: Algorithms These algorithms are useful in a wide variety of applications Their performance depends on the shape of the tree. Best case when tree is balanced hence giving log n performance Worst case when tree is a chain hence giving linear time performance Worst case can occur in practice (e.g. when keys are in order, in reverse order, alternating large / small keys, etc.). If building tree from N keys, then there are N! possible sequences. If each of these is equally likely, then average time performance is approximately 2lnN time where ln is natural log. 34 / 152

35 Binary search trees: Balanced trees Balancing of tree may be better than using a random sequence of insertions / 152

36 2-3-4 trees: A type of balanced tree A tree is always balanced. Three types of node 2-node which contains 1 key. If it is not a leaf, it has 2 children. 3-node which contains 2 keys. If it is not a leaf, it has 3 children. 4-node which contains 3 keys. If it is not a leaf, it has 4 children. The only way for the tree to grow in height is when the root is a 4-node and this is split to create three 2-nodes. 36 / 152

37 2-3-4 trees: Types of node 2-node k 1 keys < k 1 k 1 keys 3-node k 1, k 2 keys < k 1 k 1 keys < k 2 k 2 keys 4-node k 1, k 2, k 3 keys < k 1 k 1 keys < k 2 k 2 keys < k 3 k 3 keys 37 / 152

38 2-3-4 trees: Searching To find a key, start at root. If search key is not at root, select appropriate child, and search by recursion. H E R,V A,B,C F J,K T X,Z 38 / 152

39 2-3-4 trees: Insertion Insertion is always at a leaf. To find correct leaf, start at root, and find the correct branch. H E R,V A,B,C F J,K T X,Z If the leaf is a 2-node or 3-node, just add the key. For example, to insert S, we obtain. H E R,V A,B,C F J,K S,T X,Z 39 / 152

40 2-3-4 trees: Insertion Insertion is always at a leaf. To find correct leaf, start at root, and find the correct branch. H E R,V A,B,C F J,K S,T X,Z If the leaf is a 4-node, then we need to split the node, by passing up one of the middle keys to its parent. For example, to insert D, we obtain. H B,E A C,D F R,V J,K S,T X,Z 40 / 152

41 2-3-4 trees: Splitting a 4-node with 2-node parent To split a 4-node with a 2-node as parent, pass the middle key up to the parent, and put the remaining keys into two 2-nodes. k k 2, k k 1, k 2, k 3 k 1 k 3 41 / 152

42 2-3-4 trees: Splitting a 4-node with 3-node parent To split a 4-node with a 3-node as parent, pass the middle key up to the parent, and put the remaining keys into two 2-nodes. k, k k 2, k, k k 1, k 2, k 3 k 1 k 3 42 / 152

43 2-3-4 trees: Splitting a 4-node at the root 4-node k 1, k 2, k 3 k 1 1,..., k1 n 1 k 2 1,..., k2 n 2 k 3 1,..., k3 n 3 k 4 1,..., k4 n 4 Split into three 2-nodes k 2 k 1 k 3 k 1 1,..., k1 n 1 k 2 1,..., k2 n 2 k 3 1,..., k3 n 3 k 4 1,..., k4 n 4 43 / 152

44 2-3-4 trees: Insertion Use the policy to split a 4-node when come to it at insertion time (i.e. when travelling down a branch to insert a new key). G B J,T A C,D,F H K,R X Construct this tree from the sequence A,B,F,X,G,H,J,K,T,D,C,R. 44 / 152

45 2-3-4 trees: Non-unique keys Non-unique keys can be accommodated by relaxing the definitions. I A,E,G A, A,C E,E H N,R L,M P S,X Construct this tree from A,S,E,A,R,C,H,I,N,G,E,X,A,M,P,L,E. 45 / 152

46 B-trees A B-tree is a balanced tree Each node is a called a page. The order of a B-tree is a number n that provides bounds on the number of keys per page. Each page has at most 2n keys. Each page (apart from the root) has at least n keys. Each page is a leaf or it has m + 1 children (where m is the number of keys on that page). 46 / 152

47 B-trees: Example For order n = 2, with sequence of insertions: 20, 40, 10, 30, 15, 35, 7, 26, 18, 22, 5, 42, 13, 46, 27, 8, 32, 38, 24, 45, 25, 2, 14, ,20 2,5,7,8 13,14,15,18 22,24 30,40 26,27,28 32,35,38 42,45,46 47 / 152

48 B-trees: Example For order n = 1, with sequence of insertions: 20, 40, 10, 30, 15, 35, 7, 26, 18, 22, 5, 42, 13, 46, 27 (i.e. the first part of the sequence in the previous example) , / 152

49 B-trees: Searching If there are large number of keys on a page, use binary search, otherwise use linear search. Keys and pointers on a page k 1 k 2... k m p 0 p 1 p 2 p m 1 p m When looking for a key x when x k i : 1 if x < k 1, the continue on page p 0 2 if k i < x < k i+1, the continue on page p i 3 if k m < x, the continue on page p m If any of the pointers is nil, then x does not exist in the tree. 49 / 152

50 B-trees: Insertion Each insertion of a new key is at a leaf page. If page has space (i.e. current number of keys is less than 2n where n is the order of the tree, then insert new key into page. If page is full, then need to split page: With the new key, there will be 2n + 1 keys. If keys are in order on a page, take the middle key, and insert it in the parent page. Then take the keys left of middle key and put into the page left of the middle key, and take the keys right of middle key and put into the page right of the middle key. 50 / 152

51 B-trees: Insertion 20 7,10,15,18 26,30,35,40 To add key 22 to the above tree, we try to insert into the right leaf. However, this is full (assuming n = 2). Since the items are 22,26,30,35,40, the middle item is 30. This then passes to the parent, and the resulting tree is below. 20,30 7,10,15,18 22,26 35,40 51 / 152

52 B-trees: Insertion 20,30 7,10 25,28 35,40 To add key 8 to the above tree, we try to insert into the left leaf. However, this is full (assuming n = 1). The middle item, which is 8, passes to the parent which is also full. The middle item of the parent, which is 20, forms a new root, and the resulting tree is below ,28 35,40 52 / 152

53 B-trees: Deletion Let x be the key to be deleted. If x is on leaf page, delete it. If x is not on leaf page, then find successor (or predecessor), and take that to overwrite x. The successor (or predecessor) is always on the leaf page. After deletion, need to check that B-tree conditions are maintained: If not, need to combine neighbouring pages and transfer a key from parent. 53 / 152

54 B-trees: Deletion 25 10,20 2,5 13,14 22,24 30,40 26,27,28 32,35 41,42 To delete key 25, use key 24 or key 26 to replace it ,20 2,5 13,14 22,24 30,40 27,28 32,35 41,42 54 / 152

55 B-trees: Deletion 20,30 7,10,15,18 22,26 35,40 To delete key 30 from the above tree, use 26 or 35 from leaves. 20,35 7,10,15,18 22,26 40 But B-tree conditions violates, so restructure to following. 20 7,10,15,18 22,26,35,40 55 / 152

56 B-trees: Scalable A B-tree with a 1000 keys per page will have a branching factor of The first level of the tree will have 1 page and 1000 keys. The second level will have 1001 pages and keys. The third level will have pages and keys. Hence, a B-tree of order 1000 and of height 2 can have over 1 billion keys, and any search needs to consider at most 3 pages / 152

57 B-trees: Trade-off when fixing order of tree Smaller page means a higher tree, and therefore more pages need to be considered when searching, and therefore more load/unload operations. But, each load/unload operation is quicker, and searching each page is quicker. Larger page means a lower tree, and therefore fewer pages need to be considered when searching, and therefore fewer load/unload operations. But, each load/unload operation is slower, and searching each page is slower. 57 / 152

58 B-trees: Worst-case performance Let n be the minimum number of keys/page and N be the number of keys in tree Number of pages accessed = Height of tree log n N Search per page log 2 n(using binary search) Worst case (Number of pages accessed) (Search per page) 58 / 152

59 Hashing Many occasions where tree methods for searching are not appropriates such as Compilers where rapid access to information about symbols is requested Sparse matrices where there are many 0 elements Hashing offers a good time-space compromise If no space limitation, then use the key as the memory address (and so if a particular key is not used, the space is vacant) If no time limitation, then use sequential search (and so each space is used). 59 / 152

60 Hashing We have keys k 1,..., k n and hash table size of m. For each key k, we calculate some function h(k) which is an integer in the range 1 h(k) m. We use the location h(k) in the table 1,..., m as the page to store information about k (or at least a point to where it is). The set of values k 1,..., k n can be very large (e.g. all names of up to 32 characters long) but not many are expected to be used. 60 / 152

61 Hashing: Comparison with direct address tables Suppose we have the numbers 0,..,9 as keys, but we plan to use at most 3 of them. key data XXXX YYYY ZZZZ 9 Direct Address Table key data h(2) XXXX h(5) YYYY h(8) ZZZZ Hash Table 61 / 152

62 Hashing: Choice of hash function Domain of h (Universe of keys) h(n) Range of h (Addresses) For a hash function h, we use a function that has the following properties It is simple to calculate The values of h(k) should be well distributed in 1,..., m. Use hash function for entering and retrieving information. 62 / 152

63 Hashing: Collisions k 1 k 2 We may have k 1 k 2, but h(k 1 ) = h(k 2 ). x This is a collision If m is sufficiently large, and h is a good choice of function, then collisions should be rare. 63 / 152

64 Hashing Example of a hash function h(k) = k mod m (where m is a prime number) Let m be 101 and let AKEY be a key. We can encode the ith letter as the ith binary number. So AKEY is coded as the following, which is in decimal Since, = 80 mod 101, the key AKEY hashes to position 80 in the table. Many other keys (e.g. BARH) also hash to position / 152

65 Hashing Each 5-bit number equals a single digit base 32 number = = So AKEY is the following base 32 number Now suppose m is 32, and so h(k) = k mod 32. So the hash function of any key is simply the value of its last character. Therefore, if there are lots of keys with the same last character, then there are lots of collisions. The simplest way to avoid this problem is to make m prime. 65 / 152

66 Hashing: Linear probing Linear probing is a simple solution for handling collisions For a collision at s, try s + 1, then s + 2, etc. If position m is occupied, try position 0 (effectively table is circular). So finding next location is x = (x + 1) mod m But linear probing may lead to clustering. 66 / 152

67 Hashing: Linear probing key data k 1 k x k y k z k 2 k 3 AAAA XXXX YYYY ZZZZ BBBB CCCC Suppose h(k 1 ) = h(k 2 ) = h(k 3 ). Also, suppose the three addresses after this are occupied. So k 2 requires 4 probes, and k 3 requires 5 probes. 67 / 152

68 Hashing: Linear probing A S A S A E S A A E S A A E R S A A C E R S A A C E H R S A A C E H I R S A A C E H I N R S A A C E G H I N R S A A C E E G H I N R S A A C E E G H I X N R S A A C A E E G H I X N R S A A C A E E G H I X M N R S A A C A E E G H I X M N P R S A A C A E E G H I X L M N P R S A A C A E E G H I X E L M N P R Insertions from following sequence into table where m = 19, and each probe is denoted by the underscore symbol. A S E A R C H I N G E X A M P L E 68 / 152

69 Hashing: Quadratic probing Avoids some clustering arising with linear probing. For a collision at s, try s + p 1, then s + p 2,..., where p 1 = 1, p 2 = 3, p 3 = 6,... More generally, let p 1 = 1, and let p i = p i 1 + i. For example, consider k 5 which hashes to same location as k 1. For k 5, there are four probes (because of collisions with k 1, k 2, k 3, and k 4 ), before k 5 is inserted. key k 1 k 2 k 3 k 4 k 5 data AAAA BBBB CCCC DDDD EEEE 69 / 152

70 Hashing: Chaining An alternative to linear and quadratic probing. If there is a collision at location s, a linked list is set at s. All items that hashed to the same location are entered into this list. Collisions are reduced, but now need to search the linked list. Data structure is expandable. 70 / 152

71 Hashing: Chaining f (k 1 ) f (k 2 ) f (k 4 ) f (k 7 ) f (k 9 ) k 1 k 2 k 4 k 7 k 9 Keys k 1, k 2, and k 4, hash to the same position, and k 7 and k 9 hash to the same position. A linked list is created for each of these sets of keys. 71 / 152

72 Hashing: Chaining Letter Code mod 11 A 0 0 B 1 1 C 2 2 D 3 3 E 4 4 F 5 5 G 6 6 H 7 7 I 8 8 J 9 9 K L 11 0 M 12 1 Letter Code mod 11 N 13 2 O 14 3 P 15 4 Q 16 5 R 17 6 S 18 7 T 19 8 U 20 9 V W 22 0 X 23 1 Y 24 2 Z / 152

73 Hashing: Chaining Position Chain 0 A,A,A,L 1 M,X 2 C,N 3 4 E,E,E,P 5 6 G,R 7 H,S 8 I 9 10 Key: A S E A R C H I N G E X A M P L E Hash: / 152

74 Hashing: Comparison with other techniques Comparison with binary search trees. Hashing is preferred to binary search trees for many applications because simple and faster if sufficiently large table Binary search trees have advantage that there is no limit on number of insertions (i.e. dynamic data structure). Comparison with B-trees. Hashing is fast to construct and can involve many changes to the stored information. But size is limited and the performance is difficult to predict. B-trees are slow to construct and are used when there are relatively few changes to the stored information (i.e. the structure concerning the keys). But they are expandable and the performace is predictable. 74 / 152

75 Graph algorithms Techniques based on depth-first search Finding articulation points Finding strongly-connected components Optimizing network flow Isomorphic graphs 75 / 152

76 Finding articulation points Search(G) for each v N mark[v] = notvisited for each v N if mark[v] visited then DepthFirstSearch(v) DepthFirstSearch(v) mark[v] = visited for each node w adjacent to v if mark[w] visited then DepthFirstSearch(w) 76 / 152

77 Finding articulation points / 152

78 Finding articulation points An articulation point is a node x in a graph such that if x is deleted and all arcs involving x are deleted, then the graph is no longer connected. 78 / 152

79 Finding articulation points 1 Carry out a depth-first search in G, starting from any node. Let T be the tree generated by the depth-first search, and for each node v of the graph, let prenum[v] be the number assigned by the search. 2 Traverse the tree T in postorder. For each node v visited, calculate lowest[v] as the minimum of prenum[v] prenum[w] for each node w such that there exists an edge (v, w) in G that has no corresponding edge in T. lowest[x] for every child x of v in T. 3 Articulation points are now determined as follows. The root of T is an articulation point of G if and only if it has more than one child. A node v other than the root of T is an articulation point G if and only if v has a child x such that lowest[x] prenum[v]. 79 / 152

80 Finding articulation points Node Prenum Lowest / 152

81 Finding articulation points Find articulation points in the graph when either of the following arcs have been added. 1 Arc from 3 to 4. 2 Arc from 5 to / 152

82 Finding strongly connected components A directed graph is strongly connected if there exists a path from u to v and also a path from v to u for every distinct pair of nodes u and v If a directed graph is not strongly connected, we are interested in the largest set of nodes such that the corresponding subgraphs are strongly connected. Each of these subgraphs is called a strongly connected component of the original graph. 82 / 152

83 Finding strongly connected components Use a modified version of depth-first search, where num is a global variable that is initialized as 0. Search(G) for each v N mark[v] = notvisited for each v N if mark[v] visited then DepthFirstSearch(v) DepthFirstSearch(v) mark[v] = visited for each node w adjacent to v if mark[w] visited then DepthFirstSearch(w) num = num + 1 postnum[v] = num 83 / 152

84 Finding strongly connected components 1 Carry out a depth-first search of the graph starting from an arbitrary node. For each node v of the graph let postnum[v] be the number assigned during the search. 2 Construct a new graph G. G is the same as G except the direction of every edge is reversed. 3 Carry out a depth-first search in G. Begin this search at the node w that has the highest value of postnum. (If G contains n nodes, it follows that postnum[w] = n.) If the search starting at w does not reach all the nodes, choose as the second starting point the node that has the highest value of postnum among all the unvisited nodes; and so on. 4 Each tree in the resulting forest corresponds to one strongly connected component of G. 84 / 152

85 Finding strongly connected components Graph G Node Postnum / 152

86 Finding strongly connected components Graph G Node Postnum / 152

87 Finding strongly connected components G and G have exactly the same strongly connected components. Therefore, there is a path from x to y in a strongly connected component in G iff there is a path from x to y in a strongly connected component in G. Highest postnum (of the remaining nodes) is the root of a search used to visit some set of nodes: So starting search here for G is forcing us to visit the same nodes but in reverse order. 87 / 152

88 Optimizing network flow Network flow problems are important in industry, commerce, government, etc. For example Network of pipes from an oilfield to a refinery Traffic flow on roads Material flow in a factory It is partly an operations research problem Algorithms based on weighted graphs can provide valuable solutions 88 / 152

89 Optimizing network flow A 0/6 0/8 B 0/3 0/3 C 0/6 0/3 D E 0/8 0/6 F There is a source and a sink. Each arc is labelled with current flow and capacity. Flow into a node is the sum of the current flow on the incoming arcs. Flow out of a node is the sum of the current flow on the outgoing arcs. Flow into a node must equal the flow out of a node. Flow at source must be equal flow at sink. Aim is to increase flow through network 89 / 152

90 Optimizing network flow: An example A A 0/6 0/8 6/6 0/8 0/6 B 0/3 0/3 C 0/3 6/6 B 0/3 0/3 C 0/3 D E D E 0/8 0/6 F 6/8 0/6 F Current flow = 0 Current flow = 6 90 / 152

91 Optimizing network flow: An example A A 6/6 3/8 6/6 6/8 6/6 B 0/3 0/3 C 3/3 3/6 B 3/3 3/3 C 3/3 D E D E 6/8 3/6 F 6/8 6/6 F Current flow = 9 Current flow = / 152

92 Optimizing network flow: Graph traversal A 6/6 6/8 For network flow, a normal path is a sequence of forward traversals of arcs. For network flow, a super path is a sequence of forward traversals and backwards traversals of arcs. For example, A,C,D,B,E,F is a super path, where D to B is a backwards traversal. B 3/3 3/3 C 3/6 3/3 D E 6/8 6/6 F 92 / 152

93 Optimizing network flow: Greedy approach is insufficient Greedy selection in network flow For any normal path (i.e. a sequence of forward traversals of arcs), from source to sink, if there is spare capacity x in each of the arcs in the path, then increase flow by x on each arc in the path. Since the greedy approach builds a solution without backtracking, once flow is added to an arc, it cannot be subtracted. Yet, we need to subtract flow in order to redistribute flow from some arcs to optimize flow in the whole network. 93 / 152

94 Optimizing network flow: Ford-Fulkerson Method Apply parts A and B exhaustively. Part A For any normal path (i.e. a sequence of forward traversals of arcs), from source to sink, if there is spare capacity x in each of the arcs in the path, then increase flow by x on each arc in the path. Part B For any super path (i.e. a sequence of forward traversals and backwards traversals of arcs), from source to sink, IF each forward traversal of an arc has at least x unused capacity, AND each backward traversal of an arc has current flow of at least x, THEN flow can be increased by x by adding x to flow of forward traversals of arcs AND subtracting x from flow of backward traversals of arcs. 94 / 152

95 Optimizing network flow: An example A 5/5 0/5 A 5/5 5/5 B 0/5 5/5 C B 5/5 0/5 C D E 0/5 D E 5/5 0/5 5/5 F 5/5 5/5 F 95 / 152

96 Optimizing network flow: Algorithmic issues A 1/1000 0/1000 B 1/1 C 0/1000 1/1000 D 1: Add 1 to path ABCD 2: Make ABD = 1 & ACD = 1 3: Add 1 to path ABCD 4: Make ABD = 2 & ACD = 2 5: Add 1 to path ABCD 6: Make ABD = 3 & ACD = 3 7: Add 1 to path ABCD 8: Make ABD = 4 & ACD = : Make ABD = 1000 & ACD = / 152

97 Optimizing network flow: Algorithmic issues For the Ford-Fulkerson Method Problem: Need to limit the length of the paths considered. Solution: Use the shortest available path from source to sink until maximum flow is obtained. Algorithm based on breadth-first search. Treat graph as undirected. Consider paths of length n, before paths of length n / 152

98 Optimizing network flow: Algorithmic issues A A B C B C C D B D D D D Maximal flow obtained at depth 2 using paths ABD and ACD, and so paths ABCD and ACBD do not need to be considered. 98 / 152

99 Isomorphic graphs Bijection A function f : X Y is an injection (one-to-one) iff for all y Y there is at most one x X such that f (x) = y. A function f : X Y is an surjection (onto) iff the range of f is Y. A function f : X Y is an bijection iff f is an injection and a surjection. Isomorphism G 1 = (N 1, A 1 ) and G 2 = (N 2, A 2 ) are isomorphic if there is a bijection f : N 1 N 2 such that (i, j) A 1 iff (f (i), f (j)) A / 152

100 Isomorphic graphs a A e b C D d c E B Let f(a) = A, f(b) = B, f(c) = C, f(d) = D, f(e) = E. 100 / 152

101 Isomorphic graphs To show that G 1 and G 2 are isomorphic is an NP-complete problem. However, if some invariant (i.e. some property that is preserved under a graph isomorphism) fails to hold, then G 1 and G 2 are not isomorphic. Some invariants: number of nodes number of arcs degrees of each node existence of simple cycles / 152

102 File compression File Encoding Compressed file Aim for compressed file to use substantially less space than the input file. Decoding File 102 / 152

103 File compression Input is a file X Output is a another version of X that uses substantially less space Assume that most files have a lot of redundancy (i.e. a relatively low information content), and so use methods to save space by exploiting this. Examples of file compression Text files Image files Audio files Some characters used more than others Often large homogeneous areas Often repeated patterns 103 / 152

104 File compression: Space savings Space savings with file compressions depend on the input. Text fule - often 20-50% Binary fule - often 50-90% Though general purpose compressions techniques must take longer on some files, otherwise we repeatedly apply to give arbitrarily small files. 104 / 152

105 File compression: Run length encoding For example, text made up of letters but no numbers. AAAABBBAABBBBBCCCCCCCCDAB This can be encoded as follows, where sequence of one or two iterations of a character are not rewritten. 4A3BAA5B8CDAB 105 / 152

106 File compression: Run length encoding Raster file Encoding For larger raster files, where there are relatively few changes between 0 and 1 in a row, then the savings can be dramatic. 106 / 152

107 File compression: Run length encoding with escape Encodation Encodation is a sequence of the following items Escape character, Count of character, Character being counted For example, AAAABBBAABBBBBCCCCCCCC This can be encoded as follows, with Q as escape character, using characters to denote counts (i.e. A=1, B=2, C=3, D=4, E=5, etc), and where a sequence of less than four iterations of a character is not rewritten. QDABBBAAQEBQHC 107 / 152

108 File compression: Run length encoding with escape What if Q is in the input text file? Option 1: Use Q(Count of Q)Q. For example, QBQ denotes a sequence of two Qs. So three characters required even if only one Q in the input. Option 2: Use Q(space). For example, AQQBQC is encoded as AQ Q BQ C. These options can make the output file longer than the input file. 108 / 152

109 File compression: Run length encoding with escape Very long runs encoded by multiple escape sequences. For example, a sequence of 51 As would be encoded by QZAQYA. A little redundancy can help in error recovery. For example, end of line characters means that if there is an error on the line, the next line can start afresh. However, for text files, run length encoding is too crude for more than a few percent space saving in general. 109 / 152

110 File compression: Fixed versus variable length encoding For fixed length encoding in bits, use for instance 8 bits per character (A = , B = , C = , etc). So, ABBA is encoded as For variable length encoding, instead of using 8 bits to code each character, use fewer bits for frequent characters. For example, for text ABRACADABRA, use A = 0, B = 1, C = 10, D= 11, and R = / 152

111 File compression: Decoding To encode ABRACADABRA, with A = 0, B = 1, C = 10, D= 11, and R = 01, we require a delimiter (here a space) to ensure correct decoding If no delimiter is used, then could also be decoded as RRRARBRRA or... A delimiter is not needed if no code is the prefix of another. For example, for text ABRACADABRA, use A = 11, B = 00, C = 010, D= 10, and R = / 152

112 File compression: Code trie A type of binary tree where each leaf is lablled with a key Generating code Each code is generated from the sequence of arcs from root to leaf If rightwards arc, then read 1 If leftwards arc, then read 0 A C E R S Key Code A 0000 C 0001 E 001 R S / 152

113 File compression: Code trie Given a set of characters, there is a choice of code trie. A B D A R C R B D C 113 / 152

114 File compression: Code trie The trie representation guarantees that no character code is the prefix of another. So a sequence of character codes without delimiters can be uniquely decoded if character codes generated from a code trie. For example, can be decoded to what word? A C E R S Key Code A 0000 C 0001 E 001 R S / 152

115 File compression: Code trie An application of a code trie (that is not for file compression) is the telephone numbering system. Here, we need a branching factor up to 10. National/International Local International USA Liverpool National Bloomsbury UCL F E So no number is the prefix of any other number (e.g ). 115 / 152

116 Heaps A heap can be represented by a complete binary tree. A heap can also be represented by an array using the following indexing. Root is position 1 in the array. If root of subtree is at position i, then root of left subtree is at position 2i and root of right subtree is at position 2i In this example of a heap, the number denotes the index in the array, rather than the key value. In the remaining examples of heaps, the number will denote the key value. 116 / 152

117 Heaps Heap property For all nodes x in the tree, key[parent[x]] key[x] / 152

118 Heaps: Extract maximum item Extract key 16 and find home for key 1 (i.e. the key in the leaf that is removed) / 152

119 Heaps: Extract maximum item 1 Remove maximum key. This creates a free slot. 2 Remove last leaf of complete binary tree, and put key into temp variable. 3 Select maximum key from children of root, and put into free slot at root. This creates a free slot one level down. 4 While temp is less than left and right child of free slot, select maximum of left and right child of free slot, and put that selection into the free slot. This creates a free slot one level down. 5 If temp is greater than left and right child of free slot, put temp into the free slot. Class exercise: Extract maximum (i.e. 14) from previous heap 119 / 152

120 Heaps: Insertion To insert key 15, create new leaf, and find home for key 15 on the branch down to the new leaf. This may involve moving some items down until new leaf has a key / 152

121 Heaps: Insertion 1 Add leaf to the complete binary tree. Initially this is a free slot. 2 Focus on the branch from root down to the new leaf. Starting at the root, go down looking for correct place in which to insert the new key. This involves comparing the new key with the key of the current node. 3 To insert new item at the correct node, move the current key down one level in the branch. Repeat this until an item is inserted into the free slot at the leaf. Class exercise: Insert 12 into previous heap 121 / 152

122 Heaps: Heapify algorithm Input for the Heapify algorithm is an index i where either of the following hold (and so violating heap property). key[i] < key[left[i]] key[i] < key[right[i]] Assume that heap property holds at subtree at left[i] and subtree at right[i]. If these conditions hold, then the Heapify algorithm swaps key[i] with one of the children of i, and then the Heapify algorithm is applied recursively until the heap property holds on the branch from i to the leaf. 122 / 152

123 Heaps: Heapify algorithm In following tree, the node with key 4 violates the heap property The following is the result of swapping key 4 with key The following is the result of swapping key 4 with key / 152

124 Heaps: Buildheap algorithm The Buildheap algorithm builds a heap with the heap property holding. It uses applications of the Heapify algorithm on increasingly large subtrees. All leaves satisfy the heap property Start at the non-leaf node (going from right to left in the array) 124 / 152

125 Heaps: Buildheap algorithm (i) 4 (iv) (ii) (v) (iii) (vi) / 152

126 Heaps: Buildheap algorithm Class exercise: Apply the Buildheap algorithm to the following heap / 152

127 Heaps: Reverse order We can use the heap property for the reverse order key[parent[x]] key[x] for all key x / 152

128 Heaps: Priority queues A priority queue is based on a heap. It is a data structure for maintaining a set S of elements, each with an associated value called a key. It supports the following operations Insert(S,x) which inserts x into S Maximum(S) which returns the largest key in S ExtractMax(S) which removes and returns the largest key of S A priority queue is an alternative to queues and stacks It is important in a range of applications including Huffman coding, network flow, and finding minimum spanning trees. 128 / 152

129 File compression: Huffman coding algorithm Example The Huffman coding algorithm is a greedy algorithm. Given an item of text, it generates the optimal code trie (i.e. the code trie that produces the most compact encoding). It uses the frequency information about the characters in the text. For the text A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS, the frequency information is: Space = 11; A=3; B=3; C=1; D=2; E=5; F=1; G=2; I=6; L=2; M=4; N=5; O=3; P=1; R=2; S=4; T=3; and U= / 152

130 File compression: Huffman coding algorithm 1 Outline of the Huffman coding algorithm 1 A tree node is assigned for each character with non-zero occurrence 2 Two nodes n 1 and n 2 with lowest frequency form a subtree with a new node as parent, and the sum of n 1 frequency and n 2 frequency is attached to the parent. 3 Step 2 is repeated with n 1 and n 2 removed, and the parent of n 1 and n 2 added. 2 Given a set of characters C, there C 1 merging operations (i.e. step 2). 3 The resulting code trie has C leaves. 130 / 152

131 File compression: Huffman coding algorithm Let Q be a priority queue, let C be the set of characters being coded, and let f (c) be the frequency of character c C Huffman(C) n = C let Q be the priority queue formed from using f for i = 1 to n 1 z = AllocateNode() x = ExtractMin(Q) y = ExtractMin(Q) left[z] = x right[z] = y f [z] = f [x] + f [y] Insert(Q, z) return ExtractMin(Q) 131 / 152

132 File compression: Huffman coding example f:5 e:9 c:12 b:13 d:16 a:45 c:12 b:13 14 d:16 a:45 f:5 e:9 14 d:16 25 a:45 f:5 e:9 c:12 b: / 152

133 File compression: Huffman coding example a:45 c:12 b:13 14 d:16 f:5 e:9 a: c:12 b:13 14 d:16 f:5 e:9 133 / 152

134 File compression: Huffman coding example 100 a: c:12 b:13 14 d:16 f:5 e:9 134 / 152

135 File compression: An optimization problem Cost(T ) is the cost of the trie (i.e. the number of bits used to encode the text file that was used to generate the code trie). C is the set of characters in the text file. f (c) is the frequency of the character c d(c) is the depth of c in the trie (i.e. the number of bits in the code for c). Cost of a code trie Cost(T ) = c C f (c).d(c) 135 / 152

136 File compression: An optimization problem A B C d f d.f A B C C B A d f d.f A B C / 152

137 Cryptology Message Sender Coded message Key Recipient Message 137 / 152

138 Cryptology: Caesar cipher If a letter in the message is the nth letter of the alphabet, then replace by the (n + k)th letter, where k is the key. Example of encoding with k = 1 Message A T T A C K A T D A W N Encoding B U U B D L A B U A E B X O A message encoded with the Caesar cipher is easy to break because there are only 26 possibilities for k. 138 / 152

139 Cryptology: Substitution tables Using a substitution table is more robust than a Caesar cipher since 27! tables, tables, if we use the alphabet plus space symbol. Example of substitution table A B C D E F G H I.. T T H E Q U I C K R.. V Example of encoding Message A T T A C K A T D A W N Encoding H V V H L T H V T Q H X O However, patterns in language can be used to break code. E is the most common letter (in English), and so substitute it for the most frequently used letter in the encoded text,... Also pairs of letters are either very frequence (e.g. ER) and others never occur (e.g. QJ). 139 / 152

140 Cryptology: Vigenere cipher A generalization of idea of Caesar cipher. A key is a composed of a sequence of values k 1,.., k n where each k i is used as the key for ith character to be encoded. If the test is longer than the key, then the key is applied repeatedly. It is possible to have the key as long as the text to be encoded, in which case it is difficult to crack, but it is difficult to distribute the key. Example of encoding with k = 123 Key Message A T T A C K A T D A W N Encoding B V W B E N A C W A F D X P 140 / 152

141 Cryptology: Public key systems Key distribution is a big problem for secure and ecommerce Long keys that are changed frequently and available for every citizen, whilst maintaining security and cost-effectiveness, is difficult. An alternative, which avoids key distribution, are public key systems (e.g. RSA). 141 / 152

142 Cryptology: Public key systems User has two types of key A Public Key which is widely known like a telephone number. A Secret Key which is kept secret like a credit card PIN. Encoding To transmit a message M, the sender uses the public key P of the intended recipient, giving P(M). Decoding To understand the encoded message P(M), the recipient uses his/her own secret key S to decode, giving S(P(M)) = M. 142 / 152

143 Cryptology: Requirements for a public key systems Fidelity requirement For any message M, S(P(M)) = M. Security requirement All pairs (S, P) are distinct, and deriving S from P is as hard as reading P(M). Viability requirement Both S and P are easy to compute. 143 / 152

144 Cryptology: Modulo arithmetic Modulo arithmetic is the same normal arithmetic over the integers, except if working modulo n, then every result x is replaced by the element of {0, 1,..., n 1} that is equivalent. 1 = 7 (mod 6) (7 mod 6) = 1 7 = 21 (mod 7) (41 mod 10) = 1 21 = 41 (mod 10) (32 mod 7) = 4 Efficient modulo exponentiation via rewrites a 2c mod n = (a c ) 2 mod n a 2c+1 mod n = a.(a c ) 2 mod n a.b mod n = (a mod n).(b mod n) mod n 144 / 152

145 Cryptology: RSA RSA uses arithmetical algorithms for very large integers, where N is 200 digits and p and s are 100 digits. Public key P is (N, p) Secret key S is (N, s) Each message is represented by one or more numbers, each less than N. Encoding C = P(M) = M P mod N Decoding M = S(C) = C S mod N 145 / 152

146 Cryptology: RSA Choose three large ( 100 digits) random prime numbers The largest of these is s The other two are x and y. Let N = xy. Calculating factors of N is computationally difficult (e.g. according to Wikipedia, a 232-digit number was factored using hundreds of machines over a span of 2 years). Choose p so that ps mod (x 1)(y 1) = / 152

147 Cryptology: RSA Example ATTACK AT DAWN Let A=01, B=02, C=03,..., T=20,..., Z=26. To illustrate, we just use 2 digit primes. x = 47 y = 79 s = 97 So N = xy = = 3713 Since 97p mod 3588 = 1, p = (3588 1) = / 152

148 Cryptology: RSA Example (cont d) Encryption: Break message into 4 digit chunks, and raise to the pth power (modulo N). AT 0120 TA 2001 CK mod 3713 = / 152

149 Cryptology: RSA Example (cont d) Decryption: Break message into 4 digit chunks, and raise to the sth power (modulo N) AT 2001 TA 0301 CK mod 3713 = / 152

150 Cryptology: RSA Decryption key s and factors of N (i.e. x and y are kept secret). For small numbers (e.g. 3713), it is easy to discover the factors (in this case 47 and 79). For large numbers, such as 100 digit numbers, it is computationally difficult (perhaps taking years to calculate). Calculating each key component N, p, and s is only done once. Given a public and secret key, they may be used many times, and it is cheap to apply them. 150 / 152

151 Cryptology: RSA Relatively prime (coprime) Two integers that have no divisors in common other than +1 and -1 are relatively prime. Euler totient function The Euler totient function Φ is such that Φ(N) is the number of positve integers relatively prime to N. If N = xy, where x, y are prime, then Φ(N) = Φ(x)Φ(y) = (x 1)(y 1) 151 / 152

152 Cryptology: RSA Fermat/Euler result M Φ(N)+1 mod N = M mod N The message M is encoded and then decoded So M ps mod N where p = Φ(N) + 1 s M ps mod N = M Φ(N)+1 mod N = M mod N 152 / 152

Binary search trees. Support many dynamic-set operations, e.g. Search. Minimum. Maximum. Insert. Delete ...

Binary search trees. Support many dynamic-set operations, e.g. Search. Minimum. Maximum. Insert. Delete ... Binary search trees Support many dynamic-set operations, e.g. Search Minimum Maximum Insert Delete... Can be used as dictionary, priority queue... you name it Running time depends on height of tree: 1