Computational Complexity and Implications for Security DRAFT Notes on Infeasible Computation for MA/CS 109 Leo Reyzin with the help of Nick Benes

Computational Complexity and Implications for Security DRAFT Notes on Infeasible Computation for MA/CS 109 Leo Reyzin with the help of Nick Benes The Study of Computational Complexity Let s summarize what we ve done. We started out saying everything is bits: numbers, text, formatting, pictures, music, and even programs themselves. This gave us a model of information (bits) and computation (instructions, written down as bits, executed on hardware built out of gates). Once we have a precise model, we can reason what is possible and what is not. This model, in particular, enabled us to prove that there are important questions whose answers cannot be computed at all (such as whether the program will halt or whether a Diophantine equations has solutions). On the other hand, it also enabled us to specify algorithms for solving certain questions quite efficiently (such as searching an index, sorting a list, or finding the shortest path in a graph). But between the uncomputable questions and the efficiently computable ones, there are a whole lot of very interesting questions whose answers are computable, but for which the computation would take a very long time, so long that it doesn t seem to be better than not being able to solve them at all. The area of computer science that studies how long (or how much memory or other resources) computation takes is called complexity theory. For instance, take the game of checkers (a similar fact is true for the Chinese game of go). Generalize the game to allow for boards of various sizes: N N instead of the fixed 8 8. Consider a particular position (a position includes information about whose turn it is). Note that because this is a game of skill alone (i.e., luck is not involved), one of the three conditions must hold when the players start from that position: either there is a way for white to win no matter what black does, or a way for black to win no matter what white does, or a way for each player to force a tie no matter what the other player does. So it is natural to ask which one of the three situations holds. It turns that the answer to this question is computable, but even the best possible algorithm will take a very long time for some starting positions. Namely, the number of steps the algorithm will take (for at least some of the possible starting positions) is an exponential function of N (the size of the board). In other words, even for modest board sizes, the problem takes a very long time. Solving checkers is an example of an exponential-time problem. In computer science, a problem is generally considered feasible if there is an algorithm that, for any input of size N (such as N numbers to sort, or N nodes in a graph in which we need to find the shortest path) finds the solution using no more steps than than N raised to some (small) power, such as N 2 or N 3 (or N log N, which is less than N 2 ). Since this number of steps is at most a polynomial function of N, such problems are called polynomialtime problems and the algorithms to solve them are called polynomial-time algorithms. For example, Dijkstra s algorithm takes at most N 2 steps on an input of size N. Therefore, it is a polynomial-time algorithm and the problem that it solves the shortest path problem is a polynomial-time problem. 1

Polynomial-time problems are numerous and include the problems you have already seen: finding an element in a list, sorting a list, or finding the shortest path in a graph. The class of all such problems is denoted by the letter P. Exponential-time problems are not in P: they are much worse than polynomial-time problems. Thus, some problems (such as finding the shortest path in a graph) are known to be polynomial-time, while some (such as solving checkers) are known to be exponential-time. Of course, there are also problems in between, and problems even worse than exponential time. However, there are also many problems for which we simply don t know what the best algorithm can do, even though we ve been studying them for many years. Among those, one class of problems is very important. We discuss it next. NP For simplicity, let us focus on problems that demand only a yes/no answer. Consider the following problem. Traveling Salesman Problem: Given a graph (i.e., nodes, edges connecting nodes, and prices on each edge), as well as some budget b, find if there is a path covering all the nodes whose total cost is no more than the budget b. Note that this problem seems easier than asking to find the best possible route: because it is only a yes/no question, it merely asks whether there is a route priced at less than or equal to b. However, it turns out that it s only slightly easier: if this yes/no question can be answered, then finding the best possible route is not much harder. More importantly, this problem is very different from the shortest path problem, which asks only how to get from one point to another regardless of what other points you cover. The following interactive web page http://mcs109.bu.edu/site/?p=tsp allows you to play with an example of the Traveling Salesman Problem. Try to find the optimal route, or even just a route that comes in under $900. The naive approach is to list all the possible routes and then check them all. Assume you have an example with 7 cities that are all connected. Then you have seven choices for the starting city; given a starting city, you have six for the second; given the first two cities, you have five choices for the third; etc. This gives you 7 6 1 = 7! = 5040 possible routes. But suppose you have a graph of 26 nodes. By the same argument, enumerating all possible routes on such a graph will take 26! 4 10 26 (four followed by 26 zeroes). That s a very large number: if a computer enumerating 1 billion paths per second started at the birth of the universe, it would be finishing up right about now. More generally, on N cities to you need to check N! routes, which is even worse than exponential time. Of course, checking all routes is not the only possible approach. For the shortest path problem, for example, we found a much better way than checking all routes (namely, Dijkstra s algorithm): it took only about N 2 steps, which is polynomial. Maybe there is a way that involves a polynomial number of steps for this problem, too? It turns out that, despite decades of research, no one knows a polynoimal-time algorithm to solve this problem. It is important to reiterate what a polynomial-time algorithm means. A polynomialtime algorithm must take a number of steps that is some pre-specified polynomial of N (such as N 2 or N 3 ) for every input of size N. It s not good enough to have algorithm that works 2

for some inputs it has to work in the given time no matter what input of size N you give it. Unlike the checkers example above, this problem has a very important feature: if the answer is yes, then there is a proof that can be verified easily (in polynomial time) The proof, in this case, is just the list of nodes in the order you visit them. It is indeed easy to verify: verification simply requires adding the costs of all the edges traveled and seeing if it comes in under the budget b. (Note that there is not necessarily a short proof if the answer is no. ) There are many other problems with the above feature. Here we name just a few more; there are thousands of interesting ones identified in the scientific literature. Our emphasis is on graph problems, because they are the easiest examples to state given the background introduced in the class. However, problems with the above feature don t have to have anything to do with grpahs, and many do not. Clique Consider a large undirected graph, such as the Facebook graph in which users are nodes and the friend relation forms edges. Suppose you want to know if there is a group of K nodes that are all connected to each other (in the Facebook example, that means K people are all each other s friends). Such a groups is called clique. Naturally, if the answer is yes, then there is an easily (polynomial-time) verifiable proof: if I give you the K nodes, you can easily check that they are all each other s friends by verifying that there is an edge between every pair (there are K(K 1)/2 < K 2 pairs to consider). 3-Coloring Consider a large undirected graph on N nodes. Ask whether it is possible, using only three different colors, to color each node so that no nodes that are connected by an edge are the same color. Again, if the answer is yes, then there is an easily (polynomialtime) verifiable proof: if I tell you the color of each node, you can check that no connected pair has the same color, by considering all pairs of nodes (there are N(N 1)/2 < N 2 pairs to consider). Scheduling with Precedence Constraints Suppose you have N tasks, some of which must precede others (for example, if the tasks relate to building a house, you can t paint the walls until you build them). Each task comes with the amount of time needed to complete it. Given a number m of workers and a time t, figure out if it is possible to complete all the tasks in time t. Proving Theorems Given a (precisely written down) mathematical statement, find out if it has a proof of length at most N. Naturally, if the answer is yes, then there is a proof that can be efficiently verified (because verifying mathematical proofs is something that can be automated). 3

Any yes/no problem with this feature (namely, that if the answer is yes, then there is a proof that can be verified in polynomial-time) is said to belong to a class of problems called NP (for nondeterministic polynomial time). So can problems in NP be solved efficiently (that is, in polynomial time)? In other words, is the class NP exactly the same class of problems as the class P? Or, yet in other words, does finding a proof have to be much harder than verifying it? Computer scientists have been working on this question since the early 1970s, after Steven Cook, Richard Karp, and Leonid Levin (now at BU CS) described the class NP. The question is still open. In 2000, the Clay Mathematics Institute announced a list of seven Millennium Problems that are particularly important for mathematics, each with a $1 million prize. Whether P=NP is one of them. Fortunately, to claim the $1 million prize, you don t have to find an efficient algorithm for every problem in NP. It has been proven that finding an efficient algorithm for any one of the above problems (Traveling Salesman, Clique, 3-Coloring, Schedule with Precedence Constraints, or Proving Theorems) will suffice: such an efficient algorithm can be converted for an efficient algorithm for any other problem in NP. In fact, there are many important problems with the same property as the above problems; namely, finding a polynomial-time algorithm for any of them will prove that P = NP, because it will give an polynomial-time algorithm for all other problems in NP. Such special problems are called NP-complete. Most computer scientists, however, believe that P NP. That is, they believe that an efficient algorithm for the Traveling Salesman Problem does not exist. If it did, too many important problems that people have been working on for centuries would turn out to be easy. In fact, proving mathematical theorems would become too easy! Cryptography So this seems depressing, maybe even more depressing than knowing that we can t do something. With these problems, we know we can do them but it would take so long that we might as well not be able to do them at all! And, moreover, we can t even prove that we can t solve them faster. But while your inability to do something is bad for you, the bad guys inability to do something is good for you. Thus, if we can use hardness of computation to our advantage, maybe this depressing news has a silver lining. Suppose I want to have an account on a remote server (as we typically do for email, BU grades, on-line shopping, etc.). I could pick a password (like SuperSecret109 ) and the remote server can store it. Every time I log in, it would compare what I enter to what it stores. But then if someone were to steal or hack into the server, they would get my password that the server stores. Worse, I probably use the same password for other accounts, too, so they would gain access to many of my accounts. A better way would be to use the defining property of NP problems: that verification can be done efficiently. For example, instead of choosing a password, I could choose a 3-colorable graph. Note that this is much easier than 3-coloring a given graph: since I get to choose, I can start with the random colors on the nodes and then draw in the edges to make sure that no two nodes of the same color are connected. I would give the graph to the server and keep 4

the 3-coloring to myself (on my personal laptop, for example). Then my password is the coloring, but it s not stored on the server. It s easy for the server to check if I ve entered a valid coloring, but if someone steals or hacks into the server, all they get is the graph. There server doesn t know a valid 3-coloring, but merely verifies it every time I log in, and then erases it to make sure it can t get stolen. Thus, using a computationally hard problem, I have split my password into a public portion that the server knows (and, in fact, anyone can know) and a secret portion that only I know. In fact, given the prominence of the problem, probably people won t bother trying to hack into the server, because if they figure out a way to solve an arbitrary 3-coloring problem, then they ll go for the $1 million Millennium Prize rather than my accounts. Leo SuperSecret109 Coloring Server SuperSecret109 (has to be secret) Graph (not secret) The fundamental insight to take from this example is that to verify someone s secret (and, thus, someone s identity), you don t need to know any secrets. Of course, as you know from using passwords, no one actually uses three-colorings for passwords. That would be too much of a pain. However, the fundamental still applies: when you log in to a competent service, they don t know your passwords. The server doesn t store your password, but stores only something that can verify your password. Getting the original password back from it is computationally hard (likely harder than simply trying all possible passwords). Another reason three-coloring is not used in realy life is that even if humans could remember colorings (or delegate that task to a laptop/cell phone/etc.), the 3-coloring problem is not a very convenient example, because, even though for many graphs 3-coloring is hard, there are also many graphs for which it s easy. I would have to do extra work to generate a graph for which it s hard. It would be nice to use a problem that seems always hard. We will talk about such problems in later lectures. 5