Performance: It s harder than you d think to go fast

Today s lecture Prime number generation Several different algorithms with different big-o performance Making things run in parallel Using Java8 streams to run things together The pesky real world There s no replacement for measurement

Prime numbers who cares? Definition: a prime number is only divisible by itself and 1 Exceptionally useful, particularly in cryptography Modular arithmetic is the basis for most modern cryptography Modular arithmetic requires prime numbers

Modular arithmetic Z p We re working in, the integers in [1, p) 2 + 3 = 5 (mod 7) 2 + 4 = 6 (mod 7) 2 + 5 = 0 (mod 7) Forbidden! 2 3 = 6 (mod 7) 2 4 = 1 (mod 7) 6 6 = 1 (mod 7) Note: Z p is closed under multiplication but not addition.

Modular arithmetic Z p We re working in, the integers in [1, p) 2 + 3 = 5 (mod 7) 2 + 4 = 6 (mod 7) 2 + 5 = 0 (mod 7) Forbidden! 2 3 = 6 (mod 7) 2 4 = 1 (mod 7) 6 6 = 1 (mod 7) If p wasn t prime, multiplication wouldn t be closed! Note: Z p is closed under multiplication but not addition.

Some numbers are generators In Z p, we define a generator g such that g 1,g 2,...,g p 1 covers all the elements in the group. Example, for p=7: g=2 is not a generator, but g=3 is.

Discrete logarithms In the regular integers, say I give a big number: q = 5 8437591243259543 and ask you to take log 5 q Logarithms, over huge integers, are tractable. But what about? No known efficient solution to the discrete log problem. Modern cryptography depends on this being hard to solve (for large p) Z p

Difﬁe-Hellman (1976) Alice : Bob : Public A!B B!A Alice Bob Eve : : : : : : random a 2 random b 2 Zp Zp generator g 2 g a g b Zp b a ab a b ab computes (g ) = g computes (g ) = g a b ab knows g, g, cannot compute g http://www.nytimes.com/ 2016/03/02/technology/ cryptography-pioneers-to-winturing-award.html

DLog-based crypto, in practice p tends to be 2048 or 3072 bits Smaller values are too easy to break! Bigger messages? Build hybrids with classical cryptography. Want a smaller p that s still strong? Elliptic curve cryptography! Define multiplication over things that aren t integers. Want to learn more about all this? Olivier Pereira (UCL Belgium) will be teaching a crypto class this spring

Over the summer We suggested you implement a Sieve of Eratosthenes Classic, reasonably efficient method to generate lots of prime numbers This won t find 2048-bit prime numbers! But it s worth understanding.

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true;

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; Special cases for 1 and 2. } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true;

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; Is n even? Then it s not prime. } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true;

Want all the prime numbers < n? Simple O(n sqrt n) algorithm: public static IList<Integer> primessimple(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose } return rangeint(1, maxprime).filter(primes::isprime); You can see this, and everything else in edu.rice.util.primes Eight versions there (one more written yesterday )

Can we do better? Let s take an outer product and find all the composite numbers < n Anything missing must be a prime! public static IList<Integer> primesfasterstillnolists(int maxprime) { if (maxprime < 2) return List.of(1); // better than nothing, I suppose int maxfactor = (int) ceil(sqrt(maxprime)); boolean[] notprimes = new boolean[maxprime + 1]; // start off false for (int i = 2; i <= maxfactor; i++) { for (int j = 2; j <= maxprime; j++) { int product = i * j; if (product > maxprime) { break; // breaks the j-loop, continues the i-loop notprimes[product] = true; return rangeint(2, maxprime).filter(i ->!notprimes[i]).add(1); } Runtime: O(n log n)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); } Runtime: O(n log log n) (i.e., barely slower than linear time!)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose } boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); Array of boolean: where we track all the composites we ve found so far. Runtime: O(n log log n) (i.e., barely slower than linear time!)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; } for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); Runtime: O(n log log n) (i.e., barely slower than linear time!) Any composite n: must have a factor sqrt(n)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { } notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); Runtime: O(n log log n) (i.e., barely slower than linear time!) i is prime, its multiples are not.

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; } return rangeint(1, maxprime).filter(i ->!notprime[i]); Anything we missed must be prime. Runtime: O(n log log n) (i.e., barely slower than linear time!)

Runtime performance? 1000 O(n sqrt n) O(n log n) O(n log log n) Nanoseconds per prime 100 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Runtime performance? 1000 O(n sqrt n) O(n log n) As n grows, it looks like we re amortizing some one-time startup costs. O(n log log n) Nanoseconds per prime 100 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Runtime performance? 1000 O(n sqrt n) O(n log n) O(n log log n) Nanoseconds per prime 100 Straight lines on log-log graphs imply monomial functions (y = ax k ), so you re looking at sqrt(n) dominating for large n. 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Runtime performance? 1000 O(n sqrt n) O(n log n) O(n log log n) Nanoseconds per prime 100 Eratosthenes for the win! 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Big-O vs. real-world performance When n is small, big-o performance is meaningless Code clarity, debugging, features, modularity, etc. is more important! When n is massive, big-o eventually dominates But big-o misses a lot of how modern computers work Should you worry about constant factors? Do you really need to understand the CPU memory hierarchy? Do you really need to understand the garbage collector? Do you really need to understand how to write parallel code?

Java Streams: an easy way to write parallel code

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream.add(1); }

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { } if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).add(1);.parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream Make a stream of integers (like our LazyList.rangeInt)

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { } if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).add(1);.parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream Convert back to an IList (edu.rice.stream.adapters)

Prof. Wallach s 6-core desktop

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 For small problem sizes, initial overhead is pretty large. (But should we care?) 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 Parallelism (as much as 12x on a 6-core MacPro) can be a huge winner 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 10 Big-O still wins out in the end! 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Four O(n log n) implementations 1000 Nanoseconds per prime 100 O(n log n) O(n log n) PARALLEL O(n log n) NO LISTS O(n log n) PARA NOLST 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Four O(n log n) implementations 1000 n log n (LazyList) n log n (parallel stream) Parallelism helps a lot, but you may still be able to do a lot better. Nanoseconds per prime 100 O(n log n) O(n log n) PARALLEL O(n log n) NO LISTS O(n log n) PARA NOLST 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Four O(n log n) implementations 1000 Nanoseconds per prime 100 n log n (array of boolean) O(n log n) n log n (array of boolean + parallel) O(n log n) PARALLEL O(n log n) NO LISTS O(n log n) PARA NOLST 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime) Parallel version only wins on huge problem sizes. (Hmmm.)

Parallelism is tricky O(n sqrt n) implementations: each number tested independently Embarrassingly parallel: no shared data, just filter a range Easy to scale, run on many cores Functional programming high performance! O(n log n) implementations: shared array of booleans Memory contention: the cores have to fight with each other for writes Speedup is limited when the problem cannot be partitioned

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL 10000 O(n log n) NO LISTS Nanoseconds per prime 1000 100 O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL O(n log log n) 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL Really bad algorithms never win. 10000 O(n log n) NO LISTS Nanoseconds per prime 1000 100 O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL O(n log log n) 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL 10000 O(n log n) NO LISTS O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL Nanoseconds per prime 1000 100 O(n log log n) Eratosthenes squeaks a win for large n. 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL 10000 O(n log n) NO LISTS Nanoseconds per prime 1000 100 O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL O(n log log n) 10 O(n sqrt n) was the simplest possible implementation, and easily parallelized. 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Digression: measuring performance These tests report fastest of ten runs Smooth out your computer s transient behaviors. Garbage collector JIT compiler Unrelated tasks on the same computer Better to take the mean? Median? Nanosecond-accurate clocks? Not necessarily, but when you divide by n, you get good accuracy. Complete prime-number benchmark run: three minutes Should we test for bigger n? Depends on the real-world problem!

How not to win at benchmarking Vendors can never resist the urge to cheat Why? Because it sells! The best benchmark is your real program If a vendor can make it faster, then you win! Real benchmarks have real teeth SPEC requires reporting compiler flags, etc. http://www.anandtech.com/show/7384/state-of-cheating-in-android-benchmarks

Benefits of performance testing Understand your problem better Big-O analysis hides how your code really runs Tiny benchmarks help confirm your intuitions Break your code with harder and harder cases Surprise! Prof. Wallach s lazy list functions weren t always lazy Discovered when testing these prime number generators last spring The inefficient O(n 2 log n) version s flatmap broke lazy concat Prof. Wallach converted that into a unit test. Never again!

Mutation? The Sieve of Eratosthenes has the best scalar performance Parallelization is possible, not easy: http://home.math.au.dk/himsen/ Project1_parallel_algorithms_Torben.pdf The purely functional O(n sqrt n) is trivial to parallelize Might even be a good fit for running on a GPU (>100x performance) Hard to guess, in advance, what s really going to win

Pragmatic thoughts When you ve got nine implementations, you can compare them Performance and correctness! Use the simple version to unit test the fancy version Small unit tests won t find bugs that only occur at scale Parallelism only pays off for large n Also important: task granularity (tiny tasks don t parallelize as well) Why? What if your task is so fast that you re just waiting on memory?

Want more where this came from? Rice Comp322: Fundamentals of Parallel Programming Rice Comp422: Parallel Programming

Coming up Week14/Monday: Intro Computer Security (fun!) No class Wednesday or Friday (YouTube videos will be linked on Piazza) We promise there will be an exam question on the videos Week 15/Monday: Intro Android (fun!) Week 15/Wednesday: Life after Comp215 Alternative libraries to edu.rice.* (including Java8 streams) Alternative programming languages Week 15/Friday: Final exam review