Performance: It s harder than you d think to go fast

Similar documents
Primality Testing! 1

CPSC 467b: Cryptography and Computer Security

Activity Guide - Public Key Cryptography

Control Structures. Code can be purely arithmetic assignments. At some point we will need some kind of control or decision making process to occur

Comp215: More lists. Dan S. Wallach (Rice University) Copyright 2015, Dan S. Wallach. All rights reserved.

COSC 2P91. Introduction Part Deux. Week 1b. Brock University. Brock University (Week 1b) Introduction Part Deux 1 / 14

CS 360 Programming Languages Interpreters

Divisibility Rules and Their Explanations

These are notes for the third lecture; if statements and loops.

Lecture Notes, CSE 232, Fall 2014 Semester

Comp215: Structured Data Semantics

! Addition! Multiplication! Bigger Example - RSA cryptography

TA hours and labs start today. First lab is out and due next Wednesday, 1/31. Getting started lab is also out

Matching & more list functions

n! = 1 * 2 * 3 * 4 * * (n-1) * n

Python for Analytics. Python Fundamentals RSI Chapters 1 and 2

Copy: IF THE PROGRAM or OUTPUT is Copied, then both will have grade zero.

Parallel Quadratic Number Field Sieve

Intro. Speed V Growth

CSCI 1100L: Topics in Computing Lab Lab 11: Programming with Scratch

Garbage Collection (1)

COMPSCI 230 Discrete Math Prime Numbers January 24, / 15

Heaps, stacks, queues

Structured data. Dan S. Wallach and Mack Joyner, Rice University. Copyright 2016 Dan Wallach, All Rights Reserved

Thinking Functionally

6.001 Notes: Section 17.5

Lecture 3: Loops. While Loops. While Loops: Powers of Two. While Loops: Newton-Raphson Method

The Art and Science of Memory Allocation

CS/ENGRD 2110 Object-Oriented Programming and Data Structures Spring 2012 Thorsten Joachims. Lecture 10: Asymptotic Complexity and

Fundamentals. Fundamentals. Fundamentals. We build up instructions from three types of materials

EECS 470 Lecture 7. Branches: Address prediction and recovery (And interrupt recovery too.)

Public-Key Encryption, Key Exchange, Digital Signatures CMSC 23200/33250, Autumn 2018, Lecture 7

Lecture 2 Algorithms with numbers

Introduction to Cryptography and Security Mechanisms. Abdul Hameed

Cryptography III Want to make a billion dollars? Just factor this one number!

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

(Refer Slide Time: 1:27)

Week - 04 Lecture - 01 Merge Sort. (Refer Slide Time: 00:02)

CS125 : Introduction to Computer Science. Lecture Notes #38 and #39 Quicksort. c 2005, 2003, 2002, 2000 Jason Zych

COMP 110/L Lecture 4. Kyle Dewey

Programming Languages. Streams Wrapup, Memoization, Type Systems, and Some Monty Python

1 / 43. Today. Finish Euclid. Bijection/CRT/Isomorphism. Fermat s Little Theorem. Review for Midterm.

Section 05: Solutions

About this exam review

Condition-Controlled Loop. Condition-Controlled Loop. If Statement. Various Forms. Conditional-Controlled Loop. Loop Caution.

} Evaluate the following expressions: 1. int x = 5 / 2 + 2; 2. int x = / 2; 3. int x = 5 / ; 4. double x = 5 / 2.

Cryptography Worksheet

Learning to Program with Haiku

Lambdas and Generics (Intro)

Lecture 1: Overview

Lecture 3: Loops. While Loops. While Loops: Newton-Raphson Method. While Loops: Powers of Two

Intro to Programming. Unit 7. What is Programming? What is Programming? Intro to Programming

Practical Numerical Methods in Physics and Astronomy. Lecture 1 Intro & IEEE Variable Types and Arithmetic

Welcome to Comp215: Introduction to Program Design

Computer Security. 08. Cryptography Part II. Paul Krzyzanowski. Rutgers University. Spring 2018

CS61C Machine Structures. Lecture 3 Introduction to the C Programming Language. 1/23/2006 John Wawrzynek. www-inst.eecs.berkeley.

Introduction to Computer Science Unit 3. Programs

(Refer Slide Time 3:31)

CPSC 467b: Cryptography and Computer Security

Algorithms and Programming I. Lecture#12 Spring 2015

Functions and Decomposition

Structure of Programming Languages Lecture 10

CS 137 Part 2. Loops, Functions, Recursion, Arrays. September 22nd, 2017

Applied Cryptography and Network Security

15-498: Distributed Systems Project #1: Design and Implementation of a RMI Facility for Java

CPE 101. Overview. Programming vs. Cooking. Key Definitions/Concepts B-1

Habanero Extreme Scale Software Research Project

Comp215: Trees. Dan S. Wallach (Rice University) Copyright 2015, Dan S. Wallach. All rights reserved.

Processor. Lecture #2 Number Rep & Intro to C classic components of all computers Control Datapath Memory Input Output

C++ for Everyone, 2e, Cay Horstmann, Copyright 2012 John Wiley and Sons, Inc. All rights reserved. Using a Debugger WE5.

Course introduction. Advanced Compiler Construction Michel Schinz

Notes for Lecture 10

CS 161 Computer Security

CS152: Programming Languages. Lecture 11 STLC Extensions and Related Topics. Dan Grossman Spring 2011

CONTENTS: What Is Programming? How a Computer Works Programming Languages Java Basics. COMP-202 Unit 1: Introduction

Today. Finish Euclid. Bijection/CRT/Isomorphism. Review for Midterm.

Algorithmic Analysis and Sorting, Part Two

MA 1128: Lecture 02 1/22/2018

COP 4516: Math for Programming Contest Notes

CS Network Security. Nasir Memon Polytechnic University Module 7 Public Key Cryptography. RSA.

Embedded Systems Design Prof. Anupam Basu Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS110D: PROGRAMMING LANGUAGE I

Unit 4: Multiplication

Intro to Public Key Cryptography Diffie & Hellman Key Exchange

COMP 202 Recursion. CONTENTS: Recursion. COMP Recursion 1

Lecture 1 Contracts. 1 A Mysterious Program : Principles of Imperative Computation (Spring 2018) Frank Pfenning

Section 05: Solutions

Comp215: Trees. Mack Joyner and Dan S. Wallach (Rice University) Copyright 2016, Dan S. Wallach. All rights reserved.

Repetition Structures

Announcements. 1. Forms to return today after class:

CSE 303: Concepts and Tools for Software Development

CSC258: Computer Organization. Memory Systems

Smalltalk 3/30/15. The Mathematics of Bitcoin Brian Heinold

Intro. Scheme Basics. scm> 5 5. scm>

(Refer Slide Time: 01.26)

Optimising Functional Programming Languages. Max Bolingbroke, Cambridge University CPRG Lectures 2010

MA/CSSE 473 Day 05. Integer Primality Testing and Factoring Modular Arithmetic intro Euclid s Algorithm

Comparing procedure specifications

CME 193: Introduction to Scientific Python Lecture 1: Introduction

CompSci 94 Making Decisions February 8, Prof. Susan Rodger

Transcription:

Performance: It s harder than you d think to go fast Dan S. Wallach and Mack Joyner, Rice University Copyright 2016 Dan Wallach, All Rights Reserved

Today s lecture Prime number generation Several different algorithms with different big-o performance Making things run in parallel Using Java8 streams to run things together The pesky real world There s no replacement for measurement

Prime numbers who cares? Definition: a prime number is only divisible by itself and 1 Exceptionally useful, particularly in cryptography Modular arithmetic is the basis for most modern cryptography Modular arithmetic requires prime numbers

Modular arithmetic Z p We re working in, the integers in [1, p) 2 + 3 = 5 (mod 7) 2 + 4 = 6 (mod 7) 2 + 5 = 0 (mod 7) Forbidden! 2 3 = 6 (mod 7) 2 4 = 1 (mod 7) 6 6 = 1 (mod 7) Note: Z p is closed under multiplication but not addition.

Modular arithmetic Z p We re working in, the integers in [1, p) 2 + 3 = 5 (mod 7) 2 + 4 = 6 (mod 7) 2 + 5 = 0 (mod 7) Forbidden! 2 3 = 6 (mod 7) 2 4 = 1 (mod 7) 6 6 = 1 (mod 7) If p wasn t prime, multiplication wouldn t be closed! Note: Z p is closed under multiplication but not addition.

Some numbers are generators In Z p, we define a generator g such that g 1,g 2,...,g p 1 covers all the elements in the group. Example, for p=7: g=2 is not a generator, but g=3 is.

Discrete logarithms In the regular integers, say I give a big number: q = 5 8437591243259543 and ask you to take log 5 q Logarithms, over huge integers, are tractable. But what about? No known efficient solution to the discrete log problem. Modern cryptography depends on this being hard to solve (for large p) Z p

Diffie-Hellman (1976) Alice : Bob : Public A!B B!A Alice Bob Eve : : : : : : random a 2 random b 2 Zp Zp generator g 2 g a g b Zp b a ab a b ab computes (g ) = g computes (g ) = g a b ab knows g, g, cannot compute g

Diffie-Hellman (1976) Alice : Bob : Public A!B B!A Alice Bob Eve : : : : : : random a 2 random b 2 Zp Zp generator g 2 g a g b Zp b a ab a b ab computes (g ) = g computes (g ) = g a b ab knows g, g, cannot compute g http://www.nytimes.com/ 2016/03/02/technology/ cryptography-pioneers-to-winturing-award.html

DLog-based crypto, in practice p tends to be 2048 or 3072 bits Smaller values are too easy to break! Bigger messages? Build hybrids with classical cryptography. Want a smaller p that s still strong? Elliptic curve cryptography! Define multiplication over things that aren t integers. Want to learn more about all this? Olivier Pereira (UCL Belgium) will be teaching a crypto class this spring

Over the summer We suggested you implement a Sieve of Eratosthenes Classic, reasonably efficient method to generate lots of prime numbers This won t find 2048-bit prime numbers! But it s worth understanding.

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true;

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; Special cases for 1 and 2. } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true;

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; Is n even? Then it s not prime. } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true;

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true; Any composite n: must have a factor sqrt(n)

Let s start with something simpler We can test a number n for primality: divide by numbers < sqrt(n) public static boolean isprime(int n) { if (n < 1) { return false; } else if (n == 1 n == 2) { return true; if ((n & 1) == 0) { return false; } int max = (int) ceil(sqrt(n)); for (int i = 3; i <= max; i += 2) { if (n % i == 0) { return false; return true; If n is divisible by i (no remainder), then it s not prime.

Want all the prime numbers < n? Simple O(n sqrt n) algorithm: public static IList<Integer> primessimple(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose } return rangeint(1, maxprime).filter(primes::isprime); You can see this, and everything else in edu.rice.util.primes Eight versions there (one more written yesterday )

Can we do better? Let s take an outer product and find all the composite numbers < n Anything missing must be a prime! public static IList<Integer> primesfasterstillnolists(int maxprime) { if (maxprime < 2) return List.of(1); // better than nothing, I suppose int maxfactor = (int) ceil(sqrt(maxprime)); boolean[] notprimes = new boolean[maxprime + 1]; // start off false for (int i = 2; i <= maxfactor; i++) { for (int j = 2; j <= maxprime; j++) { int product = i * j; if (product > maxprime) { break; // breaks the j-loop, continues the i-loop notprimes[product] = true; return rangeint(2, maxprime).filter(i ->!notprimes[i]).add(1); } Runtime: O(n log n)

Can we do better? Let s take an outer product and find all the composite numbers < n Anything missing must be a prime! public static IList<Integer> primesfasterstillnolists(int maxprime) { if (maxprime < 2) return List.of(1); // better than nothing, I suppose int maxfactor = (int) ceil(sqrt(maxprime)); boolean[] notprimes = new boolean[maxprime + 1]; // start off false for (int i = 2; i <= maxfactor; i++) { for (int j = 2; j <= maxprime; j++) { int product = i * j; if (product > maxprime) { break; // breaks the j-loop, continues the i-loop notprimes[product] = true; return rangeint(2, maxprime).filter(i ->!notprimes[i]).add(1); } Array of boolean: where we track all the composites we ve found so far. Runtime: O(n log n)

Can we do better? Let s take an outer product and find all the composite numbers < n Anything missing must be a prime! public static IList<Integer> primesfasterstillnolists(int maxprime) { if (maxprime < 2) return List.of(1); // better than nothing, I suppose int maxfactor = (int) ceil(sqrt(maxprime)); boolean[] notprimes = new boolean[maxprime + 1]; // start off false for (int i = 2; i <= maxfactor; i++) { for (int j = 2; j <= maxprime; j++) { int product = i * j; if (product > maxprime) { break; // breaks the j-loop, continues the i-loop notprimes[product] = true; return rangeint(2, maxprime).filter(i ->!notprimes[i]).add(1); } Once the product is bigger than n, we can stop looking. (This is where the speedup comes from.) Runtime: O(n log n)

Can we do better? Let s take an outer product and find all the composite numbers < n Anything missing must be a prime! public static IList<Integer> primesfasterstillnolists(int maxprime) { if (maxprime < 2) return List.of(1); // better than nothing, I suppose int maxfactor = (int) ceil(sqrt(maxprime)); boolean[] notprimes = new boolean[maxprime + 1]; // start off false for (int i = 2; i <= maxfactor; i++) { for (int j = 2; j <= maxprime; j++) { int product = i * j; if (product > maxprime) { break; // breaks the j-loop, continues the i-loop notprimes[product] = true; return rangeint(2, maxprime).filter(i ->!notprimes[i]).add(1); } Anything we missed must be prime. Runtime: O(n log n)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); } Runtime: O(n log log n) (i.e., barely slower than linear time!)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose } boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); Array of boolean: where we track all the composites we ve found so far. Runtime: O(n log log n) (i.e., barely slower than linear time!)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; } for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); Runtime: O(n log log n) (i.e., barely slower than linear time!) Any composite n: must have a factor sqrt(n)

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { } notprime[j] = true; return rangeint(1, maxprime).filter(i ->!notprime[i]); Runtime: O(n log log n) (i.e., barely slower than linear time!) i is prime, its multiples are not.

But Eratosthenes is faster still! Eratosthenes: Incrementally find primes, knock out their multiples public static IList<Integer> primeseratosthenes(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose boolean[] notprime = new boolean[maxprime + 1]; // these start off initialized to false int maxfactor = (int) ceil(sqrt(maxprime)); // special case for 2, then standard case afterward for (int i = 4; i <= maxprime; i += 2) { notprime[i] = true; for (int i = 3; i <= maxfactor; i++) { if (!notprime[i]) { int skip = 2 * i; // optimization: odd + odd = even, so we can avoid half of the work for (int j = i * i; j <= maxprime; j += skip) { notprime[j] = true; } return rangeint(1, maxprime).filter(i ->!notprime[i]); Anything we missed must be prime. Runtime: O(n log log n) (i.e., barely slower than linear time!)

Runtime performance? 1000 O(n sqrt n) O(n log n) O(n log log n) Nanoseconds per prime 100 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Runtime performance? 1000 O(n sqrt n) O(n log n) As n grows, it looks like we re amortizing some one-time startup costs. O(n log log n) Nanoseconds per prime 100 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Runtime performance? 1000 O(n sqrt n) O(n log n) O(n log log n) Nanoseconds per prime 100 Straight lines on log-log graphs imply monomial functions (y = ax k ), so you re looking at sqrt(n) dominating for large n. 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Runtime performance? 1000 O(n sqrt n) O(n log n) O(n log log n) Nanoseconds per prime 100 Eratosthenes for the win! 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Big-O vs. real-world performance When n is small, big-o performance is meaningless Code clarity, debugging, features, modularity, etc. is more important! When n is massive, big-o eventually dominates But big-o misses a lot of how modern computers work Should you worry about constant factors? Do you really need to understand the CPU memory hierarchy? Do you really need to understand the garbage collector? Do you really need to understand how to write parallel code?

Java Streams: an easy way to write parallel code

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream.add(1); }

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { } if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).add(1);.parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream Make a stream of integers (like our LazyList.rangeInt)

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { } if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).add(1);.parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream Make it parallel! The filter will run on every core of your computer simultaneously.

Java8 streams Here s our O(n sqrt n) prime number tester with streams: public static IList<Integer> primessimpleparallel(int maxprime) { } if (maxprime < 2) { return List.of(1); // better than nothing, I suppose return Adapters.streamToList( IntStream.rangeClosed(2, maxprime).add(1);.parallel().filter(primes::isprime).boxed()) // boxed() to get from IntStream to Stream Convert back to an IList (edu.rice.stream.adapters)

Prof. Wallach s 6-core desktop

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 For small problem sizes, initial overhead is pretty large. (But should we care?) 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 Parallelism (as much as 12x on a 6-core MacPro) can be a huge winner 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Performance results 10000 O(n sqrt n) O(n sqrt n) PARALLEL 1000 O(n log n) O(n log log n) Nanoseconds per prime 100 10 Big-O still wins out in the end! 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Four O(n log n) implementations 1000 Nanoseconds per prime 100 O(n log n) O(n log n) PARALLEL O(n log n) NO LISTS O(n log n) PARA NOLST 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Four O(n log n) implementations 1000 n log n (LazyList) n log n (parallel stream) Parallelism helps a lot, but you may still be able to do a lot better. Nanoseconds per prime 100 O(n log n) O(n log n) PARALLEL O(n log n) NO LISTS O(n log n) PARA NOLST 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime)

Four O(n log n) implementations 1000 Nanoseconds per prime 100 n log n (array of boolean) O(n log n) n log n (array of boolean + parallel) O(n log n) PARALLEL O(n log n) NO LISTS O(n log n) PARA NOLST 10 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (maximum prime) Parallel version only wins on huge problem sizes. (Hmmm.)

Parallelism is tricky O(n sqrt n) implementations: each number tested independently Embarrassingly parallel: no shared data, just filter a range Easy to scale, run on many cores Functional programming high performance! O(n log n) implementations: shared array of booleans Memory contention: the cores have to fight with each other for writes Speedup is limited when the problem cannot be partitioned

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL 10000 O(n log n) NO LISTS Nanoseconds per prime 1000 100 O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL O(n log log n) 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL Really bad algorithms never win. 10000 O(n log n) NO LISTS Nanoseconds per prime 1000 100 O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL O(n log log n) 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL 10000 O(n log n) NO LISTS O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL Nanoseconds per prime 1000 100 O(n log log n) Eratosthenes squeaks a win for large n. 10 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Nine different implementations! 100000 O(n^2 log n) O(n log^2 n) O(n log n) O(n log n) PARALLEL 10000 O(n log n) NO LISTS Nanoseconds per prime 1000 100 O(n log n) PARA NOLST O(n sqrt n) O(n sqrt n) PARALLEL O(n log log n) 10 O(n sqrt n) was the simplest possible implementation, and easily parallelized. 1 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 Problem size (max prime)

Digression: measuring performance These tests report fastest of ten runs Smooth out your computer s transient behaviors. Garbage collector JIT compiler Unrelated tasks on the same computer Better to take the mean? Median? Nanosecond-accurate clocks? Not necessarily, but when you divide by n, you get good accuracy. Complete prime-number benchmark run: three minutes Should we test for bigger n? Depends on the real-world problem!

How not to win at benchmarking Vendors can never resist the urge to cheat Why? Because it sells! The best benchmark is your real program If a vendor can make it faster, then you win! Real benchmarks have real teeth SPEC requires reporting compiler flags, etc. http://www.anandtech.com/show/7384/state-of-cheating-in-android-benchmarks

Benefits of performance testing Understand your problem better Big-O analysis hides how your code really runs Tiny benchmarks help confirm your intuitions Break your code with harder and harder cases Surprise! Prof. Wallach s lazy list functions weren t always lazy Discovered when testing these prime number generators last spring The inefficient O(n 2 log n) version s flatmap broke lazy concat Prof. Wallach converted that into a unit test. Never again!

Mutation? The Sieve of Eratosthenes has the best scalar performance Parallelization is possible, not easy: http://home.math.au.dk/himsen/ Project1_parallel_algorithms_Torben.pdf The purely functional O(n sqrt n) is trivial to parallelize Might even be a good fit for running on a GPU (>100x performance) Hard to guess, in advance, what s really going to win

Pragmatic thoughts When you ve got nine implementations, you can compare them Performance and correctness! Use the simple version to unit test the fancy version Small unit tests won t find bugs that only occur at scale Parallelism only pays off for large n Also important: task granularity (tiny tasks don t parallelize as well) Why? What if your task is so fast that you re just waiting on memory?

Want more where this came from? Rice Comp322: Fundamentals of Parallel Programming Rice Comp422: Parallel Programming

Coming up Week14/Monday: Intro Computer Security (fun!) No class Wednesday or Friday (YouTube videos will be linked on Piazza) We promise there will be an exam question on the videos Week 15/Monday: Intro Android (fun!) Week 15/Wednesday: Life after Comp215 Alternative libraries to edu.rice.* (including Java8 streams) Alternative programming languages Week 15/Friday: Final exam review