EE/CSCI 451: Parallel and Distributed Computation

Size: px

Start display at page:

Download "EE/CSCI 451: Parallel and Distributed Computation"

Sheryl Bryant
5 years ago
Views:

1 EE/CSCI 451: Parallel and Distributed Computation Lecture #2 1/17/2017 Xuehai Qian University of Southern California 1

2 Outline Opportunities and challenges in exploiting parallelism Course outline High level view of Processor Core Organization Chapter 2.1, 2.2 in the text (Implicit parallelism) Superscalar Very Long Instruction Word (VLIW) processor Multi-core processor organization 2

3 Announcements HW #1 out next Wednesday, due on Jan 31 (check course website) Programming HW #1 out this Friday, due on Jan 26(check course website) Piazza account has been set up ( Course website: 3

4 Midterms and Final Midterm I: 24 th Feb in lab session Midterm II: 31 st March in lab session Final, as scheduled by the University No makeup exams 4

5 Course Dependencies EE 457 CS 402 EE 477 EE 451 EE 560 EE 557 CS 503 CS 596 CS 570 EE 653 EE 657 EE 659 CS 653 EE 677 REQUIRED SUGGESTED 5

6 Challenges (1) Processor - Memory performance gap Memory (DRAM) Access time 5 nsec Cache Processor 2 GHz 0.5 nsec 6

7 Challenges (2) Limit to achievable speedup Amdahl s Law Serial time = 1, Best Parallel time =, Program 1 unit α 1-α Serial Parallel α 1 Speedup = Serial time Parallel time = 1. Example: 50 % of the code is parallizable, Speedup 2 (no matter how many processors are used) 7

8 Challenges (3) Co-ordination Do in parallel P I, P J,, P K end overhead =? Synchronization P I P J P K Barrier 8

9 Challenges (4) Communication Compute Communicate Wait? Interconnection Network 9

10 Challenges (5) Load balance Example: Matrix multiplication P Z[ compute output matrix ij, 0 i, j < p = number of processors p (0,0) n n n p n p n n n n ( P 1, P 1) 10

11 Challenges (6) Key performance metrics jklzmn ozpk Speedup = qmlmnnkn ozpk Given p > 1 processors, can we get p fold improvement performance? Scalability 11

12 Course Outline (1) Overview / Course Introduction Architectural Principles for Application Developers Pipelined processor organization: data and control hazards, ILP, out of order execution, multithreading. Memory systems: DRAM organization, cache organization. Impact on software performance, locality, multithreading, prefetching. 12

13 Course Outline (2) Analytical Models for Parallel Systems Architecture performance metrics: CPI, MIPS, SpecMark. Software performance benchmarks: Peak performance, sustained performance, LinPack, Bandwidth benchmarks. Limits on achievable performance, Amdahl s Law, scalability definitions, work optimality, Iso-efficiency function, Order notation. 13

14 Course Outline (3) Analytical Models for Parallel Systems(cont.) Communication costs in parallel machines: start-up cost, throughput, latency. Routing mechanisms: packet routing, cut through, virtual channels. Modeling message passing and shared address space machines. Data layouts and graph embeddings. Multi-core, many-core and GPU architectures, data parallel programming abstraction of GPUs. 14

15 Course Outline (4) PRAM and Data Parallel Algorithms PRAM model of computation, Brent s theorem, other models of computation, illustrative examples Max, Scan operations Recursive doubling, graph algorithms Performance analysis, scalability, relation ship to practical platforms Sorting, FFT 15

16 Course Outline (5) Basic Communication Primitives Broadcast and all-to-all, communication costs on various topologies Personalized communication, Reduce, prefix sum, and scatter and gather Graph embeddings 16

17 Course Outline (6) Message Passing Programming Model Message passing abstraction, send receive primitives, blocking and non-blocking commands, collective operations Illustrative examples: Canon s algorithm, overlapping computation and communication, Odd- even merge sort, Review for Midterm 2 Shared Address Space Programming Models Pthreads, OpenMP, Illustrative examples 17

18 Course Outline (7) Parallel Dense Algebra Matrix vector, matrix matrix computations Parallel Dense Algebra Parallel Search and Sorting Parallel search, illustrative example applications, throughput optimization, Multi-dimensional search, decision tree and decomposition Sorting techniques, Bitonic sort, row-column sort, Mapping onto parallel architectures 18

19 Course Outline (8) Cloud, Big Data and Map Reduce Cloud as a computing platform, Large data sets and organization Discovery of knowledge, annotation of data Computational and programming models for Big Data Map Reduce as a parallel programming model, Hadoop, Spark Illustrative examples from data science domain 19

20 Acknowledgement Lecture notes compiled over the years My research team Course TA and mentors during Spring 14, 15, and 16 EE451 Students in Spring 14, 15, and 16 Vipin Kumar, Ananth Grama 20

21 Research Projects Research High Performance Networking Big Data Analysis Social Network Analysis Energy Efficient Computing Webpages ceng.usc.edu/~prasanna/

22 Related Courses EE/CSCI 451 Introduction to Parallel and Distributed Computation (Prasanna) EE 454 Introduction to Systems-on-Chip (Bogdan) EE 457 Computer Systems Organization (Puvvada) CSCI 402 Operating Systems (Cheng) EE 557 Computer Systems Architecture (Annavaram/Dubois) EE 560L Digital System Design Tools and Techniques (Puvvada) EE 532 Wireless Internet and Pervasive Computing (Hwang) EE 533 Network Processor Programming and Design (Cho) CSCI 558L Internetworking and Distributed System Laboratory (Cho) CSCI 570 Analysis of Algorithms (Shamsian/Adamchik) CSCI 596 Scientific Computing and Visualization (Nakano) EE 653 Advanced Topics in Microarchitecture (Dubois) CSCI 653 High Performance Computing and Simulations (Nakano) EE 659 Interconnection Networks (Pinkston) EE 657 Parallel and Distributed Computing (Dubios) EE 677 VLSI Architectures and Algorithms (Prasanna) 22

23 Sample Course work with focus on Computer Architecture EE/CSCI 451 Introduction to Parallel and Distributed Computation (Prasanna) EE 454 Introduction to Systems-on-Chip (Bogdan) EE 457 Computer Systems Organization (Puvvada) CSCI 402 Operating Systems (Cheng) EE 557 Computer Systems Architecture (Annavaram/Dubois) EE 533 Network Processor Programming and Design (Cho) EE 599 Cyber-Physical Systems (Bogdan) CSCI 570 Analysis of Algorithms (Shamsian/Adamchik) EE 653 Advanced Topics in Microarchitecture (Dubois) EE 659 Interconnection Networks (Pinkston) EE 677 VLSI Architectures and Algorithms (Prasanna) 23

24 Sample course work with focus on Hardware, embedded systems EE/CSCI 451 Introduction to Parallel and Distributed Computation (Prasanna) EE 454 Introduction to Systems-on-Chip (Bogdan) EE 457 Computer Systems Organization (Puvvada) EE 477 MOS VLSI Circuit Design (Nazarian) EE 577a VLSI System Design (Pedram/Nazarian) EE 577b VLSI System Design (Pedram) EE 560L Digital System Design Tools and Techniques (Puvvada) EE 532 Wireless Internet and Pervasive Computing (Hwang) EE 533 Network Processor Programming and Design (Cho) EE 599 Cyber-Physical Systems (Bogdan) CSCI 570 Analysis of Algorithms (Shamsian/Adamchik) EE 677 VLSI Architectures and Algorithms (Prasanna) 24

25 Sample course work with focus on parallel and scientific computing EE/CSCI 451 Introduction to Parallel and Distributed Computation (Prasanna) EE 454 Introduction to Systems-on-Chip (Bogdan) EE 457 Computer Systems Organization (Puvvada) CSCI 402 Operating Systems (Cheng) EE 557 Computer Systems Architecture (Annavaram/Dubois) EE 542 Internet and Cloud Computing (Hwang) CSCI 570 Analysis of Algorithms (Shamsian/Adamchik) CSCI 596 Scientific Computing and Visualization (Nakano) CSCI 653 High Performance Computing and Simulations (Nakano) 25

26 Sample course work with focus on distributed systems and big data EE/CSCI 451 Introduction to Parallel and Distributed Computation (Prasanna) EE 450 Introduction to Computer Networks (Zahid/Touch) EE 542 Internet and Cloud Computing (Hwang) CSCI 402 Operating Systems (Cheng) CSCI 555 Advanced Operating Systems (Govindan) CSCI 567 Machine Learning (Sha/Liu) CSCI 570 Analysis of Algorithms (Shamsian/Adamchik) CSCI 585 Database Systems (Raghavachary/Ghyam) CSCI 657 Advanced Distributed Systems (Lloyd) 26

27 CSCI 596: Scientific Computing and Visualization (4 Units) Overview: Particle and continuum simulations are used as a vehicle to learn basic elements scientific computing & visualization. Hands-on experience in: 1) formulating a mathematical model to describe a physical phenomenon; 2) discretizing the model (differential or integral equations) into algebraic forms in order to allow numerical solution; 3) designing/analyzing numerical algorithms to solve the algebraic equations on parallel computers; 4) translating the algorithms into a program; 5) performing a computer experiment by executing the program; 6) visualizing simulation data in an immersive and interactive virtual environment; and 7) managing/mining large datasets. Instructor: Aiichiro Nakano, anakano@usc.edu Prerequisite: Basic knowledge of programming, data structures, linear algebra, and calculus Intended Audience: MS students Course Website: cacs.usc.edu/education/cs596.html Offered: Fall 14, Fall 15, Fall

28 CSCI 653: High Performance Computing & Simulations (4 Units) Overview: Provide students with advanced techniques that are common to high performance computer simulations in science and engineering. Scalable algorithms for both deterministic and stochastic simulations of particles and continuum will be implemented on massively parallel and distributed computing platforms, and the simulation datasets will be visualized and analyzed in immersive and interactive virtual environment. Instructor: Aiichiro Nakano, Prerequisite: CS 596 Intended Audience: PhD students Course Website: cacs.usc.edu/education/cs653.html Offered: Fall

29 CSCI 570: Analysis of Algorithms (4 Units) Overview: The course is intended as a first graduate course in the design and analysis of algorithms. The main focus is on developing an understanding for the major algorithm design techniques. At times, the practical side of algorithm design is also explored with interesting examples of their usage in solving industry problems. Instructor: S. Shamsian, sshamsia@usc.edu Prerequisite: Students in the class are expected to have a reasonable degree of mathematical sophistication, and to be familiar with the basic notions of algorithms and data structures, discrete mathematics, and probability. Undergraduate classes in these subjects should be sufficient. Textbook: Algorithm Design, Jon Kleinberg/Eva Tardos Offered: Spring, Summer, Fall 29

30 EE 677: VLSI Architectures and Algorithms (3 Units) Overview: This course focuses on reconfigurable computing based on Field Programmable Gate Arrays (FPGAs) for implementing application-specific architectures and algorithms including embedded computing applications. The course will address devices, systems, and application synthesis and analysis. Instructor: Viktor Prasanna, Prerequisite: EE 457 and CS 570 or consent of the instructor Intended Audience: MS and PhD students Course Website: ceng.usc.edu/~prasanna/ee677 Text: Papers from the literature Offered: in Spring 30 30

31 Outline Opportunities and challenges in exploiting parallelism Course outline High level view of Processor Core Organization Chapter 2.1, 2.2 in the text (Implicit parallelism) Superscalar Very Long Instruction Word (VLIW) processor Multi-core processor organization 31

32 Uniprocessor (Serial Processor) RAM (Random Access Machine) A simple view Memory Fetch Execute Processor Memory access = 1 cycle Execution = 1 cycle 32

33 Pipelined implementation of processor (1) All processors are pipelined Five stage pipeline (See Hennessey Patterson, EE 457) IF ID EX MEM WB Instruction Fetch Instruction Decode/ Register file read Execute/ Address calculation Memory access Write Back 33

34 Pipelined implementation of processor(2) Ins IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM 1. Load 2. Load 3. Add R1, Add R2, Add R1, R2 6. Store WB Total # of cycles to complete execution = 10 34

35 Simple Performance Model Instruction exec. rate = clock rate = peak performance? Time to execute n ins. n / clock rate Example: 2 GHz processor core Peak performance = 2 G Ins. per second 35

36 Superscalar execution Example: 2 pipelines (Sharing hardware resources) 2 improvement? 5 stage pipeline Instruction memory Registers Functional units Data memory 5 stage pipeline 36

37 2-way super scalar execution (almost 2 Ins. Issued per cycle) Load 2. Load 3. Add R1, Add R2, Add R1, R2 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 6. Store Total # of cycles = 8 Note: Ins. 5 and 6 cannot be issued in the same cycle 37 IF ID EX MEM WB

38 Data dependencies Output of Ins i is an input to Ins k (k > i) Potential problem: Ins k is close to Ins i during execution (during execution Ins L and Ins I are in the hardware pipeline concurrently) Example: Single Pipeline 1. Load 2. Load 3. Add R1, Add R2, Add R1, R2 Simple solution = Stall the pipeline Hardware solution = Internal Forwarding Ex.: Dependency between Ins. 4 and 5 can be handled without stalling pipeline. 38

39 Out of Order Issue Example: 2-way superscalar 1. Load 2. Add R1, Load 4. Add R2, 104 Cannot issue Ins. 1, 2 simultaneously. Cannot issue Ins. 3, 4 simultaneously However, can do: Initiate Ins. 1, Ins. 3 in clock cycle 0 Initiate Ins. 2, Ins. 4 in clock cycle 1 39

40 Dynamic Instruction Issue Code Window - Series of instructions Identify those that can be executed in parallel Note: Hardware does scheduling at run time Way of exploiting instruction level parallelism 40

Resource dependency Example: 2-way super scalar execution one hardware Floating Point Unit (FPU) Execution pipelines FPU Load R1, @1000 Load R2,

41 Resource dependency Example: 2-way super scalar execution one hardware Floating Point Unit (FPU) Execution pipelines FPU Load Load FP Mult R1, 100 FP Mult R2, 104 Execution pipelines Execution hardware detects such dependencies and schedules the execution accordingly 41

42 Superscalar (1) Instruction-level parallelism Multiple function units (Ex. arithmetic logic unit, bit shifter) One or more instructions issued per clock In-order issue Out-of-order execution In-order commit 42

43 Superscalar (2) 43

44 Bulldozer Architecture (1) Microprocessor microarchitecture by AMD Two-core module Two integer execution units 2 arithmetic logic units 2 address generation units One floating point execution unit Shared L2 cache Shared instruction fetch and decode stages 44

45 Bulldozer Architecture (2) Handle 4 independent arithmetical and memory operations per clock for each integer core Issue bandwidth = 4 45

46 Sandy Bridge Microarchitecture by Intel For each Integer execution unit (core) 3 ALUs + 2 AGUs 3 ALU+ 1 AGU / 2 ALU + 2 AGU ( Note: four instruction decoders) 46

47 Very Long Instruction Word (VLIW) Processor Superscalar processor Dynamic scheduling of Ins. Scheduling hardware is expensive Dependence window size # of pipelines Do scheduling at compile time (in software) Specify parallelism as parallel activities (using long word of the instruction format) 47

48 Processor Execution Model SISD Single Instruction Single Data (Random Access Machine) Instruction Set Architecture (ISA) e.g. X86 M fetch 64 P execute 48

49 Designing and Executing a Parallel Program Performance metric Execution time Execution time depends on Compile time optimizations (Compiler level) Execution time optimizations (Processor level Implicit) Many optimizations are possible at architecture and hardware level (EE 457, EE 557) 49 Course focus Problem User Program Compile Code for target ISA Target HW e.g. matrix mult.

50 Multi-core Processor (1) AMD 6278 processor 16 cores Core Core Core 2.4 GHz L1 cache: 16 KB L1 Cache L1 Cache L1 Cache L2 cache: 2 MB L2 Cache L2 Cache L2 Cache L3 cache: 6 MB Main memory: 64 GB L3 Cache Main Memory 50

51 Multi-core Processor (2) Parallelism Pipelined execution of Instructions (each core) Instruction-level parallelism (each core) Thread-level parallelism (across cores) Data-level parallelism Loop-level parallelism 51

52 Multi-core Processor (3) Challenges (to achieving high performance) Identify and exploit explicit parallelism Cache performance Co-ordination 52

53 Summary Challenges in exploiting parallelism Course outline Core -> Multi-core -> System Implicit Parallelism 53

EE/CSCI 451 Parallel and Distributed Computation

EE/CSCI 451 Parallel and Distributed Computation Lecture #1 1/11/2018 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Course information Outline