High Performance Computing in C and C++
|
|
- Ariel Mosley
- 5 years ago
- Views:
Transcription
1 High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University
2 Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday 2 to 3 Glyndwr C Substitute Lecturer for week 3: Finish C. Coursework1: C and HPC.
3 Summary Goals of HPC: Performance: Definition Metrics Top500 list Scaling Efficiency Cost
4 Today Computational Models: Parallel Programming vs. distributed Parallel Architectures Intro to Parallel Programming
5 Assignment 1 Due date:november 2, 2012 at 11:00 AM Three parts: 4 Questions One Programming assignment Coding Conventions: TO BE FOLLOWED.
6 Assignment 1 Due date:november 2, 2012 at 11:00 AM New College Policy: WARNING!!Late Submissions Get a ZERO!!
7 Course Administration All assignment are individual work (unless stated otherwise). Copying solutions is considered cheating. Submitted documents will be compared. Keep a copy of the listings to provide evidence of creative work. Unfair practice and plagiarism: University Definition and Procedure: andprogress/unfairpracticeprocedure/ School Definition and Procedure: refer to the Handbook
8 COMPUTATIONAL MODELS
9 Two Types of HPC Parallel Computing Breaking the problem to be computed into parts that can be run simultaneously in different processors Example: a program to perform matrix multiplication Solve tightly coupled problems Distributed Computing Parts of the work to be computed are computed in different places (Note: does not necessarily imply simultaneous processing) An example: running a workflow in a Grid Solve loosely-coupled problems (no much communication)
10 Architecture Types Shared Memory: Usually via threads, all processors can access all memory directly at any time; Distributed Memory: A processor can access only its own memory, but processors can share data using message passing.
11 Architecture: Shared Memory Uniform Memory Access (UMA) Non-uniform memory access (NUMA) Shared Memory Interconnect Interconnect Shared Memory 1 Shared Memory m PE 0 PE n PE 1 PE n PE (m-1)n+1 PE m.n
12 Architecture: Shared Memory Shared memory (uniform memory access - UMA) Multiple CPUs, single memory, shared I/O All resources in a SMP machine are equally available to each CPU Processors share access to a common memory space. Implemented over a shared memory bus or switch. Support for Critical Sections is required Local cache is critical. Shared Memory PE 0 Interconnect PE n
13 Shared Memory - UMA Why local cache is critical? If not, bus/switch contention (or network traffic) reduces the systems efficiency. For this reason, uniform memory access systems do not scale well: Cache introduces problems of coherency (ensuring that cache is updated when other processors alter the memory) Shared Memory PE 0 Interconnect PE n
14 Shared Memory - NUMA Shared memory (Nonuniform memory access NUMA) Multiple CPUs Each CPU has fast access to its local area of the memory, but slower access to other areas Scale well to a large number of processors Complicated memory access pattern Global address space Shared Memory 1 PE 1 PE n Interconnect Shared Memory m PE (m-1)n+1 PE m.n
15 Distributed Memory Distributed Memory Each processor has it s own local memory Data exchange/sharing done through explicit communication: Message Passing (MPI language) Larger latencies between processors Scalability is good if the task to be computed can be divided properly PE 0 M 0 Interconnect PE n M n
16 Why the Schism? Problems whose parts are completely separated from and independent of one another are trivial to parallelize/distribute. But most interesting problems have some irreducible interaction between them. The two different memory models or computing paradigms encourage two very different ways to handle interactions.
17 Basic Issues Two processes: Alice, Bob Alice task: Add two numbers First number is her own Bob task: Provide second number Three possible scenarios Whiteboard: Shared Memory
18 The Shared Memory Lucky Example
19 The Shared Memory Unlucky Example 1
20 The Shared Memory Unlucky Example 2
21 Shared Memory now what? How do you solve it? Locking mechanism Semaphore Synchronization MUST be guaranteed! However results are non deterministic...
22 Distributed Memory Message Passing
23 HPC Strategies Good performance and scalability: how well we can get solutions faster solve a problem that is larger. Solutions are not created equal in parallelization: Luck Increase data locality Reduce dependencies Amortize system overheads
24 Simple Case Study: test 1 #include < stdio.h > #include < time.h > #define N 2048 float x[n], y[n], A[N][N]; int main(void){ int i,j, irepeat; float total; clock_t c1, c2; c1 = clock(); for (irepeat = 0; irepeat < 5; irepeat ++){ for (i = 0; i < N; i++) for (j = 0; j < N; j++) y[i] = y[i] + A[i][j]*x[j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) x[i] = x[i] + A[i][j]*y[j]; } c2 = clock(); total = (c2 - c1)*1000.f/clocks_per_sec; printf("total time = %6.2f milliseconds\n", total); return 0;}
25 Simple Case Study: test 2 #include < stdio.h > #include < time.h > #define N 2048 float x[n], y[n], A[N][N]; int main(void) { int i,j,irepeat; float total; clock_t c1, c2; c1 = clock(); for (irepeat = 0; irepeat < 5; irepeat ++) { for (j = 0; j < N; j++) for (i = 0; i < N; i++) y[i] = y[i] + A[i][j]*x[j]; for (j = 0; j < N; j++) for (i = 0; i < N; i++) x[i] = x[i] + A[i][j]*y[j]; } c2 = clock(); total = (c2 - c1)*1000.f/clocks_per_sec; printf("total time = %6.2f milliseconds\n", total); return 0; }
26 Simple Case Study $> gcc -o test1 test1.c $> test1 total time taken = milliseconds $> gcc -o test2 test2.c $> test2 total time taken = milliseconds
27 Data locality Data are stored in linear space Storage pattern: No difference to program execution Important for performance Access data according to how data is stored: Local data: use native methods (C uses row-major order) Remote data: reduce the need to move data between nodes
28 PARALLEL PROCESSING & ARCHITECTURES
29 Parallel Processing Concurrent use of multiple processors to process data. Either by: Running the same program on each processor. Running different programs on each processor. Parallel processing may occur in the instruction stream, the data stream, or both. The sequence of instructions read from memory is the instruction stream. The operations performed on the data in the processor is the data stream.
30 Types of Parallelism Pipeline parallelism: instruction stream. Data parallelism: data stream. Task parallelism: more complex, interdependencies cannot be avoided, implies data sharing.
31 Architectural Classification Flynn's Classification (1972) Four classes. Based on the multiplicity of Instruction Streams and Data Streams. Number of Data Streams Single Multiple Instruction Stream: Sequence of Instructions read from memory. Data Stream: Number of Instruction Streams Single Multiple SISD MISD SIMD MIMD Operations performed on the data in the processor.
32 Flynn s Taxonomy Classes: SISD - Single Instruction, Single Data Instructions are operated sequentially on a single stream of data in a single memory unit. Classic Von Neumann architecture. Machines may still consist of multiple (unicore) processors, operating on independent data - these can be considered as multiple SISD systems. Examples: scalar and superscalar processors. Superscalar processors: instruction level parallelism (more than one instruction per clock cycle). UNIVAC1 Cray1
33 Flynn s Taxonomy Classes: SIMD - Single Instruction, Multiple Data A single instruction stream (broadcast to all processors), acting on multiple data. The most common form of this architecture class are Vector processors. These can deliver results several times faster than scalar processors. Example: GPUs. Limitations: Not all algorithm can be vectorized. Algorithm implementation tricky. No compiler support. Architecture specific. CraySMP
34 Flynn s Taxonomy Classes: MISD - Multiple instruction, Single data No practical implementations of this architecture.
35 Flynn s Taxonomy Classes: MISD - Multiple instruction, Single data No practical implementations of this architecture. MIMD - Multiple instruction, Multiple data Multiple instruction streams, acting on different data. Most common HPC systems. Can be either shared or distributed memory MIMD. Multi-core superscalar processors are classified as MIMD. CrayT3 IBM BG/L
36 Flynn s Taxonomy Classes: SISD Single Instruction, Single Data SIMD Single Instruction, Multiple Data. MISD - Multiple instruction, Single Data. MIMD - Multiple Instruction, Multiple Data. Parallelism can be achieved within: SISD (superscalar); SIMD (vector processors); MIMD (clusters, massively parallel processors)
37 In-processor Parallelism (single processor) Pipelining: Overlap the execution of instructions. Example: $> cat scalar_array extract_contour gzip -c > contour_data.z cat : reads the disk file, "scalar_array", sends its content to "extract_contour" through a pipe extract_contour: visualization program (contour from scalar function-> geometry) gzip: compresses the data (geometry) writes the compressed data to disk
38 In-processor Parallelism (single processor) Pipelining (SISD): Overlap the execution of instructions. Example: $> cat scalar_array extract_contour gzip -c > contour_data.z 3 processes running together: keeps 3 cores busy speedup keep multiple devices busy at the same time: will help to overlap: computation and the wait for services done by system devices (OS scheduler) Achieve Streaming
39 In-processor Parallelism (single processor) Pipelining: Overlap the execution of instructions. Reduces the idle time of hardware components. Good performance with independent instructions. Performing more operations per clock cycle. Does not reduce latency. As fast as the slowest step. Branches can be a problem (loops).
40 Pipelining: Branches Example: loop-level parallelism: exploit parallelism among iterations of a loop. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; //Instruction Level Parallelism
41 Pipelining: Branches Loop-level parallelism: exploit parallelism among iterations of a loop. Example 1 for (i=1; i<=1000; i= i+1) x[i] = x[i] + y[i]; //Instruction Level Parallelism Example 2 for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; b[i+1] = c[i] + d[i]; } //s1 //s2 Is this loop parallel?
42 Pipelining: Branches Loop-level parallelism: exploit parallelism among iterations of a loop. Example 2 for (i=1; i<=100; i= i+1){ a[i] = a[i] + b[i]; b[i+1] = c[i] + d[i]; } //s1 //s2 Is this loop parallel? There is a loop-carried dependency S1 depends on S2... but no cycles!
43 Loop-level parallelism A loop is parallel unless there is a cycle in the dependecies. The absence of a cycle means that the dependencies give a partial ordering on the statements. Parallel loop Example 2 // re-written a[1] = a[1] + b[1]; for (i=1; i<=99; i= i+1){ b[i+1] = c[i] + d[i]; a[i+1] = a[i+1] + b[i+1]; } b[101] = c[100] + d[100]; //S1 //S2
44 Loop-level parallelism A loop is parallel unless there is a cycle in the dependecies. The absence of a cycle means that the dependencies give a partial ordering on the statements. Non parallel loop Example 3 for (i=1; i<=100; i= i+1){ a[i+1] = a[i] + c[i]; b[i+1] = b[i] + a[i+1]; } S1 and S2 depend on each other. //S1 //S2
45 Loop unrolling There are a number of techniques for converting loop-level parallelism into instruction-level parallelism. Such techniques work by unrolling the loop. for (int i=0 ; i< imagesize ; i++) { pixels [i] *= scale ; } for (int i=0 ; i< imagesize ; i++) { pixels [i] *= scale ; pixels [i++] *= scale ; pixels [i++] *= scale ; pixels [i++] *= scale ; } Activated by the option funroll-loops in GCC.
46 In-processor Parallelism (single processor) Pipelining (SISD): Overlap the execution of instructions. Reduces the idle time of hardware components. Good performance with independent instructions. Performing more operations per clock cycle. Discrepancy between peak and actual performance often caused by pipeline effects Difficult to keep pipelines full (conditional branches might be a reason) Branch prediction helps: Correct prediction is very fast. Incorrect prediction is very slow. Accuracy is about 95%. So 5% of branches cause a pipeline stall (bad!).
47 In-processor Parallelism (single processor) Vector architectures (SIMD) Each result independent of previous result: long pipeline (high clock rate). Vector instructions access memory with known pattern: highly interleaved memory (low latency). no (data) caches required! (Do use instruction cache). Reduces branches and branch problems in pipelines. Fewer instruction fetches. Bad performance on problems that do not have independent inputs.
48 Vector Processors: Branches Branches are expensive on GPUs. void stripe ( const float4 * input global float4 * output ) { int i = get_global_id (0); // Lighten even pixels, darken odd pixels if (i % 2) } { } else { } output [i] = input [i] * 1.1; output [i] = input [i] * 0.9; Each pair of threads will take different branches (fragments). Only half will actually be running in parallel
49 Multiprocessor Parallelism MIMD architectures. Divide workload up between processors. Often achieved by dividing up a data structure. Each processor works on it s own data. Typically processors need to communicate. Shared memory. Message exchange. Distributed shared memory (virtual global address space).
50 PARALLEL PROGRAMMING
51 Designing a Parallel Program Granularity Data Dependency Communication
52 Granularity Granularity of parallelism is the ratio of computations to that are being performed in parallel to communication: Fine: relatively small amounts of computational work are done between communication events. Coarse: relatively large amounts of computational work are done between communication events.
53 Granularity Four types of parallelism (in order of granularity size) Instruction-level parallelism (e.g. pipeline) Thread-level parallelism (e.g. run a multi-thread java program) Process-level parallelism (e.g. run an MPI job in a cluster) Job-level parallelism (e.g. run a batch of independent singleprocessor jobs in a cluster) Which is Best? The most efficient granularity is dependent on the algorithm and the hardware environment in which it runs. In most cases the overhead associated with communications and synchronization is high relative to execution speed so it is advantageous to have coarse granularity. Fine-grain parallelism can help reduce overheads due to load imbalance.
54 Communication vs. Computation Main issues that affect parallel efficiency are: Ratio of computation to communication Higher computation usually yields better performance. Communication bandwidth & latency Latency has the biggest impact. Scalability Inherent limit in the problem. Hardware limit: does the bandwidth & latency scale with the number of processors.
55 Dependency Dependencies are one of the primary inhibitors to parallelism. Dependency: If event A must occur before event B, then B is dependent on A. Two types of Dependency Control dependency: waiting for the instruction which controls the execution flow to be completed IF (X!=0) Then Y=1.0/X: Y has the control dependency on X!=0 Data dependency: dependency because of calculations or memory access Flow dependency: A=X+Y; B=A+C; Anti-dependency: B=A+C; A=X+Y; Output dependency: A=2; X=A+1; A=5;
56 Identifying Dependency Draw a Directed Acyclic Graph (DAG) to identify the dependency among a sequence of instructions Anti-dependency: a variable appears as a parent in a calculation and then as a child in a later calculation Output dependency: a variable appears as a child in a calculation and then as a child again in a later calculation X=A+B D=X*17 A=B+C X=C+E A 1 X 2 D B C 3 anti A output 17 anti 4 X E
57 How to Handle How to Handle Data Dependencies: Distributed memory architectures - communicate required data at synchronization points. Shared memory architectures -synchronize read/write operations between tasks. Loop carried dependency are the most important.
BlueGene/L (No. 4 in the Latest Top500 List)
BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside
More informationTop500 Supercomputer list
Top500 Supercomputer list Tends to represent parallel computers, so distributed systems such as SETI@Home are neglected. Does not consider storage or I/O issues Both custom designed machines and commodity
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of
More informationIntroduction II. Overview
Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationHigh Performance Computing Course Notes Course Administration
High Performance Computing Course Notes 2009-2010 2010 Course Administration Contacts details Dr. Ligang He Home page: http://www.dcs.warwick.ac.uk/~liganghe Email: liganghe@dcs.warwick.ac.uk Office hours:
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationDr. Joe Zhang PDC-3: Parallel Platforms
CSC630/CSC730: arallel & Distributed Computing arallel Computing latforms Chapter 2 (2.3) 1 Content Communication models of Logical organization (a programmer s view) Control structure Communication model
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationChapter 11. Introduction to Multiprocessors
Chapter 11 Introduction to Multiprocessors 11.1 Introduction A multiple processor system consists of two or more processors that are connected in a manner that allows them to share the simultaneous (parallel)
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More information3/24/2014 BIT 325 PARALLEL PROCESSING ASSESSMENT. Lecture Notes:
BIT 325 PARALLEL PROCESSING ASSESSMENT CA 40% TESTS 30% PRESENTATIONS 10% EXAM 60% CLASS TIME TABLE SYLLUBUS & RECOMMENDED BOOKS Parallel processing Overview Clarification of parallel machines Some General
More informationParallel Computing Introduction
Parallel Computing Introduction Bedřich Beneš, Ph.D. Associate Professor Department of Computer Graphics Purdue University von Neumann computer architecture CPU Hard disk Network Bus Memory GPU I/O devices
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationFLYNN S TAXONOMY OF COMPUTER ARCHITECTURE
FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE The most popular taxonomy of computer architecture was defined by Flynn in 1966. Flynn s classification scheme is based on the notion of a stream of information.
More informationHigh Performance Computing Course Notes HPC Fundamentals
High Performance Computing Course Notes 2008-2009 2009 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationComputer Architecture Lecture 27: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015
18-447 Computer Architecture Lecture 27: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 4/6/2015 Assignments Lab 7 out Due April 17 HW 6 Due Friday (April 10) Midterm II April
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationIntroduction to parallel computing
Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationTHREAD LEVEL PARALLELISM
THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationARCHITECTURAL CLASSIFICATION. Mariam A. Salih
ARCHITECTURAL CLASSIFICATION Mariam A. Salih Basic types of architectural classification FLYNN S TAXONOMY OF COMPUTER ARCHITECTURE FENG S CLASSIFICATION Handler Classification Other types of architectural
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More information18-447: Computer Architecture Lecture 30B: Multiprocessors. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013 Readings: Multiprocessing Required Amdahl, Validity of the single processor
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming David Lifka lifka@cac.cornell.edu May 23, 2011 5/23/2011 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor or computer to complete
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationCMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationHigh Performance Computing in C and C++
High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University WELCOME BACK Course Administration Contact Details Dr. Rita Borgo Home page: http://cs.swan.ac.uk/~csrb/
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationParallel Numerics, WT 2013/ Introduction
Parallel Numerics, WT 2013/2014 1 Introduction page 1 of 122 Scope Revise standard numerical methods considering parallel computations! Required knowledge Numerics Parallel Programming Graphs Literature
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming January 14, 2015 www.cac.cornell.edu What is Parallel Programming? Theoretically a very simple concept Use more than one processor to complete a task Operationally
More informationFlynn s Taxonomy of Parallel Architectures
Flynn s Taxonomy of Parallel Architectures Stefano Markidis, Erwin Laure, Niclas Jansson, Sergio Rivas-Gomez and Steven Wei Der Chien 1 Sequential Architecture The von Neumann architecture was conceived
More informationComputer Organization. Chapter 16
William Stallings Computer Organization and Architecture t Chapter 16 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationObjectives of the Course
Objectives of the Course Parallel Systems: Understanding the current state-of-the-art in parallel programming technology Getting familiar with existing algorithms for number of application areas Distributed
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationSMD149 - Operating Systems - Multiprocessing
SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction
More informationOverview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy
Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system
More informationParallel Computing Why & How?
Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCS 475: Parallel Programming Introduction
CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationShared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation
Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationComputer Architecture: Parallel Processing Basics. Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Parallel Processing Basics Prof. Onur Mutlu Carnegie Mellon University Readings Required Hill, Jouppi, Sohi, Multiprocessors and Multicomputers, pp. 551-560 in Readings in Computer
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationArchitecture of parallel processing in computer organization
American Journal of Computer Science and Engineering 2014; 1(2): 12-17 Published online August 20, 2014 (http://www.openscienceonline.com/journal/ajcse) Architecture of parallel processing in computer
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationParallel Processors. Session 1 Introduction
Parallel Processors Session 1 Introduction Applications of Parallel Processors Structural Analysis Weather Forecasting Petroleum Exploration Fusion Energy Research Medical Diagnosis Aerodynamics Simulations
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationOrganisasi Sistem Komputer
LOGO Organisasi Sistem Komputer OSK 14 Parallel Processing Pendidikan Teknik Elektronika FT UNY Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationCSE 160 Lecture 10. Instruction level parallelism (ILP) Vectorization
CSE 160 Lecture 10 Instruction level parallelism (ILP) Vectorization Announcements Quiz on Friday Signup for Friday labs sessions in APM 2013 Scott B. Baden / CSE 160 / Winter 2013 2 Particle simulation
More informationChapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance
Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)
More informationLecture 9: MIMD Architecture
Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is
More informationSchool of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor
School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Intro to HPC Architecture Instructor: Ekpe Okorafor A little about me! PhD Computer Engineering Texas A&M University Computer
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More informationCSE 392/CS 378: High-performance Computing - Principles and Practice
CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationLecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
More informationChapter 18 Parallel Processing
Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD
More information