Parallel Computing Architectures

Size: px
Start display at page:

Download "Parallel Computing Architectures"

Transcription

1 Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna

2 Copyright Moreno Marzolla, Università di Bologna, Italy This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. 2

3 An Abstract Parallel Architecture Processor Processor Processor Processor Interconnection Network Memory Memory Memory How is parallelism handled? What is the exact physical location of the memories? What is the topology of the interconnect network? 3

4 Why are parallel architectures important? There is no "typical" parallel computer: different vendors use different architectures There is currently no universal programming paradigm that fits all architectures Parallel programs must be tailored to the underlying parallel architecture The architecture of a parallel computer limits the choice of the programming paradigm that can be used 4

5 Von Neumann architecture and its extensions 5

6 Von Neumann architecture Data Address Control Bus R0 R1 ALU Memory Memory Rn-1 PC IR PSW Control 6

7 The Fetch-Decode-Execute cycle The CPU performs an infinite loop Fetch Decode Fetch the opcode of the next instruction from the memory address stored in the PC register, and put the opcode in the IR register The content of the IR register is analyzed to identify the instruction to execute Execute The control unit activates the appropriate functional units of the CPU to perform the actions required by the instruction (e.g., read values from memory, execute arithmetic computations, and so on) 7

8 How to limit the bottlenecks of the Von Neumann architecture Reduce the memory access latency Hide the memory access latency Rely on CPU registers whenever possible Use caches Multithreading and context-switch during memory accesses Execute multiple instructions at the same time Pipelining Multiple issue Branch prediction Speculative execution SIMD extensions 8

9 CPU times compared to the real world 1 CPU cycle Level 1 cache access Main memory access Solid-state disk I/O Rotational disk I/O Internet: SF to NYC Internet: SF to UK Internet: SF to Australia Physical system reboot 0.3 ns 0.9 ns 120 ns μs 1-10 ms 40 ms 81 ms 183 ms 1m 1s 3s 6 min 2-6 days 1-12 months 4 years 8 years 19 years 6 millennia Source:

10 Caching 10

11 Cache hierarchy Large memories are slow; fast memories are small CPU L1 Cache L2 Cache L3 Cache (possible) interconnect bus DRAM 11

12 Cache hierarchy of the AMD Bulldozer architecture Source: 12

13 CUDA memory hierarchy Block Block Shared Memory Registers Thread Local Memory Registers Thread Local Memory Shared Memory Registers Thread Local Memory Registers Thread Local Memory Global Memory Constant Memory Texture Memory 13

14 How the cache works Cache memory is very fast Often located inside the processor chip Expensive, very small compared to system memory The cache contains a copy of the content of some recently accessed (main) memory locations If the same memory location is accessed again, and its content is in cache, then the cache is accessed instead of the system RAM 14

15 How the cache works If the CPU accesses a memory location whose content is not already in cache the content of that memory location and "some" adjacent locations are copied in cache in doing so it might be necessary to purge some old data from the cache to make room to the new data The smallest unit of data that can be transferred to/from the cache is the cache line On Intel processors, usually 1 cache line = 64B 15

16 Example CPU Cache RAM a b c d e f g h i j k l Cache line 16

17 Example CPU Cache RAM a b c d e f g h i j k l Cache line 17

18 Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 18

19 Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 19

20 Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 20

21 Example CPU a b c d e f g h i j k l Cache RAM a b c d e f g h i j k l 21

22 Spatial and temporal locality Cache memory works well when applications exhibit spatial and/or temporal locality Spatial locality Accessing adjacent memory locations is OK Temporal locality Repeatedly accessing the same memory location(s) is OK 22

23 Example: matrix-matrix product Given two square matrices p, q, compute r = p q j j i i p q r void matmul( double *p, double* q, double *r, int n) { int i, j, k; for (i=0; i<n; i++) { for (j=0; j<n; j++) { r[i*n + j] = 0.0; for (k=0; k<n; k++) { r[i*n + j] += p[i*n + k] * q[k*n + j]; } } } } 23

24 Matrix representation Matrices in C are stored in row-major order Elements of each row are contiguous in memory Adjacent rows are contiguous in memory p[0][0] p[1][0] p[4][4] p[2][0] p[3][0] p[0][0] p[1][0] p[2][0] p[3][0] 24

25 Row-wise vs column-wise access Row-wise access is OK: the data is contiguous in memory, so the cache helps (spatial locality) Column-wise access is NOT OK: the accessed elements are not contiguous in memory (strided access) so the cache does NOT help 25

26 Matrix-matrix product Given two squared matrices p, q, compute r = p q j j i i p OK; rowwise access q r NOT ok; columnwise access 26

27 Optimizing the memory access pattern p00 p01 p02 p03 p10 p11 p12 p13 p20 p21 p22 p23 p30 p31 p32 p33 p q00 q10 q02 q03 q10 q11 q12 q13 q20 q21 q22 q23 q30 q31 q32 q33 q r00 r01 r02 r03 r10 r11 r12 r13 r20 r21 r22 r23 r30 r31 r32 r33 r 27

28 Optimizing the memory access pattern p00 p01 p02 p03 p10 p11 p12 p13 p20 p21 p22 p23 q00 q10 q02 q03 p30 p31 p32 p33 q10 q11 q12 q13 q20 q21 q22 q23 r00 r01 r02 r03 q30 q31 q32 q33 p r10 r11 r12 r13 r20 r21 r22 r23 r30 r31 r32 r33 q r Transpose q p00 p01 p02 p03 p10 p11 p12 p13 p20 p21 p22 p23 p30 p31 p32 p33 p q00 q10 q20 q30 q01 q11 q21 q31 q02 q12 q22 q32 q03 q13 q23 q33 qt r00 r01 r02 r03 r10 r11 r12 r13 r20 r21 r22 r23 r30 r31 r32 r33 r 28

29 But... Transposing q requires time. Is it worth it? See matmul.c./matmul-plain vs./matmul-transpose To measure the cache performance you can use perf On Debian/Ubuntu: sudo apt-get install linux-tools-common linux-tools-generic Which performance counters can be measured by perf is system-dependent The meaning of the performance counters is systemdependent 29

30 Measuring cache stats perf stat -B -e \ task-clock,cycles,cache-references,\ cache-misses,l1-dcache-loads,l1-dcache-load-misses,\ L1-dcache-stores,L1-dcache-store-misses \./exec_name $./cache-stats.sh./matmul-plain Starting plain matrix-matrix multiply (n=1000)... done Low level cache references and misses r[0][0] = elapsed time = s Performance counter stats for './matmul-plain': 3909, task-clock (msec) cycles cache-references cache-misses L1-dcache-loads L1-dcache-load-misses L1-dcache-stores L1-dcache-store-misses # 0,997 CPUs utilized # 3,465 GHz # 32,235 M/sec # 26,401 % of all cache refs # 3597,174 M/sec # 8,03% of all L1-dcache hits # 0,000 K/sec # 0,455 M/sec 3, seconds time elapsed L1 data cache references and misses 30

31 Measuring cache stats perf stat -B -e \ task-clock,cycles,cache-references,\ cache-misses,l1-dcache-loads,l1-dcache-load-misses,\ L1-dcache-stores,L1-dcache-store-misses \./exec_name $./cache-stats.sh./matmul-transpose Starting transpose matrix-matrix multiply (n=1000)... done Low level cache references and misses r[0][0] = elapsed time = s Performance counter stats for './matmul-transpose': 2678, task-clock (msec) cycles cache-references cache-misses L1-dcache-loads L1-dcache-load-misses L1-dcache-stores L1-dcache-store-misses # 0,999 CPUs utilized # 3,461 GHz # 1,234 M/sec # 39,092 % of all cache refs # 5253,280 M/sec # 0,91% of all L1-dcache hits # 0,000 K/sec # 0,732 M/sec 2, seconds time elapsed L1 data cache references and misses 31

32 Instruction-Level Parallelism (ILP) 32

33 Instruction-Level Parallelism Uses multiple functional units to increase the performance of a processor Pipelining: the functional units are organized like an assembly line, and can be used strictly in that order Multiple issue: the functional units can be used whenever required Static multiple issue: the order in which functional units are activated is decided at compile time (example: Intel IA64) Dynamic multiple issue (superscalar): the order in which functional units are activated is decided at run time 33

34 Instruction-Level Parallelism IF ID IF ID EX MEM WB EX Integer WB Pipelining Instruction Fetch Instruction Decode Execute Memory Access Write Back Instruction Fetch and Decode Unit Integer MEM Floating Point Commit Unit Multiple Issue In-order issue Floating Point In-order commit Load Store Out of order execute 34

35 Pipelining Instr1 IF ID EX MEM WB Instr2 Instr1 IF ID EX MEM WB Instr3 Instr2 Instr1 IF ID EX MEM WB Instr4 Instr3 Instr2 Instr1 IF ID EX MEM WB Instr5 Instr4 Instr3 Instr2 Instr1 IF ID EX MEM WB 35 Flusso di istruzioni

36 Control Dependency z = x + y; if ( z w } else w } > 0 ) { = x ; { = y ; The instructions w = x and w = y have a control dependency on z > 0 Control dependencies can limit the performance of pipelined architectures z = x + y; c = z > 0; w = x*c + y*(1-c); Don't do this by hand; ideally, compilers should take care of these optimizations 36

37 In the real world... From GCC documentation Built-in Function: long builtin_expect(long exp, long c) You may use builtin_expect to provide the compiler with branch prediction information. In general, you should prefer to use actual profile feedback for this (-fprofile-arcs), as programmers are notoriously bad at predicting how their programs actually perform. However, there are applications in which this data is hard to collect. The return value is the value of exp, which should be an integral expression. The semantics of the built-in are that it is expected that exp == c. For example: if ( builtin_expect (x, 0)) foo (); 37

38 Branch Hint: Example #include <stdlib.h> int main( void ) { int A[ ]; size_t i; const size_t n = sizeof(a) / sizeof(a[0]); for ( i=0; builtin_expect( i<n, 1 ); i++ ) { A[i] = i; } return 0; } Don't do this by hand; ideally, compilers should take care of these optimizations 38

39 Hardware multithreading Allows the CPU to switch to another task when the current task is stalled Fine-grained multithreading A context switch has essentially zero cost The CPU can switch to another task even on stalls of short durations (e.g., waiting for memory operations) Requires CPU with specific support, e.g., Cray XMT Coarse-grained multithreading A context switch has non-negligible cost, and is appropriate only for threads blocked on I/O operations or similar The CPU is less efficient in the presence of stalls of short duration 39

40 Source: 40

41 Hardware multithreading Simultaneous multithreading (SMT) is an implementation of fine-grained multithreading where different threads can use multiple functional units at the same time HyperThreading is Intel's implementation of SMT Each physical processor core is seen by the Operating System as two "logical" processors Each logical processor maintains a complete set of the architecture state: general-purpose registers control registers advanced programmable interrupt controller (APIC) registers some machine state registers Intel claims that HT provides a 15 30% speedup with respect to a similar, non-ht processor 41

42 HyperThreading Queue Queue MEM WB Queue EX Queue ID Queue IF Queue Queue The pipeline stages are separated by two buffers (one for each executing thread) If one thread is stalled, the other one might go ahead and fill the pipeline slot, resulting in higher processor utilization Queue Hyper-Threading Technology Architecture and Microarchitecture 42

43 HyperThreading See the output of lscpu ("Thread(s) per core") or lstopo (hwloc Ubuntu/Debian package) Processor without HT Processor with HT 43

44 44

45 Von Neumann architecture Flynn's Taxonomy Single Multiple Instruction Streams Data Streams Single Multiple SISD SIMD Single Instruction Stream Single Data Stream Single Instruction Stream Multiple Data Streams MISD MIMD Multiple Instruction Streams Single Data Stream Multiple Instruction Streams Multiple Data Streams 45

46 SIMD SIMD instructions apply the same operation (e.g., sum, product ) to multiple elements (typically 4 or 8, depending on the width of SIMD registers and the data types of operands) Time This means that there must be 4/8/... independent ALUs LOAD A[0] LOAD A[1] LOAD A[2] LOAD A[3] LOAD B[0] LOAD B[1] LOAD B[2] LOAD B[3] C[0] = A[0] + B[0] C[1] = A[1] + B[1] C[2] = A[2] + B[2] C[3] = A[3] + B[3] STORE C[0] STORE C[1] STORE C[2] STORE C[3] 46

47 SSE Streaming SIMD Extensions Extension to the x86 instruction set Provide new SIMD instructions operating on small arrays of integer or floating-point numbers Introduces 8 new 128-bit registers (XMM0 XMM7) SSE2 instructions can handle 2 64-bit doubles, or 2 64-bit integers, or 4 32-bit integers, or 8 16-bit integers, or 16 8-bit chars bit XMM0 XMM1 XMM7 47

48 SSE (Streaming SIMD Extensions) 4 lanes 32 bit 32 bit 32 bit 32 bit X3 X2 X1 X0 Y3 Y2 Y1 Y0 Op Op Op Op X3 op Y3 X2 op Y2 X1 op Y1 X0 op Y0 48

49 Example m128 a = _mm_set_ps( 1.0, 2.0, 3.0, 4.0 ); m128 b = _mm_set_ps( 2.0, 4.0, 6.0, 8.0 ); m128 ab = _mm_mul_ps(a, b); 32 bit 32 bit 32 bit 32 bit a b _mm_mul_ps(a,b) ab

50 GPU Modern GPUs (Graphics Processing Units) have a large number of cores, and can be regarded as a form of SIMD architecture Fermi GPU die (source: 50

51 CPU vs GPU The difference between CPUs and GPUs can be appreciated by looking at how the chip surface is used ALU ALU ALU ALU Control Cache DRAM controller DRAM controller CPU GPU 51

52 GPU core A single CPU core contains a fetch/decode unit shared among multiple ALUs ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx If there are 8 ALU, each instruction can operate on 8 values simultaneously Each GPU core maintains multiple execution contexts, and can switch between them at virtually zero cost Fetch / Decode Fine-grained parallelism 52

53 GPU Example: 12 instruction streams 8 ALU = 96 operations in parallel 53

54 MIMD In MIMD systems there are multiple execution units that can execute multiple sequences of instructions Each execution unit generally operates on its own input data Time Multiple Instruction Streams Multiple Data Streams LOAD A[0] CALL F() a = 18 w=7 LOAD B[0] z=8 b=9 t = 13 C[0] = A[0] + B[0] y = 1.7 if ( a>b ) c = 7 k = G(w,t) STORE C[0] z=x+y a=a-1 k=k+1 54

55 MIMD architectures Shared Memory CPU A set of processors sharing a common memory space Each processor can access any memory location CPU A set of compute nodes connected through an interconnection network The most simple example: cluster of PCs connected via Ethernet Nodes can share data through explicit communications CPU Interconnect Memory Distributed Memory CPU CPU CPU CPU CPU Mem Mem Mem Mem Interconnect 55

56 AMD Ryzen Intel core i7 56

57 Hybrid architectures Many HPC systems are based on hybrid architectures Each compute node is a shared-memory multiprocessor A large number of compute nodes is connected through an interconnect network GPU GPU GPU GPU GPU GPU GPU GPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Mem Mem Mem Mem Interconnect 57

58 Distributed memory system IBM BlueGene / CINECA Architecture 10 BGQ Frame Model IBM-BG/Q Processor Type IBM PowerA2, 1.6 GHz Computing Cores Computing Nodes RAM 1GByte / core Internal Network 5D Torus Disk Space 2PByte scratch space Peak Performance 2PFlop/s 58

59 59

60 60

61 (June 2018) 61

62 (June 2018) 62

63 SANDIA ASCI RED Date: 1996 Peak performance: 1.8Teraflops Floor space: 150m2 Power consumption: Watt 63

64 SANDIA ASCI RED Date: 1996 Peak performance: 1.8Teraflops Floor space: 150m2 Power consumption: Watt Sony PLAYSTATION 3 Date: 2006 Peak performance: >1.8Teraflops Floor space: 0.08m2 Power consumption: <200 Watt 64

65 Inside SONY's PS3 Cell Broadband Engine 65

66 Empirical rules When writing parallel applications (especially on distributed-memory architectures) keep in mind that: Computation is fast Communication is slow Input/output is incredibly slow 66

67 Recap Shared memory Advantages: Easier to program Useful for applications with irregular data access patterns (e.g., graph algorithms) Distributed memory Advantages: Disadvantages: The programmer must take care of race conditions Limited memory bandwidth Highly scalable, provide very high computational power by adding more nodes Useful for applications with strong locality of reference, with high computation / communication ratio Disadvantages: Latency of interconnect network Difficult to program 67

Parallel Computing Architectures

Parallel Computing Architectures Parallel Computing Architectures Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ 2 An Abstract Parallel Architecture Processor Processor

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

Lecture 26: Parallel Processing. Spring 2018 Jason Tang

Lecture 26: Parallel Processing. Spring 2018 Jason Tang Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Introducing Multi-core Computing / Hyperthreading

Introducing Multi-core Computing / Hyperthreading Introducing Multi-core Computing / Hyperthreading Clock Frequency with Time 3/9/2017 2 Why multi-core/hyperthreading? Difficult to make single-core clock frequencies even higher Deeply pipelined circuits:

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 4

ECE 571 Advanced Microprocessor-Based Design Lecture 4 ECE 571 Advanced Microprocessor-Based Design Lecture 4 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 January 2016 Homework #1 was due Announcements Homework #2 will be posted

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018

Introduction. CSCI 4850/5850 High-Performance Computing Spring 2018 Introduction CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University What is Parallel

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Computer Architecture Crash course

Computer Architecture Crash course Computer Architecture Crash course Frédéric Haziza Department of Computer Systems Uppsala University Summer 2008 Conclusions The multicore era is already here cost of parallelism is dropping

More information

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering

DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY. Department of Computer science and engineering DHANALAKSHMI SRINIVASAN INSTITUTE OF RESEARCH AND TECHNOLOGY Department of Computer science and engineering Year :II year CS6303 COMPUTER ARCHITECTURE Question Bank UNIT-1OVERVIEW AND INSTRUCTIONS PART-B

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Processor Performance and Parallelism Y. K. Malaiya

Processor Performance and Parallelism Y. K. Malaiya Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles

More information

! Readings! ! Room-level, on-chip! vs.!

! Readings! ! Room-level, on-chip! vs.! 1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

RISC Processors and Parallel Processing. Section and 3.3.6

RISC Processors and Parallel Processing. Section and 3.3.6 RISC Processors and Parallel Processing Section 3.3.5 and 3.3.6 The Control Unit When a program is being executed it is actually the CPU receiving and executing a sequence of machine code instructions.

More information

Parallel Computing Ideas

Parallel Computing Ideas Parallel Computing Ideas K. 1 1 Department of Mathematics 2018 Why When to go for speed Historically: Production code Code takes a long time to run Code runs many times Code is not end in itself 2010:

More information

EE 4980 Modern Electronic Systems. Processor Advanced

EE 4980 Modern Electronic Systems. Processor Advanced EE 4980 Modern Electronic Systems Processor Advanced Architecture General Purpose Processor User Programmable Intended to run end user selected programs Application Independent PowerPoint, Chrome, Twitter,

More information

Exploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University Announcement No change in lecture schedule: Timetable remains the same: Monday 1 to 2 Glyndwr C Friday

More information

Introduction to parallel computing

Introduction to parallel computing Introduction to parallel computing 2. Parallel Hardware Zhiao Shi (modifications by Will French) Advanced Computing Center for Education & Research Vanderbilt University Motherboard Processor https://sites.google.com/

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Introduction II. Overview

Introduction II. Overview Introduction II Overview Today we will introduce multicore hardware (we will introduce many-core hardware prior to learning OpenCL) We will also consider the relationship between computer hardware and

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Summary of Computer Architecture

Summary of Computer Architecture Summary of Computer Architecture Summary CHAP 1: INTRODUCTION Structure Top Level Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1 DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING Year & Semester : III/VI Section : CSE-1 & CSE-2 Subject Code : CS2354 Subject Name : Advanced Computer Architecture Degree & Branch : B.E C.S.E. UNIT-1 1.

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Leopold Grinberg T. J. Watson IBM Research Center, USA High Performance Computing Why do we need HPC? High Performance Computing Amazon can ship products within hours would it

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

Parallel Computer Architecture - Basics -

Parallel Computer Architecture - Basics - Parallel Computer Architecture - Basics - Christian Terboven 19.03.2012 / Aachen, Germany Stand: 15.03.2012 Version 2.3 Rechen- und Kommunikationszentrum (RZ) Agenda Processor

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

BlueGene/L (No. 4 in the Latest Top500 List)

BlueGene/L (No. 4 in the Latest Top500 List) BlueGene/L (No. 4 in the Latest Top500 List) first supercomputer in the Blue Gene project architecture. Individual PowerPC 440 processors at 700Mhz Two processors reside in a single chip. Two chips reside

More information

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012 CPU Architecture Overview Varun Sampath CIS 565 Spring 2012 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations

More information

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

General introduction: GPUs and the realm of parallel architectures

General introduction: GPUs and the realm of parallel architectures General introduction: GPUs and the realm of parallel architectures GPU Computing Training August 17-19 th 2015 Jan Lemeire (jan.lemeire@vub.ac.be) Graduated as Engineer in 1994 at VUB Worked for 4 years

More information

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

" # " $ % & ' ( ) * + $ " % '* + * ' "

 #  $ % & ' ( ) * + $  % '* + * ' ! )! # & ) * + * + * & *,+,- Update Instruction Address IA Instruction Fetch IF Instruction Decode ID Execute EX Memory Access ME Writeback Results WB Program Counter Instruction Register Register File

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Parallel Programming Patterns

Parallel Programming Patterns Parallel Programming Patterns Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ Copyright 2013, 2017, 2018 Moreno Marzolla, Università

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

3.3 Hardware Parallel processing

3.3 Hardware Parallel processing Parallel processing is the simultaneous use of more than one CPU to execute a program. Ideally, parallel processing makes a program run faster because there are more CPUs running it. In practice, it is

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Chapter 2 Lecture 1 Computer Systems Organization

Chapter 2 Lecture 1 Computer Systems Organization Chapter 2 Lecture 1 Computer Systems Organization This chapter provides an introduction to the components Processors: Primary Memory: Secondary Memory: Input/Output: Busses The Central Processing Unit

More information

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.

Master Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms. Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading

More information

Multithreading: Exploiting Thread-Level Parallelism within a Processor

Multithreading: Exploiting Thread-Level Parallelism within a Processor Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines

Announcement. Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Chapter 9 Objectives. 9.2 RISC Machines Announcement Computer Architecture (CSC-3501) Lecture 25 (24 April 2008) Seung-Jong Park (Jay) http://wwwcsclsuedu/~sjpark 1 2 Chapter 9 Objectives 91 Introduction Learn the properties that often distinguish

More information

Multi-cycle Instructions in the Pipeline (Floating Point)

Multi-cycle Instructions in the Pipeline (Floating Point) Lecture 6 Multi-cycle Instructions in the Pipeline (Floating Point) Introduction to instruction level parallelism Recap: Support of multi-cycle instructions in a pipeline (App A.5) Recap: Superpipelining

More information

Lecture 8: RISC & Parallel Computers. Parallel computers

Lecture 8: RISC & Parallel Computers. Parallel computers Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer

More information

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630

Computers and Microprocessors. Lecture 34 PHYS3360/AEP3630 Computers and Microprocessors Lecture 34 PHYS3360/AEP3630 1 Contents Computer architecture / experiment control Microprocessor organization Basic computer components Memory modes for x86 series of microprocessors

More information

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor

CPI IPC. 1 - One At Best 1 - One At best. Multiple issue processors: VLIW (Very Long Instruction Word) Speculative Tomasulo Processor Single-Issue Processor (AKA Scalar Processor) CPI IPC 1 - One At Best 1 - One At best 1 From Single-Issue to: AKS Scalar Processors CPI < 1? How? Multiple issue processors: VLIW (Very Long Instruction

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false.

CS 2410 Mid term (fall 2015) Indicate which of the following statements is true and which is false. CS 2410 Mid term (fall 2015) Name: Question 1 (10 points) Indicate which of the following statements is true and which is false. (1) SMT architectures reduces the thread context switch time by saving in

More information

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011

CS4961 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/30/11. Administrative UPDATE. Mary Hall August 30, 2011 CS4961 Parallel Programming Lecture 3: Introduction to Parallel Architectures Administrative UPDATE Nikhil office hours: - Monday, 2-3 PM, MEB 3115 Desk #12 - Lab hours on Tuesday afternoons during programming

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Approaches to Parallel Computing

Approaches to Parallel Computing Approaches to Parallel Computing K. Cooper 1 1 Department of Mathematics Washington State University 2019 Paradigms Concept Many hands make light work... Set several processors to work on separate aspects

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Flynn s Taxonomy of Parallel Machines How many Instruction streams? How many Data streams? SISD: Single I Stream, Single D Stream A uniprocessor

More information

Intel released new technology call P6P

Intel released new technology call P6P P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1

Computer organization by G. Naveen kumar, Asst Prof, C.S.E Department 1 Pipelining and Vector Processing Parallel Processing: The term parallel processing indicates that the system is able to perform several operations in a single time. Now we will elaborate the scenario,

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport

CPS 303 High Performance Computing. Wensheng Shen Department of Computational Science SUNY Brockport CPS 303 High Performance Computing Wensheng Shen Department of Computational Science SUNY Brockport Chapter 2: Architecture of Parallel Computers Hardware Software 2.1.1 Flynn s taxonomy Single-instruction

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

Copyright 2010, Elsevier Inc. All rights Reserved

Copyright 2010, Elsevier Inc. All rights Reserved An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 Roadmap Some background Modifications to the von Neumann model Parallel hardware Parallel software

More information

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW. Computer Architectures S Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Computer Architectures 521480S Dynamic Branch Prediction Performance = ƒ(accuracy, cost of misprediction) Branch History Table (BHT) is simplest

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information