COMPUTER ORGANIZATION AND DESIGN

Size: px
Start display at page:

Download "COMPUTER ORGANIZATION AND DESIGN"

Transcription

1 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 6 Parallel Processors from Cliet to Cloud Itroductio Goal: coectig multiple computers to get higher performace Multiprocessors Scalability, availability, power efficiecy Task-level (process-level) parallelism High throughput for idepedet jobs Parallel processig program Sigle program ru o multiple processors Multicore microprocessors Chips with multiple processors (cores) 6.1 Itroductio Chapter 6 Parallel Processors from Cliet to Cloud 2

2 Hardware ad Software Hardware Serial: e.g., Petium 4 Parallel: e.g., quad-core Xeo e5345 Software Sequetial: e.g., matrix multiplicatio Cocurret: e.g., operatig system Sequetial/cocurret software ca ru o serial/parallel hardware Challege: makig effective use of parallel hardware Chapter 6 Parallel Processors from Cliet to Cloud 3 What We ve Already Covered 2.11: Parallelism ad Istructios Sychroizatio 3.6: Parallelism ad Computer Arithmetic Subword Parallelism 4.10: Parallelism ad Advaced Istructio-Level Parallelism 5.10: Parallelism ad Memory Hierarchies Cache Coherece Chapter 6 Parallel Processors from Cliet to Cloud 4

3 Parallel Programmig Parallel software is the problem Need to get sigificat performace improvemet Otherwise, just use a faster uiprocessor, sice it s easier! Difficulties Partitioig Coordiatio Commuicatios overhead 6.2 The Difficulty of Creatig Parallel Processig Programs Chapter 6 Parallel Processors from Cliet to Cloud 5 Amdahl s Law Sequetial part ca limit speedup Example: 100 processors, 90 speedup? T ew = T parallelizable /100 + T sequetial Speedup = (1-F parallelizable 1 ) + F parallelizable /100 = 90 Solvig: F parallelizable = Need sequetial part to be 0.1% of origial time Chapter 6 Parallel Processors from Cliet to Cloud 6

4 Scalig Example Workload: sum of 10 scalars, ad matrix sum Speed up from 10 to 100 processors Sigle processor: Time = ( ) t add 10 processors Time = 10 t add + 100/10 t add = 20 t add Speedup = 110/20 = 5.5 (55% of potetial) 100 processors Time = 10 t add + 100/100 t add = 11 t add Speedup = 110/11 = 10 (10% of potetial) Assumes load ca be balaced across processors Chapter 6 Parallel Processors from Cliet to Cloud 7 Scalig Example (cot) What if matrix size is ? Sigle processor: Time = ( ) t add 10 processors Time = 10 t add /10 t add = 1010 t add Speedup = 10010/1010 = 9.9 (99% of potetial) 100 processors Time = 10 t add /100 t add = 110 t add Speedup = 10010/110 = 91 (91% of potetial) Assumig load balaced Chapter 6 Parallel Processors from Cliet to Cloud 8

5 Strog vs Weak Scalig Strog scalig: problem size fixed As i example Weak scalig: problem size proportioal to umber of processors 10 processors, matrix Time = 20 t add 100 processors, matrix Time = 10 t add /100 t add = 20 t add Costat performace i this example Chapter 6 Parallel Processors from Cliet to Cloud 9 Istructio ad Data Streams A alterate classificatio Istructio Streams Sigle Multiple Sigle SISD: Itel Petium 4 MISD: No examples today Data Streams Multiple SIMD: SSE istructios of x86 MIMD: Itel Xeo e5345 SPMD: Sigle Program Multiple Data A parallel program o a MIMD computer Coditioal code for differet processors 6.3 SISD, MIMD, SIMD, SPMD, ad Vector Chapter 6 Parallel Processors from Cliet to Cloud 10

6 Example: DAXPY (Y = a X + Y) Covetioal LEGv8 code: LDURD D0,[X28,a] ADDI X0,X19,512 //load scalar a loop: LDURD D2,[X19,#0] //load x(i) FMULD D2,D2,D0 LDURD D4,[X20,#0] FADDD D4,D4,D2 STURD D4,[X20,#0] ADDI X19,X19,#8 ADDI X20,X20,#8 CMPB X0,X19 //upper boud of what to load //a x x(i) //load y(i) //a x x(i) + y(i) //store ito y(i) //icremet idex to x //icremet idex to y //compute boud B.NE loop //check if doe Vector LEGv8 code: LDURD D0,[X28,a] //load scalar a LDURDV V1,[X19,#0] //load vector x FMULDVS V2,V1,D0 //vector-scalar multiply LDURDV V3,[X20,#0] //load vector y FADDDV V4,V2,V3 //add y to product STURDV V4,[X20,#0] //store the result Chapter 6 Parallel Processors from Cliet to Cloud 11 Vector Processors Highly pipelied fuctio uits Stream data from/to vector registers to uits Data collected from memory ito registers Results stored from registers to memory Example: Vector extesio to LEGv elemet registers (64-bit elemets) Vector istructios lv, sv: load/store vector addv.d: add vectors of double addvs.d: add scalar to each elemet of vector of double Sigificatly reduces istructio-fetch badwidth Chapter 6 Parallel Processors from Cliet to Cloud 12

7 Vector vs. Scalar Vector architectures ad compilers Simplify data-parallel programmig Explicit statemet of absece of loop-carried depedeces Reduced checkig i hardware Regular access patters beefit from iterleaved ad burst memory Avoid cotrol hazards by avoidig loops More geeral tha ad-hoc media extesios (such as MMX, SSE) Better match with compiler techology Chapter 6 Parallel Processors from Cliet to Cloud 13 SIMD Operate elemetwise o vectors of data E.g., MMX ad SSE istructios i x86 Multiple data elemets i 128-bit wide registers All processors execute the same istructio at the same time Each with differet data address, etc. Simplifies sychroizatio Reduced istructio cotrol hardware Works best for highly data-parallel applicatios Chapter 6 Parallel Processors from Cliet to Cloud 14

8 Vector vs. Multimedia Extesios Vector istructios have a variable vector width, multimedia extesios have a fixed width Vector istructios support strided access, multimedia extesios do ot Vector uits ca be combiatio of pipelied ad arrayed fuctioal uits: Chapter 6 Parallel Processors from Cliet to Cloud 15 Multithreadig Performig multiple threads of executio i parallel Replicate registers, PC, etc. Fast switchig betwee threads Fie-grai multithreadig Switch threads after each cycle Iterleave istructio executio If oe thread stalls, others are executed Coarse-grai multithreadig Oly switch o log stall (e.g., L2-cache miss) Simplifies hardware, but does t hide short stalls (eg, data hazards) 6.4 Hardware Multithreadig Chapter 6 Parallel Processors from Cliet to Cloud 16

9 Simultaeous Multithreadig I multiple-issue dyamically scheduled processor Schedule istructios from multiple threads Istructios from idepedet threads execute whe fuctio uits are available Withi threads, depedecies hadled by schedulig ad register reamig Example: Itel Petium-4 HT Two threads: duplicated registers, shared fuctio uits ad caches Chapter 6 Parallel Processors from Cliet to Cloud 17 Multithreadig Example Chapter 6 Parallel Processors from Cliet to Cloud 18

10 Future of Multithreadig Will it survive? I what form? Power cosideratios Þ simplified microarchitectures Simpler forms of multithreadig Toleratig cache-miss latecy Thread switch may be most effective Multiple simple cores might share resources more effectively Chapter 6 Parallel Processors from Cliet to Cloud 19 Shared Memory SMP: shared memory multiprocessor Hardware provides sigle physical address space for all processors Sychroize shared variables usig locks Memory access time UMA (uiform) vs. NUMA (ouiform) 6.5 Multicore ad Other Shared Memory Multiprocessors Chapter 6 Parallel Processors from Cliet to Cloud 20

11 Example: Sum Reductio Sum 100,000 umbers o 100 processor UMA Each processor has ID: 0 P 99 Partitio 1000 umbers per processor Iitial summatio o each processor sum[p] = 0; for (i = 1000*P; i < 1000*(P+1); i = i + 1) sum[p] = sum[p] + A[i]; Now eed to add these partial sums Reductio: divide ad coquer Half the processors add pairs, the quarter, Need to sychroize betwee reductio steps Chapter 6 Parallel Processors from Cliet to Cloud 21 Example: Sum Reductio half = 100; repeat sych(); if (half%2!= 0 && P == 0) sum[0] = sum[0] + sum[half-1]; /* Coditioal sum eeded whe half is odd; Processor0 gets missig elemet */ half = half/2; /* dividig lie o who sums */ if (P < half) sum[p] = sum[p] + sum[p+half]; util (half == 1); Chapter 6 Parallel Processors from Cliet to Cloud 22

12 History of GPUs Early video cards Frame buffer memory with address geeratio for video output 3D graphics processig Origially high-ed computers (e.g., SGI) Moore s Law Þ lower cost, higher desity 3D graphics cards for PCs ad game cosoles Graphics Processig Uits Processors orieted to 3D graphics tasks Vertex/pixel processig, shadig, texture mappig, rasterizatio 6.6 Itroductio to Graphics Processig Uits Chapter 6 Parallel Processors from Cliet to Cloud 23 Graphics i the System Chapter 6 Parallel Processors from Cliet to Cloud 24

13 GPU Architectures Processig is highly data-parallel GPUs are highly multithreaded Use thread switchig to hide memory latecy Less reliace o multi-level caches Graphics memory is wide ad high-badwidth Tred toward geeral purpose GPUs Heterogeeous CPU/GPU systems CPU for sequetial code, GPU for parallel code Programmig laguages/apis DirectX, OpeGL C for Graphics (Cg), High Level Shader Laguage (HLSL) Compute Uified Device Architecture (CUDA) Chapter 6 Parallel Processors from Cliet to Cloud 25 Example: NVIDIA Tesla Streamig multiprocessor 8 Streamig processors Chapter 6 Parallel Processors from Cliet to Cloud 26

14 Example: NVIDIA Tesla Streamig Processors Sigle-precisio FP ad iteger uits Each SP is fie-graied multithreaded Warp: group of 32 threads Executed i parallel, SIMD style 8 SPs 4 clock cycles Hardware cotexts for 24 warps Registers, PCs, Chapter 6 Parallel Processors from Cliet to Cloud 27 Classifyig GPUs Do t fit icely ito SIMD/MIMD model Coditioal executio i a thread allows a illusio of MIMD But with performace degredatio Need to write geeral purpose code with care Istructio-Level Parallelism Data-Level Parallelism Static: Discovered at Compile Time VLIW SIMD or Vector Dyamic: Discovered at Rutime Superscalar Tesla Multiprocessor Chapter 6 Parallel Processors from Cliet to Cloud 28

15 GPU Memory Structures Chapter 6 Parallel Processors from Cliet to Cloud 29 Puttig GPUs ito Perspective Feature Multicore with SIMD GPU SIMD processors 4 to 8 8 to 16 SIMD laes/processor 2 to 4 8 to 16 Multithreadig hardware support for SIMD threads Typical ratio of sigle precisio to double-precisio performace 2 to 4 16 to 32 2:1 2:1 Largest cache size 8 MB 0.75 MB Size of memory address 64-bit 64-bit Size of mai memory 8 GB to 256 GB 4 GB to 6 GB Memory protectio at level of page Yes Yes Demad pagig Yes No Itegrated scalar processor/simd processor Cache coheret Yes No Yes No Chapter 6 Parallel Processors from Cliet to Cloud 30

16 Guide to GPU Terms Chapter 6 Parallel Processors from Cliet to Cloud 31 Message Passig Each processor has private physical address space Hardware seds/receives messages betwee processors 6.7 Clusters, WSC, ad Other Message-Passig MPs Chapter 6 Parallel Processors from Cliet to Cloud 32

17 Loosely Coupled Clusters Network of idepedet computers Each has private memory ad OS Coected usig I/O system E.g., Etheret/switch, Iteret Suitable for applicatios with idepedet tasks Web servers, databases, simulatios, High availability, scalable, affordable Problems Admiistratio cost (prefer virtual machies) Low itercoect badwidth c.f. processor/memory badwidth o a SMP Chapter 6 Parallel Processors from Cliet to Cloud 33 Sum Reductio (Agai) Sum 100,000 o 100 processors First distribute 100 umbers to each The do partial sums sum = 0; for (i = 0; i<1000; i = i + 1) sum = sum + AN[i]; Reductio Half the processors sed, other half receive ad add The quarter sed, quarter receive ad add, Chapter 6 Parallel Processors from Cliet to Cloud 34

18 Sum Reductio (Agai) Give sed() ad receive() operatios limit = 100; half = 100;/* 100 processors */ repeat half = (half+1)/2; /* sed vs. receive dividig lie */ if (P >= half && P < limit) sed(p - half, sum); if (P < (limit/2)) sum = sum + receive(); limit = half; /* upper limit of seders */ util (half == 1); /* exit with fial sum */ Sed/receive also provide sychroizatio Assumes sed/receive take similar time to additio Chapter 6 Parallel Processors from Cliet to Cloud 35 Grid Computig Separate computers itercoected by log-haul etworks E.g., Iteret coectios Work uits farmed out, results set back Ca make use of idle time o PCs E.g., SETI@home, World Commuity Grid Chapter 6 Parallel Processors from Cliet to Cloud 36

19 Itercoectio Networks Network topologies Arragemets of processors, switches, ad liks Bus Rig N-cube (N = 3) 2D Mesh Fully coected 6.8 Itroductio to Multiprocessor Network Topologies Chapter 6 Parallel Processors from Cliet to Cloud 37 Multistage Networks Chapter 6 Parallel Processors from Cliet to Cloud 38

20 Network Characteristics Performace Latecy per message (uloaded etwork) Throughput Lik badwidth Total etwork badwidth Bisectio badwidth Cogestio delays (depedig o traffic) Cost Power Routability i silico Chapter 6 Parallel Processors from Cliet to Cloud 39 Parallel Bechmarks Lipack: matrix liear algebra SPECrate: parallel ru of SPEC CPU programs Job-level parallelism SPLASH: Staford Parallel Applicatios for Shared Memory Mix of kerels ad applicatios, strog scalig NAS (NASA Advaced Supercomputig) suite computatioal fluid dyamics kerels PARSEC (Priceto Applicatio Repository for Shared Memory Computers) suite Multithreaded applicatios usig Pthreads ad OpeMP 6.10 Multiprocessor Bechmarks ad Performace Models Chapter 6 Parallel Processors from Cliet to Cloud 40

21 Code or Applicatios? Traditioal bechmarks Fixed code ad data sets Parallel programmig is evolvig Should algorithms, programmig laguages, ad tools be part of the system? Compare systems, provided they implemet a give applicatio E.g., Lipack, Berkeley Desig Patters Would foster iovatio i approaches to parallelism Chapter 6 Parallel Processors from Cliet to Cloud 41 Modelig Performace Assume performace metric of iterest is achievable GFLOPs/sec Measured usig computatioal kerels from Berkeley Desig Patters Arithmetic itesity of a kerel FLOPs per byte of memory accessed For a give computer, determie Peak GFLOPS (from data sheet) Peak memory bytes/sec (usig Stream bechmark) Chapter 6 Parallel Processors from Cliet to Cloud 42

22 Rooflie Diagram Attaiable GPLOPs/sec = Max ( Peak Memory BW Arithmetic Itesity, Peak FP Performace ) Chapter 6 Parallel Processors from Cliet to Cloud 43 Comparig Systems Example: Optero X2 vs. Optero X4 2-core vs. 4-core, 2 FP performace/core, 2.2GHz vs. 2.3GHz Same memory system To get higher performace o X4 tha X2 Need high arithmetic itesity Or workig set must fit i X4 s 2MB L-3 cache Chapter 6 Parallel Processors from Cliet to Cloud 44

23 Optimizig Performace Optimize FP performace Balace adds & multiplies Improve superscalar ILP ad use of SIMD istructios Optimize memory usage Software prefetch Avoid load stalls Memory affiity Avoid o-local data accesses Chapter 6 Parallel Processors from Cliet to Cloud 45 Optimizig Performace Choice of optimizatio depeds o arithmetic itesity of code Arithmetic itesity is ot always fixed May scale with problem size Cachig reduces memory accesses Icreases arithmetic itesity Chapter 6 Parallel Processors from Cliet to Cloud 46

24 i7-960 vs. NVIDIA Tesla 280/ Real Stuff: Bechmarkig ad Rooflies i7 vs. Tesla Chapter 6 Parallel Processors from Cliet to Cloud 47 Rooflies Chapter 6 Parallel Processors from Cliet to Cloud 48

25 Bechmarks Chapter 6 Parallel Processors from Cliet to Cloud 49 Performace Summary GPU (480) has 4.4 X the memory badwidth Beefits memory boud kerels GPU has 13.1 X the sigle precisio throughout, 2.5 X the double precisio throughput Beefits FP compute boud kerels CPU cache prevets some kerels from becomig memory boud whe they otherwise would o GPU GPUs offer scatter-gather, which assists with kerels with strided data Lack of sychroizatio ad memory cosistecy support o GPU limits performace for some kerels Chapter 6 Parallel Processors from Cliet to Cloud 50

26 Multi-threadig DGEMM Use OpeMP: void dgemm (it, double* A, double* B, double* C) { #pragma omp parallel for for ( it sj = 0; sj < ; sj += BLOCKSIZE ) for ( it si = 0; si < ; si += BLOCKSIZE ) for ( it sk = 0; sk < ; sk += BLOCKSIZE ) do_block(, si, sj, sk, A, B, C); } 6.12 Goig Faster: Multiple Processors ad Matrix Multiply Chapter 6 Parallel Processors from Cliet to Cloud 51 Multithreaded DGEMM Chapter 6 Parallel Processors from Cliet to Cloud 52

27 Multithreaded DGEMM Chapter 6 Parallel Processors from Cliet to Cloud 53 Fallacies Amdahl s Law does t apply to parallel computers Sice we ca achieve liear speedup But oly o applicatios with weak scalig Peak performace tracks observed performace Marketers like this approach! But compare Xeo with others i example Need to be aware of bottleecks 6.13 Fallacies ad Pitfalls Chapter 6 Parallel Processors from Cliet to Cloud 54

28 Pitfalls Not developig the software to take accout of a multiprocessor architecture Example: usig a sigle lock for a shared composite resource Serializes accesses, eve if they could be doe i parallel Use fier-graularity lockig Chapter 6 Parallel Processors from Cliet to Cloud 55 Cocludig Remarks Goal: higher performace by usig multiple processors Difficulties Developig parallel software Devisig appropriate architectures SaaS importace is growig ad clusters are a good match Performace per dollar ad performace per Joule drive both mobile ad WSC 6.14 Cocludig Remarks Chapter 6 Parallel Processors from Cliet to Cloud 56

29 Cocludig Remarks (co t) SIMD ad vector operatios match multimedia applicatios ad are easy to program Chapter 6 Parallel Processors from Cliet to Cloud 57

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7. Multicores, Multiprocessors, and Clusters Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Chapter 7. Multicores, Multiprocessors, and

Chapter 7. Multicores, Multiprocessors, and Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher h performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7. Multicores, Multiprocessors, and Clusters Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Parallel Systems I The GPU architecture. Jan Lemeire

Parallel Systems I The GPU architecture. Jan Lemeire Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance

Chapter 7. Multicores, Multiprocessors, and Clusters. Goal: connecting multiple computers to get higher performance Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Parallel computing and GPU introduction

Parallel computing and GPU introduction 國立台灣大學 National Taiwan University Parallel computing and GPU introduction 黃子桓 tzhuan@gmail.com Agenda Parallel computing GPU introduction Interconnection networks Parallel benchmark Parallel programming

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

SCI Reflective Memory

SCI Reflective Memory Embedded SCI Solutios SCI Reflective Memory (Experimetal) Atle Vesterkjær Dolphi Itercoect Solutios AS Olaf Helsets vei 6, N-0621 Oslo, Norway Phoe: (47) 23 16 71 42 Fax: (47) 23 16 71 80 Mail: atleve@dolphiics.o

More information

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 18 May 2018 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges

ELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns

More information

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards.

More information

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics Trasformig Irregular lgorithms for Heterogeeous omputig - ase Studies i ioiformatics Jig Zhag dvisor: Dr. Wu Feg ollaborator: Hao Wag syergy.cs.vt.edu Irregular lgorithms haracterized by Operate o irregular

More information

GPUMP: a Multiple-Precision Integer Library for GPUs

GPUMP: a Multiple-Precision Integer Library for GPUs GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract

More information

Chapter 3. Floating Point Arithmetic

Chapter 3. Floating Point Arithmetic COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 3 Floatig Poit Arithmetic Review - Multiplicatio 0 1 1 0 = 6 multiplicad 32-bit ALU shift product right multiplier add

More information

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Ruig Time of a algorithm Ruig Time Upper Bouds Lower Bouds Examples Mathematical facts Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite

More information

Cluster Computing Spring 2004 Paul A. Farrell

Cluster Computing Spring 2004 Paul A. Farrell Cluster Computig Sprig 004 3/18/004 Parallel Programmig Overview Task Parallelism OS support for task parallelism Parameter Studies Domai Decompositio Sequece Matchig Work Assigmet Static schedulig Divide

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

Computer Architecture

Computer Architecture Computer Architecture Overview Prof. Tie-Fu Che Dept. of Computer Sciece Natioal Chug Cheg Uiv Sprig 2002 Overview- Computer Architecture Course Focus Uderstadig the desig techiques, machie structures,

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

CSE 305. Computer Architecture

CSE 305. Computer Architecture CSE 305 Computer Architecture Computer Architecture Course Teachers Rifat Shahriyar (rifat1816@gmail.com) Johra Muhammad Moosa Textbook Computer Orgaizatio ad Desig (The Hardware/Software Iterface) David

More information

Lecture 1: Introduction and Fundamental Concepts 1

Lecture 1: Introduction and Fundamental Concepts 1 Uderstadig Performace Lecture : Fudametal Cocepts ad Performace Aalysis CENG 332 Algorithm Determies umber of operatios executed Programmig laguage, compiler, architecture Determie umber of machie istructios

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors CS252 Sprig 2017 Graduate Computer Architecture Lecture 6: Out-of-Order Processors Lisa Wu, Krste Asaovic http://ist.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 2 WU UCB CS252 SP17 Last Time i Lecture

More information

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen Programmig with Shared Memory PART II HPC Sprig 2017 Prof. Robert va Egele Overview Sequetial cosistecy Parallel programmig costructs Depedece aalysis OpeMP Autoparallelizatio Further readig HPC Sprig

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

n Explore virtualization concepts n Become familiar with cloud concepts

n Explore virtualization concepts n Become familiar with cloud concepts Chapter Objectives Explore virtualizatio cocepts Become familiar with cloud cocepts Chapter #15: Architecture ad Desig 2 Hypervisor Virtualizatio ad cloud services are becomig commo eterprise tools to

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK]

Adapted from instructor s. Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Review and Advanced d Concepts Adapted from instructor s supplementary material from Computer Organization and Design, 4th Edition, Patterson & Hennessy, 2008, MK] Pipelining Review PC IF/ID ID/EX EX/M

More information

Arquitectura de Computadores

Arquitectura de Computadores Arquitectura de Computadores Capítulo 2. Procesadores segmetados Based o the origial material of the book: D.A. Patterso y J.L. Heessy Computer Orgaizatio ad Desig: The Hardware/Software Iterface 4 th

More information

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers

Outline. CSCI 4730 Operating Systems. Questions. What is an Operating System? Computer System Layers. Computer System Layers Outlie CSCI 4730 s! What is a s?!! System Compoet Architecture s Overview Questios What is a?! What are the major operatig system compoets?! What are basic computer system orgaizatios?! How do you commuicate

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley http://www.eecs.berkeley.edu/~krste

More information

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University ECE5917 SoC Architecture: MP SoC Part 1 Tae Hee Ha: tha@skku.edu Semicoductor Systems Egieerig Sugkyukwa Uiversity Outlie Overview Parallelism Data-Level Parallelism Istructio-Level Parallelism Thread-Level

More information

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis

Outline and Reading. Analysis of Algorithms. Running Time. Experimental Studies. Limitations of Experiments. Theoretical Analysis Outlie ad Readig Aalysis of Algorithms Iput Algorithm Output Ruig time ( 3.) Pseudo-code ( 3.2) Coutig primitive operatios ( 3.3-3.) Asymptotic otatio ( 3.6) Asymptotic aalysis ( 3.7) Case study Aalysis

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs

What are we going to learn? CSC Data Structures Analysis of Algorithms. Overview. Algorithm, and Inputs What are we goig to lear? CSC316-003 Data Structures Aalysis of Algorithms Computer Sciece North Carolia State Uiversity Need to say that some algorithms are better tha others Criteria for evaluatio Structure

More information

Parallel Programming Overview

Parallel Programming Overview Parallel Programmig Overview Task Parallelism OS support or task parallelism Parameter Studies Domai Decompositio Sequece Matchig Master/Slave paradigm Divide task ito early parallel sub-tasks Start the

More information

A collection of open-sourced RISC-V processors

A collection of open-sourced RISC-V processors Riscy Processors A collectio of ope-sourced RISC-V processors Ady Wright, Sizhuo Zhag, Thomas Bourgeat, Murali Vijayaraghava, Jamey Hicks, Arvid Computatio Structures Group, CSAIL, MIT 4 th RISC-V Workshop

More information

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,

More information

Vector Processing. Computer Organization Architectures for Embedded Computing. Friday, 13 December 13

Vector Processing. Computer Organization Architectures for Embedded Computing. Friday, 13 December 13 Vector Processing Computer Organization Architectures for Embedded Computing Friday, 13 December 13 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition, 2011, MK

More information

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK Departmet of Electrical Egieerig Uiversity of Arasas ELEG 5173L Digital Sigal Processig Itroductio to TMS320C6713 DSK Dr. Jigia Wu wuj@uar.edu ANALOG V.S DIGITAL 2 Aalog sigal processig ASP Aalog sigal

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Admiistrative Stuff Lab1 out toight Due Thursday (10/18) Lab1 review sessio Tomorrow, 10/05,

More information

CSE 417: Algorithms and Computational Complexity

CSE 417: Algorithms and Computational Complexity Time CSE 47: Algorithms ad Computatioal Readig assigmet Read Chapter of The ALGORITHM Desig Maual Aalysis & Sortig Autum 00 Paul Beame aalysis Problem size Worst-case complexity: max # steps algorithm

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 3: ISA ad Itroductio to Microarchitecture Prof. Yajig Li Uiversity of Chicago Lecture Outlie ISA uarch (hardware implemetatio of a ISA) Logic desig basics Sigle-cycle

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1 Reliable Trasmissio Sprig 2018 CS 438 Staff - Uiversity of Illiois 1 Reliable Trasmissio Hello! My computer s ame is Alice. Alice Bob Hello! Alice. Sprig 2018 CS 438 Staff - Uiversity of Illiois 2 Reliable

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )

EE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 ) EE26: Digital Desig, Sprig 28 3/6/8 EE 26: Itroductio to Digital Desig Combiatioal Datapath Yao Zheg Departmet of Electrical Egieerig Uiversity of Hawaiʻi at Māoa Combiatioal Logic Blocks Multiplexer Ecoders/Decoders

More information

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 1: Itroductio Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Lecture Outlie Meet ad greet Computer architecture: overview ad perspectives Course

More information