Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Size: px
Start display at page:

Download "Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units"

Transcription

1 Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018

2 New Course: Bachelor s Semiar i Comp Arch Fall credit uits Rigorous semiar o fudametal ad cuttig-edge topics i computer architecture Critical presetatio, review, ad discussio of semial works i computer architecture We will cover may ideas & issues, aalyze their tradeoffs, perform critical thikig ad braistormig Participatio, presetatio, report ad review writig Stay tued for more iformatio 2

3 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed Microarchitectures Pipeliig Issues i Pipeliig: Cotrol & Data Depedece Hadlig, State Maiteace ad Recovery, Out-of-Order Executio Other Executio Paradigms 3

4 Readigs for Today Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro Lidholm et al., "NVIDIA Tesla: A Uified Graphics ad Computig Architecture," IEEE Micro

5 Other Approaches to Cocurrecy (or Istructio Level Parallelism)

6 Approaches to (Istructio-Level) Cocurrecy Pipeliig Out-of-order executio Dataflow (at the ISA level) Superscalar Executio VLIW Fie-Graied Multithreadig SIMD Processig (Vector ad array processors, GPUs) Decoupled Access Execute Systolic Arrays 6

7 SIMD Processig: Exploitig Regular (Data) Parallelism

8 Recall: Fly s Taxoomy of Computers Mike Fly, Very High-Speed Computig Systems, Proc. of IEEE, 1966 SISD: Sigle istructio operates o sigle data elemet SIMD: Sigle istructio operates o multiple data elemets Array processor Vector processor MISD: Multiple istructios operate o sigle data elemet Closest form: systolic array processor, streamig processor MIMD: Multiple istructios operate o multiple data elemets (multiple istructio streams) Multiprocessor Multithreaded processor 8

9 Recall: SIMD Processig Sigle istructio operates o multiple data elemets I time or i space Multiple processig elemets Time-space duality Array processor: Istructio operates o multiple data elemets at the same time usig differet spaces Vector processor: Istructio operates o multiple data elemets i cosecutive time steps usig the same space 9

10 Recall: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same same time LD0 LD1 LD2 LD3 LD0 Differet time AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet same space AD3 MU2 ST1 MU3 ST2 Same space ST3 Space Space 10

11 Recall: Memory Bakig Memory is divided ito baks that ca be accessed idepedetly; baks share address ad data buses (to miimize pi cost) Ca start ad complete oe bak access per cycle Ca sustai N parallel accesses if all N go to differet baks Bak 0 Bak 1 Bak 2 Bak 15 MDR MAR MDR MAR MDR MAR MDR MAR Data bus Address bus Picture credit: Derek Chiou CPU 11

12 Some Issues Stride ad bakig As log as they are relatively prime to each other ad there are eough baks to cover bak access latecy, we ca sustai 1 elemet/cycle throughput Storage of a matrix Row major: Cosecutive elemets i a row are laid out cosecutively i memory Colum major: Cosecutive elemets i a colum are laid out cosecutively i memory You eed to chage the stride whe accessig a row versus colum 12

13 Matrix Multiplicatio A ad B, both i row-major order A B A 4x6 B 6x10 C 4x10 Dot products of rows ad colums of A ad B A: Load A 0 ito vector register V 1 Each time, icremet address by oe to access the ext colum Accesses have a stride of 1 B: Load B 0 ito vector register V 2 Each time, icremet address by 10 Accesses have a stride of 10 Differet strides ca lead to bak coflicts How do we miimize them? 13

14 Miimizig Bak Coflicts More baks Better data layout to match the access patter Is this always possible? Better mappig of address to bak E.g., radomized mappig Rau, Pseudo-radomly iterleaved memory, ISCA

15 Recall: Questios (II) What if vector data is ot stored i a strided fashio i memory? (irregular memory access to a vector) Idea: Use idirectio to combie/pack elemets ito vector registers Called scatter/gather operatios 15

16 Gather/Scatter Operatios Wat to vectorize loops with idirect accesses: for (i=0; i<n; i++) A[i] = B[i] + C[D[i]] Idexed istructio (Gather) LV vd, rd # Load idices i D vector LVI vc, rc, vd # Load idirect from rc base LV vb, rb # Load B vector ADDV.D va,vb,vc # Do add SV va, ra # Store result 16

17 Gather/Scatter Operatios Gather/scatter operatios ofte implemeted i hardware to hadle sparse vectors (matrices) Vector s ad stores use a idex vector which is added to the base register to geerate the addresses Scatter example Idex Vector Data Vector (to Store) Stored Vector (i Memory) Base Base+1 X Base Base+3 X Base+4 X Base+5 X Base Base

18 Array vs. Vector Processors, Revisited Array vs. vector processor distictio is a purist s distictio Most moder SIMD processors are a combiatio of both They exploit data parallelism i both time ad space GPUs are a prime example we will cover i a bit more detail 18

19 Recall: Array vs. Vector Processors ARRAY PROCESSOR VECTOR PROCESSOR Istructio Stream LD VR ß A[3:0] ADD VR ß VR, 1 MUL VR ß VR, 2 ST A[3:0] ß VR Time Same same time LD0 LD1 LD2 LD3 LD0 Differet time AD0 AD1 AD2 AD3 LD1 AD0 MU0 MU1 MU2 MU3 LD2 AD1 MU0 ST0 ST1 ST2 ST3 LD3 AD2 MU1 ST0 Differet same space AD3 MU2 ST1 MU3 ST2 Same space ST3 Space Space 19

20 Vector Istructio Executio VADD A,B à C Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] Time Time C[0] Slide credit: Krste Asaovic C[0] C[1] Space C[2] C[3] 20

21 Vector Uit Structure Fuctioal Uit Partitioed Vector Registers Elemets 0, 4, 8, Elemets 1, 5, 9, Elemets 2, 6, 10, Elemets 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 21

22 Vector Istructio Level Parallelism Ca overlap executio of multiple vector istructios Example machie has 32 elemets per vector register ad 8 laes Completes 24 operatios/cycle while issuig 1 vector istructio/cycle Load Uit Multiply Uit Add Uit mul time add mul add Istructio issue Slide credit: Krste Asaovic 22

23 Automatic Code Vectorizatio Scalar Seuetial Code for (i=0; i < N; i++) C[i] = A[i] + B[i]; Vectorized Code Iter. 1 add Time add add store store store Iter. 2 Iter. 1 Iter. 2 Vector Istructio add store Vectorizatio is a compile-time reorderig of operatio seuecig Þ reuires extesive loop depedece aalysis Slide credit: Krste Asaovic 23

24 Vector/SIMD Processig Summary Vector/SIMD machies are good at exploitig regular datalevel parallelism Same operatio performed o may data elemets Improve performace, simplify desig (o itra-vector depedecies) Performace improvemet limited by vectorizability of code Scalar operatios limit vector machie performace Remember Amdahl s Law CRAY-1 was the fastest SCALAR machie at its time! May existig ISAs iclude (vector-like) SIMD operatios Itel MMX/SSE/AVX, PowerPC AltiVec, ARM Advaced SIMD 24

25 SIMD Operatios i Moder ISAs

26 SIMD ISA Extesios Sigle Istructio Multiple Data (SIMD) extesio istructios Sigle istructio acts o multiple pieces of data at oce Commo applicatio: graphics Perform short arithmetic operatios (also called packed arithmetic) For example: add four 8-bit umbers Must modify ALU to elimiate carries betwee 8-bit values padd8 $s2, $s0, $s Bit positio a 3 a 2 a 1 a 0 $s0 + b 3 b 2 b 1 b 0 $s1 a 3 + b 3 a 2 + b 2 a 1 + b 1 a 0 + b 0 $s2 26

27 Itel Petium MMX Operatios Idea: Oe istructio operates o multiple data elemets simultaeously À la array processig (yet much more limited) Desiged with multimedia (graphics) operatios i mid No VLEN register Opcode determies data type: 8 8-bit bytes 4 16-bit words 2 32-bit doublewords 1 64-bit uadword Stride is always eual to 1. Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

28 MMX Example: Image Overlayig (I) Goal: Overlay the huma i image 1 o top of the backgroud i image 2 for (i=o: i<image-size; i++) i if (x[il == Blue) ew-image[i] =y[il; else ew-image[il = x[il; 1 Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

29 MMX Example: Image Overlayig (II) Y = Blossom image X = Woma s image Peleg ad Weiser, MMX Techology Extesio to the Itel Architecture, IEEE Micro,

30 GPUs (Graphics Processig Uits)

31 GPUs are SIMD Egies Udereath The istructio pipelie operates like a SIMD pipelie (e.g., a array processor) However, the programmig is doe usig threads, NOT SIMD istructios To uderstad this, let s go back to our parallelizable code example But, before that, let s distiguish betwee Programmig Model (Software) vs. Executio Model (Hardware) 31

32 Programmig Model vs. Hardware Executio Model Programmig Model refers to how the programmer expresses the code E.g., Seuetial (vo Neuma), Data Parallel (SIMD), Dataflow, Multi-threaded (MIMD, SPMD), Executio Model refers to how the hardware executes the code udereath E.g., Out-of-order executio, Vector processor, Array processor, Dataflow processor, Multiprocessor, Multithreaded processor, Executio Model ca be very differet from the Programmig Model E.g., vo Neuma model implemeted by a OoO processor E.g., SPMD model implemeted by a SIMD processor (a GPU) 32

33 How Ca You Exploit Parallelism Here? for (i=0; i < N; i++) Scalar Seuetial Code C[i] = A[i] + B[i]; Iter. 1 add store Let s examie three programmig optios to exploit istructio-level parallelism preset i this seuetial code: Iter Seuetial (SISD) add 2. Data-Parallel (SIMD) store 3. Multithreaded (MIMD/SPMD) 33

34 Prog. Model 1: Seuetial (SISD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Ca be executed o a: Iter. 1 Pipelied processor Out-of-order executio processor add Idepedet istructios executed whe ready store Differet iteratios are preset i the istructio widow ad ca execute i parallel i multiple fuctioal uits Iter. 2 add I other words, the loop is dyamically urolled by the hardware Superscalar or VLIW processor store Ca fetch ad execute multiple istructios per cycle 34

35 Prog. Model 2: Data Parallel (SIMD) for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Vector Istructio Vectorized Code VLD A à V1 Iter. 1 VLD B à V2 add add VADD V1 + V2 à V3 store store VST V3 à C Iter. 2 Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet add store Idea: Programmer or compiler geerates a SIMD istructio to execute the same istructio from all iteratios across differet data Best executed by a SIMD processor (vector, array) 35

36 Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; Scalar Seuetial Code Iter. 1 Iter. 2 Iter. 1 add store add store add store Iter. 2 Realizatio: Each iteratio is idepedet Idea: Programmer or compiler geerates a thread to execute each iteratio. Each thread does the same thig (but o differet data) Ca be executed o a MIMD machie 36

37 Prog. Model 3: Multithreaded for (i=0; i < N; i++) C[i] = A[i] + B[i]; add store add store Iter. 1 Iter. 2 Realizatio: Each iteratio is idepedet Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig (but Sigle o Program differet data) Multiple Data Ca Ca be executed be executed o a MIMD o a SIMT SIMD machie machie Sigle Istructio Multiple Thread 37

38 A GPU is a SIMD (SIMT) Machie Except it is ot programmed usig SIMD istructios It is programmed usig threads (SPMD programmig model) Each thread executes the same code but operates a differet piece of data Each thread has its ow cotext (i.e., ca be treated/restarted/executed idepedetly) A set of threads executig the same istructio are dyamically grouped ito a warp (wavefrot) by the hardware A warp is essetially a SIMD operatio formed by hardware! 38

39 SPMD o SIMT Machie for (i=0; i < N; i++) C[i] = A[i] + B[i]; Warp 0 at PC X Warp 0 at PC X+1 add add Warp 0 at PC X+2 store store Warp 0 at PC X+3 Iter. 1 Iter. 2 Warp: A set of threads that execute Realizatio: Each iteratio is idepedet the same istructio (i.e., at the same PC) Idea: This Programmer particular or model compiler is also geerates called: a thread to execute each iteratio. Each thread does the same SPMD: thig Sigle (but o Program differet data) Multiple Data Ca A GPU Ca be executed be executes executed o it a usig MIMD o a SIMD the machie SIMT machie model: Sigle Istructio Multiple Thread 39

40 Graphics Processig Uits SIMD ot Exposed to Programmer (SIMT)

41 SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly (o ay type of scalar pipelie) à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 41

42 Multithreadig of Warps for (i=0; i < N; i++) C[i] = A[i] + B[i]; Assume a warp cosists of 32 threads If you have 32K iteratios, ad 1 iteratio/thread à 1K warps Warps ca be iterleaved o the same pipelie à Fie graied multithreadig of warps Warp 10 at PC X add store add store Warp 20 at PC X+2 Iter * Iter *

43 Warps ad Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) à SIMT (Nvidia-speak) All threads ru the same code Warp: The threads that ru legthwise i a wove fabric Thread Warp Scalar Scalar Scalar Thread Thread Thread W X Y Commo PC Scalar Thread Z Thread Warp 3 Thread Warp 8 Thread Warp 7 SIMD Pipelie 43

44 High-Level View of a GPU 44

45 Latecy Hidig via Warp-Level FGMT Warp: A set of threads that execute the same istructio (o differet data elemets) Thread Warp 3 Thread Warp 8 Warps available for schedulig Fie-graied multithreadig Oe istructio per thread i pipelie at a time (No iterlockig) Iterleave warp executio to hide latecies Register values of all threads stay i register file FGMT eables log latecy tolerace Millios of pixels Thread Warp 7 RF ALU All Hit? I-Fetch Decode RF ALU D-Cache Writeback Data RF ALU SIMD Pipelie Warps accessig memory hierarchy Miss? Thread Warp 1 Thread Warp 2 Thread Warp 6 Slide credit: Tor Aamodt 45

46 Warp Executio (Recall the Slide) 32-thread warp executig ADD A[tid],B[tid] à C[tid] Executio usig oe pipelied fuctioal uit Executio usig four pipelied fuctioal uits A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] Time Time C[0] Slide credit: Krste Asaovic C[0] C[1] Space C[2] C[3] 46

47 SIMD Executio Uit Structure Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Memory Subsystem Slide credit: Krste Asaovic 47

48 Warp Istructio Level Parallelism Ca overlap executio of multiple istructios Example machie has 32 threads per warp ad 8 laes Completes 24 operatios/cycle while issuig 1 warp/cycle Load Uit Multiply Uit Add Uit W0 W1 time W2 W3 W4 W5 Warp issue Slide credit: Krste Asaovic 48

49 SIMT Memory Access Same istructio i differet threads uses thread id to idex ad access differet data elemets + Let s assume N=16, 4 threads per warp à 4 warps Threads Data elemets Warp 0 Warp 1 Warp 2 Warp 3 Slide credit: Hyesoo Kim 48

50 Sample GPU SIMT Code (Simplified) CPU code for (ii = 0; ii < ; ++ii) { C[ii] = A[ii] + B[ii]; } CUDA code // there are threads global void KerelFuctio( ) { it tid = blockdim.x * blockidx.x + threadidx.x; it vara = aa[tid]; it varb = bb[tid]; C[tid] = vara + varb; } Slide credit: Hyesoo Kim 48

51 Sample GPU Program (Less Simplified) Slide credit: Hyesoo Kim 51

52 Warp-based SIMD vs. Traditioal SIMD Traditioal SIMD cotais a sigle thread Seuetial istructio executio; lock-step operatios i a SIMD istructio Programmig model is SIMD (o extra threads) à SW eeds to kow vector legth ISA cotais vector/simd istructios Warp-based SIMD cosists of multiple scalar threads executig i a SIMD maer (i.e., same istructio executed by all threads) Does ot have to be lock step Each thread ca be treated idividually (i.e., placed i a differet warp) à programmig model ot SIMD SW does ot eed to kow vector legth Eables multithreadig ad flexible dyamic groupig of threads ISA is scalar à SIMD operatios ca be formed dyamically Essetially, it is SPMD programmig model implemeted o SIMD hardware 52

53 SPMD Sigle procedure/program, multiple data This is a programmig model rather tha computer orgaizatio Each processig elemet executes the same procedure, except o differet data elemets Procedures ca sychroize at certai poits i program, e.g. barriers Essetially, multiple istructio streams execute the same program Each program/procedure 1) works o differet data, 2) ca execute a differet cotrol-flow path, at ru-time May scietific applicatios are programmed this way ad ru o MIMD hardware (multiprocessors) Moder GPUs programmed i a similar way o a SIMD hardware 53

54 SIMD vs. SIMT Executio Model SIMD: A sigle seuetial istructio stream of SIMD istructios à each istructio specifies multiple data iputs [VLD, VLD, VADD, VST], VLEN SIMT: Multiple istructio streams of scalar istructios à threads grouped dyamically ito warps [LD, LD, ADD, ST], NumThreads Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig 54

55 Threads Ca Take Differet Paths i Warp-based SIMD Each thread ca have coditioal cotrol flow istructios Threads ca execute differet cotrol flow paths A B Thread Warp Commo PC C D F Thread 1 Thread 2 Thread 3 Thread 4 E G Slide credit: Tor Aamodt 55

56 Cotrol Flow Problem i GPUs/SIMT A GPU uses a SIMD pipelie to save area o cotrol logic Groups scalar threads ito warps Brach Brach divergece occurs whe threads iside warps brach to differet executio paths Path A Path B This is the same as coditioal/predicated/masked executio. Recall the Vector Mask ad Masked Vector Operatios? Slide credit: Tor Aamodt 56

57 Remember: Each Thread Is Idepedet Two Major SIMT Advatages: Ca treat each thread separately à i.e., ca execute each thread idepedetly o ay type of scalar pipelie à MIMD processig Ca group threads ito warps flexibly à i.e., ca group threads that are supposed to truly execute the same istructio à dyamically obtai ad maximize beefits of SIMD processig If we have may threads We ca fid idividual threads that are at the same PC Ad, group them together ito a sigle warp dyamically This reduces divergece à improves SIMD utilizatio SIMD utilizatio: fractio of SIMD laes executig a useful operatio (i.e., executig a active thread) 57

58 Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Form ew warps from warps that are waitig Eough threads brachig to each path eables the creatio of full ew warps Warp X Warp Z Warp Y 58

59 Dyamic Warp Formatio/Mergig Idea: Dyamically merge threads executig the same istructio (after brach divergece) Brach Path A Path B Fug et al., Dyamic Warp Formatio ad Schedulig for Efficiet GPU Cotrol Flow, MICRO

60 Dyamic Warp Formatio Example B x/1110 y/0011 A x/1111 y/1111 C x/1000 y/0010 D x/0110 y/0001 F x/0001 y/1100 E x/1110 y/0011 A D Leged A Executio of Warp x at Basic Block A A ew warp created from scalar threads of both Warp x ad y executig at Basic Block D Executio of Warp y at Basic Block A Baselie Dyamic Warp Formatio G x/1111 y/1111 A A B B C C D D E E F F G G A A A A B B C D E E F G G A A Time Time Slide credit: Tor Aamodt 60

61 Hardware Costraits Limit Flexibility of Warp Groupig Fuctioal Uit Registers for each Thread Registers for thread IDs 0, 4, 8, Registers for thread IDs 1, 5, 9, Registers for thread IDs 2, 6, 10, Registers for thread IDs 3, 7, 11, Lae Ca you move ay thread flexibly to ay lae? Memory Subsystem Slide credit: Krste Asaovic 61

62 Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig May 2018

63 We did ot cover the followig slides i lecture. These are for your preparatio for the ext lecture.

64 A Example GPU

65 NVIDIA GeForce GTX 285 NVIDIA-speak: 240 stream processors SIMT executio Geeric speak: 30 cores 8 SIMD fuctioal uits per core Slide credit: Kayvo Fatahalia 65

66 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) = SIMD fuctioal uit, cotrol shared across 8 uits = multiply-add = multiply = istructio stream decode = executio cotext storage Slide credit: Kayvo Fatahalia 66

67 NVIDIA GeForce GTX 285 core 64 KB of storage for thread cotexts (registers) Groups of 32 threads share istructio stream (each group is a Warp) Up to 32 warps are simultaeously iterleaved Up to 1024 thread cotexts ca be stored Slide credit: Kayvo Fatahalia 67

68 NVIDIA GeForce GTX 285 Tex Tex Tex Tex Tex Tex Tex Tex Tex Tex 30 cores o the GTX 285: 30,720 threads Slide credit: Kayvo Fatahalia 68

69 Evolutio of NVIDIA GPUs #Stream Processors GFLOPS Stream Processors GFLOPS 0 GTX 285 (2009) GTX 480 (2010) GTX 780 (2013) GTX 980 (2014) P100 (2016) V100 (2017) 0 69

70 NVIDIA V100 NVIDIA-speak: 5120 stream processors SIMT executio Geeric speak: 80 cores 64 SIMD fuctioal uits per core Tesor cores for Machie Learig 70

71 NVIDIA V100 Block Diagram 80 cores o the V

72 NVIDIA V100 Core 15.7 TFLOPS Sigle Precisio 7.8 TFLOPS Double Precisio 125 TFLOPS for Deep Learig (Tesor cores) 72

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik

More information

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 21: GPUs Prof. Onur Mutlu ETH Zurich Spring 2017 12 May 2017 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle and Microprogrammed Microarchitectures

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 20: SIMD Processors Prof. Onur Mutlu ETH Zurich Spring 2017 11 May 2017 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and Microprogrammed

More information

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 19: SIMD

More information

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow

More information

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) 18-447 Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014 Lab 4 Reminder Lab 4a out Branch handling and branch

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015

Computer Architecture Lecture 15: GPUs, VLIW, DAE. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 18-447 Computer Architecture Lecture 15: GPUs, VLIW, DAE Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/20/2015 Agenda for Today & Next Few Lectures Single-cycle Microarchitectures Multi-cycle

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 22: GPU Programming. Dr. Juan Gómez Luna Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 22: GPU Programmig Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 18 May 2018 Ageda for Today GPU as a accelerator Program structure Bulk sychroous programmig

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig 2018 27 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017

Design of Digital Circuits Lecture 19: Approaches to Concurrency. Prof. Onur Mutlu ETH Zurich Spring May 2017 Design of Digital Circuits Lecture 19: Approaches to Concurrency Prof. Onur Mutlu ETH Zurich Spring 2017 5 May 2017 Agenda for Today & Next Few Lectures! Single-cycle Microarchitectures! Multi-cycle and

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013

Computer Architecture Lecture 15: Dataflow and SIMD. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013 18-447 Computer Architecture Lecture 15: Dataflow and SIMD Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 2/20/2013 Reminder: Homework 3 Homework 3 Due Feb 25 REP MOVS in Microprogrammed LC-3b,

More information

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig 2018 26 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards.

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors CS252 Sprig 2017 Graduate Computer Architecture Lecture 6: Out-of-Order Processors Lisa Wu, Krste Asaovic http://ist.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 2 WU UCB CS252 SP17 Last Time i Lecture

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics

Transforming Irregular Algorithms for Heterogeneous Computing - Case Studies in Bioinformatics Trasformig Irregular lgorithms for Heterogeeous omputig - ase Studies i ioiformatics Jig Zhag dvisor: Dr. Wu Feg ollaborator: Hao Wag syergy.cs.vt.edu Irregular lgorithms haracterized by Operate o irregular

More information

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 14: Pipelining. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 4: Pipeliig Prof. Our Mutlu ETH Zurich Sprig 28 9 April 28 Ageda for Today & Next Few Lectures Previous lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Admiistrative Stuff Lab1 out toight Due Thursday (10/18) Lab1 review sessio Tomorrow, 10/05,

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines This Uit: Damic Schedulig CSE 560 Computer Sstems Architecture Damic Schedulig Slides origiall developed b Drew Hilto (IBM) ad Milo Marti (Uiversit of Peslvaia) App App App Sstem software Mem CPU I/O Code

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 3: ISA ad Itroductio to Microarchitecture Prof. Yajig Li Uiversity of Chicago Lecture Outlie ISA uarch (hardware implemetatio of a ISA) Logic desig basics Sigle-cycle

More information

Lecture 1: Introduction and Strassen s Algorithm

Lecture 1: Introduction and Strassen s Algorithm 5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff

Computer Architecture. Microcomputer Architecture and Interfacing Colorado School of Mines Professor William Hoff Computer rchitecture Microcomputer rchitecture ad Iterfacig Colorado School of Mies Professor William Hoff Computer Hardware Orgaizatio Processor Performs all computatios; coordiates data trasfer Iput

More information

Computer Architecture

Computer Architecture Computer Architecture Overview Prof. Tie-Fu Che Dept. of Computer Sciece Natioal Chug Cheg Uiv Sprig 2002 Overview- Computer Architecture Course Focus Uderstadig the desig techiques, machie structures,

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Computers and Scientific Thinking

Computers and Scientific Thinking Computers ad Scietific Thikig David Reed, Creighto Uiversity Chapter 15 JavaScript Strigs 1 Strigs as Objects so far, your iteractive Web pages have maipulated strigs i simple ways use text box to iput

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 6 Parallel Processors from Cliet to Cloud Itroductio Goal: coectig multiple computers to get higher performace Multiprocessors

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

Lecture 1: Introduction and Fundamental Concepts 1

Lecture 1: Introduction and Fundamental Concepts 1 Uderstadig Performace Lecture : Fudametal Cocepts ad Performace Aalysis CENG 332 Algorithm Determies umber of operatios executed Programmig laguage, compiler, architecture Determie umber of machie istructios

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen

Programming with Shared Memory PART II. HPC Spring 2017 Prof. Robert van Engelen Programmig with Shared Memory PART II HPC Sprig 2017 Prof. Robert va Egele Overview Sequetial cosistecy Parallel programmig costructs Depedece aalysis OpeMP Autoparallelizatio Further readig HPC Sprig

More information

Chapter 5: Processor Design Advanced Topics. Microprogramming: Basic Idea

Chapter 5: Processor Design Advanced Topics. Microprogramming: Basic Idea 5-1 Chapter 5 Processor Desig Advaced Topics Chapter 5: Processor Desig Advaced Topics Topics 5.3 Microprogrammig Cotrol store ad microbrachig Horizotal ad vertical microprogrammig 5- Chapter 5 Processor

More information

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Multi-Core Prof. Yajig Li Uiversity of Chicago Course Evaluatio Very importat Please fill out! 2 Lab3 Brach Predictio Competitio 8 teams etered the competitio,

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.

Hash Tables. Presentation for use with the textbook Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015. Presetatio for use with the textbook Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Hash Tables xkcd. http://xkcd.com/221/. Radom Number. Used with permissio uder Creative

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

Bank-interleaved cache or memory indexing does not require euclidean division

Bank-interleaved cache or memory indexing does not require euclidean division Bak-iterleaved cache or memory idexig does ot require euclidea divisio Adré Sezec To cite this versio: Adré Sezec. Bak-iterleaved cache or memory idexig does ot require euclidea divisio. 11th Aual Workshop

More information

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance

Pseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured

More information

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University ECE5917 SoC Architecture: MP SoC Part 1 Tae Hee Ha: tha@skku.edu Semicoductor Systems Egieerig Sugkyukwa Uiversity Outlie Overview Parallelism Data-Level Parallelism Istructio-Level Parallelism Thread-Level

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Benchmarking SpMV on Many-Core Architecture

Benchmarking SpMV on Many-Core Architecture Bechmarkig SpMV o May-Core Architecture Biwei Xie ad Zhe Jia Istitute of Computig Techology Chiese Academy of Scieces Priceto Uiversity Why Bechmarkig? To Measure Is To Kow -- William Thomso (Lord Kelvi)

More information

SPIRAL DSP Transform Compiler:

SPIRAL DSP Transform Compiler: SPIRAL DSP Trasform Compiler: Applicatio Specific Hardware Sythesis Peter A. Milder (peter.milder@stoybroo.edu) Fraz Frachetti, James C. Hoe, ad Marus Pueschel Departmet of ECE Caregie Mello Uiversity

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018 Fudametals of Chapter 1 Microprocessor ad Microcotroller Dr. Farid Farahmad Updated: Tuesday, Jauary 16, 2018 Evolutio First came trasistors Itegrated circuits SSI (Small-Scale Itegratio) to ULSI Very

More information

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering EE 4363 1 Uiversity of Miesota Midterm Exam #1 Prof. Matthew O'Keefe TA: Eric Seppae Departmet of Electrical ad Computer Egieerig Uiversity of Miesota Twi Cities Campus EE 4363 Itroductio to Microprocessors

More information

How do we evaluate algorithms?

How do we evaluate algorithms? F2 Readig referece: chapter 2 + slides Algorithm complexity Big O ad big Ω To calculate ruig time Aalysis of recursive Algorithms Next time: Litterature: slides mostly The first Algorithm desig methods:

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek EECS, University of California at Berkeley http://inst.eecs.berkeley.edu/~cs152 Administrivia Lab

More information

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014 MapReduce ad Hadoop Debapriyo Majumdar Data Miig Fall 2014 Idia Statistical Istitute Kolkata November 10, 2014 Let s keep the itro short Moder data miig: process immese amout of data quickly Exploit parallelism

More information

EE123 Digital Signal Processing

EE123 Digital Signal Processing Last Time EE Digital Sigal Processig Lecture 7 Block Covolutio, Overlap ad Add, FFT Discrete Fourier Trasform Properties of the Liear covolutio through circular Today Liear covolutio with Overlap ad add

More information

Computer Systems - HS

Computer Systems - HS What have we leared so far? Computer Systems High Level ENGG1203 2d Semester, 2017-18 Applicatios Sigals Systems & Cotrol Systems Computer & Embedded Systems Digital Logic Combiatioal Logic Sequetial Logic

More information

Mindmapping: A General Purpose (Test) Planning Tool

Mindmapping: A General Purpose (Test) Planning Tool W8 Test Strategy, Plaig, Metrics Wedesday, May 2d, 2018 1:45 PM Midmappig: A Geeral Purpose (Test) Plaig Tool Preseted by: Bob Gale Zeergy Techologies Brought to you by: 350 Corporate Way, Suite 400, Orage

More information

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition Lecture Goals Itroductio to Computig Systems: From Bits ad Gates to C ad Beyod 2 d Editio Yale N. Patt Sajay J. Patel Origial slides from Gregory Byrd, North Carolia State Uiversity Modified slides by

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 1: Itroductio Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Lecture Outlie Meet ad greet Computer architecture: overview ad perspectives Course

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

MOTIF XF Extension Owner s Manual

MOTIF XF Extension Owner s Manual MOTIF XF Extesio Ower s Maual Table of Cotets About MOTIF XF Extesio...2 What Extesio ca do...2 Auto settig of Audio Driver... 2 Auto settigs of Remote Device... 2 Project templates with Iput/ Output Bus

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions

ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions ETH, Design of Digital Circuits, SS17 Practice Exercises II - Solutions Instructors: Prof. Onur Mutlu, Prof. Srdjan Capkun TAs: Jeremie Kim, Minesh Patel, Hasan Hassan, Arash Tavakkol, Der-Yeuan Yu, Francois

More information

Threads and Concurrency in Java: Part 2

Threads and Concurrency in Java: Part 2 Threads ad Cocurrecy i Java: Part 2 1 Waitig Sychroized methods itroduce oe kid of coordiatio betwee threads. Sometimes we eed a thread to wait util a specific coditio has arise. 2003--09 T. S. Norvell

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information