ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University

Size: px
Start display at page:

Download "ECE5917 SoC Architecture: MP SoC Part 1. Tae Hee Han: Semiconductor Systems Engineering Sungkyunkwan University"

Transcription

1 ECE5917 SoC Architecture: MP SoC Part 1 Tae Hee Ha: tha@skku.edu Semicoductor Systems Egieerig Sugkyukwa Uiversity

2 Outlie Overview Parallelism Data-Level Parallelism Istructio-Level Parallelism Thread-Level Parallelism Processor-Level Parallelism Multi-core 2

3 Overview 3

4 Where Are We Headed? ø Time frame is popularity based. (Not based o first appearace) CPU-GPU Fusio Multithread, Multi-core MIPS Speculative, OOO Superscalar Pipeliig Era of Istructio Level Parallelism Multithread SIMD-extesio Era of Thread & Processor Level Parallelism Special Purpose HW Sigle-chip CPU Era (~ 2004) 4

5 Where Are We Headed? (Itel AMD Architecture Trasitio) Itel Desktop & Server Itel Mobile AMD Desktop & Server 130m Northwood/Gallati 1-core P6 (Petium III) 180m K7 130m Tualati 1-core K7 130m K7 130m Baias sigle-coree crisis NetBurst Core Nehalem Sady Bridge 90m Prescott / Smithfield P6 (Petium M) 90m Dotha CELL Shock K8 65m Cedar Mill / Presler 2-core (MCM) 65m Yoah 2-core 65m Merom 2-core 45m Pery 4-core (MCM) 45m Nehalem 4-core K10 (K8L) 32m Westmere >= 6-core Bulldozer 130m K8 90m K8 65m K8 65m K10 45m K10 32m Bulldozer 1-core 2-core 4-core >= 6-core 32m Sady Bridge 22m Ivy Bridge Sigle-Core Era Multi-Core Era System-level Itegratio Era 5

6 Processor Architectures: Fly s Classificatio SISD: Sigle Istructio, Sigle Data stream Uiprocessor SISD Istructio Pool SIMD: Sigle Istructio, Multiple Data streams Same istructio executed by multiple processig uits e.g.: multimedia processors, vector architectures Data Pool PU MISD: Multiple Istructio, Sigle Data stream Successive fuctioal uits operate o the same stream of data Rarely foud i geeral-purpose commercial desigs SIMD Istructio Pool MIMD: Multiple Istructio, Multiple Data streams Each processor has its ow istructio ad data streams Most popular form of parallel processig Sigle-user: high-performace for oe applicatio Data Pool PU PU PU Multiprogrammed: ruig may tasks simultaeously (e.g., servers) PU 6

7 System-level Itegratio (Chuck Moore, AMD at MICRO 2008) Sigle-chip CPU Era: Extreme focus o sigle-threaded performace Multi-issue, out-of-order executio plus moderate cache hierarchy Chip Multiprocessor (CMP) Era: Early: Hasty itegratio of multiple cores ito same chip/package Mid-life: Address some of the HW scalability ad (memory) iterferece issues Curret: Homogeeous CPUs plus moderate system-level fuctioality System-level Itegratio Era: ~2010 oward Itegratio of substatial system-level fuctioality Heterogeeous processors ad accelerators Itrospective cotrol systems for maagig o-chip resources & evets 7

8 Challeges!: Chuck Moore (AMD, 2011) Itegratio (log scale) Moore s Law We are here? DFM Variability Reliability Wire delay Power Budget (TDP) Power Wall Server: power = $$ DT: elimiate fas Mobile: battery We are here Frequecy Frequecy Wall We are here Time Time Time ILP complexity Wall Locality Sigle Thread Performace IPC We are here Performace We are here Sigle-thread Perf. We are here? Issue Width Cache Size Time 8

9 Three Walls to Serial Performace Memory Wall Istructio Level Parallelism (ILP) Wall Power Wall Source: excellet article, The May-Core Iflectio Poit for Mass Market Computer Systems, by Joh L. Maferdelli, Microsoft Corporatio /02/the-may-core-iflectio-poitfor-mass-market-computer-systems/ 9

10 Recall: Memory Wall Processor Memory(DRAM) Performace Gap! DRAM: A 1-cycle access i 1980 takes 100s of cycles i 2010 Registers: fast but small ad expesive We wat: fast, large, ad cheap memory 10

11 Recall: Typical Memory Hierarchy Radom Access (Read) Latecy Type Access time Capacity Maaged by Smaller, Faster, Costlier Register (F/F) L1 $ (SRAM) L2 L3 $ (SRAM) Register 1 cycle» 500~1,000B Compiler L1 cache» 3~4 cycles» 64KB HW L2 cache» 10~30 cycles» 256KB HW L3 cache» 30~60 cycles» 2~8MB HW Mai memory (DRAM) Mai Memory» 100~300 cycles 512~4GB (mobile) / 4~16GB (PC) OS Larger, Slower, Cheaper» 10 5 Performace Gap Flash Cache SSD Flash storage» 5K~10K cycles 8~32GB (mobile) / 128~512GB (PC) OS/Operator Data Storage (HDD) HDD» 10M~20M cycles > 1TB (PC) OS/Operator Exteral Secodary Storage (Exteral HDD, Tape, CD/DVD, Cloud Server) 11

12 Recall: How to alleviate the Memory Wall Problem Hidig/Reducig the memory access latecy Holistic approach Caches, Local memory, DRAM stackig, HW/SW prefetchig, Data locality optimizatio, Memory cotroller, SMT Icreasig the badwidth Latecy helps BW, but ot vice versa Reducig the umber of memory accesses keepig as much reusable data o cache ad local memory as possible 12

13 ILP Wall Duplicate hardware speculatively executes future istructios before the results of curret istructios are kow, while providig hardware safeguards to prevet the errors that might be caused by out of order executio 1. e = a + b 2. f = c + d 3. g = e f Braches must be guessed to decide what istructios to execute simultaeously If you guess wrog, you throw away this part of the result Data depedecies may prevet successive istructios from executig i parallel, eve if there are o braches 13

14 Power Wall Moore s Law Trasistor desity icreases every 18~24 moths CMOS Power Active power Stadby power v v v P total = V 2 f C a + V I leakage Drastic icrease i leakage curret ad decrease i oise margi prevet the voltage scalig aroud 1V Limitatios i Processor Performace Memory Wall ILP Wall Power Wall Not oly Battery, but also Heat! Itel cosumed ~ 2 W 3.3 GHz Itel Core i7 cosumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what ca be cooled by air 14

15 Power Wall Power dissipatio i clocked digital devices is proportioal to the clock frequecy, imposig a atural limit o clock rates Sigificat icrease i clock speed without heroic (ad expesive) coolig is ot possible à Chips would simply melt Clock speed icreased by a factor of 1,000 durig last two decades The ability of maufacturers to dissipate heat is limited though Look back at the last five years, the clock rates are pretty much flat You could bak o Materials Sciece (MS) breakthroughs The MS Egieers have usually delivered, ca they keep doig it?? 15

16 Pollack s Rule: Trade-offs improvemet (X) Pollack s Rule Area (Lead / Compactio) Performace (Lead / Compactio) Pollack s Rule: "performace icrease due to m-architecture advaces is roughly proportioal to [the] square root of [the] icrease i complexity Implicatios (i the same techology) New m-arch cosumes about 2-3x die area of the last m-arch, but provides x performace CMOS Process Techology (mm) 16

17 Multi-core Put multiple CPU s o the same die Why is this better tha multiple dies? Smaller, Cheaper Closer, so lower iter-processor latecy Ca share a L2 Cache (complicated) Less power Cost of multi-core: Complexity Slower sigle-thread executio 17

18 Creatig Parallel Processig Programs It is difficult to write SW that uses multiple processors to complete oe task faster, ad the problem gets worse as the umber of processors icreases The first reaso is that you must get better performace ad efficiecy from a parallel processig program o a multiprocessor Thik a aalogy of eight reporters tryig to write a sigle story i hopes of doig the work eight times faster 18

19 But (Fortuately) With the rise of the Iteret ad rich multimedia applicatios, the eed for hadlig idepedet tasks ad huge data icreased dramatically à Task Level Parallelism ad Data Level Parallelism User computig eviromet is chagig to iclude may backgroud tasks Multiprocessors ca speed up these types of applicatios with the help of tighter itegratio of cores ad multithreadig 19

20 Multi-core vs. Maycore Multi-core: curret trajectory Stay with curret fastest core desig Replicate every 18 moths (2, 4, 8... Etc ) Advatage: Do ot alieate serial workload Example: AMD X2 (2 core), Itel Core2 Quad (4 cores), AMD Barceloa (4 cores) Maycore: covergig i this directio Simplify cores (shorter pipelies, lower clock frequecies, i-order processig) Start at 100s of cores ad replicate every 18 moths Advatage: easier verificatio, defect tolerace, highest compute/surface-area, best power efficiecy Examples: Cell SPE (8 cores), Nvidia G80 (128 cores), Itel Polaris (80 cores), Cisco/Tesilica Metro (188 cores) Covergece: Ultimately toward Maycore Maycore if we ca figure out how to program it! Hedge: Heterogeeous Multi-core 20

21 Maycore System: CPU or GPU CPU GPU Large cache ad sophisticated flow cotrol miimize latecy for arbitrary memory access for serial process Simple flow cotrol ad limited cache, more trasistors for computig i parallel High arithmetic itesity hides memory latecy Cotrol ALU ALU ALU ALU Cache DRAM DRAM CPU GPU Source: NVIDIA 21

22 How Small is Small Xtesa x 3 Power5 (Server) TesilicaDP ARM 389mm 2 120W@1900MHz Itel Core2 sc (laptop) 130mm 2 Itel Core2 15W@1000MHz ARM Cortex A8 (automobiles) 5mm 2 0.8W@800MHz Power 5 Tesilica DP (cell phoes / priters) 0.8mm W@600MHz Each core operates at 1/3 to 1/10th efficiecy of largest chip, but you ca pack 100x more cores oto a chip ad cosume 1/20 the power Tesilica Xtesa (Cisco router) 0.32mm 2 for 3! 0.05W@600MHz 22

23 More Cocurrecy: Desig for Low Power Xtesa x 3 TesilicaDP ARM Cubic power improvemet with lower clock rate due to V 2 F Itel Core2 Slower clock rates eable use of simpler cores Power 5 Simpler cores use less area (lower leakage) ad reduce cost Tailor desig to applicatio to reduce waste This is how iphoes ad MP3 players are desiged to maximize battery life ad miimize cost 23

24 Tesio betwee Cocurrecy ad Power Efficiecy Highly cocurret systems ca be more power efficiet Dyamic power is proportioal to V 2 fc Build systems with eve higher cocurrecy? However, may algorithms are uable to exploit massive cocurrecy yet If higher cocurrecy caot deliver faster time to solutio, the power efficiecy beefit wasted So we should build fewer/faster processors? 24

25 Path to Power Efficiecy: Reducig Waste i Computig Examie methodology of low-power embedded computig market optimized for low power, low cost, ad high computatioal efficiecy Years of research i low-power embedded computig have show oly oe desig techique to reduce power: reduce waste. ¾ Mark Horowitz, Staford Uiversity & Rambus Ic. Sources of Waste Wasted trasistors (surface area) Wasted computatio (useless work/speculatio/stalls) Wasted badwidth (data movemet) Desigig for serial performace 25

26 What s Next? All Large Core Mixed Large ad Small Core May Small Cores All Small Core May Floatig- Poit Cores + 3D Stacked Memory Memory Differet Classes of Chips Home Games / Graphics Busiess Scietific The questio is ot whether this will happe but whether we are ready Source: Jack Dogarra, Itl. Supercomputig Cof. (ISC)

27 Itel Sigle-chip Cloud Computer (Dec. 2009) 27

28 Parallelism - Itroductio 28

29 Little s Law Throughput (T) = Number-i-flight (N) / Latecy (L) Example: 4 floatig-poit registers, 8 cycles per floatig-poit op Little s Law à ½ issue per cycle Issue Executio WB 29

30 Basic Performace Quatities Latecy: Every operatio requires time to execute i.e. istructio, memory or etwork latecy Little s Law relates these three: Cocurrecy = Latecy Badwidth - or - Effective Throughput = Expressed Cocurrecy / Latecy Badwidth: # of (parallel) operatios completed per cycle Cocurrecy: i.e. # of FPUs, DRAM, Network, etc Total # of operatios i flight Cocurrecy must be filled with parallel operatios Ca t exceed peak throughput with superfluous cocurrecy Each chael has a maximum (limited) throughput 30

31 Performace Optimizatio: Cotedig Forces Cotedig forces of device efficiecy ad usage/traffic Improve throughput Reduce Volume of Data Restructure to satisfy Little s Law Implemetatio & Algorithmic Optimizatio Ofte boils dow to several key challeges: Maagemet of data/task locality Maagemet of data depedecies Maagemet of commuicatio Maagemet of variable ad dyamic parallelism 31

32 Classes of Parallelism ad Parallel Architectures (1/2) Basically two kids of parallelism i applicatios: Data-level parallelism (DLP) There are may data items that ca be operated o at the same time Task-level parallelism (TLP) Tasks of work are created that ca operate idepedetly ad largely i parallel Source: Computer Architecture 5 th ed.: A Quatitative Approach (Morga Kaufma, by Heessy & Patterso, 2011) 32

33 Classes of Parallelism ad Parallel Architectures (2/2) Computer HW i tur ca exploit these two kids of applicatio parallelism i four major ways: Istructio-level parallelism Exploits DLP at modest levels with compiler help usig ideas like pipeliig ad at medium levels usig ideas like speculative executio Vector architectures ad GPUs Exploits DLP by applyig a sigle istructio to a collectio of data i parallel (SIMD) Thread-level parallelism Exploits either DLP or TLP i a tightly coupled hardware model that allows for iteractio amog parallel threads Request-level parallelism Exploits parallelism amog largely decoupled tasks specified by the programmer or the OS Source: Computer Architecture 5 th ed.: A Quatitative Approach (Morga Kaufma, by Heessy & Patterso, 2011) 33

34 Uses of Parallelism Horizotal parallelism for throughput More uits workig i parallel A B C D Throughput Vertical parallelism for latecy hidig Pipeliig: keep uits busy whe waitig for depedecies of resource, data, ad cotrol A B C D A B C A B A Latecy 34

35 Program Executio Time Latecy metric: program executio time i secods CPUtime = = = Secods Program = Istructios Program IC CPI CCT Cycles Program Cycles Istructio Secods Cycle Secods Cycle Your system architecture ca affect all of them CPI (Cycles per istructios): memory latecy, IO latecy, CCT (clock freq.): cache org., power budget, IC (Istructio cout): OS overhead, compiler choice Idepedet? 35

36 Architecture Methods for Performace Ehacemet Powerful istructios MD-techique Multiple data operads per operatio: SIMD (Vector, Sub-word SIMD Extesio) MO-techique Pipeliig Multiple operatios per istructio: Sophisticated ISA (e.g. CISC-like), VLIW Multiple istructio issue Sigle stream: Superscalar Multiple streams Multithreadig, Multi-core 36

37 Powerful Istructios MD Techique MD-techique Multiple data operads per operatio SIMD: Sigle Istructio Multiple Data Vector istructio: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Set Ldv Mulvi Ldv Addv Stv vl,64 v1,0(r2) v2,v1,5 v1,0(r1) v3,v1,v2 v3,0(r3) 37

38 Powerful Istructios MD Techique SIMD computig All PEs (Processig Elemets) execute same operatio Typical mesh or hypercube coectivity Exploit data locality of e.g. image processig applicatios Dese ecodig (few istructio bits eeded) time SIMD Executio Method PE1 PE2 PE Istructio 1 Istructio 2 Istructio 3 Istructio 38

39 Powerful Istructios MD Techique Sub-word parallelism SIMD o restricted scale for Multimedia istructios short vectors added to existig ISAs for microprocessors Examples: Itel MMX/SSE/AVX, ARM NEON, AMD 3Dow 39

40 Powerful Istructios MO Techique MO-techique: multiple operatios per istructio Two optios: CISC (Complex Istructio Set Computer) VLIW (Very Log Istructio Word) field FU 1 FU 2 FU 3 FU 4 FU 5 istructio sub r8, r5, 3 ad r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bez r5, 13 VLIW istructio example 40

41 Parallelism - Data Level Parallelism 41

42 Recall: Fly s Classificatio of Processor Architecture SISD: Sigle Istructio, Sigle Data stream Uiprocessor SISD Istructio Pool SIMD: Sigle Istructio, Multiple Data streams Same istructio executed by multiple processig uits e.g.: multimedia processors, vector architectures Data Pool PU MISD: Multiple Istructio, Sigle Data stream Successive fuctioal uits operate o the same stream of data Rarely foud i geeral-purpose commercial desigs SIMD Istructio Pool MIMD: Multiple Istructio, Multiple Data streams Each processor has its ow istructio ad data streams PU Most popular form of parallel processig Sigle-user: high-performace for oe applicatio Multiprogrammed: ruig may tasks simultaeously (e.g., servers) Data Pool PU PU PU 42

43 Data-level Parallelism Data parallelism focuses o distributig the data across differet parallel computig odes, which is usually foud i: Multimedia Computig Idetical ops o streams or arrays of soud samples, pixels, video frames Scietific Computig Weather forecastig, car-crash simulatio, biological modelig 43

44 DLP Kerel domiate may Computatioal Workloads 44

45 DLP ad Throughput Computig Source: Chuck Moore (AMD, 2011) 45

46 Data Parallelism & Loop Level Parallelism (LLP) Data Parallelism: Similar idepedet/parallel computatios o differet elemets of arrays that usually result i idepedet (or parallel) loop iteratios A commo way to icrease parallelism amog istructios is to exploit data parallelism amog idepedet iteratios of a loop: exploit Loop Level Parallelism (LLP) By urollig the loop either statically by the compiler, or dyamically by hardware, which icreases the size of the basic block preset This resultig larger basic block provides more istructios that ca be scheduled or re-ordered by the compiler/hardware to elimiate more stall cycles for (i=1; i<=1000; i=i+1;) x[i] = x[i] + y[i]; LV LV ADDV SV 4 vector istructios: Load Vector X Load Vector Y Add Vector X, X, Y Store Vector X 46

47 Resurgece of DLP Covergece of applicatio demads ad techology costraits drives architecture choice New applicatios, such as graphics, machie visio, speech recogitio, machie learig, etc. all require large umerical computatios that are ofte trivially data parallel SIMD-based architectures (Vector-SIMD, subword-simd, SIMT/GPUs) are most efficiet way to execute these algorithms 47

48 SIMD Classificatios Vector architectures SIMD extesios (sub-word SIMD) E.g) Itel - MMX: Multimedia Extesios (1996), SSE: Streamig SIMD Extesios (1999), AVX: Advaced Vector Extesio (2010) Graphics Processig Uits (GPUs) 48

49 Vector Architectures Basic idea: Read sets of data elemets ito vector registers Operate o those registers Disperse the results back ito memory Registers are cotrolled by compiler Register files act as compiler cotrolled buffers Used to hide memory latecy Leverage memory badwidth Vector loads/stores deeply pipelied Pay for memory latecy oce per vector ld/st! Regular loads/stores Pay for memory latecy for each vector elemet SCALAR (1 operatio) r1 + r3 r2 add r3, r1, r2 VECTOR (N operatios) v1 Rs1 Rs1 Rs v3 Rd Rd Rd Rd Rd v2 Rs2 Rs2 Rs2 Vector legth vadd.vv v3, v1, v2 49

50 Vector Programmig Model Scalar Registers r15 v15 Vector Registers r0 v0 [0] [1] [2] [VLRMAX-1] Vector Legth Register VLR Vector Arithmetic Istructios ADDV v3, v1, v2 v1 v2 v [0] [1] [VLR-1] Vector Load & Store Istructios LV v1, r1, r2 v1 Vector Register Base, r1 Stride, r2 Memory 50

51 Multiple Datapaths Vector elemets iterleaved across laes Example: V[0, 4, 8, ] o Lae 1, V[1, 5, 9, ] o Lae 2, etc. Compute for multiple elemets per cycle Example: Lae 1 computes o V[0] ad V[4] i oe cycle Modular, scalable desig No iter-lae commuicatio eeded for most vector istructios 51

52 Vector Processors (I) A vector is a oe-dimesioal array of umbers May scietific/commercial programs use vectors for (i = 0; i<=49; i++) C[i] = (A[i] + B[i]) / 2; A vector processor is oe whose istructios operate o vectors rather tha scalar (sigle data) values Basic requiremets Need to load/store vectors à vector registers (cotai vectors) Need to operate o vectors of differet legths à vector legth register (VLEN) Elemets of a vector might be stored apart from each other i memory à vector stride register (VSTR) Stride: distace betwee two elemets of a vector 52

53 Vector Processors (II) A vector istructio performs a operatio o each elemet i cosecutive cycles Vector fuctioal uits are pipelied Each pipelie stage operates o a differet data elemet Vector istructios allow deeper pipelies No itra-vector depedecies à o hardware iterlockig withi a vector No cotrol flow withi a vector Kow stride allows prefetchig of vectors ito cache/memory 53

54 Vector Processor Pros No depedecies withi a vector Pipeliig, parallelizatio work well Ca have very deep pipelies, o depedecies! Each istructio geerates a lot of work Reduces istructio fetch badwidth Highly regular memory access patter Iterleavig multiple baks for higher memory badwidth Prefetchig No eed to explicitly code loops Fewer braches i the istructio sequece 54

55 Vector Processor Cos Still requires a traditioal scalar uit (iteger ad FP) for the o-vector operatios Difficult to maitai precise iterrupts (ca t rollback all the idividual operatios already completed) Compiler or programmer has to vectorize programs Not very efficiet for small vector sizes Not suitable/efficiet for may differet classes of applicatios Requires a specialized, high-badwidth, memory system Usually built aroud heavily baked memory with data iterleavig 55

56 Vector Processor Limitatios Performace of a vector istructio depeds o the legth of the operad vectors Iitiatio rate Rate at which idividual operatios ca start i a fuctioal uit For fully pipelied uits this is oe operatio per cycle Start-up time (latecy) Time it takes to produce the first elemet of the result Depeds o how deep the pipelie of the fuctioal uits are Especially large for load/store uit 56

57 Multimedia SIMD Extesios Key ideas: Media applicatios operate o data types arrower tha the ative word size Video & Graphics systems use 8 bits per primary color Audio samples use 8-16 bits No memories associated with ALU s, but a pool of relatively wide (64 to 256 bits) registers that store several arrower operads E.g) 256-bit adder: 16 simultaeous operatios o 16 bits, 32 simultaeous operatios o 8 bits No direct commuicatio betwee ALU s, but via registers ad with special shufflig/permutatio istructios Not co-processors or supercomputers, but tightly itegrated ito CPU pipelie 57

58 Multimedia SIMD Extesios Meat for programmers to utilize Not for compilers to geerate Recet x86 compilers Capable for FP itesive apps Why is it popular? Costs little to add to the stadard arithmetic uit Easy to implemet Need smaller memory badwidth tha vector Separate data trasfers aliged i memory Vector: sigle istructio, 64 memory accesses, page fault i the middle of the vector likely! Use much smaller register space Fewer operads No eed for sophisticated mechaisms of vector architecture 58

59 Multimedia Extesios (aka SIMD extesios) 64b 32b 32b 16b 16b 16b 16b 8b 8b 8b 8b 8b 8b 8b 8b Very short vectors added to existig ISAs for microprocessors Use existig wide-bit register split ito small-bit registers Licol Labs TX-2 from 1957 had 36b datapath split ito 2 18b or 4 9b Newer desigs have wider registers 128b for PowerPC Altivec, Itel SSE2/3/4 256b for Itel AVX Sigle istructio operates o all elemets withi a register 16b 16b 16b 16b 16b 16b 16b 16b 4 16b adds b 16b 16b 16b 59

60 SIMD Multimedia Extesios like SSE-4 At the core of multimedia extesios SIMD parallelism Variable-sized data fields: Vector legth = register width / type size V0 V1 V2 V3 V4 V5 V Sixtee 8-bit Operads Eight 16-bit Operads Four 32-bit Operads WIDE UNIT 60

61 Multimedia Extesios versus Vectors Limited istructio set: o vector legth cotrol o strided load/store or scatter/gather uit-stride loads must be aliged to 64/128-bit boudary Limited vector register legth: requires superscalar dispatch to keep multiply/add/load uits busy loop urollig to hide latecies icreases register pressure Tred towards fuller vector support i microprocessors Better support for misaliged memory accesses Support of double-precisio (64-bit floatig-poit) New Itel AVX spec (aouced April 2008), 256b vector registers (expadable up to 1024b) 61

62 Parallelism - Istructio Level Parallelism 62

63 ILP? Istructio-level parallelism (ILP) is a measure of how may of the operatios i a computer program ca be performed simultaeously The potetial overlap amog istructios is called istructio level parallelism There are two approaches to istructio level parallelism: Dyamic approach where maily hardware locates the parallelism à Superscalar Static approach that largely relies o software to locate parallelism à VLIW (Very Log Istructio Word) How much ILP exists i programs is very applicatio specific I certai fields, such as graphics ad scietific computig the amout ca be very large However, workloads such as cryptography exhibit much less parallelism 63

64 ILP vs. PLP ILP (Istructio-Level-Parallelism) Overlap idividual machie operatios (add, mul, load ) so that they execute i parallel PLP (Processor-Level Parallelism) Havig separate processors gettig separate chuks of the program ( processors programmed to do so) 64

65 Micro-architectural Techiques for ILP Istructio pipeliig Superscalar or VLIW Multiple executio uits are used to execute multiple istructios i parallel Out-of-Order executio Note that this techique is idepedet of both pipeliig ad superscalar Register reamig is used to eable out-of-order executio Speculative executio Executio of complete istructios or parts of istructios before beig certai whether this executio should take place Brach predictio is used with speculative executio 65

66 Micro-architectural Techiques for ILP I Order Brach Predictio I-Cache Fetch Uit Decode / Reame Istructio (fetch) buffer Moder processor techiques Deep pipelies Superscalar issue Out-of-order, speculative executio Dispatch Reservatio statios Brach predictio Register reamig, dataflow order I Order Out of Order It It Float Float L/S L/S Reorder buffer Retire Executio flow I order, speculative fetch Out of order execute I order commit Usig reorder buffer for precise exceptios Write buffer D-Cache 66

67 ILP (Parallel Istructio Executio) Costraits ILP Costraits Structural Depedece (Resource Cotetio) Code Depedeces (Sequetial Sematics of the Program) Cotrol Depedeces Data Depedeces (RAW) True Depedeces Storage Coflicts (ot i I-Order Processors) (WAR) Ati-Depedeces (WAW) Output Depedeces 67

68 Types of Depedecies Structural Depedece (Structural Hazard) HW perspective Code Depedece SW (Program) perspective Data depedece (Data Hazard) Data True depedece Name depedecies Output depedece Ati-depedece Cotrol Depedece (Cotrol Hazard) Note) H/W termiology Hazards, S/W termiology Depedecies 68

69 Visualizig Pipeliig Time (clock cycles) I s t r. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg O r d e r Ifetch Reg ALU DMem Reg Ifetch Reg ALU DMem Reg 69

70 Pipeliig Overlaps executio of istructios by exploitig Istructio Level Parallelism Recall that CPU time (Latecy) = Secods Program = = IC CPI CCT Cycles Program Secods Cycle = Istructios Program Cycles Istructio Secods Cycle Pipeliig became uiversal techique i 1985 Performace Ehacemet Reduce the umber of istructios per program (IC) Reduce the umber of cycles per istructio (CPI) Reduce the umber of secods per cycle (CCT) Give ISA, it fully depeds o SW (Compiler, Programmer) Mostly depeds o HW orgaizatio & implemetatio techology uder system requiremets Pipeliig ca reduce CCT & (effective) CPI 70

71 Pipeliig is ot quite that easy! Limits to pipeliig: Hazards prevet ext istructio from executig durig its desigated clock cycle Structural hazards: HW caot support this combiatio of istructios (sigle perso to fold ad put clothes away) Data hazards: Istructio depeds o result of prior istructio still i the pipelie (missig sock) Cotrol hazards: Caused by delay betwee the fetchig of istructios ad decisios about chages i cotrol flow (braches ad jumps) Note) H/W termiology Hazards, S/W termiology Depedecies 71

72 Structural Hazards I s t r. Load Istr 1 Istr 2 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg Ifetch ALU Reg Ifetch DMem ALU Reg Reg DMem ALU Reg DMem Reg Whe two or more differet istructios wat to use same hardware resource i same cycle e.g., MEM uses the same memory port as IF as show i this slide. O r d e r Istr 3 Istr 4 Ifetch Reg Ifetch ALU Reg DMem ALU Reg DMem Reg 72

73 Structural Hazards Structural hazards are reduced with these rules: Each istructio uses a resource at most oce Always use the resource i the same pipelie stage Use the resource for oe cycle oly ISAs desiged with this i mid Sometimes very complex to do this Heavily depeds o programs ad hardware resources Some commo Structural Hazards: Memory access coflict Floatig poit - Sice may floatig poit istructios require may cycles, it s easy for them to iterfere with each other Startig up more of oe type of istructio tha there are resources 73

74 Data Hazards Time (clock cycles) IF ID/RF EX MEM WB I s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 ad r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Ifetch Reg Ifetch ALU Reg Ifetch DMem ALU Reg Ifetch Reg DMem ALU Reg Ifetch Reg DMem ALU Reg Reg DMem ALU Reg DMem Reg The use of the result of the ADD istructio i the ext three istructios causes a hazard, sice the register is ot writte util after those istructios read it. 74

75 Data Hazards Read After Write (RAW) Caused by a depedece, eed for commuicatio Istr-J tries to read operad before Istr-I writes it I : add r1, r2, r3 J : sub r4, r1, 43 Happes i cocurret executio or OoO Write After Write (WAW) Caused by a output depedece ad the re-use of the ame r1 Istr-J tries to write operad (r1) before Istr-I writes it I: sub r1, r4, r3 J: add r1, r2, r3 K: mul r6, r1, r7 Write After Read (WAR) Caused by a ati-depedece ad the re-use of the ame r1 Istr-J tries to write operad (r1) before Istr-I reads it I: add r4, r1, r3 J: add r1, r2, r3 K: mul r6, r1, r7 v Solutios for Data Hazards Stallig Forwardig: coect ew value directly to ext stage Speculatio (w/ HW) or reorderig (w/ compiler ad/or HW) 75

76 Cotrol Hazards A cotrol hazard is whe we eed to fid the destiatio of a brach, ad ca t fetch ay ew istructios util we kow that destiatio 10: beq r1,r3,36 Ifetch Reg ALU DMem Reg 14: ad r2,r3,r5 Ifetch Reg ALU DMem Reg 18: or r6,r1,r7 Ifetch Reg ALU DMem Reg 22: add r8,r1,r9 Ifetch Reg ALU DMem Reg 36: xor r10,r1,r11 Ifetch Reg ALU DMem Reg 76

77 Five Brach Hazard Alteratives #1: Stall util brach directio is clear #2: Predict Brach Not Take Execute successor istructios i sequece Squash istructios i pipelie if brach actually take Advatage of late pipelie state update 47% MIPS braches ot take o average PC+4 already calculated, so use it to get ext istructio #3: Predict Brach Take 53% MIPS braches take o average But have t calculated brach target address i MIPS MIPS still icurs 1 cycle brach pealty Other machies: brach target kow before outcome #4: Execute Both Paths #5: Delayed Brach Defie brach to take place AFTER a followig istructio brach istructio sequetial successor 1 sequetial successor 2... sequetial successor brach target if take 1 slot delay allows proper decisio ad brach target address i 5 stage pipelie 77

78 Pipeliig Istructio Fetch PC I-Cache Pipelied desig Oe stage per cycle Overlap istructios Decode & Read operads Decoder Register File Cost: pipelie registers To reduce stalls Execute ALU Forwardig paths for data depedecies Memory access D-Cache Predict-ot-take braches for cotrol depedecies Istructio & data caches to reduce memory stalls Writeback 78

79 Pipeliig ad ILP Higher clock frequecy (lower CCT): Deeper pipelies Decompose pipelie stages ito smaller stages - Overlap more istructios Lower CPI base : Wider pipelies Isert multiple istructio i parallel i the pipelie Lower CPI stall : Diversified pipelies for differet fuctioal uits Out-of-order executio Balace coflictig goals Deeper & Wider pipelies è more cotrol hazards Brach predictio (speculatio) 79

80 Deep Pipeliig Fetch 1 Fetch 2 Idea: break up istructio ito N stages Ideal CCT = 1/N compared to o-pipelied So let s use a large N Decode Read Registers ALU Memory 1 Memory 2 Write Registers Other motivatios for deep pipelies Not all basic operatios have the same latecy Iteger ALU, FP ALU, cache access Difficult to fit them i oe pipelie stage CCT must be large eough to fit the logest oe Break some of them ito multiple pipelie stages e.g. data cache access i 2 stages, FP add i 2 stage, FP mul i 3 stage 80

81 Limits to Pipelie Depth Each pipelie stage itroduces some overhead (O) Delay of pipelie registers Iequalities i work per stage Caot break up work ito stages at arbitrary poits Clock skew Clocks to differet registers may ot be perfectly aliged T T/N O T/N O If origial CCT was T, with N stages CCT is T/N+O If N, speedup = T / (T/N+O) T/O Assumig that IC ad CPI stay costat Evetually overhead domiates ad leads to dimiishig returs 81

82 Pipeliig Limits Petium3 Petium4 [Grochowski,Itel, 1997] High clock frequecy, but modest performace gais Due to memory latecy ad brach delays Power cosumptios icreases dagerously! 82

83 Wide or Superscalar Pipelies Fetch 1 Decode Read Registers Idea: operate o N istructios each cycle Parallelism at the istructio level CPI base = 1/N ALU Memory Write Registers Optios (from simpler to harder) Oe iteger ad oe floatig-poit istructio Ay N=2 istructios Ay N=4 istructios Ay N=? Istructios What are the limits here? 83

84 Diversified Pipelies Fetch 1 Decode Read Registers Idea: decouple the executio portio of the pipelie for differet istructios Separate pipelies for simple iteger, iteger multiply, FP, load/store It Add It Mult FPU Memory It Mult FPU FPU Memory Memory Advatage: avoids uecessary stalls e.g. slow FP istructio does ot block idepedet iteger istructios FPU Write Registers Disadvatages WAW hazards Imprecise (out-of-order) exceptios 84

85 ILP Architectures Computer Architecture: is a cotract (istructio format ad the iterpretatio of the bits that costitute a istructio) betwee the class of programs that are writte for the architecture ad the set of processor implemetatios of that architecture I ILP Architectures: + iformatio embedded i the program pertaiig to available parallelism betwee istructios ad operatios i the program 85

86 Sequetial Architecture ad Superscalar Processors Program cotais o explicit iformatio regardig depedecies that exist betwee istructios Depedecies betwee istructios must be determied by the hardware It is oly ecessary to determie depedecies with sequetially precedig istructios that have bee issued but ot yet completed Compiler may re-order istructios to facilitate the hardware s task of extractig parallelism 86

87 Scalar, Superscalar, Deep pipelie Scalar Processor: Oe istructio pass through i each cycle Superscalar Processor More tha oe istructio pass through i each cycle For m-way Superscalar, effective CPI is 1/m of the pipelie 3-way pipelied Superscalar 87

88 Superscalar Performace Performace Spectrum? What if all istructios were depedet? Speedup = 0, Superscalar buys us othig What if all istructios were idepedet? Speedup = N where N = superscalarity Agai key is typical program behavior Some parallelism exists 88

89 Simplified View of a OoO Superscalar Processor I-Cache Issue width Brach Predictio Fetch Uit Istructio (fetch) buffer Decode / Reame Dispatch I Order Issue Read registers or Assig register tag Advace istructios to reservatio statios Reservatio statios It It Float Float L/S L/S Out of Order Executio Moitor register tag Receive data beig forwarded Issue whe all operads ready Reorder Buffer Write buffer Retire D-Cache I Order Commit

90 Idepedece Architecture ad VLIW Processors By kowig which operatios are idepedet, the hardware eeds o further checkig to determie which istructios ca be issued i the same cycle The set of idepedet operatios >> the set of depedet operatios Oly a subset of idepedet operatios are specified The compiler may additioally specify o which fuctioal uit ad i which cycle a operatio is executed The hardware eeds to make o ru-time decisios 90

91 VLIW Processors Operatio vs. Istructio Operatio: is a uit of computatio (add, load, brach = istructio i sequetial arch.) Istructio: set of operatios that are iteded to be issued simultaeously Compiler decides which operatio to go to each istructio (schedulig) All operatios that are supposed to begi at the same time are packaged ito a sigle VLIW istructio IF ID EX M WB EX M WB EX M WB IF ID EX M WB EX M WB EX M WB 91

92 VLIW: Very Log Istructio Word It Op 1 It Op 2 Mem Op 1 Mem Op 2 FP Op 1 FP Op 2 Two Iteger Uits, Sigle Cycle Latecy Two Load/Store Uits, Three Cycle Latecy Two Floatig-Poit Uits, Four Cycle Latecy Compiler schedules parallel executio Multiple parallel operatios packed ito oe log istructio word Compiler must avoid data hazards (o iterlocks) 92

93 VLIW Stregths I hardware it is very simple: cosistig of a collectio of fuctio uits (adders, multipliers, brach uits, etc.) coected by a bus, plus some registers ad caches More silico goes to the actual processig (rather tha beig spet o brach predictio, for example), It should ru fast, as the oly limit is the latecy of the fuctio uits themselves Programmig a VLIW chip is very much like writig microcode 93

94 VLIW Limitatios The eed for a powerful compiler, Icreased code size arisig from aggressive schedulig policies, Larger memory badwidth ad register-file badwidth, Limitatios due to biary compatibility across implemetatios 94

95 VLIW past & future Declie of VLIWs for geeral purpose systems: Could t be itegrated i a sigle chip Biary compatibility betwee implemetatios Rediscovery of VLIW i embbeded No more itegrability issues Biary icompatibility ot relevat (for DSP ot CPU) Advateges of VLIW: Simplified hardware optimize ad-hoc the architecture to achieve ILP 95

96 Summary: Superscalar vs. VLIW Additioal ifo required i the program Depedeces aalysis Idepedeces aalysis Superscalar Noe Performed by HW Performed by HW VLIW Miimally, a partial list of idepedeces. A complete specificatio of whe ad where each operatio to be executed Performed by compiler Performed by compiler Schedulig Performed by HW Performed by compiler Role of compiler Rearrages the code to make the aalysis ad schedulig HW more successful Replaces virtually all the aalysis ad schedulig HW 96

97 ILP Ope Problems Pipelied schedulig : Optimized schedulig of pipelied behavioral descriptios Two simple type of pipeliig (structural ad fuctioal) Cotroller cost : Most schedulig algorithms do ot cosider the cotroller costs which is directly depedet o the cotroller style used durig schedulig Area costraits : The resource costraied algorithms could have better iteractio betwee schedulig ad floorplaig Realism: Schedulig realistic desig descriptios that cotai several special laguage costructs Usig more realistic libraries ad cost fuctios Schedulig algorithms must also be expaded to icorporate differet target architectures 97

98 Summary: Limits to ILP Doublig issue rates above today s 3-6 istructios per clock probably requires processor to: Issue 3-4 data-memory accesses per cycle, Resolve 2-3 braches per cycle, Reame ad access over 20 registers per cycle, ad Fetch istructios per cycle. Complexity of implemetig these capabilities is likely to mea sacrifices i maximum clock rate Widest-issue processor teds to be slowest i terms of clock rate Also cosider ROI i terms of area ad power 98

99 Summary: Limits to ILP (cot d) Most ways to icrease performace also boost power cosumptio Key questio is eergy efficiecy: does a method icrease power cosumptio faster tha it boosts performace? Multiple-issue techiques are eergy iefficiet: Icurs logic overhead that grows faster tha issue rate Growig gap betwee peak issue rates ad sustaied performace Number of trasistors switchig = f (peak issue rate); performace = f (sustaied rate); growig gap betwee peak ad sustaied performace Þ Icreasig eergy per uit of performace 99

100 Evolved Solutio or Alteratives MT (Multithreaded) approach More tightly coupled tha MP Decetralized multithreaded architectures Hardware for iter-thread sychroizatio ad commuicatio Multiscalar (U of Wiscosi), Superthreadig (U of Miesota) Cetralized multithreaded architectures Share pipelies amog multiple threads TERA, SMT (throughput-orieted) Trace Processor, DMT (performace-orieted) MP (Multiprocessor) approach Decetralize all resources Multiprocessig o a sigle chip Commuicate through shared-memory: Staford Hydra Commuicate through messages: MIT RAW 100

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad

More information

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago

CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW. Prof. Yanjing Li University of Chicago CMSC22200 Computer Architecture Lecture 9: Out-of-Order, SIMD, VLIW Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab2 due toight Exam I: covers lectures 1-9 Ope book, ope otes, close device

More information

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution

Multi-Threading. Hyper-, Multi-, and Simultaneous Thread Execution Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig

More information

Instruction and Data Streams

Instruction and Data Streams Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Data Parallelism 1 (vector & SIMD extesios) (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Istructio ad

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor Advanced Issues COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Advaced Issues Review: Pipelie Hazards Structural hazards Desig pipelie to elimiate structural hazards.

More information

Appendix D. Controller Implementation

Appendix D. Controller Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);

More information

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1

Master Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1 Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts

More information

Course Site: Copyright 2012, Elsevier Inc. All rights reserved.

Course Site:   Copyright 2012, Elsevier Inc. All rights reserved. Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios

More information

CS2410 Computer Architecture. Flynn s Taxonomy

CS2410 Computer Architecture. Flynn s Taxonomy CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors

CS252 Spring 2017 Graduate Computer Architecture. Lecture 6: Out-of-Order Processors CS252 Sprig 2017 Graduate Computer Architecture Lecture 6: Out-of-Order Processors Lisa Wu, Krste Asaovic http://ist.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 2 WU UCB CS252 SP17 Last Time i Lecture

More information

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 5: Pipelining. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 5: Pipeliig Prof. Yajig Li Uiversity of Chicago Admiistrative Stuff Lab1 Due toight Lab2: out later today; due 2 weeks from ow Review sessio this Friday Turig award

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:

More information

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 10: Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 10: Caches Prof. Yajig Li Uiversity of Chicago Midterm Recap Overview ad fudametal cocepts ISA Uarch Datapath, cotrol Sigle cycle, multi cycle Pipeliig Basic idea,

More information

Isn t It Time You Got Faster, Quicker?

Isn t It Time You Got Faster, Quicker? Is t It Time You Got Faster, Quicker? AltiVec Techology At-a-Glace OVERVIEW Motorola s advaced AltiVec techology is desiged to eable host processors compatible with the PowerPC istructio-set architecture

More information

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems

More information

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5. Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple

More information

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access

More information

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings

Chapter 4 Threads. Operating Systems: Internals and Design Principles. Ninth Edition By William Stallings Operatig Systems: Iterals ad Desig Priciples Chapter 4 Threads Nith Editio By William Stalligs Processes ad Threads Resource Owership Process icludes a virtual address space to hold the process image The

More information

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design

CSC 220: Computer Organization Unit 11 Basic Computer Organization and Design College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:

More information

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.

The University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition. Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive

More information

Lecture 1: Introduction and Fundamental Concepts 1

Lecture 1: Introduction and Fundamental Concepts 1 Uderstadig Performace Lecture : Fudametal Cocepts ad Performace Aalysis CENG 332 Algorithm Determies umber of operatios executed Programmig laguage, compiler, architecture Determie umber of machie istructios

More information

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines

This Unit: Dynamic Scheduling. Can Hardware Overcome These Limits? Scheduling: Compiler or Hardware. The Problem With In-Order Pipelines This Uit: Damic Schedulig CSE 560 Computer Sstems Architecture Damic Schedulig Slides origiall developed b Drew Hilto (IBM) ad Milo Marti (Uiversit of Peslvaia) App App App Sstem software Mem CPU I/O Code

More information

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000.

Basic allocator mechanisms The course that gives CMU its Zip! Memory Management II: Dynamic Storage Allocation Mar 6, 2000. 5-23 The course that gives CM its Zip Memory Maagemet II: Dyamic Storage Allocatio Mar 6, 2000 Topics Segregated lists Buddy system Garbage collectio Mark ad Sweep Copyig eferece coutig Basic allocator

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016

More information

Computer Architecture

Computer Architecture Computer Architecture Overview Prof. Tie-Fu Che Dept. of Computer Sciece Natioal Chug Cheg Uiv Sprig 2002 Overview- Computer Architecture Course Focus Uderstadig the desig techiques, machie structures,

More information

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved.

Chapter 1. Introduction to Computers and C++ Programming. Copyright 2015 Pearson Education, Ltd.. All rights reserved. Chapter 1 Itroductio to Computers ad C++ Programmig Copyright 2015 Pearso Educatio, Ltd.. All rights reserved. Overview 1.1 Computer Systems 1.2 Programmig ad Problem Solvig 1.3 Itroductio to C++ 1.4 Testig

More information

Uniprocessors. HPC Prof. Robert van Engelen

Uniprocessors. HPC Prof. Robert van Engelen Uiprocessors HPC Prof. Robert va Egele Overview PART I: Uiprocessors PART II: Multiprocessors ad ad Compiler Optimizatios Parallel Programmig Models Uiprocessors Multiprocessors Processor architectures

More information

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis

Analysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems

More information

Multiprocessors. HPC Prof. Robert van Engelen

Multiprocessors. HPC Prof. Robert van Engelen Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies

More information

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018

Fundamentals of. Chapter 1. Microprocessor and Microcontroller. Dr. Farid Farahmand. Updated: Tuesday, January 16, 2018 Fudametals of Chapter 1 Microprocessor ad Microcotroller Dr. Farid Farahmad Updated: Tuesday, Jauary 16, 2018 Evolutio First came trasistors Itegrated circuits SSI (Small-Scale Itegratio) to ULSI Very

More information

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 16: Out-of-Order Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 16: Out-of-Order Executio Prof. Our Mutlu ETH Zurich Sprig 2018 26 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures Multi-cycle ad Microprogrammed

More information

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017

Computer Architecture Lecture 8: SIMD Processors and GPUs. Prof. Onur Mutlu ETH Zürich Fall October 2017 Computer Architecture Lecture 8: SIMD Processors ad GPUs Prof. Our Mutlu ETH Zürich Fall 2017 18 October 2017 Ageda for Today & Next Few Lectures SIMD Processors GPUs Itroductio to GPU Programmig Digitaltechik

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 1 Computers ad Programs 1 Objectives To uderstad the respective roles of hardware ad software i a computig system. To lear what computer scietists

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware A Overview Graphics System Moitor Iput devices CPU/Memory GPU Raster Graphics System Raster: A array of picture elemets Based o raster-sca TV techology The scree (ad a picture)

More information

Arquitectura de Computadores

Arquitectura de Computadores Arquitectura de Computadores Capítulo 2. Procesadores segmetados Based o the origial material of the book: D.A. Patterso y J.L. Heessy Computer Orgaizatio ad Desig: The Hardware/Software Iterface 4 th

More information

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming

Lecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis

More information

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures

COMP Parallel Computing. PRAM (1): The PRAM model and complexity measures COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems

More information

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units

Design of Digital Circuits Lecture 21: SIMD Processors II and Graphics Processing Units Desig of Digital Circuits Lecture 21: SIMD Processors II ad Graphics Processig Uits Dr. Jua Gómez Lua Prof. Our Mutlu ETH Zurich Sprig 2018 17 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018

More information

Elementary Educational Computer

Elementary Educational Computer Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified

More information

CSE 305. Computer Architecture

CSE 305. Computer Architecture CSE 305 Computer Architecture Computer Architecture Course Teachers Rifat Shahriyar (rifat1816@gmail.com) Johra Muhammad Moosa Textbook Computer Orgaizatio ad Desig (The Hardware/Software Iterface) David

More information

A collection of open-sourced RISC-V processors

A collection of open-sourced RISC-V processors Riscy Processors A collectio of ope-sourced RISC-V processors Ady Wright, Sizhuo Zhag, Thomas Bourgeat, Murali Vijayaraghava, Jamey Hicks, Arvid Computatio Structures Group, CSAIL, MIT 4 th RISC-V Workshop

More information

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS

APPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful

More information

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu

UH-MEM: Utility-Based Hybrid Memory Management. Yang Li, Saugata Ghose, Jongmoo Choi, Jin Sun, Hui Wang, Onur Mutlu UH-MEM: Utility-Based Hybrid Memory Maagemet Yag Li, Saugata Ghose, Jogmoo Choi, Ji Su, Hui Wag, Our Mutlu 1 Executive Summary DRAM faces sigificat techology scalig difficulties Emergig memory techologies

More information

One advantage that SONAR has over any other music-sequencing product I ve worked

One advantage that SONAR has over any other music-sequencing product I ve worked *gajedra* D:/Thomso_Learig_Projects/Garrigus_163132/z_productio/z_3B2_3D_files/Garrigus_163132_ch17.3d, 14/11/08/16:26:39, 16:26, page: 647 17 CAL 101 Oe advatage that SONAR has over ay other music-sequecig

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Determined by ISA and compiler. Determined by CPU hardware COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface ARM Editio Chapter 4 The Processor Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler CPI ad Cycle time Determied

More information

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK

ELEG 5173L Digital Signal Processing Introduction to TMS320C6713 DSK Departmet of Electrical Egieerig Uiversity of Arasas ELEG 5173L Digital Sigal Processig Itroductio to TMS320C6713 DSK Dr. Jigia Wu wuj@uar.edu ANALOG V.S DIGITAL 2 Aalog sigal processig ASP Aalog sigal

More information

Operating System Concepts. Operating System Concepts

Operating System Concepts. Operating System Concepts Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies

More information

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control

EE 459/500 HDL Based Digital Design with Programmable Logic. Lecture 13 Control and Sequencing: Hardwired and Microprogrammed Control EE 459/500 HDL Based Digital Desig with Programmable Logic Lecture 13 Cotrol ad Sequecig: Hardwired ad Microprogrammed Cotrol Refereces: Chapter s 4,5 from textbook Chapter 7 of M.M. Mao ad C.R. Kime,

More information

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components

Announcements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,

More information

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization

End Semester Examination CSE, III Yr. (I Sem), 30002: Computer Organization Ed Semester Examiatio 2013-14 CSE, III Yr. (I Sem), 30002: Computer Orgaizatio Istructios: GROUP -A 1. Write the questio paper group (A, B, C, D), o frot page top of aswer book, as per what is metioed

More information

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering

EE University of Minnesota. Midterm Exam #1. Prof. Matthew O'Keefe TA: Eric Seppanen. Department of Electrical and Computer Engineering EE 4363 1 Uiversity of Miesota Midterm Exam #1 Prof. Matthew O'Keefe TA: Eric Seppae Departmet of Electrical ad Computer Egieerig Uiversity of Miesota Twi Cities Campus EE 4363 Itroductio to Microprocessors

More information

Computer Architecture ELEC3441

Computer Architecture ELEC3441 CPU-Memory Bottleeck Computer Architecture ELEC44 CPU Memory Lecture 8 Cache Dr. Hayde Kwok-Hay So Departmet of Electrical ad Electroic Egieerig Performace of high-speed computers is usually limited by

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

. Written in factored form it is easy to see that the roots are 2, 2, i,

. Written in factored form it is easy to see that the roots are 2, 2, i, CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or

More information

Threads and Concurrency in Java: Part 1

Threads and Concurrency in Java: Part 1 Threads ad Cocurrecy i Java: Part 1 1 Cocurrecy What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Review Istructio Set Architecture Istructio Set The repertoire of istructios of a computer Differet computers have differet istructio

More information

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Multi-Core Prof. Yajig Li Uiversity of Chicago Course Evaluatio Very importat Please fill out! 2 Lab3 Brach Predictio Competitio 8 teams etered the competitio,

More information

Chapter 4 The Datapath

Chapter 4 The Datapath The Ageda Chapter 4 The Datapath Based o slides McGraw-Hill Additioal material 24/25/26 Lewis/Marti Additioal material 28 Roth Additioal material 2 Taylor Additioal material 2 Farmer Tae the elemets that

More information

Lecture 28: Data Link Layer

Lecture 28: Data Link Layer Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig

More information

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 2: ISA. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 2: ISA Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Admiistrative Stuff Lab1 out toight Due Thursday (10/18) Lab1 review sessio Tomorrow, 10/05,

More information

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 17 GPUs CS 152 Computer Architecture ad Egieerig CS252 Graduate Computer Architecture Lecture 17 GPUs Krste Asaovic Electrical Egieerig ad Computer Scieces Uiversity of Califoria at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Data Structures and Algorithms. Analysis of Algorithms

Data Structures and Algorithms. Analysis of Algorithms Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 20 Itroductio to Trasactio Processig Cocepts ad Theory Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Trasactio Describes local

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 19 Query Optimizatio Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Query optimizatio Coducted by a query optimizer i a DBMS Goal:

More information

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition

Introduction to Computing Systems: From Bits and Gates to C and Beyond 2 nd Edition Lecture Goals Itroductio to Computig Systems: From Bits ad Gates to C ad Beyod 2 d Editio Yale N. Patt Sajay J. Patel Origial slides from Gregory Byrd, North Carolia State Uiversity Modified slides by

More information

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018

Design of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Execution. Prof. Onur Mutlu ETH Zurich Spring April 2018 Desig of Digital Circuits Lecture 17: Out-of-Order, DataFlow, Superscalar Executio Prof. Our Mutlu ETH Zurich Sprig 2018 27 April 2018 Ageda for Today & Next Few Lectures Sigle-cycle Microarchitectures

More information

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13

CIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13 CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis

More information

n Explore virtualization concepts n Become familiar with cloud concepts

n Explore virtualization concepts n Become familiar with cloud concepts Chapter Objectives Explore virtualizatio cocepts Become familiar with cloud cocepts Chapter #15: Architecture ad Desig 2 Hypervisor Virtualizatio ad cloud services are becomig commo eterprise tools to

More information

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 3: ISA and Introduction to Microarchitecture. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 3: ISA ad Itroductio to Microarchitecture Prof. Yajig Li Uiversity of Chicago Lecture Outlie ISA uarch (hardware implemetatio of a ISA) Logic desig basics Sigle-cycle

More information

Python Programming: An Introduction to Computer Science

Python Programming: An Introduction to Computer Science Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to

More information

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software

Structuring Redundancy for Fault Tolerance. CSE 598D: Fault Tolerant Software Structurig Redudacy for Fault Tolerace CSE 598D: Fault Tolerat Software What do we wat to achieve? Versios Damage Assessmet Versio 1 Error Detectio Iputs Versio 2 Voter Outputs State Restoratio Cotiued

More information

IMP: Superposer Integrated Morphometrics Package Superposition Tool

IMP: Superposer Integrated Morphometrics Package Superposition Tool IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 22 Database Recovery Techiques Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio Recovery algorithms Recovery cocepts Write-ahead

More information

Cache-Optimal Methods for Bit-Reversals

Cache-Optimal Methods for Bit-Reversals Proceedigs of the ACM/IEEE Supercomputig Coferece, November 1999, Portlad, Orego, U.S.A. Cache-Optimal Methods for Bit-Reversals Zhao Zhag ad Xiaodog Zhag Departmet of Computer Sciece College of William

More information

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1

Switching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1 Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)

More information

What are Information Systems?

What are Information Systems? Iformatio Systems Cocepts What are Iformatio Systems? Roma Kotchakov Birkbeck, Uiversity of Lodo Based o Chapter 1 of Beett, McRobb ad Farmer: Object Orieted Systems Aalysis ad Desig Usig UML, (4th Editio),

More information

Data diverse software fault tolerance techniques

Data diverse software fault tolerance techniques Data diverse software fault tolerace techiques Complemets desig diversity by compesatig for desig diversity s s limitatios Ivolves obtaiig a related set of poits i the program data space, executig the

More information

Chapter 3. Floating Point Arithmetic

Chapter 3. Floating Point Arithmetic COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 3 Floatig Poit Arithmetic Review - Multiplicatio 0 1 1 0 = 6 multiplicad 32-bit ALU shift product right multiplier add

More information

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS

FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG

More information

MOTIF XF Extension Owner s Manual

MOTIF XF Extension Owner s Manual MOTIF XF Extesio Ower s Maual Table of Cotets About MOTIF XF Extesio...2 What Extesio ca do...2 Auto settig of Audio Driver... 2 Auto settigs of Remote Device... 2 Project templates with Iput/ Output Bus

More information

Using the Keyboard. Using the Wireless Keyboard. > Using the Keyboard

Using the Keyboard. Using the Wireless Keyboard. > Using the Keyboard 1 A wireless keyboard is supplied with your computer. The wireless keyboard uses a stadard key arragemet with additioal keys that perform specific fuctios. Usig the Wireless Keyboard Two AA alkalie batteries

More information

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The

More information

Ones Assignment Method for Solving Traveling Salesman Problem

Ones Assignment Method for Solving Traveling Salesman Problem Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:

More information

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1

Reliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1 Reliable Trasmissio Sprig 2018 CS 438 Staff - Uiversity of Illiois 1 Reliable Trasmissio Hello! My computer s ame is Alice. Alice Bob Hello! Alice. Sprig 2018 CS 438 Staff - Uiversity of Illiois 2 Reliable

More information

1. SWITCHING FUNDAMENTALS

1. SWITCHING FUNDAMENTALS . SWITCING FUNDMENTLS Switchig is the provisio of a o-demad coectio betwee two ed poits. Two distict switchig techiques are employed i commuicatio etwors-- circuit switchig ad pacet switchig. Circuit switchig

More information

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018

Design of Digital Circuits Lecture 20: SIMD Processors. Prof. Onur Mutlu ETH Zurich Spring May 2018 Desig of Digital Circuits Lecture 20: SIMD Processors Prof. Our Mutlu ETH Zurich Sprig 2018 11 May 2018 New Course: Bachelor s Semiar i Comp Arch Fall 2018 2 credit uits Rigorous semiar o fudametal ad

More information

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments

Running Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.

More information

Analysis of Algorithms

Analysis of Algorithms Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The

More information

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory!

Page 1. Why Care About the Memory Hierarchy? Memory. DRAMs over Time. Virtual Memory! Why Care About the Memory Hierarchy? Memory Virtual Memory -DRAM Memory Gap (latecy) Reasos: Multi process systems (abstractio & memory protectio) Solutio: Tables (holdig per process traslatios) Fast traslatio

More information

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago

CMSC Computer Architecture Lecture 1: Introduction. Prof. Yanjing Li Department of Computer Science University of Chicago CMSC 22200 Computer Architecture Lecture 1: Itroductio Prof. Yajig Li Departmet of Computer Sciece Uiversity of Chicago Lecture Outlie Meet ad greet Computer architecture: overview ad perspectives Course

More information

DATA STRUCTURES. amortized analysis binomial heaps Fibonacci heaps union-find. Data structures. Appetizer. Appetizer

DATA STRUCTURES. amortized analysis binomial heaps Fibonacci heaps union-find. Data structures. Appetizer. Appetizer Data structures DATA STRUCTURES Static problems. Give a iput, produce a output. Ex. Sortig, FFT, edit distace, shortest paths, MST, max-flow,... amortized aalysis biomial heaps Fiboacci heaps uio-fid Dyamic

More information

The Magma Database file formats

The Magma Database file formats The Magma Database file formats Adrew Gaylard, Bret Pikey, ad Mart-Mari Breedt Johaesburg, South Africa 15th May 2006 1 Summary Magma is a ope-source object database created by Chris Muller, of Kasas City,

More information

ECE4050 Data Structures and Algorithms. Lecture 6: Searching

ECE4050 Data Structures and Algorithms. Lecture 6: Searching ECE4050 Data Structures ad Algorithms Lecture 6: Searchig 1 Search Give: Distict keys k 1, k 2,, k ad collectio L of records of the form (k 1, I 1 ), (k 2, I 2 ),, (k, I ) where I j is the iformatio associated

More information

Lecture 5. Counting Sort / Radix Sort

Lecture 5. Counting Sort / Radix Sort Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018

More information

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON

A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON A SOFTWARE MODEL FOR THE MULTILAYER PERCEPTRON Roberto Lopez ad Eugeio Oñate Iteratioal Ceter for Numerical Methods i Egieerig (CIMNE) Edificio C1, Gra Capitá s/, 08034 Barceloa, Spai ABSTRACT I this work

More information

Message Integrity and Hash Functions. TELE3119: Week4

Message Integrity and Hash Functions. TELE3119: Week4 Message Itegrity ad Hash Fuctios TELE3119: Week4 Outlie Message Itegrity Hash fuctios ad applicatios Hash Structure Popular Hash fuctios 4-2 Message Itegrity Goal: itegrity (ot secrecy) Allows commuicatig

More information

1 Enterprise Modeler

1 Enterprise Modeler 1 Eterprise Modeler Itroductio I BaaERP, a Busiess Cotrol Model ad a Eterprise Structure Model for multi-site cofiguratios are itroduced. Eterprise Structure Model Busiess Cotrol Models Busiess Fuctio

More information

Guide to Applying Online

Guide to Applying Online Guide to Applyig Olie Itroductio Respodig to requests for additioal iformatio Reportig: submittig your moitorig or ed of grat Pledges: submittig your Itroductio This guide is to help charities submit their

More information