POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors
|
|
- Hortense Rose
- 5 years ago
- Views:
Transcription
1 POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures! Multiprocessors Politecnico di Milano! Conference Room, Bld 20! 19 November, 2015! Antonio Miele! Marco Santambrogio! Politecnico di Milano!
2 Outline Multiprocessors Flynn taxonomy SIMD architectures Vector architectures MIMD architectures A real life example What s next 2
3 Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 3
4 The Cray XD1 example The XD1 uses AMD Opteron 64-bit CPUs and it incorporates Xilinx Virtex-II FPGAs Performance gains from FPGA RC5 Cipher Breaking 1000x faster than 2.4 GHz P4 Elliptic Curve Cryptography x faster than 1 GHz P3 Vehicular Traffic Simulation 300x faster on XC2V6000 than 1.7 GHz Xeon 650xfaster on XC2VP100 than 1.7 GHz Xeon Smith Waterman DNA matching 28x faster than 2.4 GHz Opteron 4
5 Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine 5
6 Parallel Architectures Definition: A parallel computer is a collection of processing elements that cooperates and communicate to solve large problems fast Almasi and Gottlieb, Highly Parallel Computing, 1989 The aim is to replicate processors to add performance vs design a faster processor. Parallel architecture extends traditional computer architecture with a communication architecture abstractions (HW/SW interface) different structures to realize abstraction efficiently 6
7 Beyond ILP ILP architectures (superscalar, VLIW...): Ø Support fine-grained, instruction-level parallelism; Ø Fail to support large-scale parallel systems; Multiple-issue CPUs are very complex, and returns (as far as extracting greater parallelism) are diminishing ð extracting parallelism at higher levels becomes more and more attractive. A further step: process- and thread-level parallel architectures. To achieve ever greater performance: connect multiple microprocessors in a complex system. 7
8 Beyond ILP Most recent microprocessor chips are multiprocessor on-chip: Intel Core Duo, IBM Power 5, Sun Niagara Major difficulty in exploiting parallelism in multiprocessors: suitable software ð being (at least partially) overcome, in particular, for servers and for embedded applications which exhibit natural parallelism without the need of rewriting large software chunks 8
9 Flynn Taxonomy (1966) SISD - Single Instruction Single Data Uniprocessor systems MISD - Multiple Instruction Single Data No practical configuration and no commercial systems SIMD - Single Instruction Multiple Data Simple programming model, low overhead, flexibility, custom integrated circuits MIMD - Multiple Instruction Multiple Data Scalable, fault tolerant, off-the-shelf micros 9
10 10 Flynn
11 SISD A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and even today, the most common type of computer 11
12 SIMD A type of parallel computer Single instruction: all processing units execute the same instruction at any given clock cycle Multiple data: each processing unit can operate on a different data element Best suited for specialized problems characterized by a high degree of regularity, such as graphics/image processing 12
13 MISD A single data stream is fed into multiple processing units. Each processing unit operates on the data independently via independent instruction streams. 13
14 MIMD Nowadays, the most common type of parallel computer Multiple Instruction: every processor may be executing a different instruction stream Multiple Data: every processor may be working with a different data stream Execution can be synchronous or asynchronous, deterministic or non-deterministic 14
15 Which kind of multiprocessors? Many of the early multiprocessors were SIMD SIMD model received great attention in the 80 s, today is applied only in very specific instances (vector processors, multimedia instructions); MIMD has emerged as architecture of choice for general-purpose multiprocessors Lets see these architectures more in details.. 15
16 SIMD - Single Instruction Multiple Data Same instruction executed by multiple processors using different data streams. Each processor has its own data memory. Single instruction memory and control processor to fetch and dispatch instructions Processors are typically special-purpose. Simple programming model. 16
17 SIMD Architecture Central controller broadcasts instructions to multiple processing elements (PEs) Array Controller Inter-PE Connection Network PE PE PE PE PE PE PE PE Control Data M e m M e m M e m M e m M e m M e m M e m M e m 17 ü Only requires one controller for whole array ü Only requires storage for one copy of program ü All computations fully synchronized
18 SIMD model Synchronized units: single Program Counter Each unit has its own addressing registers Can use different data addresses Motivations for SIMD: Cost of control unit shared by all execution units Only one copy of the code in execution is necessary Real life: SIMD have a mix of SISD instructions and SIMD A host computer executes sequential operations SIMD instructions sent to all the execution units, which has its own memory and registers and exploit an interconnection network to exchange data 18
19 SIMD Machines Today Distributed-memory SIMD failed as large-scale general-purpose computer platform required huge quantities of data parallelism (>10,000 elements) required programmer-controlled distributed data layout Vector supercomputers (shared-memory SIMD) still successful in high-end supercomputing reasonable efficiency on short vector lengths ( elements) single memory space Distributed-memory SIMD popular for special-purpose accelerators image and graphics processing Renewed interest for Processor-in-Memory (PIM) memory bottlenecks => put some simple logic close to memory viewed as enhanced memory for conventional system technology push from new merged DRAM + logic processes commercial examples, e.g., graphics in Sony Playstation-2/3 19
20 20 Reality: Sony Playstation 2000
21 Playstation 2000 Emotion Engine: Superscalar MIPS core Vector Coprocessor Pipelines RAMBUS DRAM interface Sample Vector Unit 2-wide VLIW Includes Microcode Memory High-level instructions like matrix-multiply 21
22 Alternative Model: Vector Processing Vector processors have high-level operations that work on linear arrays of numbers: "vectors" SCALAR (1 operation) VECTOR (N operations) r1 r2 + r3 add r3, r1, r2 v1 v2 + v3 vector length add.vv v3, v1, v
23 Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit + Vector Extensions Load/Store Architecture Vector Registers Vector Instructions Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory 23
24 Properties of Vector Instructions A single vector instruction specifies a great deal of work Equivalent to executing an entire loop Each instruction represents 10 or 100s operations Fetch and decode unit bandwidth needed to keep multiple deeply pipelined FUs busy dramatically reduced Vector instructions indicate that computation of each result in the vector is independent of the computation of the results of the other elements of the vector No need to check for data hazards in the vector Hardware needs to check for data hazards only between two vectors instructions once per vector operand 24
25 Properties of Vector Instructions Each result independent of previous result => long pipeline, compiler ensures no dependencies => high clock rate Vector instructions access memory with known pattern => highly interleaved memory to fetch the vector from a set of memory banks => amortize memory latency of over 64 elements => no (data) caches required! (Do use instruction cache) Reduces branches and branch problems in pipelines An entire loop is replaced by a vector instruction therefore control hazards that would arise from the loop branch are avoided 25
26 Styles of Vector Architectures A vector processor consists of a pipelined scalar unit (ma be out-of order or VLIW) + vector unit memory-memory vector processors: all vector operations are memory to memory (first ones as CDC) vector-register processors: all vector operations between vector registers (except load and store) Vector equivalent of load-store architectures Includes all vector machines since late 1980s: Cray, Convex, Fujitsu, Hitachi, NEC 26
27 Components of Vector Processor Vector Register: fixed length bank holding a single vector has at least 2 read and 1 write ports typically 8-32 vector registers, each holding bit elements Vector Functional Units (FUs): fully pipelined, start new operation every clock cycle typically 4 to 8 FUs: FP add, FP mult, FP reciprocal (1/X), integer add, logical, shift; may have multiple of same unit Control unit to detect hazards (control for Fus and data from register accesses) Scalar operations may use either the vector functional units or use a dedicated set. 27
28 Components of Vector Processor Vector Load-Store Units (LSUs): fully pipelined unit to load or store a vector to and from memory; Pipelining allows moving words between vector registers and memory with a bandwidth of 1 word per clock cycle Handles also scalar loads and stores may have multiple LSUs Scalar registers: single element for FP scalar or address Cross-bar to connect FUs, LSUs, registers 28
29 Vector programming model Scalar Registers r15 v15 Vector Registers r0 v0 [0] [1] [2] [VLRMAX-1] Vector Length Register VLR Vector Arithmetic Instructions ADDV v3, v1, v2 v1 v2 v [0] [1] [VLR-1] Vector Load and Store Instructions LV v1, r1, r2 v1 Vector Register 29 Base, r1 Stride, r2 Memory
30 Vector Code Example # C code for (i=0;i<64; i++) C[i] = A[i]+B[i]; # Scalar Code LI R4, #64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop # Vector Code LI VLR, #64 LV V1, R1 LV V2, R2 ADDV.D V3,V1,V2 SV V3, R3 30
31 Vector Instruction Set Advantages Compact one short instruction encodes N operations Expressive, tells hardware that these N operations: are independent use the same functional unit access disjoint registers access registers in same pattern as previous instructions access a contiguous block of memory (unit-stride load/store) access memory in a known pattern (strided load/store) Scalable can run same code on more parallel pipelines (lanes) 31
32 Vector Arithmetic Execution Use deep pipeline (=> fast clock) to execute element operations V 1 V 2 V 3 Simplifies control of deep pipeline because elements in vector are independent (=> no hazards!) Six stage multiply pipeline V3 <- v1 * v2 32
33 Vector Instruction Execution ADDV C,A,B Execution using one pipelined functional unit Execution using four pipelined functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3] 33
34 Vector Memory System Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency Bank busy time: Time before bank ready to accept next request To avoid conflicts stride and #banks relatively prime Vector Registers Base Stride Address Generator A B C D E F 34 Memory Banks
35 Vector Unit Structure Functional Unit Vector Registers Elements 0, 4, 8, Elements 1, 5, 9, Elements 2, 6, 10, Elements 3, 7, 11, Lane Memory Subsystem 35
36 T0 Vector Microprocessor (UCB/ICSI, 1995) Vector register elements striped over lanes [24][25] [26] [27][28] [16][17] [18] [19][20] [8] [9] [10] [11][12] [0] [1] [2] [3] [4] [29] [21] [13] [5] [30] [22] [14] [6] [31] [23] [15] [7] Lane 36
37 Vector Applications Limited to scientific computing? Multimedia Processing (compress., graphics, audio synth, image proc.) Standard benchmark kernels (Matrix Multiply, FFT, Convolution, Sort) Lossy Compression (JPEG, MPEG video and audio) Lossless Compression (Zero removal, RLE, Differencing, LZW) Cryptography (RSA, DES/IDEA, SHA/MD5) Speech and handwriting recognition Operating systems/networking (memcpy, memset, parity, checksum) Databases (hash/join, data mining, image/video serving) Language run-time support (stdlib, garbage collection) even SPECint95 37
38 MIMD - Multiple Instruction Multiple Data Each processor fetches its own instructions and operates on its own data. Processors are often off-the-shelf microprocessors. Scalable to a variable number of processor nodes. Flexible: single-user machines focusing on high-performance for one specific application, multi-programmed machines running many tasks simultaneously, some combination of these functions. Cost/performance advantages due to the use of offthe-shelf microprocessors. Fault tolerance issues. 38
39 Why MIMD? MIMDs are flexible they can function as single-user machines for high performances on one application, as multiprogrammed multiprocessors running many tasks simultaneously, or as some combination of such functions; Can be built starting from standard CPUs (such is the present case nearly for all multiprocessors!). 39
40 MIMD To exploit a MIMD with n processors at least n threads or processes to execute independent threads typically identified by the programmer or created by the compiler. Parallelism is contained in the threads thread-level parallelism. Thread: from a large, independent process to parallel iterations of a loop. Important: parallelism is identified by the software (not by hardware as in superscalar CPUs!)... keep this in mind, we'll use it! 40
41 MIMD Machines Existing MIMD machines fall into 2 classes, depending on the number of processors involved, which in turn dictates a memory organization and interconnection strategy. Centralized shared-memory architectures at most few dozen processor chips (< 100 cores) Large caches, single memory multiple banks Often called symmetric multiprocessors (SMP) and the style of architecture called Uniform Memory Access (UMA) Distributed memory architectures To support large processor counts Requires high-bandwidth interconnect Disadvantage: data communication among processors Node Node Node N0 P0 CO N3 P3 C3 Interconnection Network MM0 MM1 MM2 MM2 P0, P1, P2, P3 Processor Processor Processor Cache Cache Cache Main Memory Main Memory Main Memory 41 Interconnection Network
42 Key issues to design multiprocessors How many processors? How powerful are processors? How do parallel processors share data? Where to place the physical memory? How do parallel processors cooperate and coordinate? What type of interconnection topology? How to program processors? How to maintain cache coherency? How to maintain memory consistency? How to evaluate system performance? 42
43 43 Create the most amazing game console
44 One core A 64-bit Power Architecture core Two-issue superscalar execution Two-way multithreaded core In-order execution Cache 32 KB instruction and a 32 KB data Level 1 cache 512 KB Level 2 cache The size of a cache line is 128 bytes One core to rule them all 44
45 Cell: PS3 Cell is a heterogeneous chip multiprocessor One 64-bit Power core 8 specialized co-processors based on a novel single-instruction multiple-data (SIMD) architecture called SPU (Synergistic Processor Unit) 45
46 46 Ducks Demo
47 47 Duck Demo SPE Usage
48 Xenon: XBOX360 Three symmetrical cores each two way SMT-capable and clocked at 3.2 GHz SIMD: VMX128 extension for each core 1 MB L2 cache (lockable by the GPU) running at halfspeed (1.6 GHz) with a 256-bit bus 48
49 Microsoft vision Microsoft envisions a procedurally rendered game as having at least two primary components: Host thread: a game's host thread will contain the main thread of execution for the game Data generation thread: where the actual procedural synthesis of object geometry takes place These two threads could run on the same PPE, or they could run on two separate PPEs. In addition to the these two threads, the game could make use of separate threads for handling physics, artificial intelligence, player input, etc. 49
50 50 The Xenon architecture
51 From ILP to TLP: from the processor to the programmer Keep it simple Stripping out hardware that's intended to optimize instruction scheduling at runtime. Neither the Xenon nor the Cell have an instruction window Instructions pass through the processor in the order in which they're fetched Two adjacent, non-dependent instructions are executed in parallel where possible Static execution Is simple to implement Takes up much less die space than dynamic execution since the processor doesn't need to spend a lot of transistors on the instruction window and related hardware. Those transistors that the lack of an instruction window frees up can be used to put more actual execution units on the die. Rethink how you organize the processor You can't just eliminate the instruction window and replace it with more execution 51
52 Regrouping the execution units No hardware spent on an instruction window that looks for ILP at runtime The programmer has to structure the code stream at compile time so that it contains a high level of thread-level parallelism (TLP) Three separate cores Each of which individually contains a relatively small number of execution units. The many parallel threads out of which the programmer has woven the code stream are then scheduled to run on those separate cores This TLP strategy will work extremely well for tasks like procedural synthesis that can be parallelized at the thread level. However, it won't work as well as an old-fashioned wide execution core plus large instruction window for inherently single-threaded tasks. In particular, three types of game-oriented tasks are likely to suffer from the lack of a out-of-order processing and core width: Game control Artificial intelligence (AI) Physics 52
53 Procedural Synthesis in a nutshell Procedural synthesis is about making optimal use of system bandwidth and main memory by dynamically generating lowerlevel geometry data from statically stored higher-level scene data For 3D games Artists use a 3D rendering program to produce content for the game Each model is translated into a collection of polygons Each polygons is represented in the computer's memory as collections of vertices When the computer is rendering a scene in a game in real-time Models that are being displayed on the screen start out in main memory as stored vertex data That vertex data is fed from main memory into the GPU where it is then rendered into a 3D image and output to the monitor as a sequence of frames. 53
54 Limitations There are two problems The costs of creating art assets for a 3D game are going through the roof along with the size and complexity of the games themselves Console hardware's limited main memory sizes and limited bus bandwidth 54
55 The Xbox 360's solution 55 Store high-level descriptions of objects in main memory Gave the CPU procedurally generate the geometry (i.e., the vertex data) of the objects on the fly Main memory stores high-level information This information is passed into the Xbox 360's Xenon CPU, where the vertex data are generated by one or more running threads These threads then feed that vertex data directly into the GPU by way of a special set of write buffers in the L2 cache The GPU then takes that vertex information and renders the trees normally, just as if it had gotten that information from main memory
56 56 The Xbox 360's solution
57 Questions RISK more than others think is safe, CARE more than others think is wise, DREAM more than other think is practical, EXPECT more than others think is possible cadel maxim 57
POLITECNICO DI MILANO. Advanced Topics on Heterogeneous System Architectures! Multiprocessors
POLITECNICO DI MILANO Advanced Topics on Heterogeneous System Architectures! Multiprocessors Politecnico di Milano! SeminarRoom, Bld 20! 30 November, 2017! Antonio Miele! Marco Santambrogio! Politecnico
More informationChapter 4 Data-Level Parallelism
CS359: Computer Architecture Chapter 4 Data-Level Parallelism Yanyan Shen Department of Computer Science and Engineering Shanghai Jiao Tong University 1 Outline 4.1 Introduction 4.2 Vector Architecture
More informationData-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano
Data-Level Parallelism in SIMD and Vector Architectures Advanced Computer Architectures, Laura Pozzi & Cristina Silvano 1 Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationStatic Compiler Optimization Techniques
Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed at improving pipelined CPU performance: Static pipeline scheduling. Loop unrolling. Static branch
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationCS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers
CS 152 Computer Architecture and Engineering Lecture 16: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationHakam Zaidan Stephen Moore
Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction
More informationAsanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology
Vector Computers Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology Supercomputers Definition of a supercomputer: Fastest machine in world at given task Any machine costing
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationCS 252 Graduate Computer Architecture. Lecture 7: Vector Computers
CS 252 Graduate Computer Architecture Lecture 7: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.cs.berkeley.edu/~cs252
More informationAdvanced Topics in Computer Architecture
Advanced Topics in Computer Architecture Lecture 7 Data Level Parallelism: Vector Processors Marenglen Biba Department of Computer Science University of New York Tirana Cray I m certainly not inventing
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationIssues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Introduction Goal: connecting multiple computers to get higher performance
More informationCS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers
CS 152 Computer Architecture and Engineering Lecture 17: Vector Computers Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationComputer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 15: Dataflow
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 12: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital Circuits
More informationIntroduction to Computing and Systems Architecture
Introduction to Computing and Systems Architecture 1. Computability A task is computable if a sequence of instructions can be described which, when followed, will complete such a task. This says little
More informationParallel Systems I The GPU architecture. Jan Lemeire
Parallel Systems I The GPU architecture Jan Lemeire 2012-2013 Sequential program CPU pipeline Sequential pipelined execution Instruction-level parallelism (ILP): superscalar pipeline out-of-order execution
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationELE 455/555 Computer System Engineering. Section 4 Parallel Processing Class 1 Challenges
ELE 455/555 Computer System Engineering Section 4 Class 1 Challenges Introduction Motivation Desire to provide more performance (processing) Scaling a single processor is limited Clock speeds Power concerns
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 5B: Data Level Parallelism Avinash Kodi, kodi@ohio.edu Thanks to Morgan Kauffman and Krtse Asanovic Agenda 2 Flynn s Classification Data Level Parallelism Vector
More informationUnit 11: Putting it All Together: Anatomy of the XBox 360 Game Console
Computer Architecture Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Milo Martin & Amir Roth at University of Pennsylvania! Computer Architecture
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationArchitecture of parallel processing in computer organization
American Journal of Computer Science and Engineering 2014; 1(2): 12-17 Published online August 20, 2014 (http://www.openscienceonline.com/journal/ajcse) Architecture of parallel processing in computer
More information! Readings! ! Room-level, on-chip! vs.!
1! 2! Suggested Readings!! Readings!! H&P: Chapter 7 especially 7.1-7.8!! (Over next 2 weeks)!! Introduction to Parallel Computing!! https://computing.llnl.gov/tutorials/parallel_comp/!! POSIX Threads
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationCPS104 Computer Organization and Programming Lecture 20: Superscalar processors, Multiprocessors. Robert Wagner
CS104 Computer Organization and rogramming Lecture 20: Superscalar processors, Multiprocessors Robert Wagner Faster and faster rocessors So much to do, so little time... How can we make computers that
More informationLecture 1: Introduction
Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationExploitation of instruction level parallelism
Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering
More informationModule 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT
TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012
More informationTHREAD LEVEL PARALLELISM
THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture
More informationThis Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources
This Unit: Putting It All Together CIS 371 Computer Organization and Design Unit 15: Putting It All Together: Anatomy of the XBox 360 Game Console Application OS Compiler Firmware CPU I/O Memory Digital
More informationCOSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University
COSC4201 Multiprocessors and Thread Level Parallelism Prof. Mokhtar Aboelaze York University COSC 4201 1 Introduction Why multiprocessor The turning away from the conventional organization came in the
More informationCMSC 611: Advanced. Parallel Systems
CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More information! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)
Master Informatics Eng. Advanced Architectures 2015/16 A.J.Proença Data Parallelism 1 (vector, SIMD ext., GPU) (most slides are borrowed) Instruction and Data Streams An alternate classification Instruction
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More informationVector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar
Vector Processors Kavitha Chandrasekar Sreesudhan Ramkumar Agenda Why Vector processors Basic Vector Architecture Vector Execution time Vector load - store units and Vector memory systems Vector length
More informationParallel Processing SIMD, Vector and GPU s cont.
Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP
More informationParallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization
Computer Architecture Computer Architecture Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr nizamettinaydin@gmail.com Parallel Processing http://www.yildiz.edu.tr/~naydin 1 2 Outline Multiple Processor
More informationESE 545 Computer Architecture. Data Level Parallelism (DLP), Vector Processing and Single-Instruction Multiple Data (SIMD) Computing
Computer Architecture ESE 545 Computer Architecture Data Level Parallelism (DLP), Vector Processing and Single-Instruction Multiple Data (SIMD) Computing 1 Supercomputers Definition of a supercomputer:
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationVector Processors. Abhishek Kulkarni Girish Subramanian
Vector Processors Abhishek Kulkarni Girish Subramanian Classification of Parallel Architectures Hennessy and Patterson 1990; Sima, Fountain, and Kacsuk 1997 Why Vector Processors? Difficulties in exploiting
More informationLect. 2: Types of Parallelism
Lect. 2: Types of Parallelism Parallelism in Hardware (Uniprocessor) Parallelism in a Uniprocessor Pipelining Superscalar, VLIW etc. SIMD instructions, Vector processors, GPUs Multiprocessor Symmetric
More informationLecture 8: RISC & Parallel Computers. Parallel computers
Lecture 8: RISC & Parallel Computers RISC vs CISC computers Parallel computers Final remarks Zebo Peng, IDA, LiTH 1 Introduction Reduced Instruction Set Computer (RISC) is an important innovation in computer
More informationUnit 9 : Fundamentals of Parallel Processing
Unit 9 : Fundamentals of Parallel Processing Lesson 1 : Types of Parallel Processing 1.1. Learning Objectives On completion of this lesson you will be able to : classify different types of parallel processing
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationCMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss
CMPE 655 Multiple Processor Systems SIMD/Vector Machines Daniel Terrance Stephen Charles Rajkumar Ramadoss SIMD Machines - Introduction Computers with an array of multiple processing elements (PE). Similar
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationComputer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)
18-447 Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014 Lab 4 Reminder Lab 4a out Branch handling and branch
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, ector and GPU s EECS4201 Comp. Architecture Fall 2017 York University 1 Introduction ector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationLecture 26: Parallel Processing. Spring 2018 Jason Tang
Lecture 26: Parallel Processing Spring 2018 Jason Tang 1 Topics Static multiple issue pipelines Dynamic multiple issue pipelines Hardware multithreading 2 Taxonomy of Parallel Architectures Flynn categories:
More informationChapter 1: Perspectives
Chapter 1: Perspectives Copyright @ 2005-2008 Yan Solihin Copyright notice: No part of this publication may be reproduced, stored in a retrieval system, or transmitted by any means (electronic, mechanical,
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationCourse II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan
Course II Parallel Computer Architecture Week 2-3 by Dr. Putu Harry Gunawan www.phg-simulation-laboratory.com Review Review Review Review Review Review Review Review Review Review Review Review Processor
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationVector Processors and Graphics Processing Units (GPUs)
Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley TA Evaluations Please fill out your
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationProcessor Performance and Parallelism Y. K. Malaiya
Processor Performance and Parallelism Y. K. Malaiya Processor Execution time The time taken by a program to execute is the product of n Number of machine instructions executed n Number of clock cycles
More informationTDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading
Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5
More informationFundamentals of Quantitative Design and Analysis
Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationMultithreading: Exploiting Thread-Level Parallelism within a Processor
Multithreading: Exploiting Thread-Level Parallelism within a Processor Instruction-Level Parallelism (ILP): What we ve seen so far Wrap-up on multiple issue machines Beyond ILP Multithreading Advanced
More informationMulti-core Programming - Introduction
Multi-core Programming - Introduction Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More information06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli
06-1 Vector Processors, Etc. 06-1 Some material from Appendix B of Hennessy and Patterson. Outline Memory Latency Hiding v. Reduction Program Characteristics Vector Processors Data Prefetch Processor /DRAM
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationParallel Processing SIMD, Vector and GPU s
Parallel Processing SIMD, Vector and GPU s EECS4201 Fall 2016 York University 1 Introduction Vector and array processors Chaining GPU 2 Flynn s taxonomy SISD: Single instruction operating on Single Data
More informationLecture-13 (ROB and Multi-threading) CS422-Spring
Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationAdvanced Computer Architecture
Fiscal Year 2018 Ver. 2019-01-24a Course number: CSC.T433 School of Computing, Graduate major in Computer Science Advanced Computer Architecture 11. Multi-Processor: Distributed Memory and Shared Memory
More informationParallel Computer Architectures. Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam
Parallel Computer Architectures Lectured by: Phạm Trần Vũ Prepared by: Thoại Nam Outline Flynn s Taxonomy Classification of Parallel Computers Based on Architectures Flynn s Taxonomy Based on notions of
More informationCMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)
CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationNon-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.
CS 320 Ch. 17 Parallel Processing Multiple Processor Organization The author makes the statement: "Processors execute programs by executing machine instructions in a sequence one at a time." He also says
More information