Computer Architecture - PDF Free Download

Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts Thread: Threads are lightweight processes. They consist of several instructions. The threads share a common (virtual) address space. Threads can communicate via this common address space. Task: Tasks are heavyweight processes. Each task has its own address space. Tasks can only communicate via inter task communication channels like shared memory, pipes, message queues or sockets. A task can contain several threads Computer Architecture Part 10 page 2 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts Instruction level parallelism is limited. To further exploit parallel processing, thread or task level parallelism can be used. Two major architectures are known: Multithreaded processors exploit thread level parallelism Chip multiprocessors (multi core processors, many core processors) exploit task level parallelism Both concepts are also used in combination Computer Architecture Part 10 page 3 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts In a multi-threaded processor instructions of several threads of the program are candidates for concurrent issuing. This can be done in a classical scalar pipeline to hide the latencies of memory access. Here, instructions from several threads can be processed in the different pipeline stages. In can be as well combined with a superscalar pipeline to increase the level of possible parallelism from the intra thread level to the inter thread level. This is called SMT (Simultaneous Multithreading). Computer Architecture Part 10 page 4 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic concepts Chip multiprocessors combine multiple processor cores on a single chip. Therefore these processors are also called multi core processors. Today's multicore processors integrate 2-8 cores on a chip. By increasing the number of cores in the future (e.g. > 100), the term many core processors is used. These cores can execute several tasks in parallel. Cores can be homogeneous or heterogeneous. Having multithreaded cores, multithreading and chip multiprocessing can be combined. Computer Architecture Part 10 page 5 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multithreaded Architectures Multithreaded processor: Supports the execution of multiple threads by hardware It can store the context information of several threads in separate register sets and execute instructions of different threads at the same time in the processor pipeline Different stages of the processor pipeline can contain instructions from different threads This exploits thread level parallelism on basis of parallelism in time (pipelining) Computer Architecture Part 10 page 6 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multithreaded Architectures Goal: Reduction of latencies caused by memory accesses or dependencies Such latencies can be bridged by switching to another thread During the latency, instructions from other threads are feed into the pipeline => the processor ultilzation is raised, the throughput of a load consisting of multiple threads increases (while the throughput of a single thread remains the same) Explicit multithreaded processors: each thread is a real thread of the application program Implicit multithreaded processors: speculative parallel threads are created dynamically out of a sequential program Computer Architecture Part 10 page 7 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Basic multithreading techniques (2) (3) (4) (2) (3) (4) (a) single threaded prozessor (b) Cycle-by-cycle- Interleaving-Technik (fine-grain multithreading): Time (processor cycles) (2) (3) (4) (2) (2) (2) Context switches Context switch Context is switched each clock cycle (a) (b) (c) (c) Block-Interleaving-Technik (coarse-grain multithreading): Instructions of a thread are executed until an event causes a latency. Then context is switched. Computer Architecture Part 10 page 8 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Comparing multithreading to superscaler and VLIW (2) (3) (4) (2) (3) (4) Time (processor cycles) N N N N N (2) (2) (2) (3) (4) (4) (2) (2) (2) (2) Context switches N N (2) (2) (2) N (3) N N N (4) (4) N N N N N (2) (2) (2) (2) Context switches (a) (b) (c) (d) a: four times superscalar processor b: four times VLIW processor c: four times superscaler processor d: four times VLIW processor with cycle by cycle interleaving with cycle by cycle interleaving Computer Architecture Part 10 page 9 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Classification of block interleaving techniques Block Interleaving statisch dynamisch Explicit-switch Implicit-switch (switch-on-load, switch-on-store, switch-on-branch,...) Conditionalswitch Switchon-signal Switch-oncache-miss Switchon-use Computer Architecture Part 10 page 10 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Simultaneous multithreading (SMT) A simultaneous multithreaded processor is able to issue instructions of multiple threads to multiple execution units in a single clock cycle. This exploits thread level and instruction level parallelism in time and space Instruction Fetch... Instruction Decode and Rename... Instruction Window Issue 1 2 3 4 Reservation Stations Reservation Stations Execution 1... Execution 4 Retire and Write Back Computer Architecture Part 10 page 11 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Comparing SMT to chip multiprocessing Simultaneous multithreading (a) and chip multiprocessing (b) (2) (3) (4) (2) (3) (4) Time (processor cycles) (2) (2) (2) (2) (3) (4) (4) (4) (4) (2) (4) (4) (4) (4) (2) (3) (2) (4) (2) (2) (4) (4) (2) (2) (4) (4) (4) (4) (4) (a) (2) (2) (3) (2) (4) (4) (2) (3) (4) (4) (2) (2) (3) (3) (4) (4) (2) (3) (3) (4) (b) Computer Architecture Part 10 page 12 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Other applications of multithreading Resulting from the ability of fast context switching more application fields for multithreading arise Reduction of energy consumption Mispredictions in superscaler processors cost energy. Multithreaded processors can execute instructions from other threads instead Event handling Helper threads handle special events (e.g. carbage collection) Real-time processing Allows efficient real-time scheduling policies like LLF or GP Computer Architecture Part 10 page 13 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Chip multiprocessing architectures A Chip-Multiprocessor (CMP) combines several processors on a single chip Instead of chip-multiprocessor, today this is also called Multi-Core- Processor, where a core denotes a single processor on the multi-core processor chip Each core can have the complexity of today s microprocessors and holds ist own primary cache for instructions and data Usually, the cores are organized as memory coupled multi processors with a shared address space Furthermore, a secondary cache is contained on the chip For future multi-core processors containing a large number of cores (>100), the term Many-Core-Processor is used Computer Architecture Part 10 page 14 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Possible multi-core-configurations shared-main memory shared-secondary cache Processor Processo r Processor Processor Processor Processor Processor Processor Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Primary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Secondary Cache Global Memory Global Memory Computer Architecture Part 10 page 15 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Possible multi-core-configurations (2) shared-primary cache Processor Processor Processor Processor Primary Cache Secondary Cache Global Memory Computer Architecture Part 10 page 16 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Chip-Multiprocessor / Multi-Core Simulations show the shared secondary cache architecture superior to shared primary cache and shared main memory Therefore, mostly a large shared secondary cache is implemented on the processor chip Cache coherency protocols known from symmetric multi-processor architectures (e.g. MESI protocol) guarantee a correct access to the shared memory cells from inside and outside the processor chip Today, chip multiprocessing is often combined with simultaneous multithreading There, each core is a SMT core giving the advantages of both approaches Computer Architecture Part 10 page 17 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

An early single chip multiprocessor proposal: Hydra A Single Chip CPU 0 Primary Primary I-cache D-cache CPU0 Mem.Controller Centralized Bus Arbitration Mechanisms CPU 1 Primary Primary I-cache D-cache CPU1 Mem.Controller CPU 2 Primary Primary I-cache D-cache CPU2 Mem.Controller CPU 3 Primary Primary I-cache D-cache CPU 3 Mem.Controller On-chip Secondary Cache Off-chip L3 Interface Rambus Mem. Interface DMA I/O Bus Interface Cache SRAM Array DRAM Main Memory I/O Device Computer Architecture Part 10 page 18 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples IBM Power5 Symmetric multi-core processor with two 64-bit 2 times SMT processors having 64 kbytes instruction cache and 32 kbytes data cache Both cores share a 1.41. MByte on-chip secondary cache Controller for third level cache as well on chip Four Power5 chips and four L3 cache chips are combined in a multi-chip module Computer Architecture Part 10 page 19 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples IBM Power6 Similar to Power5, but superscaler in-order-execution Level 1 cache size raised to 64 kbytes for instructions and data on each core 65 nm process 5 GHz clock frequency Computer Architecture Part 10 page 20 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples IBM Power7 4, 6 or 8 cores Turbo mode deactivates 4 out of 8 cores, but gives access to all memory controllers for the remaining 4 cores => improves single core performance Each core supports 4 times SMT 45 nm process 4 GHz clock frequency Computer Architecture Part 10 page 21 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples Intel Core 2 Duo (Wolfdale) 2 processor cores of Intel Core 2 architecture 32 kbytes data and instruction cache for each core Core 1 6 MBytes L2 cache 45 nm process 3 Ghz clock frequency L2 Cache Shared by both cores Core 2 Computer Architecture Part 10 page 22 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples Microarchitecture of Intel Core 2 family (a single core) Computer Architecture Part 10 page 23 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting Source: c t 16/2006

Multi-Core examples Intel Core 2 Quad (Yorkfield) 2 Wolfdale dices in a multi-chip module => 4 processor cores of Intel Core 2 architecture 32 kbytes data and instruction cache for each core 6 MBytes L2 cache for each dice 45 nm process 3 Ghz clock frequency Computer Architecture Part 10 page 24 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core examples Intel Core i7-3930k (Sandy Bridge E) 6 core processor (Hexa-Core) 32 kbytes data and instruction cache for each core 256 kbytes L2 cache for each core 15 MBytes common L3 cache 32 nm process 3.3 Ghz clock frequency Computer Architecture Part 10 page 25 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Heterogeneous multi-cores While homogeneous multi-core processors are commonly used for general purpose computing, heterogeneous multi-core processors are seen as a future trend for embedded systems A first member of this technology is the IBM Cell processor containing a Power processor (Power Processor Element, PPE) and 8 dependend processors (Synergistic Processing Elements, SPE) PPE: based on Power architecture, two times SMT, controls the 8 SPEs SPE: contains a RISC processor with 128 bit SIMD (multimedia) instructions, a memory flow controller and a bus controller Originally designed for Sony Playstation 3, the cell processor is now used in various application domains Computer Architecture Part 10 page 26 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Cell Processor Die Computer Architecture Part 10 page 27 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs Heterogeneous Many-Cores 1000 and more streaming processor cores for shading First generation: Special purpose hardware for various shading tasks Second generation: Programmable streaming processors for pixel shading, vertex shading,.. Third generation: Unfied Shaders Example: Radeon R600 GPU Computer Architecture Part 10 page 28 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs Another example: NVIDIA GF100 4 Graphic Processing Clusters (GPC) 768 kbytes L2 Cache 6 memory controllers Computer Architecture Part 10 page 29 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs A GPC consists of: Raster Engine (triangle setup, rasterization, Z- management) Polymorph Engine (Vertex attribute fetch, tesselation) 4 Streaming Multiprocessors (Unified Shading: vertex-, geometry-, raster-, texture-, pixel- processing) => Computer Architecture Part 10 page 30 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

GPUs A SM consists of: 32 CUDA Cores (Compute Unified Device Architecture) 16 Load/Store Units 4 Special Function Units (sin, cos, square root calculation, etc.) => GF100 Overall: 4 x 4 x 32 = 512 CUDA Cores 4 x 4 x 16 = 256 Load/Store Units 4 x 4 x 4 = 64 Special Function Units 4 x 4 = PolyMorph Engines 4 Raster Engines Computer Architecture Part 10 page 31 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: performance Due to multithreading in PC and server operating systems, two to four cores significantly increase the processor throughput Exploiting eight or more cores requires parallel application programs Hence, software development is challenged to deliver the necessary number of parallel threads by either parallelizing compilers or parallel applications Experiences from multiprocessors show a moderate number of parallel threads resulting in high performance improvement, but this does not scale to a higher amount of parallelism Beginning with 4 to 8 threads, the performance improvement is dramatically reduced Using 8 cores, except for very computing intensive applications (signal processing, graphic processing) some cores will be temporarily idle Furthermore, memory bandwidth can become a bottleneck Computer Architecture Part 10 page 32 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: hardware While current multi-core processors use cache coupled interconnection, future processors might rely on grid structures (network on chip) to improve performance Adaptive and reconfigurable MPSoC (Multi-Processor Systens-on-.Chip) will gain importance for embedded systems and general purpose computing Heterogeneous many-core GPUs are state-of-the-art Reconfigurable cache memories might allow variable connections to different cores Available input/output bandwidth is still an open problem for throughput oriented programs Computer Architecture Part 10 page 33 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: hardware For data access, transactional memory might be is a model for future multi-core processors Similar to database systems, memory access is organized as a transaction being executed completely or not at all Hardware support for checkpointing and rollback is necessary As an advantage, concurrent access is simplified (no locks) Furthermore, fault tolerance and dependability techniques will become more important as the error probability will increase with decreasing transistor dimensions On chip power management will keep the importance it has already today Computer Architecture Part 10 page 34 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: software Currently, operating system concepts known from memory coupled multiprocessor systems are used. Here, the operating system scheduler assigns independent processes to the available processors Different to these concepts, the closer core connection of multi-core processors leads to a different computation versus synchronization ratio allowing to use more fine grain parallelism Parallel computing will become the future standard programming model Most of the currently existing software is sequential, thus can run only on one core Programming languages and tools to exploit the fine grain parallelism of multi-core processors need to be developed Furthermore, software engineering techniques are needed to allow the development of safe parallel programs Computer Architecture Part 10 page 35 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting

Multi-Core discussion: software The application development for multi-core processors will become one of the main future market places for computer scientists Today s applications have to be proceeded with the goal to exploit parallelism, gain performance and increase comfort New applications currently not realizable due to a lack of processor performance will arise These are hard to predict Possible applications must have the need for high computational performance reachable by parallelism Such applications might come from speech recognition, image recognition, data mining, learning technologies or hardware synthesis Computer Architecture Part 10 page 36 of 36 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin Betting