HPM Hardware Performance Monitor for Bluegene/Q

Size: px
Start display at page:

Download "HPM Hardware Performance Monitor for Bluegene/Q"

Transcription

1 HPM Hardware Performance Monitor for Bluegene/Q PRASHOBH BALASUNDARAM I-HSIN CHUNG KRIS DAVIS JOHN H MAGERLEIN The Hardware performance monitor (HPM) is a component of IBM high performance computing toolkit. The HPM is used to instrument applications for extracting hardware performance counters data on Bluegene/Q. The Bluegene/Q s hardware performance counter design a marked deviation from the counters on Bluegene/L and Bluegene/P systems. As the number of cores on a multi-core processor increases the design of hardware performance counters as well as other performance analysis tools needs modifications. This document presents the design, functionality, performance characteristics, and usage of the hardware performance counters on Bluegene/Q. General Terms: hardware counters, performance monitoring, Bluegene/Q Additional Key Words and Phrases: counters, metrics, tools 1. BLUEGENE/Q PROCESSOR OVERVIEW The Bluegene/Q processor is a system-on-chip multi-core processor and hosts 18 cores. 16 of these powerpc 64 bit cores are dedicated for computations, while the 17 th core offloads operating system tasks. The 18 th core is used as a backup to improve yield of processors during manufacturing stages. The Bluegene/Q processor incorporates the direct memory access engine as well as the interconnection network and includes the IO unit network unit and messaging unit in addition to the processor unit. The processor core is a 4-way multi-threaded 64bit powerpc microprocessor operates at 1.6 GHZ and is capable of executing instructions in-order in two pipelines. While one of these pipelines executes all integer control and memory access instructions, the other pipeline executes all floating point arithmetic instructions. The floating point unit is 4- wide SIMD double precision unit known as the QPU. This unit is capable of delivering upto four fused multiply-add results per processor clock. The L1 (first level) data cache is 16 kilo bytes with 64 byte lines. The load-store interface of this processor allows loading and storing of 32 bytes per cycle. Each core has a private level 1 prefetch unit. This unit accepts, decodes and dispatches all requests raised by the core. Each entry in l1 prefetch unit can hold 128 bytes. The L2 (second level) cache is a shared cache. All 16 compute cores can access this cache. The L2 cache is comprised of 16 slices connected via a shared cross-bar switch. Each L2 cache slice is 16 way set associative, write-back with capacity of 2 Mega bytes. To achieve even slice utilization, physical addresses are scattered across the slices using a programmable hash function. A direct consequence of this design on hardware performance analysis tools including HPM is that the L2 cache related hardware performance metrics are available reliably at the node level only. The L2 cache supports novel features like memory speculation and atomic memory updates. The prefetch on the L2 cache is directed by hints from the L1 cache prefetch unit. The aggregate read bandwidth from all slices is GB/s while the write bandwidth is GB/s. The list prefetch and stream prefetch modes of the l1 prefetch is another interesting innovation. The goal of this feature is to provide more control over the data prefetch. There are two modes of data prefetch aiming to reduce latency of access to the shared l2 cache - the stream prefetch mode and the list prefetch mode. Data is prefetched into the l1 prefetch buffer (128B lines, 32 prefetch lines). If the memory access pattern is

2 HPM:2 International Business Machines contiguous, the stream prefetch unit can offer low latency access to data by prefetching additional contiguous data in the same sequence. The stream prefetch unit is adaptive. Since the storage requirements of 4 threads per core accessing multiple streams can cause increased pressure on the l1 cache, the depth of stream prefetching is modified to keep the total depth of all streams fit within the 32 prefetch lines. The stream prefetch engine can be programmed to reduce the length of the prefetch streams programmatically. When the access to memory locations is not contiguous but the same pattern is repeated many times then list prefetch is useful. This type of access pattern is often encountered in scientific computing codes like iterative solvers. The stream prefetch unit provides guidance to the list prefetch unit by recording the memory address pattern where cache misses occur. Once this pattern is recorded the list prefetch unit can target the prefetch of memory addresses where cache misses are anticipated. In addition to the traditional hardware performance monitoring metrics, the HPM of Bluegene/Q incorporates counters and derived metrics composed of data from the list prefetch, stream prefetch and other features. Though the Bluegene/Q processor is designed to be a throughput processor favouring multithreaded application performance over single thread performance, there are several innovations to improve performance of threaded regions. Programming interfaces and compiler support for these innovative features are available. The HPM also supports analysis of counter data for a subset of these features. The first feature designed to improve multi-thread performance is speculative memory. The L2 cache on Bluegene/Q is multi-versioning, and is capable of separating state changes caused by a speculative thread and keeping these state changes separate from the main memory state. At the end of the speculative section, these changes are either invalidated or committed. In case of conflicts to same memory locations from various threads, the L2 can detect and the speculative runtime system can reason about the accuracy of the concurrent execution. This enables features like transactional memory and speculative execution. The IBM XL compilers for Bluegene/Q supports speculative execution and transactional memory. The effective use of the quad floating point unit of the Bluegene/Q processor is essential for maximizing the performance of this MPP system. The QPX architecure and instruction set is a new bluegene/q specific development. The QPX follows a vector single instruction multiple data computing model. There are 4 elements per vector, four execution lanes supported by a register file of bit registers containing 4 64 bit vector elements. Data manipulation across the 4 execution lanes is also enabled by the new instruction set by supporting compare, convert and move instructions. Complex arithmetic Fig. 1. Layout of UPC counters in a ring instructions allows arithmetic operation across adjacent lanes.

3 Hardware Performance Monitor for Bluegene/Q HPM:3 Fig. 2. Hardware counter multiplexing 2. UNIVERSAL PERFORMANCE COUNTERS AND THE BGPM API The universal performance counter (UPC) is a processor hardware component designed to collect performance events from all 17 processor cores, L1P units, 16 L2 units, as well as the message unit. The UPC hardware has central (node-shared) as well as distributed (core-local) components. This is a major design change from the UPC unit on the Bluegene/P. The local UPC unit is named the UPC P, while the central UPC unit is named UPC C. UPC P has bit counters to count processor core, L1P, completed instructions and floating point operations. The UPC L2 unit has 16 counters for each L2 slice while the IO units have 43 counters. The UPC C has 64 bit counters within the central UPC unit (UPC C) to which the UPC P counters are accumulated. The UPC C provides overflow detection and counter aggregation operations. In Figure 1, the layout of UPC counter components is illustrated. The UPC P and UPC P units together provides the necessary functionality to implement 3 modes of operation. The HPM uses the distributed mode which collects 24 counters from each UPC P unit. In Figure 2, the details of the processor event selection logic configurable by the punit logic is illustrated. BlueGene/Q s hardware performance monitor implements more than 600 events and other selected filtered opcodes. All these events contend for the 24 counter registers per UPC P unit. Not all combinations of the 24 events can be counted simultaneously. These restrictions necessitates design changes in HPM implementation. The performance monitoring API on the Bluegene/Q is organized hierarchically as illustrated in Figure 3. The software layers interacts with the compute node kernel (operating system on Bluegene/Q) and other system software components through a hierarchy of application programming interfaces. The BGPM (Bluegene performance monitoring) API is intended for use by developers of performance tools. The HPM and PAPI are intended for use by application developers and implements instrumentation functionalities. The BGPM layer is used by both HPM and PAPI for accessing the hardware performance monitoring units. The HPM is the end user facing component and implements functionality to instrument application code and to control formatting of output.

4 HPM:4 International Business Machines Fig. 3. HPM s software layers The software layers used by HPM ( illustrated in Figure 3 ) include BGPM, Universal performance counter interface, the compute node kernel, and UPC low level SPI. The UPC low level SPI layer interfaces with hardware components to expose functionality to higher layers of system software. The system programming interface named UPCI (universal performance counter interface) can uses both the UPC low level interface and the compute node kernel s functions. The BGPM API uses system calls to the compute node kernel as well as UPCI API functions. The HPM software is built on the software distributed mode implemented in the BGPM API. 3. HARDWARE PERFORMANCE MONITOR Corresponding to the thread safe and non thread safe versions of the IBM XL compilers, there are two versions of the HPM - the process only version and a thread safe version. The process only version named libmpihpm is intended for use in MPI-Only programs compiled using the non thread safe compiler. The libmpihpm r is the threadsafe version and is designed for use in OpenMP applications. The HPM libraries use BGPM to initialize, start,stop and read the hardware counters. Depending on the user selected options, the HPM calculates counter values, derived metrics and formats the output files. The HPM libraries are used in a C/C++ or FORTRAN application by instrumenting code segments within HPM API calls. The initialization step of HPM sets up the data structures necessary for HPM to operate. HPM initalization can be performed from an MPI application. The hpminit() call from C and f hpminit call from FORTRAN initalizes HPM s data structures. The hpmstart (C) call or f hpmstart (FORTRAN) starts the HPM s counter recording functionality. If no environment variables are set, the default event set is enabled for counting. The hpmstart and hpmstop calls enclose the code segment to be profiled. In the multi-threaded version, the hpmstart and hpm- Stop calls enclose a threaded section of code. The threaded parallel region must be

5 Hardware Performance Monitor for Bluegene/Q HPM:5 started after hpmstart and stopped before hpmstop. The hpmstop function stops the currently selected sets of events and records the final counter values. An example C code is listed in Figure 4 while Figure 5 lists the FORTRAN example // Include the HPM Header f i l e s 3 #include hpm. h 4 #include mpi. h // i n i t i a l i z e MPI 8 i e r r = MPI Init(&argc, &argv ) ; 9 i e r r = MPI Comm rank(mpi COMM WORLD, &my id ) ; 10 i e r r = MPI Comm size (MPI COMM WORLD, &numprocs ) ; 11 hpminit ( ) ; hpmstart ( MainLoop ) ; // MainLoop = l a b e l f o r code segment //Code to be p r o f i l e d hpmstop ( MainLoop ) ; // MainLoop stop the code segment 19 // Note Code l a b e l s started must match with stopped code segments //Terminate HPM 23 hpmterminate ( ) ; 24 MPI Finalize ( ) ; } Fig. 4. Example usage of HPM from C code 1! Fortan code to be pre processed using CPP 2! Naming input f i l e as filename. F or filename. F90 enables preprocessing 3 #include f hpm. h Call mpi init ( i e r r ) 6 Call mpi comm rank ( mpi comm world, my id, i e r r ) 7 Call mpi comm size ( mpi comm world, numprocs, i e r r ) 8 Call f hpminit ( ) ! MainLoop = l a b e l name and 8 = length of l a b e l name 11 c a l l f hpmstart ( MainLoop, 8 ) //Code to be p r o f i l e d c a l l f hpmstop ( MainLoop, 8 ) c a l l f hpmterminate ( ) 18 Call m p i f i n a l i z e ( i e r r ) Fig. 5. Example usage of HPM from FORTRAN code The instrumented application needs to be linked to HPM prior to executing it. An example Makefile s sections for linking HPM to the application is in Figure 7.

6 HPM:6 International Business Machines // Include the HPM Header f i l e s 3 #include hpm. h 4 #include mpi. h // i n i t i a l i z e MPI 8 i e r r = MPI Init(&argc, &argv ) ; 9 i e r r = MPI Comm rank(mpi COMM WORLD, &my id ) ; 10 i e r r = MPI Comm size (MPI COMM WORLD, &numprocs ) ; 11 hpminit ( ) ; hpmstart ( OuterLoop ) ; // MainLoop = l a b e l f o r code segment //Code to be p r o f i l e d hpmstart ( InnerLoop ) ; // MainLoop = l a b e l f o r code segment 19 //Code to be p r o f i l e d 20 hpmstop ( InnerLoop ) ; // MainLoop stop the code segment 21 //Code to be p r o f i l e d 22 hpmstop ( OuterLoop ) ; // MainLoop stop the code segment 23 // Note Code l a b e l s started must match with stopped code segments //Terminate HPM 27 hpmterminate ( ) ; 28 MPI Finalize ( ) ; } Fig. 6. Nesting of HPM calls 1 # Declare the make variables 2 CC= b g x l c r I / bgsys / drivers / ppcfloor /comm/ xl / include Ilibmpihpm r / src 3 CFLAGS= O3 qarch=qp qtune=qp qhot qsimd=auto qsmp=omp: noauto 4 5 LFLAGS= $ (CFLAGS) 6 # MPI LIBRARIES 7 MPI LIBS= L / bgsys / drivers / ppcfloor /comm/ gcc / l i b lmpich lmpl lopa \ 8 L / bgsys / drivers / ppcfloor /comm/ sys / l i b lpami \ 9 L / bgsys / drivers / ppcfloor / spi / l i b lspi cnk lspi upci cnk l r t lpthread lm SYS INC = $ (MPI INC) I / bgsys / drivers / $ (FLOOR) I / bgsys / drivers / ppcfloor / spi / include / kernel / cnk 12 SYS LIB = / bgsys / drivers / $ (FLOOR) / bgpm / l i b / libbgpm. a 13 HPM LIB=.. /.. / libmpihpm / src / libhpm. a # Linking executable step 16 # Link e x e c i t a b l e to HPM LIB SYS LIB and MPI LIB Fig. 7. Linking HPM Libraries 3.1. Environment variables The HPM s functionalities can be controlled using several environment variables. The list that follows documents the HPM environment variables and their functionalities. HPM EVENT SET =

7 Hardware Performance Monitor for Bluegene/Q HPM:7-1 is the default setting, and corresponds to a basic set of non multiplexed counters 0 multiplexed event set providing information about total cycles, instructions and LSU events 1 multiplexed event set to explore branch prediction 2 multiplexed presents data about floating point instruction mix 3 multiplexed event set with a mix of different counters 4 multiplexed event set for stream prefetching events HPM SCOPE - This environment variable applies only for non-threaded version process - Prints ouptut per MPI process node - Prints output per Node HPM ASC OUTPUT : prints output as a text file yes (default) no HPM CSV OUTPUT : prints output as a csv file yes no (default) HPM VIZ OUTPUT : prints output as a viz file yes no (default) The OUTPUT related environment variables are mutually exclusive, and allows HPM to write output in several formats at the same time. HPM EXCLUSIVE : calculates output of outer nested sections exclusive to inner nested sections yes no (default) HPM IO BATCH : Reduces the number of output files simultaneously opened by HPM yes no (default) 3.2. HPM multiplexing The BGPM layer implements a feature to multiplex hardware counters. This feature switches different sets of events to measure counter data. The user still selects an eventset, and enables multiplexing. Since the counter conflicts on BlueGene/Q is related to the number of events, the scope of measurement (Node, process, thread levels) the BGPM layer implements multiplexing and can be reused by user tools. While enabling multiplexing, BGPM chooses the number of groups to be used for multiplexing. The period of multiplexing is set by HPM using BGPM API calls to minimize overhead while retaining accuracy. The HPM output is accurate if the compute time is as low as a second Exploring stream, list prefetch using HPM The stream prefetching on Bluegene/Q can be programmatically modified. When the user selects HPM EVENT SET=4 these prefetch events are enabled for counting. The available stream and list prefetch events are L1P Streams Established, L1P Stream write invalidate, L1P List Hit, Stream depth adaptation Events. The L1P list hit metric can be used to explore if the list prefetching is effective. The L1P stream depth adaptation events gives an indication of the number of times the stream depth had to be modified. This event along with L1P stream write invalidate event is an indicator of effectiveness of stream prefetch.

8 HPM:8 International Business Machines 3.4. HPM Design The HPM header file included defines a set of macros which replace HPM function calls with calls incorporating the details of location from which the HPM calls are made. The hpminit function initializes a set of common data structures used by HPM. The hpminit function also initializes the BGPM data structures. The main data structures used by HPM are defined in a common header file. Arrays to track the code block s labels, number of times the code block is started and stopped, wall time per code block, counter data per code block are declared in the common header file. A function named index from label converts the string label entered by user to a numeric index. This index is used to tally the number of times a code block is visited as the program executes. The recording of the counter data is tracked per process for ever code block. The maximum number of code blocks allowed by HPM is 20. The maximum length of the string for the code block label is 80 characters. In order to keep track of the process number and the number of cores used by a process a set of mask arrays are defined. Each process maintains information about which slot it should be updating its counter data to in an array named myslots. Once the counter data is read, the slots corresponding to the process is updated. HPM prints out the labels of the counters used as a descriptive text. Inorder to translate the counter number to the descriptive text, the counter labels are mapped to counter numbers. This functionality is defined in a function named set labels. When hpmstart function is called, the hpmstart function converts the code block label to code block index. It registers the location of the start function s source code file and line number. The hpmstart function also registers the nesting level of this hpmstart function. The slots which corresponds to the process is identified. Then the starting counter values as well as the wall clock time values are recorded for the current code block. The number of events in a specified event set is known in advance since the event sets are hard coded. The starting values of events recorded are updated per code block index. The hpmstop function performs steps similar to hpmstart. The hpmstop function converts the code block to code block index. It then reads the counter value and then records them in the corresponding array location of the specified code block index. The hpmterminate function processes the data extracted by hpm and converts them to a format which can be printed in various formats requested by the user. This function processes the environment variables related to output formats, calculation of exclusive and inclusive metrics and writing of output in batches to reduce file system impact. hpmterminate calls a function named printderived to compute and print the derived metrics. The print derived function checks if the counters it needs to compute each derived metric is available in the event set selected. If all components are available it prints the output HPM s overhead measurements The HPM s overhead of instrumentation is minimal since it uses hardware performance counters for measurements. To validate the low overhead a test application was run at varying lengths of compute time from within 3 nested HPM regions. The overhead of HPM was measured for the original code. Then the application was instrumented using HPM and the run time of the application was measured for a nonmultiplexed event set. The same section of code was instrumented using multiplexed event set. The difference in runtimes of this application is presented in Figure 8.

9 Hardware Performance Monitor for Bluegene/Q HPM:9 Overhead Measurement of HPM Time in seconds Without HPM Non Multiplexed event set Multiplexed event set Number of Processes Overhead Measurement of HPM Larger compute time 25 Time in seconds Without HPM Non Multiplexed event set Multiplexed event set Number of Processes Fig. 8. HPM s overhead measurements HPM Derived metric definitions Derived Metrics Definition Million Instructions Per Second Total Number of instructions in millions by total time in seconds Cycles Per Instruction Total number of cycles by Total number of instructions MegaFlop per second Total Flops (in millions) Total AXU Instructions Total number of AXU instructions (in millions) Total AXU Instructions committed per sec Total number of AXU instructions committed per second Percentage of FP instructions to total instructions Total number of AXU instructions per total number of all instructions expressed as a percentage Instructions per load store Total number of instructions by the total number of loads and store instructions L1 loads per load miss Total number of L1 loads per load miss L1 Stores per Store miss Total number of L1 stores per store miss L1 cache hit percentage (1.0 - (Total L1 misses/ Total L1 accesses ))*100 Percentage of branch mispredictions Total number of branches mispredicted by Total number of branches expressed as a percentage Fig. 9. List of derived metrics

10 HPM:10 International Business Machines 3.6. HPM Derived metrics Derived metrics are values computed from one or more hardware counter data. Since raw hardware counter data is hard to interpret, the derived metrics are calculated by HPM. The derived metrics is available in ASCII and CSV formats. Derived metrics are computed by HPM automatically if the components of the derived metric is available. The definition of the derived metrics is listed in Figure HPM Peekperf output Fig. 10. Peekperf view of HPM Output Peekperf is a graphic user interface to display HPM data in a browsable format. When the HPM VIZ OUTPUT environment variable is set, HPM produces a peekperf compatible output. The peekperf output displays the source code and the counter output in an integrated view. 4. REFERENCES

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

Overview. Timers. Profilers. HPM Toolkit

Overview. Timers. Profilers. HPM Toolkit Overview Timers Profilers HPM Toolkit 2 Timers Wide range of timers available on the HPCx system Varying precision portability language ease of use 3 Timers Timer Usage Wallclock/C PU Resolution Language

More information

Portland State University ECE 588/688. Cray-1 and Cray T3E

Portland State University ECE 588/688. Cray-1 and Cray T3E Portland State University ECE 588/688 Cray-1 and Cray T3E Copyright by Alaa Alameldeen 2014 Cray-1 A successful Vector processor from the 1970s Vector instructions are examples of SIMD Contains vector

More information

Blue Gene/P Universal Performance Counters

Blue Gene/P Universal Performance Counters Blue Gene/P Universal Performance Counters Bob Walkup (walkup@us.ibm.com) 256 counters, 64 bits each; hardware unit on the BG/P chip 72 counters are in the clock-x1 domain (ppc450 core: fpu, fp load/store,

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Portland State University ECE 588/688. IBM Power4 System Microarchitecture

Portland State University ECE 588/688. IBM Power4 System Microarchitecture Portland State University ECE 588/688 IBM Power4 System Microarchitecture Copyright by Alaa Alameldeen 2018 IBM Power4 Design Principles SMP optimization Designed for high-throughput multi-tasking environments

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Performance Analysis on Blue Gene/P

Performance Analysis on Blue Gene/P Performance Analysis on Blue Gene/P Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University From microprocessor to the full Blue Gene P/system IBM XL Compilers The commands

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

1. PowerPC 970MP Overview

1. PowerPC 970MP Overview 1. The IBM PowerPC 970MP reduced instruction set computer (RISC) microprocessor is an implementation of the PowerPC Architecture. This chapter provides an overview of the features of the 970MP microprocessor

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information

IBM Blue Gene/Q solution

IBM Blue Gene/Q solution IBM Blue Gene/Q solution Pascal Vezolle vezolle@fr.ibm.com Broad IBM Technical Computing portfolio Hardware Blue Gene/Q Power Systems 86 Systems idataplex and Intelligent Cluster GPGPU / Intel MIC PureFlexSystems

More information

PIPELINE AND VECTOR PROCESSING

PIPELINE AND VECTOR PROCESSING PIPELINE AND VECTOR PROCESSING PIPELINING: Pipelining is a technique of decomposing a sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates

More information

6.1 Multiprocessor Computing Environment

6.1 Multiprocessor Computing Environment 6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA

The Alpha Microprocessor: Out-of-Order Execution at 600 Mhz. R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA The Alpha 21264 Microprocessor: Out-of-Order ution at 600 Mhz R. E. Kessler COMPAQ Computer Corporation Shrewsbury, MA 1 Some Highlights z Continued Alpha performance leadership y 600 Mhz operation in

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

I/O Buffering and Streaming

I/O Buffering and Streaming I/O Buffering and Streaming I/O Buffering and Caching I/O accesses are reads or writes (e.g., to files) Application access is arbitary (offset, len) Convert accesses to read/write of fixed-size blocks

More information

Main Points of the Computer Organization and System Software Module

Main Points of the Computer Organization and System Software Module Main Points of the Computer Organization and System Software Module You can find below the topics we have covered during the COSS module. Reading the relevant parts of the textbooks is essential for a

More information

FFTSS Library Version 3.0 User s Guide

FFTSS Library Version 3.0 User s Guide Last Modified: 31/10/07 FFTSS Library Version 3.0 User s Guide Copyright (C) 2002-2007 The Scalable Software Infrastructure Project, is supported by the Development of Software Infrastructure for Large

More information

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved. Chapter 6 Parallel Processors from Client to Cloud FIGURE 6.1 Hardware/software categorization and examples of application perspective on concurrency versus hardware perspective on parallelism. 2 FIGURE

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

ECE 341 Final Exam Solution

ECE 341 Final Exam Solution ECE 341 Final Exam Solution Time allowed: 110 minutes Total Points: 100 Points Scored: Name: Problem No. 1 (10 points) For each of the following statements, indicate whether the statement is TRUE or FALSE.

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Blue Gene/Q User Workshop. Performance analysis

Blue Gene/Q User Workshop. Performance analysis Blue Gene/Q User Workshop Performance analysis Agenda Code Profiling Linux tools GNU Profiler (Gprof) bfdprof Hardware Performance counter Monitors IBM Blue Gene/Q performances tools Internal mpitrace

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Detection and Analysis of Iterative Behavior in Parallel Applications

Detection and Analysis of Iterative Behavior in Parallel Applications Detection and Analysis of Iterative Behavior in Parallel Applications Karl Fürlinger and Shirley Moore Innovative Computing Laboratory, Department of Electrical Engineering and Computer Science, University

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Inside Intel Core Microarchitecture

Inside Intel Core Microarchitecture White Paper Inside Intel Core Microarchitecture Setting New Standards for Energy-Efficient Performance Ofri Wechsler Intel Fellow, Mobility Group Director, Mobility Microprocessor Architecture Intel Corporation

More information

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2

Lecture 5: Instruction Pipelining. Pipeline hazards. Sequential execution of an N-stage task: N Task 2 Lecture 5: Instruction Pipelining Basic concepts Pipeline hazards Branch handling and prediction Zebo Peng, IDA, LiTH Sequential execution of an N-stage task: 3 N Task 3 N Task Production time: N time

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

Introduction to the MMAGIX Multithreading Supercomputer

Introduction to the MMAGIX Multithreading Supercomputer Introduction to the MMAGIX Multithreading Supercomputer A supercomputer is defined as a computer that can run at over a billion instructions per second (BIPS) sustained while executing over a billion floating

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle

M7: Next Generation SPARC. Hotchips 26 August 12, Stephen Phillips Senior Director, SPARC Architecture Oracle M7: Next Generation SPARC Hotchips 26 August 12, 2014 Stephen Phillips Senior Director, SPARC Architecture Oracle Safe Harbor Statement The following is intended to outline our general product direction.

More information

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors

ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors ProfileMe: Hardware-Support for Instruction-Level Profiling on Out-of-Order Processors Jeffrey Dean Jamey Hicks Carl Waldspurger William Weihl George Chrysos Digital Equipment Corporation 1 Motivation

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar.

Manycore Processors. Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. phi 1 Manycore Processors phi 1 Definition Manycore Chip: A chip having many small CPUs, typically statically scheduled and 2-way superscalar or scalar. Manycore Accelerator: [Definition only for this

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Appendix C. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Appendix C Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure C.2 The pipeline can be thought of as a series of data paths shifted in time. This shows

More information

IBM High Performance Computing Toolkit

IBM High Performance Computing Toolkit IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim PowerPC-base Core @3.2GHz 1 VMX vector unit per core 512KB L2 cache 7 x SPE @3.2GHz 7 x 128b 128 SIMD GPRs 7 x 256KB SRAM for SPE 1 of 8 SPEs reserved for redundancy total

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Profiling: Understand Your Application

Profiling: Understand Your Application Profiling: Understand Your Application Michal Merta michal.merta@vsb.cz 1st of March 2018 Agenda Hardware events based sampling Some fundamental bottlenecks Overview of profiling tools perf tools Intel

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors COSC 6385 Computer Architecture - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors Fall 2012 References Intel Larrabee: [1] L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M.

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

The IBM Blue Gene/Q 1 System

The IBM Blue Gene/Q 1 System The IBM Blue Gene/Q 1 System IBM Blue Gene Team [10] IBM Somers NY, USA Abstract We describe the architecture of the IBM Blue Gene/Q system, the third generation in the IBM Blue Gene line of massively

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights

The Alpha Microprocessor: Out-of-Order Execution at 600 MHz. Some Highlights The Alpha 21264 Microprocessor: Out-of-Order ution at 600 MHz R. E. Kessler Compaq Computer Corporation Shrewsbury, MA 1 Some Highlights Continued Alpha performance leadership 600 MHz operation in 0.35u

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model.

Performance of Computer Systems. CSE 586 Computer Architecture. Review. ISA s (RISC, CISC, EPIC) Basic Pipeline Model. Performance of Computer Systems CSE 586 Computer Architecture Review Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp Performance metrics Use (weighted) arithmetic means for execution

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Increasing pipeline length eventually leads to diminishing returns longer pipelines take longer to re-fill data and control hazards lead to increased overheads, removing any a performance

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Processors. Young W. Lim. May 12, 2016

Processors. Young W. Lim. May 12, 2016 Processors Young W. Lim May 12, 2016 Copyright (c) 2016 Young W. Lim. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Intel Architecture for Software Developers

Intel Architecture for Software Developers Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software

More information

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties

Great Reality #2: You ve Got to Know Assembly Does not generate random values Arithmetic operations have important mathematical properties Overview Course Overview Course theme Five realities Computer Systems 1 2 Course Theme: Abstraction Is Good But Don t Forget Reality Most CS courses emphasize abstraction Abstract data types Asymptotic

More information

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors Edgar Gabriel Fall 2018 References Intel Larrabee: [1] L. Seiler, D. Carmean, E.

More information

Performance Analysis with Periscope

Performance Analysis with Periscope Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Final Exam Fall 2008

Final Exam Fall 2008 COE 308 Computer Architecture Final Exam Fall 2008 page 1 of 8 Saturday, February 7, 2009 7:30 10:00 AM Computer Engineering Department College of Computer Sciences & Engineering King Fahd University of

More information

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP

CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP CISC 662 Graduate Computer Architecture Lecture 13 - Limits of ILP Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Complex Pipelining: Superscalar Prof. Michel A. Kinsy Summary Concepts Von Neumann architecture = stored-program computer architecture Self-Modifying Code Princeton architecture

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft

Blue Gene/Q. Hardware Overview Michael Stephan. Mitglied der Helmholtz-Gemeinschaft Blue Gene/Q Hardware Overview 02.02.2015 Michael Stephan Blue Gene/Q: Design goals System-on-Chip (SoC) design Processor comprises both processing cores and network Optimal performance / watt ratio Small

More information

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means

OpenMP Programming. Prof. Thomas Sterling. High Performance Computing: Concepts, Methods & Means High Performance Computing: Concepts, Methods & Means OpenMP Programming Prof. Thomas Sterling Department of Computer Science Louisiana State University February 8 th, 2007 Topics Introduction Overview

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Short Notes of CS201

Short Notes of CS201 #includes: Short Notes of CS201 The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with < and > if the file is a system

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Porting Applications to Blue Gene/P

Porting Applications to Blue Gene/P Porting Applications to Blue Gene/P Dr. Christoph Pospiech pospiech@de.ibm.com 05/17/2010 Agenda What beast is this? Compile - link go! MPI subtleties Help! It doesn't work (the way I want)! Blue Gene/P

More information

CS201 - Introduction to Programming Glossary By

CS201 - Introduction to Programming Glossary By CS201 - Introduction to Programming Glossary By #include : The #include directive instructs the preprocessor to read and include a file into a source code file. The file name is typically enclosed with

More information

Alexandria University

Alexandria University Alexandria University Faculty of Engineering Computer and Communications Department CC322: CC423: Advanced Computer Architecture Sheet 3: Instruction- Level Parallelism and Its Exploitation 1. What would

More information