SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units

Size: px
Start display at page:

Download "SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units"

Transcription

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2013; 25: Published online 15 October 2012 in Wiley Online Library (wileyonlinelibrary.com) SPECIAL ISSUE PAPER SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units Shivani Raghav 1, *,, Andrea Marongiu 2, Christian Pinto 2, Martino Ruggiero 1,2, David Atienza 1 and Luca Benini 2 1 ESL - École Polytechnique Fédérale de Lausanne, Switzerland 2 DEIS - University of Bologna, Italy SUMMARY This paper introduces SIMinG-1k a manycore simulator infrastructure. SIMinG-1k is a graphics processing unit accelerated, parallel simulator for design-space exploration of large-scale manycore systems. It features an optimal trade-off between modeling accuracy and simulation speed. Its main objectives are high performance, flexibility, and ability to simulate thousands of cores. SIMinG-1k can model different architectures (currently, we support ARM (Available from: arm.doc.ddi0100i/index.html) and Intel x86) using two-step approac where architecture specific front end is decoupled from a fast and parallel manycore virtual machine running on graphical processing unit platform. We evaluate the simulator for target architecture with up to 4096 cores. Our results demonstrate very high scalability and almost linear speedup with simulation of increasing number of cores. Copyright 2012 John Wiley & Sons, Ltd. Received 7 January 2011; Revised 13 June 2012; Accepted 6 September 2012 KEY WORDS: simulation; parallel simulator; manycore; GPU; CUDA 1. INTRODUCTION In almost every computing domain, modern system design and application development methodologies are largely based on simulation. Virtual platforms are extensively used for early software development before the real hardware is available. They are also needed to optimize the parallelization and hardware resource utilization of the application once the hardware is available. During the last decade, design of integrated architectures has been characterized by a paradigm shift. Boosting clock frequencies of monolithic processor cores has clearly reached its limits, and designers have turned to multicore architectures to satisfy the growing computational needs of applications within a reasonable power envelope. To meet the ever-increasing demand for peak performance while fitting tight power budgets, there is a clear trend towards simplifying the core microarchitecture design. Using elementary cores (e.g. in-order execution, no sophisticated circuitry for branch prediction, instruction-level parallelism etc.) allows significant reductions in the consumed power, which in turn allows integration of a higher number of cores in a single chip, providing higher peak performance. Future manycore processors will thus embed thousands of simple cores and memories on a single chip, connected through a network-on-chip, more than hundred times faster than traditional off-chip interconnections. A similar trend is clearly foreseeable in every *Correspondence to: Shivani Raghav, Embedded Systems Laboratory - École Polytechnique Fédérale de Lausanne, 1018, Switzerland. shivani.raghav@epfl.ch Copyright 2012 John Wiley & Sons, Ltd.

2 1444 S. RAGHAV ET AL. computing domain, from High Performance Computing (HPC) to embedded systems. Examples of similar architectures may include on-chip manycore accelerators such as the Hypercore Architecture Line from Plurality [1], Platform 2012 [2], or future evolutions of Intel s prototypes Larrabee [3] and Single-Chip Cloud Computer [4]. Dark silicon pushes innovations towards specialization where a single chip will include a spectrum of hardware accelerators to access and manipulate data in the cloud workloads with minimal energy. Simulation and virtual prototyping technology must obviously evolve to tackle the numerous challenges inherent in simulating such highly parallel architectures. Current state-of-the-art sequential simulators use SystemC [5], binary translation, smart sampling techniques, or tuneable abstraction levels for hardware description. These kinds of simulation technologies typically have to make a trade-off between simulation accuracy and simulation speed. Because very low-level hardware operations are accurately modeled, simulation is slow. This can lead to unacceptable performance when simulating a huge number of cores. Simulating a parallel system is an inherently parallel task. Individual processor simulation may independently proceed until the point where communication or synchronization with other processors is required. This is the key idea behind parallel simulation technology that distributes the simulation workload over parallel hardware resources. Parallel simulators utilizes the availability of multiple physical processing nodes to increase the simulation rate. However, this requirement may turn out to be much too costly in case of adopting server clusters or computing farms as a host for running simulations. The high cost in terms of increasing latency and decreasing bandwidth typically leads to poor scalability because of the synchronization overhead when increasing the number of processing nodes. The development of computer technology has recently led to an unprecedented performance increase of general-purpose graphical processing units (GPGPU). Modern GPGPUs integrate hundreds of processors on the same device, communicating through low-latency and high-bandwidth on-chip networks and memory hierarchies. This allows cutting interprocessor communication costs by orders of magnitude with respect to server clusters. Additionally, such scalable computation power and flexibility are delivered at a rather low cost by commodity graphics processing unit (GPU) hardware. With improvement in GPU hardware performance, their programmability has been significantly increased in the last 5 years. This has led to the diffusion of computing clusters based on such manycores leading to inexpensive solutions in HPC for a wide community. This scenario motivated our idea of developing a novel parallel simulation technology that utilizes the computational power of widely available and low-cost GPUs. We developed a new simulation technology to deploy a parallel simulator for on-chip 1000-core systems on top of GPGPUs. Although we are currently researching on a methodology targeted at full-system simulation [6], the specific contribution of this paper focuses on Instruction Set Simulation (ISS). In the design of our simulation environment, we provide the capability of easily simulating target processors from different computing domains. We use a two-step approach where architecture specific front end is decoupled from target-independent virtual ISS (VISS) that is parallelized for fast simulation on GPU platform. To better explain our methodology, we present the implementation and results from two ISS namely; an x86 ISS, representative of the HPC domain and an ARM ISS, representative of the embedded domain. We previously published our early implementation and results for GPGPU-based simulation of ARM and x86 ISAs [7]. Implementation of x86 Instruction Set Architecture (ISA) does not include extended subset of floating point (x87) and Single Instruction Multiple Data (SIMD) instructions. We developed two separate ISS, each of which has been specifically optimized for execution on an NVIDIA [8] GeForce GTX 295 GPU. This approach however presents two main limitations. First, supporting different ISAs requires development of a new ISS from scratch, and to arduously tailor its implementation to the GPU hardware peculiarities. This requires devising optimizations that are specific for the target processor, and implies significant effort and time (design, implementation, and debugging). Second, complex ISA simulation (i.e. x86) typically requires lots of conditional instructions to manage all of the possible variants in the control flow. This is particularly relevant in both the decode and execution phases of a processor pipeline.

3 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1445 Variable length decoding requires sophisticated check patterns. These check patterns are dependent on fields of target instruction s opcodes that are unknown until runtime. When implementing such check patterns in decoding phase of pipeline usage of control flow branches increases and thus the possibility of concurrent operations on GPU decreases. Similarly, depending upon the type of application running on simulator, cores can fetch different instructions implemented using control flow operations in execution phase of pipeline. Anytime cores perform divergent flow operation on GPU, it is detrimental to GPU performance because of intensive processor stall. In this paper, we overcome these limitations and significantly extend our previous work by taking a different approach aimed at achieving both goals of easing the development of a new ISS and maximizing the performance of a core simulation engine running on the GPGPU. This simulation engine has been designed as a highly efficient, target-independent VISS, capable of executing a small yet comprehensive set of generic instructions found at the heart of any ISA. Our VISS design leverages a three-stage pipeline (fetch, decode, and execute), the simulation of each stage being accurately optimized for achieving maximum efficiency out of the underlying GPU hardware. Developing simulation support for a new ISA on our infrastructure only requires building a small architecture-specific translator. The latter operates as a post-pass compiler and transforms a compiled program for the target architecture into its equivalent in the VISS format. However, designing a parallel VISS for manycore running on GPUs is not a straightforward task. The main advantage of using GPU as host platform is its high memory bandwidth and large number of cores capable of thousands of hardware thread contexts executing program in Single Instruction Multiple Thread (SIMT) fashion. Implementing a simulator for such platform imposes several challenges such as control flow divergence, high latency of global memory, bank conflicts, and slow host CPU to GPU communication. In Section 3, we explain each of these limitations in detail and provide solutions to overcome these challenges for effective simulation. It is also important to remember that running a data parallel program on top of the manycore simulator provides very high performance and scalability, whereas a task parallel program entails an overhead because of control flow divergence that is known to be detrimental for GPU performance. At present stage of development, we only provide application-level simulation. The behavior of Operating System (OS), input/output (I/O), and device modeling is not yet supported and is part of our future work. To enable simulation of parallel shared memory applications, we provide support for OpenMP style programming. In particular, we discuss an implementation of locking primitives that leverages Central Processing Unit (CPU) GPU synchronization. Our experimental results confirm the goodness and efficiency of our solution. The rest of the paper is organized as follows. In Section 2, we discuss related work, and Section 3 briefly explains GPU architecture and Compute Unified Device Architecture (CUDA) programming model. Section 4 describes the design and implementation of our manycore simulator, SIMinG-1k. In Section 5, we evaluate the performance and scalability of SIMinG-1k and also compare our results with a commercial, state-of-the-art sequential simulator. In the end, Section 6 concludes the paper. 2. GRAPHICAL PROCESSING UNIT ARCHITECTURE AND COMPUTE UNIFIED DEVICE ARCHITECTURE PROGRAMMING MODEL The Fermi-based GPU used in this work is a NVIDIA GeForce GTX 480, a two-level shared memory parallel machine comprising 480 Streaming Processors (SPs) organized in 15 Streaming Multiprocessors (SMs). SMs manage the execution of programs using so-called warps, groups of 32 threads. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. All instructions are executed in an SIMD fashion, where one instruction is applied to all threads in warp. This execution method is called SIMT. All threads in a warp execute the same instruction or remain idle (different threads can perform branching and other forms of independent work) [8]. In the Fermi architecture, each SM has 64 KB of on-chip memory that can be configured as 48 KB of shared memory with 16 KB of L1 cache or as 16 KB of shared memory with 48 KB of L1 cache. Fermi features also a 768 KB unified L2 cache that provides efficient data sharing across the GPU. CUDA is the software architecture for issuing and managing computations on the GPU [9].

4 1446 S. RAGHAV ET AL. 3. DESIGN AND IMPLEMENTATION In this section, we describe the implementation of our GPU-based manycore instruction set simulator. The goal is to simulate on-chip manycore accelerators with thousands of cores by modeling the functional correctness of processor pipelines and their program-visible state. The key idea behind our approach is to identify the inherent parallelism in manycore architecture simulation and efficiently execute it on top of highly parallel GPGPU hardware. Despite the parallel nature of manycore simulation, there are many issues potentially hindering the effectiveness of the approach that need to be taken into account. Besides the evident necessity of carefully designing intercore communication and synchronization, it is also extremely important to consider some peculiarities of GPGPU hardware and the CUDA programming model. Some of the key implementation issues for performance are described as follows: Control Flow Divergence - Because of the SIMT nature of GPU architectures, divergent control flow in CUDA programs is detrimental to GPU performance. In parallel ISA simulation, lots of conditional instructions are typically required to model decoding and execution stages of the pipeline. Moreover, supporting a large number of simulated instruction during the execution phase results in high probability of different ISS fetching different instructions at the same time. We describe the steps taken to optimize the performance of our simulator with this respect later on in this section and in Section 3.1. High Latency of Global Memory - GPU global memory has a very high latency of cycles. In parallel ISA simulation, the contexts of all simulated cores are frequently accessed, which makes them good candidates for placement on the faster shared memory. This however requires laying out corresponding data structure over memory banks in a smart manner to better utilize memory bandwidth. We minimize the number of bus transaction by coalescing memory accesses as described in Section Bank Conflicts in Shared Memory - The fast GPU-shared memory leverages a multiported, multibanked design to improve memory access parallelism. Accesses by multiple threads within the same bank cause a conflict. In Section 3.2.1, we describe how we eliminate such conflicts and achieve optimal performance. Slow GPU CPU Interaction - CPU GPU interaction is a costly process. However, CUDA and the GPU hardware do not provide any means to synchronize threads among multiprocessors. This poses a serious performance bottleneck when simulating shared memory applications using locks and other synchronization primitives. This may require synchronization among thread blocks which is not natively supported by CUDA and GPU hardware and which can only be achieved through interactions with the CPU. We describe in Section how we deal with this issue. To address the first challenge, we design our simulation framework so as to minimize frequent divergent execution flows by construction. Intuitively, the higher the number of instructions supported by every ISS, the higher the probability that two ISS will be executing different instructions, and as a consequence that the simulated pipelines will be subject to stalls implied by divergent instruction streams. To mitigate this effect, we take a two-step approach (Figure 1). At the heart of our proposal sits a VISS, which supports a minimal yet comprehensive set of instructions. More complex instructions found in the targeted ISA can be decomposed into a set of these simple micro-operations. This is carried out statically on the host CPU platform. More precisely, an offline architecture-specific translation step transforms the target program object code (i.e. the result of a standard compilation stage for the target architecture) into the generic micro-operations supported by our VISS. There are two important advantages when using this methodology. First, this method enables easier creation of simulation infrastructures for several different architectures and makes their emulation on GPU more portable. Only one generic implementation of the Virtualized ISS (VISS) is required, and it can be used in conjugation with any architecture-specific translation technique. Second, our methodology improves simulation performance on GPGPUs. We carefully design the

5 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1447 Figure 1. Manycore simulation workflow. CUDA, Compute Unified Device Architecture; VISS, Virtual Instruction Set Simulation; GPU, graphical processing unit; SM, Streaming Multiprocessors. micro-operation encoding format in such a way that instructions can be fetched and decoded in a single parallel step on thousand virtualized ISS, hence providing optimal opportunity for the parallel execution on GPU. For example, we have previously experienced [7] that variable-length x86 instruction decoding is subject to a variable number of steps depending on opcodes and addressing modes. This makes it one of the most difficult and expensive parts of x86 parallel simulation on GPU, because of frequent control flow divergence. With our new approach, all of the complex decoding part has been shifted to the offline translation phase. The highly optimized VISS runs much faster on GPU hardware, as it now benefits from the high concurrency enabled by the new design. As explained in Section 1, our goal is a simulation technology that enables to model on-chip manycore accelerators from different computing domains. To assess the goodness and flexibility of our methodology, we describe the implementation of two ISAs: a Reduced Instruction Set Computing (RISC)-like (ARM) architecture targeting the embedded system domain and a simple x86 architecture similar to those used in manycores from the HPC domain ([3, 4]). To estimate execution cycle count on the two architectures, we adopt a timestamping technique. In general, a single instruction from a target ISA is translated into multiple instructions in the VISS format. We emit a timestamp before the first and after the last VISS instruction. The interpretation phase in the simulator implements a statistics module, which is in charge of gathering most common processor metrics such as overall execution cycles, active cycles, and idle cycles. When a begin timestamp is encountered in the interpretation phase, the statistics module increments the source instruction count. The end timestamp is annotated with the source instruction latency (we refer to instruction latencies presented in x86 [10] and ARM [11] datasheets). The latency information is then used by the statistics module to increment the instruction cycle count. It is worth recalling here that manycore System-on-Chip (SoCs) employ simple processors to achieve the required performance/watt targets. Thus, both the ARM and x86 processors considered in this work are simple, in-order cores, without sophisticated circuitry for branch prediction, ILP extraction, and so forth.

6 1448 S. RAGHAV ET AL. This makes our technique for performance profiling accurate enough to estimate processor behavior. Clearly, memory and interconnect effects may generate stalls and delays that affect these measures. However, in this work, we just focus on ISS simulation and assume that memory transactions have deterministic duration because of the presence of per-core scratchpad memory. Cache, Network-on- Chip (NoC), and main memory effects in our simulation framework are accounted for in separate modules, which model the behavior of these system components. The interested reader is referred to our proposal for full-system simulation in [6]. The following subsections describe in more details translation and interpretation phases for both ISS Translator One of the main tasks of this translator is to split each of the source ISA (x86 or ARM) instructions into simpler VISS instructions called micro-ops. A set of micro-ops is chosen that is much smaller than the large number of instructions found in ISAs such as x86 and ARM. Currently, our VISS executes 18 instructions, empirically determined as the minimum set of basic operations to support most of the instructions in x86 and ARM ISA. It is easy to extend the set of micro-ops to support instructions from more complex ISA. However, it is important to keep in mind that the more complex the simulated ISA (and thus the minimum number of necessary micro-ops), the higher likeliness of control flow divergence during the execution stage of the core pipeline. The translator works as a post-pass compiler over a compiled x86/arm program, and it operates in a similar manner to a standard disassembler. First, the x86/arm compiled program is decoded into a stream of instructions from the target ISA. Instruction opcodes are used as an index to the actual instructions tables. Operands are then parsed on the basis of their corresponding ISA specifications. Once the disassembled instruction stream has been reconstructed, we use custom short functions to convert it into a corresponding stream of VISS micro-ops. Micro-ops are 64 bit long, therefore instructions with embedded long (32-bit) immediate values can be easily transformed into an intermediate form that preserves the long immediate values. Figure 2 shows the format for 64 bit long intermediate micro-operations. It is important to note that simple instructions (such as single-register transfer) can be translated using a single micro-ops, whereas more complex instructions (such as those requiring calculation of various addressing modes) may need several micro-ops for translation. One important task of the translator is that of re-mapping the original register allocation for the target architecture s register file into the VISS register file.the latter consists of 32 general-purpose registers. The first 16 (r0 r15) directly map into the corresponding registers from the target architecture (e.g. only the first 16 ARM registers are made visible to program/compiler). The others (t1 t16) Figure 2. Instruction format for intermediate micro-operations.

7 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1449 are used as temporary registers for intermediate operations and special usage. Register t14 is used as a program counter for the VISS, t15 is dedicated as a ISS flag register, and t16 is allocated for condition checking operations. The translator also generates final addresses for all the symbols in the program according to the architecture s memory map specified through a linker script. Thus, all code segments global, initialized, and uninitialized data as well as other special sections are mapped onto Scratch Pad Memory (SPMs) according to the global address space specified through the linker script. Some arithmetic or logic instructions in any ISA require setting flag (or condition codes) values. By decoding an instruction, we know if the instruction requires any conditional check for a set of flags. This can be easily implemented offline in our translation phase. On the basis of this decoded information and on the basis of the result of operation execution (at runtime), the bits are updated in flag register. Modeling this part in translation phase, however, requires additional operations. Because the rules to compute flag status are quite complex, generating the necessary micro-ops for this computation would represent an inefficient solution. Therefore, we implement the conditional check for flag update in interpretation phase at runtime. We use three bits (sf bits in Figure 2) to pass the information (type of flags that require conditional check) decoded by translator to the interpretation phase. If at runtime any of these bits is set, the set flag logic is called inside the GPU simulator that checks the condition and sets the bit in the flag register. For conditional instructions, the translator implements the condition checking mechanism by generating specific additional micro-ops. These store the result of condition check inside a dedicated register (t16). In addition, one bit (cc bit in Figure 2) indicates if condition evaluation is required for a particular instruction. If the cc bit is set and register t16 is true, the specified condition is met. Otherwise, the interpreter skips the operation. To provide a qualitative measure of the efforts spent in developing a target-specific translator, we report the lines of code (LoC) for the two supported architecture. The ARM translator is composed of 1800 LoC, whereas the x86 translator, which features a more complicated ISA, required 3200 LoC Virtual instruction set simulation compute unified device architecture implementation In this section, we describe the core VISS written in CUDA that accepts the micro-ops format and executes it on the GPU. The VISS (Figure 1) fetches, decodes, and emulates the execution of individual intermediate micro-ops. Thousands of cores with associated instruction and data scratchpads (SPM) can be simulated by mapping each instance of the VISS to a CUDA thread. Each virtual core is assigned a private context structure, which represents the CPU current state. Initially, host memory is allocated for a context structure that represents one core. The main (host) program populates this data structure and then copies it to the GPU global device memory. When instantiating N virtual cores, the global memory contains an array of N context structure where structure members of array elements constitute the register file, flags, program counter, and other state information. Each VISS exploits a configurable amount of code and data scratchpad memory. All memory references are handled by a dedicated code portion of the ISS that models the behavior of scratchpads. Each VISS repeatedly fetches loaded byte code of micro-operations for the assigned simulated core. Similar to the operation on the hardware, the instruction byte is fetched, decoded, and executed at the run time. Each VISS core updates its simulated registers file and program counter until the program is finished. For all instructions requiring memory references, the traces are sent to input/output buffers that provide a generic interface towards memory. This provides a flexible communication infrastructure in case the core simulator needs to be plugged to some other architectural component simulator (e.g. the cache) [6]. The simulation loop for VISS is made up of three phases, each corresponding to one main step of classical processor pipeline: fetch, decode, and execute. During the first step, instructions are fetched from memory for being further decoded and executed. The micro-ops format chosen for the translation makes it straightforward to decode the opcode and operands in a single step (executed in parallel over multiple VISS). In the execution step, previously extracted opcode and operands are used to simulate the target instruction semantics. The actual instruction execution is modeled

8 1450 S. RAGHAV ET AL. with a C switch/case construct. This is translated from the CUDA compiler into a series of conditional branches, which are taken depending on the decoded instruction. Intuitively, the parallel application running on the simulated architecture has a great impact on the simulator performance. When running data parallel applications, all cores simulate the same instruction flow. This matches the native GPU SIMT model of execution, and results in high parallelism. On the contrary, when simulating a task parallel application, simulated cores execute distinct instruction streams, which is highly likely to result in control flow divergence during the execution stage of the pipeline. This implies performance loss because of serialization of thread execution within a warp CPU context allocation. One of the key concerns of CUDA programming is to ensure optimal utilization of memory bandwidth. When threads in a warp access the same memory block, the number of memory transactions for the operation is minimized. Given the criticality of memory accesses and their impact on simulation performance, it is of the utmost importance to carefully design the layout of ISS execution contexts in memory. Each core contexts can be represented with an aggregated data structure (i.e. an array of registers and flags). Because each simulated core is mapped to a CUDA thread, parallel ISS execution implies concurrent accesses to core contexts, which we want to service in the most efficient manner. Every ISS frequently accesses its execution context during program execution, so it is beneficial to place the corresponding data structure in the low-latency shared memory rather than accessing it from the global memory. This requires explicit copy operations, whose cost can be minimized if we design the matrix layout according to its access pattern. Let us take the example of placing context arrays in memory in row order, as shown in the upper part of Figure 3(a). Here, different rows represent different context arrays, and different array elements are registers from different simulated core contexts. Suppose that a data parallel application is running on the simulator and that different cores are trying to attempt the same registers (from private contexts) at the same time. This scenario leads to the inefficient access pattern shown in the lower part of Figure 3(a). Laying out contexts arrays vertically in memory is not directly feasible. We thus aggregate all context arrays within a unique data structure (i.e. a matrix) to obtain this goal (see Figure 3(c)). Padding is employed whenever the number of simulated cores is not an integer multiple of the size of a CUDA warp (i.e. a block of 32 parallel threads). Ensuring that each warp accesses a single line when accessing global memory enables memory coalescing, a well-known optimization for effective memory bandwidth exploitation in CUDA programming. More precisely, Figure 3. (a) Scanning matrices row-major leads to memory bandwidth waste; (b) Scanning matrices column-major minimizes memory transactions; (c) A matrix representing execution contexts on simulated core.

9 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1451 with the proposed representation of execution contexts, concurrent ISS access memory in columnmajor order. This results in separate threads in a warp accessing the same memory line, which is served with a single memory transactions, as shown in Figure 3(b). A similar copy-out scheme is employed at the end of the kernel execution to restore the contexts in global memory. Another important optimization is to ensure that there are no bank conflicts in shared memory. Accesses by multiple threads to a single data element within the same bank will cause conflicts and result in longer transactions. Hence, we place our CPU context data structure in shared memory in such a way that each access of simulated core to its context register file only results in linear addressing [9]. As a result, all threads of half warp access different banks, and we do not see bank conflicts Synchronization. One of the main requirements of shared memory parallel programming is synchronization primitives such as locks. Multicore simulation should then provide effective intercore synchronization mechanism. To support shared memory synchronization, we translate application-level primitives into micro-ops corresponding to standard hardware primitives such as test-and-set, spin locks, wait-for-event,andsignal-event. On the simulator side, the implementation of these micro-ops is carried out using CUDA atomic instructions [8]. We implement wait-for-event and signal-event using an event table in GPU global memory. If a simulated core has been waiting for an event for a certain period, it voluntarily yields the processor by freezing its instruction pointer. The event table to resume execution is set by another core simulating a signal-event instruction. Because of the lack of support for interblock synchronization in CUDA, supporting synchronization among all simulated cores is a challenging task. In a naive implementation where locks are implemented with busy waiting, we may easily incur in deadlocks when simulating a higher numbers of cores than the physical GPU processors available. The GPU hardware scheduler selects thread blocks for execution on the basis of available computational resources. If the number of threads in all blocks is higher than the number of processors, only a subset of all the blocks can execute, whereas the remaining blocks wait until the first set finishes its execution. This generates a deadlock. To address this issue we consider three different approaches: 1. Core Yield at Instruction Level In this case, a synchronization point with the host side is created after every single simulated instruction. To do this, we save the ISS context state in GPU global memory, then we performing a sort of global (barrier) synchronization by terminating the CUDA kernel (i.e. our simulator) and returning execution to the host side. Because the amount of computation (i.e. simulating a single instruction) is extremely small in comparison with communication between host CPU and GPU device, this approach shows high performance overhead because of frequent synchronization with the host side. 2. Core Yield at Synchronization Level In this approach, synchronization between different CUDA blocks is carried out only when interblock communication is necessary. Only synchronization instructions between different cores and wait-on-response instructions from a simulated core residing on a currently inactive block may cause deadlock problems. Thus, by identifying such potentially dangerous instructions and yielding the core simulation only upon their occurrence, we achieve considerable performance gains. 3. Core Yield at Timeout Here, we further optimize the performance by yielding the core simulation only when the synchronization instruction occurs within a simulated core residing on an inactive block. Because it is difficult to determine in advance if the intercore communication occurs between cores simulated on an active or inactive block, we adopt a timeout mechanism. At any point, if the simulator discovers that a particular simulated core has been waiting for an event for longer than a certain period, it sends a trap that yields the simulation of the waiting core. This allows the simulation of previously inactive blocks to proceed thus removing the deadlock. We provide an evaluation of these approaches in Section

10 1452 S. RAGHAV ET AL. 4. EXPERIMENTS AND RESULTS In this section, we present a set of experiments aimed at evaluating the performance of SIMinG-1k. We carried out three different sets of experiments as follows: 1. The first set of experiments aims at measuring the performance of our VISS. We measure the overhead (or performance loss) induced by the translation process in both ARM and x86 front-ends. As a main performance metric, we adopt Simulated Million Instructions Per Second (S-MIPS) simply computed as the total number of dynamic instructions present in an application binary divided by (host) wall clock time that provides an intuitive measure of the achievable throughput. We calculate S-MIPS for a set of benchmarks that are compiled for execution on VISS. We also present the overhead of translation process and show the ratio between the number of original assembly instructions and the number of translated micro-operations. The results for these experiments are described in Section The second set of experiments is targeted at comparing our simulation methodology against state-of-the-art simulation techniques. Open Virtual Platform Simulator (OVPSim) is a representative example of a modern simulation technology based on binary translation and is used for performance comparison in our experiments. In our results, we demonstrate that parallel simulation on GPGPUs is not only feasible for thousands of cores but also profitable compared with binary translation technology under certain circumstances.the results for these experiments are described in Section Finally, a third set of experiments aims at evaluating the effectiveness of our support for shared memory synchronization primitives. In Table I, we list the benchmarks we adopted for our experiments. At present stage of this work, we lack some architectural support in SIMinG-1k (i.e. timers and interrupts) for the implementation of a full-fledged OS. Because most applications found in standard benchmark suites implement parallelism through library calls requiring some form of OS support, we are not currently capable of running these applications as-is. At the moment, parallel applications in our framework can be encoded using the popular OpenMP programming model [12]. OpenMP directives are translated into calls to a runtime library during compilation. We have implemented a custom version of this support library that bypasses OS and directly leverages low-level services for parallelization provided by our simulator. The parallelization scheme adopted for all five benchmarks simply focuses on evenly dividing total loop iterations among all participating processors (OpenMP static loop scheduling). The considered benchmarks include complete programs and representative computational kernels found at the heart of many embedded and scientific applications. The first five kernels are extracted from a complete [13] JPEG decoder and OpenMP Source Code Repository [14] benchmark suite. These kernels are parallelized with OpenMP and are used to test the scalability of SIMinG-1k (see Section 4.1). The parallelization scheme adopted for first five benchmarks simply focuses on evenly dividing total loop iterations among all participating processors.the last two benchmarks are used to compare the performance of SIMinG-1k against OVPSim. They were directly extracted from the OVPSim suite, and run unmodified (these are used in the OVP release to test simulation speed). Table I. Benchmarks. Benchmark Acronym Source 1 Inverse Discrete cosine transform (DCT) IDCT JPEG Decoding 2 Luminance Dequantization DQ JPEG Decoding 3 Background Subtraction NCC Normalized Cut Clustering Algorithm 4 Matrix Multiplication MMULT OpenMP Source Code Repository 5 Fast Fourier Transform FFT OpenMP Source Code Repository 6 Dhrystone DHR OVPSim suite 7 Fibonacci FIB OVPSim suite OVPSim, Open Virtual Platform Simulator.

11 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1453 Unlike the first five benchmarks, these two benchmarks are not parallelized (each ISS executes sequentially an instance of the program). The host machine for our experiments consists of the following: 1. An Intel i7 quad-core x86-64 based machine, running Linux GHz, equipped with 6 GB DDR3 DRAM (the host) 2. A NVIDIA GTX 480 GPGPU, equipped with 1.5 GB GDDR5 and 480 CUDA Cores (the Device) 4.1. Simulation performance scalability We show in this section the performance scalability of our simulator. Benchmarks one to five from Table I were executed on simulated architectures consisting of an increasing number of nodes, each node being composed of a processor plus data and instruction scratchpad memory. Each core accesses a different subset of the application dataset, according to its identification number. Table II summarizes the size of the dataset for each considered benchmark. In Figure 4(a) and 4(b) we show the S-MIPS for ARM and x86 architecture simulation, respectively. It is possible to notice that the simulation scales well in both cases up to 2048 cores. Beyond 2048 cores, the achievable throughput only increases very slightly, and in some cases worsens. Because of per-block shared memory requirements on the GPU, we are only able to run at most three blocks per multiprocessor at a time. When simulating 4096 cores, we exceed this limit, and extra blocks are dynamically scheduled thus impacting the final scalability. Inverse DCT (IDCT), Matrix Multiplication (MMULT), and Luminance Dequantization (DQ) benchmarks show best scaling results, because they are completely data parallel, and no divergence takes place during the simulation. Other benchmarks, such as Fast Fourier Transform (FFT), can be rather considered task parallel, where the control flow diverges depending on a thread s identification number. This ultimately results in GPU processors stall. Kernel Table II. Benchmark dataset sizes. Dataset size IDCT 4096 DCT blocks(8*8 pixels) DQ 4096 DCT blocks(8*8 pixels) MMULT (4096 x 100)*(100 x 100) NCC 4096 parallel rows FFT (Datasize D 4096) IDCT, Inverse DCT; DQ, Luminance Dequantization; MMULT, Matrix Multiplication; NCC, Background Subtraction; FFT, Fast Fourier Transform. (a) ARM (b) x86 Figure 4. SIMinG-ARM and SIMinG-x86 simulated Million Instructions Per Second. MMULT, Matrix Multiplication; IDCT, Inverse DCT; NCC, Background Subtraction; DQ; Luminance Dequantization; FFT, Fast Fourier Transform.

12 1454 S. RAGHAV ET AL. Table IIIa. Instructions and translated micro-ops count for ARM ISA. Also showing translation ratio and overhead because of translation time reported in seconds. ARM # Instr #Micro-ops Trans ratio Trans time IDCT MMULT DQ NCC FFT Dhrystone Fibonacci Table IIIb. Instructions and translated micro-ops count for x86 ISA. Also showing translation ratio and overhead because of translation time reported in seconds. x86 #Instr #Micro-ops Trans ratio Trans time IDCT, Inverse DCT; MMULT, Matrix Multiplication; DQ, Luminance Dequantization; NCC, Background Subtraction; FFT, Fast Fourier Transform. In Table III, we report the number of ARM and x86 instructions decoded from the application binary as well as number of micro-ops translated for each. We see an average translation ration of 2.5 for ARM and 5 for x86. It is explained before in Section 3.1 that the instruction translation ratio is higher for x86 because of its more complex instructions that in turn get translated into a higher number of VISS instructions. As the number of original RISC type (ARM) instructions are less than those from CISC (x86), therefore it is possible to expect that for a particular benchmark, the required time to simulate an x86 binary would be less than that for ARM. On the contrary, S-MIPS of x86 is lower than that of ARM. This is because, even though S-MIPS is calculated as total number of original ISA instruction divided by wall clock time, the time it takes to simulate the application is not dependent on number of instructions in binary but on the number of micro-ops simulated. The number of micro-ops simulated is indeed higher for CISC, and therefore the benchmark takes longer to simulate. Overall, in Figure 4(a) and 4(b), we see an average of 900 S-MIPS for ARM and 600 S-MIPS for x86 for total of 4096 simulated cores. Another important point is that time it takes to translate an application is a one time overhead and independent of number of cores for which the application is going to be parallelized at runtime. In Table III, we report the average translation time of all our benchmarks cases for both ARM and x86 simulators. We show that the overhead for translation is very contained and is always less than s in case of ARM and s in case of x86. It is important to note that for longer running applications (such as MMULT, FFT, and Background Subtraction (NCC)), the ratio of translation overhead compared with total simulation time is very small and almost inconsequential Performance comparison In this section, we compare our simulation methodology with a state-of-the-art simulation technique: binary translation. Binary translation natively executes on the host processor an instruction

13 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1455 (a) Dhrystone (b) Fibonacci Figure 5. SIMinG-ARM and Open Virtual Platform Simulator (OVPSim) Simulated Million Instructions Per Second (S-MIPS). from the simulated stream. Single-ISS simulation using this approach on a powerful CPU is likely to outperform our simulation approach on the GPU. Still, we expect the high number of streaming processors available on a single graphics card to allow our simulation approach to deliver significant speedups in thousand-core simulation with respect to CPU execution of both our simulator or binary translators. We also want to assess simulation performance scaling of our ISS running on a standard multicore. Modern Chip Multiprocessors (CMP) such as the Intel Nehalem integrate on a single die a small number (typically from two to eight) of aggressive cores, which boast a sophisticated multilevel cache system and a peak operating frequency much higher than that of a GPU streaming processor. Considering single-iss performance, it is expected that even the most carefully optimized GPU version of our VISS would run much slower than its CPU counterpart. Even in this case, we expect however our GPU-based simulation approach to exhibit significant benefits for thousand-core simulation. Open Virtual Platform Simulator is a famous commercial sequential simulation platform, able to model architectures composed of thousands of computing nodes. It exploits the benefits of Just in Time Code Morphing and a caching system to accelerate the simulation. The comparison with OVPSim is only available for the ARM architecture, because x86 is currently not supported from OVP. To the best of our knowledge, none of the currently available x86 simulators is capable of managing target architectures with thousands of cores. In Sections and 4.2.2, we compare the performance scaling of SIMinG-1k running on a GPU against OVPSim and against SIMinG-1k running on the CPU, respectively Comparison between SIMinG-1k and OVPSim. To compare the performance of OVP against SIMinG-1k running on GPU, we performed two different experiments. First, we consider two benchmarks from the OVP test suite, Dhrystone and Fibonacci (see Table I). These benchmarks are provided with the OVP distribution and unlike our other benchmarks are not parallelized. Independent of the number of instantiated cores on SIMinG-1k, every ISS executes the benchmarks entirely. The main reason for the using this set of benchmarks is to highlight the reason behind steady throughput (S-MIPS) by OVP as shown in Figure 5. In both cases, OVP shows a constant S-MIPS with increasing number of simulated core because of its code morphing technology. With these benchmarks, OVP needs to invoke its morphing phase just once and exploits the translation caching system to speedup the simulation. SiminG-1k, on the other hand, scales well up to 2048 simulated (same considerations from the previous section hold for higher core count). The breakeven performance point between SiminG-1k and OVP is 1024 cores. In the second experiment, we consider data-parallel benchmarks MMULT and NCC. We recall here that with this parallelization approach, smaller chunks of data are processed by each core when The remaining benchmarks (FFT, DQ, and IDCT) show similar behavior and have not been included because of space constraints.

14 1456 S. RAGHAV ET AL. (a) MMULT (b) NCC Figure 6. ARM and Open Virtual Platform (OVP) Simulated Million Instructions Per Second (S-MIPS). MMULT, Matrix Multiplication; NCC, Background Subtraction. the core count increases. Results for this test are shown in Figure 6. It is possible to notice from this plot that OVPSim suffers from a high initial overhead, induced by its code morphing phase. This overhead is clearly increasingly evident as the workload diminishes, because morphing time tends to dominate. Although this initial overhead is clearly amortized as soon as workloads increase, in our opinion, it is worth underlining that SIMinG-1k on the contrary performs equally well even for very small workloads. On-chip manycore accelerators are often involved in data-parallel computation, which may contain even very small amounts of work (e.g. embedded accelerators for image processing, which may perform single-pixel computation). In these scenarios, SIMinG-1k performs better than OVP. The breakeven performance point between SIMinG-1k andovpsimfor data-parallel kernels is 512 cores Comparison between SIMinG-1k-GPU and SIMinG-1k-CPU. In this section, we compare the performance of SIMinG-1k running on the GPU with SIMinG-1k running on a powerful CMP such as the Intel Nehalem. General purpose CPUs are easier to program and leverage a complex hardware architecture to boost tasks execution. In addition to this, because of their higher core frequency, it is easier to predict that single core performance of SIMinG-1k would be higher on general purpose CPU as opposed to GPU platforms. On the other hand, GPU platforms utilizes their parallel computation by simulating high number of cores, therefore we want to identify the number of simulated core count at which using GPU platform gains advantage over a faster CPU. We thus parallelized our SIMinG-1k to run on an i7 processor,using pthreads to wrap each simulated core within one parallel task. The GPU used for this work can run threads in parallel on 15(MP) x 32(Cores/MP) D 480 cores. For this experiment, we consider two benchmarks, IDCT benchmark selected from OpenMP Source Code Repository [14] benchmark suite and Fibonacci benchmark selected from OVPSim suite[15]. The results for these experiments are shown in Figure 7. It is possible to see that for a small number of simulated cores (below 128), the CPU performs better than the GPU. However, beyond 128 cores, the GPU outperforms the CPU, thus demonstrating our original intuition. The remaining benchmarks from Table I also show similar behavior in results Experiments with synchronization primitives. In this section, we evaluate the three different approaches to supporting deadlock-free synchronization primitives described in Section To perform this evaluation, we consider a classical producer consumer synchronization pattern [16]. The experiment simulates a total of 4096 cores, with a CUDA block size of 256. By maximizing the usage of available GPU resources within each thread block, we ensure that a total of 14 thread blocks (3854 cores) are active at one time, and each thread block is individually running on a single SM. One thread block remains inactive until it finds its resources. A total of 14 cores, one from each thread block, are producing and consuming elements in an array. Table IV shows the simulation time

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

RISC Principles. Introduction

RISC Principles. Introduction 3 RISC Principles In the last chapter, we presented many details on the processor design space as well as the CISC and RISC architectures. It is time we consolidated our discussion to give details of RISC

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Chapter 7 The Potential of Special-Purpose Hardware

Chapter 7 The Potential of Special-Purpose Hardware Chapter 7 The Potential of Special-Purpose Hardware The preceding chapters have described various implementation methods and performance data for TIGRE. This chapter uses those data points to propose architecture

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav CMPE655 - Multiple Processor Systems Fall 2015 Rochester Institute of Technology Contents What is GPGPU? What s the need? CUDA-Capable GPU Architecture

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Performance impact of dynamic parallelism on different clustering algorithms

Performance impact of dynamic parallelism on different clustering algorithms Performance impact of dynamic parallelism on different clustering algorithms Jeffrey DiMarco and Michela Taufer Computer and Information Sciences, University of Delaware E-mail: jdimarco@udel.edu, taufer@udel.edu

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Numerical Simulation on the GPU

Numerical Simulation on the GPU Numerical Simulation on the GPU Roadmap Part 1: GPU architecture and programming concepts Part 2: An introduction to GPU programming using CUDA Part 3: Numerical simulation techniques (grid and particle

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

ASSEMBLY LANGUAGE MACHINE ORGANIZATION

ASSEMBLY LANGUAGE MACHINE ORGANIZATION ASSEMBLY LANGUAGE MACHINE ORGANIZATION CHAPTER 3 1 Sub-topics The topic will cover: Microprocessor architecture CPU processing methods Pipelining Superscalar RISC Multiprocessing Instruction Cycle Instruction

More information

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010 Parallelizing FPGA Technology Mapping using GPUs Doris Chen Deshanand Singh Aug 31 st, 2010 Motivation: Compile Time In last 12 years: 110x increase in FPGA Logic, 23x increase in CPU speed, 4.8x gap Question:

More information

Handout 3. HSAIL and A SIMT GPU Simulator

Handout 3. HSAIL and A SIMT GPU Simulator Handout 3 HSAIL and A SIMT GPU Simulator 1 Outline Heterogeneous System Introduction of HSA Intermediate Language (HSAIL) A SIMT GPU Simulator Summary 2 Heterogeneous System CPU & GPU CPU GPU CPU wants

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Computer Organization and Design, 5th Edition: The Hardware/Software Interface

Computer Organization and Design, 5th Edition: The Hardware/Software Interface Computer Organization and Design, 5th Edition: The Hardware/Software Interface 1 Computer Abstractions and Technology 1.1 Introduction 1.2 Eight Great Ideas in Computer Architecture 1.3 Below Your Program

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

Top-Level View of Computer Organization

Top-Level View of Computer Organization Top-Level View of Computer Organization Bởi: Hoang Lan Nguyen Computer Component Contemporary computer designs are based on concepts developed by John von Neumann at the Institute for Advanced Studies

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

ARM Simulation using C++ and Multithreading

ARM Simulation using C++ and Multithreading International Journal of Innovative Technology and Exploring Engineering (IJITEE) ARM Simulation using C++ and Multithreading Suresh Babu S, Channabasappa Baligar Abstract: - This project is to be produced

More information

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 12

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

Improving Performance of Machine Learning Workloads

Improving Performance of Machine Learning Workloads Improving Performance of Machine Learning Workloads Dong Li Parallel Architecture, System, and Algorithm Lab Electrical Engineering and Computer Science School of Engineering University of California,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Processors, Performance, and Profiling

Processors, Performance, and Profiling Processors, Performance, and Profiling Architecture 101: 5-Stage Pipeline Fetch Decode Execute Memory Write-Back Registers PC FP ALU Memory Architecture 101 1. Fetch instruction from memory. 2. Decode

More information

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator).

Von Neumann architecture. The first computers used a single fixed program (like a numeric calculator). Microprocessors Von Neumann architecture The first computers used a single fixed program (like a numeric calculator). To change the program, one has to re-wire, re-structure, or re-design the computer.

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture

Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Design of CPU Simulation Software for ARMv7 Instruction Set Architecture Author: Dillon Tellier Advisor: Dr. Christopher Lupo Date: June 2014 1 INTRODUCTION Simulations have long been a part of the engineering

More information

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability Lecture 27: Pot-Pourri Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability 1 Shared-Memory Vs. Message-Passing Shared-memory: Well-understood

More information

Performance analysis basics

Performance analysis basics Performance analysis basics Christian Iwainsky Iwainsky@rz.rwth-aachen.de 25.3.2010 1 Overview 1. Motivation 2. Performance analysis basics 3. Measurement Techniques 2 Why bother with performance analysis

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s cont. Parallel Processing SIMD, Vector and GPU s cont. EECS4201 Fall 2016 York University 1 Multithreading First, we start with multithreading Multithreading is used in GPU s 2 1 Thread Level Parallelism ILP

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATION SEMESTER: III Subject Name: Operating System (OS) Subject Code: 630004 Unit-1: Computer System Overview, Operating System Overview, Processes

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Outline Marquette University

Outline Marquette University COEN-4710 Computer Hardware Lecture 1 Computer Abstractions and Technology (Ch.1) Cristinel Ababei Department of Electrical and Computer Engineering Credits: Slides adapted primarily from presentations

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Accelerating Dynamic Binary Translation with GPUs

Accelerating Dynamic Binary Translation with GPUs Accelerating Dynamic Binary Translation with GPUs Chung Hwan Kim, Srikanth Manikarnike, Vaibhav Sharma, Eric Eide, Robert Ricci School of Computing, University of Utah {chunghwn,smanikar,vaibhavs,eeide,ricci}@utah.edu

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Software-Controlled Multithreading Using Informing Memory Operations

Software-Controlled Multithreading Using Informing Memory Operations Software-Controlled Multithreading Using Informing Memory Operations Todd C. Mowry Computer Science Department University Sherwyn R. Ramkissoon Department of Electrical & Computer Engineering University

More information

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania Course Overview This OpenCL base course is structured as follows: Introduction to GPGPU programming, parallel programming

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Parallel Simulation Accelerates Embedded Software Development, Debug and Test

Parallel Simulation Accelerates Embedded Software Development, Debug and Test Parallel Simulation Accelerates Embedded Software Development, Debug and Test Larry Lapides Imperas Software Ltd. larryl@imperas.com Page 1 Modern SoCs Have Many Concurrent Processing Elements SMP cores

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs

Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs Presenting: Comparing the Power and Performance of Intel's SCC to State-of-the-Art CPUs and GPUs A paper comparing modern architectures Joakim Skarding Christian Chavez Motivation Continue scaling of performance

More information

Lecture 2: CUDA Programming

Lecture 2: CUDA Programming CS 515 Programming Language and Compilers I Lecture 2: CUDA Programming Zheng (Eddy) Zhang Rutgers University Fall 2017, 9/12/2017 Review: Programming in CUDA Let s look at a sequential program in C first:

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 04 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 4.1 Potential speedup via parallelism from MIMD, SIMD, and both MIMD and SIMD over time for

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information