SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units

Size: px

Start display at page:

Download "SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units"

Arnold Byrd
5 years ago
Views:

1 CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2013; 25: Published online 15 October 2012 in Wiley Online Library (wileyonlinelibrary.com) SPECIAL ISSUE PAPER SIMinG-1k: A thousand-core simulator running on general-purpose graphical processing units Shivani Raghav 1, *,, Andrea Marongiu 2, Christian Pinto 2, Martino Ruggiero 1,2, David Atienza 1 and Luca Benini 2 1 ESL - École Polytechnique Fédérale de Lausanne, Switzerland 2 DEIS - University of Bologna, Italy SUMMARY This paper introduces SIMinG-1k a manycore simulator infrastructure. SIMinG-1k is a graphics processing unit accelerated, parallel simulator for design-space exploration of large-scale manycore systems. It features an optimal trade-off between modeling accuracy and simulation speed. Its main objectives are high performance, flexibility, and ability to simulate thousands of cores. SIMinG-1k can model different architectures (currently, we support ARM (Available from: arm.doc.ddi0100i/index.html) and Intel x86) using two-step approac where architecture specific front end is decoupled from a fast and parallel manycore virtual machine running on graphical processing unit platform. We evaluate the simulator for target architecture with up to 4096 cores. Our results demonstrate very high scalability and almost linear speedup with simulation of increasing number of cores. Copyright 2012 John Wiley & Sons, Ltd. Received 7 January 2011; Revised 13 June 2012; Accepted 6 September 2012 KEY WORDS: simulation; parallel simulator; manycore; GPU; CUDA 1. INTRODUCTION In almost every computing domain, modern system design and application development methodologies are largely based on simulation. Virtual platforms are extensively used for early software development before the real hardware is available. They are also needed to optimize the parallelization and hardware resource utilization of the application once the hardware is available. During the last decade, design of integrated architectures has been characterized by a paradigm shift. Boosting clock frequencies of monolithic processor cores has clearly reached its limits, and designers have turned to multicore architectures to satisfy the growing computational needs of applications within a reasonable power envelope. To meet the ever-increasing demand for peak performance while fitting tight power budgets, there is a clear trend towards simplifying the core microarchitecture design. Using elementary cores (e.g. in-order execution, no sophisticated circuitry for branch prediction, instruction-level parallelism etc.) allows significant reductions in the consumed power, which in turn allows integration of a higher number of cores in a single chip, providing higher peak performance. Future manycore processors will thus embed thousands of simple cores and memories on a single chip, connected through a network-on-chip, more than hundred times faster than traditional off-chip interconnections. A similar trend is clearly foreseeable in every *Correspondence to: Shivani Raghav, Embedded Systems Laboratory - École Polytechnique Fédérale de Lausanne, 1018, Switzerland. shivani.raghav@epfl.ch Copyright 2012 John Wiley & Sons, Ltd.

2 1444 S. RAGHAV ET AL. computing domain, from High Performance Computing (HPC) to embedded systems. Examples of similar architectures may include on-chip manycore accelerators such as the Hypercore Architecture Line from Plurality [1], Platform 2012 [2], or future evolutions of Intel s prototypes Larrabee [3] and Single-Chip Cloud Computer [4]. Dark silicon pushes innovations towards specialization where a single chip will include a spectrum of hardware accelerators to access and manipulate data in the cloud workloads with minimal energy. Simulation and virtual prototyping technology must obviously evolve to tackle the numerous challenges inherent in simulating such highly parallel architectures. Current state-of-the-art sequential simulators use SystemC [5], binary translation, smart sampling techniques, or tuneable abstraction levels for hardware description. These kinds of simulation technologies typically have to make a trade-off between simulation accuracy and simulation speed. Because very low-level hardware operations are accurately modeled, simulation is slow. This can lead to unacceptable performance when simulating a huge number of cores. Simulating a parallel system is an inherently parallel task. Individual processor simulation may independently proceed until the point where communication or synchronization with other processors is required. This is the key idea behind parallel simulation technology that distributes the simulation workload over parallel hardware resources. Parallel simulators utilizes the availability of multiple physical processing nodes to increase the simulation rate. However, this requirement may turn out to be much too costly in case of adopting server clusters or computing farms as a host for running simulations. The high cost in terms of increasing latency and decreasing bandwidth typically leads to poor scalability because of the synchronization overhead when increasing the number of processing nodes. The development of computer technology has recently led to an unprecedented performance increase of general-purpose graphical processing units (GPGPU). Modern GPGPUs integrate hundreds of processors on the same device, communicating through low-latency and high-bandwidth on-chip networks and memory hierarchies. This allows cutting interprocessor communication costs by orders of magnitude with respect to server clusters. Additionally, such scalable computation power and flexibility are delivered at a rather low cost by commodity graphics processing unit (GPU) hardware. With improvement in GPU hardware performance, their programmability has been significantly increased in the last 5 years. This has led to the diffusion of computing clusters based on such manycores leading to inexpensive solutions in HPC for a wide community. This scenario motivated our idea of developing a novel parallel simulation technology that utilizes the computational power of widely available and low-cost GPUs. We developed a new simulation technology to deploy a parallel simulator for on-chip 1000-core systems on top of GPGPUs. Although we are currently researching on a methodology targeted at full-system simulation [6], the specific contribution of this paper focuses on Instruction Set Simulation (ISS). In the design of our simulation environment, we provide the capability of easily simulating target processors from different computing domains. We use a two-step approach where architecture specific front end is decoupled from target-independent virtual ISS (VISS) that is parallelized for fast simulation on GPU platform. To better explain our methodology, we present the implementation and results from two ISS namely; an x86 ISS, representative of the HPC domain and an ARM ISS, representative of the embedded domain. We previously published our early implementation and results for GPGPU-based simulation of ARM and x86 ISAs [7]. Implementation of x86 Instruction Set Architecture (ISA) does not include extended subset of floating point (x87) and Single Instruction Multiple Data (SIMD) instructions. We developed two separate ISS, each of which has been specifically optimized for execution on an NVIDIA [8] GeForce GTX 295 GPU. This approach however presents two main limitations. First, supporting different ISAs requires development of a new ISS from scratch, and to arduously tailor its implementation to the GPU hardware peculiarities. This requires devising optimizations that are specific for the target processor, and implies significant effort and time (design, implementation, and debugging). Second, complex ISA simulation (i.e. x86) typically requires lots of conditional instructions to manage all of the possible variants in the control flow. This is particularly relevant in both the decode and execution phases of a processor pipeline.

3 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1445 Variable length decoding requires sophisticated check patterns. These check patterns are dependent on fields of target instruction s opcodes that are unknown until runtime. When implementing such check patterns in decoding phase of pipeline usage of control flow branches increases and thus the possibility of concurrent operations on GPU decreases. Similarly, depending upon the type of application running on simulator, cores can fetch different instructions implemented using control flow operations in execution phase of pipeline. Anytime cores perform divergent flow operation on GPU, it is detrimental to GPU performance because of intensive processor stall. In this paper, we overcome these limitations and significantly extend our previous work by taking a different approach aimed at achieving both goals of easing the development of a new ISS and maximizing the performance of a core simulation engine running on the GPGPU. This simulation engine has been designed as a highly efficient, target-independent VISS, capable of executing a small yet comprehensive set of generic instructions found at the heart of any ISA. Our VISS design leverages a three-stage pipeline (fetch, decode, and execute), the simulation of each stage being accurately optimized for achieving maximum efficiency out of the underlying GPU hardware. Developing simulation support for a new ISA on our infrastructure only requires building a small architecture-specific translator. The latter operates as a post-pass compiler and transforms a compiled program for the target architecture into its equivalent in the VISS format. However, designing a parallel VISS for manycore running on GPUs is not a straightforward task. The main advantage of using GPU as host platform is its high memory bandwidth and large number of cores capable of thousands of hardware thread contexts executing program in Single Instruction Multiple Thread (SIMT) fashion. Implementing a simulator for such platform imposes several challenges such as control flow divergence, high latency of global memory, bank conflicts, and slow host CPU to GPU communication. In Section 3, we explain each of these limitations in detail and provide solutions to overcome these challenges for effective simulation. It is also important to remember that running a data parallel program on top of the manycore simulator provides very high performance and scalability, whereas a task parallel program entails an overhead because of control flow divergence that is known to be detrimental for GPU performance. At present stage of development, we only provide application-level simulation. The behavior of Operating System (OS), input/output (I/O), and device modeling is not yet supported and is part of our future work. To enable simulation of parallel shared memory applications, we provide support for OpenMP style programming. In particular, we discuss an implementation of locking primitives that leverages Central Processing Unit (CPU) GPU synchronization. Our experimental results confirm the goodness and efficiency of our solution. The rest of the paper is organized as follows. In Section 2, we discuss related work, and Section 3 briefly explains GPU architecture and Compute Unified Device Architecture (CUDA) programming model. Section 4 describes the design and implementation of our manycore simulator, SIMinG-1k. In Section 5, we evaluate the performance and scalability of SIMinG-1k and also compare our results with a commercial, state-of-the-art sequential simulator. In the end, Section 6 concludes the paper. 2. GRAPHICAL PROCESSING UNIT ARCHITECTURE AND COMPUTE UNIFIED DEVICE ARCHITECTURE PROGRAMMING MODEL The Fermi-based GPU used in this work is a NVIDIA GeForce GTX 480, a two-level shared memory parallel machine comprising 480 Streaming Processors (SPs) organized in 15 Streaming Multiprocessors (SMs). SMs manage the execution of programs using so-called warps, groups of 32 threads. Each SM features two warp schedulers and two instruction dispatch units, allowing two warps to be issued and executed concurrently. All instructions are executed in an SIMD fashion, where one instruction is applied to all threads in warp. This execution method is called SIMT. All threads in a warp execute the same instruction or remain idle (different threads can perform branching and other forms of independent work) [8]. In the Fermi architecture, each SM has 64 KB of on-chip memory that can be configured as 48 KB of shared memory with 16 KB of L1 cache or as 16 KB of shared memory with 48 KB of L1 cache. Fermi features also a 768 KB unified L2 cache that provides efficient data sharing across the GPU. CUDA is the software architecture for issuing and managing computations on the GPU [9].

4 1446 S. RAGHAV ET AL. 3. DESIGN AND IMPLEMENTATION In this section, we describe the implementation of our GPU-based manycore instruction set simulator. The goal is to simulate on-chip manycore accelerators with thousands of cores by modeling the functional correctness of processor pipelines and their program-visible state. The key idea behind our approach is to identify the inherent parallelism in manycore architecture simulation and efficiently execute it on top of highly parallel GPGPU hardware. Despite the parallel nature of manycore simulation, there are many issues potentially hindering the effectiveness of the approach that need to be taken into account. Besides the evident necessity of carefully designing intercore communication and synchronization, it is also extremely important to consider some peculiarities of GPGPU hardware and the CUDA programming model. Some of the key implementation issues for performance are described as follows: Control Flow Divergence - Because of the SIMT nature of GPU architectures, divergent control flow in CUDA programs is detrimental to GPU performance. In parallel ISA simulation, lots of conditional instructions are typically required to model decoding and execution stages of the pipeline. Moreover, supporting a large number of simulated instruction during the execution phase results in high probability of different ISS fetching different instructions at the same time. We describe the steps taken to optimize the performance of our simulator with this respect later on in this section and in Section 3.1. High Latency of Global Memory - GPU global memory has a very high latency of cycles. In parallel ISA simulation, the contexts of all simulated cores are frequently accessed, which makes them good candidates for placement on the faster shared memory. This however requires laying out corresponding data structure over memory banks in a smart manner to better utilize memory bandwidth. We minimize the number of bus transaction by coalescing memory accesses as described in Section Bank Conflicts in Shared Memory - The fast GPU-shared memory leverages a multiported, multibanked design to improve memory access parallelism. Accesses by multiple threads within the same bank cause a conflict. In Section 3.2.1, we describe how we eliminate such conflicts and achieve optimal performance. Slow GPU CPU Interaction - CPU GPU interaction is a costly process. However, CUDA and the GPU hardware do not provide any means to synchronize threads among multiprocessors. This poses a serious performance bottleneck when simulating shared memory applications using locks and other synchronization primitives. This may require synchronization among thread blocks which is not natively supported by CUDA and GPU hardware and which can only be achieved through interactions with the CPU. We describe in Section how we deal with this issue. To address the first challenge, we design our simulation framework so as to minimize frequent divergent execution flows by construction. Intuitively, the higher the number of instructions supported by every ISS, the higher the probability that two ISS will be executing different instructions, and as a consequence that the simulated pipelines will be subject to stalls implied by divergent instruction streams. To mitigate this effect, we take a two-step approach (Figure 1). At the heart of our proposal sits a VISS, which supports a minimal yet comprehensive set of instructions. More complex instructions found in the targeted ISA can be decomposed into a set of these simple micro-operations. This is carried out statically on the host CPU platform. More precisely, an offline architecture-specific translation step transforms the target program object code (i.e. the result of a standard compilation stage for the target architecture) into the generic micro-operations supported by our VISS. There are two important advantages when using this methodology. First, this method enables easier creation of simulation infrastructures for several different architectures and makes their emulation on GPU more portable. Only one generic implementation of the Virtualized ISS (VISS) is required, and it can be used in conjugation with any architecture-specific translation technique. Second, our methodology improves simulation performance on GPGPUs. We carefully design the

5 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1447 Figure 1. Manycore simulation workflow. CUDA, Compute Unified Device Architecture; VISS, Virtual Instruction Set Simulation; GPU, graphical processing unit; SM, Streaming Multiprocessors. micro-operation encoding format in such a way that instructions can be fetched and decoded in a single parallel step on thousand virtualized ISS, hence providing optimal opportunity for the parallel execution on GPU. For example, we have previously experienced [7] that variable-length x86 instruction decoding is subject to a variable number of steps depending on opcodes and addressing modes. This makes it one of the most difficult and expensive parts of x86 parallel simulation on GPU, because of frequent control flow divergence. With our new approach, all of the complex decoding part has been shifted to the offline translation phase. The highly optimized VISS runs much faster on GPU hardware, as it now benefits from the high concurrency enabled by the new design. As explained in Section 1, our goal is a simulation technology that enables to model on-chip manycore accelerators from different computing domains. To assess the goodness and flexibility of our methodology, we describe the implementation of two ISAs: a Reduced Instruction Set Computing (RISC)-like (ARM) architecture targeting the embedded system domain and a simple x86 architecture similar to those used in manycores from the HPC domain ([3, 4]). To estimate execution cycle count on the two architectures, we adopt a timestamping technique. In general, a single instruction from a target ISA is translated into multiple instructions in the VISS format. We emit a timestamp before the first and after the last VISS instruction. The interpretation phase in the simulator implements a statistics module, which is in charge of gathering most common processor metrics such as overall execution cycles, active cycles, and idle cycles. When a begin timestamp is encountered in the interpretation phase, the statistics module increments the source instruction count. The end timestamp is annotated with the source instruction latency (we refer to instruction latencies presented in x86 [10] and ARM [11] datasheets). The latency information is then used by the statistics module to increment the instruction cycle count. It is worth recalling here that manycore System-on-Chip (SoCs) employ simple processors to achieve the required performance/watt targets. Thus, both the ARM and x86 processors considered in this work are simple, in-order cores, without sophisticated circuitry for branch prediction, ILP extraction, and so forth.

6 1448 S. RAGHAV ET AL. This makes our technique for performance profiling accurate enough to estimate processor behavior. Clearly, memory and interconnect effects may generate stalls and delays that affect these measures. However, in this work, we just focus on ISS simulation and assume that memory transactions have deterministic duration because of the presence of per-core scratchpad memory. Cache, Network-on- Chip (NoC), and main memory effects in our simulation framework are accounted for in separate modules, which model the behavior of these system components. The interested reader is referred to our proposal for full-system simulation in [6]. The following subsections describe in more details translation and interpretation phases for both ISS Translator One of the main tasks of this translator is to split each of the source ISA (x86 or ARM) instructions into simpler VISS instructions called micro-ops. A set of micro-ops is chosen that is much smaller than the large number of instructions found in ISAs such as x86 and ARM. Currently, our VISS executes 18 instructions, empirically determined as the minimum set of basic operations to support most of the instructions in x86 and ARM ISA. It is easy to extend the set of micro-ops to support instructions from more complex ISA. However, it is important to keep in mind that the more complex the simulated ISA (and thus the minimum number of necessary micro-ops), the higher likeliness of control flow divergence during the execution stage of the core pipeline. The translator works as a post-pass compiler over a compiled x86/arm program, and it operates in a similar manner to a standard disassembler. First, the x86/arm compiled program is decoded into a stream of instructions from the target ISA. Instruction opcodes are used as an index to the actual instructions tables. Operands are then parsed on the basis of their corresponding ISA specifications. Once the disassembled instruction stream has been reconstructed, we use custom short functions to convert it into a corresponding stream of VISS micro-ops. Micro-ops are 64 bit long, therefore instructions with embedded long (32-bit) immediate values can be easily transformed into an intermediate form that preserves the long immediate values. Figure 2 shows the format for 64 bit long intermediate micro-operations. It is important to note that simple instructions (such as single-register transfer) can be translated using a single micro-ops, whereas more complex instructions (such as those requiring calculation of various addressing modes) may need several micro-ops for translation. One important task of the translator is that of re-mapping the original register allocation for the target architecture s register file into the VISS register file.the latter consists of 32 general-purpose registers. The first 16 (r0 r15) directly map into the corresponding registers from the target architecture (e.g. only the first 16 ARM registers are made visible to program/compiler). The others (t1 t16) Figure 2. Instruction format for intermediate micro-operations.

7 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1449 are used as temporary registers for intermediate operations and special usage. Register t14 is used as a program counter for the VISS, t15 is dedicated as a ISS flag register, and t16 is allocated for condition checking operations. The translator also generates final addresses for all the symbols in the program according to the architecture s memory map specified through a linker script. Thus, all code segments global, initialized, and uninitialized data as well as other special sections are mapped onto Scratch Pad Memory (SPMs) according to the global address space specified through the linker script. Some arithmetic or logic instructions in any ISA require setting flag (or condition codes) values. By decoding an instruction, we know if the instruction requires any conditional check for a set of flags. This can be easily implemented offline in our translation phase. On the basis of this decoded information and on the basis of the result of operation execution (at runtime), the bits are updated in flag register. Modeling this part in translation phase, however, requires additional operations. Because the rules to compute flag status are quite complex, generating the necessary micro-ops for this computation would represent an inefficient solution. Therefore, we implement the conditional check for flag update in interpretation phase at runtime. We use three bits (sf bits in Figure 2) to pass the information (type of flags that require conditional check) decoded by translator to the interpretation phase. If at runtime any of these bits is set, the set flag logic is called inside the GPU simulator that checks the condition and sets the bit in the flag register. For conditional instructions, the translator implements the condition checking mechanism by generating specific additional micro-ops. These store the result of condition check inside a dedicated register (t16). In addition, one bit (cc bit in Figure 2) indicates if condition evaluation is required for a particular instruction. If the cc bit is set and register t16 is true, the specified condition is met. Otherwise, the interpreter skips the operation. To provide a qualitative measure of the efforts spent in developing a target-specific translator, we report the lines of code (LoC) for the two supported architecture. The ARM translator is composed of 1800 LoC, whereas the x86 translator, which features a more complicated ISA, required 3200 LoC Virtual instruction set simulation compute unified device architecture implementation In this section, we describe the core VISS written in CUDA that accepts the micro-ops format and executes it on the GPU. The VISS (Figure 1) fetches, decodes, and emulates the execution of individual intermediate micro-ops. Thousands of cores with associated instruction and data scratchpads (SPM) can be simulated by mapping each instance of the VISS to a CUDA thread. Each virtual core is assigned a private context structure, which represents the CPU current state. Initially, host memory is allocated for a context structure that represents one core. The main (host) program populates this data structure and then copies it to the GPU global device memory. When instantiating N virtual cores, the global memory contains an array of N context structure where structure members of array elements constitute the register file, flags, program counter, and other state information. Each VISS exploits a configurable amount of code and data scratchpad memory. All memory references are handled by a dedicated code portion of the ISS that models the behavior of scratchpads. Each VISS repeatedly fetches loaded byte code of micro-operations for the assigned simulated core. Similar to the operation on the hardware, the instruction byte is fetched, decoded, and executed at the run time. Each VISS core updates its simulated registers file and program counter until the program is finished. For all instructions requiring memory references, the traces are sent to input/output buffers that provide a generic interface towards memory. This provides a flexible communication infrastructure in case the core simulator needs to be plugged to some other architectural component simulator (e.g. the cache) [6]. The simulation loop for VISS is made up of three phases, each corresponding to one main step of classical processor pipeline: fetch, decode, and execute. During the first step, instructions are fetched from memory for being further decoded and executed. The micro-ops format chosen for the translation makes it straightforward to decode the opcode and operands in a single step (executed in parallel over multiple VISS). In the execution step, previously extracted opcode and operands are used to simulate the target instruction semantics. The actual instruction execution is modeled

1450 S. RAGHAV ET AL. with a C switch/case construct. This is translated from the CUDA compiler into a series of conditional branches, which are taken depending on the decoded instruction.

8 1450 S. RAGHAV ET AL. with a C switch/case construct. This is translated from the CUDA compiler into a series of conditional branches, which are taken depending on the decoded instruction. Intuitively, the parallel application running on the simulated architecture has a great impact on the simulator performance. When running data parallel applications, all cores simulate the same instruction flow. This matches the native GPU SIMT model of execution, and results in high parallelism. On the contrary, when simulating a task parallel application, simulated cores execute distinct instruction streams, which is highly likely to result in control flow divergence during the execution stage of the pipeline. This implies performance loss because of serialization of thread execution within a warp CPU context allocation. One of the key concerns of CUDA programming is to ensure optimal utilization of memory bandwidth. When threads in a warp access the same memory block, the number of memory transactions for the operation is minimized. Given the criticality of memory accesses and their impact on simulation performance, it is of the utmost importance to carefully design the layout of ISS execution contexts in memory. Each core contexts can be represented with an aggregated data structure (i.e. an array of registers and flags). Because each simulated core is mapped to a CUDA thread, parallel ISS execution implies concurrent accesses to core contexts, which we want to service in the most efficient manner. Every ISS frequently accesses its execution context during program execution, so it is beneficial to place the corresponding data structure in the low-latency shared memory rather than accessing it from the global memory. This requires explicit copy operations, whose cost can be minimized if we design the matrix layout according to its access pattern. Let us take the example of placing context arrays in memory in row order, as shown in the upper part of Figure 3(a). Here, different rows represent different context arrays, and different array elements are registers from different simulated core contexts. Suppose that a data parallel application is running on the simulator and that different cores are trying to attempt the same registers (from private contexts) at the same time. This scenario leads to the inefficient access pattern shown in the lower part of Figure 3(a). Laying out contexts arrays vertically in memory is not directly feasible. We thus aggregate all context arrays within a unique data structure (i.e. a matrix) to obtain this goal (see Figure 3(c)). Padding is employed whenever the number of simulated cores is not an integer multiple of the size of a CUDA warp (i.e. a block of 32 parallel threads). Ensuring that each warp accesses a single line when accessing global memory enables memory coalescing, a well-known optimization for effective memory bandwidth exploitation in CUDA programming. More precisely, Figure 3. (a) Scanning matrices row-major leads to memory bandwidth waste; (b) Scanning matrices column-major minimizes memory transactions; (c) A matrix representing execution contexts on simulated core.

9 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1451 with the proposed representation of execution contexts, concurrent ISS access memory in columnmajor order. This results in separate threads in a warp accessing the same memory line, which is served with a single memory transactions, as shown in Figure 3(b). A similar copy-out scheme is employed at the end of the kernel execution to restore the contexts in global memory. Another important optimization is to ensure that there are no bank conflicts in shared memory. Accesses by multiple threads to a single data element within the same bank will cause conflicts and result in longer transactions. Hence, we place our CPU context data structure in shared memory in such a way that each access of simulated core to its context register file only results in linear addressing [9]. As a result, all threads of half warp access different banks, and we do not see bank conflicts Synchronization. One of the main requirements of shared memory parallel programming is synchronization primitives such as locks. Multicore simulation should then provide effective intercore synchronization mechanism. To support shared memory synchronization, we translate application-level primitives into micro-ops corresponding to standard hardware primitives such as test-and-set, spin locks, wait-for-event,andsignal-event. On the simulator side, the implementation of these micro-ops is carried out using CUDA atomic instructions [8]. We implement wait-for-event and signal-event using an event table in GPU global memory. If a simulated core has been waiting for an event for a certain period, it voluntarily yields the processor by freezing its instruction pointer. The event table to resume execution is set by another core simulating a signal-event instruction. Because of the lack of support for interblock synchronization in CUDA, supporting synchronization among all simulated cores is a challenging task. In a naive implementation where locks are implemented with busy waiting, we may easily incur in deadlocks when simulating a higher numbers of cores than the physical GPU processors available. The GPU hardware scheduler selects thread blocks for execution on the basis of available computational resources. If the number of threads in all blocks is higher than the number of processors, only a subset of all the blocks can execute, whereas the remaining blocks wait until the first set finishes its execution. This generates a deadlock. To address this issue we consider three different approaches: 1. Core Yield at Instruction Level In this case, a synchronization point with the host side is created after every single simulated instruction. To do this, we save the ISS context state in GPU global memory, then we performing a sort of global (barrier) synchronization by terminating the CUDA kernel (i.e. our simulator) and returning execution to the host side. Because the amount of computation (i.e. simulating a single instruction) is extremely small in comparison with communication between host CPU and GPU device, this approach shows high performance overhead because of frequent synchronization with the host side. 2. Core Yield at Synchronization Level In this approach, synchronization between different CUDA blocks is carried out only when interblock communication is necessary. Only synchronization instructions between different cores and wait-on-response instructions from a simulated core residing on a currently inactive block may cause deadlock problems. Thus, by identifying such potentially dangerous instructions and yielding the core simulation only upon their occurrence, we achieve considerable performance gains. 3. Core Yield at Timeout Here, we further optimize the performance by yielding the core simulation only when the synchronization instruction occurs within a simulated core residing on an inactive block. Because it is difficult to determine in advance if the intercore communication occurs between cores simulated on an active or inactive block, we adopt a timeout mechanism. At any point, if the simulator discovers that a particular simulated core has been waiting for an event for longer than a certain period, it sends a trap that yields the simulation of the waiting core. This allows the simulation of previously inactive blocks to proceed thus removing the deadlock. We provide an evaluation of these approaches in Section

10 1452 S. RAGHAV ET AL. 4. EXPERIMENTS AND RESULTS In this section, we present a set of experiments aimed at evaluating the performance of SIMinG-1k. We carried out three different sets of experiments as follows: 1. The first set of experiments aims at measuring the performance of our VISS. We measure the overhead (or performance loss) induced by the translation process in both ARM and x86 front-ends. As a main performance metric, we adopt Simulated Million Instructions Per Second (S-MIPS) simply computed as the total number of dynamic instructions present in an application binary divided by (host) wall clock time that provides an intuitive measure of the achievable throughput. We calculate S-MIPS for a set of benchmarks that are compiled for execution on VISS. We also present the overhead of translation process and show the ratio between the number of original assembly instructions and the number of translated micro-operations. The results for these experiments are described in Section The second set of experiments is targeted at comparing our simulation methodology against state-of-the-art simulation techniques. Open Virtual Platform Simulator (OVPSim) is a representative example of a modern simulation technology based on binary translation and is used for performance comparison in our experiments. In our results, we demonstrate that parallel simulation on GPGPUs is not only feasible for thousands of cores but also profitable compared with binary translation technology under certain circumstances.the results for these experiments are described in Section Finally, a third set of experiments aims at evaluating the effectiveness of our support for shared memory synchronization primitives. In Table I, we list the benchmarks we adopted for our experiments. At present stage of this work, we lack some architectural support in SIMinG-1k (i.e. timers and interrupts) for the implementation of a full-fledged OS. Because most applications found in standard benchmark suites implement parallelism through library calls requiring some form of OS support, we are not currently capable of running these applications as-is. At the moment, parallel applications in our framework can be encoded using the popular OpenMP programming model [12]. OpenMP directives are translated into calls to a runtime library during compilation. We have implemented a custom version of this support library that bypasses OS and directly leverages low-level services for parallelization provided by our simulator. The parallelization scheme adopted for all five benchmarks simply focuses on evenly dividing total loop iterations among all participating processors (OpenMP static loop scheduling). The considered benchmarks include complete programs and representative computational kernels found at the heart of many embedded and scientific applications. The first five kernels are extracted from a complete [13] JPEG decoder and OpenMP Source Code Repository [14] benchmark suite. These kernels are parallelized with OpenMP and are used to test the scalability of SIMinG-1k (see Section 4.1). The parallelization scheme adopted for first five benchmarks simply focuses on evenly dividing total loop iterations among all participating processors.the last two benchmarks are used to compare the performance of SIMinG-1k against OVPSim. They were directly extracted from the OVPSim suite, and run unmodified (these are used in the OVP release to test simulation speed). Table I. Benchmarks. Benchmark Acronym Source 1 Inverse Discrete cosine transform (DCT) IDCT JPEG Decoding 2 Luminance Dequantization DQ JPEG Decoding 3 Background Subtraction NCC Normalized Cut Clustering Algorithm 4 Matrix Multiplication MMULT OpenMP Source Code Repository 5 Fast Fourier Transform FFT OpenMP Source Code Repository 6 Dhrystone DHR OVPSim suite 7 Fibonacci FIB OVPSim suite OVPSim, Open Virtual Platform Simulator.

A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1453 Unlike the first five benchmarks, these two benchmarks are not parallelized (each ISS executes sequentially an instance of the program).

11 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1453 Unlike the first five benchmarks, these two benchmarks are not parallelized (each ISS executes sequentially an instance of the program). The host machine for our experiments consists of the following: 1. An Intel i7 quad-core x86-64 based machine, running Linux GHz, equipped with 6 GB DDR3 DRAM (the host) 2. A NVIDIA GTX 480 GPGPU, equipped with 1.5 GB GDDR5 and 480 CUDA Cores (the Device) 4.1. Simulation performance scalability We show in this section the performance scalability of our simulator. Benchmarks one to five from Table I were executed on simulated architectures consisting of an increasing number of nodes, each node being composed of a processor plus data and instruction scratchpad memory. Each core accesses a different subset of the application dataset, according to its identification number. Table II summarizes the size of the dataset for each considered benchmark. In Figure 4(a) and 4(b) we show the S-MIPS for ARM and x86 architecture simulation, respectively. It is possible to notice that the simulation scales well in both cases up to 2048 cores. Beyond 2048 cores, the achievable throughput only increases very slightly, and in some cases worsens. Because of per-block shared memory requirements on the GPU, we are only able to run at most three blocks per multiprocessor at a time. When simulating 4096 cores, we exceed this limit, and extra blocks are dynamically scheduled thus impacting the final scalability. Inverse DCT (IDCT), Matrix Multiplication (MMULT), and Luminance Dequantization (DQ) benchmarks show best scaling results, because they are completely data parallel, and no divergence takes place during the simulation. Other benchmarks, such as Fast Fourier Transform (FFT), can be rather considered task parallel, where the control flow diverges depending on a thread s identification number. This ultimately results in GPU processors stall. Kernel Table II. Benchmark dataset sizes. Dataset size IDCT 4096 DCT blocks(8*8 pixels) DQ 4096 DCT blocks(8*8 pixels) MMULT (4096 x 100)*(100 x 100) NCC 4096 parallel rows FFT (Datasize D 4096) IDCT, Inverse DCT; DQ, Luminance Dequantization; MMULT, Matrix Multiplication; NCC, Background Subtraction; FFT, Fast Fourier Transform. (a) ARM (b) x86 Figure 4. SIMinG-ARM and SIMinG-x86 simulated Million Instructions Per Second. MMULT, Matrix Multiplication; IDCT, Inverse DCT; NCC, Background Subtraction; DQ; Luminance Dequantization; FFT, Fast Fourier Transform.

12 1454 S. RAGHAV ET AL. Table IIIa. Instructions and translated micro-ops count for ARM ISA. Also showing translation ratio and overhead because of translation time reported in seconds. ARM # Instr #Micro-ops Trans ratio Trans time IDCT MMULT DQ NCC FFT Dhrystone Fibonacci Table IIIb. Instructions and translated micro-ops count for x86 ISA. Also showing translation ratio and overhead because of translation time reported in seconds. x86 #Instr #Micro-ops Trans ratio Trans time IDCT, Inverse DCT; MMULT, Matrix Multiplication; DQ, Luminance Dequantization; NCC, Background Subtraction; FFT, Fast Fourier Transform. In Table III, we report the number of ARM and x86 instructions decoded from the application binary as well as number of micro-ops translated for each. We see an average translation ration of 2.5 for ARM and 5 for x86. It is explained before in Section 3.1 that the instruction translation ratio is higher for x86 because of its more complex instructions that in turn get translated into a higher number of VISS instructions. As the number of original RISC type (ARM) instructions are less than those from CISC (x86), therefore it is possible to expect that for a particular benchmark, the required time to simulate an x86 binary would be less than that for ARM. On the contrary, S-MIPS of x86 is lower than that of ARM. This is because, even though S-MIPS is calculated as total number of original ISA instruction divided by wall clock time, the time it takes to simulate the application is not dependent on number of instructions in binary but on the number of micro-ops simulated. The number of micro-ops simulated is indeed higher for CISC, and therefore the benchmark takes longer to simulate. Overall, in Figure 4(a) and 4(b), we see an average of 900 S-MIPS for ARM and 600 S-MIPS for x86 for total of 4096 simulated cores. Another important point is that time it takes to translate an application is a one time overhead and independent of number of cores for which the application is going to be parallelized at runtime. In Table III, we report the average translation time of all our benchmarks cases for both ARM and x86 simulators. We show that the overhead for translation is very contained and is always less than s in case of ARM and s in case of x86. It is important to note that for longer running applications (such as MMULT, FFT, and Background Subtraction (NCC)), the ratio of translation overhead compared with total simulation time is very small and almost inconsequential Performance comparison In this section, we compare our simulation methodology with a state-of-the-art simulation technique: binary translation. Binary translation natively executes on the host processor an instruction

13 A THOUSAND-CORE SIMULATOR RUNNING ON GPGPUs 1455 (a) Dhrystone (b) Fibonacci Figure 5. SIMinG-ARM and Open Virtual Platform Simulator (OVPSim) Simulated Million Instructions Per Second (S-MIPS). from the simulated stream. Single-ISS simulation using this approach on a powerful CPU is likely to outperform our simulation approach on the GPU. Still, we expect the high number of streaming processors available on a single graphics card to allow our simulation approach to deliver significant speedups in thousand-core simulation with respect to CPU execution of both our simulator or binary translators. We also want to assess simulation performance scaling of our ISS running on a standard multicore. Modern Chip Multiprocessors (CMP) such as the Intel Nehalem integrate on a single die a small number (typically from two to eight) of aggressive cores, which boast a sophisticated multilevel cache system and a peak operating frequency much higher than that of a GPU streaming processor. Considering single-iss performance, it is expected that even the most carefully optimized GPU version of our VISS would run much slower than its CPU counterpart. Even in this case, we expect however our GPU-based simulation approach to exhibit significant benefits for thousand-core simulation. Open Virtual Platform Simulator is a famous commercial sequential simulation platform, able to model architectures composed of thousands of computing nodes. It exploits the benefits of Just in Time Code Morphing and a caching system to accelerate the simulation. The comparison with OVPSim is only available for the ARM architecture, because x86 is currently not supported from OVP. To the best of our knowledge, none of the currently available x86 simulators is capable of managing target architectures with thousands of cores. In Sections and 4.2.2, we compare the performance scaling of SIMinG-1k running on a GPU against OVPSim and against SIMinG-1k running on the CPU, respectively Comparison between SIMinG-1k and OVPSim. To compare the performance of OVP against SIMinG-1k running on GPU, we performed two different experiments. First, we consider two benchmarks from the OVP test suite, Dhrystone and Fibonacci (see Table I). These benchmarks are provided with the OVP distribution and unlike our other benchmarks are not parallelized. Independent of the number of instantiated cores on SIMinG-1k, every ISS executes the benchmarks entirely. The main reason for the using this set of benchmarks is to highlight the reason behind steady throughput (S-MIPS) by OVP as shown in Figure 5. In both cases, OVP shows a constant S-MIPS with increasing number of simulated core because of its code morphing technology. With these benchmarks, OVP needs to invoke its morphing phase just once and exploits the translation caching system to speedup the simulation. SiminG-1k, on the other hand, scales well up to 2048 simulated (same considerations from the previous section hold for higher core count). The breakeven performance point between SiminG-1k and OVP is 1024 cores. In the second experiment, we consider data-parallel benchmarks MMULT and NCC. We recall here that with this parallelization approach, smaller chunks of data are processed by each core when The remaining benchmarks (FFT, DQ, and IDCT) show similar behavior and have not been included because of space constraints.

14 1456 S. RAGHAV ET AL. (a) MMULT (b) NCC Figure 6. ARM and Open Virtual Platform (OVP) Simulated Million Instructions Per Second (S-MIPS). MMULT, Matrix Multiplication; NCC, Background Subtraction. the core count increases. Results for this test are shown in Figure 6. It is possible to notice from this plot that OVPSim suffers from a high initial overhead, induced by its code morphing phase. This overhead is clearly increasingly evident as the workload diminishes, because morphing time tends to dominate. Although this initial overhead is clearly amortized as soon as workloads increase, in our opinion, it is worth underlining that SIMinG-1k on the contrary performs equally well even for very small workloads. On-chip manycore accelerators are often involved in data-parallel computation, which may contain even very small amounts of work (e.g. embedded accelerators for image processing, which may perform single-pixel computation). In these scenarios, SIMinG-1k performs better than OVP. The breakeven performance point between SIMinG-1k andovpsimfor data-parallel kernels is 512 cores Comparison between SIMinG-1k-GPU and SIMinG-1k-CPU. In this section, we compare the performance of SIMinG-1k running on the GPU with SIMinG-1k running on a powerful CMP such as the Intel Nehalem. General purpose CPUs are easier to program and leverage a complex hardware architecture to boost tasks execution. In addition to this, because of their higher core frequency, it is easier to predict that single core performance of SIMinG-1k would be higher on general purpose CPU as opposed to GPU platforms. On the other hand, GPU platforms utilizes their parallel computation by simulating high number of cores, therefore we want to identify the number of simulated core count at which using GPU platform gains advantage over a faster CPU. We thus parallelized our SIMinG-1k to run on an i7 processor,using pthreads to wrap each simulated core within one parallel task. The GPU used for this work can run threads in parallel on 15(MP) x 32(Cores/MP) D 480 cores. For this experiment, we consider two benchmarks, IDCT benchmark selected from OpenMP Source Code Repository [14] benchmark suite and Fibonacci benchmark selected from OVPSim suite[15]. The results for these experiments are shown in Figure 7. It is possible to see that for a small number of simulated cores (below 128), the CPU performs better than the GPU. However, beyond 128 cores, the GPU outperforms the CPU, thus demonstrating our original intuition. The remaining benchmarks from Table I also show similar behavior in results Experiments with synchronization primitives. In this section, we evaluate the three different approaches to supporting deadlock-free synchronization primitives described in Section To perform this evaluation, we consider a classical producer consumer synchronization pattern [16]. The experiment simulates a total of 4096 cores, with a CUDA block size of 256. By maximizing the usage of available GPU resources within each thread block, we ensure that a total of 14 thread blocks (3854 cores) are active at one time, and each thread block is individually running on a single SM. One thread block remains inactive until it finds its resources. A total of 14 cores, one from each thread block, are producing and consuming elements in an array. Table IV shows the simulation time

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu