MULTIPROG Prague, Czech Republic, January 18th, 2016

Size: px

Start display at page:

Download "MULTIPROG Prague, Czech Republic, January 18th, 2016"

Erik Spencer
5 years ago
Views:

1 MULTIPROG-2016 Proceedings of the Ninth International Workshop on Programmability and Architectures for Heterogeneous Multicores Editors: Miquel Pericàs, Chalmers University of Technology, Sweden Vassilis Papaefstathiou, Chalmers University of Technology, Sweden Ferad Zyulkyarov, Barcelona Supercomputing Center, Spain Oscar Palomar, Barcelona Supercomputing Center, Spain Prague, Czech Republic, January 18th, 2016

2 The ninth edition of the Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2016) took place in Prague, Czech Republic, on January 18th, The workshop was co-located with the HiPEAC 2016 conference. MULTIPROG aims to bring together researchers interested in programming models, runtimes, and computer architecture. The workshop is intended for quick publication of early results, work-in-progress, etc., and is not intended to prevent later publication of extended papers. This year we received a total of 15 submissions. The authors' origin was mainly from Europe (48 authors). We also had contributions from Asia (13 authors), Brazil (2 authors) and the U.S. (1 author). Each submission was reviewed by up to three members from our Program Committee. The organizing committee selected eight regular papers and three position papers for presentation at the workshop. In addition to the selected papers, the workshop included a keynote and three invited talks: Prof. David Kaeli from Northeastern University gave the MULTIPROG 2016 keynote: "Accelerators as First-class Computing Devices" Prof. Per Stenström from Chalmers University of Technology gave an invited talk: "MECCA - Meeting the Challenges in Computer Architecture" Rainer Leupers, CTO of Silexica Software Solutions GmbH gave an invited talk: "Use Case Driven Embedded Multicore Software Development" Jeav-Francois Lavignon from the European Technology Platform for High Performance Computing (ETP4HPC) gave an invited talk: "The ETP4HPC Strategic Research Agenda" We have assembled the accepted papers into these informal proceedings. The 2016 edition of MULTIPROG was well attended and generated lively discussions among the participants. We hope these proceedings will encourage you to submit your research to the next edition of the workshop!

3 Organizing Committee: Ferad Zyulkyarov, Barcelona Supercomputing Center, Spain Oscar Palomar, Barcelona Supercomputing Center, Spain Vassilis Papaefstathiou, Chalmers University of Technology, Sweden Miquel Pericàs, Chalmers University of Technology, Sweden Steering Committee: Eduard Ayguade, UPC/BSC, Spain Benedict R. Gaster, University of the West of England, UK Lee Howes, Qualcomm, USA Per Stenström, Chalmers University of Technology, Sweden Osman Unsal, Barcelona Supercomputing Center, Sweden Program Committee: Abdelhalim Amer, Argonne National Lab, USA Ali Jannesari, TU Darmstadt, Germany Avi Mendelson, Technion, Israel Christos Kotselidis, University of Manchester, UK Daniel Goodman, Oracle Labs, UK Dong Ping Zhang, AMD, USA Gilles Sassatelli, LIRMM, France Håkan Grahn, Blekinge Institute of Technology, Sweden Hans Vandierendonck, Queen's University of Belfast, UK Kenjiro Taura, University of Tokyo, Japan Luigi Nardi, Imperial College London, UK Naoya Maruyama, RIKEN AICS, Japan Oscar Plata, University of Malaga, Spain Pedro Trancoso, University of Cyprus, Cyprus Polyvios Pratikakis, FORTH-ICS, Greece Roberto Gioiosa, Pacic Northwest National Laboratory, USA Ruben Titos, BSC, Spain Sasa Tomic, IBM Research, Switzerland Simon McIntosh-Smith, University of Bristol, UK Timothy G. Mattson, Intel, USA Trevor E. Carlson, Uppsala University, Sweden External Reviewers: Julio Villalba, University of Malaga, Spain

4 Index of selected Papers: Accelerating HPC Kernels with RHyMe - REDEFINE HyperCell Multicore. Saptarsi Das, Nalesh S., Kavitha Madhu, Soumitra Kumar Nandy and Ranjani Narayan Project Beehive: A Hardware/Software Co-designed Stack for Runtime and Architectural Research. Christos Kotselidis, Andrey Rodchenko, Colin Barrett, Andy Nisbet, John Mawer, Will Toms, James Clarkson, Cosmin Gorgovan, Amanieu d'antras, Yaman Cakmakci, Thanos Stratikopoulos, Sebatian Werner, Jim Garside, Javier Navaridas, Antoniu Pop, John Goodacre and Mikel Lujan Reaching intrinsic compute eciency requires adaptable micro-architectures. Mark Wijtvliet, Luc Waeijen, Michaël Adriaansen and Henk Corporaal Toward Transparent Heterogeneous Systems. Baptiste Delporte, Roberto Rigamonti and Alberto Dassatti Exploring LLVM Infrastructure for Simplied Multi-GPU Programming. Alexander Matz, Mark Hummel and Holger Fröning Ecient scheduling policies for dynamic dataow programs executed on multi-core. Malgorzata Michalska, Nicolas Zuerey, Jani Boutellier, Endri Bezati and Marco Mattavelli OpenMP scheduling on ARM big.little architecture. Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatie, Gilles Sassatelli, Lionel Torres and Michel Robert Collaborative design and optimization using Collective Knowledge. Anton Lokhmotov and Grigori Fursin Heterogeneous (CPU+GPU) Working-set Hash Tables. Ziaul Choudhury and Suresh Purini A Safe and Tight Estimation of the Worst-Case Execution Time of Dynamically Scheduled Parallel Applications. Petros Voudouris, Per Stenström and Risat Pathan

5 Accelerating HPC Kernels with RHyMe - REDEFINE HyperCell Multicore Saptarsi Das 1, Nalesh S. 1, Kavitha T. Madhu 1, S. K. Nandy 1 and Ranjani Narayan 2 1 CAD Laboratory, Indian Institute of Science, Bangalore {sdas, nalesh, kavitha}@cadl.iisc.ernet.in, nandy@serc.iisc.in 2 Morphing Machines Pvt. Ltd., Bangalore ranjani@morphing.in Abstract. In this paper, we present a coarse grained reconfigurable array (CGRA) designed to accelerate high performance computing (HPC) application kernels. The proposed CGRA named RHyMe, REDEFINE HyperCell Multicore, is based on the REDEFINE CGRA. It consists of a set of reconfigurable data-paths called HyperCells interconnected through a network-on-chip (NoC). The network of HyperCells serves as the hardware data-path for realization of HyperOps which are the basic schedulable entities in REDEFINE. RHyMe is specialized to accelerate regular computations like loops and relies on the compiler to generate the meta-data which are used at runtime for orchestrating the kernel execution. As a result, the compute hardware is simple and memory structures can be managed explicitly rendering a simple as well as efficient architecture. 1 Introduction Modern high performance computing (HPC) applications demand heterogeneous computing platforms which consists of a variety of specialized hardware accelerators alongside general purpose processing (GPP) cores to accelerate compute intensive functions. When compared to GPPs, although accelerators give dramatically higher efficiency for their target applications they are not as flexible and performs poorly on other applications. Graphic processing units (GPU) can be used for accelerating a wide range of parallel applications. However GPUs are more suited for single instruction multiple data (SIMD) applications. Field programmable gate arrays (FPGA) may be used to generate accelerators on demand. Although this mitigates the flexibility issue involved with specialized hardware accelerators, the finer granularity of the lookup tables (LUT) in FP- GAs leads to significantly high configuration time and low operating frequency. Coarse-grain reconfigurable architectures (CGRA) consisting of a pool of compute elements (CE) interconnected using some communication infrastructure overcomes the reconfiguration overheads of FPGAs while providing performance close to specialized hardware accelerators.

6 2 Examples of CGRAs include Molen Polymorphic Processor [13], Convey Hybrid-Core Computer [3], DRRA [12], REDEFINE [2], CRFU [10], Dyser [7], TRIPS [4]. REDEFINE as reported in [2] is a runtime reconfigurable polymorphic applications specific integrated circuit (ASIC). Polymorphism in ASICs is synonymous with attributing different functionalities to fixed hardware in space and time. REDEFINE is a massively parallel distributed system, comprising a set of Compute Elements (CEs) communicating over a Network-on-Chip (NoC) [6] using messages. REDEFINE follows a macro data-flow execution model at the level of macro operations (also called HyperOps). HyperOps are convex partitions of the application kernels data-flow graph, and are composition of one or more multiple-input-multiple-output (MIMO) operations. The ability of REDE- FINE to provision CEs to serve as composed data-paths for MIMO operations over the NoC is a key differentiator that sets aside REDEFINE from other CGRAs. REDEFINE exploits temporal parallelism inside the CEs, while spatial parallelism is exploited across CEs. The CE can be an instruction-set processor or a specialized custom function unit (CFU) or a reconfigurable data-path. In this paper, we present the REDEFINE CGRA with HyperCells [8], [5] as CEs so as to support parallelism of all granularities. HyperCell is a reconfigurable data-path that can be configured on demand to accelerate frequently occurring code segments. Custom data-paths, dynamically set up within HyperCells enable exploitation of fine-grain parallelism. Coarse grained parallelism is exploited across various HyperCells. We refer to this architecture, ie, HyperCells as CEs in REDEFINE as REDEFINE HyperCell Multicore (RHyMe). In this paper we present the RHyMe hardware comprising both the resources for computation and runtime orchestration. The paper is structured as follows. The execution model and a brief overview of compilation flow employed are described in 2. Section 3 presents the hardware architecture of RHyMe. Sections 4 and 5 present some results and conclusions of the paper. 2 Execution Model & Compilation Framework In this section we present a brief overview of the execution model of the RHyMe architecture followed by a high level description of the compilation flow. RHyMe is a macro data-flow engine comprising three major hardware components namely compute fabric, orchestrator and memory (see figure 1). As mentioned previously, the compute fabric is composed of HyperCells. An application kernel to be executed on RHyMe comprises convex schedulable entities called HyperOps that are executed atomically. Each HyperOp is composed of phyperops, each of which is mapped onto a HyperCell of RHyMe. In the scope of this exposition, we consider loops from HPC applications as the kernels for acceleration on RHyMe. Computation corresponding to a loop in the kernel is treated as a HyperOp and its iteration space is divided into a number of HyperOp instances. Execution of a HyperOp on RHyMe involves three major phases namely, configuration of the hardware resources (HyperCells and orchestrator), execution of

7 External Memory Data Movement Unit Runtime Parameter Computation Unit HyperOp Scheduler Unit Host Environment Orchestrator-Fabric Interface Orchestrator Configuration Unit 3 the instances of a HyperOp by binding runtime parameters and synchronization among HyperOp instances. HyperOp instances are scheduled for execution when the following conditions are met. HyperCells executing the HyperOp instance and the orchestrator are configured. Operands of the HyperOp instance is available in REDEFINE s memory. HyperCells to which the HyperOp is mapped are free to execute the HyperOp. Runtime parameters of HyperCells and orchestrator are bound. A HyperOp requires the HyperCells and orchestrator to be configured when launching the first instance for execution. Subsequent instances require only the runtime parameters to be sent to HyperCells and orchestrator as explained in detail in section 3. In [9], the authors have presented a detailed description of the execution model and the compilation flow for RHyMe. In the following section we discuss the hardware architecture of RHyMe in greater detail. 3 REDEFINE HyperCell Multicore (RHyMe) Architecture In this section we present the architectural details of RHyMe. As mentioned in section 2, RHyMe has three major components namely Compute Fabric, Memory, Orchestrator. Data Configuration Metadata Control Memory Bank Set Memory Bank Set Memory Bank Set Memory Bank Set Memory Bank Set Memory Bank Set Distributed Global Memory H H H H H H H H H H H H H H H H H H Compute Fabric H H H H H H Memory Bank Set Memory Bank Set Memory Bank Set Memory Bank Set Memory Bank Set Memory Bank Set H Router HyperCell Fig. 1. REDEFINE HyperCell Multicore (RHyMe)

8 4 3.1 Compute Fabric The Compute fabric of RHyMe consists of a set of HyperCells interconnected via an NoC. HyperCell: Micro-architectural details of HyperCell is presented in [5] and [8]. The authors had presented HyperCell as a hardware platform for realization of multiple input multiple output (MIMO) macro instructions. In this exposition, we adopted the micro-architecture of HyperCell as the CEs in RHyMe. A HyperCell has a controller and a local storage along side a reconfigurable datapath (refer to figure 2). The reconfigurable data-path of HyperCell comprises a set of compute units (CU) connected by a circuit-switched interconnect (refer figure 2). The CUs and switches can be configured to realize various data-flow graphs (DFG). The reconfigurable data-path of HyperCell is designed to support pipelined execution of instances of such DFGs. Flow control of data in this circuit switched network is ensured by a light weight ready-valid synchronizing mechanism ensuring that the data is not overwritten until it is consumed. This mechanism makes HyperCell tolerant to the non-deterministic latencies in data-delivery at CUs inputs. Local storage of a HyperCell consists of a set of register files, each with one read port and one write port. Each operand data in a register file is associated with a valid bit. An operand can only be read if its corresponding valid bit is set. Likewise, an operand can be written to a register location only if the corresponding valid bit is reset. Controller is responsible for delivering both configuration and data inputs to the HyperCell s data-path and transferring results to RHyMe s memory. It orchestrates four kinds of data transfers: Load data from RHyMe s memory to HyperCell s local data storage. Load input from HyperCell s local storage to HyperCell s reconfigurable data-path. Store outputs from HyperCell s reconfigurable data-path to RHyMe s memory. Store outputs from HyperCell s reconfigurable data-path to local storage for reuse in subsequent instances of the realized DFG. These data transfers are specified in terms of a set of four control-sequences. The control-sequences are stored in dedicated storages inside the HyperCells. The HyperCell controller comprises four FSMs that process these control sequences. The aforementioned control sequences together realize a modulo-schedule[11] of the DFG instantiated on the data-path. Each control sequence contains a prologue and epilogue, both of which are executed once and a steady state executed multiple times. The sequences are generated in a parametric manner. The runtime parameters are start and end pointers for prologue, steady state and epilogue, base addresses of the inputs and outputs, number of times steady state is executed and epilogue spill. At runtime, the reconfigurable data-path of HyperCell and its controller are configured for the first HyperOp instance. To facilitate execution of different instances of a HyperOp, the runtime parameters are bound to HyperCell, once per instance.

9 5 A control sequence is realized as a set of control words interpreted at runtime. HyperCells control sequences are also responsible for communicating outputs between HyperCells. In order to facilitate inter-hypercell communication, the local storage of a HyperCell is write-addressable from other HyperCells. An acknowledgement based synchronization scheme is realized to maintain flow control during communication among HyperCells. Further, control words are grouped together to send multiple operand data together to a remote HyperCell s local storage increasing the granularity of synchronization messages. Transporter To and from NoC Router Local Storage Controller Reconfigurable Data-path Data Control HyperCell Switch Peripheral Switch Corner Switch Buffer Compute Unit Fig. 2. Reconfigurable data-path of HyperCell Network on Chip: The NoC of RHyMe provides the necessary infrastructure for inter-hypercell communication (refer to figure 1), communication between Memory and HyperCells and communication between Orchestrator and HyperCells. Authors have presented detail micro-architectural descriptions of the NoC in [6]. We adopted the same NoC for RHyMe. The NoC consists of routers arranged in a toroidal mesh topology. Each router is connected to four neighbouring routers and a HyperCell. Packets are routed from a source router to a destination router based on a deterministic routing algorithm namely the west-first algorithm [6]. There are four types of packets handled by the NoC namely load/store packets, configuration packets, synchronization packets and inter-hypercell packets. The load/store packets carry data between the Hyper- Cells and memory. Configuration packets contain configuration meta-data or runtime parameters sent by the orchestrator to the HyperCells. Synchronization packets are sent from each HyperCell to the orchestrator to indicate end of computation for the current phyperop being mapped to that HyperCell. Inter-

10 6 HyperCell packets consist of data transmission packets between the HyperCells and the acknowledgement packets for maintaining flow control in inter-hypercell communications. A transporter module acts as the interface between HyperCell and router and is responsible for packetizing data communicated across Hyper- Cells as well as load/store requests (refer to figure 2). 3.2 Memory A distributed global memory storage is provisioned in RHyMe as shown in figure 1. This memory serves as input and output storage to be used by successive computations. This memory can be viewed as overlay memory explicitly managed by the orchestrator. All inputs required by a particular computation is loaded into memory before execution starts. It is implemented as a set of memory banks. Multiple logical partitions are created at compile time. One of the partitions is used as operand data storage and the others act as prefetch buffers for subsequent computations. 3.3 Orchestrator Orchestrator is responsible for the following activities. Data movement between RHyMe s memory and external memory interface Computation of runtime parameters Scheduling HyperOps and HyperOp instances and synchronization between successive instances or HyperOps Configuration of HyperCell and the other orchestrator modules External Memory Runtime Parameter Computation Unit Host Environment Orchestrator Data Configuration Metadata Control Data Movement Unit HyperOp Scheduler Unit Configuration Unit RHyMe Memory RHyMe Compute Fabric Fig. 3. Interactions between modules of the Orchestrator

11 7 The aforementioned tasks are carried out by three modules of the orchestrator. Brief descriptions of the modules are given below. Figure 3 depicts the interactions between these various modules. Configuration Unit: Configuration Unit is responsible for the initial configuration of the HyperCells as well as the other units of the orchestrator listed below. The configuration metadata for the HyperCells is delivered through the NoC. Configuration metadata for the other modules of orchestrator is delivered directly to the recipient module. These metadata transactions are presented in figure 3. Data Movement Unit: As seen in figure 3, the Data Movement Unit (DMU) is responsible for managing data transactions between the external memory and RHyMe s memory. It is configured by the configuration unit such that fetching data for computations and write-back to the external memory overlaps with computation. DMU Configuration corresponds to a set of load and store instructions. The overlap in fetching operand data with computation is accomplished by dividing the address space of RHyMe s memory into partitions. As mentioned in section 3.2, during execution of one HyperOp on the compute fabric, one partition of the address space acts as operand storage for the active HyperOp and the rest act as a prefetch buffer for subsequent HyperOp instances. Each partition of RHyMe s memory is free to be written into when its previous contents have been consumed. This is achieved through partition-by-partition synchronization at the HyperOp Scheduler unit. The compilation flow is responsible for creating appropriate configuration meta-data to perform the aforementioned activities. Runtime Parameter Computation Unit: Runtime Parameter Computation Unit (RPCU) is responsible for computation of HyperCell s runtime parameters listed in section 3.1. The RPCU computes the runtime parameters for a successive instance of a HyperOp while HyperCells are busy computing previous instances, thus amortizing the overheads of parameter computation. The runtime parameter computation is expressed as a sequence of ALU and branch operations. The RPCU comprises a data-path that processes these instructions. Similar to the DMU, the RPCU works in synchrony with the HyperOp scheduler unit. The runtime parameters computed are forwarded to the HyperOp scheduler unit which in turn binds them to the compute fabric (see figure 3). HyperOp Scheduler Unit: HyperOp Scheduler Unit (HSU) is responsible for scheduling instances of a HyperOp onto the compute fabric for execution. HSU waits for conditions listed previously to be met to trigger the execution of a new HyperOp or its instance on HyperCells. When all the HyperCells are free to execute a new HyperOp instance, the scheduler unit binds a new set of runtime parameters to HyperCells to enable execution of the instance. 4 Results In this section, we present experimental results to demonstrate the effectiveness of the RHyMe architecture. HPC kernels from the Polybench benchmark suite [1] were employed in this evaluation. The kernels are from the domains of linear

12 8 Table 1. Computational complexity and problem sizes of the kernels Setup matmul gesummv gemver syrk syr2k jacobi1d jacobi2d siedel2d O(n 3 ) O(n 2 ) O(n 2 ) O(n 3 ) O(n 3 ) O(mn) O(mn 2 ) O(mn 2 ) Setup1 n = 256 n = 256 n = 256 n = 256 n = 256 m = 2, m = 10, m = 10, n = 256 n = 256 n = 256 Setup2 n = 512 n = 512 n = 512 n = 512 n = 512 m = 2 m = 10, m = 10, n = 512 n = 512 n = 512 Setup3 n = 1024 n = 1024 n = 1024 n = 1024 n = 1024 m = 10, m = 20, m = 20, n = 1024 n = 1024 n = 1024 Setup4 n = 2048 n = 2048 n = 2048 n = 2048 n = 2048 m = 100, m = 20, m = 20, n = 2048 n = 2048 n = 2048 Setup5 n = 4096 n = 4096 n = 4096 n = 4096 n = 4096 m = 100, m = 100, m = 100, n = 4096 n = 4096 n = 4096 Setup6 n = 8192 n = 8192 n = 8192 n = 8192 n = 8192 m = 100, m = 100, m = 100, n = 8192 n = 8192 n = 8192 algebra and stencil computations. For each kernel we create 6 experimental setups with different problem sizes listed in table 1. For these experiments, we have selected a template of the RHyMe compute fabric with HyperCells arranged in 4 rows and 6 columns. Each HyperCell comprises 25 compute units (CU), each consisting of an integer ALU and a single precision floating point unit (FPU). The local storage of each HyperCell consists of 8 banks of 64 deep register files. A HyperCell has a configuration memory of 16 KB. As mentioned in section 3.2, RHyMe s distributed global memory is divided in 12 sets. 2 sets on either side of the fabric act as data storage for a column of four HyperCells. A set consists of 4 banks of 16 KB each with one router giving access to 4 banks. The overall storage capacity is hence 768 KB. Since each router is connected to 4 banks on either side, 4 loads/stores can be serviced per request. Thus, each load/store request from a HyperCell can address four words from the memory. RHyMe s orchestrator has a configuration storage for different components of the orchestrator and HyperOp configuration storage corresponding to HyperCell s configuration metadata. The former is of size 16 KB and latter is 20 KB in size and can hold HyperCell configuration for four HyperOps at a time. In this exposition, RHyMe is assumed embedded in a heterogeneous multicore machine with a shared L2 cache. The L2 cache size is 512 KB. The data movement unit (DMU) of RHyMe s orchestrator interfaces directly with the shared L2 cache. Figure 4 shows the steps involved in executing a HyperOp in RHyMe. We refer to the data transfer latency as T mem, the computation latency as T comp, the runtime parameter binding latency as T param and the synchronization latency as T sync. For maximizing performance, (max(t mem, (T param + T comp )) + T sync ) should be minimized. Given a kernel and a fixed number of HyperCells, T comp, T sync and T param are fixed. Hence, maximizing performance requires T mem to be less than or equal to (T param +T comp ) such that the computation and parameter binding steps completely overlap the data transfer step. T mem can be reduced by increasing the bandwidth between L2 cache and RHyMe memory. We have hence

13 9 HyperOp Instance n Runtime parameter Binding (T param ) Transfer of Data between RHyMe s memory and external memory (T mem ) Computation (T comp ) Synchronization among Producer Consumer HyperOps (T sync ) HyperOp Instance n + 1 Runtime parameter Binding (T param ) Transfer of Data between RHyMe s memory and external memory (T mem ) Computation (T comp ) Synchronization among Producer Consumer HyperOps (T sync ) Fig. 4. Execution flow of HyperOps conducted experiments for two different configurations with the results given in table 2. In the first configuration (referred to as MemSetup1), L2 has cache line size of 64B while DMU to RHyMe memory interface is capable of handling one word per cycle. In the second configuration referred to as MemSetup2, the L2 cache line size is doubled to 128B and the DMU to RHyMe memory interface is capable of handling two words per cycle. In table 2 we present (T param + T comp ) T param+t comp T mem max((t param+t comp),t mem) and T mem for various kernels. We define a metric η = that measures the effectiveness of overlap of the data transfer step with the compute and configuration step. Figure 5 presents η for various kernels for the two different configurations. A positive value in figure 5 indicates the fact that data transfer is completely hidden. It can be observed that, increasing the bandwidth between L2 cache and RHyMe memory (MemSetup2) helps in increasing η for most of the kernels. However, in case of gesummv, gemver and jacobi2d, η is negative with MemSetup2 as well. In case of jacobi2d, this is attributed to relatively large amount of data consumed and produced per HyperOp. In case of gesummv, gemver, both the volume of data required is comparable with the volume of computation in each HyperOp whereas other kernels require an order less volume of data the volume of computation. In case of these two kernels, T mem dominates the overall execution time and no amount of reasonable increase in memory bandwidth can hide it effectively (refer table 2). In table 3 we present the computation times for various kernels as fractions of their respective overall execution times. We observe that as problem size increases, the fraction grows and becomes close to one. This indicates the effective amortization of configuration and synchronization latencies at larger problem

14 10 Table 2. Comparison of computation vs memory transaction latencies (per HyperOp) for various kernels Kernels T comp T mem Effectiveness of overlap η MemSetup1 MemSetup2 MemSetup1 MemSetup2 matmul gesummv gemver syrk syr2k jacobi1d jacobi2d siedel2d MemSetup 1 MemSetup 2 Fig. 5. η for various kernels Table 3. Computation time as fraction of overall execution time for different kernels MemSetup1 MemSetup2 Problem matmul gesummv gemver syrk syr2k siedel jacobi jacobi Size 2d 1d 2d Setup Setup Setup Setup Setup Setup Setup Setup Setup Setup Setup Setup

15 MemSetup MemSetup MATMUL SYRK SYR2K SIEDEL-2D JACOBI-1D JACOBI-2D GESUMMV GEMVER MemSetup MemSetup MATMUL SYRK SYR2K SIEDEL-2D JACOBI-1D JACOBI-2D GESUMMV GEMVER Setup - 1 Setup - 2 Setup - 3 Setup - 4 Setup - 5 Setup - 6 Fig. 6. Efficiency of execution for various kernels on RHyMe. gesummv and gemver plotted separately due to order of magnitude difference in efficiency sizes. This can be attributed to improvement in temporal utilization of the resources in compute fabric of RHyMe with increase in problem size. An exception to this trend is jacobi1d. In case of jacobi1d, even for the larger problem sizes (setup 4, 5 and 6), the amount of computation involved is not large enough to effectively amortize configuration overheads. Hence we observe a significant configuration overhead for jacobi1d. For any given kernel, efficiency is measured as the ratio of actual performance of the kernel on RHyMe and theoretical peak performance. Actual performance is affected by various architectural artifacts of RHyMe such as NoC bandwidth, RHyMe memory bandwidth, HyperCell s local storage bandwidth. While measuring peak performance we simply consider the parallelism available in each kernel and the number of basic operations that can be executed in parallel. The efficiency for various kernels with the experimental setups in table 1 can be seen in figure 6. With increasing problem sizes, efficiency increases since configuration and synchronization overheads are more effectively amortized (refer to table 3). As mentioned previously, in case of gesummv and gemver, the overwhelming dominance of data transfer latency leads to less than 1% efficiency. For the other kernels, we achieve efficiencies ranging from 14% to 40% with large problem sizes. Table 4 lists the performance for different kernels for largest problem sizes (Setup6) in terms of Giga Floating Point Operations per Second (GFLOPS) at 500MHz operating frequency. The table also presents the improvement in performance achieved by increasing the bandwidth between external L2 and RHyMe s memory. Against a theoretical peak performance of 300 GFLOPs, for

16 12 most kernels we achieve performance ranging from 42 to 136 GFLOPS. Due to the reasons mentioned previously, gesummv and gemver show upto 2 GFLOPS performance and are unsuitable for execution on RHyMe platform. Table 4. Performance of various kernels on RHyMe measured at two different configurations: MemSetup1 & MemSetup2 Kernels Performance in GFLOPS % Increase MemSetup1 MemSetup2 matmul gesummv gemver syrk syr2k jacobi1d jacobi2d siedel2d Conclusion In this paper we presented the architectural details of REDEFINE HyperCell Multicore (RHyMe). RHyMe is a data-driven coarse-grain reconfigurable architecture designed for fast execution of loops in HPC applications. RHyMe facilitates exploitation of spatial and temporal parallelism. The CEs of RHyMe aka HyperCells offer reconfigurable data-path for realizing MIMO operations and alleviate the fetch-decode overheads of a fine-grain instruction processing machine. HyperCell s reconfigurable data-path offers the ability to exploit high degree of fine-grain parallelism while the controller of HyperCell enables exploiting pipeline parallelism. Multitude of HyperCells that can communicate with each other directly enable creation of large computation pipelines. RHyMe employs a lightweight configuration, scheduling and synchronization mechanism with minimal runtime overheads as is evident from the results presented. References 1. Polybench: Polyhedral benchmark. pouchet/software/polybench/ 2. Alle, M., Varadarajan, K., Fell, A., Reddy, C.R., Nimmy, J., Das, S., Biswas, P., Chetia, J., Rao, A., Nandy, S.K., Narayan, R.: REDEFINE: Runtime reconfigurable polymorphic ASIC. ACM Trans. Embedded Comput. Syst 9(2) (2009), 3. Brewer, T.M.: Instruction set innovations for the convey HC-1 computer. IEEE Micro 30(2), (2010),

17 4. Burger, D., Keckler, S., McKinley, K., Dahlin, M., John, L., Lin, C., Moore, C., Burrill, J., McDonald, R., Yoder, W.: Scaling to the end of silicon with edge architectures. Computer 37(7), (July 2004) 5. Das, S., Madhu, K., Krishna, M., Sivanandan, N., Merchant, F., Natarajan, S., Biswas, I., Pulli, A., Nandy, S., Narayan, R.: A framework for post-silicon realization of arbitrary instruction extensions on reconfigurable data-paths. Journal of Systems Architecture 60(7), (2014) 6. Fell, A., Biswas, P., Chetia, J., Nandy, S.K., Narayan, R.: Generic routing rules and a scalable access enhancement for the network-on-chip RECON- NECT. In: Annual IEEE International SoC Conference, SoCC 2009, September 9-11, 2009, Belfast, Northern Ireland, UK, Proceedings. pp (2009), 7. Govindaraju, V., Ho, C.H., Sankaralingam, K.: Dynamically specialized datapaths for energy efficient computing. In: HPCA. pp IEEE Computer Society (2011), 8. Madhu, K.T., Das, S., Krishna, M., Sivanandan, N., Nandy, S.K., Narayan, R.: Synthesis of instruction extensions on hypercell, a reconfigurable datapath. In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014 International Conference on. pp IEEE (2014) 9. Madhu, K.T., Das, S., Nalesh, S., Nandy, S.K., Narayan, R.: Compiling HPC kernels for the REDEFINE CGRA. In: 17th IEEE International Conference on High Performance Computing and Communications, HPCC 2015, 7th IEEE International Symposium on Cyberspace Safety and Security, CSS 2015, and 12th IEEE International Conference on Embedded Software and Systems, ICESS 2015, New York, NY, USA, August 24-26, pp (2015), Noori, H., Mehdipour, F., Inoue, K., Murakami, K.: Improving performance and energy efficiency of embedded processors via post-fabrication instruction set customization. The Journal of Supercomputing 60(2), (2012), Rau, B.R.: Compiling hpc kernels for the redefine cgra. In: Proceedings of the 27th annual international symposium on Microarchitecture. pp MICRO 27, ACM, New York, NY, USA (1994), Shami, M., Hemani, A.: Partially reconfigurable interconnection network for dynamically reprogrammable resource array. In: ASIC, ASICON 09. IEEE 8th International Conference on. pp (2009) 13. Vassiliadis, S., Wong, S., Gaydadjiev, G., Bertels, K., Kuzmanov, G., Panainte, E.M.: The MOLEN polymorphic processor. IEEE Trans. Computers 53(11), (2004), 13

18 Project Beehive: A Hardware/Software Co-designed Stack for Runtime and Architectural Research Christos Kotselidis, Andrey Rodchenko, Colin Barrett, Andy Nisbet, John Mawer, Will Toms, James Clarkson, Cosmin Gorgovan, Amanieu d Antras, Yaman Cakmakci, Thanos Stratikopoulos, Sebastian Werner, Jim Garside, Javier Navaridas, Antoniu Pop, John Goodacre, and Mikel Luján Advanced Processor Technologies Group The University of Manchester, first.last@manchester.ac.uk Abstract. The end of Dennard scaling combined with stagnation in architectural and compiler optimizations makes it challenging to achieve significant performance deltas. Solutions based solely in hardware or software are no longer sufficient to maintain the pace of improvements seen during the past few decades. In hardware, the end of single-core scaling resulted in the proliferation of multi-core system architectures, however this has forced complex parallel programming techniques into the mainstream. To further exploit physical resources, systems are becoming increasingly heterogeneous with specialized computing elements and accelerators. Programming across a range of disparate architectures requires a new level of abstraction that programming languages will have to adapt to. In software, emerging complex applications, from domains such as Big Data and computer vision, run on multi-layered software stacks targeting hardware with a variety of constraints and resources. Hence, optimizing for the power-performance (and resiliency) space requires experimentation platforms that offer quick and easy prototyping of hardware/software co-designed techniques. To that end, we present Project Beehive: A Hardware/Software co-designed stack for runtime and architectural research. Project Beehive utilizes various state-of-theart software and hardware components along with novel and extensible co-design techniques. The objective of Project Beehive is to provide a modern platform for experimentation on emerging applications, programming languages, compilers, runtimes, and low-power heterogeneous many-core architectures in a full-system co-designed manner. 1 Introduction Traditionally, software and hardware providers have been delivering significant performance improvements on a yearly basis. Unfortunately, this is beginning to change. Predictions about dark silicon [2] and resiliency, especially in the forthcoming exascale era [1], suggest the traditional approaches to computing

19 2 problems are impeded by power constraints; saturation on architectural and compiler research; and process manufacturing. Mitigation of these problems is likely to come through vertical integration and optimization techniques; or bespoke solutions for each or a cluster of problems. However, whilst such an approach may yield the desired results it is both complex and expensive to implement. At the current time only a handful of vendors, such as Oracle, Google, Facebook, etc., have both the financial resources and engineering expertise required to deliver on this approach. Co-designing an architectural solution at the system-level 1 requires significant resources and expertise. The design-space to be explored is vast, and there is the potential that a poor, even if well intentioned, decision will propagate through the entire co-designed stack; amending the consequences at a later date may prove extremely complex and expensive if not impossible. Project Beehive aims to provide a platform for rapid experimentation and prototyping, at the system-level, enabling accurate decision making for architectural and runtime optimizations. The project is intended to facilitate: Co-designed research and development for traditional and emerging workloads such as Big Data and computer vision applications. Co-designed compiler and runtime research of multiple languages building on top of Truffle [5], Graal, and Maxine VM [4]. Heterogeneous processing on a variety of platforms focusing mainly on ARMv7, Aarch64, and x86. Fast prototyping and experimentation on heterogeneous programming on GPGPUs and FPGAs. Co-designed architectural research on power, performance, and reliability techniques. Dynamic binary optimization techniques via binary instrumentation and optimization on both at the system and chip level. The following subsections describe the general architecture of Project Beehive and its various components. Finally, some preliminary performance numbers along with the short-term and long-term plans are also presented. 2 Beehive Architecture 2.1 Overview Beehive, as depicted in Figure 1, targets a variety of workloads spanning from traditional benchmarks to emerging applications from a variety of domains such as computer vision and Big Data. Applications can execute either: directly on hardware, in-directly on hardware using our dynamic binary optimization layer (MAMBO and MAMBO64) or inside our simulator. 1 In this context we refer to architectural solution as a co-designed solution that spans from a running application to the underlying hardware architecture.

3 Full System Co-design Applications Traditional Benchmarking (SpecJVM, Dacapo, etc.) DSLs (LLVM IR, etc.) Computer Vision SLAM Applications Big Data Applications (Spark, Flink, Hadoop, etc.

20 3 Full System Co-design Applications Traditional Benchmarking (SpecJVM, Dacapo, etc.) DSLs (LLVM IR, etc.) Computer Vision SLAM Applications Big Data Applications (Spark, Flink, Hadoop, etc.) Performance Runtime Layer Maxine VM Truffle T1X Graal JACC Memory Java Native Applications Manager Accelerator (GC) (PTX, OpenCL) Operating System and Beehive Drivers (PIN, MAMBO PIN) Beehive Services (MAMBO Dynamic Binary Optimizer) Power ISA extensions Compute Platform Heterogeneous Architectures Aarch64 x86 ARMv7 GPUS FPGAs VPUs ASICs Simulators GEM5 Full System Simulator Cacti McPAT Power Simulator NVSim Hotspot Thermal Simulator Emulated Architectures MAMBO Dynamic Binary Translator Resiliency Fig. 1. Project Beehive architecture overview. The runtime layer centers around an augmented Maxine Research VM and MAMBO components. The VM provides the capability to target both highperformance x86 and low-power ARM systems, in addition to heterogeneous architectures. Our enhanced capability is made possible via a range of compilers: the T1X and Graal compilers support ARMv7, Aarch64, and x86 architectures, while our Jacc compiler can target GPGPUs, FPGAs, and SIMD units. Moreover, by replacing the C1X compiler with Graal it is also possible to fully benefit from Truffle AST interpreter on our VM. Beehive offers the ability to perform architectural research via an integrated simulation environment. Our environment is built around the gem5, Cacti, NvSim, and McPat tools. By using this environment, novel micro-architectures can be simulated and performance, power, temperature, and reliability metrics to be gathered. 2.2 Applications Beehive targets a variety of applications in order to enable co-designed optimizations in numerous domains. Whilst compiler and micro-architectural research traditionally uses benchmarks such as SPEC and PARSEC, Beehive also considers complex emerging application areas. The two primary domains considered are Big Data software stacks such as Spark, Flink, and Hadoop along with computer vision SLAM (Simultaneous Localization and Mapping). In the vision arena SLAMBench [3] will be the main vehicle of experimentation. SLAM- Bench currently includes implementations in C++, CUDA, OpenCL, OpenMP and Java allowing a broad range of languages, platforms and techniques to be investigated.

21 4 2.3 Runtime Layer Some of the key features of Beehive are found in its runtime layer, which provides capability beyond simply running native applications. For instance, our MAMBO64 component is able to translate ARMv7 binaries into Aarch64 instructions at runtime, whilst MAMBO enables binary translation/optimization in a manner similar to PIN 2. Despite being able to execute native C/C++ applications, Beehive has been designed to target languages that utilize a managed runtime system. Our managed runtime system is based on the Maxine Research VM which has been augmented with a selection of state-of-the-art components. For example, we have properly integrated and increased the stability of both the template, T1X, compiler and the Graal compiler which also allows Project Beehive to utilize the Truffle AST interpreter. Moreover, this work has required us to undertake extensive infrastructure work to allow us to easily downstream the Graal and Truffle code bases in order to provide Beehive with the state-of-the-art components on a regular basis. The VM is designed to enable execution across a range of hardware configurations. To that end, we introduce support for low-power ARM systems, by extending the T1X and Graal compilers to support both the ARMv7 and Aarch64 architectures, along with continuing the existing x86 support. Additionally, the VM supports heterogeneous execution via the Jacc (Java accelerator) framework. By annotating source code using Jacc s API, which is similar to OpenMP/OpenAcc, it is possible to execute performance critical code on specialized hardware such as GPGPUs and FPGAs. Regarding the memory manager (GC), various options are being explored ranging from enhancing Maxine VM s current GC algorithms to porting existing state-of-the-art memory management components. 2.4 Hardware Layer As depicted in Figure 1, Project Beehive targets a variety of hardware platforms and therefore significant effort is being placed in providing the appropriate support for the compilers and runtime of choice. In addition targeting conventional CPU/GPU systems, it is also possible to target FPGA systems, the primary target being Xilinx Zynq, ARM/FPGA. In-house tools and IP (Intellectual Property) can be used to rapidly assemble hardware systems targeted at specific applications, for example accelerators for computer vision, hardware models appropriate to system level simulation or database accelerators. The hardware accelerators have access to the processor s main memory at 10Gb/s through the processor s cache system allowing high speed transfer of data between generic and custom processing resources. The system uses an exclusively user space driver allowing new hardware to be added 2

22 5 and easily linked to runtimes or binary translators. Using the Zynq s ARM processors it is possible to identify IP blocks currently configured on the FPGA and if necessary reconfigure it, whilst applications continue running on the host ARM device. This allows a runtime to dynamically tune its hardware resources to match its power/performance requirements. Typical examples of the hardware layer in use might include preprocessing image data in SLAMBench; integrating with MAMBO s dynamic binary instrumentation to provide high performance memory system simulation, using our memory system IP; or providing a small low power micro-controller which might be used for some runtime housekeeping task. Hotspot-C2-Current Hotspot-Graal-Current Maxine-Graal-Original Maxine-Graal-Current 100% 75% 50% 25% 0% avrora geomean batik fop h2 jython pmd lusearch luindex xalan tradesoap tradebeans tomcat sunflow Fig. 2. DaCapo-9.12-bach benchmarks (higher is better) normalized to Hotspot-C2- Current. 2.5 Simulation Layer Despite running directly on real hardware, Beehive offers the opportunity to conduct micro-architectural research via its simulation infrastructure. The gem5 full-system simulator has been augmented to include accurate power and temperature models using the McPat and Hotspot simulators. Both simulators are invoked within the simulator allowing power and temperature readings to be triggered either from the simulator (allowing for transient power and temperature traces to be recorded) or from within the simulated OS (allowing accurate power and temperature figures to be used within user space programs) with minimal performance overhead. Furthermore, the non-volatile memory simulator NVSim has been incorporated into the simulation infrastructure. This can be invoked by McPat (along side the conventional SRAM modeling tool Cacti) and allows accurate delay, power and temperature modeling of non-volatile memory anywhere in the memory hierarchy. 3 Initial Evaluation Project Beehive combines work conducted on various parts of the co-designed stack. Although, presently, it can not be evaluated holistically, individual components are very mature and can be independently evaluated. Due to space limitation, we present preliminary developments in two areas of interest.

23 6 3.1 Maxine VM Development The following major changes to Maxine VM were done since Oracle Labs has stopped its active development: 1) profiling instrumentation in T1X, 2) more optimistic optimizations were enabled (including optimistic elimination of zero count exception handlers), and 3) critical math substitutions were enabled. The following configurations were evaluated on DaCapo-9.12-bach benchmarks (with the exception of eclipse) as depicted in Figure 2: 1) Hotspot-C2-Current (ver ), 2) Hotspot-Graal-Current 3, 3) Maxine-Graal-Original 4, 4) Maxine- Graal-Current 5. Our work on improving performance and stability of Maxine- Graal resulted in 1.64x speedup over the initially committed version. The plan is to keep working towards increasing performance and stability of all versions of Maxine-Graal; ARMv7, Aarch64, and x MapReduce Use Case Parallel frameworks, such as Flink, Spark and Hadoop, abstract functionality from the underlying parallelism. Performance tuning is therefore reliant on the capabilities provided through specializations in the API. These attempts to reduce the semantic distance between applications elements require additional experience and expertise. Furthermore, every layer in the software stack abstracts the functionality and hardware even further. Co-designing the layers in a complete application is an alternative approach that aims to maintain productivity for all. MapReduce is a very simple framework, yet popular and a powerful tool in the Big Data arena. In multicore implementations there exists a semantic distance between the Map and Reduce methods. The method level abstraction for compilation in Java cannot span the distance and so compiles each method independently. Existing MapReduce frameworks offer the Combine method explicitly in order to compensate for this inconvenience. By designing a new MapReduce framework, with a co-designed optimizer, it is possible to inline the Reduce method within the Map method. This allows the optimizing compiler of Java to virtualize or eliminate many objects that would otherwise be required as intermediate data. It is possible to reduce execution times up to 2.0x for naive, yet efficient, benchmarks at the same time as reducing the strain on the GC. Importantly this is possible without altering or extending the API presented to the user. 4 Conclusions In this paper, we introduced Project Beehive: a hardware/software co-designed stack for full-system runtime and architectural research. Project Beehive builds 3 rev maxine rev.8749, graal rev maxine rev.8809, graal rev.11557

24 7 on top of existing state-of-the-art as well as novel components at all layers of the stack. The short-term plans are to complete the ARMv7 and Aarch64 ports of T1X and Graal compilers, while increasing confidence by achieving a high application coverage, along with establishing a high-performing GC framework. Our vision regarding Project Beehive is to unify the platform capabilities under a semantically aware runtime increasing developer productivity. Furthermore, we plan on defining a hybrid ISA between emulated and hardware capabilities in order to provide a roadmap of movement of capabilities between abstractions offered in software that later are offered in hardware. Finally, we plan to work on new hardware services for scale out and representation of volatile and non-volatile communication services in order to provide a consistent view of platform capabilities across heterogeneous processors. Acknowledgement. The research leading to these results has received funding from UK EPSRC grants DOME EP/J016330/1, AnyScale Apps EP/L000725/1, INPUT EP/K015699/1 and PAMELA EP/K008730/1 and the European Union s Seventh Framework Programme under grant agreement n AXLE project, and n RETHINK big. Mikel Lujan is funded by a Royal Society University Research Fellowship and Antoniu Pop a Royal Academy of Engineering Research Fellowship. References 1. Cappello, F., Geist, A., Gropp, B., Kale, L., Kramer, B., Snir, M.: Toward exascale resilience (November 2009) 2. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. In: ISCA 11 (2011) 3. Nardi, L., Bodin, B., Zia, M.Z., Mawer, J., Nisbet, A., Kelly, P.H.J., Davison, A.J., Luján, M., O Boyle, M.F.P., Riley, G., Topham, N., Furber, S.: Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM. In: ICRA (2015) 4. Wimmer, C., Haupt, M., Van De Vanter, M.L., Jordan, M., Daynès, L., Simon, D.: Maxine: An approachable virtual machine for, and in, java (January 2013) 5. Würthinger, T., Wimmer, C., Wöß, A., Stadler, L., Duboscq, G., Humer, C., Richards, G., Simon, D., Wolczko, M.: One vm to rule them all. In: Proceedings of the 2013 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming & Software. Onward! 13 (2013)

25 Position Paper: Reaching intrinsic compute efficiency requires adaptable micro-architectures Mark Wijtvliet, Luc Waeijen, Michaël Adriaansen, and Henk Corporaal Eindhoven University of Technology, 5612 AZ, The Netherlands, {m.wijtvliet, l.j.w.waeijen, & Abstract. Today s embedded applications demand high compute performance at a tight energy budget, which requires a high compute efficiency. Compute efficiency is upper-bound by the technology node, however in practice programmable devices are orders of magnitude away from achieving this intrinsic compute efficiency. This work investigates the sources of inefficiency that cause this, and identifies four key design guidelines that can steer compute efficiency towards sub-picojoule per operation. Based on these guidelines a novel architecture with adaptive micro-architecture, and accompanying tool flow is proposed. Keywords: adaptive micro-architecture, intrinsic compute efficiency, spatial layout 1 Introduction Modern embedded applications require a high computational performance under severe energy constraints. Mobile phones, for example, have to implement the 4G protocol, which has a workload of about 1000 GOPS [11]. Due to battery capacity limitations, the computation on a mobile phone has a budget of about 1 Watt. Thus, under these requirement, each computational operation can only use 1pJ of energy. Another example is ambulatory healthcare monitoring, where a patients vital signs are monitored over an extended period of time. Because these devices have to be mobile and small, energy is very limited. An added constraint is that the compute platform has to be programmable, as the field of ambulatory healthcare is still developing, and improved algorithms and new applications are developed at a fast rate. To support such embedded applications, a computational operation has an energy budget in the sub-pico joule domain. However, current programmable devices do not have a high enough compute efficiency to meet this requirement. One of the most compute efficient microprocessors, the ARM Cortex-M0, has an compute efficiency of 5.1 pj /op at 40nm low-power technology [8]. The intrinsic compute efficiency (ICE) of 45nm technology is 1 pj /op [6]. There is thus a gap between the ICE and the achieved efficiency of at least a factor 5. However, to support compute intensive embedded applications, processors more powerful than the Cortex-M0 are needed, which increases the gap up to several orders of

26 2 magnitude [6]. In order to meet the demands of modern embedded applications, this gap has to be closed. This work investigates why programmable processing devices do not meet the ICE and how this can be improved in section 2. It is concluded that an adaptive micro-architecture should be leveraged to improve parallelism exploitation and save energy on dynamic control and data transport, two of the largest sources of inefficiency. Based on this observation a novel architecture is proposed in section 3, designed to narrow the efficiency gap. The tool flow and compiler approach are discussed in section 4. Section 5 concludes this work. 2 Discussion and Related Work The achieved compute efficiency (ACE) of current programmable architectures is several orders of magnitude lower than the ICE [6]. Hameed et al. find a similar gap (500 ) between general purpose processors and Application Specific Integrated Circuits (ASICs), which come very close to the ICE. There are many sources of inefficiency for general purpose processors that contribute to this gap. Hameed et al. identified various of these sources [4]. They extend a Tensilica processor [2] with complex instructions and configurable hardware accelerator to support a target application. This brings the compute efficiency for the application within 3 of the ICE, but at the expense of generality. The resulting architecture is highly specialized. Based on their optimizations it can be concluded that largest sources of overhead are: 1. Dynamic control, e.g, fetching and decoding instructions. 2. Data transport, e.g., moving data between memory, caches and register files. 3. Mismatch between application and architecture parallelism, e.g., 8-bit add on 32-bit adder. The first two sources of overhead can be attributed to sequential execution. A large amount of energy is used because the processor fetches and decodes a new instruction every cycle. This can be mitigated by using spatial layout (execution in parallel). By increasing the number of issue slots, it is possible to achieve a single instruction steady-state, such that no new instructions need to be fetched for an extended period of time. Refer to the for-loop in Fig. 1 for an example. The loop-body contains operations A, B, C and control flow computation CF. The loop in the figure can be transformed from the sequential version (left) to the spatial version (right) by software-pipelining. The single-cycle loop body in Fig. 1 does not require any other instructions to be fetched and decoded. It can be observed that a general purpose processor with only one issue slot can never support single-cycle loops due to the control flow. This technique is already used in very long instruction word (VLIW) processors [9], but is only applicable if the number of issue slots and their compute capabilities match the loop. ASICs and Field Programmable Gate Arrays (FPGA) implement the extreme form of spatial-layout. By completely spatially mapping the application the need for instruction fetching and decoding is eliminated altogether.

27 3 Multi-cycle loop A[0...N] B[0...N] C[0...N] CF for i = 0 to N A B C Single-cycle loop A0 A1 B0 A[2...N] B[1...N-1] C[0...N-2] B2 C1 CF C2 Fig. 1. Program execution example for multi- and single-cycle loops The second source of inefficiency, data transport, is reduced substantially by adapting the data-path to the application in such a way that the register file (RF) and memory system are bypassed as much as possible like in explicit datapaths [12]. The memory system and the RF are two of the main energy users in a processor [4]. Thus, by keeping data in the pipeline, the overall energy usage can be reduced significantly. The third source of inefficiency can be addressed by adapting the microarchitecture to the application. Applications have varying types of parallelism: bit-level (BLP), instruction-level (ILP) and data-level (DLP). BLP is exploited by multi-bit functional units, such as a 32-bit adder. ILP is exploited with multiple issue slots, such as very long instruction word (VLIW) processors. Finally, DLP is exploited by single instruction multiple data (SIMD) architectures. Different applications expose different types and amounts of parallelism. When the micro-architecture is tuned to the application, such as in an ASIC or FPGA, the mix of different types of parallelism can be exploited in the optimal manner. Micro-architecture adaptation is the key to achieve a higher compute efficiency. FPGAs and ASICs do this, but at an unacceptable price. For ASICs the data-path is adapted for one set of applications, so it loses generality. FP- GAs are configured at gate-level, which requires many memory cells to store the hardware configuration (bitfile). These cells leak current resulting in high static power consumption [7, 3]. Furthermore the dynamic power is also high [1], due to the large configurable interconnect. Additionally efficiently compiling for FPGAs is hard due to the very fine granularity. Although there are High-Level Synthesis tools that reduce the programming effort, they cannot always provide high quality results [13] because of this. Summarizing, to achieve high compute efficiency, overhead of adaptability should be reduced, while still supporting: 1. Single instruction steady state, e.g., single-cycle loops 2. Data transport reduction, e.g., explicit bypassing 3. Application tailored exploitation of parallelism, e.g., VLIW with matching SIMD vector lanes 4. Programmability

28 4 3 Architecture Proposal In this section an energy-efficient architecture is proposed that ticks all requirement boxes from section 2. An adaptive micro-architecture is realized by separating control units, such as instruction fetch (IF) and instruction decoder (ID), from the functional units (FU). Each ID can be connected to one or more FUs through a control network, and FUs are interconnected via a data network. These networks use switch-boxes that configured before the application is executed, and remain static during execution, much like FPGAs. The number of switch-boxes is much smaller than in an FPGA, and multiple bits are routed at once. Therefore the proposed architecture requires significantly less configuration bits, thereby avoiding the high static energy usage that FPGAs suffer from. There are various FU types that are considered: Arithmetic Logic Units, Load Store Units, RFs and Branch Units. The adaptive micro-architecture enables high energy efficiency while attaining high compute performance. 3.1 Single instruction steady state It is possible to construct VLIW-like micro-architectures by grouping IDs in a common control group, and connecting them to FUs, as shown in Fig. 2. In this figure a three issue-slot VLIW is shown. By adapting the number of issues slots to the application, single instruction steady state is supported. Thus reducing instruction fetch and decode, resulting in lower dynamic energy usage. Multiple ID control groups enable the construction of multiple independent VLIWs. 3.2 Data transport reduction Reduction of data transport is achieved by directly connecting FUs through a switch-box network. This allows results from one FU to bypass the RF and memory, and directly flow to the next FU. Complex data-flow patterns, such as butterfly patterns in the fast Fourier transform, can be wired between the FUs. This reduces RF accesses that otherwise would have been required to accomodate these patterns. The special case of data-flow patterns where each compute node performs the same operation, such as reduction trees, can be supported with only one ID to control the entire structure. 3.3 Application tailored exploitation of parallelism The varying amount of ILP in an application can be exploited by the configurable VLIW structures. DLP is captured by constructing SIMD-type vectorlanes within each issue slot, as shown in Fig. 2, where issue-slot 3 has a vector width of four. BLP is addressed by combining multiple narrower FUs into wider units, e.g., combine two 16-bit FUs into one 32-bit unit. This allows efficient support of multiple data-widths, e.g., processing 8-bit pixels for an image application in one case, and supporting 32-bit fixed point for health monitoring applications.

29 5 3.4 Programmability The possible configurations in the proposed architecture all bear a strong resemblance to VLIW processors with an explicit data-path and issue slots with vector lanes. This requires a compiler which supports explicit bypassing, which is described in more detail in section 4. Host Global data memory Local Mem. Local Mem. Local Mem. Local Mem. Local Mem. Local Mem. VLIW Slot 1 Instruction memory IF ID LS LS LS LS LS LS VLIW Slot 2 IF ID FU FU FU FU FU FU VLIW Slot 3 IF ID FU FU FU FU FU FU Fig. 2. Proposed architecture Application Code Metrics Compiler Application Binary Chip description Architecture construction ` Architecture Description Unit selection and routing Configuration file Fig. 3. Toolflow 4 Tool flow Many architectures have been published in literature, but few of them are used in industry, often because of the lack of mature tools. The development of the tool flow for the highly flexible proposed architecture is challenging. A chip description file that lists the available resources and interconnect options is used as input to the tool flow, as is shown in Fig. 3. Generation of the hardware description (HD) and mapping of data-flow patterns to FUs is done based on this file. The tools to generate synthesizable HD are already implemented, but require integration with the full toolflow. The most challenging part, the construction of the compiler, is in progress. Section 4.1 discusses the challenges and various approaches to deal with them.

30 6 R1 5 IMM ALU RF R LSU Cycle 0 + R2 IMM ALU RF R LSU Cycle 1 ST IMM ALU RF R LSU Cycle 2 Fig. 4. Resource and dependence graphs 4.1 Compiler Designing the compiler is particularly challenging because of the explicit datapath of the proposed architecture, and code has to be generated for all possible combinations of IDs and FUs. In addition to the tasks of a regular compiler, the compiler needs to route data between FUs. This is similar to compilers for transport triggered architectures [5]. One approach for scheduling for an explicit data-path is list scheduling using a resource graph (RG). The RG has a node for every FU at every clock cycle in the schedule, shown in Fig. 4. Scheduling is done by mapping nodes from the data dependence graph onto the nodes in the RG. However, scheduling deadlocks can occur when the result of a scheduled operation can not be routed to its destination, because the required pass-through resource became occupied during scheduling. Guaranteeing that values can always be read from the RF prevents these scheduling deadlocks. There are two methods to guarantee this. One always allocates a temporary route to the RF. Another method generates max flow graphs to check if all data can reach the RF [10]. Instead of preventing deadlocks, they can also be resolved by a backtracking scheduler that unschedules operations if their result can not be routed. 5 Conclusions Various sources of inefficiency in programmable devices have been investigated, and methods to reduce these inefficiencies have been discussed. Four design guidelines have been established that will steer programmable devices in the direction of sub-picojoule compute efficiency, while not sacrificing generality and performance of these devices. A novel architecture with adaptable microarchitecture which adheres to these guidelines has been proposed, and its tool flow has been described. The proposed architecture is an adaptable mix between multi-core VLIW, SIMD, and FPGA architectures, which allows efficient mapping of an application, by using the best from each of these architectures. A synthesizable hardware description the architecture is available, and will be used in future work for validation of the guidelines presented here, and further development of the architecture.

31 REFERENCES 7 References [1] Amara Amara, Frédéric Amiel, and Thomas Ea. FPGA vs. ASIC for low power applications. In: Microelectronics Journal 37.8 (2006), pp issn: doi: url: S [2] Inc. Cadence Design systems. Tensilica Customizable Processor IP. url: http : / / ip. cadence. com / ipportfolio / tensilica - ip (visited on 11/20/2015). [3] Lanping Deng, K. Sobti, and C. Chakrabarti. Accurate models for estimating area and power of FPGA implementations. In: Acoustics, Speech and Signal Processing, ICASSP IEEE International Conference on. Mar. 2008, pp doi: /ICASSP [4] Rehan Hameed et al. Understanding Sources of Inefficiency in Generalpurpose Chips. In: SIGARCH Comput. Archit. News 38.3 (June 2010), pp issn: doi: / [5] Jan Hoogerbrugge. Code generation for transport triggered architectures. TU Delft, Delft University of Technology, [6] Akash Kumar et al. Multimedia Multiprocessor Systems: Analysis, Design and Management. Embedded Systems. Springer Netherlands, isbn: doi: / [7] Fei Li et al. Architecture Evaluation for Power-efficient FPGAs. In: Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays. FPGA 03. New York, NY, USA: ACM, 2003, pp isbn: X. doi: / url: [8] ARM Ltd. ARM Cortex-M0 Specifications. url: products/processors/cortex-m/cortex-m0.php (visited on 11/19/2015). [9] Yi Qian, Steve Carr, and Philip Sweany. Loop fusion for clustered VLIW architectures. In: ACM SIGPLAN Notices 37.7 (2002), pp [10] Dongrui She et al. Scheduling for register file energy minimization in explicit datapath architectures. In: Design, Automation Test in Europe Conference Exhibition (DATE), Mar. 2012, pp doi: /DATE [11] C.H. Van Berkel. Multi-core for mobile phones. In: Design, Automation Test in Europe Conference Exhibition, DATE , pp doi: /DATE [12] Luc Waeijen et al. A Low-Energy Wide SIMD Architecture with Explicit Datapath. English. In: Journal of Signal Processing Systems 80.1 (2015), pp issn: doi: /s url: [13] M. Wijtvliet, S. Fernando, and H. Corporaal. SPINE: From C loop-nests to highly efficient accelerators using Algorithmic Species. In: Field Programmable Logic and Applications (FPL), th International Conference on. Sept. 2015, pp doi: /FPL

32 Toward Transparent Heterogeneous Systems Baptiste Delporte, Roberto Rigamonti, Alberto Dassatti Reconfigurable and Embedded Digital Systems Institute REDS HEIG-VD School of Business and Engineering Vaud HES-SO, University of Applied Sciences Western Switzerland Abstract. Heterogeneous parallel systems are widely spread nowadays. Despite their availability, their usage and adoption are still limited, and even more rarely they are used to full power. Indeed, compelling new technologies are constantly developed and keep changing the technological landscape, but each of them targets a limited sub-set of supported devices, and nearly all of them require new programming paradigms and specific toolsets. Software, however, can hardly keep the pace with the growing number of computational capabilities, and developers are less and less motivated in learning skills that could quickly become obsolete. In this paper we present our effort in the direction of a transparent system optimization based on automatic code profiling and Just-In-Time compilation, that resulted in a fully-working embedded prototype capable of dynamically detect computing-intensive code blocks and automatically dispatch them to different computation units. Experimental results show that our system allows gains up to 32 in performance after an initial warm-up phase without requiring any human intervention. 1 Introduction Improvements in computational power marked the last six decades and represented the major factor that allowed mankind to tackle problems of growing complexity. However, owing to physical and technological limitations, this process came to an abrupt halt in the past few years [25, 20, 14]. Industry tried to circumvent the obstacle by switching the paradigm, and both parallelism and specialization became the keywords to understand current market trends: the former identifies the tendency of having computing units that are composed of many independent entities that supposedly increase by a multiplicative factor the throughput; the latter reflects the drift toward architectures (think of, for instance, DSPs) that are focused on solving particular classes of problems that arise when facing a specific task. These two non-exclusive phenomena broadened the panorama of technological solutions available to the system developer but, contrary to expectations, were not capable of sustaining the growth that was observed in the previous years [20, 8]. Indeed, while architectures and systems evolve at a fast pace, software does not [9]. Big software projects, which are usually the most computing-intensive ones, demand careful planning over the years,

33 2 and cannot sustain rapid twitches to adapt to the latest technological trend. Developers are even more of a static resource, in the sense that they require long time periods before becoming proficient in a new paradigm. Moreover, the most experienced ones, which are the most valuable resource of a software company and those who would be leading the change, are even less inclined to impose radical turns to a project, as this would significantly affect their mastery and their control of the situation. Faced with such a dilemma, the only reasonable solution seems to be automation. In this paper we present a solution to this problem capable of detecting segments of code that are hot from a computational stance and dynamically dispatching them to different computing units, relieving thus the load of the main processor and increasing the execution speed by exploiting the peculiarities of those computing units. In particular, as a case study, we demonstrate our approach by providing a fully-working embedded system, based on the REPTAR platform [2], that automatically transfers heavy tasks from the board s Cortex- A8 ARM processor to the C64x+ DSP processor that is incorporated in the same DM3730 chip [23]. To achieve this goal, we execute the code we want to optimize in the LLVM [7] Just-In-Time (JIT) framework, we then identify functions worth optimizing by using the Linux s perf event [26] tool, and finally we dispatch them to the DSP, aiming at accelerating their execution. We will hereafter refer to our proposal as Versatile Performance Enhancer (VPE). While the performance we obtain is obviously worse than the one we could achieve by a careful handcrafting of the code, we get this result at no cost for the developer, who is totally unaware of the environment in which the code will be executed. Also, as the code changes are performed at run-time, they can adapt to optimize particular input patterns think of, for instance, a convolution where most of the kernel s components are zeros, further enhancing the performance. Finally, the system can dynamically react to changes in the context of execution, for example resources that become available, are upgraded, or experience an hardware failure. In the following we will first present the current state of the art, and then accurately describe our approach. 2 State of The Art Parallel and heterogeneous architectures are increasingly widespread. While a lot of research addresses Symmetrical Multi-Processors (SMP) systems, heterogeneous architectures are yet not well integrated in the development flow, and integrators usually have to come up with ad-hoc, non-portable solutions. Slowly the scenario is changing and mainstream solutions start to be proposed. Particularly interesting in this context are some language-based solutions, such as CUDA [1], OpenCL [13, 1], HSA [11], and OpenMP Extensions for Heterogeneous Architectures [15]. All of these solutions are oriented to some specific hardware settings, they are very similar to a GPU in nature, and are constructed in such a way that hardware manufacturers can keep a strong hold on their technologies. While these proposals might look as a perfectly reasonable answer to the

34 3 heterogeneity problem, they share a major drawback: they all require the programmers to learn a new programming paradigm and a new toolset to work with them. Moreover, these solutions are similar but mainly incompatible in nature, and this fact worsens even further the situation. Supporting more than one approach is expensive for companies, and the reality is that the developer has seldom the choice about the methodology to adopt once the hardware platform is selected most likely by the team in charge of the hardware part, who is probably unaware, or partially unaware, of the implications that choice has on the programming side. A partial solution to this problem, called SoSOC, is presented in [18]: to avoid setting the knowledge of the DSP architecture and the related toolset (compiler, library,... ) as an entry barrier to using the DM3730 chip for actual development, the authors wrote a library that presented a friendly interface to the programmer and allowed the dispatching of functions to a set of targets based on either the developer s wishes or some statistics computed during early runs. While interesting and with encouraging results, in our view the approach has a major drawback: not only the user of SoSOC has to learn (yet) another library, but also someone has to provide handcrafted code for any specialized unit of interest. This is a considerable waste of time and resources, and limits the applicability of the system to the restricted subset of architectures directly supported by the development team. Furthermore, the developer might not be aware of the real bottlenecks of the system for a particular input set, so he might candidate for remote execution non-relevant functions, wasting precious resources. Other academic proposals exist. A notable one is StarPU from INRIA Bordeaux [4]. StarPU provides an API and a pragma-based environment that, coupled with a run-time scheduler for heterogeneous hardware, composes a complete solution. While the main focus of the project is CPU/GPU systems, it could be extended to less standard systems. As for [1, 13, 11, 15, 4, 18], StarPU shows the same limitations: a new set of tools and a new language or API to master. Compared with these alternatives, our solution does not need application developers to be aware of the optimization steps that will be undertaken, it does not target a specific architecture, and it does not require any additional step from the developer s side. Another interesting technique, called BAAR and focused on the Intel s Xeon Phi architecture, is presented in [16, 17]. This proposal is similar to ours, in that the code to be optimized is run inside LLVM s Just-In-Time framework, and functions deemed to be best executed on the Xeon Phi are offloaded to a remote server that compiles them with Intel s compiler and executes them. However, their analysis step lacks the Versatility that characterizes our approach: functions are statically analyzed using Polly [21], a state-of-the-art polyhedral optimizer for automatic parallelization, to investigate their suitability for remote execution, and if this is the case, they are sent to the remote target. In our proposal, instead, optimizations are triggered according to an advanced performance analyzer, fitting to the current input set under processing and not to expected-

35 4 usage scenarios or other compile-time metrics. This allows us a fine-grained control on the metric to optimize, the strategy to achieve this optimization, and the best target selection for a given task at any moment during the program s life. To the best of our knowledge, no other approach faces the versatility and the transparency aspects simultaneously as VPE does. 3 VPE approach The analysis in the previous section highlighted the importance of alternatives able to automate the code acceleration and dispatching steps. VPE aims at the run-time optimization of a generic code for a specific heterogeneous platform and input data pair, all in a transparent way. The idea behind it is that the developer just writes the code as if it had to be executed on a standard CPU. The VPE framework JIT-compiles this code and executes it, collecting statistics at run-time. When a user function system calls are automatically excluded from the analysis behaves according to a specific pattern, for instance, is particularly CPU-intensive, VPE acts to alter the run-time behaviour trying to optimize the execution. In the case of a CPU-intensive code, this could be the dispatching on a remote target specialized for the type of operations executed. After a warm-up delay, which can quickly become negligible for a large family of algorithms adopted in both the scientific and industrial settings, the performances result potentially increased. If this is not the case for instance after an abrupt discontinuity in the input data pattern that makes the computation not suitable for the selected remote target, VPE can revise its decisions and act accordingly. In structuring VPE, similarly to [16], we have chosen to cast the problem in the LLVM framework [5, 6]. LLVM recently came as an alternative to the widely known GCC compiler, whose structure was deemed to be too intricate to allow people to easily start contributing to it. The biggest culprit seemed to be the non-neat separation between the front-end, the optimization, and the back-end steps [5]. LLVM tried to solve this issue by creating an Intermediate Representation (IR) an enriched assembly that acts as a common language between the different steps [7]. The advantage here is that each component of the compilation chain can be unaware of the remaining parts and still be capable of doing its job; for instance, the ARM back-end does not need to know whether the code it is trying to assemble comes from C++ or FORTRAN code, allowing a back-end designer to focus solely on what this tool is supposed to do. As a result, a slew of LLVM-based tools came out in the past few years, with remarkable contributions coming from the academic community too, and this fueled its diffusion. Among others, LLVM features a Just-In-Time (JIT) compiler (MCJIT) which is the core component of our system since many years now, whereas GCC introduced it only at the end of 2014 [10]. Also, a number of tools allowing in-depth code analysis and optimization, such as [12, 21, 22] to cite a few, can be easily integrated, leaving the door open to future extensions.

36 5 We have thus started from MCJIT, integrated an advanced profiling technique, and altered its behaviour by acting directly on the code s IR to allow us dynamically switch functions at will. We then took an embedded system that suited our needs, and experimentally verified the improvements introduced by our solution. 3.1 Profiling Detecting which function is the best candidate to be speeded-up is a task that can be accomplished neither at development time, nor at compile time, as it is usually strongly dependent on inputs. We therefore had to shape our architecture to include a performance monitoring solution, and after considering different alternatives (such as OProfile [27]), we opted for perf event [26]. perf event gives access to a large number of hardware performance counters, although at a penalty that can reach up to 20% overhead. In particular, very interesting measures can be acquired, including cache misses, branch misses, page faults, and many others, leaving the choice about which figure of merit optimize for, to the system engineer. In this paper we adopt, as the sole performance metrics for selecting which function off-load, the number of CPU cycles requested for its execution. Our only optimization strategy is blind off-loading that is, we off-load the candidate function and we observe if this results in a performance improvement, eventually reverting our choice. It should be noted, however, that large gains could derive by a careful crafting of this optimization step: as an example, one might think of reorganizing on-the-fly a data structure after figuring out that it is causing too many cache misses [24]. While we do claim that having reliable and accurate statistics is vital to devising a clever optimization strategy, and having such a powerful performance analyzer integrated in our system is surely a strength of our approach, we will not investigate this topic further in this paper. 3.2 Function call techniques Once an interesting function is detected, we would like to off-load it to another computational unit (we will refer to this computational unit as remote target from now on). For this to happen, we have to transfer all the function s code, parameters, and shared data to the remote target, then give the control to it, wait for the function to return, and finally grab the returned values. Invoking a function on the remote target is particularly tricky: while LLVM s MCJIT compiler includes a Remote Target Interface, it has the peculiarity of operating on modules only, where a module is a collection of functions [3]. This behaviour has only very recently been changed with the introduction of a new JIT, called ORC, but this code is still under development and it is available for the x86 64 architecture only. Operating at module level is very uncomfortable, as MCJIT requires a module to be finalized before being executed, and leaves us no simple way to alter the function invocation at run-time. To acquire the

When a remote target is selected, the wrapper invokes a function that is in charge of handling the communication with it, sending it the parameters and the code, and waiting for the results to be

37 6 Fig. 1. Comparison of the execution flows in a standard system (left column) and in VPE (right column). While without VPE the JIT directly invokes the desired function, in VPE an intermediate step through a wrapper has to be made. When a remote target is selected, the wrapper invokes a function that is in charge of handling the communication with it, sending it the parameters and the code, and waiting for the results to be handed back. capacity of dynamically dispatching functions, we thus automatically replace all functions with a caller that, in normal situations, simply executes the corresponding function via a function pointer (see Figure 1). While this introduces a call overhead (as all function invocations must perform this additional caller step ), when we wish to execute a function on the remote target, we just have to alter this function pointer to make it point to another function that deals with the remote target, as shown in Figure 1. Similarly, when we consider that a function is not worth remote execution anymore for instance we might have observed that the remote target is slower that the local CPU on the given task, we know that the remote target is already busy, or we have a more suitable function for the given computation unit we set back this pointer to its original value. Since computing-intensive functions are automatically detected and offloaded to the remote target, the overhead imposed by the additional step quickly becomes negligible. 3.3 Memory allocation problem As briefly mentioned in the previous paragraph, once a function is invoked on the remote target, all data relative to it have to be transferred as well. In standard SMP systems this issue is not very relevant: all processors usually access all memory space, and hardware mechanisms guarantee cache coherency. On heterogeneous systems, however, the problem is more relevant. Often the remote target has only a partial access to the main system s memory and there is no hardware support to ease the data sharing. In this context differences between systems are remarkable, and we can distinguish between two macro-categories

38 7 based on memory organization: we have indeed systems with shared memory (physically shared or virtually shared), and systems without it. We stress here that the two types can easily co-exist in the same platform, and while a sub-set of processing units can see the memory as a single address space, a different sub-set can provide a different view. In the context of VPE we consider only shared memory systems; in systems where this assumption is proven false, we could adopt a message passing layer to virtualize the real hardware resources as in [17]. 4 Experimental Setup To validate our proposal, we looked for an heterogeneous platform suitable for building a demonstrator. We have chosen a TI-DM3730 DaVinci digital media processor SoC. It is present on the REPTAR platform [2] we could use for our tests and has already been adopted by [18], allowing us an indirect comparison. The DM3730 chip hosts an ARM Cortex-A8 1GHz processor and a C64x+ DSP processor running at 800MHz. Part of the address space is shared between the two processors, therefore we can easily transfer data by placing them in this region. This can be achieved by custom memory management functions, which however do not require any human intervention: when the JIT loads the IR code, it detects the memory operations and automatically replaces them with our custom ones. Note that this setup is not restrictive, as transfers among nonshared memory regions can be easily achieved by a framework such as MPI, as in [17]. The chosen DSP lacks an LLVM back-end we could use to automatically compile the code we are running in the JIT. The TI compiler used to produce the binaries executed on the DSP is proprietary software, and writing a compatible back-end was out of the scope of the project. While this could appear as a major obstacle, we have circumvented it by creating a set of scripts that compiles the functions code using the aforementioned closed-source compiler, and then extracts a symbol table that is loaded and used in VPE. 5 Benchmarking 5.1 Methodology We have evaluated the performance of VPE using a set of six algorithms: construction of the complementary nucleotidic sequence of an input DNA sequence, 2D convolution with a square kernel matrix, dot product of two vectors, multiplication of two square matrices, search of a nucleotidic pattern in an input DNA sequence, and Fast Fourier Transform (FFT). These algorithms were inspired by the Computer Language Benchmarks Game 1 and were adapted to limit the use of floating point numbers, which are only handled in software by the DSP we 1

39 8 use and would, therefore, strongly penalize it. The applications have been written in their naive implementation, that is, without any thorough handcrafted optimization 2, and have been compiled on the ARM target with all the optimizations turned on (-O3 ). For each algorithm, a simple application allocates the data and calls the computing-intensive function repeatedly, in a continuous loop. The size of the data is constant and the processing is made on the same data from one call to another. The execution time of the processing function including the target selection mechanism, the call to the function, and the execution of the function itself is recorded at each iteration. We have compared the performance of the algorithm running on the ARM core with the performance of the same algorithm on the DSP, once VPE has taken the decision to dynamically dispatch the function to the DSP. The performances reported for VPE skip this initial warm-up phase where the algorithm is first run on the CPU while VPE records the performances, as this value quickly becomes negligible as the number of iterations of the algorithm increases. 5.2 Results and analysis Figure 2(a) shows that the execution time of the selected algorithms on the ARM core can be in the order of seconds. This is notably the case for the matrix multiplication, but the other tests do not score far better. Once VPE has selected the DSP as a remote target, noticeable improvements in terms of performance can be observed: the acceleration of the nucleotidic complement nearly reaches a factor of eight, while the convolution sports a 4 speedup. Detailed timings for the different algorithms are reported in Table 1. Please note that the standard deviation is significantly increased when the code is running on the DSP under the control of VPE, since the profiler periodically slows down the execution while collecting and analyzing usage statistics. The most significant improvements have been obtained with the matrix multiplication and the pattern matching. Indeed, since the original versions of the algorithms are based on nested loops, the TI compiler has detected optimization opportunities and carried out software pipelining that resulted in a reduction of the number of required CPU cycles, thereby increasing the execution speed on the DSP target. Figure 2(b) shows, on a logarithmic scale, the time required for matrix multiplication as a function of matrix size: for small matrices (< 75 75), we can see that it is not worth executing the operations on the DSP, as the time required for the setup (around 100ms) exceeds the execution time for the ARM processor although a remote execution would still have, in this case, the advantage of freeing the CPU for other tasks. For bigger matrices, however, the advantage becomes considerable. The versatility of our approach comes again handy in this case: we could easily, for instance, learn automatically a correlation between the size of the matrix passed as a parameter and the performance achieved this could achieve this using a simple decision tree [19], and ground future decisions upon this criteria. 2 The source code of the applications can be downloaded here:

9 (a) (b) Fig. 2. (a) Execution time of the algorithms on the REPTAR platform.

The execution times are given in milliseconds (the axis scale is logarithmic). (b) Execution time of the matrix multiplication algorithm for a varying matrix size.

While the improvements are remarkable, the optimization strategy we have selected that is, blindly off-loading the code to the DSP does not guarantee we have indeed a performance improvement.

This performance penalty is due to the non-optimality of the code for the particular architecture: the hand-optimized DSP version of the same algorithm requires on average 109ms, while the code

40 9 (a) (b) Fig. 2. (a) Execution time of the algorithms on the REPTAR platform. The performance of the algorithm running on the ARM core and the performance of the same algorithm dispatched on the DSP, after the transition triggered by VPE. The execution times are given in milliseconds (the axis scale is logarithmic). (b) Execution time of the matrix multiplication algorithm for a varying matrix size. Despite the ARM code being compiled with all the optimizations turned on (-O3 ), the DSP largely outperforms it for matrices with size greater than While the improvements are remarkable, the optimization strategy we have selected that is, blindly off-loading the code to the DSP does not guarantee we have indeed a performance improvement. This is the case for the FFT code, which suffers a 25% performance penalty from being executed on the DSP. This performance penalty is due to the non-optimality of the code for the particular architecture: the hand-optimized DSP version of the same algorithm requires on average 109ms, while the code executed by VPE takes around 720ms. Two important points can be observed here: VPE will never be capable of outsmarting a developer in its job of optimizing the code for a particular architecture, and the optimization it performs might not always be the best choice available. For the former point, the result is a consequence of the fact that the code has been written without any knowledge of the system it will be executed upon, and thus it cannot benefit from the system s peculiarities. The improvements given by VPE, however, come for free from the application developer s stance since this result requires no effort from his side; This contrasts with, for instance, the achievements of [18]. Concerning the latter point, it is linked with the amount of knowledge available to VPE and to the amount of intelligence we have incorporated in it. A more clever optimization strategy, as well as a better investigation of the type of operations performed inside the routine candidate for off-loading and a thorough analysis of the statistics collected by perf event, could have lead to a better choice which would have been, in the FFT case, leaving the FFT function on the ARM processor. However, the dynamic nature of VPE forgives us these optimization attempts, as we can easily detect a mediocre performance on the remote unit and reverse our decision. This is an opportunity which is not available in, for instance, the work of [16, 17].

41 10 Table 1. Timings (in ms) for the different algorithms tested on the REPTAR platform. The number reported after the ± represents one standard deviation. With normal execution we indicate the execution of the algorithm on the ARM CPU with no performance collection undergoing, while with VPE we indicate the very same code but running on the DSP in the VPE framework Algorithm normal execution VPE Speedup Complement ± ± Convolution ± ± DotProduct ± ± MatrixMult ± ± FFT ± ± PatternMatch ± ± Image processing prototype We have also built a prototype demonstrator for the REPTAR board that uses a 2D convolution algorithm to detect contours in a video, similar in spirit to the one proposed in SOSoC 3. Both the CPU usage and the frame rate are displayed during the execution of the video processing. We use the OpenCV library to decode and display the video frames in a dedicated process. The system starts by invoking the video process that is in charge of decoding the current frame, then the pixel matrix is sent to the convolution process. The computation of the convolution is performed within VPE and the resulting matrix is sent back to the video application, which displays it. Figure 3(a) shows that, despite the main CPU being under heavy load, the frame rate is very low, being around 1.5fps. After a predefined time interval, chosen to allow the spectators to observe the system running for a while, VPE is granted the right to automatically optimize the execution. Once this happens, it detects that the convolution is the most expensive task and starts sending the new frames to the DSP, halving the CPU load the image handling is still performed by the CPU and multiplying by four the frame rate. Short bursts of CPU usage are, however, to be foreseen even when the convolution code is running on the DSP, as VPE still periodically analyzes the collected performances to spot variations in the system s usage that could trigger a different resources allocation policy. A detailed view of the CPU usage and frame rate evolution are shown in Figure 3(c). 6 Conclusion In this paper we have presented a transparent system optimization scheme capable of using a run-time code profiler and a JIT to automatically dispatch computing-intensive chunks of code to a set of heterogeneous computing units. 3

42 11 We have also built a working prototype that exploits this technique to accelerate by a factor of four a standard image processing algorithm, and significantly improve the performances on a set of standard benchmarks. Future work will concentrate on testing our approach on a larger number of platforms, as well as exploring additional run-time optimization schemes that could further reduce the algorithm s computation time. References 1. A. Danalis and G. Marin and C. McCurdy and J.S. Meredith and P.C. Roth and K. Spafford and V. Tipparaju and J.S. Vetter: The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In: Proc. of the General-Purpose Computation on Graphics Processing Units Workshop (2010) 2. A. Dassatti and O. Auberson and R. Bornet and E. Messerli and J. Stadelmann and Y. Thoma: REPTAR: A Universal Platform For Codesign Applications. In: Proc. of the European Embedded Design Conf. in Education and Research (2014) 3. B.C. Lopes and R. Auler: Getting Started with LLVM Core Libraries. Packt Publishing (2014) 4. C. Augonnet and S. Thibault and R. Namyst and P.A. Wacrenier: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurrency and Computation: Practice & Experience (2011) 5. C. Lattner: Introduction to the LLVM Compiler System. In: Proc. of the ACAT Workshop (2008) 6. C. Lattner: LLVM and Clang: Advancing Compiler Technology. In: Proc. of the FOSDEM (2011) 7. C. Lattner and V. Adve: LLVM: A Compilation Framework for Life-long Program Analysis & Transformation. In: Proc. of the CGO Symposium (2004) 8. D.E. Womble and S.S. Dosanjh and B. Hendrickson and M.A. Heroux and S.J. Plimpton and J.L. Tomkins and D.S. Greenberg: Massively parallel computing: A Sandia perspective. Parallel Computing (1999) 9. F.P. Brooks: The Mythical Man-month (Anniversary Ed.) (1995) 10. Free Software Foundation Inc.: GCC 5 Release Notes, gcc-5/changes.html 11. G. Kyriazis: Heterogeneous system architecture: A technical review. Tech. rep., AMD (2013) 12. G. Venkatesh and J. Sampson and N. Goulding and S. Garcia and V. Bryksin and J. Lugo-Martinez and S. Swanson and M.B. Taylor: Conservation Cores: Reducing the Energy of Mature Computations. In: Proc. of the ASPLOS Conference (2010) 13. J.E. Stone and D. Gohara and G. Shi: OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering (2010) 14. K. Asanovic and R. Bodik and B.C. Catanzaro and J.J. Gebis and P. Husbands and K. Keutzer and D.A. Patterson and W.L. Plishker and J. Shalf and S.W. Williams and K.A. Yelick: The Landscape of Parallel Computing Research: A View from Berkeley. Tech. rep., EECS Department, University of California, Berkeley (2006) 15. L. White: OpenMP Extensions for Heterogeneous Architectures. Lecture Notes in Computer Science (2011) 16. M. Damschen and C. Plessl: Easy-to-use on-the-fly binary program acceleration on many-cores. In: Proc. Int. ASCS Workshop (2015)

43 M. Damschen and H. Riebler and G. Vaz and C. Plessl: Transparent Offloading of Computational Hotspots from Binary Code to Xeon Phi. In: Proc. of the Design, Automation & Test in Europe Conference & Exhibition (2015) 18. O. Nasrallah and W. Luithardt and D. Rossier and A. Dassatti and J. Stadelmann and X. Blanc and N. Pazos and F. Sauser and S. Monnerat: SOSoC, a Linux framework for System Optimization using System on Chip. In: Proc. of the IEEE System-on-Chip Conference (2013) 19. S.R. Safavian and D. Landgrebe: A Survey of Decision Tree Classifier Methodology. IEEE Trans. on Systems, Man, and Cybernetics (1991) 20. S.W. Keckler and W.J. Dally and B. Khailany and M. Garland and D. Glasco: GPUs and the Future of Parallel Computing. IEEEMicro (2011) 21. T. Grosser, A. Groesslinger, C. Lengauer: Polly - Performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters (2012) 22. T. Oh and H. Kim and N.P. Johnson and J.W. Lee, and D.I. August: Practical Automatic Loop Specialization. In: Proc. of the ASPLOS Conference (2013) 23. Texas Instruments Incorporated: DM3730, DM3725 Digital Media Processors Datasheet (2011) 24. T.M. Chilimbi and M.D. Hill and J.R. Larus: Cache-Conscious Structure Layout. In: Proc. of the PLDI Conf. (1999) 25. U. Lopez-Novoa and A. Mendiburu and J. Miguel-Alonso: A Survey of Performance Modeling and Simulation Techniques for Accelerator-Based Computing. IEEE Transactions on Parallel and Distributed Systems (2015) 26. V.M. Weaver: Linux perf event Features and Overhead. In: Proc. of the FastPath Workshop (2013) 27. W.E. Cohen: Multiple Architecture Characterization of the Linux Build Process with OProfile. In: Proc. of the Workshop on Workload Characterization (2003)

13 (a) (b) (c) Fig. 3. Screenshots of the VPE system in execution on the REPTAR platform.

and the frame rate (in the top-left corner of the video player).

ARM core has to perform all the visualization-related tasks and the frame rate increases by a factor four.

44 13 (a) (b) (c) Fig. 3. Screenshots of the VPE system in execution on the REPTAR platform. The system is working on a signal processing task contour detection in this case on a video while recording the percentage of CPU usage (in the small graph) and the frame rate (in the top-left corner of the video player). (a) depicts the system before VPE transitioned the computation-intensive task (in this case, 2D convolution) to the DSP, while (b) shows the system after this transition happened. It can be seen that when VPE triggers the transition on the DSP processor, the load of the ARM core is considerably relieved but still not negligible as the ARM core has to perform all the visualization-related tasks and the frame rate increases by a factor four. (c) CPU usage and frame rate for the image processing prototype. The system starts on the CPU and performs statistics calculation. Then, when we allow it to change its execution target with a specific command, it decides to move the heaviest computation it is performing in this case a 2D convolution to the DSP, relieving the CPU load. At the same time, the frame rate gets multiplied by a factor four. Slightly after this moment, a rapid peak in CPU usage that is due to the performance calculation is visible.

45 Exploring LLVM Infrastructure for Simplified Multi-GPU Programming Alexander Matz 1, Mark Hummel 2 and Holger Fröning 3 1,3 Ruprecht-Karls University of Heidelberg, Germany, alexander.matz@ziti.uni-heidelberg.de holger.froening@ziti.uni-heidelberg.de 2 NVIDIA, US mhummel@nvidia.com Abstract. GPUs have established themselves in the computing landscape, convincing users and designers by their excellent performance and energy efficiency. They differ in many aspects from general-purpose CPUs, for instance their highly parallel architecture, their threadcollective bulk-synchronous execution model, and their programming model. In particular, languages like CUDA or OpenCL require users to express parallelism very fine-grained but also highly structured in hierarchies, and to express locality very explicitly. We leverage these observations for deriving a methodology to scale out single-device programs to an execution on multiple devices, aggregating compute and memory resources. Our approach comprises three steps: 1. Collect information about data dependency and memory access patterns using static code analysis 2. Merge information in order to choose an appropriate partitioning strategy 3. Apply code transformations to implement the chosen partitioning and insert calls to a dynamic runtime library. We envision a tool that allows a user write a single-device program that utilizes an arbitrary number of GPUs, either within one machine boundary or distributed at cluster level. In this work, we introduce our concept and tool chain for regular workloads. We present results from early experiments that further motivate our work and provide a discussion on related opportunities and future directions. 1 Introduction GPU Computing has gained a tremendous amount of interest in the computing landscape due to multiple reasons. GPUs as processors have a high computational power and an outstanding energy efficiency in terms of performance-per- Watt metrics. Domain-specific languages like OpenCL or CUDA, which are based on data-parallel programming, have been key to bring these properties to the masses. Without such tools, graphical programming using tools like OpenGL or similar would have been too cumbersome for most users. We observe that data-parallel languages like OpenCL or CUDA can greatly simplify parallel programming, as no hybrid solutions like sequential code enriched with vector instructions is required. The inherent domain decomposition

46 2 principle ensures finest granularity when partitioning the problem, typically resulting in a mapping of one single output element to one thread. Work agglomeration at thread level is rendered unnecessary. The Bulk-Synchronous Parallel (BSP) programming paradigm and its associated slackness regarding the ratio of virtual to physical processors allows effective latency hiding techniques that make large caching structures obsolete. At the same time, a typical code exhibits substantial amounts of locality, as the rather flat memory hierarchy of threadparallel processors has to rely on large amounts of data reuse to keep their vast amount of processing units busy. However, this beauty of simplicity only is applicable to single-gpu programs. Once a program is scaled out to any number of GPUs larger than one, the programmer has to start using orthogonal orchestration techniques for data movement and kernel launch. These modification are scattered throughout host and device code. We understand the efforts behind this orchestration to be high and completely incompatible with the single-device programming model, independent if these multiple GPUs are within one or multiple machine boundaries. With this work we introduce our efforts on GPU Mekong 1. Its main objective is to provide a simplified path to scale-out the execution of GPU programs from one GPU to almost any number, independent if the GPUs are located within one host or distributed at cloud or cluster level. Unlike existing solutions, this work proposes to maintain the GPU s native programming model, which relies on a bulk-synchronous, thread-collective execution. No hybrid solutions like OpenCL/CUDA programs combined with message passing are required. As a result, we can maintain the simplicity and efficiency of GPU Computing in the scale-out case, together with high productivity and performance. We base our approach on compilation techniques including static code analysis and code transformations regarding host and device code. We initially focus on multiple GPU devices within one machine boundary (single computer), allowing avoiding efforts regarding multi-device programming (cudasetdevice, streams, events and similar). Our initial tool stack is based on OpenCL programs as input, LLVM as compilation infrastructure and a CUDA backend to orchestrate data movement and kernel launches on any number of GPUs. In this paper, we make the following contributions: 1. A detailed reasoning about the motivation and conceptual ideas of our approach, including a discussion of current design space options 2. Introduction of our compilation tool stack for analysis passes and code transformation regarding device and host code 3. Initial analysis of workload characteristics regarding suitability for this approach 1 With Mekong we are actually referring to the Mekong Delta, a huge river delta in southwestern Vietnam that transforms one of the longest rivers of the world into an abundant number of tributaries, before this huge water stream is finally emptied in the South China Sea. Similar to this river delta, Mekong as a project gears to transform a single data stream into a large number of smaller streams that can be easily mapped to multiple GPUs.

47 3 4. Exemplary performance analysis of the execution of a single-device workload on four GPUs The remainder of this work is structured as follows: first, we establish some background about GPUs and their programming models in section 2. We describe the overall idea and our preliminary compilation pipeline in section 3. In section 4, we describe the characteristics of BSP applications suitable for our approach. We present the partitioning schemes and their implementation in section 5. Our first experiments are described and discussed in section 6. Section 7 presents background information and section 8 discusses our current state and future direction. 2 Background A GPU is a powerful high-core-count device with multiple Shared Multiprocessors (SMs) that can execute thousands of threads concurrently. Each SM is essentially composed by a large number of computing cores, and a shared scratchpad memory. Threads are organized in blocks, but the scheduler of a GPU doesn t handle each single thread or block; instead threads are organized in warps (typically 32 threads) and these warps are scheduled to the SMs during runtime. Context switching between warps comes at negligible costs, so longlatency events can easily be hidden. GPU Computing has been dramatically pushed by the availability of programming languages like OpenCL or CUDA. They are mainly based on three concepts: (1) a thread hierarchy based on collaborative thread arrays (CTAs) to facilitate an effective mapping of the vast number of threads (often upwards of several thousands) to organizational units like the SMs. (2) shared memory that is an explicit element of the memory hierarchy, in essence enforcing users to manually specify the locality and thereby inherently optimizing locality to a large extend. (3) barrier synchronization for the enforcement of a logical order between autonomous computational instances like threads or CTAs. Counter-intuitively, developing applications that utilize multiple GPUs is rather difficult despite the explicitly expressed high degree of parallelism. These difficulties stem from the need to orchestrate execution and manage the available memory. In contrast to multi-socket CPU systems, the memory from multiple GPUs in a single system is not shared and data has to be moved explicitly between the main memory and the memory of each GPU. 3 Concept and Compilation Pipeline We base our approach on the following observations: first, the inherent thread hierarchy that forms CTAs allows an easy remapping to multiple GPUs. We leverage this to span up BSP aggregation layers that cover all SMs of multiple GPUs. While this step is rather straight-forward, the huge bandwidth disparity

48 4 between on-device and off-device memory accesses requires an effective partitioning technique to maximize locality for memory accesses. We use concepts based on block data movement (cudamemcpy) and fine-grained remote memory accesses (UVA) to set-up a virtual global address space that embraces all device memory. We note that such an approach is not novel, it is well-explored in the area of general-purpose computing (shared virtual memory) with known advantages and drawbacks. What we observe as main difference regarding previous efforts, is that GPU programs exhibit a huge amount of locality due to the BSP-like execution model and the explicit memory hierarchy, much larger than for traditional general-purpose computing. We leverage this fact for automated data placement optimization. The intended compilation pipeline is assembled using mostly LLVM tools as well as custom analysis and transformation passes. The framework targets OpenCL device code and C/C++ host code using the CUDA driver API, which allows us to implement the complete frontend using Clang (with the help of libclc to provide OpenCL built-in functions, types, and macros). With the frontend and code generation being handled by existing LLVM tools, we can focus on implementing the analysis and transformation passes that present the core of our work. These passes form a sub-pipeline consisting of three steps (see figure 1): 1. The first step is solely an analysis step that is applied to both host and device code. The goal is to extract features that can be used to characterize the workload and aid in deciding on a partitioning strategy. Examples for these features include memory access patterns, data dependencies, and kernel iterations. The results are written back into a database, which is used as the main means of communication between the different steps. 2. The second step uses the results from the first step in order to reach a decision on the partitioning strategy and its parameters. These parameters (for example the dimension along which the workload is partitioned), are written back into the database. 3. With the details on the partitioning scheme agreed on, the third step applies the corresponding code transformations in host and device code. This phase is highly dependent on the chosen partitioning scheme. 4 Workload Characterization In order to efficiently and correctly partition the application, its specific nature needs to be taken into account (see table 2 for example features). Much the same way dependencies have to be taken into account when parallelizing a loop, data and control flow dependencies need to be considered when partitioning GPU kernels. Some of these dependencies can be retrieved by analyzing the device code while others can be identified in host code. Regarding what kind of transformations are allowed, the most important characteristic of a workload if whether it is regular or irregular.

49 5 Fig. 1. High level overview of the compilation pipeline Regular workloads are characterized by their well-defined memory access patterns in kernels. Well-defined in this context means that the memory locations accessed over time depend on a fixed number of parameters that are known at kernel launch time. The memory locations accessed being the result of dereferencing input data (as is the case in sparse computations for example) is a clear exclusion criterion. They can be analyzed to a very high extend, which allows for extensive reasoning about the applied partitioning scheme and other optimization schemes. They usually can be statically partitioned according to elements in the output data. Device code can be inspected for data reuse that does not leverage shared memory and can be optimized accordingly. On the host code side, data movement and kernel synchronization are the main optimization targets.

50 6 Workload Classification Data reuse Indirections Iterations Dense Matrix Multiply Regular High 1 1 Himeno (19 point stencil) Regular High 1 many Prefix sum Regular Low 1 log2(n) SpMV/Graph traversal Irregular Low 2 many Fig. 2. Characterization of selected workloads 5 Analysis and Transformations This section details some of the analysis and transformations that build the core of our project. The first sub section focuses on the analysis phase, where the applicable partitioning schemes are identified and one of them is selected, while the second sub section goes into how and which transformations are applied. As of now we focus on implementing a reasonably efficient 1D partitioning. Depending on how data movement is handled, it can be divided into further sub-schemes: UVA This approach leverages NVIDIA Unified Virtual Addressing (UVA), which allows GPUs to directly access memory on different GPUs via peer-to-peer communication. It is the easiest algorithm to implement and both host and device code only need small modifications. For all but one device, all data accesses are non-local, resulting in peer-to-peer communication between GPUs. With this scheme, the data set has to fully fit into a single GPU. Input replication Input replication is similar to the UVA approach in that it does not require any data reshaping and the device code transformations are exactly the same. But instead of utilizing direct memory access between GPUs, input data gets fully replicated among device. Results are written into a local buffer on each device and later collected and merged by the host code. Streaming This is the approach that we suspect solves both the performance and problem size issues of the first two approaches. Both input and output data are divided into partitions and device buffers are reshaped to only hold a single partition at a time. This strategy requires extensive modifications on both host and device code but also presents more opportunities to optimize data movements and memory management. 5.1 Analysis In order to identify viable partitioning schemes, the analysis step extracts a set of features exhibited by the code and performs a number of tests that, if failed, dismiss certain partitioning schemes. Since for now we focus on regular workloads the most important test performed determines the number of indirections when accessing global memory. One indirection corresponds to a direct memory access using a simple index. This index can be the result of a moderately complex calculation as long as

51 7 none of the values used in the calculations themselves have been read from global memory. Every time the result of one or more load instructions is used to calculate an index it counts as another level of indirection. For regular workloads, where we know the data dependencies in advance, any level of indirection that is greater than one dismisses the workload for partitioning. Although this might seem like a massive limitation, a number of workloads can be implemented using only one level of indirection. Examples include matrix multiplications, reductions, stencil codes, and n-body (without cluster optimizations). If the code is a regular workload, certain features that help deciding on a partitioning scheme are extracted. Useful information includes: Maximum loop nesting level Minimum and maximum values of indices Index stride between loop iterations Index stride between neighboring threads Number and index of output elements For the partitioning schemes we are currently exploring, we require the output elements to be distinct for each thread (i.e. no two threads have the same output element). Without this restriction, access to output data would have to be kept coherent between devices. The index stride between neighboring threads on input data is used in order to determine partition shapes and sizes. For UVA and Input Replication this is not relevant, but streaming kernels with partitioned input data should only be partitioned across block-working-set boundaries. The analysis of the index stride between loop iterations gives a deeper insight into the nature of the workload. As an example, in a non-transposed matrix multiplication the loop stride in the left matrix is 1 while it is the matrix width in the right matrix. All extracted features will be used to train a classifier that later provides hints for proven-good partitioning schemes. 5.2 Transformation The transformation phase is highly dependent on the chosen partitioning scheme. Host code of the kind of CUDA applications we are looking at usually follows the following formula: 1. Read input data 2. Initialize device(s) 3. Distribute data 4. Launch kernels 5. Read back results 6. Produce output 7. Clean up

52 8 The relevant parts of this code are steps 2 through 5. Iterative workloads have the same structure, but repeat steps 3 through 5. Host code modifications currently consist of replacing the regular CUDA calls with custom replacements that act on several GPUs instead of a single one. These are the functions that are implemented differently depending on the partitioning scheme. As an example, for 1D UVA based partitioning, cumalloc allocates memory only on a single GPU, but enables peer-to-peer access between that and all other visible GPUs. In contrast, for 1D Input Replication based partitioning, cumalloc allocates memory of the same size on all available GPUs. In all cases the kernel configuration is modified to account for the possibly multiple device buffers and the new partitioned grid size. For device code we currently employ a trick that greatly simplifies 1D partitioning for UVA and Input Replication schemes. The regular thread grid is embedded in a larger super grid that spans the complete workload on all devices. The super ID corresponds to the device number of a GPU within the super grid and the super size corresponds to the size of a single partition. These additional parameters are passed as extra arguments to the kernel. With this abstraction, transformations on the device code are limited to augmenting the function to accept these arguments as well as replacing calls to get global id. All calls to get global id that query for the dimension we are partitioning along, get replaced with the following computation: super_id*super_size + get_global_id(<original arguments>). This way, no index recalculations have to be performed, as the kernel is essentially the same just with the grid shrunken on and shifted along the partitioned dimension. In order to distribute data for the input replication scheme, regular CUDA memcpys are employed. So far this does not pose a problem, since all GPUs involved are part of the local system. 6 Early Experiments In order to proof-of-concept our ideas, we did preliminary experiments with the first implementation of our toolstack that implements automatic 1D UVA and Input Replication partitioning from section 5. The workload in question is a relatively naive square matrix multiply with the only optimization being the use of 32x32 tiling. It has been chosen due to its regular nature and high computation to communication ratio. The experiments have been executed on a single node system equipped with two Intel Xeon E v3 running at 3.20Ghz, 256GB of DDR3 RAM, and a set of 8 NVIDIA K80 GPUs (each combining 2 GK210 GPUs). The systems runs on Ubuntu As can be seen in figure 3, even with the high compute to memory ratio of a matrix multiply, a UVA based partitioning does not perform well and always results in a speedup of less than one. This can be attributed to the high latency for memory accesses on peers that can not be hidden even by the very high amount of parallelism exposed by this kernel.

53 9 Speedup Speedup Number of GPUs Number of GPUs N (NxN matrices) Fig. 3. Measured speedup for 1D UVA partitioning vs Input Replication. Time(s) Number of GPUs Time(s) Number of GPUs Operation Dev >Host Kernel Host >Dev Fig. 4. Runtime breakdown of 1D Input Replication partitioning with N=4096 and N= Input replication, on the other hand, performs reasonably well for larger input sizes. As suspected, figure 4 shows that the initial higher cost of having to distribute the data to all devices is outweighed by the speedup of the kernel execution up to a certain number of gpus. Until the workload hits its problem size dependent point of saturation, the speedup is just slightly less than linear. These initial results indicate that Input based (and possibly Streaming) based automated partitioning might be a promising option to produce high performance GPU code in a productive manner. 7 Related Work There are several projects that focus on simplifying distributed GPU programming without introducing new programming languages or specific libraries that have to be used by the user. Highly relevant are the projects SnuCL from [8] and RCUDA from [13]. They offer a solution to virtualize GPUs on remote notes in a way so that they appear as local devices local (for OpenCL and CUDA respectively) and still require the user to partition the application manually. We consider these projects to be a highly attractive option to scale from a single-node-single-gpu application to a multi-node-multi-gpu application using our techniques. Several forms of automated partitioning techniques have been proposed in the past. Even though all are similar in principle, the details make them differ

54 10 substantially. Cilardo et al. discuss memory optimized automated partitioning of applications for FPGA platforms : while [4] focuses on analyzing memory access patterns using Z-polyhedrals, [5] explores memory partitioning in High-Level Synthesis (HLS) tasks. The work about run-time systems we examined focus on shared virtual memory and memory optimizations. Li et al. explore the use of page migration for virtual shared memory in [11]. Tao et al. utilize page migration techniques in order to optimize data distribution in NUMA systems in [14]. Both of these works are a great inspiration for the virtual shared memory system we intend on using in order to support irregular workloads. ScaleMP is a successful real-world example of a software based virtual shared memory system. A mix of compile-time and run-time systems (similar to our approach) has been used for various work: Pai et al. describe the use of page migration to manage distinct address spaces of general-purpose CPUs and discrete accelerators like GPUs, based on the X10 compiler and run-time [12]. Lee et al. use kernel partitioning techniques to enable a collaborative execution of a single kernels across heterogeneous processors like CPUs and GPUs (SKMD) [9], and introduce an automatic system for mapping multiple kernels across multiple computing devices, using out-of-order scheduling and mapping of multiple kernels on multiple heterogeneous processors (MKMD) [10]. Work on memory access patterns has a rich history. Recent work that focuses on GPUs include Fang et al., who introduced a tool to analyze memory access patterns to predict performance of OpenCL kernels using local memory [6], which we find very inspiring for our work. Ben-Nun et al. are a very recent representative of various work that extends code with library calls to optimize execution on multiple GPUs by decisions based on the specified access pattern [3]. Code analysis and transformation has also been used to optimize singledevice code. In [7], Fauzia et al. utilize static code analysis in order to speed up execution by coalescing memory accesses and promoting data from shared memory to registers and local memory respectively. Similary, Baskaran et al. focus on automatically moving memory between slow off-chip and faster on-chip memory [1]. 8 Discussion In this paper, we presented our initial work on GPU Mekong, a tool that simplifies multi-gpu programming using the LLVM infrastructure for source code analysis and code transformations. We observe that the use of multiple GPUs steadily increases for reasons including memory aggregation and computational power. In particular, even NVIDIA s top-notch Tesla-class GPU called K80 is internally composed of two K40 connected by a PCIe switch, requiring multidevice programming techniques and manual partitioning. With GPU Mekong, we gear to support such multi-gpu systems without additional efforts besides good (single-device) CUDA/OpenCL programming skills.

55 11 We observe that a dense matrix multiply operation can be computed in parallel with a very high efficiency, given the right data distribution technique. It seems that UVA techniques (load/store forwarding over PCIe) is too limited in terms of bandwidth and/or access latency. Depending on the workload, it might be useful revisiting it later to support fine-grain remote accesses. For irregular workloads with more than one level of indirection, our current approach of statically partitioning data and code is not going to work. We see virtual shared memory based on page migration as a possible solution for these cases. Given the highly structured behavior of GPU kernels, in particular due to the use of shared memory optimizations (bulk data movement prior to finegrained accesses), we see strong differences to page migration techniques for general-purpose processors like CPUs. Also, even though it is a common belief that irregular workloads have no locality, recent work has shown that this is not true [2]. As multi-device systems show strong locality effects by tree-like interconnection networks (in particular for PCIe), we anticipate that a scheduling such data movements correctly is mandatory to diminish bandwidth limitations due to contention effects. We plan to support this with a run-time that intercepts block data movements, predicts associated costs, and re-schedules them as needed. Besides such work on fully-automated code transformations for multiple GPUs, we are envisioning multiple other research aspects. In particular, our code analysis technique could also highlight performance issues found in the single-gpu code. Examples include detecting shared memory bank conflicts, or global memory coalescing issues. However, we still have to find out to which extend these performance bugs could be automatically solved, or if they simply have to be reported to the user. Similarly, we are considering exploring promoting global memory allocations automatically to shared memory for performance reasons. Such a privatization would dramatically help using explicit level of the memory hierarchy. 9 Acknowledgements We gratefully acknowledge the sponsoring we have received from Google (Google Research Award, 2014) and the German Excellence Initiative, with substantial equipment grants from NVIDIA. We acknowledge the support by various colleagues during discussions, in particular Sudhakar Yalamanchili from Georgia Tech. References 1. Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. pp PPoPP 08, ACM, New York, NY, USA (2008),

56 12 2. Beamer, S., Asanovic, K., Patterson, D.: Locality exists in graph processing: Workload characterization on an ivy bridge server. In: Workload Characterization (IISWC), 2015 IEEE International Symposium on. pp (Oct 2015) 3. Ben-Nun, T., Levy, E., Barak, A., Rubin, E.: Memory access patterns: The missing piece of the multi-gpu puzzle. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. pp. 19:1 19:12. SC 15, ACM, New York, NY, USA (2015), 4. Cilardo, A., Gallo, L.: Improving multibank memory access parallelism with latticebased partitioning. ACM Transactions on Architecture and Code Optimization (TACO) 11(4), 45 (2015) 5. Cilardo, A., Gallo, L.: Interplay of loop unrolling and multidimensional memory partitioning in hls. In: Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. pp EDA Consortium (2015) 6. Fang, J., Sips, H., Varbanescu, A.: Aristotle: a performance impact indicator for the opencl kernels using local memory. Scientific Programming 22(3), (Jan 2014) 7. Fauzia, N., Pouchet, L.N., Sadayappan, P.: Characterizing and enhancing global memory data coalescing on gpus. In: Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization. pp IEEE Computer Society (2015) 8. Kim, J., Seo, S., Lee, J., Nah, J., Jo, G., Lee, J.: Snucl: An opencl framework for heterogeneous cpu/gpu clusters. In: Proceedings of the 26th ACM International Conference on Supercomputing. pp ICS 12, ACM, New York, NY, USA (2012), 9. Lee, J., Samadi, M., Mahlke, S.: Orchestrating multiple data-parallel kernels on multiple devices. In: International Conference on Parallel Architectures and Compilation Techniques (PACT). vol. 24 (2015) 10. Lee, J., Samadi, M., Park, Y., Mahlke, S.: Skmd: Single kernel on multiple devices for transparent cpu-gpu collaboration. ACM Transactions on Computer Systems (TOCS) 33(3), 9 (2015) 11. Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Transactions on Computer Systems (TOCS) 7(4), (1989) 12. Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques. pp ACM (2012) 13. Peña, A.J., Reaño, C., Silla, F., Mayo, R., Quintana-Ortí, E.S., Duato, J.: A complete and efficient cuda-sharing solution for {HPC} clusters. Parallel Computing 40(10), (2014), Tao, J., Schulz, M., Karl, W.: Ars: an adaptive runtime system for locality optimization. Future Generation Computer Systems 19(5), (2003)

57 Efficient scheduling policies for dynamic dataflow programs executed on multi-core Ma lgorzata Michalska 1, Nicolas Zufferey 2, Jani Boutellier 3, Endri Bezati 1, and Marco Mattavelli 1 1 EPFL STI-SCI-MM, École Polytechnique Federale de Lausanne, Switzerland 2 Geneva School of Economics and Management, University of Geneva, Switzerland 3 Department of Computer Science and Engineering, University of Oulu, Finland Abstract. An important challenge for dataflow program implementations on multi-core platforms is the partitioning and scheduling providing the best possible throughput when satisfying multiple objective functions. Not only it has been proven that these problems are NP-complete, but also the quality of any heuristic approach is strongly affected by other factors, such as: buffer dimensioning, influence of an established partitioning configuration and scheduling strategy on each other, uncertainties of a compiler affecting the profiling information. This paper focuses on the adaptation of some alternative scheduling policies to the dataflow domain and observation of their properties and behavior when applied to different partitioning configurations. It investigates the impact of the scheduling on the overall execution time, and verifies which policies could further drive the metaheuristic-based search of a close-to-optimal partitioning configuration. 1 Introduction In the emerging field of massively parallel platforms, it is highly requested to develop efficient implementations exploiting the available concurrency. Dataflow programs, characterized by some interesting properties providing a natural way of dealing with parallelism, seem to posses the necessary features to successfully handle this requirement. For the purpose of mapping a dataflow program on a target architecture, the program should be treated as a set of non-decomposable components (connected to each other by a set of buffers) that can be freely placed. This implies that dataflow programs are portable to different architectures with making only two decisions: (1) assign the components to the processing units; (2) find the execution order (sequencing) inside every unit. For static dataflow programs the problem of partitioning and scheduling has been studied very well and the whole class of compile time algorithms is proven valid [14]. In the case of dynamic dataflow programs the problem becomes much complicated, because it requires creating a reliable model of execution that could sufficiently cover and capture the entire application behavior, which depends on the input data.

58 The exploration of the design space should take into consideration three dimensions: partitioning of dataflow components, scheduling inside each partition and dimensioning of the buffers that connect the components with each other [8]. Exploring one dimension requires making an assumption or at least narrowing the setup of other two dimensions. On the other hand, such assumptions usually strongly influence the outcome of the exploration. For instance, in the case of scheduling exploration, the quality of applied partitioning and buffer dimensioning determines the size of the space of possible admissible scheduling configurations. Still it can be stated that if the whole dynamic behavior of an application is properly captured for a given input sequence and the non-blocking buffer dimensioning is applied, the scheduling problem for dataflow programs is always feasible, in comparison to some other Models of Computation (MoCs) [2]. Under the circumstances the challenge becomes to find such a scheduling configuration that optimizes the desired objective. For the case of signal processing systems, among various possible objective functions, the most natural is the maximization of the data throughput, since it contributes to the improvement of other objective functions [12]. Providing such an optimal solution to the partitioningscheduling problem has been, however, proven to be NP-complete even if only a platform with two processors is considered [25]. After discussing the related work in Section 2, the contribution of this paper starts in Section 3 from a proper formulation of the partitioning and scheduling problem, specifically in the dataflow domain. Indeed, to the best of our knowledge, such a formulation is still missing in the dataflow related literature. Furthermore, the Section 4 presents the methodology of experiments which involves modeling of the program execution, target architecture, simulation an verification of some different scheduling policies described in Section 5. The main objective is the analysis of these policies and their performance potential for different partitioning configurations. Section 6 contains the results of simulation supported with experiments conducted on a real platform. Finally, the results, observations, advantages and drawbacks of the applied methodology clarify a direction of future work discussed in Section 7. 2 Related work Among several existing dataflow computation models, a dataflow program is in principle structured as a network of communicating computational kernels, called actors. They are in turn connected by directed, lossless, order preserving point-to-point communication channels (called buffers), and data exchanges are only permitted by sending data packets (called tokens) over those channels. This model is presented in Fig. 1. As a result, the flow of data between actors in such a network is fully explicit. The most general dataflow MoC is known in literature as Dataflow Process Network (DPN ) with firings [15]. A DPN evolves as a sequence of discrete steps by executing actors firings (called actions) that may consume and/or produce a finite number of tokens and modify the internal actor state. At each step, according to the current actor internal state, only one action

59 can be executed. The processing part of actors is encapsulated in the atomic firing completely abstracting from time. The problem of partitioning and scheduling of parallel programs in general has been widely described in literature in numerous variants [22]. In the dataflow domain, in particular, the programs are usually treated as graphs that need to be optimally partitioned [29]. According to the commonly used terminology, the partitioning can be defined as a mapping of an application in the spatial domain (binding), whereas scheduling takes place in the temporal domain (sequencing) [24]. It is also usually emphasized that the partitioning is performed at compile time, whereas scheduling occurs at run-time and is subject to the satisfaction of firing rules, as well as to the scheduling policy for the sequential execution of actors inside each processor [9]. Although the partitioning and scheduling problems seem to rely and impact each other, much more attention has been paid so far to the partitioning problem. Several experiments lead to consider finding a solution to the partitioning problem dominant over the scheduling problem [2, 7]. Since some dataflow models can be very general and therefore difficult to schedule efficiently, an interesting idea comes along with the concept of a flowshop scheduling [1]. The asynchronous dataflow models can be, in some cases, transformed into simpler synchronous ones, where the partitioning and scheduling can be applied directly to the actions. After the partitioning stage (which is an assignment of all actions to the processing units), the scheduling is performed first in the offline phase (schedules are computed at compile time), and then in the run-time phase where a dispatching mechanism selects a schedule for data processing [3]. Another approach for simplifying the scheduling problem is to reduce the complexity of the network and control the desired level of granularity. This can be achieved by actor merging, which can be treated as a special transformation performed on the sets of actors [13]. Recent research shows that actor merging is possible even in the case of applications with data dependent behavior and in the end can act quasi-statically [4]. This, however, does not solve the scheduling problem entirely, since even for a set of merged actors, if multiple merged actors are partitioned on one processor, a scheduling approach needs to be defined. Actions Actor Finite State Machine State Variables A B Buffer C Network Fig. 1. Construction of a dataflow network and actor.

60 3 Problem formulation The problem formulation described here should: cover the dataflow MoC in a very generic way, and avoid any limitations of the allowed level of dynamism in the application. It is used as a starting point for any experiment on the partitioning and scheduling. Following the production field terminology [17], a goal is to find an assignment of n jobs (understood as action firings) to m parallel machines (understood as processing units) so that the overall makespan (completion time of the last performed job among all processing units) is minimized. Assuming the processing units with no parallel execution, only one job (one action) can be executed at a time on each machine. When all jobs are assigned to the machines, the next decision is about their order of executions within each machine, as it is restricted that each job may only consists of one and exactly one stage. Each job j has an associated processing time (or weight) p j and a group (or actor) g j. There are k possible groups, and each one can be divided into subgroups where all jobs have the same processing time. This division can be easily identified with actors and associated firings, which can be different executions of the same action. Between some pairs {j, j } of incompatible jobs (i.e., with g j g j ) is associated a communication time w jj. The communication time is subject to a fixed quantity q jj of information (or the number of tokens) that needs to be transferred. The size of this data is fixed for any subgroup (i.e., an action always produces/consumes the same amount of data). Due to the structure of dataflow programs, the following constraints need to be satisfied: Group constraint. All jobs belonging to the same group have to be processed on the same machine (an actor must be entirely assigned to one processing unit). A fixed relative order is decided within each group (it can be assumed that this order is established based on the program s input data). Precedence constraint (j, j ) means that a job j (plus the associated communication time) must be completed before job j is allowed to start. Setup constraint. It requires that for each existing connection (j, j ) involving jobs from different groups, a setup (or communication) time w jj occurs. More precisely, let C j (resp. B j ) be the completion (resp. starting) time of job j. Then, B j C j + w jj. Communication channel capacity constraint. The size of a communication channel (buffer) the information (tokens) is being transmitted through, is bounded by B. That is, the sum of the q jj s assigned to this buffer cannot exceed B. If it occurs, it might affect the overall performance. The range of values for the p j s and the w jj s fully depends on the targeted architecture. In homogeneous platforms, p j is constant no matter how a group (actor) is actually partitioned. In heterogeneous platforms, this value can vary according to the processor family the processing unit belongs to (i.e., software or hardware). Let m j be the machine assigned to job j. Then w jj is a product of two elements: the number of tokens q jj and the variable time c jj (m j, m j ) needed to transfer a single unit of information from m j to m j. For two given

61 jobs j and j, the largest c jj can be significantly larger than the smallest c jj. In theory, every connection (j, j ) can have as many different c jj s as the number of different possible assignments to the machines, but in practice this number can be usually reduced to few different values, depending on the internal structure of the target platform (i.e., multiple NUMA nodes) [18]. 4 Methodology of experiments The goal of the methodology is to deal with the most general dataflow MoC, DPN, which is considered to be fully dynamic and capable of covering all classes of signal processing applications, such as audio/video codecs or packet switching in communication networks. In order to provide a framework for analysis and simulation of such dynamic applications, several models need to be introduced. A starting point is in the execution model of the application that is going to be partitioned and scheduled, and the model of the target architecture (set of machines). The next step is the profiling of an application on a target architecture in order to provide the model with weights assigned to every job. Next, the results of the profiling are exploited by the simulation tool in order to calculate the makespan for various partitioning and scheduling configurations. Finally, as in [21], the simulated results are verified and compared with the actual execution times obtained on the platform for a given set of configurations. 4.1 Program execution modeling A representation of the dataflow program that captures its entire behavior, as extensively studied to solve other optimization problems of dynamic dataflow implementations [8], can be built by generating a directed, acyclic graph G, called Execution Trace Graph (ETG), which has to consider a sufficiently large and statistically meaningful set of input stimuli in order to cover the whole span of dynamic behavior. The execution of a DPN program with firings can be represented as a collection of action executions, called firings, which are characterized by intrinsic dependencies. The dependencies are either due to the data that is exchanged over communication channels or to the internal properties, such as Finite State Machine or State Variable. In the first case, the firings (jobs) belong to different actors (groups) and the setup constraint occurs. In the second case, the firings belong to the same actor and the dependency contributes to the precedence constraint. 4.2 Target platform The platform used in this work is built by an array of Transport Triggered Architecture processors (Fig. 2), further referenced as TTA. It resembles the Very Long Instruction Word (VLIW) architecture with the internal datapaths of the processors exposed in the instruction set. The program description consists only

of the operand transfers between the computational resources. A TTA processor is made of functional units connected by input and output sockets to an interconnection network consisting of buses [31].

62 of the operand transfers between the computational resources. A TTA processor is made of functional units connected by input and output sockets to an interconnection network consisting of buses [31]. Among several strengths of the TTA architecture, the property of highest importance for the sake of this work is a simple instruction memory without caches [11, 28]. As far as we are concerned, this is also the only multiprocessor platform with no significant interprocessor communication penalty. Thus, it allows a validation of an execution - and architecture model with regards to different partitioning and scheduling configurations, before the methodology is extended by a figure of merit for the possibly complex communication time. Fig. 2. Transport Triggered Architecture model. The choice of the TTA as a target platform has been also dictated by the quest for the measurable and deterministic processing time of an application with possibly negligible overheads. In fact, the applied profiling methodology operates on actors executed in isolation, that is, one actor at a time on a single processor core. Thanks to that, it is possible to apply the profiling only once and explore its results in various configurations. It is a precious property of the TTA architecture comparing to profiling of other platforms, where the results of profiling usually depend on the partitioning configuration and may turn to be invalid when other configurations are approached [18, 26]. The profiling information is obtained using a minimally intrusive timestamp hardware operation taking place on the cycle-accurate TTA simulator [30]. The location of timestamp calls allows to measure the execution time in clock-cycles for every action inside the application as well as the overall time spent inside every actor outside the actual algorithmic parts (actions). This additional time can be identified with the internal scheduling overhead of an actor. Such a profiling seems to be a unique opportunity comparing to other platforms (i.e., NUMA architectures), where especially the communication time profiling is a highly troublesome process [18].

4.3 Performance simulation A simulation tool developed as a part of the TURNUS co-design framework [6] is used to simulate the performance for different partitioning and scheduling configurations.

63 4.3 Performance simulation A simulation tool developed as a part of the TURNUS co-design framework [6] is used to simulate the performance for different partitioning and scheduling configurations. It is able to compute in a deterministic way the makespan (execution time) for any given set of partitioning, scheduling and buffer dimensioning configurations, with the input of ETG, the p j s and the w jj s. The simulation tool considers the constraints specified in the problem formulation, monitors the events occurring on every processor in parallel, and throughout the execution it follows the model of behavior defined for DPN actors. In the situation when multiple actors could be executed at one time, it makes a choice basing on the specified internal scheduling policy. The simulation finalizes a tool chain used for the experiments, depicted in Fig. 3. Our previous experiments have proven that the simulation tool can be effectively and reliably used to simulate the performance of an application running on the TTA platform exploiting the results of a single profiling. Different partitioning configurations can be simulated with the maximal discrepancy between the simulated and real execution time of less than 5% [16]. Fig. 3. Methodology of experiments: toolchain. 4.4 Analyzed application All experiments have been performed using an MPEG4 SP decoder network, which is an implementation of a full MPEG-4 4:2:0 Simple Profile decoder standard written in CAL Actor Language [10]. The main functional blocks include a parser, a reconstruction block, a 2-D inverse discrete cosine transform (IDCT) block and a motion compensator. These functional units are hierarchical compositions of actors in themselves. The decoding starts from the parser (the most complicated actor in the network consisting of 71 actions) which extracts data from the incoming bitstream, through reconstruction blocks exploiting the correlation of pixels up to the motion compensator performing a selective adding of blocks. The whole network is presented in Fig.4.

Fig. 4. MPEG4 SP decoder network 5 Scheduling policies This work validates six different scheduling policies for actors partitioned on one core.

firing conditions are satisfied, that is, it has the necessary input tokens and available space in outgoing buffers.

64 Fig. 4. MPEG4 SP decoder network 5 Scheduling policies This work validates six different scheduling policies for actors partitioned on one core. The first three are direct implementations of existing techniques described in literature and widely used in systems of multiple types: Non Preemptive (NP): one actor is executed as long as as the firing conditions are satisfied, that is, it has the necessary input tokens and available space in outgoing buffers. It can be considered analogous to the FCFS scheduling, known also as Run-to-Completion [20]. The scheduler moves to the execution of the next actor on the list only if its firing conditions are not satisfied any more. The expression preemptiveness refers here to the change of the target actor after a successful firing and not to the interruption of a single task, which is, by nature, not allowed in dataflow programs. Round Robin (RR): after a successful firing of an actor, the scheduler moves to another one and verifies its firing conditions. It is not allowed to execute an actor multiple times in a row if there are other ones executable at the same time. This policy follows directly the standard RR procedure used in operating systems and described in [20]. NP/RR swapped (NP/RR): it is similar to the concept of Round Robin with credits scheduling, where each task (actor, in this case) can receive a different number of cells for execution in each round [19]. In this case the choice of cells number is binary: either equal to one or the number determined by the NP policy. The choice is made basing on the criticality of an actor, which is represented as a percentage of its executions belonging to the critical path (CC ). CC, defined as the longest time-weighted sequence of events from the start of the program to its termination, is evaluated using multiple algorithms as described in [8]. The other three policies are the extensions of these strategies with an introduction of different types of priorities (priority scheduling [20]). Unlike the existing approaches they exploit the information obtained at the level of action firings, not actors (i.e., jobs, not groups). As a result, although only the actors can be chosen by the scheduler, the system of priorities changes from firing to

65 firing throughout the execution. Extracting the information at this level is performed with the simulation tool operating within the TURNUS framework [6]. Critical Non Preemptive (CNP): as long as the next firing of an actor is in CC, it is executed on a NP basis. For the non-critical executions, a RR approach is applied instead. It is a similar strategy to the NP/RR, but the priority is resolved independently for each action firing. In this case only the actual critical firings are given the priority, not actors as such. Critical Outgoings Workload (COW ): priority is assigned to actors according to different properties. The highest priority goes to the actor whose next firing is critical. If multiple actors await to execute a critical firing, the next level of priority is given to the one, where the firing has outgoing dependencies in other partitions. If the decision cannot be made basing on these two criteria, the heaviest firing is chosen. Earliest Critical Outgoings (ECO): priority is assigned to the actor where the next firing occurs the earliest in CC, or if no critical firing is currently available, to a firing with the highest number of outgoing dependencies in other partitions. Non-resolved cases are handled on a RR basis. 6 Experimental results In order to explore the design space in the dimension of scheduling, a fixed setup of partitioning and buffer dimensioning must be specified. In all experiments, the two sets of partitioning configurations spanned on up to 8 processors have been compared. The first set contained configurations where the overall workload of each partition is balanced, whereas the second one was created out of some random configurations. The idea behind that was to verify whether a certain tendency in the performance for different scheduling policies occurs independently from the quality of partitioning. As for the buffer dimensioning, in order to minimize its influence on the results, we would ideally aim at considering infinite buffer sizes. From practical purposes, as experimentally verified, a buffer size of 8192 is already a good approximation of an infinite buffer, because blocking at the outputs is not likely to happen. This value has been used for profiling, platform execution and performance simulation. The first part of the analysis was the execution time simulated for the NP strategy, which is originally used by the TTA backend of ORCC [27]. The accuracy obtained for the simulation tool was very high, for instance, for the random set of partitioning configurations the difference between the TTA platform execution and emulated results was less than 1.8%. This makes the accuracy even higher comparing to our previous work [16]. This improvement might be due to the more convenient buffer size (8192 vs 512 used previously) and using of a longer input sequence (30 frames vs 5 frames used previously). Secondly, for each partitioning configuration, the simulation tool estimated the execution times for 6 different scheduling policies and calculated the speed-ups versus the mono-core execution. The results for the balanced (resp. random) partitioning configurations are presented in Table 1 (resp. 2).

66 Table 1. Estimated speed-ups: balanced partitioning configurations No. of units NP RR NP/RR CNP COW ECO Table 2. Estimated speed-ups: random partitioning configurations No. of units NP RR NP/RR CNP COW ECO It can be clearly observed that some policies tend to perform much better than the others for almost any set of configuration. For example, RR outperforms NP by more than 10% on average, and up to even 25%. The strategies relying on changing the actor after every execution (RR, COW, ECO) are also in general more efficient than NP and its derivatives. Surprisingly, CNP does not perform really well. It can be due to the fact that, as for the scheduling policy, when the critical firings were given a priority to fire, the critical path might have been modified by the concurrent decision of the scheduler. At the higher processor count all policies start to perform very similarly. It can be due to the fact that as the average number of actors in one processor decreases, the possible choice of scheduler becomes limited and less sensitive to the strategy it is using. Another observation is that the balanced partitioning configurations resulted in much more diversity in the results than the random ones. This can lead to a conclusion that the partitioning problem should be in fact considered dominant over the scheduling problem, as it is responsible for a room for improvement available for scheduling policies. The same kind of observation was made in order and acceptance scheduling problems [23]. For further experiments, the two relatively extreme strategies RR and NP have been chosen. The scheduler inside the TTA backend has been modified to perform the scheduling on both NP and RR basis, so that a comparison of performances is possible. The execution times are presented in Fig. 5 (resp. 6) for balanced (resp. random) configurations.

Fig. 5. TTA platform execution: balanced partitioning configurations The same tendency can be again observed in both sets of partitioning configurations.

Since good as well as bad partitioning configurations behave in the same way for different scheduling policies, using the simulation tool in order to tune

67 Fig. 5. TTA platform execution: balanced partitioning configurations The same tendency can be again observed in both sets of partitioning configurations. It thus confirms the legitimacy of the partitioning setup applied to the design space for the exploration of scheduling. Since good as well as bad partitioning configurations behave in the same way for different scheduling policies, using the simulation tool in order to tune the scheduling policy for the metaheuristic search of optimal partitioning configuration seems to be a promising direction. Fig. 6. TTA platform execution: random partitioning configurations

68 At the beginning, that is, up to 3 units, NP outperforms RR. However, the difference between them gradually decreases. From 4 units, RR achieves a better performance. This phenomenon can by explained by the presence of intrapartition scheduling overhead. This overhead is not measurable according to the current profiling methodology, but we would logically expect it to be proportional to the number of actors in one partition, since if there are more actors, more conditions need to be checked at every scheduling decision. Nevertheless, even in the presence of this unfavorable overhead, the modified scheduler RR brought up to 14.5% of improvement. 7 Future work The most promising aspect of our current work is the expansion of different scheduling approaches to platforms different than TTA with an emphasis on the NUMA architectures and various heterogeneous platforms. This involves much more advanced profiling methodology and the introduction of the probability model since a bigger notion of uncertainty is present in the architecture, especially regarding the caches. On the other hand, it is highly important to understand the differences between the estimated execution times and platform results and, in particular, investigate if the intra-partition scheduling overhead can be measured or at least approximated. For this purpose, the goal would be to extend the simulation tool to keep track on the schedulers decision in a more detailed way, especially in terms of the overall number of firing conditions that are checked before a successful execution. In this work, the scheduling strategies are evaluated globally, that is, the same strategy is defined for every processing unit (partition). It might be useful to analyze also the opportunity of defining a different scheduling policy for each partition, depending on the level of dynamism occurring in the sequencing for every subset of actors. Finally, having the model extended to cover different architectures in a generic way, the target will be to use the simulation tool in order to improve the algorithms for partitioning of dataflow applications. Exploring the properties and performance potential of different scheduling policies should help drive the metaheuristic search for a close-to-optimal partitioning. References 1. Baker, K. R., Trietsch, D.: Principles of Sequencing and Scheduling. Wiley (2009). 2. Benini, L., Lombardi, M., Milano, M., Ruggiero. M: Optimal resource allocation and scheduling for the CELL BE platform. Annals of Operations Research, (2011). 3. Boutellier, J., Sadhanala, V., Lucarz, C., Brisk, P., Mattavelli, M.: Scheduling of dataflow models within the reconfigurable video coding framework. IEEE Workshop on Signal Processing Systems, Washington, DC, (2008). 4. Boutellier, J., Ersfolk, J., Lilius, J., Mattavelli, M., Roquier, G., Silven, O.: Actor Merging for Dataflow Process Networks. IEEE Transactions on Signal Processing, vol. 63, (2015).

69 5. Casale-Brunet, S., Elguindy. A., Bezati, E., Thavot, R., Roquier, G., Mattavelli, M., Janneck, J. W.: Methods to explore design space for MPEG RMC codec specifications. Signal Processing: Image Communication, vol. 28, (2013). 6. Casale-Brunet, S., Alberti, C., Mattavelli, M., Janneck, J. W.: TURNUS: a Unified Dataflow Design Space Exploration Framework for Heterogeneous Parallel Systems. Conference on Design and Architectures for Signal and Image Processing (DASIP), Cagliari, Italy (2013). 7. Casale-Brunet, S., Bezati, E., Alberti, C., Mattavelli, M., Amaldi, E., Janneck, J. W.: Partitioning And Optimization Of High Level Stream Applications For Multi Clock Domain Architectures. IEEE Workshop on Signal Processing, Taipei, Taiwan, (2013). 8. Casale-Brunet, S.: Analysis and optimization of dynamic dataflow programs. PhD Thesis at EPFL, Switzerland (2015). 9. Eisenring, M., Teich, J., Thiele, L.: Rapid Prototyping of Dataflow Programs on Hardware/Software Architectures. Proc. of HICSS-31, Proc. of the Hawai Int. Conf. on System Sciences, (1998). 10. Eker, J., Janneck, J. W.: CAL Language Report. Tech. Memo UCB/ERL M03/48, UC Berkeley (2003). 11. Esko, O., Jääskeläinen, P., Huerta, P., de La Lama, C. S., Takala, J., Martinez, J. I.: Customized exposed datapath soft-core design flow with compiler support. 15th Annual IEEE International ASIC/SOC Conference, (2002). 12. Hirzel, M., Soulé, R., Schneider, S., Gedik, B., Grimm, R.: A catalog of Stream Processing Optimizations. ACM Computing Surveys, vol. 46 (2014). 13. Janneck, J. W.: Actors and their composition. Formal Aspects Comput., vol. 15, (2003). 14. Lee, E. A., Messerschmitt, D. G. : Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing. IEEE Transactions on Computers, vol. C-36, (1987). 15. Lee, E. A., Parks, T. M.: Dataflow process networks. Proceedings of the IEEE, (1995). 16. Michalska, M., Boutellier, J., Mattavelli, M.: A methodology for profiling and partitioning stream programs on many-core architectures. International Conference on Computational Science (ICCS), Procedia Computer Science Ed., (2015). 17. Pinedo, M.: Scheduling: Theory, Algorithms, and Systems, third edition. Prentice Hall (2008). 18. Selva, M.: Performance Monitoring of Throughput Constrained Dataflow Programs Executed On Shared-Memory Multi-core Architectures. PhD Thesis at INSA Lyon, France (2015). 19. Singh, S.: Round-robin with credits: an improved scheduling strategy for rateallocation in high-speed packet-switching. Global Telecommunications Conference, GLOBECOM (1994). 20. Silberschatz, A., Galvin, P., Gagne, G: Operating System Concepts. Wiley (2005). 21. Silver, E. A., Zufferey, N.: Inventory Control of an Item with a Probabilistic Replenishment Lead Time and a Known Supplier Shutdown Period. International Journal of Production Research 49 (4), (2011). 22. Sinnen, O.: Task scheduling for parallel systems. Wiley Series on Parallel and Distributed Computing (2007). 23. Thevenin, S., Zufferey, N., Widmer, M.: Metaheuristics for a Scheduling Problem with Rejection and Tardiness Penalties. Journal of Scheduling 18 (1), (2015).

70 Thiele, L., Bacivarov, I., Haid, W., Huang, K.: Mapping Applications to Tiled Multiprocessor Embedded Systems. Seventh International Conference on Application of Concurrency to System Design, (2007). 24. Thiele, L., Bacivarov, I., Haid, W., Huang, K.: Mapping Applications to Tiled Multiprocessor Embedded Systems. Seventh International Conference on Application of Concurrency to System Design, (2007). 25. Ullman, J. D.: NP-complete scheduling problems. Journal of Computer and System Sciences, (1975). 26. Weaver, V., Terpstra, D., Moore, S. Non-Determinism and Overcount on Modern Hardware Performance Counter Implementations. IEEE International Symposium on Performance Analysis of Systems and Software, Austin (2013). 27. Yviquel, H., Lorence, A., Jerbi, K., Cocherel, G.: Orcc: Multimedia Development Made Easys. Proceedings of the 21st ACM International Conference on Multimedia, (2013). 28. Yviquel, H.: From dataflow-based video coding tools to dedicated embedded multicore platforms. PhD Thesis at Université Rennes, France (2013). 29. Yviquel, H., Casseau, E., Raulet, M., Jääskeläinen, P., Takala, J.: Towards runtime actor mapping of dynamic dataflow programs onto multi-core platforms. 8th International Symposium on Image and Signal Processing and Analysis (2013). 30. Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E.: Embedded Multi-Core Systems Dedicated to Dynamic Dataflow Programs. Journal of Signal Processing Systems, 1 16 (2014). 31. TTA-Based Co-design Environment, Last checked: December 2014.

71 Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM (CNRS and University of Montpellier), Montpellier, France Abstract. Single-ISA heterogeneous multicore systems are emerging as a promising direction to achieve a more suitable balance between performance and energy consumption. However, a proper utilization of these architectures is essential to reach the energy benefits. In this paper, we demonstrate the ineffectiveness of popular OpenMP scheduling policies executing Rodinia benchmark on the Exynos 5 Octa (5422) SoC, which integrates the ARM big.little architecture. 1 Introduction Traditional CPUs consume just too much power and new solutions are needed to scale up to the ever-growing demand on computational complexity. Accordingly, major efforts are focusing on achieving a more holistic balance between performance and energy consumption. In this context, heterogeneous multicore architectures are firmly established as the main gateway to higher energy efficiency. Particularly interesting is the concept of single-isa heterogeneous multicore systems [1], which is an attempt to include heterogeneity at the microarchitectural level while preserving a common abstraction to the software stack. In single-isa heterogeneous multicore systems, all cores execute the same machine code and thus, any core can execute any part of the code. Such model makes it possible to execute the same OS kernel binary implemented for symmetric Chip Multi- Processors (CMPs) with only minimal configuration changes. In order to take advantage of single-isa heterogeneous multicore architectures, we need an appropriate strategy to manage the distribution of computation tasks also known as efficient thread scheduling in multithreading programming models. OpenMP [2] is a popular programming model that provides a shared memory parallel programming interface. It features a thread-based forkjoin task allocation model and various loop scheduling policies to determine the way in which iterations of a parallel loop are assigned to threads. This paper measures the impact of different loop scheduling policies in a real state-of-the-art single-isa heterogeneous multicore system. We use the Exynos 5 Octa (5422) System-on-Chip (SoC) [3] integrating the ARM big.little architecture [4], which couples relatively slower, low-power processor cores (LITTLE)

72 with relatively more powerful and power-hungry ones (big). We provide insightful performance and energy consumption results on the Rodinia OpenMP benchmark suite [5] and demonstrate the ineffectiveness of typical loop scheduling policies in the context of single-isa heterogeneous multicore architectures. 2 The Exynos 5 Octa (5422) SoC 2.1 Platform Description We run our experiments on the Odroid-XU3 board, which contains the Exynos 5 Octa (5422) SoC with the ARM big.little architecture. ARM big.little technology features two sets of cores: a low performance energy-efficient cluster that is called LITTLE and power hungry high performance cluster that is called big. The Exynos 5 Octa (5422) SoC architecture and its main parameters are presented in Figure 1. It contains: (1) a cluster of four out-of-order superscalar Cortex-A15 cores with 32kB private caches and 2MB L2 cache, and (2) a cluster of four in-order Cortex-A7 cores with 32kB private caches and 512KB L2 cache. Each cluster operates at independent frequencies, ranging from 200MHz up to 1.4GHz for the LITTLE and up to 2GHz for the big. The SoC contains 2GB LPDDR3 RAM, which runs at 933MHz frequency and with 2x32 bit bus achieves 14.9GB/s memory bandwidth. The L2 caches are connected to the main memory via the 64-bit Cache Coherent Interconnect (CCI) 400 [6]. $ $ $ $ %!"#!$ # Fig. 1: Exynos 5 Octa (5422) SoC. 2.2 Software Support Execution models. ARM big.little processors have three main software execution models [4]. The first and simplest model is called cluster migration. A single cluster is active at a time, and migration is triggered on a given workload threshold. The second mode named CPU migration relies on pairing every big core with a LITTLE core. Each pair of cores acts as a virtual core in which

73 only one actual core among the combined two is powered up and running at a time. Only four physical cores at most are active. The main difference between cluster migration and CPU migration models is that the four actual cores running at a time are identical in the former while they can be different in the latter. The heterogeneous multiprocessing (HMP) mode also known as Global Task Scheduling (GTS) allows using all of the cores simultaneously. Clearly, HMP provides the highest flexibility and consequently it is the promising mode to achieve the best performance/energy trade-offs. Benchmarks. We consider the Rodinia benchmark suite for heterogeneous computing [5]. It is composed of applications and kernels of different nature in terms of workload, from domains such as bioinformatics, image processing, data mining, medical imaging and physics simulation. It also includes classical algorithms like LU decomposition and graph traversal. In our experiments, the OpenMP implementations are configured with 4 or 8 threads, depending on the number of cores that are visible to the thread scheduling algorithm. Due to space constraints, we selected the following subset of benchmarks: backprop, bfs, heartwall, hotspot, kmeans openmp/serial, lud, nn, nw and srad v1/v2. Thread scheduling algorithms. OpenMP provides three loop scheduling algorithms, which allows determining the way in which iterations of a parallel loop are assigned to threads. The static scheduling is the default loop scheduling algorithm, which divides the loop into equal or almost equal chunks. This scheduling provides the lowest overhead, but, as we will show in the results, the potential load imbalance can cause significant synchronization overheads. The dynamic scheduling assignings chunks at runtime once threads complete previously assigned iterations. An internal work queue of chunk-sized blocks is used. By default, the chunk size is 1 and this can be explicitly specified by a programmer at compile time. Finally, the guided scheduling is similar to dynamic scheduling, but the chunk size exponentially decreases from the value calculated as #interations/#threads to 1 by default or to a value explicitly specified by a programmer at compile time. In the next section, we consider these three loop scheduling policies with the default chunk size. Furthermore, the experiments are run with the following software system configuration: the Ubuntu Linux kernel LTS 3.10, the GCC compiler and the OpenMPI 3.1 libraries. 3 Experimental Results In this section we present a detailed analysis of the OpenMP implementation of the Rodinia benchmark suite running on the ARM big.little architecture. We consider the following configurations: Cortex-A7 cluster running at 200 MHz, 800 MHz and 1.4 GHz; Cortex-A15 cluster running at 200 MHz, 800 MHz and 2GHz; Cortex-A7/A15 clusters running at 200/200 MHz, 800/800 MHz, 1.4/2 GHz, 200 MHz/2 GHz and 1.4 GHz/200 MHz.

74 Static Thread Scheduling. Figure 2(a) shows in logarithmic scale the measured execution time of different configurations using the static scheduling algorithm. The results are normalized with respect to the slowest configuration, i.e., Cortex-A7 running at 200MHz. As expected, the highest performance is typically achieved by the Cortex-A15 running at 2GHz. For example, a speedup of 21x is observed when running the kmeans openmp in the big cluster. When using the HMP mode to simultaneously run on the big and LITTLE clusters (i.e., A7/A15 in the figure), the execution time is usually slower to that of the big cluster alone, despite using four additional actives cores. An even higher penalty is observed when operating the LITTLE cluster at a lower frequency, especially so for the lud, nn and nw applications. "$ "#!!! % & (a) Execution time speedup comparison % &!' (b) EtoS comparison Fig. 2: Normalized speedup using Static scheduling (reference A7 at 200MHz). Figure 2(b) shows the normalized Energy to Solution (EtoS) measured with the on-board power monitors present in the Odroid-XU3 board. Results are again normalized against the reference Cortex-A7 running at 200MHz. We observe, that the Cortex-A7 cluster is generally more energy-efficient than the Cortex-A15. Furthermore, the best energy efficiency is achieved when operating at 800MHz. Besides, we also observe that for a few applications (i.e., bfs, kmeans serial, and srad v1 ) the Cortex-A15 running at 800MHz provides slightly better EtoS than the reference Cortex-A7 cluster. These applications benefit the most from the A15 out-of-order architecture achieving the largest speedups. This leads

Master thread OMP thread 1 OMP thread 2 OMP thread 3 OMP thread 4 OMP thread 5 OMP thread 6 OMP thread 7 0.4755s Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A7 0.

to a higher energy efficiency despite running on a core of higher power consumption. When using the HMP mode, some application exhibit a very high EtoS.

75 Master thread OMP thread 1 OMP thread 2 OMP thread 3 OMP thread 4 OMP thread 5 OMP thread 6 OMP thread s Cortex-A7 Cortex-A7 Cortex-A7 Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A15 Cortex-A s s Time zoom Complete runtime Master thread OMP worker thread OMP barrier (idle) Fig. 3: lud on HMP big.little at 200MHz/2GHz. to a higher energy efficiency despite running on a core of higher power consumption. When using the HMP mode, some application exhibit a very high EtoS. Particularly high are the EtoS of the lud and nn applications executed in the configuration Cortex-A7/A15 running at 200MHz/2GHz. Our experiments also show that HMP is less energy efficient than the big cluster running at maximum frequency (i.e., A15 2GHz). In conclusion, static thread scheduling achieves a highly suboptimal use of our heterogeneous architecture, which turns out to be slower and less energy efficient than a single big cluster. Further investigations were carried out with Scalasca [7] and Vampir [8] software tools that permit instrumenting the code and visualizing low-level behavior based on collected execution traces. Figure 3 shows a snapshot of the execution trace of the lud application alongside a zoom on two consecutive parallel-for loop constructs. It is clearly visible that the OpenMP runtime spawned eight threads, which got assigned to the eight cores. The four threads assigned to the Cortex-A15 cores completed execution of their chunks significantly faster than the Cortex-A7 cores. As a result, the execution critical path is affected by the slowest cores, which slows down system performance. Dynamic and Guided Thread Scheduling. Figures 4(a-b) respectively illustrate the execution time using dynamic and guided thread scheduling normalized by the static scheduling discussed previously. The dynamic scheduling is able to achieve good speedups for some applications (e.g., nn) but also degrades the performance of some others (e.g., nw). Something very similar happens with the guided scheduling but with different application/configuration sets. For example, the heartwall is now degraded for the 1.4GHz/200MHz configuration while the nn achieves a 1.8x speedup. Figures 4(c-d) respectively show the EtoS of the dynamic and guided scheduling normalized by the static scheduling. We observe a very high correlation with respect to the corresponding execution time graphs. Accordingly, we can conclude that there is no existing policy that is generally superior. The best policy will depend on the application and on the architecture configuration. However, we believe that none of the policies is able to fully leverage the heterogeneity of our architecture and that more intelligent thread scheduling policies are needed

76 to sustain the energy efficiency promised by single-isa heterogeneous multicore systems.!! "# $% $#! &# ' && &% #( #(!'! Fig. 4: Normalized execution time speedup and EtoS. 4 Conclusion In this paper, we evaluate performance and energy trade-offs of single-isa heterogeneous multicore system. The investigations were conducted on the Odroid XU3 board including an ARM big.little Exynos 5 Octa (5422) chip. We provided performance and energy results on the Rodinia OpenMP benchmark suit using typical loop scheduling policies, i.e. static, dynamic and guided. The results show that the given policies are inefficient in the use of heterogeneous cores. Therefore, we conclude that further research is required to propose suitable scheduling policies able to leverage the superior energy efficiency of LITTLE cores while maintaining the faster execution times of big cores. 5 Acknowledgement The research leading to these results has received funding from the European Union s Seventh Framework Programme (FP7/ ) under the Mont-Blanc 2 Project: grant agreement n o

77 References 1. R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, Single-isa heterogeneous multi-core architectures for multithreaded workload performance, in Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA 04, (Washington, DC, USA), pp. 64, IEEE Computer Society, O. A. R. Board, The openmp api specification for parallel programming. November Samsung, Exynos Octa SoC. November B. Jeff, big.little technology moves towards fully heterogeneous global task scheduling. November S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, Rodinia: A benchmark suite for heterogeneous computing, in Workload Characterization, IISWC IEEE International Symposium on, pp , Oct ARM, CoreLink CCI-400 Cache Coherent Interconnect Technical Reference Manual, November Revision r1p1. 7. Scalasca. November Vampir - performance optimization. November 2015.

78 Collaborative design and optimization using Collective Knowledge Anton Lokhmotov 1 and Grigori Fursin 1,2 1 dividiti, UK 2 ctuning foundation, France Abstract. Designing faster, more energy efficient and reliable computer systems requires effective collaboration between hardware designers, system programmers and performance analysts, as well as feedback from system users. We present Collective Knowledge (CK), an open framework for reproducible and collaborative design and optimization. CK enables systematic and reproducible experimentation, combined with leading edge predictive analytics to gain valuable insights into system performance. The modular architecture of CK helps engineers create and share entire experimental workflows involving modules such as tools, programs, data sets, experimental results, predictive models and so on. We encourage a wide community, including system engineers and users, to share and reuse CK modules to fuel R&D on increasing the efficiency and decreasing the costs of computing everywhere. 1 Introduction 1.1 The need for collaboration Designing faster, more energy efficient and reliable computer systems requires effective collaboration between several groups of engineers, for example: hardware designers develop and optimize hardware, and provide low-level tools to analyze their behavior such as simulators and profilers with event counters; system programmers port to new hardware and then optimize proprietary or open-source compilers (e.g. LLVM, GCC) and libraries (e.g. OpenCL, 1 OpenVX, 2 OpenCV, 3 Caffe, 4 BLAS 5 ); performance analysts collect benchmarks and representative workloads, and automate running them on new hardware. In our experience, the above groups still collaborate infrequently (e.g. on achieving development milestones), despite the widely recognized virtues of hardware/software co-design [1]. Moreover, the effectiveness of collaboration typically 1 Khronos Group s standard API for heterogeneous systems: khronos.org/opencl 2 Khronos Group s standard API for computer vision: khronos.org/openvx 3 Open library for computer vision: opencv.org 4 Open library for deep learning: caffe.berkeleyvision.org 5 Standard API for linear algebra: netlib.org/blas

79 depends on the proactivity and diligence of individual engineers, the level of investment into collaboration tools, the pressure exerted by customers and users, and so on. Ineffective collaboration could perhaps be tolerated many decades ago when design and optimization choices were limited. Today systems are so complex that any seemingly insignificant choice can lead to dramatic degradation of performance and other important characteristics [2,3,4,5]. To mitigate commercial risks, companies develop proprietary infrastructures for testing and performance analysis, and bear the associated maintenance costs. For example, whenever a performance analyst reports a performance issue, she should provide the program code along with instructions for how to build and run it, and the experimental conditions (e.g. the hardware and compiler revisions). Reproducing the reported issue may take many days, while omitting a single condition in the report may lead to frustrating back-and-forth communication and further time being wasted. Dealing with a performance issue reported by a user is even harder: the corresponding experimental conditions need to be elicited from the user (or guessed), the program code and build scripts imported into the proprietary infrastructure, the environment painstakingly reconstructed, etc. Ineffective collaboration wastes precious resources and runs the risk of designing uncompetitive computer systems. 1.2 The need for representative workloads The conclusions of performance analysis intrinsically depend on the workloads selected for evaluation [6]. Several companies devise and license benchmark suites based on their guesses of what representative workloads might be in the near future. Since benchmarking is their primary business, their programs, data sets and methodology often go unchallenged, with the benchmarking scores driving the purchasing decisions both of OEMs (e.g. phone manufacturers) and consumers (e.g. phone users). When stakes are that high, the vendors have no choice but to optimize their products for the commercial benchmarks. When those turn out to have no close resemblance to real workloads, the products underperform. Leading academics have long recognized the need for representative workloads to drive research in hardware design and software tools [7,8]. With funding agencies increasingly requiring academics to demonstrate impact, academics have the right incentives to share representative workloads and data sets with the community. Incentives to share representative workloads may be somewhat different for industry. Consider the example of Realeyes, 6 a participant in the EU CARP project. 7 Recognizing the value of collaborative R&D, Realeyes released under a permissive license a benchmark comprised of several standard image processing algorithms used in their pipeline for evaluating human emotions [9]. Now 6 realeyesit.com 7 carpproject.eu

80 Realeyes enjoy the benefits of our research on run-time adaptation ( 3) and accelerator programming ([10]) that their benchmark enabled. We thus have reasons to believe that the expert community can tackle the issue of representative workloads. The challenge for vendors and researchers alike will be to keep up with the emerging workloads, as this will be crucial for competitiveness. 1.3 The need for predictive analytics While traditionally performance analysts would only obtain benchmarking figures, recently they also started performing more sophisticated analyses to detect unexpected behavior and suggest improvements to hardware and system software engineers. Conventional labour-intensive analysis (e.g. frame by frame, shader by shader for graphics) is not only extremely costly but is simply unsustainable for analyzing hundreds of real workloads (e.g. most popular mobile games). Much of success of companies like Google, Facebook and Amazon can be attributed to using statistical ( machine learning, predictive analytics ) techniques, which allow them to make uncannily accurate predictions about users preferences. Whereas most people would agree with this, the same people would resist the idea of using statistical techniques in their own area of expertise. A litmus test for our community is to ask ten computer engineers whether statistical techniques would help them design better processors and compilers. In our own experience, only one out of ten would say yes, while others would typically lack interdisciplinary knowledge. We have grown to appreciate the importance of statistical techniques over the years. (One of us actually flunked statistics at university.) We constantly find useful applications of predictive analytics in computer engineering. For example, identifying a minimal set of representative programs and inputs has many benefits for design space exploration, including vastly reduced simulation time. 1.4 Our humble proposal for solution We present Collective Knowledge, a simple and extensible framework for collaborative and reproducible R&D ([11], 2). With Collective Knowledge, engineers can systematically investigate design and optimization choices using leading edge statistical techniques, conveniently exchange experimental workflows across organizational boundaries (including benchmarks), and automatically maintain programming tools and documentation. Several performance-oriented open-source tools exist including LLVM s LNT, 8 ARM s Workload Automation, 9 and Phoronix Media s OpenBenchmarking. 10 These tools do not provide, however, robust mechanisms for reproducible experimentation and capabilities for collaborative design and optimization. We 8 llvm.org/docs/lnt 9 github.com/arm-software/workload-automation 10 openbenchmarking.org

Typical experimental workflow Ad-hoc tuning scripts Algorithm, Program Source to source transformations, Compilation Data set Hardware

input Unified command interface CK modules (wrappers) with JSON API to abstract access to changing SW and HW Unified input (JSON)

choices Unified output (JSON) Detected features Detected choices Parse and unify output $ ck pull repo:ctuning-programs $ ck list program

unifying whole system modeling and multi-objective optimization using top-down methodology similar to physics Monitored run-time state

with UID = B( c,f,s ) JSON converted into CK vectors Assemble experimental workflows from CK modules as LEGO(R) for agile prototyping,

repositories with cross-linked modules (benchmarks, data sets, workflows, results) Compile source code Run code GitHub, Bitbucket, ACM DL

analysis and predictive analytics Apply complexity reduction Web service for crowdsourcing cknowledge.org Interdisciplinary crowd Fig. 1.

demonstrate some of these mechanisms and capabilities on a computationally intensive algorithm from the Realeyes benchmark ( 3).

accelerate computer engineering. 2 Collective Knowledge Fig.

ams (e.g. benchmarks), data sets, tools (e.g. compilers and libraries), scripts, experimental results, predictive models, articles, etc.

81 Typical experimental workflow Ad-hoc tuning scripts Algorithm, Program Source to source transformations, Compilation Data set Hardware State Execution / Run-time system Collection of CSV, XLS, TXT and other files Gradually convert to Collective Knowledge Original ad-hoc input Unified command interface CK modules (wrappers) with JSON API to abstract access to changing SW and HW Unified input (JSON) Processing (Python) Actions Set environment (tool versions, system state, ) Expose features Expose design and opt. choices Unified output (JSON) Detected features Detected choices Parse and unify output $ ck pull repo:ctuning-programs $ ck list program $ ck list dataset $ ck compile program:*susan speed $ ck run program:cbench-automotive-susan $ ck crowdtune program Simplifying and unifying whole system modeling and multi-objective optimization using top-down methodology similar to physics Monitored run-time state Monitored behavior b Any tool (compiler, lib, profiler, script ) Generated files CK entries with CKCK entries entries Unique IDsUID with with UID = B( c,f,s ) JSON converted into CK vectors Assemble experimental workflows from CK modules as LEGO(R) for agile prototyping, crowdsourcing and analysis Choose exploration strategy Generate choices (code sample, data set, compiler, flags, architecture ) CK repositories with cross-linked modules (benchmarks, data sets, workflows, results) Compile source code Run code GitHub, Bitbucket, ACM DL Analyze variation Apply Pareto filter Stat. analysis and predictive analytics Apply complexity reduction Web service for crowdsourcing cknowledge.org Interdisciplinary crowd Fig. 1. Converting a typical experimental workflow to the Collective Knowledge format. demonstrate some of these mechanisms and capabilities on a computationally intensive algorithm from the Realeyes benchmark ( 3). We believe that Collective Knowledge can be combined with open-source and proprietary tools to create robust, cost-effective solutions to accelerate computer engineering. 2 Collective Knowledge Fig. 1 shows how a typical experimental workflow can be converted into a collection of CK modules such as programs (e.g. benchmarks), data sets, tools (e.g. compilers and libraries), scripts, experimental results, predictive models, articles, etc. In addition, CK modules can abstract away access to hardware, monitor run-time state, apply predictive analytics, etc. Each CK module has a class. Classes are implemented in Python, with a JSON11 meta description, JSON-based API, and unified command line interface. New classes can be defined as needed. Each CK module has a DOI-style unique identifier (UID). CK modules can be referenced and searched through by their UIDs using Hadoop-based Elas11 JavaScript Object Notation: json.org

82 ticsearch. 12 CK modules can be flexibly combined into experimental workflows, similar to playing with LEGO R modules. Engineers can share CK workflows complete with all their modules via repositories such as GitHub. Other engineers can reproduce an experiment under the same or similar conditions using a single CK command. Importantly, if the other engineers are unable to reproduce an experiment due to uncaptured dependencies (e.g. on run-time state), they can debug the workflow and share the fixed workflow back (possibly with new extensions, experiments, models, etc.) Collaborating groups of engineers are thus able to gradually expose in a unified way multi-dimensional design and optimization choices c of all modules, their features f, dependencies on other modules, run-time state s and observed behavior b, as shown in Fig. 1 and described in detail in [12,13]. This, in turn, enables collaboration on the most essential question of computer engineering: how to optimize any given computation in terms of performance, power consumption, resource usage, accuracy, resiliency and cost; in other words, how to learn and optimize the behavior function B: b = B(c, f, s) 2.1 Systematic benchmarking Collective Knowledge supports systematic benchmarking of a program s performance profile under reproducible conditions, with the experimental results being aggregated in a local or remote CK repository. Engineers gradually improve reproducibility of CK benchmarking by implementing CK modules to set run-time state and monitor unexpected behavior across participating systems. For example, on mobile devices, unexpected performance variation can often be attributed to dynamic voltage and frequency scaling (DVFS). Mobile devices have power and temperature limits to prevent device damage; in addition, when a workload s computational requirements can still be met at a lower frequency, lowering the frequency conserves energy. Further complications arise when benchmarking on heterogeneous multicore systems such as ARM big.little: in a short time, a workload can migrate between cores having different microarchitectures, as well as running at different frequencies. Controlling for such factors (or at least accounting for them with elementary statistics) is key to meaningful performance evaluation on mobile devices. 3 Example Systematically collecting performance data that can be trusted is essential but does not by itself produce insights. The Collective Knowledge approach permits to seamlessly apply leading edge statistical techniques on the collected data, thus converting raw data into useful insights. 12 Open-source distributed real-time search and analytics: elastic.co

83 Platform CPU (ARM) GPU (ARM) Chromebook 1 Cortex-A15 2 Mali-T604 4 Chromebook 2 Cortex-A15 4 Mali-T628 4 Table 1. Experimental platforms: Samsung Chromebooks 1 (XE303C12, 2012) and 2 (XE503C12, 2014). Notation: processor architecture number of cores. Consider the Histogram of Oriented Gradients (HOG), a widely used computer vision algorithm for detecting objects [14]. Realeyes deploy HOG in several stages of their image processing pipeline. Different stages use different flavours of HOG, considerably varying in their computational requirements. For example, one stage of the pipeline may invoke HOG on a small-sized image but with a high amount of computation per pixel ( computational intensity ); another stage, may invoke HOG on a medium-sized image but with low computational intensity. In addition, the Realeyes pipeline may be customized differently for running on mobile devices (e.g. phones), personal computers (e.g. laptops) or in the cloud. In this paper, we use two versions of HOG: an OpenCV-based CPU implementation (with TBB parallelization) and a hand-written OpenCL implementation (data parallel kernel). 13 Suppose we are interested in optimizing the execution time of HOG. 14 Computing HOG on the GPU is typically faster than on the CPU. The total GPU execution time (including the memory transfer overhead), however, may exceed the CPU execution time. Figure 2 shows a performance surface plot for one flavour of HOG with DVFS disabled and the processors frequencies controlled for. The X and Y axis show the CPU and the GPU frequencies, while the Z axis shows the CPU execution time divided by the total GPU execution time. When this ratio is greater than 1 (the light pink to bright red areas), using the GPU is faster than using the CPU, despite the memory transfer overhead. A sensible scheduling decision, therefore, is to schedule the workload on the GPU. While it may be possible to infer when to use the GPU from this plot (just avoid the light blue to navy areas), what if the performance also depends on other factors as well as the processors frequencies? Will we still be able to make sensible scheduling decisions most of the time? To answer this question, we conducted multiple experiments with HOG (1 1 cells) on two Chromebook platforms (see Table 1). The experiments covered the Cartesian product of the CPU and GPU frequencies available on both platforms (CPU: 1600 MHz, 800 MHz; GPU: 533 MHz, 266 MHz), 3 block size (16, 64, 128), 23 images (in different shapes and sizes), for the total of 276 samples (with 5 repetitions each). 13 The related CK repository is at github.com/ctuning/reproduce-carp-project. 14 We can also consider multi-objective optimization e.g. finding appropriate trade-offs between execution time vs. energy consumption vs. cost.

84 CPU frequency (MHz) GPU frequency (MHz) CPU time / GPU time with mem. transfer Fig. 2. Platform: Chromebook 2. Program: HOG 4 4; block size: 64. X axis: CPU frequency (MHz); Y axis: GPU frequency (MHz); Z axis: CPU execution time divided by GPU [kernel + memory transfer] execution time. To analyze the collected experimental data, we use decision trees, a popular supervised learning method for classification and regression. 15 We build decision trees using a Collective Knowledge interface to the Python scikit-learn package. 16 We thus obtain a predictive model that tells us whether it is faster to execute HOG on the GPU or on the CPU by considering several features of a sample (experiment). In other words, the model classifies a sample by assigning to it one of the two labels: YES means the GPU should be used; NO means the CPU should be used. We train the model on the experimental data, by labelling a sample with YES if the CPU execution time exceeds the GPU execution time by at least 7% (to account for variability), and with NO otherwise. Figure 3 shows a decision tree of depth 1 built from the experimental data obtained on Chromebook 1 using just one feature: the block size (designated as worksize in the figure), which, informally, determines the computational intensity of the algorithm. The root node divides the training set of 276 samples into two subsets. For 92 samples in the first subset, represented by the left leaf node ( L1 ), the worksize is less than or equal to 40 (i.e. 16). For 184 samples in the second subset, represented by the right leaf node ( L2 ), the worksize is greater than 40 (i.e. 64 and 128). 15 en.wikipedia.org/wiki/decision_tree_learning 16 scikit-learn.org

85 if X[0] (worksize) <= samples = 276 *L1* samples = 92 NO (90) / YES (2) NO yes no *L2* samples = 184 NO (4) / YES (180) YES Fig. 3. Platform: Chromebook 1. Model: feature set: 1; depth: 1. if X[0] (worksize) <= samples = 276 *L1* samples = 92 NO (90) / YES (2) NO yes no if X[0] (worksize) <= samples = 184 *L2* samples = 92 NO (4) / YES (88) YES yes no *L3* samples = 92 NO (0) / YES (92) YES Fig. 4. Platform: Chromebook 1. Model: feature set: 1; depth: 2. if X[0] (worksize) <= samples = 276 yes if X[7] (image rows) <= samples = 92 no if X[6] (image columns) <= samples = 184 yes no yes no if X[5] (GPU frequency) <= samples = 8 *L4* samples = 84 NO (84) / YES (0) NO *L5* samples = 160 NO (0) / YES (160) YES if X[0] (worksize) <= samples = 24 yes no yes no *L1* samples = 4 NO (4) / YES (0) NO if X[4] (CPU frequency) <= samples = 4 if X[4] (CPU frequency) <= samples = 12 *L8* samples = 12 NO (0) / YES (12) YES yes no yes no *L2* samples = 2 NO (0) / YES (2) *L3* samples = 2 NO (2) / YES (0) *L6* samples = 6 NO (0) / YES (6) *L7* samples = 6 NO (4) / YES (2) YES NO YES *NO* Fig. 5. Platform: Chromebook 1. Model: feature set: 2; depth: 4.

86 Id Features FS1 worksize [block size] FS2 all features from FS1, CPU frequency, GPU frequency, image rows (m), image columns (n), image size (m n), (GWS0, GWS1, GWS2) [OpenCL global work size] FS3 all features from FS2, CPU frequency / GPU frequency, image size / CPU frequency, image size / GPU frequency, Table 2. Feature sets: simple (FS1); natural (FS2); designed (FS3). In the first subset, 90 samples are labelled with NO and 2 samples are labelled with YES. Since the majority of the samples are labelled with NO, the tree predicts that the workload for which the worksize is less than or equal to 40 should be executed on the CPU. Similarly, the workload for which the worksize is greater than 40 should be executed on the GPU. Intuitively, this makes sense: the workload with a higher computational intensity (a higher value of the worksize) should be executed on the GPU, despite the memory transfer overhead. For 6 samples out of 276, the model in Figure 3 mispredicts the correct scheduling decision. (We say that the rate of correct predictions is 270/276 or 97.8%.) For example, for the two samples out of 92 in the subset for which the worksize is 16 ( L1 ), the GPU was still faster than the CPU. Yet, based on labelling of the majority of the samples in this subset, the model mispredicts that the workload should be executed on the CPU. Figure 4 shows a decision tree of depth 2 using the same worksize feature. The right child of the root now has two children of its own. All the samples in the rightmost leaf ( L3 ) for which the worksize is greater than 96 (i.e. 128) are labelled with YES. This means that at the highest computational intensity, the GPU was always faster than the CPU, thus confirming our intuition. However, the model in Figure 4 still makes 6 mispredictions. To improve the prediction rate, we build models using more features, as well as having more levels. In Table 2, we consider two more sets of features. The natural set is constructed from the features that we expected would impact the scheduling. Figure 5 shows a decision tree of depth 4 built using the natural feature set. This model uses 4 additional features (the GPU frequency, the CPU frequency, the number of image columns, the number of image rows) and has 8 leaf nodes, but still results in 2 mispredictions ( L7 ), achieving the prediction rate of 99.3%. This model makes the same decision on the worksize at the top level, but better fits the training data at lower levels. However, this model is more difficult to grasp intuitively and may not fit new data well. The designed set can be used to build models achieving the 100.0% prediction rate. A decision tree of depth 5 (not shown) uses all the new features from the designed set. With 12 leaf nodes, however, this model is even more difficult to grasp intuitively and exhibits even more overfitting than the model in Figure 5.

87 if X[0] (worksize) <= samples = 276 yes no if X[0] (worksize) <= samples = 184 *L3* samples = 92 NO (52) / YES (40) *NO* *L1* samples = 92 NO (92) / YES (0) NO yes no *L2* samples = 92 NO (83) / YES (9) NO Fig. 6. Platform: Chromebook 2. Model: feature set: 1; depth: 2. Now, if we use a simple model trained on data from Chromebook 1 (Figure 4) for predicting scheduling decisions on Chromebook 2, we only achieve a 51.1% prediction rate (not shown). A similar model retrained on data from Chromebook 2 (Figure 6) achieves a 82.3% prediction rate. Note that the toplevel decision has changed to the worksize being less than 96. In other words, up to that worksize the CPU is generally faster than the GPU even as problems become more computationally intensive. This makes sense: the CPU of Chromebook 2 has 4 cores, whereas the CPU of Chromebook 1 has 2 cores. This demonstrates the importance of retraining models for different platforms. As before, using more features and levels can bring the prediction rate to 100.0%. For example, using the natural feature set improves the prediction rate to 90.2% (Figure 7). Note that the top-level decision no longer depends on the worksize but on the first dimension of the OpenCL global work size. For brevity, we omit a demonstration of the importance of using more data for training. For example, to build more precise models, we could have added experiments with the worksize of 32 to determine if that would still be considered non-intensive as the worksize of 16. The Collective Knowledge approach allows to crowdsource such experiments and rebuild models as more mispredictions are detected and more data becomes available. High-level programming frameworks for heterogeneous systems such as Android s RenderScript, 17 Qualcomm s Symphony, 18 and Khronos s OpenVX 19 can be similarly trained to dispatch tasks to system resources efficiently. 17 developer.android.com/guide/topics/renderscript 18 developer.qualcomm.com/symphony (formerly known as MARE) 19 khronos.org/openvx

88 if X[1] (GWS0) <= samples = 276 yes if X[4] (CPU frequency) <= samples = 92 no if X[7] (image rows) <= samples = 184 yes no yes no *L1* samples = 46 NO (12) / YES (34) *L2* samples = 46 NO (40) / YES (6) *L3* samples = 64 NO (55) / YES (9) *L4* samples = 120 NO (120) / YES (0) *YES* *NO* *NO* NO Fig. 7. Platform: Chromebook 2. Model: feature set: 2; depth: 2. 4 Conclusion We have presented Collective Knowledge, an open methodology that enables collaborative design and optimization of computer systems. This methodology encourages contributions from the expert community to avoid common benchmarking pitfalls (allowing, for example, to fix the processor frequency, capture run-time state, find missing software/hardware features, improve models, etc.) 4.1 Representative workloads We believe the expert community can tackle the issue of representative workloads as well as the issue of rigorous evaluation. The community will both provide representative workloads and rank them according to established quality criteria. Furthermore, a panel of recognized experts could periodically (say, every 6 months) provide a ranking to complement commercial benchmark suites. The success will depend on establishing the right incentives for the community. As the example of Realeyes shows, even when commercial sensitivity prevents a company from releasing their full application under an open-source license, it may still be possible to distill a performance-sensitive portion of it into a standalone benchmark. The community can help the company to optimize their benchmark (for free or for fee), thus improving the overall performance of their full application. 20 Some software developers will just want to see their benchmark appear in the ranked selection of workloads, highlighting their skill and expertise (similar to kudos for open-source contributions). 4.2 Predictive analytics We believe that the Collective Knowledge approach convincingly demonstrates that statistical techniques can indeed help computer engineers do a better job 20 The original HOG paper [14] has over citations. Just imagine this community combining their efforts to squeeze out every gram of HOG performance across different flavours, data sets, hardware platforms, etc.

89 in many practical scenarios. Why do we think it is important? Although we are not suggesting that even most advanced statistical techniques can ever substitute human expertise and ingenuity, applying them can liberate engineers from repetitive, time-consuming and error-prone tasks that machines are better at. Instead, engineers can unleash their creativity on problem solving and innovating. Even if this idea is not particularly novel, Collective Knowledge brings it one small step closer to reality. 4.3 Trust me, I am a catalyst! We view Collective Knowledge as a catalyst for accelerating knowledge discovery and stimulating flows of reproducible insights across largely divided hardware/- software and industry/academia communities. Better flows will lead to breakthroughs in energy efficiency, performance and reliability of computer systems. Effective knowledge sharing and open innovation will enable new exciting applications in consumer electronics, robotics, automotive and healthcare at better quality, lower cost and faster time-to-market. 5 Acknowledgements We thank the EU FP TETRACOM Coordination Action for funding initial CK development. We thank the CK community for their encouragement, support and contributions. In particular, we thank our partners and customers for providing us valuable opportunities to improve Collective Knowledge on realworld use cases. References 1. J. Teich. Hardware/software codesign: The past, the present, and predicting the future. Proceedings of the IEEE, 100(Special Centennial Issue): , May John L. Hennessy and David A. Patterson. Computer architecture, a quantitative approach (second edition). Morgan Kaufmann publishers, R. Whaley and J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the Conference on High Performance Networking and Computing, B. Aarts, M. Barreteau, F. Bodin, P. Brinkhaus, Z. Chamski, H.-P. Charles, C. Eisenbeis, J. Gurd, J. Hoogerbrugge, P. Hu, W. Jalby, P.M.W. Knijnenburg, M.F.P O Boyle, E. Rohou, R. Sakellariou, H. Schepers, A. Seznec, E.A. Stöhr, M. Verhoeven, and H.A.G. Wijshoff. OCEANS: Optimizing compilers for embedded applications. In Proc. Euro-Par 97, volume 1300 of Lecture Notes in Computer Science, pages , K.D. Cooper, D. Subramanian, and L. Torczon. Adaptive optimizing compilers for the 21st century. Journal of Supercomputing, 23(1), Raj Jain. The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling. May 1991.

90 7. Krste Asanovic, Ras Bodik, Bryan C. Catanzaro, Joseph J. Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William L. Plishker, John Shalf, Samuel W. Williams, and Katherine A. Yelick. The landscape of parallel computing research: a view from Berkeley. Technical Report UCB/EECS , Electrical Engineering and Computer Sciences, University of California at Berkeley, December Luigi Nardi, Bruno Bodin, M. Zeeshan Zia, John Mawer, Andy Nisbet, Paul H. J. Kelly, Andrew J. Davison, Mikel Luján, Michael F. P. O Boyle, Graham Riley, Nigel Topham, and Steve Furber. Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM. In Proceedings of the IEEE Conference on Robotics and Automation (ICRA), May arxiv: Elnar Hajiyev, Róbert Dávid, László Marák, and Riyadh Baghdadi. Realeyes image processing benchmark Riyadh Baghdadi, Ulysse Beaugnon, Tobias Grosser, Michael Kruse, Chandan Reddy, Sven Verdoolaege, Javed Absar, Sven van Haastregt, Alexey Kravets, Robert David, Elnar Hajiyev, Adam Betts, Jeroen Ketema, Albert Cohen, Alastair Donaldson, and Anton Lokhmotov. PENCIL: a platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT 15), September Grigori Fursin, Anton Lokhmotov, and Ed Plowman. Collective Knowledge: towards R&D sustainability. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE 16), March Grigori Fursin, Renato Miceli, Anton Lokhmotov, Michael Gerndt, Marc Baboulin, D. Malony, Allen, Zbigniew Chamski, Diego Novillo, and Davide Del Vento. Collective Mind: Towards practical and collaborative auto-tuning. Scientific Programming, 22(4): , July Grigori Fursin, Abdul Memon, Christophe Guillon, and Anton Lokhmotov. Collective Mind, Part II: Towards performance- and cost-aware software engineering as a natural science. In Proceedings of the 18th International Workshop on Compilers for Parallel Computing (CPC 15), January Navneet Dalal and Triggs Bill. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages , 2005.

91 Heterogeneous (CPU+GPU) Working-set Hash Tables Ziaul Choudhury and Suresh Purini International Institute of Information Technology, Hyderabad,India Abstract. In this paper, we propose heterogeneous (CPU+GPU) hash tables, that optimize operations for frequently accessed keys. The idea is to maintain a dynamic set of most frequently accessed keys in the GPU memory and the rest of the keys in the CPU main memory. Further, queries are processed in batches of fixed size. We measured the query throughput of our hash tables using Millions of Queries Processed per Second (MQPS) as a metric, on different key access distributions. On distributions, where some keys are queried more frequently than others, we achieved on average 10x higher MQPS when compared to a highly tuned serial hash table in the C++ Boost library; and 5x higher MQPS against a state of the art concurrent lock free hash table. The maximum load factor on the hash tables was set to 0.9. On uniform random query distributions, as expected our hash tables do not outperform concurrent lock free hash tables, nevertheless matches their performance. 1 Introduction A hash table is a key-value store which supports constant time insert, delete and search operations. Hash tables do not lay any special emphasis on the key access patterns over time. However, the key access sequences in real world applications tend to have some structure. For example, till a certain point in time a small subset of keys could be searched more frequently than the rest. Data structures like Splay trees [17] are specifically designed to reduce the access times of the frequently accessed keys by keeping them close to the root using rotation operations on the tree. The working set property states that it requires at most O(log[ω(x) + 2]) time to search for a key x, where ω(x) is the number of distinct keys accessed since the last access of x. Splay trees satisfy this property in an amortized sense [17], while the working-set structure satisfies the same in the worst case sense [4]. The working-set structure is an array of balanced binary search trees where the most recent keys occupy the smaller trees at front. Through a sequence of insertions and deletions the older keys are propagated to the bigger trees towards the end. Following the recent uprise of accelerator based computing, like the heterogeneous multi-core CPU+GPU based systems, many data structures, for example quad-trees [9] and B+ trees [5] have been successfully ported to these exotic

92 2 Ziaul Choudhury, Suresh Purini platforms thus achieving greater performance. In this direction, inspired by the working-set structure, we propose a set of two level heterogeneous (CPU+GPU) hash tables in this paper. In all the designs the first level of the hash table is smaller in size and resides in the GPU memory. It essentially caches the most recently accessed keys or in other words hot data. The second level hash table resides in the CPU memory and contains the rest of the keys. We overlay an MRU list on the keys residing in the GPU hash table. The queries are batched and are processed first on the GPU followed by the CPU. Our overall hash tables can be viewed as a heterogeneous two level working-set structure. To the best of our knowledge, this is the first attempt towards designing heterogeneous (CPU+GPU) hash tables, wherein we use the GPU accelerator to improve the query throughput by exploiting the key access patterns of the hash table. The rest of the paper is organized as follows. In section 2 we give a brief background necessary towards designing a heterogeneous (CPU+GPU) hash table. In section 3 we describe our hash table designs in detail followed by experimental results in section 4. We conclude with directions to some future work in section 5. 2 Background This section gives an overview of the heterogeneous (CPU+GPU) architecture followed by a brief discussion on the multi-core CPU and GPU hash tables that inspired the work in this paper. NVIDIA GPUs are composed of multiple streaming multiprocessors (SMs) each containing a number of light weight primitive cores. The SMs execute in parallel independently. The memory subsystem is composed of a global DRAM and a L2 cache shared by all the SMs. There is also a small software managed data cache whose access times is close to register speeds called shared memory. This is attached to each SM and shared by all the cores within a SM. A compute kernel on a GPU is organized as a collection of thread blocks which are in turn grouped into batches of 32 threads called warps. One instruction by a warp of threads is executed in a constant number of cycles within the SM. A warp is the basic unit of execution in a GPU kernel. The GPU is embedded in a system as an accelerator device connected to the CPU through a low bandwidth PCIe express bus. GPUs are programmed using popular frameworks like CUDA [1] and OpenCL. Heterogeneous computing using CPU and GPU traditionally involves the GPU handling the data parallel part of the computation by taking advantage of its massive number of light weight parallel threads, while the CPU handling the sequential code or data transfer management. Unfortunately a large fraction of time in a CPU+GPU code is spent in transferring data across the slow PCIe bus. This problem can be mitigated by carefully placing the data in the GPU so that fetching of new data from the CPU is as minimum as possible. The CPU after transferring the data and launching the kernel mostly sits idle during the

93 Heterogeneous Hash Tables 3 computation. In this work the motivation is to keep both the devices 1 busy while executing successive operations on the respective hash tables. Data structures that use both the CPU and GPU simultaneously have been reported in literature. Kelly and Breslow [9] proposed a heterogeneous approach to construct quad-trees by building the first few levels in the CPU and the rest of the levels in the GPU. The work load division strategy has also proven its worth in cases where the costly or frequent operations were accelerated on the GPU while the rest of the operations were handled by the CPU. Daga and Nutter [5] proposed a B+ tree implementation on an Accelerated Processing Unit (APU). They eliminated the need to copy the entire tree to the GPU memory, thus freeing the implementation from the limited GPU memory. General hash table designs include linear hash table, chained hash table, cuckoo hash table and hopscotch hash table [7]. Among these the cuckoo hashing [15] technique can achieve good performance for lookup operations. Cuckoo hashing is an open address hashing scheme which uses a fixed number of hash tables with one hash function per table. On collision the key replaces the already present key in the slot. Now the slotless key is hashed into a different table by the hash function of that table and the process continues until all the keys have a slot. There has been efforts directed towards designing high performance hash tables for multi-core systems. Lea s hash table from the Java Concurrency Package [11] is a closed address lock based hash table based on chaining. Hopscotch hashing [7] guarantees constant lookup operations. It is a lock based open address technique which combines linear probing with the cuckoo hashing technique. Initial work on lock free hashing was done in [13] which used chaining. The lock free version of cuckoo hashing was designed in [14]. The algorithm allowed mutating operations to operate concurrently with query ones and required only single word compare-and-swap primitives. They used a two-round query protocol enhanced with a logical clock technique to ensure correctness. Pioneering work was done for parallel hashing in GPU by Alcantara et.al. [2]. They used cuckoo hashing on the GPU for faster lookup and update operations. Each thread handled a separate query and used GPU atomic operations to prevent race conditions while probing for hash table slots. This design is a part of the CUDPP [3] library, which is a data parallel library of common algorithms in the GPU. The work in [10] presented the Stadium Hashing (Stash) technique which is a cuckoo hash design and scalable to large hash tables. It removes the restriction of maintaining the hash table wholly on the limited GPU memory by storing container buckets in the host memory as well. It used a compact data structure named ticket-board separate from hash table buckets maintained in the GPU memory which guided all the operations on the hash tables. 3 Proposed Heterogeneous Hash Tables In this section, we give an overview of the basic design of our hash tables and their memory layout across both the devices (Figure 1). The primary goal of our 1 In this paper, by devices we mean the CPU and it s connected GPU.

94 4 Ziaul Choudhury, Suresh Purini hash tables is to support faster operations on recently accessed keys similar to the working-set structure. Unlike previous works the size and scalability of our hash tables is not restricted by the limited GPU memory. The GPU is used as a cache to store the most frequently accessed keys (hot data). These key-value pairs are processed in parallel by all the GPU threads. The search and update queries are bundled into batches of size B before processing. We intuitively expect that every key k with ω(k) cm where M is the size of GPU global memory and 0 < c 1 is some constant, is available in the GPU. The value of c depends on the key-value pair record size. All the key-value pairs in our heterogeneous hash tables are partitioned between a list and a CPU based hash table. The list is implemented using an array residing in the unified memory. Unified memory is an abstracted form of memory that can be accessed by both the devices without any explicit data transfers [1]. The support for unified memory is provided with CUDA 6.0 onwards. Internally this memory is first allocated on the GPU. When the CPU accesses an address in this memory, a GPU memory block containing the requested memory is transferred to the CPU by the underlying CUDA framework implicitly. The key-value pairs stored in the list are arranged from the most recently to the least recently accessed pair in the left to right order respectively. The size of the list is M and has three sections: an active middle area which contains all the key-value pairs belonging to the list and empty left and right areas of size B each. The query vector is first processed by the GPU and then by the CPU. After both the devices have processed the query vector it is copied to the left section of the list in unified memory. A reorganize operation now arranges the key-value pairs of the list in the MRU order. This reorganization will be explained later in the paper. The MRU list may contain more than the allowed number of key-value pairs. The overflow keys accumulate in the empty right most area of the list after the reorganization step. These overflow keys are the oldest in the GPU memory and will be accommodated in the CPU memory during successive operations on the hash tables. The rest of the key-value pairs which are old enough and thus can not be accommodated in the MRU list due to size constraints, are maintained in the CPU hash table. The keys in the CPU are not maintained in the MRU order. The architecture of CPU hash table is different for all the designs and will be described in the later sections. Each element in the query vector called a query element contains a key-value pair. The rightmost three bits in the key are reserved. The first two bits identify the operation being carried out with the key; i.e. a search, insert or delete. The last bit is set if the key-value pair is present in the GPU and vice-versa (Figure 1). The next three sections describe the hash table designs in detail. 3.1 A Spliced Hash Table A spliced hash table (S-hash) is the simplest of our designs where a standard GPU hash table from the CUDPP library is fused together with a serial CPU hash table from the C++ Boost [16] library, within the framework described in the

Heterogeneous Hash Tables 5 Figure 1: The left figure shows the structure of a query element. The figure on the right shows the overall structure of the hash tables.

The GPU hash table (CUDPP hash) is separate from the MRU list and is maintained as a separate data structure in the GPU memory.

The CUDPP hash and the Boost hash communicate through the MRU list in the unified memory.

Recall that the MRU list contains M slots. To identify the position of a key in the MRU list log M bits are stored along with the value part.

95 Heterogeneous Hash Tables 5 Figure 1: The left figure shows the structure of a query element. The figure on the right shows the overall structure of the hash tables. The value part in the query vector is omitted for simplification. previous section. The GPU hash table (CUDPP hash) is separate from the MRU list and is maintained as a separate data structure in the GPU memory. All the keys that belong to the MRU list are also present in the CUDPP hash. The CUDPP hash processes search and update queries in batches. The CUDPP hash and the Boost hash communicate through the MRU list in the unified memory. By communicate we mean, the overflow keys in the MRU list which also lies in the CUDPP hash are removed from the GPU memory and are added to the CPU Boost hash during successive operations. Recall that the MRU list contains M slots. To identify the position of a key in the MRU list log M bits are stored along with the value part. These position bits link a key-value pair in the CUDPP hash to its location in a specific slot of the MRU list. Operations: The operations are bundled in a query vector first and sent to the CUDPP hash for processing. The working set bit is set for each insert(key) operation in a query element. The CUDPP hash can not handle mixed operations. Hence the query elements with search operations are separated from the delete operations before processing. Each GPU thread handles an individual query element. For a search query, if the key is found in the CUDPP hash, the position bits located in the value field are read and the working set bit situated at the location of the key in the MRU list is set to 0 and the working set bit corresponding to the key-value pair in the query element is set to 1. Delete queries are handled by first removing the key-value pair from the CUDPP hash and simultaneously from the MRU list by setting the working set bit situated at the location of the key to 0. Also the working set bit in the query element is left unset for a delete operation. The search and delete queries that could not be serviced by the GPU are sent to the CPU. The CPU takes one query element at a time and executes the corresponding operation on the Boost hash. The setting of the working set bit in the query element is done as before. The query vector is now copied to the leftmost section in the MRU list. This copying can be avoided if the query vector is placed in this section of the MRU list at the beginning only. To prevent

6 Ziaul Choudhury, Suresh Purini duplicate keys in the hash tables, the query vector is scanned for repeated keys with the working set bit set.

The keys in the query vector whose working set bit is set and is not available in the CUDPP hash are added to the GPU in a separate batch.

Reorganize: This operation shrinks the MRU list by removing all the keys in the list whose working set bits are set to 0. Figure 2 shows an instance of this operation.

The operations bits for the search, insert and delete operation are 00, 01 and 11 respectively. Notice these bits along with the key in the figure.

96 6 Ziaul Choudhury, Suresh Purini duplicate keys in the hash tables, the query vector is scanned for repeated keys with the working set bit set. If duplicates are found, the working set bit is set for one key and left unset for the rest. The keys in the query vector whose working set bit is set and is not available in the CUDPP hash are added to the GPU in a separate batch. Now a Reorganize operation is executed on the list which arranges the keys in the MRU order (Figure 3). Reorganize: This operation shrinks the MRU list by removing all the keys in the list whose working set bits are set to 0. Figure 2 shows an instance of this operation. The MRU list with the associated left and right sections is shown along with an example query vector with insertion of the key 98 and search for the other keys. The operations bits for the search, insert and delete operation are 00, 01 and 11 respectively. Notice these bits along with the key in the figure. Once the query vector is processed, it is added to the leftmost section of the MRU list. As 12 and 72 belong to the MRU list, the corresponding working set bit in the query element is set and the working set bit corresponding to the location of these keys inside the MRU list is unset. Now an exclusive prefix scan is carried out on the list. This prefix scan is carried out using the working set bit values. The overflow area where the working set bits are set to X, is not included in the prefix scan. The keys are then packed using the indices returned by the scan operation. The index starts at B, where B is the size of the query vector. If a key overflows to the overflow section due to addition of a new key to the MRU list (key 98 in Figure 2), it gets added by the CPU. This addition of the overflow keys by the CPU is done when the CUDPP hash starts processing the next batch of queries. At this point both the devices are active simultaneously. Since this overflow section belongs to the MRU list in unified memory, the CPU can read these keys without any explicit data transfers. The prefix scan is carried out in-place using a scan operation in the Thrust high performance GPU library [8]. Figure 2: The figure shows a reorganize operation on a simple MRU list with the overflow and the query sections. The value part is omitted for simplification. 3.2 A Simplified S-Hash Table The simplified S-hash table (SS-hash) eliminates the CUDPP hash and operates on the MRU list directly. The step to separate the queries based on operations

97 Heterogeneous Hash Tables 7 is no longer necessary as the GPU now handles mixed operations together in one batch. The algorithm for maintaining the MRU list in the GPU remains the same. The only difference is the replacement of the CUDPP hash with our MRU list processing logic described below (Figure 3). MRU List Processing: After the query vector is filled with query elements, a GPU kernel is launched. The thread configuration of the kernel is adjusted to launch W warps, where W equals the size of the active middle section in the MRU list. A warp is assigned a single MRU list element. Each block in the kernel loads a copy of the query vector in its shared memory. The i th warp processes the j 32 + i th key in the MRU list, here j is the index of the block containing the warp and each block has a maximum capacity of 32 warps. The warp reads the assigned key in the list and all the threads in the warp linearly scan the query vector in shared memory for the key. If a thread in the warp finds a match, it first reads the operations bits to identify if it is a search or a delete operation. For a successful search operation, a thread sets the working set bit of the key in the query vector and unsets the corresponding bit in the MRU list. The thread uses the copy of the query vector in the global memory for this step. This bit manipulation is done using the bit-wise OR/AND primitives. For a successful delete, the key-value pair along with the working set bit in the MRU list is set to 0. The working set bit in the query vector is left unset. The success of a concurrent search for the same key that is getting deleted is determined by whether the search read the key before the delete started modifying the bits in the keyvalue pair. Insert operations need not be processed as they will be taken care by the reorganize step that was described before. This is a pretty straightforward code without any optimization. Listed below are some optmizations which are intrinsic to our algorithm and some others which are incorporated with minor modifications. Memory Bank conflicts: Bank conflicts within a block occur when a few threads within a warp read the same shared memory location. In our algorithm all the warp threads read adjacent memory locations of the shared memory therefore preventing bank conflicts. Global Memory coalescing: Coalescing happens when all the threads from a warp read successive global memory locations. Consequently all the read requests are served by a single memory transaction. In our algorithm all the warp threads read a single global memory location so the scope of coalescing reads is lost. As an optimization, before all the warps in a block starts reading keys from the MRU list, warp 0 in each block reads in a set of contiguous keys from the MRU list and places them in a shared memory buffer. After this the warps, including warp 0, starts executing and fetching the keys from this buffer instead of the MRU list. Warp Serialization: If the threads from two different warps read the same shared memory location, the two warps are scheduled one after another on the respective SM. There is a high probability for this to happen as all the warps within a block scan the query vector linearly starting from the

8 Ziaul Choudhury, Suresh Purini beginning.

Launch configuration: The number of warps, thereby blocks, launched can be reduced if more work is assigned to a single warp.

$Redundant work: There might be scenarios where all the query elements are serviced by a very small fraction of the warps while the majority of the warps do redundant work of simply reading the query$ Now a warp only starts its execution cycle if the value of this counter is greater than 0. In this design, the Boost hash is replaced by a lock free cuckoo hash from [14].

Now a warp only starts its execution cycle if the value of this counter is greater than 0. In this design, the Boost hash is replaced by a lock free cuckoo hash from [14].

98 8 Ziaul Choudhury, Suresh Purini beginning. To reduce this probability each warp chooses a random location in the query vector to start the scan from and wrap around in case it over flows the size of the query vector. Launch configuration: The number of warps, thereby blocks, launched can be reduced if more work is assigned to a single warp. Instead of processing a single key from the MRU list, each warp can pick up a constant number of keys to look for inside the query vector. Redundant work: There might be scenarios where all the query elements are serviced by a very small fraction of the warps while the majority of the warps do redundant work of simply reading the query vector and the keys before expiring. To combat this issue, each warp on successfully processing a query decrements an global atomic counter initialized to B. Now a warp only starts its execution cycle if the value of this counter is greater than 0. In this design, the Boost hash is replaced by a lock free cuckoo hash from [14]. The overflow keys are now added in parallel by individual CPU threads to the CPU hash table. Figure 3: The overall design of the S-hash and SS-hash tables. 3.3 A Cache Partitioned SS-hash Table In this section, the focus is shifted to the CPU hash table design. The lock free cuckoo hash is replaced by our own implementation in the SS-hash table. The developed hash table is optimized for multi-core caches using the technique of partitioning, hence we call it CPSS hash table. The work in [12] designed a shared memory hash table for multi-core machines (CPHASH), by partitioning the table across the caches of cores and using message passing to transfer search/insert/delete operations to a partition. Each partition was handled by a separate thread on a core. We design our hash table along similar lines. The hash table processes queries in batches and operates in a rippling fashion during query processing. Our hash table is implemented using a circular array. The array housing the hash table is partitioned into P different partitions, P is the number of CPU threads launched. In each partition P[i], a fixed number of buffer slots, R[i], are

99 Heterogeneous Hash Tables 9 reserved. The rest of the slots in each partition are used for hashing the keys. Within a partition the collisions are resolved using closed addressing. A mixed form of cuckoo hashing and linear probing is used. Each partition uses two hash function h 1 and h 2, each operating on half the slots reserved for hashing. Each partition is serviced by a thread and handles Q P queries, Q is the total number of queries batched for processing on the CPU. Operations: The query element m in the batch is assigned to the partition k = H(m)%P, here H is a hash function and H h 1, h 2. The assignment is completed by writing the contents of the query element to a slot in R[k]. After this the threads execute a barrier operation and come out of the barrier only if there are no more entries in the buffer slots of each thread s partition. Each thread i reads a key from its buffer and hashes it to one of the hashing slots using h 1. If the slot returned is already full, the thread searches for an empty slot in the next constant number of slots using linear probing. If this scheme fails the threads replaces the last read key from its slot and inserts its key into this slot. The slotless key is hashed in the other half of the partition using h 2. The same process is repeated here also. If the thread is unsuccessful in assigning a slot to the removed key, it simply replaces the key from last read slot and inserts the just removed key in R[i + 1]%P. The insertion to the buffer slots of an adjacent partition is done using lock free techniques. All the insertions and deletions happen at the starting slot of these buffers using the atomic compare-and-swap primitive. This is the same mechanism used by a lock free stack [6]. For search and delete operation, each thread probes for a constant number of slots within its partition. Unsuccessful queries are added to the adjacent partition s buffer slots for the adjacent thread to process. There is a major issue with concurrent cuckoo hashing in general. A search for a specific key might be in progress while that key is in movement due to insertions happening in parallel, thus the search returns false for a key present in the hash table. Note that in our case the overall algorithm for the hash tables is designed in such a way that the CPU side insertions always happens in a separate batch before the searches and deletes. 4 Performance Evaluation This section compares the performance of our heterogeneous hash tables to the most effective prior hash tables in both sequential and concurrent (multi-core) environments. The metric used for performance comparison is query throughput which is measured in Millions of Queries Processed per Second (MQPS). The experimental setup consists of a NVIDIA Tesla K20 GPU and an Intel Xeon E CPU. Both the CPU and GPU are connected through a PCIe express bus with 8GB/s peak bandwidth. The GPU has 5GB of global memory with 14 SM and 192 cores per SM. The GPU runs on latest CUDA framework 7.5. The host is a 8 core CPU running at 3.2 GHz with 32 GB RAM. The CPU code is implemented using the latest C++11 standard. All the results are averaged out over 100 runs. Our hash tables are compared against a concurrent

10 Ziaul Choudhury, Suresh Purini Figure 4: The query throughput of the heterogeneous hash tables on different key access distributions and query mixes.

4.1 Query performance We use micro-benchmarks similar to the works in [7],[14]. Each experiment uses the same data set of 64 bit key-value pairs for all the hash tables.

All the hash tables are filled with 64M key-value pairs initially.

100 10 Ziaul Choudhury, Suresh Purini Figure 4: The query throughput of the heterogeneous hash tables on different key access distributions and query mixes. lock free cuckoo hash implementation (LF-Cuckoo) from [14] and a serial hash table from the Boost library. For completeness we also compared the results with Lea s concurrent locked (LL) hash table. 4.1 Query performance We use micro-benchmarks similar to the works in [7],[14]. Each experiment uses the same data set of 64 bit key-value pairs for all the hash tables. The results were collected by setting up the hash tables densities (load factor) close to 90%(0.9). Figure 4 compares the query throughput of the hash tables on 10M queries. All the hash tables are filled with 64M key-value pairs initially. The results are shown for two types of query mixes, one has a higher percentage of search queries and the other has more update operations. Two types of key access patterns are simulated for the search and delete queries. A Uniform distribution generates the queries at random from the data set while a Gaussian distribution generates queries where a fraction of the keys are queried more than the others. The standard deviation of the distribution is set such that 20% of the keys have higher access frequency. As each warp in the GPU processes a single MRU list element, it is treated as a single GPU thread in the plots, i.e. the value of x for the number of GPU threads is actually 32 x. The number of {GPU, CPU} threads is varied linearly from {32, 4} to {1024, 128}. The size of the MRU list is fixed at 1M key-value pairs and each query vector has 8K entries. The Boost hash being a serial structure always operates with a single CPU thread.

number of threads. Altough our hash tables have lesser query throughput compared to the lock free cuckoo hash.

101 Heterogeneous Hash Tables 11 Figure 5: The cache misses/query comparison for the hash tables. As can be seen in Figure 4, for search dominated uniform key access patterns the heterogeneous hash tables outperforms the Boost hash and Lea s hash and the throughput scales with the increasing number of threads. Altough our hash tables have lesser query throughput compared to the lock free cuckoo hash. For the insert dominated case, the heterogeneous hash tables outperforms all the other hash tables. The reason is the simplified insert operation where the simple reorganize operation inserts the keys into the MRU list and thereby to the hash tables. The CPU only handles the overflow inserts which have less probability of occurrence. For the Gaussian distribution case, our hash tables outperformed the others by a significant margin. They can process 10 times more queries compared to the Boost hash and 5 times more compared to the lock free cuckoo hash. The frequently accessed keys are always processed on the GPU. The CPU only processes the unprocessed queries in the query vector and the overflow keys. The probability of CPU doing work is less, as most of the queries are satisfied by the GPU without generating any overflow. Figure 5 shows the cache misses per query for the Uniform and the Gaussian distribution case. The CPSS hash has fewer number of cache misses compared to the Boost hash and the lock free cuckoo hash. As the GPU has a primitive cache hierarchy and most of the memory optimizations has already been taken care of, only the CPU cache misses are reported. In the Gaussian distribution case the CPSS hash performs much better compared to the other case as most of the queries are resolved by the GPU itself and the CPU has less work to do. 4.2 Structural Analysis The experiments in this section are carried out on the CPSS hash to find out the reasons for the speed up that was reported earlier. In Figure 6 the number of queries were varied with {1024, 128} threads in total. The other parameters are same as before. As can be seen, for the uniform scenarios half the time is spent on memory copies. These cover both DeviceToHost or HostToDevice implicit memory transfers. These memory transfers were captured with the help of the CUDA profiler. In the Gaussian distribution case the GPU performs most of the work with minimum memory transfer overhead and hence the expected speed

12 Ziaul Choudhury, Suresh Purini Figure 6: The top two graphs show the time split of the CPU,GPU and memory transfers for processing different number of queries under different key access

As can be seen in Figure 6 the maximum throughput is achieved when our hash tables are configured with MRU list and query vector size of 1M and 8K respectively.

102 12 Ziaul Choudhury, Suresh Purini Figure 6: The top two graphs show the time split of the CPU,GPU and memory transfers for processing different number of queries under different key access distribution. The bottom graphs show the variance of the query throughput with the size of the MRU list and the query vector respectively. up is achieved. As can be seen in Figure 6 the maximum throughput is achieved when our hash tables are configured with MRU list and query vector size of 1M and 8K respectively. With increasing size of the MRU list and the query vector, the time spent by the GPU in the Reorganize operation and the time for the DeviceToHost memory transfers increases. This is the reason for the diminishing query throughput at higher values of these parameters. 5 Conclusion In this work, we proposed a set of heterogeneous working-set hash tables whose layout spans across GPU and CPU memories, where the GPU handles the most frequently accessed keys. The hash tables operates without any explicit data transfers between the devices. This concept can be extended to any set of interconnected devices with varying computational powers where the most frequently accessed keys lies on the fastest device and so on. For non-uniform key access distributions, our hash tables outperformed all the others in query throughput. In our future work, we plan to investigate the challenges involved in using multiple accelerators including GPUs and FPGAs. We envisage that maintaining a global MRU list spanning across all the devices could be computationally expensive. So suitable approximations that give the right trade-off have to be made.

103 Heterogeneous Hash Tables 13 References 1. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide, D. A. F. Alcantara. Efficient Hash Tables on the Gpu. PhD thesis, Davis, CA, USA, AAI Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan. GpuCV: A GPUaccelerated framework for image processing and computer vision. In Advances in Visual Computing, volume 5359 of Lecture Notes in Computer Science, pages Springer, Dec M. Bdoiu, R. Cole, E. D. Demaine, and J. Iacono. A unified access bound on comparison-based dynamic dictionaries. Theor. Comput. Sci., 382(2):86 96, Aug M. Daga and M. Nutter. Exploiting coarse-grained parallelism in b+ tree searches on an apu. In Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 12, pages , Washington, DC, USA, IEEE Computer Society. 6. D. Hendler, N. Shavit, and L. Yerushalmi. A scalable lock-free stack algorithm. In Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 04, pages , New York, NY, USA, ACM. 7. M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In 22nd Intl. Symp. on Distributed Computing, J. Hoberock and N. Bell. Thrust: A parallel template library, Version M. Kelly and A. Breslow. Quad-tree construction on the gpu: A hybrid cpu-gpu approach. Retrieved June13, F. Khorasani, M. E. Belviranli, R. Gupta, and L. N. Bhuyan. Stadium hashing: Scalable and flexible hashing on gpus D. Lea. Hash table util.concurrent.concurrenthashmap, revision 1.3, in JSR-166, the proposed Java Concurrency Package. 12. Z. Metreveli, N. Zeldovich, and M. F. Kaashoek. Cphash: A cache-partitioned hash table. SIGPLAN Not., 47(8): , Feb M. M. Michael. High performance dynamic lock-free hash tables and list-based sets. In Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA 02, pages 73 82, New York, NY, USA, ACM. 14. N. Nguyen and P. Tsigas. Lock-free cuckoo hashing. In Distributed Computing Systems (ICDCS), 2014 IEEE 34th International Conference on, pages IEEE, R. Pagh and F. F. Rodler. Cuckoo hashing. J. Algorithms, 51(2): , May B. Schling. The Boost C++ Libraries. XML Press, D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3): , July 1985.

104 A Safe and Tight Estimation of the Worst-Case Execution Time of Dynamically Scheduled Parallel Application Petros Voudouris, Per Stenström, Risat Pathan Chalmers University of Technology, Sweden {petrosv, per.stenstrom, Abstract. Estimating a safe and tight upper bound on the Worst-Case Execution Time (WCET) of a parallel program is a major challenge for the design of real-time systems. This paper, proposes for the first time, a framework to estimate the WCET of dynamically scheduled parallel applications. Assuming that the WCET can be safely estimated for a sequential task on a multicore system, we model a parallel application using a directed acyclic graph (DAG). The execution time of the entire application is computed using a breadth-first scheduler that simulates non-preemptive execution of the nodes of the DAG (called, the BFS scheduler). Experiments using Fibonacci application from the Barcelona OpenMP Task Suite (BOTS) show that timing anomalies are a major obstacle to estimate safely the WCET of parallel applications. To avoid such anomalies, the estimated execution time of an application computed under the simulation of BFS scheduler is multiplied by a constant factor to derive a safe bound on WCET. Finally, an anomaly-free priority-based new scheduling policy (called, the Lazy-BFS scheduler) is proposed to estimate safely the WCET. Experimental results show that the bound on WCET computed using Lazy-BFS is not only safe but also 30% tighter than that computed for BFS. Keywords: parallel program; timing anomaly; time predictability 1 Introduction There is an increasing demand for more advanced functions in today s prevailing embedded real-time systems like automotive and avionics. In addition to the embedded domains, timeliness is also important in high-performance server applications, for example, to guarantee a bounded response time in the light of increasing number of clients and their processing requests. The need to satisfy such increasing computing demands both in the embedded and in the high-performance domains requires more powerful processing platform. Contemporary multicore processors provide such computing power. The main challenge to ensure timing predictability and maximizing throughput is to come up with techniques to exploit parallel multicore architectures. Although sequential programming has been the primary paradigm to implement tasks of hard real-time applications [2], such a paradigm limits the extent to which a parallel multicore architecture can be exploited (according to Amdahl s law [3]). On

105 the other hand, the HPC community has developed several parallel programming models for task parallelism (e.g., Cilk [4]) and for data parallelism (e.g., OpenMp loop [5]). Under parallel programming models, a classical sequential program is implemented as a collection of parallel tasks that can execute in parallel on different cores. The quest for more performance has recently attracted parallel programming models for the design of real-time applications [6, 7]. While the HPC domain is mainly concerned about average throughput, the design of real-time system is primarily concerned with the worst-case behavior. The blending of high-performance and realtime computing poses a new challenge: how the worst-case timing behavior of a parallel application can be analyzed? Scheduling algorithms play one of the most important roles in determining whether timing constraints of an application are met or not. The timing analysis of real-time scheduling algorithms often assumes that the worst-case execution time (WCET) of each application s task is known [1]. If WCET of a task is not estimated safely, then the outcome of schedulability analysis may be erroneous, and could result in catastrophic consequences for hard real-time applications. The estimation of the WCET needs to be also tight in order to avoid over-provisioning of computing resources. The tasks of a parallel application are scheduled either statically or dynamically. Static scheduling binds (offline) each task on a particular core while dynamic scheduling allows a task to execute on any core. Recently, there have been several works on WCET estimation for statically scheduled parallel applications under simplistic assumptions, for example: the number of tasks is smaller than the number of available cores or the maximum number of tasks is two [8, 9, 10]. Such limitations may constrain the programmer to exploit higher inter-task level parallelism and, hence, limits the performance. In addition, a task assigned to a core, which is already highly loaded, may need to await execution while other cores of the platform may be idle; hence, contributing to load imbalance. While the approaches proposed in [8, 9, 10] are inspiring, the limitations of static scheduling motivate us to investigate the problem of estimating the WCET of dynamically scheduled parallel application on multicores. To the best of our knowledge, this paper proposes, for the first time, a framework for estimating WCET of a dynamically scheduled parallel application on multicore. The proposed framework is applied to Fibonacci application from the BOTS [11]. This paper makes the following contributions. First, a methodology to model a parallel application is proposed. The model captures information regarding what code units can execute in parallel and what must execute sequentially so as to establish the WCET of parallel applications. Second, we identify timing anomalies triggered by dynamically scheduling tasks. Third, we contribute with new scheduling policies under which timing anomalies can be avoided. Finally, experimental results are presented to show the tightness by which WCET can be estimated using the proposed scheduling algorithms using Fibonacci from BOTS. The rest of this paper is organized as follows: Section 2 presents our assumed system model. From this, we identify in Section 3 timing anomalies challenging WCET estimation. A systematic methodology to model parallel applications and the design of a runtime simulator are presented in Section 4. Section 5 then presents our pro-

106 posed scheduling algorithms (BFS, Lazy-BFS) for the run-time system. Our experimental results are presented in Section 6. Related work is presented in Section 7 before concluding in Section 8. 2 Timing Anomalies The estimated WCET of an individual task is an upper bound on the WCET meaning it is safe. A task during runtime may take less than its estimated WCET. The overall execution time of a dynamically scheduled parallel application may increase when some tasks take less than their WCETs, which is known as execution-timebased timing anomaly [20]. An example of such an anomaly is demonstrated in Figure 1. The C u value beside each node is the WCET of the corresponding task in Figure 1. The DAG is executed based on non-preemptive BFS on M=2 cores. For example, consider the DAG and the schedule on the left-hand side of Figure 1. The execution time of the application is 9. Consider the case when node B does not execute for 3 time units but finishes after 1 unit of execution and all other nodes take their WCET. The DAG and the schedule in such case are shown on the right-hand side of Figure 1. The execution time of the application is 10. In other words, the overall execution time of the application is increased when node B takes less than its WCET. This example demonstrates an execution-time-based timing anomaly. Spawn-Based Timing Anomaly. Execution-time-based timing anomaly made us curious to find scenarios that can result in other types of timing anomalies. In this process, we find a new type of timing anomaly that we call spawn-based timing anomaly. In parallel programming, a parallel task may be generated based on conditional statements, for example, depending on specific value of some variable. C B=3 B C D=2 D A CA=1 C CC=2 E CE=2 F CF=5 C B=1 B C D=2 D A CA=1 C CC=2 E CE=2 F CF=5 P 1 P 0 Processors C A B G CG=1 F D E G Time P 1 P 0 G CG=1 Processors C E A B D F G Time Figure 1. The DAG on the left when executed on two cores has execution time is 9 (schedule on the left). If node B takes 1 time units (the DAG on the right-hand side), the execution time is 10 on two cores. BFS is used in both cases. A node that is generated based on some conditional statement is called a conditional node which may not be always present in the DAG if, for example, values of input changes. A spawn-based timing anomaly occurs if a relatively fewer number of nodes

107 is generated. Consider the following DAGs in Figure 2 where node C is a conditional node. A CA=1 A CA=1 C B=2 B C E=5 E C Cc=2 C F=2 F D CD=1 G CG=2 C B=2 B C E=5 E C F=2 F D CD=1 G CG=2 H CG=1 H CG=1 Processors Processors P1 C E P1 D F E P0 A B D F G H P0 A B G H Figure 2. When node C is generated, the schedule length is 9. When node C is not generated, the schedule length is 10. BFS is used in both cases. The schedules in Figure 2 show that the execution time of the application is larger when fewer nodes are generated (i.e., node C is not generated). We are not aware of any work where such anomaly has already been identified. Timing anomalies occur only if the execution time of a DAG is computed based on a total ordering of nodes execution that is different from the ordering during actual execution. This paper proposes a framework to compute a safe estimation of the WCET of parallel applications mitigating the effect of timing anomalies. 3 Proposed Scheduler Any scheduling algorithm can be plugged into the ExeSIM module. Well-known scheduling strategies are breath-first scheduler [12], work-first scheduler with work stealing [13], etc. We consider two different scheduling algorithms for ExeSIM; BFS and Lazy-BFS. 3.1 Breadth-First Scheduler (BFS) We have implemented the non-preemptive BFS in ExeSIM. This scheduler dispatches tasks from the ready queue in breadth-first order to the idle cores for execution. Each task executes until completion without any preemption. The output of ExeSIM using BFS scheduler is an estimation of the WCET of the DAG of an application where each node of the DAG takes its WCET. We denote this estimation by EXE BFS. As discussed in Section 3, such an estimation may not be an upper bound on the WCET of the application due to timing anomalies, i.e., the actual execution time may be larger than EXE BFS during runtime when some task take less than their

108 WCETs. A safe estimation of the WCET of an application executed on an M-core platform under BFS is given (according to Theorem 3 in [20]) as follows: The value of is a safe bound on the WCET of an application scheduled under BFS. The multiplicative factor (2-1/M) in Eq. (1) may result in too much pessimism in case ExeSIM is close (tight) in estimation of the WCET. In order to derive a tighter estimation of WCET, we propose a priority-based scheduler, called Lazy-BFS. (1) 3.2 Lazy-BFS: Priority-Based Non-Preemptive Scheduler In Lazy-BFS, each node has a priority and nodes are stored in the ready queue in non-increasing priority order. We now present priority assignment policy for Lazy- BFS and then the details of its scheduling policy. Priority assignment policy. The priority of a node is denoted as a pair (L, p) where L is the level of the node in the DAG and p is level-priority value at level L. The first node is assigned level 1 and level-priority 1, i.e. (L, p)=(1,1). The next nodes are given priorities based on the priority of the node that generates them. The new nodes are given different priorities than that of the parent node. Let the total number of new nodes generated from a parent node with priority (L, p) be D (Degree). These D nodes are ordered in BFS order (i.e., the order in which they are created). Each of the new nodes generated from a parent node with priority (L, p) is assigned level (L+1). These D ordered nodes with level (L+1) are respectively assigned levelpriorities where p i is the priority of the i th child, p par is the priority of the parent, D the degree and i the position of the child. In Figure 3 an example of priority assignment is presented. It can be seen that all the nodes in different levels are assigned with different levels. The level priority is assigned based on the previous equation. Nodes that are eligible to execute in parallel will not have a tie in both of their level and level-priority pair. (1, 1) A (2, 1) B C (2, 2) (3, 3) D E (3, 4) Figure 3 Example of priority assignment. We assume that smaller value implies higher priority. The priorities of two nodes A and B are compared as follows. First, the levels of A and B are compared. If node A has smaller level than that of node B, then A has higher priority. If the levels of A and B are equal, then the node with smaller level-priority value has higher priority. Scheduling policy. Lazy-BFS executes tasks based on their priority in a nonpreemptive fashion. In Lazy-BFS, a task is allowed to start its execution if each of its higher priority tasks has already been dispatched for execution. Note that if a relative-

109 ly higher priority task is not generated yet, a relatively lower priority task, which may be in the ready queue, cannot start its execution even if some core is idle. This ensures that tasks are executed strictly in their decreasing priority order. The policy is nongreedy (lazy) in the sense that ready task may not be executed even if a core is idle. We may have a situation where some higher priority task would not be created (e.g., due to conditional spawn) while a relatively lower priority task waits in the ready queue. This may create a deadlock situation. We avoid deadlock as follows: if all cores become idle, then the highest priority task from the ready queue is dispatched for execution even if some of its (non-existent) higher-priority task is not yet dispatched for execution. Whenever a new task starts execution, the priority of that task is stored in variable (L lowest, p lowest ) in the runtime system. If multiple tasks are ready to execute in the ready queue, Lazy-BFS starts executing the highest priority-task with priority (L, p) non-preemptively on an idle core if one of the two following conditions are satisfied: (C1) If at least one core is busy when (L lowest, p lowest ) = (L, p-1), then each of the tasks having priorities higher than (L, p) have either finished execution or are currently in execution. In such case, the highest-priority ready task with priority (L, p) is allocated to the idle core for execution. We also set (L lowest, p lowest ) = (L,p) to specify that the lowest-priority task that is already given a core has priority (L,p). (C2) If all the cores become idle, then the highest-priority ready task with priority (L, p) is allocated to the idle core for execution. We also set (L lowest, p lowest )=(L,p). 4 Analysis Framework We consider a multicore platform with M identical cores such that each core has a (normalized) speed 1. We consider a time-predicable multicore architecture [15, 23] such that an upper bound on time to access any shared resource, for example, memory controllers [16], cache [17, 18], inter-connection network [19] is known. We focus on parallel applications assuming a task-based dataflow parallel programming model (e.g. Cilk [4], OpenMP [5]). A parallel application is modeled as a directed acyclic graph (DAG) denoted by G = (V, E) where V (the set of nodes) is the set of tasks and E (the set of edges) is the set of dependencies between tasks. If there is an edge from node u i V to node u k V, then the edge specifies that execution of task u k can start only after execution of task u i completes. We assume that the WCET of each node of the DAG is known (please see [10] where such an approach is proposed). The WCET of a node u V is denoted as C u. The WCET of each node includes any synchronization delay due to critical sections (please see [21] that proposes time-predictable synchronization primitives). The overheads related to scheduling decisions and managements of tasks in the ready queue have been incorporated in the WCET of the corresponding task. In addition to the occurrences of timing anomalies, another major challenge in analyzing a dynamically scheduled parallel program is the many possible execution interleavings of different parallel nodes. We present a methodology to model the structure

110 of a parallel application to capture information about such inter-leavings as a directed acyclic graph (DAG). The structure (i.e., nodes and edges) of the DAG of a parallel application depends on the input parameters. The main challenge is to determine the DAG that will have the longest execution time, called, the worst-case DAG, for some given key input. Such key input parameters are also used in computing the WCET of sequential program (e.g., loop bounds, number of array elements, etc.) [14]. Modeling an application as a DAG from a time-predictability perspective is the first building block, called, the GenDAG module, of our proposed framework. We use three types of nodes to model the application s different parts. Spawn nodes: It models the #omp pragma task and generates new nodes. It has a set of nodes connected in series. A node models the execution time that is required to generate a task. When the #omp pragma task is included in a loop or before a recursive call then multiple nodes are created. So, for example, a loop from 0 to 3, will generate 4 nodes connected in series. Basic Node: It models the execution time of a sequentially executed piece of code. Synchronization Node: It models the time that is required to identify that all the related nodes are synchronized. For example Figure 4 presents the generation of the graph for Fibonacci with input 3. The code of Fibonacci is presented below. int fib(int n){ int x,y; if(n < 2) return n; #pragma omp task x = fib(n-1); #pragma omp task x = fib(n-2); #pragma taskwait return x+y } Basic Spawn Sync ❶ Initially only the spawn node for Fib(3) is ready for execution and the first node of the spawn node is executed. ❷ Next Fib(2), which is also a spawn node, is generated. Since, Fib(3) is a spawn node, also the corresponding synchronization node (S3) is generated. ❸ At the next step, the second node in the Fib(3) and the first node of Fib(2) are executed in parallel. Consequently, the Fib(1) and the corresponding synchronization node (S2) are generated from Fib(2). S2 now points to the synchronization node that its parents was pointing (S3). At ❹ the second node from Fib(2), the Fib(1) from Fib(2) and the Fib(1) from Fib(3) are executed in parallel and similarly Fib(0) is generated. Since the two Fib(1) nodes were executed their dependencies are released. At ❺ Fib(0) is executed and S2 becomes ready since all its dependencies have been released. ❻ When S2 finishes S3 can start its execution since all its dependencies have been released.

111 ❶ Fib(3) ❷ Fib(3) ❸ Fib(3) ❹ Fib(3) Fib(2) Fib(2) Fib(1) Fib(2) Fib(1) Fib(1) Fib(1) S2 S2 S3 S3 S3 Fib(3) ❺ Fib(2) Fib(1) S2 Fib(1) S3 Fib(3) ❻ Fib(2) Fib(1) S2 Fib(1) S3 Fib(3) ❼ Fib(2) Fib(1) S2 Fib(1) S3 Spawn node Basic node Sync node Dependency Generated Finished Release Dependency Figure 4 Example of graph generation for Fibonacci 3 The second building block of our proposed framework is the DAG execution simulator called, the ExeSIM module. The purpose of this module is to simulate the execution of the worst-case DAG to find the WCET of parallel applications under some scheduling policy. Throughput-oriented run-time systems have various sources of time unpredictability, for example, random work stealing. We implement the ExeSIM module from scratch to avoid such sources of timing unpredictability. ExeSIM is an event-based simulator that mimics the execution of the tasks of a parallel application. The input to ExeSIM is the root node of the worst-case DAG of an application and the output is the execution time of the entire application. In Figure 5 is presented an abstract view of the simulator. The GenDag module inserts new ready nodes to the ready queue. Based on the scheduling policy and the available processors the appropriate nodes are selected for execution. The results are feedback to GenDag to progress the execution of application. INPUT WCET GenDAG Progress Application ExSim Available processors HW Status Execute New Nodes Ready Q Choose ready nodes Scheduling Policy Figure 5 Abstract view of ExeSim simulator.

112 5 Experiments The code of Fibonacci from BOTS is analyzed to generate its worst-case DAGs based on GenDAG module. Recall that we assume that the WCET of each individual node is known. The WCET of each spawn, sync, and basic node is assumed to 300, 100 and 400 time units, respectively. We computed the WCET of each application using ExeSIM considering variation in input size (denoted by n), number of available cores (denoted by M), and scheduling policies (BFS or Lazy-BFS). Since ExeSIM is implemented as a sequential program, currently it is capable of handling small inputs. It is expected that the experimental results for large input and other applications from BOTS will follow similar trends presented here. In execution-time-based timing anomaly, some nodes of the DAG take less than their WCETs. To capture such behavior, we consider two additional parameters: pn and pw that are defined as follows. Parameter pn ranges in [0, 1] and represents the percentage of nodes of a DAG for which the actual execution time is less than their WCETs. A node that takes smaller execution time than its WCET is called an anomaly-critical node. Parameter pw captures the actual execution time of an anomaly-critical node as the percentage of its WCET. These two new parameters pn and pw are used as follows. When a new node is generated by the GenDAG module, a random number using some built-in function rand() in the range [0,1] is generated. If rand() is larger than pn, then the new node s actual execution time is set to its WCET; otherwise, the new node s actual execution time is set to pw times its WCET. For example, assume that the WCET of a new node is 20. Let pn=0.3 and pw=90% for some experiment. If rand() generates 0.75, then the new node executes for 20 time units that is equal to its WCET because 0.75 > pn=0.3. If rand() is 0.15, then the node s execution time is set to (pw 20)=(90% 20)=18 time units. The ExeSIM simulator determines the execution time of the entire application based on the actual execution time of each node. If the computed execution time of the application is larger than the estimated WCET for a specific scheduling policy, then a timing anomaly is detected for that scheduling policy. Each experiment is characterized by four parameters (n, M, pn, pw). We considered 20 different values of pn {0.05, 0.1,, 1.0} and pw=98%. For some given values of n and M, we compute the execution time times for BFS. At each value of pn for some given n and M, the percentage of cases where the computed execution time is larger than the computed WCET, an anomaly is detected. This percentage of executions is the percentage of timing anomaly. The results are presented in Figure 5 for Fibonacci with input n= 10, 11, 12, and13 considering M = 4, 8 and 16 cores. The x-axis in the graphs Figure 5 represents the percentage of anomaly-critical node (pn) and the y-axis represents the percentage of timing anomalies under BFS. It is evident that for all input parameters and number of cores, BFS suffers from timing anomalies. In summary, timing anomalies are frequent and we need mechanism to mitigate them. A safe bound using BFS is given in Eq. (1). The estimation of the WCET under Lazy-BFS is safe by construction since timing anomalies cannot occur in Lazy-BFS.

WCET Estimation x 10000 The estimation under Lazy-BFS is denoted as with. In Figure 6 we compare for Fibonacci. The x-axis is the input clustered by the number of processors.

113 WCET Estimation x The estimation under Lazy-BFS is denoted as with. In Figure 6 we compare for Fibonacci. The x-axis is the input clustered by the number of processors. The vertical axis shows the WCET estimation. Form the results it can be seen that for all the cases the WCET estimation with Lazy-BFS is smaller compared to BFS. In addition, by increasing the input size the WCET estimation increases also and similarly by increasing the number of processors the WCET estimation is decreased. The WCET estimation using Lazy-BFS is around 30% tighter than that of using BFS. \ Figure 6 Percentage of Anomalies for Fibonacci using BFS schedule. Comparison of WCET estimation of LazyBFS and Safe BFS Fibonacci Input(n), #processors(m), Lazy-BFS Bound Figure 7 WCET estimation for Fibonacci is presented. The graph shows configurations for inputs 10, 11, 12 and 13 clustered for 4, 8 and 16 cores.

Project Beehive: A HW/SW co-designed stack for runtime and architecture research

Project Beehive: A HW/SW co-designed stack for runtime and architecture research Andy Nisbet Research Fellow Advanced Processors Technologies (APT) Group Team: EPSRC/EU Projects & Industry Funding Compilers