Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Size: px

Start display at page:

Download "Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference"

Jeffery Hood
5 years ago
Views:

1 This is a pre-print, author's version of the paper to appear in the IEEE International Symposium on Workload Characterization (IISWC), Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee and Carole-Jean Wu School of Computing, Informatics, and Decision Systems Engineering Arizona State University {lee.shin-ying,carole-jean.wu}@asu.edu Abstract Modern computer systems are accelerator-rich, equipped with many types of hardware accelerators to speed up computation. For example, graphics processing units (GPUs) are a type of accelerators that are widely employed to accelerate parallel workloads. In order to well utilize different accelerators to gain better execution time speedup or reduce total energy consumption, many scheduling algorithms have been proposed to select the optimal target device to process an OpenCL kernel according to the kernel s individual characteristics. However, in a real computer system, there are a lot of workloads co-located together on a single machine and would be processed on different devices simultaneously. The CPU cores and accelerators may contend shared resources, such as the host main memory and shared last-level cache. Thus, it is not robust to schedule an OpenCL kernel execution by simply considering the characteristics of the kernel. To maximize the system throughput, it is important to consider the execution behavior of all co-located applications when performing OpenCL kernel execution scheduling. In this paper, we provide a detailed characterization study demonstrating that scheduling an OpenCL kernel to run on different devices can introduce varying performance impact to itself and the other co-located applications due to memory interference. Based on the characterization results, we then develop a light-weight, scalable performance degradation predictor specifically for heterogeneous computer systems, called HeteroPDP. HeteroPDP aims to dynamically predict and balance the execution time slowdown of all co-located applications in a heterogeneous computation environment. Our real system evaluation results show that comparing with always running an OpenCL kernel on the host CPU, HeteroPDP is able to achieve 3X execution time speedup when an OpenCL kernel runs alone and improve the system fairness from 24% to 65% when an OpenCL kernel is co-located with other applications. I. INTRODUCTION Hardware accelerators are increasingly used to improve application performance, system throughput, and energy efficiency in modern computing platforms [45]. For instance, graphics processing units (GPUs), with the key design feature of massive multithreading, are widely deployed on high performance computing clusters (HPCs) to speed up the execution of general-purpose parallel workloads. Figure 1 illustrates an example of an accelerator-rich heterogeneous system which comprises a general-purpose chip-multiprocessor () and several types of hardware accelerators. With the availability of a unified programming interface, general-purpose computations can be offloaded to the different accelerator devices fluidly to maximize application performance or energy efficiency. Open Computing Language (OpenCL) is a framework that offers CPU 0 CPU 1 CPU 2 CPU 3 Shared Last-level Cache Northbridge /Interconnect DRAM GPU PCIe Controller DSP Where to run? FPGA Fig. 1. An example of an accelerator-rich heterogeneous computer system. This diagram exhibits a machine equipped with CPU cores and multiple hardware accelerators, including a GPU. All the CPU cores share the last-level cache, whereas all the CPU cores and accelerators share the interconnect, PCIe controller, and main memory. An accleratable application, e.g., machine learning, can be scheduled to run on the CPU or an accelerator. this computation offloading capability. Applications written in OpenCL can be run on a collection of accelerators or devices of different instruction set architectures (ISAs) that support the standard. Thus, depending on the application requirement, the optimization goals, and the performance and power characteristics of available devices, an intelligent OpenCL scheduler can schedule segments of the application (kernels) onto the different OpenCL-enabled devices to improve application execution time speedup. State-of-the-art approaches, such as [3], [38], [39], proposed to build predictive models to determine an optimal execution target among all available accelerators of different compute and power characteristics. However, these prior works focused on scheduling algorithm designs for a single application only and do not consider important, realistic runtime effects, such as memory interference, stemmed from background processes, operating system related activities, or co-located applications. In a realistic execution environment, there are many concurrent processes and can be a number of native CPU applications co-located on the same system at the same time. For example, in an on-demand cloud computing environment, e.g., Amazon Web Service (AWS) [1], Google Cloud [14], and Microsoft Azure [9], compute nodes are simultaneously servicing multiple applications or hosting multiple virtual 1

2 machines with native CPU applications as well as acceleratable applications. These compute nodes are also well equipped with a wide range of accelerators, such as high-performance GPUs or Field Programmable Gate Arrays (FPGAs), and offer computation offloading and acceleration opportunities. In such execution environment, co-located applications contend for shared resources in the memory subsystem and receive a varying degree of performance degradation from memory interference. Thus, existing OpenCL schedulers that only consider the characteristics of the application itself but do not take into account memory interference from co-located workloads are not robust and provide sub-optimal system throughput and fairness. To understand the need for an intelligent scheduler that can make an accurate decision for which optimal execution target an application should be executed on in the presence of memory interference, we perform detailed performance characterization studies for a diverse set of OpenCL applications alone and with co-located applications. Corroborating with recent prior finding, our studies show that, for individual application execution scenarios, the optimal execution target switches between the host and the discrete GPU (Section III-A). The decision is influenced by the degree of parallelism and divergence in an application and by the amount of data movement overhead between the host system and the selected accelerator. Furthermore, we demonstrate that the optimal execution target switches based on the degree of memory interference from the co-located applications or processes (Section III-B). And, more importantly, the room for performance improvement is substantial, motivating a scheduler design that considers the effect of memory interference. We take a step further beyond the large scale performance characterization studies and propose a simple, light-weight performance degradation predictor, called HeteroPDP. Unlike existing performance estimators, such as [3], HeteroPDP is designed tailored for heterogeneous systems with multiple levels of memory interference. In the presence of co-located CPU applications on the host system, HeteroPDP predicts the respective execution time slowdown factors for an OpenCL application and schedules the application onto the remaining cores in the host or offloads it to the GPU accelerator to maximize overall system throughput or fairness. HeteroPDP is implemented on a real heterogeneous system setup and is integrated into the existing OpenCL device driver. Our realsystem evaluation results show an accurate enough execution target prediction accuracy of 80% and 72% for the alone and the co-located scenarios, respectively. The execution target prediction accuracies translate to significant application execution time speedup of X (alone) and system fairness is improved from 24% to 65% (co-located). In summary, this paper makes the following key contributions: We observe that with different optimization goals, the optimal target device to process an OpenCL kernel may switch in an accelerator-rich heterogeneous computer system. We demonstrate that the multi-level memory interference in a heterogeneous system significantly influence the scheduling decision of OpenCL applications in the colocation execution. We present HeteroPDP, a light-weight, flexible prediction scheme, to accurately predict the system performance degradation and to select the optimal target devices to process a kernel depending on the optimization goal in a heterogeneous system. II. EXPERIMENTAL METHODOLOGY This section introduces the experimental setup for the performance characterization studies and the design evaluation on a real heterogeneous computer system. A. Experiment Infrastructure and Configurations To explore the memory interference and performance degradation on a heterogeneous multiprogrammed environment, we build a system that comprises an Intel Core i processor (a quad core with an 8MB shared last-level cache) and an AMD GCN discrete GPU card attached via a PCI-e 16x bus. On this system, the host processor and the GPU card share the same host DRAM controller and main memory modules. Both the cores and GPU card are OpenCL-compatible and are able to execute OpenCL programs. The detailed experiment setup and system configurations are presented in Table I. To collect application-specific information for performance prediction, we instrument the OpenCL JIT compiler to generate the static information, e.g., the static instruction count (Section IV), as the input for the HeteroPDP predictors. To collect runtime system resource utilization information, such as the lastlevel cache miss count, we integrate Intel s performance counter monitor toolkit (PCM) [40] into HeteroPDP to periodically collect system resource utilization information at runtime. B. Workload Construction We use a wide range of workloads exhibiting varying execution behavior for the performance characterization studies. We use applications from the SPEC2006 benchmarks suite with the reference dataset to represent the native CPU workloads [16]. These applications introduce a varying degree of shared resource pressure to the memory subsystem. We classify these CPU applications into two categories: computation or memory intensive benchmarks, based on the average miss per kilo instruction (MPKI) [24]. We take various applications from the AMD SDK [2], Intel SDK [18], Hetero-Mark [34], Pannotia [6], Rodinia [7], [8], SHOC [11], and XSBench [35] benchmark suites to evaluate the behavior of OpenCL applications. Due to the resolution of the performance counters used in HeteroPDP, we do not use OpenCL applications that finish faster than 2 seconds and focus our studies on the longer-running OpenCL applications in this paper. Table II lists the benchmarks used in this paper. For the co-located execution scenario, we construct workload combinations by pairing one native CPU application and one 2

3 TABLE I MEMORY INTERFERENCE INFRASTRUCTURE SETUP AND CONFIGURATIONS. Device Host CPU Host DRAM Accelerator (GPU) GPU DRAM Software Runtime Configuration Intel Core i x86-64 CPU 4 cores 3.4GHz core frequency 8MB shared LLC disabling turbo boost disabling hyperthreading DDR GB 2 channels 22GB/s max available bandwidth AMD FirePro S9150 GCN GPU 44 compute units (CUs) 900 MHz core frequency PCIe x16 8GT/s GDDR GB with ECC 512-bit width 320GB/s max avaiable bandwidth Ubuntu Linux kernel v4.4.0 clang/clang++ v3.8.0 Intel PCM v2.11 Intel OpenCL driver v1..18 AMD OpenCL driver v OpenCL application, which results in 6*26 = 156 multiprogrammed workloads. To study the scalability of HeteroPDP, we increase the number of native CPU applications and synthesize additional 38 multiprogrammed workloads, consisting of two SPEC applications and one OpenCL application from the listed benchmarks. To prevent our experimental machine from thermal throttling, the 38 workloads are the combinations that complete within 5 minutes. III. MOTIVATION FOR AN INTELLIGENT APPLICATION EXECUTION TARGET SCHEDULER We begin this section with performance characterization and analysis for the alone and the co-located execution scenarios. In the alone case, an OpenCL application is the sole application running on the heterogeneous system and is to be dispatched onto an execution target among all available processors or accelerators. In contrast, in the co-located case, an OpenCL application shares the heterogeneous system with other native CPU applications. Section III-A shows that there is a significant room for performance improvement depending on where or which execution target an OpenCL application is executed on for both the alone and co-located cases. Section III-B shows that the optimal execution target switches for OpenCL applications in the presence of memory interference from a memory-intensive co-located application. Then, in Sections III-C and III-D, we show more detailed fairness characterization for the co-located case and with different scheduling priorities imposed onto the concurrent applications. A. Performance Characterization for alone and co-located Offloading an OpenCL application onto a hardware accelerator does not always lead to performance improvement or TABLE II WORKLOADS USED IN THE PERFORMANCE CHARACTERIZATION STUDIES AND DESIGN EVALUATION. THE ASTERISK SYMBOL INDICATES HIGH MEMORY INTENSITIES. Benchmark Suite Type bzip2 calculix lbm* Native CPU SPEC2006 [16] mcf* application perlbench xalancbmk AutoCluster* Binomial* BlackScholes Histogram* AMD SDK [2] LUDecomposition MonteCarolAS AES FIR Hetero-Mark [34] KMN* PR Bitonics* GEMM* MedianFilter MonteCarlo bc* csr* ell* cfd* Intel SDK [18] Pannotia [6] gaussian* heartwall kmeans* Rodinia [7], [8] leukocyte pathfinder streamcluster* s3d SHOC [11] XSBench* XSBench [35] Speedup over Alone on GPU Opt Fairness Number of Co-located OpenCL application GPU Opt Fig. 2. The average execution time speedup to run OpenCL applications alone and the execution time slowdown fairness of co-located on a quad-core CPU, GPU, and the optimal target between the CPU and GPU devices. energy reduction. This is mainly because of three reasons. First, to perform computations on an accelerator, it often requires moving a considerable amount of data between the host system and the accelerators and to synchronize the execution, which are expensive in terms of execution time and energy consumption [5], [15], [25], [31], [34]. Second, to make the shared data accessible by the host CPU as well as the hardware accelerators, the device driver or operating system has to frequently modify the page tables and translation lookaside buffers (TLB) to remap the data into different 3

4 memory spaces, which can introduce very long operation latencies [36]. Third, the OpenCL JIT compiler is not always able to well transform the OpenCL kernel code to fully utilize the dedicated target accelerator, making the performance suboptimal [38]. Consequently, offloading computations onto an accelerator may instead degrade the application performance and incur higher energy dissipation. Figure 2 shows the system performance for running an OpenCL application on the Intel or the discrete GPU card alone and co-located, averaged across the 26 OpenCL applications. The horizontal axis indicates the execution target of the OpenCL application whereas the y-axis represents system performance: execution time speedup for alone and fairness 1 for co-located. Figure 2(a) shows that, although offloading the OpenCL application to the GPU achieves an impressive speedup on average as compared with the execution target, there is room for performance improvement. With the oracle execution target information, the application performance can be further improved by an average of 50%. Figure 2(b) shows a similar performance trend for co-located. Thus, to maximize system performance, an intelligent execution target scheduler is needed for both the alone and co-located execution scenarios. B. Optimal Execution Target Varies in the Presence of Memory Interference We delve deeper into a few workload combinations to illustrate that the optimal OpenCL execution target varies in the presence of memory interference from a memory-intensive co-located application. In this study, we use mcf as the memory-intensive application running on the. When an OpenCL application is co-located with mcf on the, shared last-level cache contention degrades application performance whereas when the OpenCL application is offloaded to the GPU, performance degradation comes from a different level of the memory hierarchy, i.e., the DRAM memory bandwidth. The already expensive data transfer cost for OpenCL application offloading is exacerbated. Figure 3(a) shows the execution time speedup of five different OpenCL applications alone on the versus on the GPU accelerator and on the optimal, higher-performing execution target. Figure 3(b) shows the execution time speedup of the same OpenCL applications co-located with mcf and the optimal execution target. The labels on the top of the bars indicate the optimal execution target. The optimal execution target for three out of the five OpenCL applications, i.e., BIT, HIS, and XSB, is changed. It is clear that the decision depends on the memory intensities and interference between the OpenCL and the co-located workloads. Hence, simply considering the features of an OpenCL application for scheduling is insufficient to maximize application and system performance it is crucial for an intelligent execution target scheduler to take into account the characteristics of all co-located applications. 1 Fairness is a commonly-used metric to evaluate the execution time slowdown for multiprogramming execution [4], [12], [30] and is defined as the ratio of the minimum and the maximum slowdown among all concurrent applications. Speedup over Alone on GPU Opt (a) alone GPU GPU GPU (b) co-located GPU GPU FIR BIT HIS XSB KMN Fig. 3. The execution time speedup of an OpenCL application when it is running alone and when it is co-located with the native CPU application mcf. The labels on the top indicate that the optimal target device based on the execution time speedup. C. Large-scale Performance Degradation Characterization with Different Co-location Scenarios To fairly evaluate the overall system performance, the fairness metric is commonly-used for co-located workloads in the multiprogramming execution [4], [12], [30]. Fairness is defined as follows: Fairness = min(slowdown i) (1) max(slowdown i ) where i represents any of the co-located applications and slowdown is the ratio of an application s execution time in co-located and that in alone. The goal of using fairness as the optimization goal is to ensure a fine balance of the slowdown among all co-located applications. Fairness of 1 represents a system with equal slowdown among all co-located workloads. Figure 4 shows the execution target preference for the OpenCL application in the co-located scenario for all 156 workload combinations in this study. The x-axis represents all workload combinations while the y-axis represents the fairness ratio of the OpenCL application running on the versus on the GPU. The data points are sorted based on the fairness ratio in the increasing order. We observe that the fairness ratio varies significantly, from 01 to 100. For a large number of workload combinations (toward either end of the curve), there is a clear OpenCL execution target preference. D. Large-scale Performance Degradation Characterization with Different Scheduling Priorities Real-time constraint and scheduling priorities of processes can affect the scheduling decision as well. Many interrupt services, for example, must be handled by the host processor with a hard real-time deadline. To evaluate how scheduling priorities can influence the scheduling decision of an OpenCL application and affect the overall system performance, we adopt the metric of weighted slowdown [12] and use it to calculate fairness defined as: WeightedFariness = min(weightedslowdown i) max(weightedslowdown i ) WeightedSlowdown i = slowdown i weight i (2) 4

5 Fairness Ratio (/GPU) prefer GPU execution prefer execution Native CPU App C/C++ Compiler Slowdown Estimation on Pre-Characterized Utilization Table Compilation Time Static Features on or GPU OCL App OCL JIT Compiler Workloads Fig. 4. The ratio of fairness between running an OpneCL kernel on the versus on the GPU for workloads comprising one OpneCL application and one native CPU application. Higher than indicates running on has higher fairness number and thereby preferring to run on the. App Execution Runtime Utilization Launch Time Kernel Launch Dynamic Features Fairness Ratio (/GPU) x0.5 x x1.5 x Workloads Fig. 5. The ratio of fairness that running an OpneCL kernel on the versus on the GPU when the co-located native CPU application is assigned to have different OS scheduling priorities/weights. The blue boxes point out workloads having varying target execution devices when the co-located application has different scheduling weights. where weight i is the scheduling weight given to process i. Figure 5 presents the fairness ratio based on the weighted slowdown, having each co-located native CPU application with the weight factor varying from 0.5 to means the co-located native CPU application is more latency tolerable than the OpenCL application, whereas 2.5 indicates the coscheduled native CPU application is highly latency critical. The weights can also be representative of, for example, the operating system scheduling priority. We see that when the scheduling priority of the co-located native CPU process increases, the fairness ratio shifts remarkably as well, favoring GPU as the OpenCL execution target as labeled with the blue boxes in Figure 5. Therefore, in order to meet the real-time deadline, an intelligent OpenCL execution target scheduling framewor should also consider the process scheduling priorities to reach a correct target selection decision IV. PERFORMANCE PREDICTION AND OPTIMIZATION FRAMEWORK Based on the performance characterization studies, we design a simple, light-weight performance prediction and optimization framework, called HeteroPDP. HeteroPDP estimates application slowdown for each co-located application and schedules the OpenCL application to an execution target in a heterogeneous system with the goal of maximize fairness, system throughput, 99 HeteroPDP Section IV-E CPU slowdown Accessing the Pre-characterized Utilization Table Section IV-B OCL exe. time alone on /GPU Regression model: Static Features Dynamic Features Section IV-C OCL exe. time co-located on /GPU Regression model: Static Features Dynamic Features Runtime Utilization Fig. 6. The OpenCL kernel execution flow and slowdown prediction flow of HeteroPDP. OCL ICD OCL Source Code OCL User API Calls Perf Counters OCL JIT Compiler OCL Command Queue PCM Dynamic Features Static Features Runtime Utilization HeteroPDP OCL Perf Predictor (Regression Models) CPU Perf Predictor (Pre-characterized Table) Fig. 7. System diagram of the HeteroPDP scheme. Scheduling Decision or weighted speedup. Figure 6 illustrates the overall execution flow and the design components of HeteroPDP. A. HeteroPDP Overview and Execution Flow HeteroPDP is implemented as a part of the OpenCL independent client driver (ICD). When an OpenCL API is invoked within an application, HeteroPDP retrieves and collects application-specific information, such as the size of data transfer between the host and device memories, available in the command queue of the ICD. Based on the application-specific features (Section IV-B), HeteroPDP estimates application execution time and selects an execution target for the OpenCL application. The proposed HeteroPDP framework is illustrated in Figure 7. To predict the performance prediction in HeteroPDP, at the compilation time, the compiler collects static features (Section IV-B) for OpenCL kernels and a lookup table (Section IV-E) for native CPU applications. At runtime, HeteroPDP periodically queries performance counters to collect the system resource utilization and retrieves the OpenCL kernel dynamic 5

6 features (Section IV-B) right before the kernel launch time for performane prediction. HeteroPDP estimates the execution time and slowdown for the OpenCL kernel running on the and the GPU by simple regression-based models, which use the kernel static features, the dynamic features, and the resource utilization as the prediction inputs (Sections IV-B to IV-D). While a fullfledge machine learning technique can also be used and may offer higher prediction accuracies, our evaluation results in the later section indicate a simple performance model works sufficiently well for the purpose of this work (Section V-A). HeteroPDP also assesses the impact of shared resource contention on the co-located native CPU applications using a pre-characterized performance estimation approach [27]. It evaluates the performance degradation for the native CPU applications by looking up a table (pre-characterized utilization table) with the co-located OpenCL kernel s per-thread working set size and the amount of data transfer (Section IV-E). HeteroPDP then predicts the optimal execution target based on the optimization goal and schedules the OpenCL kernel accordingly. B. OpenCL Kernel Execution Time Prediction for alone To establish the regression model for predicting the performance of an OpenCL kernel when it is running alone in a heterogeneous system, we first analyze and identify a set of important kernel characteristics, including both static and dynamic features. The static features of a kernel, such as the number of static instructions, can be retrieved by the OpenCL JIT compiler at the compilation time. On the other hand, the dynamic features of a kernel include parameters, such as the size of input data sets, and user commands specified at the kernel launch time, such as the total number of threads. The kernel characteristics are extracted with the instrumented OpenCL JIT compiler and the device driver, and are used to train the regression-based performance prediction models: one for predicting the OpenCL application execution time on the host execution target and the other for the GPU execution target. We run an OpenCL kernel with a varying number of threads and different sizes of input data sets to collect its corresponding execution time by querying the clgeteventprofilinginfo() API. We construct the correlation between the features and the execution time (Section IV-D). Overall, the regression model expresses the predicted execution time as a function of a number of important features, as shown in Equation 3, where c i and f i represent the i-th coefficient and feature, respectively. Table III summarizes the kernel-specific features used in the performance prediction models for the execution targets of the host and the GPU. Per f ormance execution target = c i f i (3) i C. OpenCL Kernel Execution Time Prediction for co-located Similar to predicting the execution time for an OpenCL application alone, we build an additional regression model to TABLE III THE OPENCL KERNEL FEATURES USED FOR EXECUTION TIME PREDICTION. Feature # of scalar ALU instructions # of scalar memory instructions # of vector ALU instructions # of vector memory instructions # of branch instructions # of atomic instructions # of memory instructions # of integer instructions # of float-point instructions # of special math instructions # of branch instructions # of barrier instructions # of threads spawned size of memory buffer allocated last-level cache miss count host DRAM bandwidth util Category Static features for predicting execution time on the Static features for predicting execution time on the GPU Dynamic features Runtime util for predicting execution time of co-located predict the kernel execution time in the presence of co-located applications. In such an execution scenario, shared memory resource utilization, such as the last-level cache and the DRAM bandwidth on the host, influences the OpenCL kernel performance. To consider the memory interference effects, we include two additional system utilization features into the performance prediction model for co-located: (1) the shared last-level cache miss counts on the host and (2) the host DRAM bandwidth utilization incurred by the co-located native CPU applications. HeteroPDP retrieves these two runtime utilization features to predict the degree of memory interference by periodically (every 1 second) checking performance counters on the host machine and taking the moving average of 8 consecutive samples. In summary, when an OpenCL kernel is launched, we use the regression models to predict the OpenCL kernel execution time for (1) each of the two available execution targets, alone (time alone with Equation 3) and (2) each of the two available execution targets, co-located (time co located ). Then, HeteroPDP estimates the slowdown factor of the OpenCL application for the two execution targets with Equation 4. Note that, the parameters and features are chosen to form the regression models as these are identified to be highly correlated to kernel execution time [38]. Slowdown = time co located time alone (4) D. Performance Model Training for OpenCL Kernels To build the regression models for OpenCL kernel execution time prediction in HeteroPDP, we take a set of 63 distinct OpenCL kernels with varying input data set sizes from the OpenCL benchmarks listed in Table II as the training set. We first execute the OpenCL kernels to collect the corresponding kernel execution time with different static and dynamic features to build up the initial regression models. We then apply the commonly-used K-fold cross validation algorithm [32] with 32 test passes to eliminate overfitting and to maximize the 6

7 coefficient of the determination value (R-square) by narrowing down the training set size from 63 to 45 kernels 2. That is, 45 kernels are used to build up coefficients of the regression model and the rest of the 18 kernels are used to validate the prediction errors. The kernels used for model training and validation are without overlap. Similarly, using the same 63 kernels, we vary the degree of memory interference (i.e., the host DRAM utilization and shared last-level cache miss count) by co-locating the OpenCL kernels with microbenchmarks and perform the same model training procedure for co-located as the alone case. To minimize the execution time overhead at runtime, the regression models are trained offline. Moreover, to better correlate each feature and parameter, the prediction models are trained as interactive regression models. E. Performance Degradation Prediction for Native CPU Applications To assess fairness or weighted speedup of multiple concurrent applications running on a heterogeneous system, HeteroPDP has to determine the performance of native CPU applications as well. It does so with an offline-trained lookup table. A major advantage of using an offline-trained table is the ease of computation overhead. Therefore, instead of applying a prediction model to project the execution time slowdown of co-located native CPU applications with complicated execution behavior, we modify a previously proposed approach, called Bubble-Up [27], to measure and estimate the CPU application slowdown caused by the co-scheduled OpenCL kenel. In Bubble-Up, a simple lookup table (pre-characterized utilization table) is built at the compilation time for predicting the degree of performance degradation under different levels of shared memory contention caused by other co-located applications. The table is constructed for each native CPU application and is trained with a collection of microbenchmarks that generate a fixed level of contention for a specific shared resource, such as the last-level cache or the shared DRAM bandwidth. In our design, when an OpenCL kernel is launched, HeteroPDP looks up the pre-characterized utilization table by indexing it with the system status (i.e., DRAM bandwidth utilization and OpenCL buffer size) to predict the execution time slowdown for the native CPU applications. Note, Bubble-Up was originally proposed for application slowdown estimation for CPU applications in the multiprogrammed execution scenario. We revise the algorithm for the purpose of performance degradation prediction for native CPU applications in a heterogeneous system setup. For HeteroPDP, if an OpenCL kernel is running on the host, the main resource contention occurs at the shared last-level cache. To predict the pressure the OpenCL kernel imposes onto the shared cache, we use the maximum number of concurrent threads that can run on the s SIMD or vector functional units and the total working set size to estimate its demand for the shared cache capacity. On the other hand, 2 The model training is done by the MATLAB fitlm() and crossval() APIs [28]. 100% 80% 60% 40% 20% 0% [, ] [GPU, GPU] [, GPU] [GPU, ] (a) alone prediction 80% (b) co-located prediction 72% Fig. 8. The prediction accuracy of selecting the optimal target device to run an OpneCL kernel for (a) alone, and (b) co-located with a native CPU process. when the OpenCL kernel is offloaded onto the discrete GPU, the major resource interference occurs at the data movement operations for the shared main memory bandwidth. To predict the slowdown caused by the bandwidth contention, HeteroPDP uses the total size of data transfer required for launching the OpenCL kernel to evaluate the host DRAM bandwidth requirement. V. EVALUATION RESULTS AND ANALYSIS FOR HeteroPDP In this section, we present the evaluation results for the prediction model and the execution target prediction accuracies (Section V-A) as well as the performance of HeteroPDP in the alone and co-located execution scenarios (Section V-B). A. Evaluation for Execution Time Prediction Models and Execution Target Prediction The ultimate goal of the HeteroPDP framework is to predict the optimal execution target for an OpenCL application in the alone and co-located execution scenarios. Since HeteroPDP depends its execution target prediction on the four execution time prediction models, we also evaluate the prediction accuracies for the individual models. Figure 8 presents the execution target selection accuracy for alone and for co-located. The different portions of the bar represent the different prediction outcomes [predicted execution target, optimal execution target]. For example, [, ] means that the predicted execution target for the OpenCL application is the host processor and the optimal execution target is also the host processor, resulting in a correct prediction outcome. For the alone case, the execution target is selected such that the execution time of the OpenCL application is minimized. For the co-located case, the execution target is selected such that fairness, as defined in Section III-C, is maximized. Overall, HeteroPDP achieves 80% and 72% execution target prediction accuracies for the alone and co-located scenarios, respectively. The training set for our prediction model is relative few, which may not cover the diverse execution behavior of the OpenCL kernels in this paper. We believe that the model prediction accuracy can be significantly improved with an increased training set size. We investigate the prediction accuracies for the individual execution time models as well. Figure 9 shows the cumulative density function for the execution time prediction accuracies for (1) the OpenCL application on the host processor, alone, (2) 7

8 CDF of Workloads alone GPU alone co-located GPU co-located 100% 20% error margin 80% 60% 40% 20% 0% Error Speedup over OCL Running on 4.0 Native CPU OCL Avg Fairness GPU HeteroPDP OPT Fairness Fig. 9. The CDF of prediction errors for predicting OpenCL kernel execution time. The red dash line indicates the 20% error margin. Speedup over GPU HeteroPDP Opt Fig. 10. The system speedup of HeteroPDP when running an OpenCL application alone. the OpenCL application on the GPU, alone, (3) the OpenCL application on the host processor, co-located, and (4) the OpenCL application on the GPU, co-located. We observe that the execution time prediction error rate for the majority of applications or workload combinations is below 10%. For the four respective models, (1) (4), 73%, 70%, 68%, and 72% of the workloads can meet the 20% error rate cutoff. We find that the execution time prediction error is mainly coming from two sources. First, for applications with high degree of memory or branch divergence, the execution behavior is less predictable with a simple regression model. Second, because of the limited resolution of timer on the GPU, we are not able to accurately measure the execution time for short-running OpenCL kernels. Therefore, HeteroPDP has a relatively high prediction error for short-running kernels. B. Evaluation for OpenCL Application and System Performance Next, we investigate the application and system performance impacts of HeteroPDP for alone and co-located. Figure 10 shows the performance speedup for an OpenCL application running alone on the target heterogeneous system. The bars represent the OpenCL application on the different execution targets, i.e., the host, the GPU, an execution target ( or GPU) selected by HeteroPDP, and the optimal execution target (Opt), whereas the y-axis plots the speedup over the baseline execution target. The always offloading to GPU choice improves the OpenCL application performance by 2.5X while HeteroPDP improves the application performance by Fig. 11. The speedup and its fairness of HeteroPDP when co-located an OpenCL application with a native CPU application. The label Native CPU represents native CPU workloads, OCL represents OpenCL workloads, and Avg is the average speedup across all co-located applications. X. HeteroPDP bridges the performance gap between always offloading to GPU and the optimal target selection by 72%. Figure 11 shows the respective performance speedup for the native CPU application and the OpenCL application of the co-located multiprogrammed workloads. The x-axis again shows the execution target of the OpenCL application (the Avg bar indicates the average throughput across all colocated applications), the left y-axis shows the application performance speedup normalized to the baseline (where the OpenCL application runs on the host ), and the right y- axis plots the fairness evaluation. Similar to the alone execution scenario, the proposed HeteroPDP improves the weighted speedup over the always offloading to GPU choice and, at the same time, improves the fairness of the co-located applications. C. HeteroPDP with Varying Scheduling Priorities Assigning equal weights to the native CPU applications and the OpenCL application is not reflective of the scheduling priorities to be enforced in typical systems. As previously mentioned, HeteroPDP can be configured to consider the priorities of co-located applications when making a scheduling decision. Thus, we perform a characterization study by varying the weight ratio of the native CPU application and the OpenCL application. This weight ratio is then taken into account when fairness of the system is calculated and thereby influencing the scheduling decision of the OpenCL application. Figure 12 shows the execution target prediction accuracy evaluation for HeteroPDP with the weight ratio varying from 0.5 to 2.5. A weight ratio less than 1 indicates that the native CPU application has a lower priority than that of the OpenCL application, a weight ratio of 1 means all applications have an equal priority, and a weight ratio higher than 1 indicates that the native CPU application has a higher priority than that of the OpenCL application. As we increase the importance of the native CPU application s speedup with a larger weight ratio, the optimal execution target for the OpenCL application increasingly switches to the GPU, as expected. HeteroPDP achieves a similarly good prediction accuracy of 75% for selecting the execution target. Figure 13 shows the corresponding system performance impact for HeteroPDP with varying scheduling priorities (weight ratios). As the native 8

9 100% 80% 60% 40% 20% 0% [, ] [GPU, GPU] [, GPU] [GPU, ] prediction 64% prediction 72% prediction 73% Weight of the Native CPU Application prediction 75% 100% 80% 60% 40% 20% 0% [, ] [GPU, GPU] [, GPU] [GPU, ] prediction 70% Fig. 12. The prediction accuracy of selecting the optimal target to run an OpneCL kernel co-located with one native CPU application that has varying scheduling weights. dup over OCL Running on 4.0 Native CPU OCL Avg Weighted Fairness Weight of the Native CPU Application Fig. 13. The speedup of HeteroPDP when running workloads consisting of one OpenCL application and one native CPU application with varying scheduling weights. CPU application is given a heavier weight, its performance improvement becomes more important when maximizing the overall system throughput. We notice that when the weight ratio is 0.5, the performance of OpenCL applications is lower than having equal weight (i.e., weight ). This is because HeteroPDP s target prediction accuracy is slightly lower than with other weight ratios as shown in Figure 12. This also reflects upon the trend of weighted fairness improvement of the system. D. HeteroPDP Scalability Analysis Finally, we assess the scalability of the proposed design by increasing the number of native CPU applications on the four-core. In this study, we co-locate two native CPU application on the host processor and evaluate the prediction trend of HeteroPDP for the OpenCL application. Figure 14 shows the prediction accuracy of the target device selection under such more resource-stressed execution environment. The evaluation result indicates that, although the number of colocated processes increase, HeteroPDP can still achieve a similarly good prediction accuracy of 70% as compared to the execution scenario with only one native CPU process (Figure 8). Similarly, the good execution target prediction accuracy translates into system throughput improvement for HeteroPDP. Figure 15 shows the respective speedup of the co-located applications as well as the system throughput and fairness results. HeteroPDP is able to continue its accurate execution target prediction without the need for prediction Weighted Fairness Fig. 14. The prediction accuracy of selecting the optimal target device to run an OpneCL kernel co-located with two native CPU applications. Speedup over OCL Running on 4.0 Native CPU OCL Avg Fairness CPU GPU HeteroPDP OPT Fig. 15. The speedup and its fairness of HeteroPDP when running workloads consisting of two native CPU applications and one OpenCL application. model revision and continues to mitigate the performance degradation in the co-located execution environment. VI. RELATED WORK A. Memory Interference and Management An extensive amount of prior works have studied the shared resource management in the domain, focusing on the capacity management of shared caches and DRAMs. Mutlu and Moscibroda proposed a stall-time fair DRAM scheduling algorithm to reduce the performance degradation and improve system fairness caused by shared resource contention in the DRAM modules [30]. Jaleel et al. proposed a thread-aware dynamic insertion policy (TADIP) to monitor and select the insertion policy for co-located applications that share the LLC [19]. By doing so, the shared LLC contention can be significantly mitigated. Because of the performance importance of shared caches, many other works similarly proposed solutions to specifically improve the utilization of the shared LLC [17], [33], [37], [41], [44]. Intra-application cache interference stemmed from OS-related activities and hardware prefetching can occur and degrade an application s performance. Thus, Wu and Martonosi studied the intra-application cache interference problem and proposed simple techniques to mitigate such cache contention [42]. From the scheduling s aspect, Bubble-Up was designed to predict the degree of shared resource contention and to schedule services to different server nodes in the data center execution environments [27]. This work was targeted at maximizing the per-node loading without violating some specified quality-of-service constraints Fairness 9

10 Similarly, many other prior works also attack the shared cache contention problem with modified scheduler designs [10], [20]. Nevertheless, these existing solutions mainly target at the homogeneous architecture domain. On the other hand, this work identifies performance improvement opportunity in a heterogeneous system and delves deeper into the cross-isa heterogeneous system to quantify the performance behavior affected by the multi-level memory interference in the colocated execution environment. B. Shared Resource Management for Heterogeneous Systems Many commercial products have integrated CPU and GPU cores into one single die. Thus, how to efficiently manage the shared resource between the multiple processors is a real and important research problem, particularly for shared lastlevel caches. Lee and Kim proposed a thread level parallelism aware policy (TAP) to partition the shared cache for co-located CPU and GPU workloads [23]. Mekkat et al. developed an algorithm, called HeLM, to dynamically determine the priority of CPU and GPU cache accesses [29]. Kayıran et al. designed a concurrency management scheme that mitigates the memory contention in a heterogeneous system by regulating the number of concurrent threads on the GPU cores [21]. García et al. quantified the impact of shared virtual memory space between the CPU and GPU cores and suggested that developers have to redesign OpenCL programs to leverage the utilization between CPU and GPU cores to optimize the system throughput [13]. Ausavarungnirun et al. developed a staged DRAM controller that aims to improve the fairness of CPU-GPU shared DRAM by using dedicated CPU and GPU request queues in the memory controller and treat the CPU/GPU requests with different priorities [4]. None of these works, however, addressed the shared resource contention problem from the scheduling s aspect by taking into account the degree of memory interference from multiple levels of the memory hierarchy. C. OpenCL Kernel Scheduling Furthermore, many prior works have identified that using GPUs to accelerate OpenCL kernels does not always lead to performance improvement, due to the data movement overhead [5], [15], [25], [31], [34]. In order to identify the optimal target device to run an OpenCL kernel, prior works proposed applying a variety of machine learning techniques, e.g., K- means clustering [43], support vector machine (SVM) [39], regression models [3], and decision trees [38], to dynamically analyze and predict the behavior of an OpenCL kernel. Besides the machine learning techniques, Margiolas and O Boyle proposed using a modified OpenCL JIT compiler to analyze the workload behavior at the compilation time [26] and partition the GPU resources between multiple OpenCL kernels. Lee and Abeelrahman designed a launch time framework to better utilize different target execution devices by performing an additional post-compilation optimization pass at runtime [22]. However, all these works only take the characteristics of an OpenCL kernel into account but do not consider the memory interference effects stemmed from a heterogeneous real-system setup. In this work, we perform a detailed performance characterization study and highlight the importance of the consideration of memory interference from co-located applications. Then, we design and implement a simple OpenCL scheduler on an experimental heterogeneous system for both the alone and colocated execution environments. To the best of our knowledge, this is the first work that demonstrates a scheduler design that can accurately predict an OpenCL execution target for a heterogeneous system with multi-level memory interference. VII. CONCLUSION This paper presents a detailed performance characterization study for the multiprogrammed heterogeneous computation environment. We show that the performance of an OpenCL application can be significantly affected by co-located native CPU applications and vice versa. Hence, a high-performing, robust OpenCL framework design should take the entire system utilization into account instead of only considering the characteristics of the OpenCL application. In order to balance the performance degradation of a heterogeneous system, we develop a light-weight and scalable performance degradation predictor (HeteroPDP), based on simple regression models. HeteroPDP can accurately select the target device in a heterogeneous system to optimize and balance the performance degradation among all co-located workloads. HeteroPDP is designed and implemented within the existing OpenCL framework, and is evaluated on a real system consisting of an Intel Core i and an AMD FirePro GPU. Overall, HeteroPDP improves the performance of OpenCL applications by 3X by intelligently selecting the execution target between the host and the GPU while the always offloading to GPU decision produces 2.5X speedup. This paper shows that a simple regression model approach and the consideration of the multi-level memory interference in HeteroPDP can effectively improve the scheduling decision of OpenCL applications, leading to higher application performance and system throughput. ACKNOWLEDGMENT The authors would like to thank the paper shepherd Dr. Sandeep Agrawal/Oracle and the anonymous reviewers for their useful feedback. This work is supported in part by the National Science Foundation (under grants CCF # and CCF # ). REFERENCES [1] Amazon, Overview of Amazon web services, Dec [Online]. Available: [2] AMD, AMD SDK a complete development platform, Mar [3] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu, Crossarchitecture performance prediction (XAPP) using CPU code to predict GPU performance, in Proc. of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec [4] R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems, in Proc. of the 39th IEEE/ACM International Symposium on Computer Architecture (ISCA), Jun [5] M. E. Belviranli, F. Khorasani, L. N. Bhuyan, and R. Gupta, CuMAS: Data transfer aware multi-application scheduling for shared GPUs, in Proc. of the 2016 ACM International Conference on Supercomputing (ICS), Jun

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee