Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Size: px
Start display at page:

Download "Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference"

Transcription

1 This is a pre-print, author's version of the paper to appear in the IEEE International Symposium on Workload Characterization (IISWC), Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee and Carole-Jean Wu School of Computing, Informatics, and Decision Systems Engineering Arizona State University {lee.shin-ying,carole-jean.wu}@asu.edu Abstract Modern computer systems are accelerator-rich, equipped with many types of hardware accelerators to speed up computation. For example, graphics processing units (GPUs) are a type of accelerators that are widely employed to accelerate parallel workloads. In order to well utilize different accelerators to gain better execution time speedup or reduce total energy consumption, many scheduling algorithms have been proposed to select the optimal target device to process an OpenCL kernel according to the kernel s individual characteristics. However, in a real computer system, there are a lot of workloads co-located together on a single machine and would be processed on different devices simultaneously. The CPU cores and accelerators may contend shared resources, such as the host main memory and shared last-level cache. Thus, it is not robust to schedule an OpenCL kernel execution by simply considering the characteristics of the kernel. To maximize the system throughput, it is important to consider the execution behavior of all co-located applications when performing OpenCL kernel execution scheduling. In this paper, we provide a detailed characterization study demonstrating that scheduling an OpenCL kernel to run on different devices can introduce varying performance impact to itself and the other co-located applications due to memory interference. Based on the characterization results, we then develop a light-weight, scalable performance degradation predictor specifically for heterogeneous computer systems, called HeteroPDP. HeteroPDP aims to dynamically predict and balance the execution time slowdown of all co-located applications in a heterogeneous computation environment. Our real system evaluation results show that comparing with always running an OpenCL kernel on the host CPU, HeteroPDP is able to achieve 3X execution time speedup when an OpenCL kernel runs alone and improve the system fairness from 24% to 65% when an OpenCL kernel is co-located with other applications. I. INTRODUCTION Hardware accelerators are increasingly used to improve application performance, system throughput, and energy efficiency in modern computing platforms [45]. For instance, graphics processing units (GPUs), with the key design feature of massive multithreading, are widely deployed on high performance computing clusters (HPCs) to speed up the execution of general-purpose parallel workloads. Figure 1 illustrates an example of an accelerator-rich heterogeneous system which comprises a general-purpose chip-multiprocessor () and several types of hardware accelerators. With the availability of a unified programming interface, general-purpose computations can be offloaded to the different accelerator devices fluidly to maximize application performance or energy efficiency. Open Computing Language (OpenCL) is a framework that offers CPU 0 CPU 1 CPU 2 CPU 3 Shared Last-level Cache Northbridge /Interconnect DRAM GPU PCIe Controller DSP Where to run? FPGA Fig. 1. An example of an accelerator-rich heterogeneous computer system. This diagram exhibits a machine equipped with CPU cores and multiple hardware accelerators, including a GPU. All the CPU cores share the last-level cache, whereas all the CPU cores and accelerators share the interconnect, PCIe controller, and main memory. An accleratable application, e.g., machine learning, can be scheduled to run on the CPU or an accelerator. this computation offloading capability. Applications written in OpenCL can be run on a collection of accelerators or devices of different instruction set architectures (ISAs) that support the standard. Thus, depending on the application requirement, the optimization goals, and the performance and power characteristics of available devices, an intelligent OpenCL scheduler can schedule segments of the application (kernels) onto the different OpenCL-enabled devices to improve application execution time speedup. State-of-the-art approaches, such as [3], [38], [39], proposed to build predictive models to determine an optimal execution target among all available accelerators of different compute and power characteristics. However, these prior works focused on scheduling algorithm designs for a single application only and do not consider important, realistic runtime effects, such as memory interference, stemmed from background processes, operating system related activities, or co-located applications. In a realistic execution environment, there are many concurrent processes and can be a number of native CPU applications co-located on the same system at the same time. For example, in an on-demand cloud computing environment, e.g., Amazon Web Service (AWS) [1], Google Cloud [14], and Microsoft Azure [9], compute nodes are simultaneously servicing multiple applications or hosting multiple virtual 1

2 machines with native CPU applications as well as acceleratable applications. These compute nodes are also well equipped with a wide range of accelerators, such as high-performance GPUs or Field Programmable Gate Arrays (FPGAs), and offer computation offloading and acceleration opportunities. In such execution environment, co-located applications contend for shared resources in the memory subsystem and receive a varying degree of performance degradation from memory interference. Thus, existing OpenCL schedulers that only consider the characteristics of the application itself but do not take into account memory interference from co-located workloads are not robust and provide sub-optimal system throughput and fairness. To understand the need for an intelligent scheduler that can make an accurate decision for which optimal execution target an application should be executed on in the presence of memory interference, we perform detailed performance characterization studies for a diverse set of OpenCL applications alone and with co-located applications. Corroborating with recent prior finding, our studies show that, for individual application execution scenarios, the optimal execution target switches between the host and the discrete GPU (Section III-A). The decision is influenced by the degree of parallelism and divergence in an application and by the amount of data movement overhead between the host system and the selected accelerator. Furthermore, we demonstrate that the optimal execution target switches based on the degree of memory interference from the co-located applications or processes (Section III-B). And, more importantly, the room for performance improvement is substantial, motivating a scheduler design that considers the effect of memory interference. We take a step further beyond the large scale performance characterization studies and propose a simple, light-weight performance degradation predictor, called HeteroPDP. Unlike existing performance estimators, such as [3], HeteroPDP is designed tailored for heterogeneous systems with multiple levels of memory interference. In the presence of co-located CPU applications on the host system, HeteroPDP predicts the respective execution time slowdown factors for an OpenCL application and schedules the application onto the remaining cores in the host or offloads it to the GPU accelerator to maximize overall system throughput or fairness. HeteroPDP is implemented on a real heterogeneous system setup and is integrated into the existing OpenCL device driver. Our realsystem evaluation results show an accurate enough execution target prediction accuracy of 80% and 72% for the alone and the co-located scenarios, respectively. The execution target prediction accuracies translate to significant application execution time speedup of X (alone) and system fairness is improved from 24% to 65% (co-located). In summary, this paper makes the following key contributions: We observe that with different optimization goals, the optimal target device to process an OpenCL kernel may switch in an accelerator-rich heterogeneous computer system. We demonstrate that the multi-level memory interference in a heterogeneous system significantly influence the scheduling decision of OpenCL applications in the colocation execution. We present HeteroPDP, a light-weight, flexible prediction scheme, to accurately predict the system performance degradation and to select the optimal target devices to process a kernel depending on the optimization goal in a heterogeneous system. II. EXPERIMENTAL METHODOLOGY This section introduces the experimental setup for the performance characterization studies and the design evaluation on a real heterogeneous computer system. A. Experiment Infrastructure and Configurations To explore the memory interference and performance degradation on a heterogeneous multiprogrammed environment, we build a system that comprises an Intel Core i processor (a quad core with an 8MB shared last-level cache) and an AMD GCN discrete GPU card attached via a PCI-e 16x bus. On this system, the host processor and the GPU card share the same host DRAM controller and main memory modules. Both the cores and GPU card are OpenCL-compatible and are able to execute OpenCL programs. The detailed experiment setup and system configurations are presented in Table I. To collect application-specific information for performance prediction, we instrument the OpenCL JIT compiler to generate the static information, e.g., the static instruction count (Section IV), as the input for the HeteroPDP predictors. To collect runtime system resource utilization information, such as the lastlevel cache miss count, we integrate Intel s performance counter monitor toolkit (PCM) [40] into HeteroPDP to periodically collect system resource utilization information at runtime. B. Workload Construction We use a wide range of workloads exhibiting varying execution behavior for the performance characterization studies. We use applications from the SPEC2006 benchmarks suite with the reference dataset to represent the native CPU workloads [16]. These applications introduce a varying degree of shared resource pressure to the memory subsystem. We classify these CPU applications into two categories: computation or memory intensive benchmarks, based on the average miss per kilo instruction (MPKI) [24]. We take various applications from the AMD SDK [2], Intel SDK [18], Hetero-Mark [34], Pannotia [6], Rodinia [7], [8], SHOC [11], and XSBench [35] benchmark suites to evaluate the behavior of OpenCL applications. Due to the resolution of the performance counters used in HeteroPDP, we do not use OpenCL applications that finish faster than 2 seconds and focus our studies on the longer-running OpenCL applications in this paper. Table II lists the benchmarks used in this paper. For the co-located execution scenario, we construct workload combinations by pairing one native CPU application and one 2

3 TABLE I MEMORY INTERFERENCE INFRASTRUCTURE SETUP AND CONFIGURATIONS. Device Host CPU Host DRAM Accelerator (GPU) GPU DRAM Software Runtime Configuration Intel Core i x86-64 CPU 4 cores 3.4GHz core frequency 8MB shared LLC disabling turbo boost disabling hyperthreading DDR GB 2 channels 22GB/s max available bandwidth AMD FirePro S9150 GCN GPU 44 compute units (CUs) 900 MHz core frequency PCIe x16 8GT/s GDDR GB with ECC 512-bit width 320GB/s max avaiable bandwidth Ubuntu Linux kernel v4.4.0 clang/clang++ v3.8.0 Intel PCM v2.11 Intel OpenCL driver v1..18 AMD OpenCL driver v OpenCL application, which results in 6*26 = 156 multiprogrammed workloads. To study the scalability of HeteroPDP, we increase the number of native CPU applications and synthesize additional 38 multiprogrammed workloads, consisting of two SPEC applications and one OpenCL application from the listed benchmarks. To prevent our experimental machine from thermal throttling, the 38 workloads are the combinations that complete within 5 minutes. III. MOTIVATION FOR AN INTELLIGENT APPLICATION EXECUTION TARGET SCHEDULER We begin this section with performance characterization and analysis for the alone and the co-located execution scenarios. In the alone case, an OpenCL application is the sole application running on the heterogeneous system and is to be dispatched onto an execution target among all available processors or accelerators. In contrast, in the co-located case, an OpenCL application shares the heterogeneous system with other native CPU applications. Section III-A shows that there is a significant room for performance improvement depending on where or which execution target an OpenCL application is executed on for both the alone and co-located cases. Section III-B shows that the optimal execution target switches for OpenCL applications in the presence of memory interference from a memory-intensive co-located application. Then, in Sections III-C and III-D, we show more detailed fairness characterization for the co-located case and with different scheduling priorities imposed onto the concurrent applications. A. Performance Characterization for alone and co-located Offloading an OpenCL application onto a hardware accelerator does not always lead to performance improvement or TABLE II WORKLOADS USED IN THE PERFORMANCE CHARACTERIZATION STUDIES AND DESIGN EVALUATION. THE ASTERISK SYMBOL INDICATES HIGH MEMORY INTENSITIES. Benchmark Suite Type bzip2 calculix lbm* Native CPU SPEC2006 [16] mcf* application perlbench xalancbmk AutoCluster* Binomial* BlackScholes Histogram* AMD SDK [2] LUDecomposition MonteCarolAS AES FIR Hetero-Mark [34] KMN* PR Bitonics* GEMM* MedianFilter MonteCarlo bc* csr* ell* cfd* Intel SDK [18] Pannotia [6] gaussian* heartwall kmeans* Rodinia [7], [8] leukocyte pathfinder streamcluster* s3d SHOC [11] XSBench* XSBench [35] Speedup over Alone on GPU Opt Fairness Number of Co-located OpenCL application GPU Opt Fig. 2. The average execution time speedup to run OpenCL applications alone and the execution time slowdown fairness of co-located on a quad-core CPU, GPU, and the optimal target between the CPU and GPU devices. energy reduction. This is mainly because of three reasons. First, to perform computations on an accelerator, it often requires moving a considerable amount of data between the host system and the accelerators and to synchronize the execution, which are expensive in terms of execution time and energy consumption [5], [15], [25], [31], [34]. Second, to make the shared data accessible by the host CPU as well as the hardware accelerators, the device driver or operating system has to frequently modify the page tables and translation lookaside buffers (TLB) to remap the data into different 3

4 memory spaces, which can introduce very long operation latencies [36]. Third, the OpenCL JIT compiler is not always able to well transform the OpenCL kernel code to fully utilize the dedicated target accelerator, making the performance suboptimal [38]. Consequently, offloading computations onto an accelerator may instead degrade the application performance and incur higher energy dissipation. Figure 2 shows the system performance for running an OpenCL application on the Intel or the discrete GPU card alone and co-located, averaged across the 26 OpenCL applications. The horizontal axis indicates the execution target of the OpenCL application whereas the y-axis represents system performance: execution time speedup for alone and fairness 1 for co-located. Figure 2(a) shows that, although offloading the OpenCL application to the GPU achieves an impressive speedup on average as compared with the execution target, there is room for performance improvement. With the oracle execution target information, the application performance can be further improved by an average of 50%. Figure 2(b) shows a similar performance trend for co-located. Thus, to maximize system performance, an intelligent execution target scheduler is needed for both the alone and co-located execution scenarios. B. Optimal Execution Target Varies in the Presence of Memory Interference We delve deeper into a few workload combinations to illustrate that the optimal OpenCL execution target varies in the presence of memory interference from a memory-intensive co-located application. In this study, we use mcf as the memory-intensive application running on the. When an OpenCL application is co-located with mcf on the, shared last-level cache contention degrades application performance whereas when the OpenCL application is offloaded to the GPU, performance degradation comes from a different level of the memory hierarchy, i.e., the DRAM memory bandwidth. The already expensive data transfer cost for OpenCL application offloading is exacerbated. Figure 3(a) shows the execution time speedup of five different OpenCL applications alone on the versus on the GPU accelerator and on the optimal, higher-performing execution target. Figure 3(b) shows the execution time speedup of the same OpenCL applications co-located with mcf and the optimal execution target. The labels on the top of the bars indicate the optimal execution target. The optimal execution target for three out of the five OpenCL applications, i.e., BIT, HIS, and XSB, is changed. It is clear that the decision depends on the memory intensities and interference between the OpenCL and the co-located workloads. Hence, simply considering the features of an OpenCL application for scheduling is insufficient to maximize application and system performance it is crucial for an intelligent execution target scheduler to take into account the characteristics of all co-located applications. 1 Fairness is a commonly-used metric to evaluate the execution time slowdown for multiprogramming execution [4], [12], [30] and is defined as the ratio of the minimum and the maximum slowdown among all concurrent applications. Speedup over Alone on GPU Opt (a) alone GPU GPU GPU (b) co-located GPU GPU FIR BIT HIS XSB KMN Fig. 3. The execution time speedup of an OpenCL application when it is running alone and when it is co-located with the native CPU application mcf. The labels on the top indicate that the optimal target device based on the execution time speedup. C. Large-scale Performance Degradation Characterization with Different Co-location Scenarios To fairly evaluate the overall system performance, the fairness metric is commonly-used for co-located workloads in the multiprogramming execution [4], [12], [30]. Fairness is defined as follows: Fairness = min(slowdown i) (1) max(slowdown i ) where i represents any of the co-located applications and slowdown is the ratio of an application s execution time in co-located and that in alone. The goal of using fairness as the optimization goal is to ensure a fine balance of the slowdown among all co-located applications. Fairness of 1 represents a system with equal slowdown among all co-located workloads. Figure 4 shows the execution target preference for the OpenCL application in the co-located scenario for all 156 workload combinations in this study. The x-axis represents all workload combinations while the y-axis represents the fairness ratio of the OpenCL application running on the versus on the GPU. The data points are sorted based on the fairness ratio in the increasing order. We observe that the fairness ratio varies significantly, from 01 to 100. For a large number of workload combinations (toward either end of the curve), there is a clear OpenCL execution target preference. D. Large-scale Performance Degradation Characterization with Different Scheduling Priorities Real-time constraint and scheduling priorities of processes can affect the scheduling decision as well. Many interrupt services, for example, must be handled by the host processor with a hard real-time deadline. To evaluate how scheduling priorities can influence the scheduling decision of an OpenCL application and affect the overall system performance, we adopt the metric of weighted slowdown [12] and use it to calculate fairness defined as: WeightedFariness = min(weightedslowdown i) max(weightedslowdown i ) WeightedSlowdown i = slowdown i weight i (2) 4

5 Fairness Ratio (/GPU) prefer GPU execution prefer execution Native CPU App C/C++ Compiler Slowdown Estimation on Pre-Characterized Utilization Table Compilation Time Static Features on or GPU OCL App OCL JIT Compiler Workloads Fig. 4. The ratio of fairness between running an OpneCL kernel on the versus on the GPU for workloads comprising one OpneCL application and one native CPU application. Higher than indicates running on has higher fairness number and thereby preferring to run on the. App Execution Runtime Utilization Launch Time Kernel Launch Dynamic Features Fairness Ratio (/GPU) x0.5 x x1.5 x Workloads Fig. 5. The ratio of fairness that running an OpneCL kernel on the versus on the GPU when the co-located native CPU application is assigned to have different OS scheduling priorities/weights. The blue boxes point out workloads having varying target execution devices when the co-located application has different scheduling weights. where weight i is the scheduling weight given to process i. Figure 5 presents the fairness ratio based on the weighted slowdown, having each co-located native CPU application with the weight factor varying from 0.5 to means the co-located native CPU application is more latency tolerable than the OpenCL application, whereas 2.5 indicates the coscheduled native CPU application is highly latency critical. The weights can also be representative of, for example, the operating system scheduling priority. We see that when the scheduling priority of the co-located native CPU process increases, the fairness ratio shifts remarkably as well, favoring GPU as the OpenCL execution target as labeled with the blue boxes in Figure 5. Therefore, in order to meet the real-time deadline, an intelligent OpenCL execution target scheduling framewor should also consider the process scheduling priorities to reach a correct target selection decision IV. PERFORMANCE PREDICTION AND OPTIMIZATION FRAMEWORK Based on the performance characterization studies, we design a simple, light-weight performance prediction and optimization framework, called HeteroPDP. HeteroPDP estimates application slowdown for each co-located application and schedules the OpenCL application to an execution target in a heterogeneous system with the goal of maximize fairness, system throughput, 99 HeteroPDP Section IV-E CPU slowdown Accessing the Pre-characterized Utilization Table Section IV-B OCL exe. time alone on /GPU Regression model: Static Features Dynamic Features Section IV-C OCL exe. time co-located on /GPU Regression model: Static Features Dynamic Features Runtime Utilization Fig. 6. The OpenCL kernel execution flow and slowdown prediction flow of HeteroPDP. OCL ICD OCL Source Code OCL User API Calls Perf Counters OCL JIT Compiler OCL Command Queue PCM Dynamic Features Static Features Runtime Utilization HeteroPDP OCL Perf Predictor (Regression Models) CPU Perf Predictor (Pre-characterized Table) Fig. 7. System diagram of the HeteroPDP scheme. Scheduling Decision or weighted speedup. Figure 6 illustrates the overall execution flow and the design components of HeteroPDP. A. HeteroPDP Overview and Execution Flow HeteroPDP is implemented as a part of the OpenCL independent client driver (ICD). When an OpenCL API is invoked within an application, HeteroPDP retrieves and collects application-specific information, such as the size of data transfer between the host and device memories, available in the command queue of the ICD. Based on the application-specific features (Section IV-B), HeteroPDP estimates application execution time and selects an execution target for the OpenCL application. The proposed HeteroPDP framework is illustrated in Figure 7. To predict the performance prediction in HeteroPDP, at the compilation time, the compiler collects static features (Section IV-B) for OpenCL kernels and a lookup table (Section IV-E) for native CPU applications. At runtime, HeteroPDP periodically queries performance counters to collect the system resource utilization and retrieves the OpenCL kernel dynamic 5

6 features (Section IV-B) right before the kernel launch time for performane prediction. HeteroPDP estimates the execution time and slowdown for the OpenCL kernel running on the and the GPU by simple regression-based models, which use the kernel static features, the dynamic features, and the resource utilization as the prediction inputs (Sections IV-B to IV-D). While a fullfledge machine learning technique can also be used and may offer higher prediction accuracies, our evaluation results in the later section indicate a simple performance model works sufficiently well for the purpose of this work (Section V-A). HeteroPDP also assesses the impact of shared resource contention on the co-located native CPU applications using a pre-characterized performance estimation approach [27]. It evaluates the performance degradation for the native CPU applications by looking up a table (pre-characterized utilization table) with the co-located OpenCL kernel s per-thread working set size and the amount of data transfer (Section IV-E). HeteroPDP then predicts the optimal execution target based on the optimization goal and schedules the OpenCL kernel accordingly. B. OpenCL Kernel Execution Time Prediction for alone To establish the regression model for predicting the performance of an OpenCL kernel when it is running alone in a heterogeneous system, we first analyze and identify a set of important kernel characteristics, including both static and dynamic features. The static features of a kernel, such as the number of static instructions, can be retrieved by the OpenCL JIT compiler at the compilation time. On the other hand, the dynamic features of a kernel include parameters, such as the size of input data sets, and user commands specified at the kernel launch time, such as the total number of threads. The kernel characteristics are extracted with the instrumented OpenCL JIT compiler and the device driver, and are used to train the regression-based performance prediction models: one for predicting the OpenCL application execution time on the host execution target and the other for the GPU execution target. We run an OpenCL kernel with a varying number of threads and different sizes of input data sets to collect its corresponding execution time by querying the clgeteventprofilinginfo() API. We construct the correlation between the features and the execution time (Section IV-D). Overall, the regression model expresses the predicted execution time as a function of a number of important features, as shown in Equation 3, where c i and f i represent the i-th coefficient and feature, respectively. Table III summarizes the kernel-specific features used in the performance prediction models for the execution targets of the host and the GPU. Per f ormance execution target = c i f i (3) i C. OpenCL Kernel Execution Time Prediction for co-located Similar to predicting the execution time for an OpenCL application alone, we build an additional regression model to TABLE III THE OPENCL KERNEL FEATURES USED FOR EXECUTION TIME PREDICTION. Feature # of scalar ALU instructions # of scalar memory instructions # of vector ALU instructions # of vector memory instructions # of branch instructions # of atomic instructions # of memory instructions # of integer instructions # of float-point instructions # of special math instructions # of branch instructions # of barrier instructions # of threads spawned size of memory buffer allocated last-level cache miss count host DRAM bandwidth util Category Static features for predicting execution time on the Static features for predicting execution time on the GPU Dynamic features Runtime util for predicting execution time of co-located predict the kernel execution time in the presence of co-located applications. In such an execution scenario, shared memory resource utilization, such as the last-level cache and the DRAM bandwidth on the host, influences the OpenCL kernel performance. To consider the memory interference effects, we include two additional system utilization features into the performance prediction model for co-located: (1) the shared last-level cache miss counts on the host and (2) the host DRAM bandwidth utilization incurred by the co-located native CPU applications. HeteroPDP retrieves these two runtime utilization features to predict the degree of memory interference by periodically (every 1 second) checking performance counters on the host machine and taking the moving average of 8 consecutive samples. In summary, when an OpenCL kernel is launched, we use the regression models to predict the OpenCL kernel execution time for (1) each of the two available execution targets, alone (time alone with Equation 3) and (2) each of the two available execution targets, co-located (time co located ). Then, HeteroPDP estimates the slowdown factor of the OpenCL application for the two execution targets with Equation 4. Note that, the parameters and features are chosen to form the regression models as these are identified to be highly correlated to kernel execution time [38]. Slowdown = time co located time alone (4) D. Performance Model Training for OpenCL Kernels To build the regression models for OpenCL kernel execution time prediction in HeteroPDP, we take a set of 63 distinct OpenCL kernels with varying input data set sizes from the OpenCL benchmarks listed in Table II as the training set. We first execute the OpenCL kernels to collect the corresponding kernel execution time with different static and dynamic features to build up the initial regression models. We then apply the commonly-used K-fold cross validation algorithm [32] with 32 test passes to eliminate overfitting and to maximize the 6

7 coefficient of the determination value (R-square) by narrowing down the training set size from 63 to 45 kernels 2. That is, 45 kernels are used to build up coefficients of the regression model and the rest of the 18 kernels are used to validate the prediction errors. The kernels used for model training and validation are without overlap. Similarly, using the same 63 kernels, we vary the degree of memory interference (i.e., the host DRAM utilization and shared last-level cache miss count) by co-locating the OpenCL kernels with microbenchmarks and perform the same model training procedure for co-located as the alone case. To minimize the execution time overhead at runtime, the regression models are trained offline. Moreover, to better correlate each feature and parameter, the prediction models are trained as interactive regression models. E. Performance Degradation Prediction for Native CPU Applications To assess fairness or weighted speedup of multiple concurrent applications running on a heterogeneous system, HeteroPDP has to determine the performance of native CPU applications as well. It does so with an offline-trained lookup table. A major advantage of using an offline-trained table is the ease of computation overhead. Therefore, instead of applying a prediction model to project the execution time slowdown of co-located native CPU applications with complicated execution behavior, we modify a previously proposed approach, called Bubble-Up [27], to measure and estimate the CPU application slowdown caused by the co-scheduled OpenCL kenel. In Bubble-Up, a simple lookup table (pre-characterized utilization table) is built at the compilation time for predicting the degree of performance degradation under different levels of shared memory contention caused by other co-located applications. The table is constructed for each native CPU application and is trained with a collection of microbenchmarks that generate a fixed level of contention for a specific shared resource, such as the last-level cache or the shared DRAM bandwidth. In our design, when an OpenCL kernel is launched, HeteroPDP looks up the pre-characterized utilization table by indexing it with the system status (i.e., DRAM bandwidth utilization and OpenCL buffer size) to predict the execution time slowdown for the native CPU applications. Note, Bubble-Up was originally proposed for application slowdown estimation for CPU applications in the multiprogrammed execution scenario. We revise the algorithm for the purpose of performance degradation prediction for native CPU applications in a heterogeneous system setup. For HeteroPDP, if an OpenCL kernel is running on the host, the main resource contention occurs at the shared last-level cache. To predict the pressure the OpenCL kernel imposes onto the shared cache, we use the maximum number of concurrent threads that can run on the s SIMD or vector functional units and the total working set size to estimate its demand for the shared cache capacity. On the other hand, 2 The model training is done by the MATLAB fitlm() and crossval() APIs [28]. 100% 80% 60% 40% 20% 0% [, ] [GPU, GPU] [, GPU] [GPU, ] (a) alone prediction 80% (b) co-located prediction 72% Fig. 8. The prediction accuracy of selecting the optimal target device to run an OpneCL kernel for (a) alone, and (b) co-located with a native CPU process. when the OpenCL kernel is offloaded onto the discrete GPU, the major resource interference occurs at the data movement operations for the shared main memory bandwidth. To predict the slowdown caused by the bandwidth contention, HeteroPDP uses the total size of data transfer required for launching the OpenCL kernel to evaluate the host DRAM bandwidth requirement. V. EVALUATION RESULTS AND ANALYSIS FOR HeteroPDP In this section, we present the evaluation results for the prediction model and the execution target prediction accuracies (Section V-A) as well as the performance of HeteroPDP in the alone and co-located execution scenarios (Section V-B). A. Evaluation for Execution Time Prediction Models and Execution Target Prediction The ultimate goal of the HeteroPDP framework is to predict the optimal execution target for an OpenCL application in the alone and co-located execution scenarios. Since HeteroPDP depends its execution target prediction on the four execution time prediction models, we also evaluate the prediction accuracies for the individual models. Figure 8 presents the execution target selection accuracy for alone and for co-located. The different portions of the bar represent the different prediction outcomes [predicted execution target, optimal execution target]. For example, [, ] means that the predicted execution target for the OpenCL application is the host processor and the optimal execution target is also the host processor, resulting in a correct prediction outcome. For the alone case, the execution target is selected such that the execution time of the OpenCL application is minimized. For the co-located case, the execution target is selected such that fairness, as defined in Section III-C, is maximized. Overall, HeteroPDP achieves 80% and 72% execution target prediction accuracies for the alone and co-located scenarios, respectively. The training set for our prediction model is relative few, which may not cover the diverse execution behavior of the OpenCL kernels in this paper. We believe that the model prediction accuracy can be significantly improved with an increased training set size. We investigate the prediction accuracies for the individual execution time models as well. Figure 9 shows the cumulative density function for the execution time prediction accuracies for (1) the OpenCL application on the host processor, alone, (2) 7

8 CDF of Workloads alone GPU alone co-located GPU co-located 100% 20% error margin 80% 60% 40% 20% 0% Error Speedup over OCL Running on 4.0 Native CPU OCL Avg Fairness GPU HeteroPDP OPT Fairness Fig. 9. The CDF of prediction errors for predicting OpenCL kernel execution time. The red dash line indicates the 20% error margin. Speedup over GPU HeteroPDP Opt Fig. 10. The system speedup of HeteroPDP when running an OpenCL application alone. the OpenCL application on the GPU, alone, (3) the OpenCL application on the host processor, co-located, and (4) the OpenCL application on the GPU, co-located. We observe that the execution time prediction error rate for the majority of applications or workload combinations is below 10%. For the four respective models, (1) (4), 73%, 70%, 68%, and 72% of the workloads can meet the 20% error rate cutoff. We find that the execution time prediction error is mainly coming from two sources. First, for applications with high degree of memory or branch divergence, the execution behavior is less predictable with a simple regression model. Second, because of the limited resolution of timer on the GPU, we are not able to accurately measure the execution time for short-running OpenCL kernels. Therefore, HeteroPDP has a relatively high prediction error for short-running kernels. B. Evaluation for OpenCL Application and System Performance Next, we investigate the application and system performance impacts of HeteroPDP for alone and co-located. Figure 10 shows the performance speedup for an OpenCL application running alone on the target heterogeneous system. The bars represent the OpenCL application on the different execution targets, i.e., the host, the GPU, an execution target ( or GPU) selected by HeteroPDP, and the optimal execution target (Opt), whereas the y-axis plots the speedup over the baseline execution target. The always offloading to GPU choice improves the OpenCL application performance by 2.5X while HeteroPDP improves the application performance by Fig. 11. The speedup and its fairness of HeteroPDP when co-located an OpenCL application with a native CPU application. The label Native CPU represents native CPU workloads, OCL represents OpenCL workloads, and Avg is the average speedup across all co-located applications. X. HeteroPDP bridges the performance gap between always offloading to GPU and the optimal target selection by 72%. Figure 11 shows the respective performance speedup for the native CPU application and the OpenCL application of the co-located multiprogrammed workloads. The x-axis again shows the execution target of the OpenCL application (the Avg bar indicates the average throughput across all colocated applications), the left y-axis shows the application performance speedup normalized to the baseline (where the OpenCL application runs on the host ), and the right y- axis plots the fairness evaluation. Similar to the alone execution scenario, the proposed HeteroPDP improves the weighted speedup over the always offloading to GPU choice and, at the same time, improves the fairness of the co-located applications. C. HeteroPDP with Varying Scheduling Priorities Assigning equal weights to the native CPU applications and the OpenCL application is not reflective of the scheduling priorities to be enforced in typical systems. As previously mentioned, HeteroPDP can be configured to consider the priorities of co-located applications when making a scheduling decision. Thus, we perform a characterization study by varying the weight ratio of the native CPU application and the OpenCL application. This weight ratio is then taken into account when fairness of the system is calculated and thereby influencing the scheduling decision of the OpenCL application. Figure 12 shows the execution target prediction accuracy evaluation for HeteroPDP with the weight ratio varying from 0.5 to 2.5. A weight ratio less than 1 indicates that the native CPU application has a lower priority than that of the OpenCL application, a weight ratio of 1 means all applications have an equal priority, and a weight ratio higher than 1 indicates that the native CPU application has a higher priority than that of the OpenCL application. As we increase the importance of the native CPU application s speedup with a larger weight ratio, the optimal execution target for the OpenCL application increasingly switches to the GPU, as expected. HeteroPDP achieves a similarly good prediction accuracy of 75% for selecting the execution target. Figure 13 shows the corresponding system performance impact for HeteroPDP with varying scheduling priorities (weight ratios). As the native 8

9 100% 80% 60% 40% 20% 0% [, ] [GPU, GPU] [, GPU] [GPU, ] prediction 64% prediction 72% prediction 73% Weight of the Native CPU Application prediction 75% 100% 80% 60% 40% 20% 0% [, ] [GPU, GPU] [, GPU] [GPU, ] prediction 70% Fig. 12. The prediction accuracy of selecting the optimal target to run an OpneCL kernel co-located with one native CPU application that has varying scheduling weights. dup over OCL Running on 4.0 Native CPU OCL Avg Weighted Fairness Weight of the Native CPU Application Fig. 13. The speedup of HeteroPDP when running workloads consisting of one OpenCL application and one native CPU application with varying scheduling weights. CPU application is given a heavier weight, its performance improvement becomes more important when maximizing the overall system throughput. We notice that when the weight ratio is 0.5, the performance of OpenCL applications is lower than having equal weight (i.e., weight ). This is because HeteroPDP s target prediction accuracy is slightly lower than with other weight ratios as shown in Figure 12. This also reflects upon the trend of weighted fairness improvement of the system. D. HeteroPDP Scalability Analysis Finally, we assess the scalability of the proposed design by increasing the number of native CPU applications on the four-core. In this study, we co-locate two native CPU application on the host processor and evaluate the prediction trend of HeteroPDP for the OpenCL application. Figure 14 shows the prediction accuracy of the target device selection under such more resource-stressed execution environment. The evaluation result indicates that, although the number of colocated processes increase, HeteroPDP can still achieve a similarly good prediction accuracy of 70% as compared to the execution scenario with only one native CPU process (Figure 8). Similarly, the good execution target prediction accuracy translates into system throughput improvement for HeteroPDP. Figure 15 shows the respective speedup of the co-located applications as well as the system throughput and fairness results. HeteroPDP is able to continue its accurate execution target prediction without the need for prediction Weighted Fairness Fig. 14. The prediction accuracy of selecting the optimal target device to run an OpneCL kernel co-located with two native CPU applications. Speedup over OCL Running on 4.0 Native CPU OCL Avg Fairness CPU GPU HeteroPDP OPT Fig. 15. The speedup and its fairness of HeteroPDP when running workloads consisting of two native CPU applications and one OpenCL application. model revision and continues to mitigate the performance degradation in the co-located execution environment. VI. RELATED WORK A. Memory Interference and Management An extensive amount of prior works have studied the shared resource management in the domain, focusing on the capacity management of shared caches and DRAMs. Mutlu and Moscibroda proposed a stall-time fair DRAM scheduling algorithm to reduce the performance degradation and improve system fairness caused by shared resource contention in the DRAM modules [30]. Jaleel et al. proposed a thread-aware dynamic insertion policy (TADIP) to monitor and select the insertion policy for co-located applications that share the LLC [19]. By doing so, the shared LLC contention can be significantly mitigated. Because of the performance importance of shared caches, many other works similarly proposed solutions to specifically improve the utilization of the shared LLC [17], [33], [37], [41], [44]. Intra-application cache interference stemmed from OS-related activities and hardware prefetching can occur and degrade an application s performance. Thus, Wu and Martonosi studied the intra-application cache interference problem and proposed simple techniques to mitigate such cache contention [42]. From the scheduling s aspect, Bubble-Up was designed to predict the degree of shared resource contention and to schedule services to different server nodes in the data center execution environments [27]. This work was targeted at maximizing the per-node loading without violating some specified quality-of-service constraints Fairness 9

10 Similarly, many other prior works also attack the shared cache contention problem with modified scheduler designs [10], [20]. Nevertheless, these existing solutions mainly target at the homogeneous architecture domain. On the other hand, this work identifies performance improvement opportunity in a heterogeneous system and delves deeper into the cross-isa heterogeneous system to quantify the performance behavior affected by the multi-level memory interference in the colocated execution environment. B. Shared Resource Management for Heterogeneous Systems Many commercial products have integrated CPU and GPU cores into one single die. Thus, how to efficiently manage the shared resource between the multiple processors is a real and important research problem, particularly for shared lastlevel caches. Lee and Kim proposed a thread level parallelism aware policy (TAP) to partition the shared cache for co-located CPU and GPU workloads [23]. Mekkat et al. developed an algorithm, called HeLM, to dynamically determine the priority of CPU and GPU cache accesses [29]. Kayıran et al. designed a concurrency management scheme that mitigates the memory contention in a heterogeneous system by regulating the number of concurrent threads on the GPU cores [21]. García et al. quantified the impact of shared virtual memory space between the CPU and GPU cores and suggested that developers have to redesign OpenCL programs to leverage the utilization between CPU and GPU cores to optimize the system throughput [13]. Ausavarungnirun et al. developed a staged DRAM controller that aims to improve the fairness of CPU-GPU shared DRAM by using dedicated CPU and GPU request queues in the memory controller and treat the CPU/GPU requests with different priorities [4]. None of these works, however, addressed the shared resource contention problem from the scheduling s aspect by taking into account the degree of memory interference from multiple levels of the memory hierarchy. C. OpenCL Kernel Scheduling Furthermore, many prior works have identified that using GPUs to accelerate OpenCL kernels does not always lead to performance improvement, due to the data movement overhead [5], [15], [25], [31], [34]. In order to identify the optimal target device to run an OpenCL kernel, prior works proposed applying a variety of machine learning techniques, e.g., K- means clustering [43], support vector machine (SVM) [39], regression models [3], and decision trees [38], to dynamically analyze and predict the behavior of an OpenCL kernel. Besides the machine learning techniques, Margiolas and O Boyle proposed using a modified OpenCL JIT compiler to analyze the workload behavior at the compilation time [26] and partition the GPU resources between multiple OpenCL kernels. Lee and Abeelrahman designed a launch time framework to better utilize different target execution devices by performing an additional post-compilation optimization pass at runtime [22]. However, all these works only take the characteristics of an OpenCL kernel into account but do not consider the memory interference effects stemmed from a heterogeneous real-system setup. In this work, we perform a detailed performance characterization study and highlight the importance of the consideration of memory interference from co-located applications. Then, we design and implement a simple OpenCL scheduler on an experimental heterogeneous system for both the alone and colocated execution environments. To the best of our knowledge, this is the first work that demonstrates a scheduler design that can accurately predict an OpenCL execution target for a heterogeneous system with multi-level memory interference. VII. CONCLUSION This paper presents a detailed performance characterization study for the multiprogrammed heterogeneous computation environment. We show that the performance of an OpenCL application can be significantly affected by co-located native CPU applications and vice versa. Hence, a high-performing, robust OpenCL framework design should take the entire system utilization into account instead of only considering the characteristics of the OpenCL application. In order to balance the performance degradation of a heterogeneous system, we develop a light-weight and scalable performance degradation predictor (HeteroPDP), based on simple regression models. HeteroPDP can accurately select the target device in a heterogeneous system to optimize and balance the performance degradation among all co-located workloads. HeteroPDP is designed and implemented within the existing OpenCL framework, and is evaluated on a real system consisting of an Intel Core i and an AMD FirePro GPU. Overall, HeteroPDP improves the performance of OpenCL applications by 3X by intelligently selecting the execution target between the host and the GPU while the always offloading to GPU decision produces 2.5X speedup. This paper shows that a simple regression model approach and the consideration of the multi-level memory interference in HeteroPDP can effectively improve the scheduling decision of OpenCL applications, leading to higher application performance and system throughput. ACKNOWLEDGMENT The authors would like to thank the paper shepherd Dr. Sandeep Agrawal/Oracle and the anonymous reviewers for their useful feedback. This work is supported in part by the National Science Foundation (under grants CCF # and CCF # ). REFERENCES [1] Amazon, Overview of Amazon web services, Dec [Online]. Available: [2] AMD, AMD SDK a complete development platform, Mar [3] N. Ardalani, C. Lestourgeon, K. Sankaralingam, and X. Zhu, Crossarchitecture performance prediction (XAPP) using CPU code to predict GPU performance, in Proc. of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec [4] R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu, Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems, in Proc. of the 39th IEEE/ACM International Symposium on Computer Architecture (ISCA), Jun [5] M. E. Belviranli, F. Khorasani, L. N. Bhuyan, and R. Gupta, CuMAS: Data transfer aware multi-application scheduling for shared GPUs, in Proc. of the 2016 ACM International Conference on Supercomputing (ICS), Jun

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v6.5 August 2014 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

The Use of Cloud Computing Resources in an HPC Environment

The Use of Cloud Computing Resources in an HPC Environment The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Automatic Intra-Application Load Balancing for Heterogeneous Systems Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena

More information

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs

Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs The 34 th IEEE International Conference on Computer Design Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs Shin-Ying Lee and Carole-Jean Wu Arizona State University October

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Heterogeneous Computing with a Fused CPU+GPU Device

Heterogeneous Computing with a Fused CPU+GPU Device with a Fused CPU+GPU Device MediaTek White Paper January 2015 2015 MediaTek Inc. 1 Introduction Heterogeneous computing technology, based on OpenCL (Open Computing Language), intelligently fuses GPU and

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

ECE 8823: GPU Architectures. Objectives

ECE 8823: GPU Architectures. Objectives ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1 Chapter 1 Chapter 2: 2.2, 2.3 Reading

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems 1 Presented by Hadeel Alabandi Introduction and Motivation 2 A serious issue to the effective utilization

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads

PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads PYTHIA: Improving Datacenter Utilization via Precise Contention Prediction for Multiple Co-located Workloads Ran Xu (Purdue), Subrata Mitra (Adobe Research), Jason Rahman (Facebook), Peter Bai (Purdue),

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms

Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Quantifying the Energy Cost of Data Movement for Emerging Smartphone Workloads on Mobile Platforms Arizona State University Dhinakaran Pandiyan(dpandiya@asu.edu) and Carole-Jean Wu(carole-jean.wu@asu.edu

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Transparent Offloading and Mapping () Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O Connor, Nandita Vijaykumar,

More information

When MPPDB Meets GPU:

When MPPDB Meets GPU: When MPPDB Meets GPU: An Extendible Framework for Acceleration Laura Chen, Le Cai, Yongyan Wang Background: Heterogeneous Computing Hardware Trend stops growing with Moore s Law Fast development of GPU

More information

Managing GPU Concurrency in Heterogeneous Architectures

Managing GPU Concurrency in Heterogeneous Architectures Managing Concurrency in Heterogeneous Architectures Onur Kayıran, Nachiappan CN, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, Chita R. Das Era of Heterogeneous Architectures

More information

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies Alexander Merritt, Vishakha Gupta, Abhishek Verma, Ada Gavrilovska, Karsten Schwan {merritt.alex,abhishek.verma}@gatech.edu {vishakha,ada,schwan}@cc.gtaech.edu

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS

PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS PAGE PLACEMENT STRATEGIES FOR GPUS WITHIN HETEROGENEOUS MEMORY SYSTEMS Neha Agarwal* David Nellans Mark Stephenson Mike O Connor Stephen W. Keckler NVIDIA University of Michigan* ASPLOS 2015 EVOLVING GPU

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

A Simple Model for Estimating Power Consumption of a Multicore Server System

A Simple Model for Estimating Power Consumption of a Multicore Server System , pp.153-160 http://dx.doi.org/10.14257/ijmue.2014.9.2.15 A Simple Model for Estimating Power Consumption of a Multicore Server System Minjoong Kim, Yoondeok Ju, Jinseok Chae and Moonju Park School of

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Understanding Reduced-Voltage Operation in Modern DRAM Devices

Understanding Reduced-Voltage Operation in Modern DRAM Devices Understanding Reduced-Voltage Operation in Modern DRAM Devices Experimental Characterization, Analysis, and Mechanisms Kevin Chang A. Giray Yaglikci, Saugata Ghose,Aditya Agrawal *, Niladrish Chatterjee

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems Chris Gregg Jeff S. Brantley Kim Hazelwood Department of Computer Science, University of Virginia Abstract A typical consumer desktop

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM I5 AND I7 PROCESSORS Juan M. Cebrián 1 Lasse Natvig 1 Jan Christian Meyer 2 1 Depart. of Computer and Information

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS ARKAPRAVA BASU, JOSEPH L. GREATHOUSE, GURU VENKATARAMANI, JÁN VESELÝ AMD RESEARCH, ADVANCED MICRO DEVICES, INC. MODERN SYSTEMS ARE POWERED BY HETEROGENEITY

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

COL862 Programming Assignment-1

COL862 Programming Assignment-1 Submitted By: Rajesh Kedia (214CSZ8383) COL862 Programming Assignment-1 Objective: Understand the power and energy behavior of various benchmarks on different types of x86 based systems. We explore a laptop,

More information

XPU A Programmable FPGA Accelerator for Diverse Workloads

XPU A Programmable FPGA Accelerator for Diverse Workloads XPU A Programmable FPGA Accelerator for Diverse Workloads Jian Ouyang, 1 (ouyangjian@baidu.com) Ephrem Wu, 2 Jing Wang, 1 Yupeng Li, 1 Hanlin Xie 1 1 Baidu, Inc. 2 Xilinx Outlines Background - FPGA for

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

AMD Fusion APU: Llano. Marcello Dionisio, Roman Fedorov Advanced Computer Architectures

AMD Fusion APU: Llano. Marcello Dionisio, Roman Fedorov Advanced Computer Architectures AMD Fusion APU: Llano Marcello Dionisio, Roman Fedorov Advanced Computer Architectures Outline Introduction AMD Llano architecture AMD Llano CPU core AMD Llano GPU Memory access management Turbo core technology

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract

The Affinity Effects of Parallelized Libraries in Concurrent Environments. Abstract The Affinity Effects of Parallelized Libraries in Concurrent Environments FABIO LICHT, BRUNO SCHULZE, LUIS E. BONA, AND ANTONIO R. MURY 1 Federal University of Parana (UFPR) licht@lncc.br Abstract The

More information

Understanding GPGPU Vector Register File Usage

Understanding GPGPU Vector Register File Usage Understanding GPGPU Vector Register File Usage Mark Wyse AMD Research, Advanced Micro Devices, Inc. Paul G. Allen School of Computer Science & Engineering, University of Washington AGENDA GPU Architecture

More information

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering

More information

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA

Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Performance Optimizations via Connect-IB and Dynamically Connected Transport Service for Maximum Performance on LS-DYNA Pak Lui, Gilad Shainer, Brian Klaff Mellanox Technologies Abstract From concept to

More information

Messaging Overview. Introduction. Gen-Z Messaging

Messaging Overview. Introduction. Gen-Z Messaging Page 1 of 6 Messaging Overview Introduction Gen-Z is a new data access technology that not only enhances memory and data storage solutions, but also provides a framework for both optimized and traditional

More information

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA

EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD. George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA EXPLORING PARALLEL PROCESSING OPPORTUNITIES IN AERMOD George Delic * HiPERiSM Consulting, LLC, Durham, NC, USA 1. INTRODUCTION HiPERiSM Consulting, LLC, has a mission to develop (or enhance) software and

More information

Intelligent Scheduling and Memory Management Techniques. for Modern GPU Architectures. Shin-Ying Lee

Intelligent Scheduling and Memory Management Techniques. for Modern GPU Architectures. Shin-Ying Lee Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures by Shin-Ying Lee A Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs A Dissertation Presented by Yash Ukidave to The Department of Electrical and Computer Engineering in partial

More information

Architecture, Programming and Performance of MIC Phi Coprocessor

Architecture, Programming and Performance of MIC Phi Coprocessor Architecture, Programming and Performance of MIC Phi Coprocessor JanuszKowalik, Piotr Arłukowicz Professor (ret), The Boeing Company, Washington, USA Assistant professor, Faculty of Mathematics, Physics

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches

Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Addendum to Efficiently Enabling Conventional Block Sizes for Very Large Die-stacked DRAM Caches Gabriel H. Loh Mark D. Hill AMD Research Department of Computer Sciences Advanced Micro Devices, Inc. gabe.loh@amd.com

More information

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems Carl Pearson 1, I-Hsin Chung 2, Zehra Sura 2, Wen-Mei Hwu 1, and Jinjun Xiong 2 1 University of Illinois Urbana-Champaign, Urbana

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

IN modern systems, the high latency of accessing largecapacity

IN modern systems, the high latency of accessing largecapacity IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016 3071 BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk

More information

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip

More information

Heterogeneous Processing Systems. Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core)

Heterogeneous Processing Systems. Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core) Heterogeneous Processing Systems Heterogeneous Multiset of Homogeneous Arrays (Multi-multi-core) Processing Heterogeneity CPU (x86, SPARC, PowerPC) GPU (AMD/ATI, NVIDIA) DSP (TI, ADI) Vector processors

More information

Fundamentals of Quantitative Design and Analysis

Fundamentals of Quantitative Design and Analysis Fundamentals of Quantitative Design and Analysis Dr. Jiang Li Adapted from the slides provided by the authors Computer Technology Performance improvements: Improvements in semiconductor technology Feature

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Exploring GPU Architecture for N2P Image Processing Algorithms

Exploring GPU Architecture for N2P Image Processing Algorithms Exploring GPU Architecture for N2P Image Processing Algorithms Xuyuan Jin(0729183) x.jin@student.tue.nl 1. Introduction It is a trend that computer manufacturers provide multithreaded hardware that strongly

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Heterogeneous platforms

Heterogeneous platforms Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform! Further in this talk

More information

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Performance COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Performance COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals What is Performance? How do we measure the performance of

More information

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed

Survey results. CS 6354: Memory Hierarchy I. Variety in memory technologies. Processor/Memory Gap. SRAM approx. 4 6 transitors/bit optimized for speed Survey results CS 6354: Memory Hierarchy I 29 August 2016 1 2 Processor/Memory Gap Variety in memory technologies SRAM approx. 4 6 transitors/bit optimized for speed DRAM approx. 1 transitor + capacitor/bit

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton

More information

Characterizing Multi-threaded Applications based on Shared-Resource Contention

Characterizing Multi-threaded Applications based on Shared-Resource Contention Characterizing Multi-threaded Applications based on Shared-Resource Contention Tanima Dey Wei Wang Jack W. Davidson Mary Lou Soffa Department of Computer Science University of Virginia Charlottesville,

More information