Towards a Holistic Approach to Auto-Parallelization

Similar documents
Integrating Profile-Driven Parallelism Detection and Machine-Learning-Based Mapping

A Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping

Mapping Parallelism to Multi-cores: A Machine Learning Based Approach

A Parallelizing Compiler for Multicore Systems

Using Machine Learning to Partition Streaming Programs

Predicting GPU Performance from CPU Runs Using Machine Learning

Machine Learning Based Mapping of Data and Streaming Parallelism to Multi-cores

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

Profile-driven Parallelisation of Sequential Programs

Parallel Computing. Hwansoo Han (SKKU)

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Polyhedral Optimizations of Explicitly Parallel Programs

Detection and Analysis of Iterative Behavior in Parallel Applications

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

Real Time Power Estimation and Thread Scheduling via Performance Counters. By Singh, Bhadauria, McKee

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

CSE 141 Summer 2016 Homework 2

Partitioning Streaming Parallelism for Multi-cores: A Machine Learning Based Approach

Bring your application to a new era:

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Overview Implicit Vectorisation Explicit Vectorisation Data Alignment Summary. Vectorisation. James Briggs. 1 COSMOS DiRAC.

First Experiences with Intel Cluster OpenMP

New Features in LS-DYNA HYBRID Version

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Analyzing Cache Bandwidth on the Intel Core 2 Architecture

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Multi-threaded Queries. Intra-Query Parallelism in LLVM

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Operating System Support for Shared-ISA Asymmetric Multi-core Architectures

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Hybrid Analysis and its Application to Thread Level Parallelization. Lawrence Rauchwerger

Automatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München

Adaptively Mapping Parallelism Based on System Workload Using Machine Learning

Dynamic Performance Tuning for Speculative Threads

Predic've Modeling in a Polyhedral Op'miza'on Space

Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace

Automatic and Portable Mapping of Data Parallel Programs to OpenCL for GPU-Based Heterogeneous Systems

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

SIMD Exploitation in (JIT) Compilers

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

Machine Learning based Compilation

Speculative Parallelization Technology s only constant is CHANGE. Devarshi Ghoshal Sreesudhan

Optimising the Mantevo benchmark suite for multi- and many-core architectures

Benchmark: Uses. Guide computer design. Guide purchasing decisions. Marketing tool Benchmarks. Program used to evaluate performance.

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Crossing the Architectural Barrier: Evaluating Representative Regions of Parallel HPC Applications

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

CellSs Making it easier to program the Cell Broadband Engine processor

Piecewise Holistic Autotuning of Compiler and Runtime Parameters

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Yasuo Okabe. Hitoshi Murai. 1. Introduction. 2. Evaluation. Elapsed Time (sec) Number of Processors

Adaptive Power Profiling for Many-Core HPC Architectures

AAlign: A SIMD Framework for Pairwise Sequence Alignment on x86-based Multi- and Many-core Processors

From CISC to RISC. CISC Creates the Anti CISC Revolution. RISC "Philosophy" CISC Limitations

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

An Introduction to the SPEC High Performance Group and their Benchmark Suites

Computational Interdisciplinary Modelling High Performance Parallel & Distributed Computing Our Research

Towards Automatic Parallelisation for Multi-Processor DSPs

Molecular Dynamics and Quantum Mechanics Applications

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

SCALASCA parallel performance analyses of SPEC MPI2007 applications

Inlining Java Native Calls at Runtime

Energy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

WRF performance on Intel Processors

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Performance Evaluation of OpenMP Applications on Virtualized Multicore Machines

International Conference on Computational Science (ICCS 2017)

Multithreaded Value Prediction

DeAliaser: Alias Speculation Using Atomic Region Support

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

Research in Middleware Systems For In-Situ Data Analytics and Instrument Data Analysis

Hugh Leather, Edwin Bonilla, Michael O'Boyle

Prediction Models for Multi-dimensional Power-Performance Optimization on Many Cores

POWER-AWARE SOFTWARE ON ARM. Paul Fox

OpenMP on the IBM Cell BE

Victor Malyshkin (Ed.) Malyshkin (Ed.) 13th International Conference, PaCT 2015 Petrozavodsk, Russia, August 31 September 4, 2015 Proceedings

Point-to-Point Synchronisation on Shared Memory Architectures

LS-DYNA Scalability Analysis on Cray Supercomputers

Experiences with Achieving Portability across Heterogeneous Architectures

Sequoia. Mike Houston Stanford University July 9 th, CScADS Workshop

Hybrid Implementation of 3D Kirchhoff Migration

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Efficient Architecture Support for Thread-Level Speculation

Enforcing Textual Alignment of

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

A case study of performance portability with OpenMP 4.5

Kyoung Hwan Lim and Taewhan Kim Seoul National University

OpenMP Doacross Loops Case Study

JCudaMP: OpenMP/Java on CUDA

Lightweight Fault Detection in Parallelized Programs

Directive-based Programming for Highly-scalable Nodes

Transcription:

Towards a Holistic Approach to Auto-Parallelization Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping Georgios Tournavitis, Zheng Wang, Björn Franke and Michael F.P. O Boyle MPSoCs 2009, St. Goar, Germany. Institute for Computing Systems Architecture

Overview Introduction & motivation Profile-driven parallelization ML-based mapping Empirical evaluation Conclusions & future work 2

Introduction Performance on multi-core architectures requires extraction of thread-level parallelism - Legacy code is sequential - Majority still thinks/programs sequentially - Parallel programming is tedious and error-prone Auto-parallelization - 30 years of research - Fails to deliver performance beyond niche settings 3

State-of-the-art State-of-the-art auto-parallelizer Array-based scientific applications On average no speedup!!! NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 4

State-of-the-art Wide performance gap Why? NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 5

State-of-the-art Wide performance gap Why? NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 5

State-of-the-art Wide performance gap Why? 1. Static parallelism detection is conservative due to lack of information 2. Lack of automatic and portable approach to parallelism mappin NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 5

Problem statement Parallel loops 6

Problem statement Parallel loops Statically provable parallel 7

Problem statement DOALL Unexploited Statically provable parallel 8

Problem statement DOALL Unexploited Statically provable parallel 9

Problem statement DOALL Profitable Unexploited Statically provable parallel 9

Overview of holistic approach Sequential code in C Profiling-driven analysis Code with OpenMP annotations Machine-Learning based mapper Code with profitable loops 10

Overview of holistic approach Sequential code in C Profiling-driven analysis Code with OpenMP annotations Machine-Learning based mapper Code with profitable loops 10

Overview of holistic approach Sequential code in C Profiling-driven analysis Code with OpenMP annotations Machine-Learning based mapper Code with profitable loops 10

Profiling-driven parallelism detection Keypoint: restrictions of static analysis can be overcome using precise, dynamic information How? Instrument the middle-level IR - Straightforward correlation - High-level information (loops, induction/reduction vars) - Avoid obfuscation due to ISA - Execute natively 11

Profiling-driven parallelism detection Dynamically reconstruct precise view of control and data flow - Identify parallel loops - Identify privatizable data - Use hybrid static/profiling-based approach to uncover parallel reductions Unsafe so user validates correctness 12

Overview Sequential code in C Profile-driven analysis Code with OpenMP annotations Machine-Learning based mapper Code with profitable loops 13

ML-based mapping How do we select/map profitable loops? Traditional compilers use hard-wired heuristics ML provides portable and automatic modeling of compiler/runtime/architecture 14

ML-based mapping How good heuristics can be? Number of Instructions 1000 100 10 Should be parallelized Should NOT be parallelized 1 10 100 1000 10000 100000 1000000 1E7 1E8 1. Baseline SVM (a) Training d D = {(x i (b) Maximum c i (w x i (c) Determine subject to 2. Extensions fo (a) Non-linea Replace d following k(x, x )= (b) Multiclass Reduce si lems. Eac NAS PB 2.3+SPEC FP2000 Intel Xeon X5450 Number of Iterations 15 and the re

ML-based mapping Predictive modeling based on Support Vector Machine (SVM) - Decide (i) profitability, (ii) loop scheduling - Construct hyperplanes in transformed higher-dimensional space - Non-linear & multiclass extensions hyper-plane 1 feature 2 feature 2 feature 1 feature 1 e.g. f' = f * X - b 16

ML-based mapping Off-line learning Predict using smallest input smallest input Features new program Feature extraction Trained model profitable or not Static Dynamic IR instruction count IR Load/Store count IR Branch count Loop iteration count Data access count Instruction count Branch count scheduling policy 17

Final approval Profile-driven parallelization cannot guarantee correctness for a new input Parallel loops Profitable loops ML-based mapper user approval 18

Empirical evaluation 19

Benchmarks 2 sets of applications - NAS Parallel Benchmarks 2.3 - SPEC FP2000 (subset in C) 2 versions - Sequential - Manually parallelized using OpenMP by expert programmers All input datasets 20

Performance on Intel Intel icc fails to deliver any improvement For small datasets even slowdown NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 21

Performance on Intel Intel icc fails to deliver any improvement For small datasets even slowdown NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 22

Performance on Intel Profile-driven: 3.34 average speedup 96% of hand-parallelized NAS NPB 2.3 OMP-C and SPEC CFP2000 2 Quad-cores (8 cores in total) Intel Xeon X5450 @ 3.00GHz Intel icc 10.1 -O2 -xt -axt -ipo NAS PB SPEC FP 23

Performance on Cell On average manual parallelization delivers slowdown OpenMP overheads/tradeoff radically different NAS PB NAS NPB 2.3 OMP-C and SPEC CFP2000 Dual Socket, QS20 Cell Blade IBM xlc ssc v0.9 O5 -qstrict -qarch=cell -qipa=partition=minute SPEC FP 24

Performance on Cell On average manual parallelization delivers slowdown OpenMP overheads/tradeoff radically different NAS PB NAS NPB 2.3 OMP-C and SPEC CFP2000 Dual Socket, QS20 Cell Blade IBM xlc ssc v0.9 O5 -qstrict -qarch=cell -qipa=partition=minute SPEC FP 25

Performance on Cell Profile-driven + ML: avoids slowdowns On average speedup NAS PB NAS NPB 2.3 OMP-C and SPEC CFP2000 Dual Socket, QS20 Cell Blade IBM xlc ssc v0.9 O5 -qstrict -qarch=cell -qipa=partition=minute SPEC FP 26

Conclusions Static-only parallelization approaches fail Profiling is effective in uncovering parallel loops ML mapping adapts across architectures Profiling information + ML combined outperforms each approach in isolation Performance of holistic approach close to manual parallelization 27

Future work Consider low-overhead TLS/TM to ensure correctness Study on a wider set of benchmarks Extend to partially sequential loops 28

Thanks!!! Acknowledgments: