Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading

Size: px

Start display at page:

Download "Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading"

Jody Jefferson
6 years ago
Views:

Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL Multigrain Parallelism: Bridging Coarse- Grain Parallel

1 Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory - CAPSL Multigrain Parallelism: Bridging Coarse- Grain Parallel Languages and Fine-Grain Event-Driven Multithreading CPEG Spring 2015 Advanced Topics in Computing Systems Jaime Arteaga Stephane Zuckerman Guang Gao November 17 th, 2015

2 Outline 1. Motivation. 2. Our Proposed Solution. 3. First Approach. 4. Second (current) Approach. 5. Implementation so far. 6. Current and future work.

3 Motivation Current architectures continue to integrate more processing units per chip every year. However, traditional programming models (such as OpenMP) continue to be the most popular. OpenMP has tried to keep up: Tasks in v Task dependencies and support for accelerators in v. 4.0.

4 Motivation Exascale computing requires new programming models (not adapting legacy ones). Codelet Model: Specifically designed to handle fine-grain parallelism. Suited for new architectures and Exascale machines. DARTS Delaware Adaptive Run-Time System: University of Delaware s Codelet model s implementation. Codelet = non-preemptive light-weight task. Unlike OpenMP, DARTS could be verbose. Using codelets for parallel programming requires training.

5 Fib in DARTS

6 Fib in DARTS

7 Fib in DARTS

8 Fib in DARTS

9 Fib in DARTS 120 lines of code

10 Fib in OpenMP 3.0

11 Fib in OpenMP 44 lines of code

12 Our Proposed Solution To design a source-source compiler that takes OpenMP (2.5, 3.1, 4.0) code and generates DARTS code, taking advantage of the programming interface and features found in the Codelet model. barrier C/C++ OpenMP 2 DARTS C++ OpenMP DARTS

13 Delaware RunTime System. DARTS Implementation of the Codelet model (others similar are ETI s SWARM, Rice s Habanero, OCR, etc.). Two level of parallelism: Threaded Procedures (TP) and Codelets, to group processing units according to the number of clusters in the system. An Implementation of the Codelet Model, Euro-Par 2013 Joshua Suettlerlein, Stephane Zuckerman, Guang R. Gao.

Fine-grain parallelism gives better control of resources (best for

14 DARTS Better suited for many-core architectures. Event and dataflow driven (instead of control-flow driven). Fine-grain parallelism gives better control of resources (best for energy efficiency, resiliency, and fault-tolerance mechanisms). Matrix Multiplication BFS

15 First Approach: OMPi to DARTS Lightweight open-source OpenMP compiler from University of Ioannina (Greece). Implemented in and supporting C. Supports OpenMP 2.5, 3.1, and mostly 4.0. Advantages: Open source. Few files. Disadvantages: Not completely modular. A lot of constants and data structures across files. Not very well-known (IR and AST not standard).

Produces a standard IR (LLVM s). Better documentation. Implemented in C++ (more of a personal preference).

16 Second Approach: Clang to DARTS Main front end for LLVM (part of LLVM since v 2.6). Parses C, C++, Objective-C, Objective-C++. Advantages: Better error and warning messages. Faster. Produces a standard IR (LLVM s). Better documentation. Implemented in C++ (more of a personal preference). "LLVM Logo" by Source. Licensed under Fair use via Wikipedia

17 Implementation so far

18 Implementation so far Each pragma construct is transformed into a TP

19 Implementation so far Each pragma construct is transformed into a TP Several copies of each TP are created

20 Implementation so far Each pragma construct is transformed into a TP Several copies of each TP are created and the total number of threads in the region (OMP_NUM_THREADS) is distributed among the TPs according to the hardware at hand (HWLOC).

21 Implementation so far A shared variable is created in the TP frame

22 Implementation so far A shared variable is created in the TP frame A private variable is created on each codelet

23 Implementation so far A shared variable is created in the TP frame A private variable is created on each codelet Special considerations for first and last private

24 Implementation so far class TPParallel class TPCompute class TPSingle class TPFor class TPFor

25 TP Function TP Task Branch TP Task Taskwait Return

26 Current and Future Work Fine-grain scheduling optimizations. Bug finding. Benchmarking with mini-apps and popular workloads. Implement full OpenMp 3.1 and 4.0 standard.

27 References 1. An Implementation of the Codelet Model, Joshua Suettlerlein, Stephane Zuckerman, Guang R. Gao, Euro-Par Position Paper: Using a Codelet Program Execution Model for Exascale Machines, Stephane Zuckerman, Joshua Suetterlein, Rob Knauerhase and Guang R. Gao, PLDI OpenMP, 4. OMPi,

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory

University of Delaware Department of Electrical and Computer Engineering Computer Architecture and Parallel Systems Laboratory Locality-Driven Scheduling of Tasks for Data-Dependent Multithreading Jaime