Politecnico di Milano

Similar documents
Politecnico di Milano

Applications to MPSoCs

Using Speculative Computation and Parallelizing techniques to improve Scheduling of Control based Designs

Exploring Hardware Support For Scaling Irregular Applications on Multi-node Multi-core Architectures

An Interrupt Controller for FPGA-based Multiprocessors

6.1 Multiprocessor Computing Environment

EE/CSCI 451: Parallel and Distributed Computation

A Dual-Priority Real-Time Multiprocessor System on FPGA for Automotive Applications

Exploiting Vectorization in High Level Synthesis of Nested Irregular Loops. Marco Lattuada, Fabrizio Ferrandi

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Parallel Programming in C with MPI and OpenMP

Metodologie di Progettazione Hardware e Software

Parallel Programming in C with MPI and OpenMP

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Cloud Computing CS

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Parallel Programming in C with MPI and OpenMP

A Process Model suitable for defining and programming MpSoCs

Summary: Open Questions:

CS370 Operating Systems Midterm Review

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

An adaptive genetic algorithm for dynamically reconfigurable modules allocation

A Multiprocessor Self-reconfigurable JPEG2000 Encoder

Multiprocessors & Thread Level Parallelism

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Module 10: Open Multi-Processing Lecture 19: What is Parallelization? The Lecture Contains: What is Parallelization? Perfectly Load-Balanced Program

Message-Passing Shared Address Space

«Real Time Embedded systems» Multi Masters Systems

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Operating Systems (2INC0) 2017/18

Parallel Computing. Prof. Marco Bertini

Operating System Review Part

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

EECS 583 Class 2 Control Flow Analysis LLVM Introduction

asoc: : A Scalable On-Chip Communication Architecture

Synchronization for Concurrent Tasks

CSE 544: Principles of Database Systems

CSCI 4717 Computer Architecture

Concurrent Programming with OpenMP

Multiple Processor Systems. Lecture 15 Multiple Processor Systems. Multiprocessor Hardware (1) Multiprocessors. Multiprocessor Hardware (2)

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Parallel Programming. Exploring local computational resources OpenMP Parallel programming for multiprocessors for loops

A brief introduction to OpenMP

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Accelerated Library Framework for Hybrid-x86

Commercial Real-time Operating Systems An Introduction. Swaminathan Sivasubramanian Dependable Computing & Networking Laboratory

Review: Creating a Parallel Program. Programming for Performance

CS 261 Fall Mike Lam, Professor. Threads

OpenMP and more Deadlock 2/16/18

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

Overview: The OpenMP Programming Model

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Parallel Program Graphs and their. (fvivek dependence graphs, including the Control Flow Graph (CFG) which

CellSs Making it easier to program the Cell Broadband Engine processor

OpenMP. A parallel language standard that support both data and functional Parallelism on a shared memory system

SPARK: A Parallelizing High-Level Synthesis Framework

Efficient AND/OR Search Algorithms for Exact MAP Inference Task over Graphical Models

Data Parallel Architectures

Midterm Exam. October 20th, Thursday NSC

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Parallel Programming. OpenMP Parallel programming for multiprocessors for loops

Chapter 20: Database System Architectures

CSE 421/521 - Operating Systems Fall Lecture - XXV. Final Review. University at Buffalo

Shared Symmetric Memory Systems

Compiling for HSA accelerators with GCC

Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems

1. Define algorithm complexity 2. What is called out of order in detail? 3. Define Hardware prefetching. 4. Define software prefetching. 5. Define wor

Fork / Join Parallelism

Operating Systems Overview. Chapter 2

Sri Vidya College of Engineering and Technology. EC6703 Embedded and Real Time Systems Unit IV Page 1.

A Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software RTOS

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

MutekH embedded operating system. January 10, 2013

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Multicore and Multiprocessor Systems: Part I

Self-Aware Adaptation in FPGA-based Systems

CS 470 Spring Mike Lam, Professor. Advanced OpenMP

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Main Points of the Computer Organization and System Software Module

Lecture 16: Recapitulations. Lecture 16: Recapitulations p. 1

CS 5220: Shared memory programming. David Bindel

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

SDSoC: Session 1

SyCERS: a SystemC design exploration framework for SoC reconfigurable architecture

Computer Systems Architecture

OS and Hardware Tuning

CME 213 S PRING Eric Darve

Joe Hummel, PhD. Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago.

Chapter 18 Parallel Processing

Concurrency, Thread. Dongkun Shin, SKKU

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Native Simulation of Complex VLIW Instruction Sets Using Static Binary Translation and Hardware-Assisted Virtualization

Transcription:

Politecnico di Milano Automatic parallelization of sequential specifications for symmetric MPSoCs [Full text is available at https://re.public.polimi.it/retrieve/handle/11311/240811/92308/iess.pdf] Fabrizio Ferrandi, Luca Fossati, Marco Lattuada, Gianluca Palermo, Donatella Sciuto, Antonino Tumeo {ferrandi,fossati,lattuada,gpalermo,sciuto,tumeo}@elet.polimi.it Thursday, May 31 - IESS '07 Irvine California - USA

Outline Introduction Related Work Target Architecture: CerberO Parallelization Partitioning Task start conditions Experimental Results Conclusions - 2 -

Introduction On-chip Multiprocessors are gaining momentum The development of good parallel applications is highly dependent on software tools Developers must contend with several problems not encountered in sequential programming: non-determinism, communication, synchronization, data partitioning and distribution, load-balancing, heterogeneity, shared or distributed memory, deadlocks, race conditions This work proposes an approach for automatic parallelization of sequential programs - 3 -

Objectives This work focuses on a complete design flow From the high level sequential C description of the application to its deployment on a multiprocessor system-on-chip prototype The sequential code is partitioned in tasks with a specific clustering algorithm Then, the resulting task graph is optimized and the parallel C code is generated (C to C transformation) The generated tasks are dynamically schedulable run-time evaluation of boolean conditions tasks are started as soon as only the real data dependences are satisfied - 4 -

Related Work Several strategies for partial automatization of the parallelization process have been proposed: Problem-solving environments which generate parallel programs starting from high level sequential descriptions Machine independent code annotations The parallelization process: The initial specification is parsed in an intermediate graph representation Partitioning of the intermediate representation An initial task graph is obtained, then tasks need to be allocated on processors through clustering and clusterscheduling (merging) - 5 -

Related Work (Partitioning) [Girkar et al.]: A Hierarchical Task Graph (HTG) (we do not use hierarchy) Simplification of the conditions for execution of task nodes [Luis et al.]: Extend Girkar's work by using a Petri net model to represent parallel code [Newburn and Shen]: PEDIGREE compiler: a flow for automatic parallelizazion (it is not a C to C tool) The Program Dependence Graph (PDG) is analysed, searching for control equivalent regions Then, they are partitioned bottom up - 6 -

Parallelization: the flow The sequential C code is compiled with a slightly modified version of the GCC 4.0 The internal structures are dumped The C to C partitioning algorithm works on a modified system dependence graph (SDG) Code is generated for both OpenMP and CerberO - 8 -

FSDG Creation Vertices: statements or predicate expressions Grey solid edges: data dependences Black edges: control dependences Black dashed edges: both Grey dashed edges: feedback edges (loops) All the loops are converted in do-while loops - 9 -

Partitioning First step: feedback edges analysis A partition for each loop A partition for nodes not in loops Second step: control edges analysis Recognization of control-equivalent (CE) regions Statement nodes descending from the same branch condition (TRUE or FALSE) of a predicate node are grouped together Each region presents potential parallelism Third step: data dependence analisys of CE regions Depth-first exploration. A node is added to a cluster if: it is dependent from one and only one node all its predecessors have already been added - 10 -

Partitioning - 11 -

Optimizations The partitioning phase tends to produce too many small clusters Task management overhead could eliminate all the advantages of parallel execution Two type of optimizations Optimization on control structures: Control predicates are executed before statements beneath The control predicate and the instructions which depends on it are grouped in the same tasks Then and Else clauses are grouped in the same tasks since they are mutually exclusive If control structures are not optimized, they are replicated in order to remove any control dependences among tasks - 12 -

Optimizations (2) Optimizations on data dependences: Data dependent clusters can be joined together to form a bigger cluster Candidates are those clusters containing a number of instructions smaller than N (predetermined number) The algorithm tries to join a task with its successors Two clusters are grouped if all the data dependences on edges exiting from a cluster have the same target cluster Repeated until no more clusters are joined or no more clusters smaller than N exist - 13 -

Task Creation The final clustered FSDG must be translated into specific data structures effectively representing the tasks First step: identification of the task variables Edges coming from the ENTRY node are global variables Edges going in a cluster represent the input parameters Edges going out from a cluster represent the output parameters Edges whose both source and destination nodes are contained in the same cluster are the local variables of the task Edges that represent the same variable: a single variable is instantiated Second step: computation of the start conditions - 14 -

Start Conditions If C1 = TRUE then a and b must be ready before the task starts Else Only c must be ready before the task starts When C1 is determined The branch outcome is known If a, b and c are produced by different tasks, there is no need to wait that all three are produced - 15 -

Start Conditions A start condition is valid when: A task is started only once (only one of the predecessors of the task can start it) Problem: track if a task has already been started Solution: a boolean variable set to TRUE when the task starts The necessary parameters for a correct execution of the task must have already been computed when the start condition evaluates to TRUE Problem: we don't want all the variables, we just need the parameters for the actual execution path Solution: algorithm to generate start conditions depending on the execution path - 16 -

Start Conditions Algorithm First step: explore the input parameters of a task Parameters used in the same control region (i.e. all in the true or false branch) are put in and All they must be ready if the region is going to be executed All the resulting and expressions (one for each control region) are joined by an or operator Second step: explore the preceding tasks Searching where the input parameters to the task to start are written In case there are more control flows from which a parameter is written, all the corresponding paths are joined in an or expression - 17 -

Example Tx: TRUE if task x ended Cx: if predicate x x : all the possible paths that compute x in the preceding tasks C2 ( C0 T0) + C2 [C3 ((C0 T0 + C1 T1) (T1)) + C3 (C1 T1)] (b) = C0 T0 (a) = C0 T0 + C1 T1 (d) = T1 (c) = C1 T1 C2 (b) + C2 [C3 ( (a) (d)) + C3 (c)] - 18 -

Backend The condition is inserted at the end of both Task0 and Task1 When it evaluates to TRUE Task2 is launched Long conditions: BDDs (Binary Decision Diagrams) can be used to reduce the complexity A C Writer backend produces the final parallel C OpenMP compliant code for functional validation Code compliant with the CerberO platform threading API - 19 -

Experimental Setup CerberO architectures with 2 to 6 processors The CerberO OS Layer is thin, but thread management routines of the architecture have an overhead Shared memory accesses to store threads and processors tables The sequential programs have been run on a single processor architecture with CerberO-like memory mapping CerberO-like thread management They have been hand-modified to account for these aspects and to allow a fair comparison - 20 -

Experimental Results (ADPCM) At most 4 parallel threads Maximum speed up with 4 processors: 70% More processors: more synchronization/threading overhead - 21 -

Experimental Results (JPEG) RBG-to-YUV and 2D-DCT have been parallelized (70% of the sequential JPEG execution time) For the whole JPEG algorithm, the maximum speedup reached is 42%. At most 4 parallel threads - 22 -

Conclusions Main contributions: A complete design flow from sequential to parallel C code executable on CerberO, a homogeneous multiprocessor system on FPGA A partitioning algorithm that extracts parallelism and transforms all control dependences in data dependences An algorithm that generates start conditions for dynamic thread scheduling without requiring a complex operating system support by the target architecture The flow has been applied to several standard applications ADPCM and JPEG algorithms show speedups up to 70% and 42% respectively - 23 -

Thank you for your attention! Questions? - 24 -

Related work (Clustering & Merging) Clustering algorithms: Dominant sequence clustering (DSC) by [Yang and Gerasoulis] Linear clustering by [Kim and Browne] [Sarkar]'s internalization algorithm (SIA) Cluster-scheduling: [Hou, Wang]: Evolutionary algorithms [Kianzad and Bhattacharyya]: Single step evolutionary approach for both the clustering and cluster scheduling aspects - 25 -

CerberO details A symmetric shared memory multiprocessor system-on-chip (MPSoC) prototype on FPGA Multiple Xilinx MicroBlazes, shared IPs and the controller for the shared external memory reside on the shared bus The addressing space of each processor is partitioned in two parts: a private part and a shared part Instructions are cached by each processor, data are moved from the shared to the fast private memory Synchronization Engine (SE) provides hw locks/barriers A thin operating system layer dynamically schedules and allocates threads More details in GLSVLSI '07 paper by [Tumeo et al.] - 26 -