Politecnico di Milano Automatic parallelization of sequential specifications for symmetric MPSoCs [Full text is available at https://re.public.polimi.it/retrieve/handle/11311/240811/92308/iess.pdf] Fabrizio Ferrandi, Luca Fossati, Marco Lattuada, Gianluca Palermo, Donatella Sciuto, Antonino Tumeo {ferrandi,fossati,lattuada,gpalermo,sciuto,tumeo}@elet.polimi.it Thursday, May 31 - IESS '07 Irvine California - USA
Outline Introduction Related Work Target Architecture: CerberO Parallelization Partitioning Task start conditions Experimental Results Conclusions - 2 -
Introduction On-chip Multiprocessors are gaining momentum The development of good parallel applications is highly dependent on software tools Developers must contend with several problems not encountered in sequential programming: non-determinism, communication, synchronization, data partitioning and distribution, load-balancing, heterogeneity, shared or distributed memory, deadlocks, race conditions This work proposes an approach for automatic parallelization of sequential programs - 3 -
Objectives This work focuses on a complete design flow From the high level sequential C description of the application to its deployment on a multiprocessor system-on-chip prototype The sequential code is partitioned in tasks with a specific clustering algorithm Then, the resulting task graph is optimized and the parallel C code is generated (C to C transformation) The generated tasks are dynamically schedulable run-time evaluation of boolean conditions tasks are started as soon as only the real data dependences are satisfied - 4 -
Related Work Several strategies for partial automatization of the parallelization process have been proposed: Problem-solving environments which generate parallel programs starting from high level sequential descriptions Machine independent code annotations The parallelization process: The initial specification is parsed in an intermediate graph representation Partitioning of the intermediate representation An initial task graph is obtained, then tasks need to be allocated on processors through clustering and clusterscheduling (merging) - 5 -
Related Work (Partitioning) [Girkar et al.]: A Hierarchical Task Graph (HTG) (we do not use hierarchy) Simplification of the conditions for execution of task nodes [Luis et al.]: Extend Girkar's work by using a Petri net model to represent parallel code [Newburn and Shen]: PEDIGREE compiler: a flow for automatic parallelizazion (it is not a C to C tool) The Program Dependence Graph (PDG) is analysed, searching for control equivalent regions Then, they are partitioned bottom up - 6 -
Parallelization: the flow The sequential C code is compiled with a slightly modified version of the GCC 4.0 The internal structures are dumped The C to C partitioning algorithm works on a modified system dependence graph (SDG) Code is generated for both OpenMP and CerberO - 8 -
FSDG Creation Vertices: statements or predicate expressions Grey solid edges: data dependences Black edges: control dependences Black dashed edges: both Grey dashed edges: feedback edges (loops) All the loops are converted in do-while loops - 9 -
Partitioning First step: feedback edges analysis A partition for each loop A partition for nodes not in loops Second step: control edges analysis Recognization of control-equivalent (CE) regions Statement nodes descending from the same branch condition (TRUE or FALSE) of a predicate node are grouped together Each region presents potential parallelism Third step: data dependence analisys of CE regions Depth-first exploration. A node is added to a cluster if: it is dependent from one and only one node all its predecessors have already been added - 10 -
Partitioning - 11 -
Optimizations The partitioning phase tends to produce too many small clusters Task management overhead could eliminate all the advantages of parallel execution Two type of optimizations Optimization on control structures: Control predicates are executed before statements beneath The control predicate and the instructions which depends on it are grouped in the same tasks Then and Else clauses are grouped in the same tasks since they are mutually exclusive If control structures are not optimized, they are replicated in order to remove any control dependences among tasks - 12 -
Optimizations (2) Optimizations on data dependences: Data dependent clusters can be joined together to form a bigger cluster Candidates are those clusters containing a number of instructions smaller than N (predetermined number) The algorithm tries to join a task with its successors Two clusters are grouped if all the data dependences on edges exiting from a cluster have the same target cluster Repeated until no more clusters are joined or no more clusters smaller than N exist - 13 -
Task Creation The final clustered FSDG must be translated into specific data structures effectively representing the tasks First step: identification of the task variables Edges coming from the ENTRY node are global variables Edges going in a cluster represent the input parameters Edges going out from a cluster represent the output parameters Edges whose both source and destination nodes are contained in the same cluster are the local variables of the task Edges that represent the same variable: a single variable is instantiated Second step: computation of the start conditions - 14 -
Start Conditions If C1 = TRUE then a and b must be ready before the task starts Else Only c must be ready before the task starts When C1 is determined The branch outcome is known If a, b and c are produced by different tasks, there is no need to wait that all three are produced - 15 -
Start Conditions A start condition is valid when: A task is started only once (only one of the predecessors of the task can start it) Problem: track if a task has already been started Solution: a boolean variable set to TRUE when the task starts The necessary parameters for a correct execution of the task must have already been computed when the start condition evaluates to TRUE Problem: we don't want all the variables, we just need the parameters for the actual execution path Solution: algorithm to generate start conditions depending on the execution path - 16 -
Start Conditions Algorithm First step: explore the input parameters of a task Parameters used in the same control region (i.e. all in the true or false branch) are put in and All they must be ready if the region is going to be executed All the resulting and expressions (one for each control region) are joined by an or operator Second step: explore the preceding tasks Searching where the input parameters to the task to start are written In case there are more control flows from which a parameter is written, all the corresponding paths are joined in an or expression - 17 -
Example Tx: TRUE if task x ended Cx: if predicate x x : all the possible paths that compute x in the preceding tasks C2 ( C0 T0) + C2 [C3 ((C0 T0 + C1 T1) (T1)) + C3 (C1 T1)] (b) = C0 T0 (a) = C0 T0 + C1 T1 (d) = T1 (c) = C1 T1 C2 (b) + C2 [C3 ( (a) (d)) + C3 (c)] - 18 -
Backend The condition is inserted at the end of both Task0 and Task1 When it evaluates to TRUE Task2 is launched Long conditions: BDDs (Binary Decision Diagrams) can be used to reduce the complexity A C Writer backend produces the final parallel C OpenMP compliant code for functional validation Code compliant with the CerberO platform threading API - 19 -
Experimental Setup CerberO architectures with 2 to 6 processors The CerberO OS Layer is thin, but thread management routines of the architecture have an overhead Shared memory accesses to store threads and processors tables The sequential programs have been run on a single processor architecture with CerberO-like memory mapping CerberO-like thread management They have been hand-modified to account for these aspects and to allow a fair comparison - 20 -
Experimental Results (ADPCM) At most 4 parallel threads Maximum speed up with 4 processors: 70% More processors: more synchronization/threading overhead - 21 -
Experimental Results (JPEG) RBG-to-YUV and 2D-DCT have been parallelized (70% of the sequential JPEG execution time) For the whole JPEG algorithm, the maximum speedup reached is 42%. At most 4 parallel threads - 22 -
Conclusions Main contributions: A complete design flow from sequential to parallel C code executable on CerberO, a homogeneous multiprocessor system on FPGA A partitioning algorithm that extracts parallelism and transforms all control dependences in data dependences An algorithm that generates start conditions for dynamic thread scheduling without requiring a complex operating system support by the target architecture The flow has been applied to several standard applications ADPCM and JPEG algorithms show speedups up to 70% and 42% respectively - 23 -
Thank you for your attention! Questions? - 24 -
Related work (Clustering & Merging) Clustering algorithms: Dominant sequence clustering (DSC) by [Yang and Gerasoulis] Linear clustering by [Kim and Browne] [Sarkar]'s internalization algorithm (SIA) Cluster-scheduling: [Hou, Wang]: Evolutionary algorithms [Kianzad and Bhattacharyya]: Single step evolutionary approach for both the clustering and cluster scheduling aspects - 25 -
CerberO details A symmetric shared memory multiprocessor system-on-chip (MPSoC) prototype on FPGA Multiple Xilinx MicroBlazes, shared IPs and the controller for the shared external memory reside on the shared bus The addressing space of each processor is partitioned in two parts: a private part and a shared part Instructions are cached by each processor, data are moved from the shared to the fast private memory Synchronization Engine (SE) provides hw locks/barriers A thin operating system layer dynamically schedules and allocates threads More details in GLSVLSI '07 paper by [Tumeo et al.] - 26 -