Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1

Size: px

Start display at page:

Download "Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1"

Brianne Davis
5 years ago
Views:

1 Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1 Nikolai Kim, Andreas Krall {kim,andi}@complang.tuwien.ac.at Institute of Computer Languages University of Technology Vienna computer lang uages ODES-11: Optimizations for DSP and Embedded Systems 1 This work is supported by the Austrian Science Fund (FWF) under contract P21842, Epicopt: Optimal Code Generation for Explicitly Parallel Processors.

2 Outline Implementation Swing modulo scheduling extension/adaptation Two different cluster assignment heuristics Implemented within LLVM 2.9 Targeting TI s TMS320C64X DSP Evaluation Taking UAS, ILP as baseline Based on a cycle accurate simulator MiBench, mediabench, DSPStone, BenchmarkGames, SingleUnit tests 35 kernels in total, 14 most representative presented

3 Target architecture cluster A crosspath A Register file A LB SB MB DB LA SA MA DA Register file B crosspath B cluster B Texas Instruments TMS320C64X Clustered VLIW architecture, 2 clusters 4 functional units, 32 GP registers per cluster 3 predicate registers per cluster, 6 cycles branch latency DSP, SIMD subset, predication, soft. pipelining buffer

4 Intercluster communication cluster B cluster A B0 =... A0 =... B1 = COPY A0 B2 = LOAD B0 [B1] a) B0 =... A0 =... B1 = ADD B0, A0 b) Data transfer a) explicit, via inserted COPY instructions b) implicit, via intercluster crosspaths, 1 cycle delay (crosspath stall) for uses placed directly after definitions

5 If-Conversion entry entry for.cond for.cond p!p p land.rhs land.rhs BB#6!q for.loop!p BB#6 q land.end land.end for.end!q for.body q!q b) c) for.end for.body a) Basics As preprocessing to modulo scheduling Requires hardware support, removes conditional branches Reduces basic block count, increases ILP

6 Modulo scheduling (1) General Iterative II scheme, swing scheduling adaptation Extended to address target specific factors such as functional unit support and crosspath stalls Employs modulo variable expansion based on lifetime analysis Utilizes modulo resource table, captures crosspath occupation

7 Modulo scheduling (2) Schedule nodes Assign clusters yes success? no Reschedule nodes Increase II Emit schedule Specific Two-pass setup: Iteratively generate a preliminary schedule in combination with provided clustering heuristics Distribute intercluster copies, avoid crosspath stalls

8 Cluster assignment Simple naive heuristic Non-integrated, losely coupled with scheduling routine DG depth ordering, uniform handling of all dependences Processes the DG at once in a top-down manner Decides upon already assigned predecessor nodes only Extended variant Runs inline with the modulo scheduler Operates on a DG with edges annotated prior to scheduling Uses a simple copy cost scheme for DG edge annotation Additionally incorporates cluster utilization counters

9 Copy-cost annotation 0 a 0 1 b c 0 0 d 1 e f g 0 h i Details Qualifies adjacent nodes in terms of register copies Annotation only, no cluster information generated Takes crosspath access possibilities into account

10 Performance factors Optimization objectives Fast schedule generation Minimal initiation interval through iterative scheme Reduction of crosspath stalls through explicit rescheduling Minimization of intercluster copies through DG labeling Even cluster balance through utilization counters

11 Performance evaluation: UAS as baseline Simple/UAS Extended/UAS Speedup in % Figure: Cycle speedup (%) comparison to UAS

12 Performance evaluation: optimal ILP as baseline Simple/ILP Extended/ILP Runtime optimality gap % Figure: Optimality gap (%) to ILP

13 Performance evaluation: initiation intervals 30 Simple/UAS Extended/UAS 25 Initiation interval Figure: Absolute initiation interval values

14 Summary Conclusions Extended clustering heuristic generally more potent Significant speedup compared to UAS (avg. 24.8%) Partially significant gap to ILP (avg. 15.8%) Nearly even cluster load distribution Shortcomings, current research Backend modulo scheduling support currently very basic Rudimentary loop analysis, restricted applicability Clustering still suboptimal in terms of register copies More sophisticated clustering algorithms in development Fair, undistorted comparison to alternative implementations

15 Thank You Thank you for being my audience!

Integrated Modulo Scheduling and Cluster Assignment for TI TMS320C64x+Architecture

Integrated Modulo Scheduling and Cluster Assignment for TI TMS32C64x+Architecture Nikolai Kim Andreas Krall Institute of Computer Languages, Vienna University of Technology {kim,andi}@complang.tuwien.ac.at