Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1

Integrated Modulo Scheduling and Cluster Assignment for TMS320C64x+ Architecture 1 Nikolai Kim, Andreas Krall {kim,andi}@complang.tuwien.ac.at Institute of Computer Languages University of Technology Vienna computer lang uages ODES-11: Optimizations for DSP and Embedded Systems 1 This work is supported by the Austrian Science Fund (FWF) under contract P21842, Epicopt: Optimal Code Generation for Explicitly Parallel Processors.

Outline Implementation Swing modulo scheduling extension/adaptation Two different cluster assignment heuristics Implemented within LLVM 2.9 Targeting TI s TMS320C64X DSP Evaluation Taking UAS, ILP as baseline Based on a cycle accurate simulator MiBench, mediabench, DSPStone, BenchmarkGames, SingleUnit tests 35 kernels in total, 14 most representative presented

Target architecture cluster A crosspath A Register file A LB SB MB DB LA SA MA DA Register file B crosspath B cluster B Texas Instruments TMS320C64X Clustered VLIW architecture, 2 clusters 4 functional units, 32 GP registers per cluster 3 predicate registers per cluster, 6 cycles branch latency DSP, SIMD subset, predication, soft. pipelining buffer

Intercluster communication cluster B cluster A B0 =... A0 =... B1 = COPY A0 B2 = LOAD B0 [B1] a) B0 =... A0 =... B1 = ADD B0, A0 b) Data transfer a) explicit, via inserted COPY instructions b) implicit, via intercluster crosspaths, 1 cycle delay (crosspath stall) for uses placed directly after definitions

If-Conversion entry entry for.cond for.cond p!p p land.rhs land.rhs BB#6!q for.loop!p BB#6 q land.end land.end for.end!q for.body q!q b) c) for.end for.body a) Basics As preprocessing to modulo scheduling Requires hardware support, removes conditional branches Reduces basic block count, increases ILP

Modulo scheduling (1) General Iterative II scheme, swing scheduling adaptation Extended to address target specific factors such as functional unit support and crosspath stalls Employs modulo variable expansion based on lifetime analysis Utilizes modulo resource table, captures crosspath occupation

Modulo scheduling (2) Schedule nodes Assign clusters yes success? no Reschedule nodes Increase II Emit schedule Specific Two-pass setup: Iteratively generate a preliminary schedule in combination with provided clustering heuristics Distribute intercluster copies, avoid crosspath stalls

Cluster assignment Simple naive heuristic Non-integrated, losely coupled with scheduling routine DG depth ordering, uniform handling of all dependences Processes the DG at once in a top-down manner Decides upon already assigned predecessor nodes only Extended variant Runs inline with the modulo scheduler Operates on a DG with edges annotated prior to scheduling Uses a simple copy cost scheme for DG edge annotation Additionally incorporates cluster utilization counters

Copy-cost annotation 0 a 0 1 b c 0 0 d 1 e f g 0 h 1 1 0 i Details Qualifies adjacent nodes in terms of register copies Annotation only, no cluster information generated Takes crosspath access possibilities into account

Performance factors Optimization objectives Fast schedule generation Minimal initiation interval through iterative scheme Reduction of crosspath stalls through explicit rescheduling Minimization of intercluster copies through DG labeling Even cluster balance through utilization counters

Performance evaluation: UAS as baseline Simple/UAS Extended/UAS 60.00 Speedup in % 40.00 20.00 0.00-20.00-40.00 Figure: Cycle speedup (%) comparison to UAS

Performance evaluation: optimal ILP as baseline 70.00 Simple/ILP Extended/ILP Runtime optimality gap % 60.00 50.00 40.00 30.00 20.00 10.00 0.00 Figure: Optimality gap (%) to ILP

Performance evaluation: initiation intervals 30 Simple/UAS Extended/UAS 25 Initiation interval 20 15 10 5 0 Figure: Absolute initiation interval values

Summary Conclusions Extended clustering heuristic generally more potent Significant speedup compared to UAS (avg. 24.8%) Partially significant gap to ILP (avg. 15.8%) Nearly even cluster load distribution Shortcomings, current research Backend modulo scheduling support currently very basic Rudimentary loop analysis, restricted applicability Clustering still suboptimal in terms of register copies More sophisticated clustering algorithms in development Fair, undistorted comparison to alternative implementations

Thank You Thank you for being my audience!