Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture

Size: px

Start display at page:

Download "Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture"

Paulina Preston
5 years ago
Views:

1 Roman Lysecky Department of Electrical and Computer Engineering University of Arizona Dynamic HW/SW Partitioning Initially execute application in software only 5 Partitioned application executes faster with lower energy consumption Profile application to determine critical regions Dynamic Part. Module () Partition critical regions to hardware SW Only HW/SW 4 Program configurable logic & update software binary Roman Lysecky US Patent Pending, 4 / Time Energy Applications Fingerprint Detection SW/profiling (.s) Dynamic Partitioning (.s) HW/SW >X Potential (Currently X) 5 MHz Warp Processor 5 MHz Processor SW Only Execution Fingerprint DB (5,+) Standard binary - Separating Function and Architecture Software binaries of the past reflected specific language of underlying architecture limited portability Current standard binary Concept: separate function from detailed architecture Develop new architectures for existing applications Trend towards dynamic translation and optimization Expansion Ideally, improve performance by simply adding additional, similar to adding memory SW Standard Profiling Compiler x86 4 Execution Time (s) (CAD) Roman Lysecky / Roman Lysecky 4/ Why configurable logic (s)? C Code for Bit Reversal x = (x >>6) (x <<6); x = ((x >> 8) & xffff) ((x << 8) & xffff); x = ((x >> 4) & xffff) ((x << 4) & xffff); x = ((x >> ) & x) ((x << ) & xcccccccc); x = ((x >> ) & x ) ((x << ) & xaaaaaaaa); sll $v[],$v[],x srl $v[],$v[],x or $v[],$v[],$v[] srl $v[],$v[],x8 and $v[],$v[],$t5[] sll $v[],$v[],x8 and $v[],$v[],$t4[] or $v[],$v[],$v[] srl $v[],$v[],x4 and $v[],$v[],$t[] sll $v[],$v[],x4 and $v[],$v[],$t[]... Processor Hardware for Bit Reversal Bit Original Reversed X Value X Value Bit Reversed X Value Processor Traditional partitioning done here Dynamic HW/SW Partitioning SW Standard Profiling Compiler CAD Profiling Tools CAD Profiling Tools Proc. Dynamic HW/SW Partitioning Enabler Synthesis from Binaries [Stitt & Vahid, 5][Stitt & Vahid, ] Advantages Does not require any special compilers Completely transparent Provides separation of function and architecture for architectures incorporating s Avoid complexities of supporting different s Opens additional market segments (i.e., all software developers) that otherwise would not use s and CAD Requires between and 8 cycles Requires only cycle (speedup of x to 8x) Roman Lysecky 5/ Roman Lysecky 6/

2 () Warp Processor Tools (CAD) Updater Partitioning Decompilation RT Synthesis Std. HW Existing s Not Suitable for Existing s require extremely complex CAD tools Designed to handle large arbitrary circuits, ASIC prototyping, etc. Require long execution times and very large memory usage Not suitable for dynamic on-chip execution min MB min MB - mins - mins 5 MB 6 MB *My Research Focus Updated HW Bitstream Roman Lysecky 7/ Roman Lysecky 8/ CAD-Oriented Solution: Develop a custom CAD-oriented Careful simultaneous design of and CAD features evaluated for impact on CAD Add architecture features for SW kernels Enables development of fast, lean compilation tools s <s <s MB MB s.6 MB Updater Updated Partitioning Decompilation RT Synthesis Std. HW HW Bitstream Warp Configurable Logic Architecture () Warp Configurable Logic Architecture () Need a fast, efficient coprocessor interface Analyzed digital signal processors (DSP) and existing coprocessors Data address generators (DADG) and Loop control hardware (LCH) Provide fast loop execution Supports memory accesses with regular access pattern Integrated -bit multiplier-accumulator (MAC) Frequently found in within critical SW kernels Fast, single-cycle multipliers are large and require many interconnections ARM DADG & LCH Reg Reg -bit MAC Reg Configurable Logic Fabric Roman Lysecky 9/ Roman Lysecky A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 / - Configurable Logic Fabric - Combinational Logic Block Configurable Logic Fabric (CLF) Hundreds of existing commercial and research fabrics Most designed to balance circuit density and speed Analyzed s features to determine their impact of CAD Designed our CLF in conjunction with compilation tools Array of configurable logic blocks (s) surrounded by switch matrices (s) is directly connected to a Along with design, allows for design of lean JIT routing DADG LCH -bit MAC Configurable Logic Fabric s Flexibility/Density: Large s, various internal routing resources Combinational Logic Block Simplicity: Limited internal routing, reduce on-chip CAD complexity Incorporate two -input -output LUTs a b c Equivalent to four -input LUTs with fixed internal routing Allows for good quality circuit while reducing JIT technology mapping complexity Provide routing resources between adjacent s to support carry chains Reduces number of nets we need to route Adj. LUT o o d e f LUT o o4 Adj. A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 Roman Lysecky A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 / Roman Lysecky /

3 - Switch Matrix s Flexibility/Speed: Large routing resources, various routing options Switch Matrix Simplicity: Allow for design of fast, lean routing algorithm L L L L All nets are routed using only a single pair of channels throughout the configurable logic fabric Each short channel is associated with single long channel Designed for fast, lean routing L L L L L L L L L L L L (CAD) Updater Updated Partitioning Decompilation RT Synthesis Std. HW HW Bitstream A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 Roman Lysecky 4/ Roman Lysecky / ROCM Riverside On-Chip Minimizer ROCM - Riverside On-Chip Minimizer Two-level minimization tool Utilized a combination of approaches from Espresso-II [Brayton, et al., 984][Hassoun & Sasoa, ] and Presto [Svoboda & White, 979] Utilizes a single expand phase instead of multiple iterations Eliminate the need to compute the off-set to reduce memory usage On average only % larger than optimal solution - Results min MB min MB - mins - mins 5 MB 6 MB Expand s Reduce on-set dc-set off-set MB Irredundant On-Chip Logic Minimization, DAC Roman Lysecky 5/ A Codesigned On-Chip Logic Minimizer, CODES+ISSS On-Chip Logic Minimization, DAC Roman Lysecky 6/ A Codesigned On-Chip Logic Minimizer, CODES+ISSS ROCTM Riverside On-Chip Technology Mapper ROCTM - Technology Mapping/Packing Decompose hardware circuit into DAG Nodes correspond to basic -input logic gates (AND, OR, XOR, etc.) Hierarchical bottom-up graph clustering algorithm Breadth-first traversal combining nodes to form single-output LUTs Combine LUTs with common inputs to form final -output LUTs Pack LUTs in which output from one LUT is input to second LUT - Results min MB min MB - mins - mins 5 MB 6 MB s <s MB Dynamic Hardware/Software Partitioning: A First Approach, DAC Roman Lysecky 7/ A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 Dynamic Hardware/Software Partitioning: A First Approach, DAC Roman Lysecky 8/ A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4

4 ROCPLACE Riverside On-Chip r ROCPLACE - Dependency-based positional placement algorithm Identify critical path, placing critical nodes in center of CLF Use dependencies between remaining s to determine placement Attempt to use adjacent routing whenever possible - Results min MB min MB - mins - mins 5 MB 6 MB s <s <s MB MB Dynamic Hardware/Software Partitioning: A First Approach, DAC Roman Lysecky 9/ A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 Dynamic Hardware/Software Partitioning: A First Approach, DAC Roman Lysecky / A Configurable Logic Fabric for Dynamic Hardware/Software Partitioning, DATE 4 ROCR Riverside On-chip r Find a path within to connect source and sinks of each net within our hardware circuit Pathfinder [Ebeling, et al., 995] Introduced negotiated congestion During each routing iteration, route nets using shortest path Allows overuse (congestion) of resources If congestion exists (illegal routing) Update cost of congested resources Rip-up all routes and reroute all nets VPR [Betz, et al., 997] Increased performance over Pathfinder Routability-driven: Use fewest tracks possible Timing-driven: Optimize circuit speed Many techniques are used in commercial CAD tools congestion ROCR - Riverside On-Chip r Resource Graph Nodes correspond to s Edges correspond to channels between s Capacity of edge equal to the number of wires within the channel Requires much less memory than VPR as resource graph is smaller Produces circuits with critical path % shorter than VPR (RD) Rip-up yes illegal? no Done! Resource Resource Graph Graph Roman Lysecky / Roman Lysecky Dynamic for Just-in-Time, DAC 4 / ROCR - Memory Usage ROCR - Algorithm Performance VPR requires over 5MB of memory with an average of over MB ROCR requires at most.6 MB VPR requires up to 6X more memory ROCR is on average X faster than VPR (TD) Up to X faster for ex5p Memory Usage (KB) V PR (RD) VPR (TD) ROCR Execution Time (s) VPR (TD) ROCR alu4 apex apex4 bigkey des diffeq dsip e64 elliptic ex5p frisc misex s4 Benchmark s98 s847 s8584. seq tseng Average alu4 apex apex4 bigkey des diffeq dsip e64 elliptic ex5p frisc misex s4 s98 s847 s8584. Benchmark seq tseng Average Dynamic for Just-in-Time, DAC 4 Roman Lysecky Dynamic for Just-in-Time, DAC 4 4/ Roman Lysecky / 4

5 - Results Experimental Setup s <s <s MB min MB MB min MB s.6 MB - mins - mins 5 MB 6 MB Warp Processor MHz ARM7 processor Configurable logic fabric with maximum frequency of 5 MHz Used dynamic on-chip CAD tools to map critical region to hardware Requires less than seconds to perform synthesis and compilation Traditional HW/SW Partitioning MHz ARM7 processor Xilinx Virtex-E (executing at maximum possible speed) Manually partitioned software using VHDL VHDL synthesized using Xilinx ISE 4. on desktop ARM7 ARM7 Xilinx Virtex-E Dynamic for Just-in-Time, DAC 4 Roman Lysecky 6/ Roman Lysecky 5/ Performance Speedup (Critical Region, Single Kernel) Performance Speedup (Overall, Multiple Kernels) Speedup brev Average critical region speedup of 4 vs. for Virtex-E 9 Warp Proc. Xilinx Virtex-E gfax url rocm pktflow canrdr bitmnp tblook ttsprk matrix idct g7 mpeg fir matmul Average: simplicity results in faster HW circuits SW Only Execution Speedup brev gfax Average speedup of 7.4 Energy reduction of 8% - 94% Warp Proc. url rocm pktflow canrdr bitmnp tblook ttsprk matrix idct g7 mpeg fir matmul Average: SW Only Execution Roman Lysecky 7/ Roman Lysecky 8/. s.6mb - Results (CAD) (CAD) (75MHz ARM7) Xilinx ISE 9. s 6 MB Conclusions Developed Dynamically and transparently re-implements SW kernel as HW implemented using on-chip Developed Warp Configurable Logic Architecture Designed specifically to allow development of lean on-chip CAD tools Developed fast, lean on-chip compilation tools Requires order of magnitude less memory requirements and execution time, capable of on-chip execution Speedups are significant Average speedups of 7.4X Speedup of over X possible for many applications Energy reduction of 8% to 94%.4s.6MB Roman Lysecky 9/ Roman Lysecky / 5

6 Future Directions Patents & Publications Extend to desktop/sever/pda domains Increase parallelism within Efficient memory/data reuse methods Development of a standard HW binary Support more complex architectures High Performance HW/SW Partitioning Operating system aware HW/SW partitioning HW/SW partitioning must be tightly integrated with OS What OS support is required for HW/SW partitioning and warp processing? Low Power Design Dynamic power management within s Requires development of new architectures and CAD tools Patents F. Vahid, R. Lysecky, G. Stitt. Warp Processor for Dynamic Hardware/Software Partitioning. US Patent Pending, 4. Publications R. Lysecky, F. Vahid, S. Tan. A Study of the Scalability of On-Chip for Justin-Time. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), April 5. R. Lysecky, F. Vahid. A Study of the Speedups and Competitiveness of Soft Processor Cores using Dynamic Hardware/Software Partitioning. Design Automation and Test in Europe Conference (DATE), 5. R. Lysecky, F. Vahid, S. Tan. Dynamic for Just-in-Time. Design Automation Conference (DAC), 4. R. Lysecky, F. Vahid. A Configurable Logic Architecture for Dynamic Hardware/Software Partitioning. Design Automation and Test in Europe Conference (DATE), 4. R. Lysecky, F. Vahid. On-Chip Logic Minimization. Design Automation Conference (DAC),. G. Stitt, R. Lysecky, F. Vahid. Dynamic Hardware/Software Partitioning: A First Approach. Design Automation Conference (DAC),. Roman Lysecky / Roman Lysecky / 6

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms)

Warp Processors (a.k.a. Self-Improving Configurable IC Platforms) (a.k.a. Self-Improving Configurable IC Platforms) Frank Vahid Department of Computer Science and Engineering University of California, Riverside Faculty member, Center for Embedded Computer Systems, UC