Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

Size: px

Start display at page:

Download "Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit"

Lucy Watkins
6 years ago
Views:

1 Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi Computer and IT Engineering Department, Amirkabir University of Technology {mehdipur,szamani,msedighi}@aut.ac.ir Department of Informatics, Graduate School of Information Science and Electrical Engineering, Kyushu University noori@c.csce.kyushu-u.ac.jp, {murakami,inoue}@i.kyushu-u.ac.jp

2 Agenda Introduction Application-specific instruction set extension Temporal Partitioning Some Definitions General overview of the architecture RFU Architecture: A Quantitative Approach Generating Custom Instructions Mapping Custom Instructions Integrating RFU with base processor Integrated framework for generating and mapping custom instructions Performance Evaluation References

3 Introduction An extensible processor with a reconfigurable functional unit (RFU) can be an alternative to General Purpose Processors (GPPs), Application-Specific Integrated Circuits (ASICs) and Application-Specific Instruction set Processors (ASIPs) to achieve enhanced performance in embedded systems ASICs GPPs not flexible expensive and time consuming design process very flexible may not offer the necessary performance

4 Introduction ASIPs more flexible than ASICs more potential to meet the high-performance demands of embedded applications, compared to GPPs needs to generation of a complete instruction set architecture for the targeted application full-custom solution is too expensive and has long design turnaround times

5 Application-specific instruction set extension Another Method for performance improvement An extensible processor with a reconfigurable functional unit favorable tradeoff between efficiency and flexibility keeping design turnaround time much shorter. Critical portions of an application s dataflow graph (DFG) are accelerated by using custom functional units The nodes of DFGs -> instructions of critical potions Edges of DFGs -> dependencies between instructions

6 Temporal Partitioning Partitioning a data flow graph into a number of partitions such that each partition can fit into the target hardware and dependencies among the graph nodes are not violated.

7 Some definitions Hot Basic Block (HBB) A basic block which execution frequency is greater than a given threshold specified in the profiler Custom Instructions (CIs) Are the extended Instruction Set Architecture (ISA) that are executed on the RFU Reconfigurable Functional Unit (RFU) Custom hardware for executing CIs

8 General overview of the architecture N-way in-order general RISC Adaptive Dynamic Extensible Processor Base Processor Fetch Reg File Decode Execute Memory Augmented Hardware Profiler RFU Sequencer Detects start addresses of Hot Basic Blocks (HBBs) Switches between main processor and RFU Write Executes Custom Instructions

9 Operation modes Training Mode Training Mode Normal Mode Applications Binary-Level Profiling Detecting Start Address of HBBs Applications Running Tools for Generating Custom Instructions, Generating Configuration Data for ACC and Initializing Sequencer Table Applications Monitors PC and Switches between main processor and ACC Processor Profiler ACC Sequencer Processor ACC Profiler Sequencer Processor ACC Profiler Sequencer Binary Rewriting Executing CIs

10 Tool Chain Base Processor Profiler Simplescalar (PISA Configuration) Reading HBBs from Obj Code 22 Applications of Mibench Detecting Start Addr of HBBs Results are used for designing RFU Generating DFG for HBBs Custom Instruction Generator Mapping CIs on the RFU Optimization (Constant Propagation) Updating DFG

11 Reconfigurable Functional Unit (RFU) RFU is a matrix of Functional Units (FUs) RFU has a two level configuration memory A multi-context memory (keeps two or four config) A cache FUs support only logical operations, add/subtract, shifts and compare RFU updates the PC RFU has variable delay which depends on size of Custom Instruction

12 RFU Architecture: A Quantitative Approach 22 programs of MiBench were chosen Simplescalar toolset was utilized for simulation RFU is a matrix of FUs No of Inputs No of Outputs No of FUs Connections Location of Inputs & Outputs Some definitions: Considering frequency and weight in measurement CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight) Rejection: Percentage of CI that could not be mapped on the RFU Coverage: Percentage of CI that could be mapped on the RFU Basic Blocks: A sequence of instructions terminates in a control instruction Hot Basic Blocks: A basic block executed more than a threshold

13 RFU Architecture Distributing Inputs in different rows Row1 = 7 Row 2 = 2 Row 3 = 2 Row 4 = 2 Row 5 = 1 Connections with Variable Length row1 row3 = 1 row1 row4 = 1 row1 row5 = 1 row2 row4 = 1 Synthesis results using Hitachi 0.18 μm Area : mm 2 Delay : 9.66 ns

14 Integrating RFU with the Base Processor Reg0. Reg31 Config Mem Decoder Sequencer DEC/EXE Pipeline Registers FU1 FU2 FU3 FU4 RFU Sequencer EXE/MEM Pipeline Registers

15 Generation of Custom Instructions Custom instructions Exclude floating point, multiply, divide and load instructions Include at most one STORE, at most one BRANCH/JUMP and all other fixed point instructions Simple algorithm for generating custom instructions HBBs usually include 10~40 instructions for Mibench Custom instruction generator is going to be executed on the base processor (in online training mode)

16 Mapping Custom Instructions Mapping is the same as the well-known placement problem: Determining the appropriate positions for DFG nodes on the RFU. Assigning CI instructions to FUs is done based on the priority of the nodes.

17 Mapping Custom Instructions Slack of each node represents its criticality and also their priority for partitioning. Slack equal to 0 means that it is on the critical path of DFG and should be scheduled with the highest priority. For the nodes with the same criticality, ASAP level of them determines their mapping order.

18 Mapping Algorithm (1/2) First Step: determining an appropriate row for that node Row number= Last Row (if the selected node is on a critical path with the length more than or equal to RFU depth) Row number= ALAP- slack -1(to prevent the occupation of FUs in the lower RFU rows by the nodes do not belong to critical paths )

19 Mapping Algorithm (2/2) Second Step: Determining an appropriate column That is determined according to the minimum connection length criterion. For each row, a maximum capacity is considered to prohibit gathering many nodes in a row. Capacity of rows is determined with respect to longest critical path and the number of critical paths in the DFG.

20 An Example: Mapping of a CI on the RFU

21 Generating Custom Instruction for the Target RFU In our primary CI generator we did not consider any constraints for the generated CIs and tried to generate CIs as large as possible. Therefore, some of the generated CIs can not be mapped on the proposed RFU due to its constraints.

22 Customizing CI generator for the Target RFU First Approach Some primary constraints of RFU (number of inputs, number of outputs and number of nodes) were added to our CI generator tool to generate CIs that are mappable. In this approach the CI generator is unaware of the mapping process results Some of CIs may not be ultimately mapped to the RFU due to the routing constraints

23 Customizing CI generator for the Target RFU Second Approach Integrated Framework Performs an integrated temporal partitioning and mapping process Takes rejected CIs as input Partitions them to appropriate mappable CIs Adds nodes to the current partition while architectural constraints are satisfied The ASAP level of nodes represents their order to execute according to their dependencies Advantages Reducing the number of rejected CI Using a mapping-aware temporal partitioning process

24 Integrated Framework- Temporal Partitioning Algorithms HTTP VTTP Traverses DFG nodes horizontally according to the ASAP level of the nodes usually brings about more parallelism for instruction execution may require large intermediate data The size of intermediate data affects data transfer rate and the size of configuration memory. Traverse the DFG nodes vertically Creates partitions with longer critical paths Reduces the size of intermediate data

25 Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental temporal partitioning process is performed iteratively Each partition which does not satisfy RFU constraints is modified A new iteration starts. Two different partition modification strategies are used for HTTP and VTTP The main difference is in the way of selecting the nodes to be moved to the next partition.

26 Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental HTTP The node with the highest ASAP level is selected and moved to the subsequent partition. Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.

27 Integrated Framework- Incremental Temporal Partitioning Algorithm Incremental VTTP: A node with the highest ASAP level is selected and moved. The other nodes are selected from the path where the previous moved node had been located in their ASAP level order. Nodes selection and moving order:15, 14, 6, 13, 12, 5, 11, 10, 4 and 7.

28 Customizing Mapping Tool Spiral shaped mapping is possible thanks to the horizontal connections in the third and fourth rows of RFU

29 Performance Evaluation issue L1- I cache L1- D cache Unified L2 Execution units RUU size Fetch queue size 4-way 32K, 2 way, 1 cycle latency 32K, 4 way, 1 cycle latency 1M, 6 cycle latency 4 integer, 4 floating point Simplescalar was configured to behave as a 4-issue in-order RISC processor. The base processor supports MIPS instruction set. 22 applications of Mibench

30 Delay of RFU according to CI length CI Length RFU Delay (ns) Synopsys Tools + Hitachi 0.18μm

31 CIs length for Mibench applications

32 Intermediate data size No. of 32bit Intermediate Data bitcnts blowfish blowfish (dec) cjpeg djpeg fft fft (inv) gsm (dec) gsm (enc) lame rijndael (enc) rijndael (dec) sha HTTP Intermediate Data Size VTTP Intermediate Data Size

33 Maximum critical path length for CIs HTTP Critical Path Length VTTP Critical Path Length Critical Path Length bitcnts blowfish blowfish (dec) cjpeg djpeg fft fft (inv) gsm (dec) gsm (enc) lame rijndael (enc) rijndael (dec) sha

34 Speedup comparison Speedup bitcounts blowfish blowfish (dec) cjpeg djpeg fft fft (inv) gsm (dec) gsm (enc) lame rijndael (enc) rijndael (dec) sha HTTP VTTP CIGen

35 References Arnold, M., Corporaal, H., Designing domain-specific processors. In Proceedings of the Design, Automation and Test in Europe Conf, 2001, pp Atasu, K., Pozzi, L., Lenne, P., Automatic application-specific instruction-set extensions under microarchitectural constraints, 40th Design Automation Conference, Bobda, C., Synthesis of dataflow graphs for reconfigurable systems using temporal partitioning and temporal placement, Ph.D thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University of Paderborn, Clark, N., Kudlur, M., Park, H., Mahlke, S., Flautner, K., Application-Specific Processing on a General-Purpose Core via Transparent Instruction Set Customization, In Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, Karthikeya, M., Gajjala, P., Dinesh, B., Temporal partitioning and scheduling data flow graphs for reconfigurable computer, IEEE Transactions on Computers, vol. 48, no. 6, 1999, pp

36 References Kastner, R. Kaplan, A., Ogrenci Memik, S., Bozorgzadeh, E., Instruction generation for hybrid reconfigurable systems, ACM TODAES, vol. 7, no. 4, 2002, pp Ouaiss, I., Govindarajan, S., Srinivasan, V., Kaul M., Vemuri R., An integrated partitioning and synthesis system for dynamically reconfigurable multi-fpga architectures, In Proceedings of the Reconfigurable Architecture Workshop, 1998, pp Spillane, J., Owen, H., Temporal partitioning for partially reconfigurable field programmable gate arrays, IPPS/SPDP Workshops, 1998, pp Tanougast, C., Berviller, Y., Brunet, P., Weber, S., Rabah, H., Temporal partitioning methodology optimizing FPGA resources for dynamically reconfigurable embedded real-time system, International Journal of Microprocessors and Microsystems, vol. 27, 2003, pp Yu, P., Mitra, T., Characterizing embedded applications for instruction-set extensible processors, In Proceedings of Design and Automation Conference, 2004, pp

37 Thank you for your listening

A Reconfigurable Functional Unit for an Adaptive Extensible Processor

A Reconfigurable Functional Unit for an Adaptive Extensible Processor Hamid Noori Farhad Mehdipour Kazuaki Murakami Koji Inoue and Morteza SahebZamani Department of Informatics, Graduate School of Information