Design of a clockless MSP430 core using mixed asynchronous design flow

Size: px

Start display at page:

Download "Design of a clockless MSP430 core using mixed asynchronous design flow"

Sybil Ursula Atkins
6 years ago
Views:

1 LETTER IEICE Electronics Express, Vol.14, No.8, 1 12 Design of a clockless MSP430 core using mixed asynchronous design flow Ziho Shin 1,3a), Myeong-Hoon Oh 1,3b), Jeong-Gun Lee 2, Hag Young Kim 3, and Young Woo Kim 1,3 1 Dept. of Computer SW, University of Science and Technology (UST), 217, Gajeong-ro, Yuseong-gu, Daejeon, 34113, Republic of Korea 2 Dept. of Computer Engineering, Hallym University, 1, Hallimdaehak-gil, Chuncheon-si, Gangwon-do, 24252, Republic of Korea 3 Cloud Computing Research Group, Electronics and Telecommunication Research Institute (ETRI), 218, Gajeong-ro, Yuseong-gu, Daejeon, 34129, Republic of Korea a) zshin@ust.ac.kr b) mhoonoh@etri.re.kr, Corresponding Author Abstract: There are various limitations on the supporting tools and design methodologies for the implementation of an asynchronous delay-insensitive model. In this paper, we propose a new design flow by exploiting a mixed model, which combines a bounded delay model and a delay-insensitive model. To develop the design flow, we use an asynchronous finite-state machine for the bounded delay model and the null convention logic for the delay-insensitive model. Further, we designed an MSP430 core to verify the proposed design flow and the results of simulation show that it exhibits a performance improvement of 30.34% over its synchronous counterpart. Keywords: asynchronous circuit, AFSM, NCL, UNCLE, delay insensitive, bounded delay Classification: Integrated circuits References [1] Y. I. Ismail: Interconnect design and limitations in nanoscale technologies, IEEE ISCAS (2008) 780 (DOI: /ISCAS ). [2] C. J. Anderson, et al.: Physical design of a fourth-generation POWER GHz microprocessor, Proc. ISSCC2001 (2001) 232 (DOI: /ISSCC ). [3] J. Sparso and S. Furber: Principles of Asynchronous Circuit Design A Systems Perspective (Springer US, New York, 2001). [4] K. M. Fant: Logically Determined Design Clockless System Design with Null Convention Logic (John Wiley & Sons, Hoboken, 2005). [5] R. B. Reese, et al.: Uncle-An RTL approach to asynchronous design, Proc. 18th ASYNC (2012) 65 (DOI: /ASYNC ). [6] G. De Micheli: Synchronous logic synthesis: Algorithms for cycle-time minimization, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 10 (1991) 63 (DOI: / ). 1

2 [7] P.-H. Ho: Industrial clock synthesis, ISPD (2009) (DOI: / ). [8] E. G. Friedman: Clock distribution network in synchronous digital integrated circuits, Proc. IEEE 89 (2001) 665 (DOI: / ). [9] M.-H. Oh, et al.: Architectural design issues in a clockless 32 Bit processor using an asynchronous HDL, ETRI J. 35 (2013) 480 (DOI: /etrij ). [10] C. J. Myers: Asynchronous Circuit Design (John Wiley & Sons, New York, 2001) 88. [11] K. Y. Yun: Synthesis of asynchronous controllers for heterogeneous systems, Ph.D Dissertation, Stanford University (1994). [12] S. M. Nowick: Automatic synthesis of burst-mode asynchronous controllers, Ph.D. Dissertation, Stanford University (1994). [13] R. B. Reese: Uncle (Unified NCL Environment) (Mississippi State University, 2011). [14] A. Kondratyev and K. Lwin: Design of asynchronous circuits using synchronous CAD tools, IEEE Des. Test Comput. 19 (2002) 107 (DOI: /MDT ). [15] Z. Xia, et al.: An asynchronous FPGA based on dual/single-rail hybrid architecture, Proc. ERSA (2012) 139. [16] P. A. Beerel, et al.: Proteus: An ASIC flow for GHz asynchronous designs, IEEE Des. Test Comput. 28 (2011) 36 (DOI: /MDT ). [17] C. A. R. Hoare: Communicating sequential processes, Commun. ACM 21 (1978) 666 (DOI: / ). [18] R. O. Ozdag and P. A. Beerel: High-speed QDI asynchronous pipelines, Proc. 8th ASYNC (2002) 13 (DOI: /ASYNC ). [19] Texas Instrument: MSP430x2xx Family User s Guide (Texas Instrument, 2013). [20] M.-H. Oh, et al.: Design of low-power asynchronous MSP430 processor core using AFSM based controllers, Proc. 23rd ITC-CSCC (2008) [21] L. Nazhandali, et al.: SenseBench: Toward an accurate evaluation of sensor network processor, Proc. IEEE Int. Symp. Workload Characterization (2005) 197 (DOI: /IISWC ). 1 Introduction Traditional synchronous circuit designs have limitations owing to the single global clock signal, such as the large power consumption of the clock network, performance degradation caused by clock skews, and meta-stability problems arising from the multiple clock domains [1]. On the other hand, since asynchronous circuits have no inherent global clock, they are fundamentally free from the limitations of the synchronous circuits. Instead of a global clock, an asynchronous circuit can guarantee its functional correctness using a distributed handshake control for localized synchronization. Moreover, the power consumption of a processor can be reduced due to the elimination of the clock network, which accounts for 70% of the entire power consumption [2]. Therefore, the asynchronous circuit has low-power characteristics. Additionally, the asynchronous circuit emits less electro-magnetic noise (EMI), because it does not use the globalized common and periodic control signal. 2

3 Asynchronous circuit designs employ a handshake protocol to transfer data between their internal modules. The handshake protocol uses a request (Req) signal, which indicates data validity, and an acknowledgement (Ack) signal, which represents the completion of data delivery. According to the utilization of the two control signals, handshake protocols can be classified into two categories: four-phase signaling and two-phase signaling [3]. A four-phase signaling protocol uses only the rising edges of the control signals to synchronize neighboring modules and needs a return-to-zero. A two-phase signaling protocol utilizes both the rising and falling edges of the control signals to perform the handshake protocol. Since this type of signaling does not need to have a return-to-zero phase, hypothetically, it can result in higher performance when compared to a four-phase signaling protocol. However, the two-phase signaling protocol has the disadvantage of design complexity. Further, asynchronous circuits are categorized into various delay models, such as the bounded delay model, speed-independent model, and delay-insensitive model [3]. The operation of the bounded delay model is similar to that of a synchronous circuit. In order to implement the bounded delay model, the delays of gates and wires should be calculated, and the worst-case scenarios should be analyzed. Thus, this model can utilize the data path circuits of the synchronous design without any modifications. Since the speed-independent model does not need to consider the delays of wires and delays of gates are modeled as unbounded, it can show much asynchrony than the bounded delay model. Nonetheless, the speed-independent model has a limitation: When multiple input changes occur, proper control inputs should be selected. Hence, this model has increased design complexity. The delay-insensitive model is ideal for asynchronous circuits. Unbounded delays of wires and gates are assumed. This model employs a multi-bit data encoding scheme to detect the completion of data dependent operations; however, this scheme can cause area overhead, when a designer tries to implement this model. Nevertheless, the delay-insensitive model provides operational stability under the variations of process, voltage, and temperature and furthermore the operating time of the circuit depends on the applied data. The NULL Convention Logic (NCL) [4] has been introduced as a technique for the implementation of circuits using this model. An NCL-based circuit, which uses a four-phase signaling protocol, inserts a NULL state between the transferred data as a spacer; hence, it is called NULL convention logic. To represent the NULL state and data (logic high 1 or logic low 0 ), the NCL utilizes dual-rail or quad-rail schemes instead of a single-rail scheme. In the above mentioned NCL-based circuit, a Req signal is encoded and embedded into the data line based on a dual-rail scheme and an Ack signal is received through the Ack network when the request has been processed. Additionally, with the primitive NCL (threshold) gates introduced in [4], every Boolean equations can be expressed using the gates owing to their universal characteristics. Therefore, designers can implement delay-insensitive circuits using any Boolean equation with the set of the primitive NCL gates. 3

4 The Unified NULL Convention Logic Environment (Uncle) has been introduced as an open-source tool set, which supports the implementation of NCL circuits [5]. In this paper, we propose a new design flow, which combines a bounded delay model and a delay-insensitive model. Further, we designed an MSP430 core using the proposed design flow. Finally, the performance of the designed core is evaluated and compared to that of a synchronous counterpart. This paper is organized as follows. Section 2 presents the conventional asynchronous circuit design methodologies. In Section 3, we propose a new design flow using a mixed delay model. Section 4 describes the design and implementation of an MSP430 core using the proposed design flow. Section 5 describes the simulation environment and analyzes the simulation results. Finally, in Section 6, the conclusion and future works are presented. 2 Related work The design of a synchronous circuit focuses mainly on the optimization of a sequential logic based on clock cycle time and the optimization of clock networks [6, 7, 8], whereas, the design of an asynchronous circuit uses graph theories for the optimization of a control logic. Furthermore, since asynchronous circuit does not have globalized control signal, it should consider the hazard or race conditions on the architectural level of view [9]. An Asynchronous Finite-State Machine (AFSM) [10] is similar to a Mealy-type FSM, whose output values depend on both the current state and the current inputs. The AFSM defines the changes of states according to the inputs rather than the signal transitions. Therefore, the AFSM has a restriction: it should be settled into the new state before the next input changes. This AFSM can be utilized for the bounded delay model. In order to support the synthesis of the AFSM design, 3D [11] and MINIMALIST [12] tools have been developed. The 3D tool supports conditional branches and a directed don t care state to eliminate the design constraints [11]. As a tool for the delay-insensitive model, Uncle [13] has been developed as an open-source tool set for the design of NCL circuits and it guarantees a self-timed operation of the circuits derived from the model. In [14], Kondratyev introduced the design of asynchronous circuits using synchronous CAD Tools based on the NCL. However, [14] does not provide automated synthesis of Ack signal generation and simulation methodology for the generated netlists. Moreover, another drawback of [14], it is currently unavailable to use. In the research from Xia group [15], they suggested hybrid architecture for interfacing between single-rail and delay-insensitive dual-rail circuits. They have applied the dual-rail encoding idea to the critical path. However, in their work, they did not suggest an NCL based dual-rail as well as they did not apply any handshake protocol to single-rail data path. Proteus project was also introduced as a design flow for a delay insensitive model [16]. The Proteus provides high-level language interface which is a translator for Communication Sequential Process (CSP) [17]. The Proteus focuses on the 4

5 Fig. 1. The UNCLE, NCL design flow dual-rail domino logic based on pre-charged half-buffer (PCHB) custom cells, in order to get high-performance [18]. Thus, their approach is suitable for the fullcustom design. Additionally, it is not an open-source tool. Since their work requires full custom cells and their tool is not open to public, their work has a disadvantage on the design flexibility. The Uncle also provides an automatic mapping function, which translates the register transfer level (RTL)-based design to the NCL gates netlist. The concept of the design flow of the Uncle is shown in Fig. 1. When a designer inputs an RTL code into the Uncle, it initiates the conventional synthesis CAD tool to translate the RTL codes to a single-rail and-or-not netlist. Consequently, the Uncle expands the single-rail netlist to its dual-rail version; then, the resulting dual-rail and-or-not netlist, which is composed of the predefined primitive gates from the Uncle, is mapped to the NCL gates. This predefined primitive gate library is called andor2.db. Since the Uncle tool only supports andor2.db that is dedicated to the NCL designs, the library includes only small number of primitive gates that are required for synthesizing NCL circuits. In consequence, owing to this limitation, the use of the library could affect design flexibility and it might lead to performance degradation in other types of circuit design. Subsequently, the Uncle generates Ack networks and verifies their validity. Finally, the Uncle runs simulation using a dedicated simulator called Uncle_sim. After finalizing the mapping process, the Uncle adds registers on the input and output sides to guarantee the delay-insensitivity of the generated NCL netlist. Because of this process, a designer has to insert a global clock and reset signals deliberately before the Uncle mapping process. If an original RTL design does not include both the global clock and reset signals, the Uncle cannot generate the NCL netlist. The netlist generated by the Uncle can be applied only to the data path, since the Uncle does not consider the interaction with the control path of an existing system; this is the disadvantage of using the Uncle. As another drawback of the Uncle, it supports the translation of only one module at a time. Therefore, if a designer intends to translate a design that is composed of multiple sub-modules, each sub-module design should be mapped one-by-one and the interconnections between the translated modules should be made manually. Consequently, the Uncle does not have a solution for congeniality between the existing data paths of a single-rail design and an NCL-based dual-rail design. Furthermore, it is not easy to verify the consistency of the circuit functionality at each step of design using the Uncle_sim simulator, because it only supports the simulation of the circuit netlist produced at the final step of design. 5

In this paper, in order to mitigate the aforementioned disadvantages of the Uncle, we propose a design flow for the asynchronous mixed delay model with three new beneficial features: 1.

6 In this paper, in order to mitigate the aforementioned disadvantages of the Uncle, we propose a design flow for the asynchronous mixed delay model with three new beneficial features: 1. Support for a mixed delay model: In our proposed flow, the AFSM design methodology is employed to support a bounded delay model for single rail control circuit design while the NCL-based Uncle flow is used to support a delay insensitive model. Afterwards, we forward the control signal of the NCL Ack network to the four-phase AFSM handshake protocol. Finally, we utilize C-elements to synchronize the communications between the delay-insensitive data path and the control path. 2. Data path interfacing: In order to support the communication between the data path and the control path, we design the translation logic of the data path to ensure compatibility with the existing system. This translation logic can support the interface between the NCL-based dual-rail data path and the data path of a single-rail scheme. 3. Timing simulation environment and verification method: We modify the command script of the Uncle to generate a Standard Delay Format (SDF) file and we write an SDF annotated Verilog simulation model for the purpose of timing simulation over the conventional CAD tool. 3 Suggested design flow 3.1 Control path design The Uncle aims to support a data-driven style of design. Therefore, the output from the Uncle does not consider the interaction with the control path from an existing control-driven style of design. In this paper, we suggest a combination of NCL-based delay-insensitive data path and AFSM-based control path as shown in Fig. 2(a). The matched delay cells in the control path are required no longer when the control path are combined with the NCL based delay-insensitive data path and those delay cells should be eliminated to maintain the self-timed characteristics obtained from the NCL-based delay-insensitive data path. Further, the Ack signal from the NCL Ack network should be connected to the AFSM-based handshake protocol in order to facilitate the communication between the NCL-based data path and the AFSM-based control path. To achieve the stability of the control signal, we insert the C-element, which is described in Fig. 2(b). (a) (b) Fig. 2. Control path structure: (a) AFSM control path (b) AFSM control path and NCL based data path signaling method 6

7 Fig. 3. Data path translation structure Fig. 4. Mixed signaling with SDTL and DSTL 3.2 Data path translation The Uncle uses a dual-rail data path design for the implementation of the delayinsensitive model. However, existing data paths are mostly designed using a singlerail scheme. Moreover, the Uncle does not support the automatic mapping of multiple modules simultaneously. Owing to this limitation, the translation logic between the single-rail to dual-rail (STD) and dual-rail to single-rail (DTS) schemes are needed to provide the harmonious composition of the heterogeneous circuit styles. Fig. 3 shows the designed translation logic at the level of abstraction, i.e., a single-rail to dual-rail translation logic (SDTL) and a dual-rail to single-rail translation logic (DSTL). Additionally, when encoding the data, since an NCLbased system has NULL states, the SDTL and the DSTL should include circuits for capturing and generating a NULL state. The integration of the overall method is presented in Fig. 4 with detailed circuit structures of SDTL and DSTL. 3.3 Functional simulation methodology The Uncle_sim is used for the functional simulation of the NCL-based netlist. However, it does not support a timing simulation environment at each step of design refinement; it only supports the simulation of the netlists produced at the final step. Therefore, if a designer faces a functional error in the final step, it is not possible to simulate the intermediate netlist from each step of design. Therefore, simulation methodology using conventional CAD tools is required. The command script of the Uncle is modified to generate the SDF file so as to ensure compatibility between the conventional CAD tools and the Uncle. Subsequently, the SDF file is annotated into the Verilog simulation model. The functional simulation flow of an NCL-based circuit using the conventional CAD tool is as follow: When the Uncle produces the NCL-based netlist, the designer re-synthesizes the netlist through the andor2.db gate library, which is provided by the Uncle for translating the NCL gates to andor2.db-based gates. Consequently, the designer can obtain the SDF file for the NCL gates that are implemented using the andor2.db library. Finally, the SDF annotation of the synthesized netlist is performed by writing a Verilog simulation model and running the simulation using the conventional CAD tools. 7

In Section 4, we describe the design of a 16-bit processor core using the proposed design flow. 4 Processor architecture & design 4.

8 Fig. 5. Proposed asynchronous mixed delay model design flow Fig. 6. Control flow of the MSP430 core To summarize, Fig. 5 represents the suggested design flow of the asynchronous mixed delay model that integrates design flow of the Uncle with the above mentioned three new features. In Section 4, we describe the design of a 16-bit processor core using the proposed design flow. 4 Processor architecture & design 4.1 Overview of TI MSP430 MSP430 [19] is a 16-bit processor, which has applications in fields like Internet of Things (IoT) as a low-power microcontroller (MCU). The MSP430 provides a relatively simple instruction set architecture (ISA) and low-power characteristics with an open-compiling environments. The MSP430 core executes 27 reduced instruction set computer (RISC)-type instructions and it supports 7 addressing modes. The 27 supported instructions can be categorized using the number of operands they use: dual-operand (Instruction Group II, 2 operands), single-operand (Instruction Group I, 1 operand), and jumps (Instruction Group III, 0 operand). Theoretically, every instruction can use all the addressing modes without any restriction; therefore, there can be a smaller code size for building various functions as compared to other MCUs. 4.2 Architecture for MSP430 Complex instruction set computer (CISC) architecture is suitable for the MSP430 core [20] in order to support various addressing modes and various opcode sizes for each instruction. The CISC-based MSP430 architecture can utilize the data path flexibly depending on the given instructions and addressing modes. Fig. 6 shows the suggested control flow of the MSP430 core. When an instruction is loaded into the core, it is decoded. Afterwards, three different paths are determined according to the opcode and addressing mode. Further, the MSP430 core executes the instruction through the determined path. After the execution, the result is written back to the register and finalized. 8

9 Fig. 7. The MSP430 data path The control flow is composed of five steps: instruction fetch and decode state (IFID), source fetch (OF1), destination fetch (OF2), zero operand instruction group (Jump), and write back stage (EXWB). The suggested block diagram of the data path for the MSP430 core is shown in Fig. 7 and it represents the groups in accordance with the control flow presented in Fig. 6. When an instruction is given to the IFID module, it is decoded into the arithmetic and logical unit (ALU) Opcode, SRC_index, DST_index, jump offset, and addressing mode. These are used to generate the control signal of each multiplexer and the indexes indicate where data come from and store to. Then, the data is sent to the OF1, OF2, jump, and EXWB modules. Subsequently, the input data and the instruction are processed through various data paths according to the corresponding instruction and addressing mode. 4.3 Implementation In this paper, we designed the MSP430 core in three different ways: 1. The proposed NCL and AFSM based mixed delay model asynchronous core using the suggested design flow as shown in Fig. 5 (NCL+AFSM). 2. AFSM-based bounded delay model asynchronous core (AFSM) 3. Synchronous core (SYNC) These three cores are synthesized into the gate level using the andor2.db gate library provided from the Uncle for fair comparison. In order to improve the performance of the NCL+AFSM version, the ALU is designed in a delay insensitive style with the Uncle, since the ALU is one of the data path module showing most data-dependent processing time in the MSP430 core. Through the delay insensitive implementation of the ALU, average case performance can be obtained. On the other hand, the control path is designed using the AFSM with a four-phase handshake protocol to handle communication with the data path. To eliminate the excessive restrictions on concurrent operations in the bounded delay model based AFSM control path, the 3D tool is utilized for the logic synthesis. The handshake control signal for the AFSM-based control path, except for the self-timed ALU, regulates the worst-case timing for corresponding to the data path sub-modules, by using the matched delay cells. These delay cells were used to provide the design margin for safe operation. Further, in order to meet setup/hold time constraints for the latches, the matched delay cells are restricted not to be optimized on the synthesis process. A C-element was used between the AFSM control path and the NCL Ack network and for achieving the average-case datadependent computation time of the self-timed ALU. During the design of the data path, except for the ALU, the remaining parts follow the synchronous design methodology; hence, they are designed as a bounded delay model. 9

Fig. 8. Designed core architecture: Asynchronous mixed delay model Fig. 8 represents the architecture of the designed NCL+AFSM version of the core.

10 Fig. 8. Designed core architecture: Asynchronous mixed delay model Fig. 8 represents the architecture of the designed NCL+AFSM version of the core. In order to provide the interface between the self-timed ALU, which is based on a dual-rail scheme and other parts of the data path, which are based on a singlerail scheme, the SDTL and DSTL described in section 3.2 are inserted into the boundary between single-rail circuits and dual-rail circuits in the self-timed ALU. The core implementation of the AFSM version uses the same control path as the NCL+AFSM version. Further, the data path is simply designed as a single-rail scheme, including the ALU. We inserted the matched delay cells into the control path to manage the timing of the handshake signal. In case of the IFID module, the matched delay is estimated by summing the worst-case delays for the IF part, ID part, and controlling PC part, including a design margin for reliable circuit operations. The matched delays for OF1, OF2, and EXWB are also calculated in the same manner. The SYNC version of the core shares the data path of the AFSM version and utilizes the global clock to manage the FSM of the control path. The clock cycle is determined using the delay of OF2, which is the worst-case module along the entire data path with a design margin. The results of synthesis of the three cores are as follows. The core designed with the proposed mixed delay model occupies more cell area than the synchronous and AFSM cores by approximately 80%, because the designed core employs the SDTL and DSTL for interacting between the data paths of the single-rail and the dual-rail circuits. Further, we did not optimize the synthesis process when we translated the NCL to andor2.db-based netlist for the purpose of guaranteeing the delay-insensitive characteristics from the Uncle and providing fair comparison between 3 cores. 5 Simulation 5.1 Simulation environment Three different versions of the MSP430 core were modeled using Verilog HDL and synthesized at the gate level using the library provided by the Uncle [13] to determine the equivalent simulation behavior of each version. As a basic synthesis tool, we used the Synopsys Design Compiler. This synthesis tool was used for both the data path and control path, only the data path, and the data path except ALU for the SYNC version, AFSM version, and NCL+AFSM version, respectively. The 3D tool was used to synthesize the control path of the AFSM version and NCL+AFSM version. The ALU for the NCL+AFSM version was generated by the Uncle as described in Clause

(a) EXWBExecution Rise time to EXWBExcutionDone Rise Time: ALUopcode: 4 AFSM: 15ns, AFSM+NCL: 4.2ns, SYNC clock cycle: 30.

60ns (c) EXWBExecution Rise time to EXWBExcutionDone Rise Time: ALUopcode: 5 AFSM: 15ns, AFSM+NCL: 14.72ns, SYNC clock cycle: 30.60ns Fig. 9.

3 and performed the timing simulation using the Cadence NC-Verilog.

11 (a) EXWBExecution Rise time to EXWBExcutionDone Rise Time: ALUopcode: 4 AFSM: 15ns, AFSM+NCL: 4.2ns, SYNC clock cycle: 30.60ns (b) EXWBExecution Rise time to EXWBExcutionDone Rise Time: ALUopcode: 2 AFSM: 15ns, AFSM+NCL: 7.38ns, SYNC clock cycle: 30.60ns (c) EXWBExecution Rise time to EXWBExcutionDone Rise Time: ALUopcode: 5 AFSM: 15ns, AFSM+NCL: 14.72ns, SYNC clock cycle: 30.60ns Fig. 9. The waveform from three cores: self-timed feature Then, we annotated the SDF files to Verilog HDL gate level netlists for each version of the core as described in Section 3.3 and performed the timing simulation using the Cadence NC-Verilog. To confirm the functionality and to evaluate the performance of each version, we applied benchmark programs [21], which have been used frequently in IoT services such as networking and sensor data processing. 5.2 Simulation result Fig. 9 shows the captured waveforms of three core design examples in order to focus on the completion time of the EXWB modules (see Fig. 7). In case of the AFSM version, the execution time of the ALU, which is calculated as the delay from the rising time of EXWBExecution (Req) (See Fig. 8) signal to the rising time of EXWBExecutionDone (Ack) (See Fig. 8), is fixed at 15 ns, even if the operation changes. The SYNC version also has an optimized fixed clock cycle of 30.6 ns, which is determined by worst-case timing of OF2 module. Due to the OF2 module has worst-case delay from the entire data path. Meanwhile, in the NCL+AFSM version, the delay varies according to the given instruction (4.2 ns in Fig. 9(a), 7.38 ns in Fig. 9(b)). The maximum delay of the NCL+AFSM version was measured to be ns as shown in Fig. 9(c), which is almost the same as the delay of the AFSM version. However, it is still lower than that of the AFSM version, if the AFSM design margin is considered. Accordingly, it is confirmed that the NCL+AFSM based ALU has a flexible data-dependent delay under a given instruction and data. 11

12 Fig. 10. Benchmark program simulation results Fig. 11. Benchmark program instruction set analysis Fig. 10 presents the completion time of each version during the execution of the four benchmark programs. The NCL+AFSM version shows a performance improvement of 27.02% at least and 34.4% at most when compared to the SYNC version at THOLD and BUF_CRC benchmark programs, respectively. As shown in Fig. 11, the BUF_CRC program is organized over 80% of an arithmetic operation (AR_OP) and a special operation (SP_OP) of the ALU out of the entire program. This result shows that when the benchmark program accesses the self-timed ALU to a maximum extent, the performance of the entire system will be improved accordingly. 6 Conclusion and future works In this paper, we propose a new design flow for the asynchronous mixed delay model: an AFSM for the bounded delay model and the Uncle for the delayinsensitive model. Then, we designed the MSP430 core using the proposed design flow for targeting IoT applications. The proposed design flow can support immaculate interfacing between the dual-rail and single-rail encoded data paths and it can provide communication between the data driven data path and the control path. Additionally, it guarantees the self-timed characteristics obtained from the delay-insensitive model. We verified the self-timed performance through the timing simulation and observed that the designed core exhibits a performance improvement of 30.34% over the synchronous core. In the near future, we will perform static timing analysis in order to check timing constraints with the layout synthesis. In addition, we will implement our design on the FPGA and we will verify the performance and functionality in a real working chip. Acknowledgments This work was supported by the ICT R&D program of MSIP/IITP. [B , Low-power and High-density Micro Server System Development for Cloud Infrastructure] and Basic Science Research Program through the National Research Foundation (2015R1D1A3A ). 12

CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER

84 CHAPTER 3 ASYNCHRONOUS PIPELINE CONTROLLER 3.1 INTRODUCTION The introduction of several new asynchronous designs which provides high throughput and low latency is the significance of this chapter. The