An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture

Size: px

Start display at page:

Download "An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture"

Cleopatra Floyd
5 years ago
Views:

1 An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture Shuji SANNOMIYA 1, Yukikuni NISHIDA 2, Makoto IWATA 3, and Hiroaki NISHIKAWA 1 1 Faculty of Engineering, Information and Systems, University of Tsukuba, Tsukuba Science City, Ibaraki, Japan 2 Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba Science City, Ibaraki, Japan 3 School of Information, Kochi University of Technology, Kami, Kochi, Japan Abstract In order to enhance the sustainability of communication especially in times of disaster, both low-power consumption and the tolerance for traffic increased due to the emergency communication should be realized urgently. Already our previous study has presented ULP-DDNS (Ultra-Low-Power Data-Driven Networking System) extending the lifetime of battery-operated devices to form an ad-hoc network which can provide a communication environment in the area where fixed and wired networks are disabled due to the disaster. In this paper, a networking platform architecture with a runtime overload-avoidance mechanism to dynamically maintain the processing load within the design target is revealed to provide the ULP-DDNS with the tolerance for the increased traffic. The runtime overload-avoidance mechanism exploits the unique positive correlation between the processing load and consumption current in the datadriven processors realized by self-timed pipeline, and it enhances the throughput for reducing the processing load by runtime voltage scaling when the current increases. Keywords: data-driven processor, protocol handling, real-time multiprocessing, self-timed pipeline 1. Introduction To enhance the sustainability of communication is one of the urgent issues in emergent situations especially in times of disaster. We have already proposed ULP-DDNS (Ultra-Low- Power Data-Driven Networking System) [1] to achieve ultralow-power consumption indispensable to extend the lifetime of the battery-operated mobile devices to form an ad-hoc network which can provide a communication environment in the area where fixed and wired networks are disabled due to the disaster. To ensure the connectivity over the ULP-DDNS, it is indispensable to provide tolerance for traffic increased due to emergency communication for safety confirmation, information gathering, and so forth. Concretely, protocol processing should be guaranteed on every platform (network node) even when traffic increases. However, the platform may become inoperative when incoming traffic increases. This is because the increased traffic may increase the number of packets concurrently processed in the platform beyond the design target, i.e., the pipeline occupancy which is the ratio of the number of valid data to the number of pipeline stages may exceed the design target. To make the platforms free from such overload situation, both observability and controllability on the pipeline occupancy are indispensable. Unfortunately, the pipeline occupancy of currently mainstream processors cannot be observed accurately because the number of valid data may change at runtime depending on the unpredictable branches or/and interrupts. In contrast, data-driven processors realized by self-timed pipeline can provide direct observability on their pipeline occupancy because the localized data transfer of the selftimed pipeline drives only pipeline stages with valid data and thus the consumption current of the self-timed pipeline is in proportion to the runtime pipeline occupancy, i.e. the pipeline occupancy can be externally observed by the amount of the consumption current. Moreover, the throughput of the self-timed pipeline can be controlled in real-time by changing the supplied voltage based on a DVS (Dynamic Voltage Scaling) technique [2]. Consequently, the pipeline occupancy can be kept within the design target by increasing the pipeline throughput by the DVS when the consumption current is increased due to the increased traffic. In this paper, an overload-free data-driven networking platform architecture is proposed based on the direct observability and controllability on the pipeline occupancy of the self-timed pipeline. The changing of the throughput based on the DVS technique takes time because of both the signal propagation in the control circuit and the parasitic capacitance on the circuit, and thus the fluctuation of the pipeline occupancy should be temporally smoothed and reduced in order to keep the pipeline occupancy within the design target until the throughput becomes a target value. The key idea of the proposed architecture is to temporally smooth and lower the pipeline occupancy at runtime by changing the parallelism of target protocol handling based on the realtime multiprocessing capability of the data-driven processor realized by the self-timed pipeline. The feasibility of the

2 Fig. 1: Self-timed (clockless) pipeline. proposed architecture is discussed based on the measurement of the latest version of the data-driven processors realized by self-timed pipeline [3]. 2. ULP-DDNS platform To realize the overload-free networking platform, both the observability and controllability on the pipeline occupancy are indispensable. Fortunately, they can be provided in the data-driven networking platform of the ULP-DDNS. In this section, how these indispensable features are provided is explained, and a basic technique to exploit this unique feature for achieving an overload-free networking platform is discussed. 2.1 Ultra-low-power data-driven networking processor The platform of the proposed ULP-DDNS is realized by both an ad-hoc networking scheme for reducing the redundant traffic and a data-driven processor for handling communication protocols with low-power. The proposed ad-hoc networking scheme realizes an adhoc network over mobile devices in the area where existing fixed- and wired-network infrastructure becomes inoperative due to fault or disaster, and it reduces the redundant traffic caused by existing simple flooding (broadcasting) to deliver urgent information all over the ad-hoc network [4]. As a result of our evaluation, it is revealed that the proposed ad-hoc networking scheme reduces the traffic to 1/10 [1]. This reduction of the traffic directly decreases the number of sending and receiving packets in every node (platform) in the ad-hoc network, and thus it contributes to the lowering power consumption of every platform. In addition to the ad-hoc networking scheme, a datadriven networking processor is proposed to lower the power consumption required to handle the protocol for both sending and receiving each packet. The proposed data-driven networking processor, named ULP-DDCMP (Ultra-Low-Power Data-Driven Chip MultiProcessor), is realized by using an optimized circular pipeline which makes it possible to bypass the pipeline stages for firing control to detect the arrival of a pair of operands when unary operations are executed [3]. Each processor core of the ULP-DDCMP is named ULP- CUE (Ultra-Low-Power CUE) as a successor of the CUE series data-driven processors [3]. The ultra-low-power consumption as a result of the synergistic effect between the traffic reduction by the ad-hoc networking scheme and the low-power protocol handling by the ULP-DDCMP is demonstrated by using simulators and a prototype VLSI chip of the ULP-DDCMP, and it is revealed that the ULP-DDNS can reduce power consumption to a several-hundredth in comparison with an existing network system [1]. 2.2 Real-time observability and controllability One of the main contributors to the ultra-low-power consumption is the localized data transfer of the selftimed pipeline (STP) which is used to realize the ULP- DDCMP. The localized data transfer also provides both a strong positive correlation between the pipeline occupancy and consumption current and the real-time adaptability for dynamic voltage scaling. In the STP, only pipeline stages with valid data are driven exclusively as a consequence of the localized data transfer called handshake. Figure 1 shows the basic structure of the STP in which each stage consists of a data-latch (DL), functional logic (FL) and transfer control unit (C). The STP is a kind of asynchronous bundled data pipelines, and it employs four-phased handshake [5]. Based on the fourphased handshake, the valid data in the STP are transferred between adjacent stages, as follows. Reset: After the assertion of the reset signal, the C negates both its send signal representing transfer request and ack signal representing acknowledge. The C asserts its ack signal after its send signal is asserted. After the assertion of the ack signal, the preceding C negates its send signal. After the negation of the send signal, the C asserts both its gate open signal (cp) and its send signal and it negates concurrently its ack signal, only if the ack signal is negated. As a result, the data is latched in the stage to which the C belongs. The succeeding C repeats the above steps similarly to the C. This handshake not only concentrates dynamic consumption current into the pipeline stages with valid data but also eliminates global clocks. Generally, clock-synchronized circuit requires PLL (Phase-Locked Loop) circuit to change the clock-frequency according to the supplied voltage, and it takes several tens of µ seconds to change the clock-frequency by the PLL. That is, the supplied voltage should be kept at constant within several tens of µ seconds.

3 Throughput (normalized) Design target Pipeline occupancy Fig. 2: Direct observability on pipeline occupancy. Design target Pipeline occupancy In contrast, no PLL is required in the STP, and the delay times of the DL, FL and C are changed at equal rate according to the supplied voltage. Therefore, the supplied voltage of the STP can be scaled at runtime while the rate of change of the voltage is moderate enough to guarantee the transistor switching, i.e., the throughput of the ULP-DDCMP can be changed while target protocols are handled. In the ULP-DDCMP, both the occupancy and throughput increase when the number of packets processed concurrently increases. Figure 2 shows the characteristics which are measured by using the existing ULP-DDCMP chip. As shown in this figure 2(a), the throughput is kept at a maximum value regardless of the pipeline occupancy while the pipeline occupancy exceeds the design target value, therefore, the ULP-DDCMP may become inoperative due to the overflow of the STP if the input traffic continues to exceed the design target. That is, the pipeline occupancy should be kept within the design target to realize the overload-free networking platform. As shown in the figure 2(b), the pipeline occupancy correlate with the consumption current of the STP, i.e., the statically unpredictable pipeline occupancy can be observed at runtime based on the consumption current. Consequently, the overload situation can be avoided by increasing the pipeline throughput to keep the pipeline occupancy within the design target value when the pipeline occupancy increases. 3. Runtime overload-avoidance mechanism Based on the direct observability and controllability, the throughput of the protocol handling in the DDCMP can be changed when input traffic increases. To realize this runtime load control for overload-avoidance, the platform architecture is discussed in this section. 3.1 Networking platform architecture As already described, the observation of the pipeline occupancy by the consumption current and the control of the effective throughput by the DVS can be realized at runtime. Unfortunately, some delay time is introduced until the effective throughput becomes a target value after the pipeline occupancy changes because of the signal propagation delay through control circuits and their parasitic capacitance. Therefore, the fluctuation of the pipeline occupancy should be temporally moderate to provide enough time for changing the effective throughput. To make the pipeline occupancy fluctuation temporally smooth without any runtime overhead, the data-driven programs of target protocols are modified to reduce the variety of the numbers of operations executed concurrently. As illustrated in figure 3, the programs are defined by data-flow graph (DFG) in the data-driven processors. The DFG consists of nodes and arcs, and each node describes an operation while each arc represents the data-dependency between two successive operations. The data-dependencies between operations represent naturally the ILP (Instruction Level Parallelism) inherent in the programs, and thus describing target program by using DFG results in extracting the ILP in the target programs. In the data-driven processors, each operand is executed independently from the other operands and the execution time of each operand is also independent from that of the other operands as a result of the real-time multiprocessing [6]. Based on this feature, the number of operations executed concurrently can be changed by postponing the execution timing of the operations on non-critical paths, as shown in figure 3. This program modification can temporally smooth

4 Fig. 3: Temporally-smoothing the number of operations executed concurrently. the number of operations executed concurrently without any overhead on the execution time of the operations on the critical path of target programs. Figure 4 shows the basic architecture to realize an overload-free networking platform based on the techniques discussed. To enhance the throughput of the protocol handling when the input traffic increases, a runtime overload avoidance mechanism is introduced to increase the supplied voltage according to the increased consumption current. This runtime overload avoidance mechanism can be implemented by using runtime voltage scaling technique [2] for the selftimed pipeline. This kind of load control in the platform should not increase the traffic in the ad hoc network because the increasing traffic leads to the network congestion. From this standpoint, the throughput of the protocol handling for receiving packets should be kept at constant to guarantee the receiving packets because the retransmission due to the denial of packet reception increases the traffic in the ad hoc network. Therefore, the receiving protocol handling at link layer is out of the throughput control as shown in the figure 4. On the other hand, the throughput of the protocol handling for sending packets is enhanced by increasing the supplied voltage in order to reduce the pipeline occupancy for the increased traffic. Based on this basic architecture, the pipeline occupancy derived from the protocol handling up to network layer can be reduced for the increased traffic. However, the pipeline occupancy depends on not only the protocol handling up to the network layer but also the internal processing including the upper layer protocol handling and the application processing. 3.2 Runtime parallelism transformation As for the internal processing, the enhancement of the throughput may not necessarily result in the reduction of the pipeline occupancy because some of the internal processing may be resident. For example, a GUI (Graphical User Interface) manager continues to run while the display device is lit. To reduce the pipeline occupancy derived from such internal processing, the number of data (tokens in the data-driven processors) flowing through the STP should be reduced. However, tokens derived from different programs are concurrently processed at the different stages of the STP without any distinction on the types of processing, and thus it is difficult to selectively remove the flowing tokens of a particular processing type. Fortunately, the processing time constraint of the upper layer protocol handling and the application processing is often lazy in comparison with that of the link level protocol handling. For instance, the response time of the MAC (Media Access Control) protocol handling is strictly and tightly determined on the µ second time scale depending on the specification of the physical layer hardware while the several seconds delay time of a mailer application can be accepted or ignored. By utilizing such slack time of some internal processing, the pipeline occupancy can be temporally smoothed and reduced in the data-driven processors. By utilizing the real-time multiprocessing feature, the number of operations executed simultaneously can be reduced as already shown in the figure 3. As for the internal processing with the slack time, the number of the concurrently executing operations can be more reduced at the expense of the increase in the processing time. In an extreme case, it can be 1 as shown in figure 5 while the increased time is acceptable. Consequently, the pipeline occupancy derived from the internal processing with the slack time can be reduced by transforming the parallelism of the programs. To realize such transformation of the parallelism, any overhead on the processing time of the running programs should be avoided in order to satisfy the processing time constraints required. In this paper, a runtime parallelism transformation with no overhead on the processing time is introduced by exploiting the real-time multiprocessing capability of the data-driven processor realized by the STP. The runtime parallelism transformation is realized by switch-

5 Fig. 4: Networking platform with runtime overload-avoidance mechanism. Fig. 5: Runtime parallelism transformation by switching DFG. ing the program at runtime, i.e. an internal processing program with high throughput (parallelism) is switched to its alternative version with low parallelism when the pipeline occupancy increases. It is difficult to switch the running program to the alternative version because the tokens of the running program are spread over the STP. Therefore, the switching should be realized at the beginning of the execution of the program or the iteration. This switching should be coordinated with the change of the pipeline occupancy, and thus a switch operation is introduced to realize the branch on the pipeline occupancy. As shown in the figure 5, the switch operation changes the data-flow at runtime according to the direction externally input from the runtime overload-avoidance mechanism. In the runtime overload-avoidance mechanism, the direction of the switch operation is determined according to the input consumption current representing the pipeline occupancy. As a result of the control by the runtime overload-avoidance mechanism, the pipeline occupancy can be reduced by both enhancing the throughput of the protocol handling for sending packets and decreasing the number of operations executed concurrently when the input traffic increases. 3.3 Preliminary evaluation The proposed architecture completely depends on not only the already proposed runtime DVS technique [2] but also both the parallelism transformation of the target protocol handling program and the real-time processing capability of the data-driven processors realized by the STP. As a preliminary evaluation of the feasibility of the proposed architecture, both the parallelism transformation and the realtime multiprocessing are verified by using the ULP-DDCMP chip which is the latest data-driven processor realized by the STP. As a concrete protocol, UDP/IP is focused on because its connection-less packet transfer results in low-power consumption indispensable in ad-hoc networking, i.e., it is one of the protocols expected to be used in ad hoc networking. As shown in figure 6(a), the ULP-DDCMP chip houses four ULP-CUE s interconnected by a multi-stage token router realized by the STP. In the design of this chip, the circular STP realizing each ULP-CUE is divided finely in order to eliminate the pipeline bottleneck. As a result of this

Fig. 6: ULP-DDCMP chip and its evaluation board. pipeline division, the number of stages of each ULP-CUE is 13. The chip is fabricated by 65nm CMOS 7-metal-layer process technology.

voltage at a target value and the other FPGA realizes logging of the performance and power consumption. The evaluation board is shown in the figure 6(b).

6 Fig. 6: ULP-DDCMP chip and its evaluation board. pipeline division, the number of stages of each ULP-CUE is 13. The chip is fabricated by 65nm CMOS 7-metal-layer process technology. The ULP-DDCMP is implemented on an evaluation board which mounts two FPGA s; one FPGA is used to realize the runtime DVS with PID (Proportional Integral Derivative) control to stabilize the supplied voltage at a target value and the other FPGA realizes logging of the performance and power consumption. The evaluation board is shown in the figure 6(b). The ULP-DDCMP provides an instruction set enough to describe the UDP/IP handling program. Actually, the datadriven program of the UDP/IP handling is described by using the instruction set. The described UDP/IP handling program realizes the checksum calculation and the generation of the UDP/IP header, and the packets containing pseudo header and payload are input to the program and the program outputs IP datagrams. The number of the operations executed simultaneously in the originally described program varies from 1 to 5, and thus the maximum pipeline occupancy becomes approximately 38% (= 5/13). This means that one UDP/IP handling can be executed in one ULP-CUE within the design target because the design target of each ULP-CUE is 40% as shown in the figure 2. To verify the temporally-smoothing of the number of concurrently executed operations, an alternative version of the UDP/IP handling is derived from the original version by using the introduced scheme as shown in the figure 5. In the derived alternative version, the number of operations executed concurrently is reduced to almost 1. That is, it is verified that the parallelism can be changed by modifying the program. By using the alternative version, the real-time processing capability is verified. The processing time required to process one packet is measured by using the logging function on the evaluation board while the number of input packet is increased, i.e. the multiplicity is increased. Figure 7 shows the measured result. In the sequential processing, the processing time per one packet is in proportion to the multiplicity. In contrast, the real-time processing capability of the ULP-DDCMP can keep the processing time per packet at approximately constant regardless of the multiplicity, as shown in the result. In addition, the processing time per one packet is measured for the different input timing of the packets, and the same results are obtained. That is, the processing time of a program is independent from that of the other programs. It is true that the processing time per packet experiences approximately a 10% increase when the multiplicity is 4 in comparison with the other results. The cause of this increase is the elastic capability of the STP. The STP can maintain its maximum throughput even when the pipeline occupancy exceeds the design target, as shown in the figure 2. The number of the operations executed concurrently in the alternative version is not exactly 1 and it temporarily becomes 2, therefore, the pipeline occupancy exceeds the design target temporarily when the multiplicity is 4. In other words, the STP provides a tolerance for temporal overload naturally. If the increased processing time is not acceptable, the processing time can be kept at constant by limiting the multiplicity to be within 3 or by pipelining the STP more deeply. 4. Conclusion In this paper, a data-driven networking platform architecture with a runtime overload-avoidance mechanism is revealed in order to realize an overload-free networking platform indispensable to realize sustainable networking environment. Based on the direct observability and controllability on the pipeline occupancy which is the processing load of the platform, the overload-avoidance mechanism makes it possible to dynamically keep the pipeline occupancy within the design target. Concretely, the pipeline occupancy is

7 Fig. 7: Processing time for one packet. observed by the consumption current, and it is reduced by increasing the pipeline throughput with the runtime DVS when the input traffic increases. Moreover, a runtime parallelism transformation is proposed to make the control delay time inherent in the DVS circuit ignorable. As a preliminary evaluation, the feasibility of the newly proposed runtime parallelism transformation is verified through the measurement of the latest version of data-driven processors. Now we are developing a simulator [7] realizing the comprehensive evaluation on the ad hoc network environment realized by the proposed architecture, and the evaluation result will be reported soon. Acknowledgement Although it is impossible to give credit individually to all those who organized and supported our project, the authors would like to express their sincere appreciation to all the colleagues in the project. This research work was supported in part by Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency (JST). References [1] Kazuhiro Aoki, Hiroshi Ishii, Makoto Iwata, and Hiroaki Nishikawa, A Comprehensive Evaluation of ULP-DDNS by Platform Simulator, in Proc. of PDPTA, pp , July [2] Kei Miyagi, Shuji Sannomiya, Makoto Iwata, and Hiroaki Nishikawa, Low-Powered Self-Timed Pipeline with Runtime Fine-Grain Power Supply, in Proc. of PDPTA, pp , July [3] Shuji Sannomiya, Kazuhiro Aoki, Makoto Iwata, and Hiroaki Nishikawa, Power-Performance Verification of Ultra-Low-Power Data-Driven Networking Processor: ULP-CUE, in Proc. of PDPTA, pp , July [4] Hiroshi Ishii, Keisuke Utsu, and Hiroaki Nishikawa, Integrated Evaluation on Effectiveness of ULP-DDNS Networking Layer, in Proc. of PDPTA, pp , July [5] C. J. Myers, Asynchronous circuit design, Univ. of Utah John Wiley & Sons, Inc., [6] Hiroaki Nishikawa, Design Philosophy of a Networking-Oriented Data-Driven Processor: CUE, IEICE Transactions on Electronics, Vol.E89-C No.3, pp , Mar [7] Kazuhiro Aoki, Shuji Sannomiya, Makoto Iwata, Hiroshi Ishii and Hiroaki Nishikawa, An Implementation of Platform Simulator for Congestion-Free Ultra-Low-Power Data-Driven Networking System, in Proc. of PDPTA, PDP2081, July 2013.

Data-Driven Sensor Networking Processor Tolerating Instantaneously Excessive Load

316 Int'l Conf. Par. and Dist. Proc. Tech. and Ap. PDPTA'16 Data-Driven Sensor Networking Processor Tolerating Instantaneously Excessive Load Shuji SANNOMIYA 1, Yukikuni NISHIDA 2, Makoto IWATA 3, and