SIM 2011 26 th South Symposium on Microelectronics 167 A Direct Memory Access Controller (DMAC) IP-Core using the AMBA AXI protocol 1 Ilan Correa, 2 José Luís Güntzel, 1 Aldebaro Klautau and 1 João Crisóstomo Costa {ilan, aldebaro, jweyl}@ufpa.br, guntzel@inf.ufsc.br 1 Sensors and Embedded Systems Laboratory (LASSE) Federal University of Pará, Belém, Brazil 2 Embedded Computing Lab. (LCE) Federal University of Santa Catarina, Florianópolis, Brazil Abstract This work presents the design of a hardware controller that allows a peripheral of an embedded computer system to directly access the memory, known as direct memory access controller (DMAC). This DMAC is inspired in the Dalton Project DMAC, also being compatible with the widely used Intel I8237. In order to make it suitable for current embedded applications, its communication pattern follows the AMBA AXI protocol, which allows more flexibility such as fewer timing restrictions and data transfer in the system with minimum CPU load. 1. Introduction The tremendous advances in integrated circuit fabrication technology have allowed a myriad of new products that are present in everyday. Most of these products fall into the embedded systems class. Particularly, personal mobile devices (PMDs), such as smatphones, tablets etc, became very popular, currently responding for a significant part of the consumer electronics market. New electronic products, mainly PMDs, present very stringent design requirements in terms of energy consumption, reliability, cost and performance (speed). Currently, the major challenge in embedded systems design concerns energy efficiency, which corresponds to achieve the performance required by the application with as low power consumption as possible. Unfortunately, those are conflicting targets. Thereby, new solutions in terms of hardware, embedded software or both are greatly demanded, as those presented in [1], [2]. To meet performance requirements of embedded systems, a widely used design technique is the adoption of blocks that are designed to efficiently perform a particular task or a critical part of a larger algorithm. Those blocks are generally referred to as Intellectual Property (IP) blocks. This paper presents the design of an IP block that acts as a direct memory access controller (DMAC). It is based on the widely used Intel I8237 DMAC [3], as well as on Dalton Project's DMAC [4]. But unlike those DMACs, ours follows the AMBA AXI [5] protocol, which provides ways to achieve those requirement aforementioned and is becoming widely used in industry. As basic design requirement, it was assumed that this DMAC should allow data transfer between a peripheral and the memory consuming the least amount of CPU cycles possible. The DMAC design is the result of a partnership in the context of NAMITEC Project [6], involving two research groups, one from the Federal University of Santa Catarina and another from the Federal University of Pará. 2. AMBA AXI overview The Advanced Microcontroller Bus Architecture (AMBA) was developed by ARM Ltd in 1996 and has several versions. In 2003 a new AMBA with many features was released, looking for achieving better performance. In this new version there are five communication channels which are defined as: Read address channel, Write address channel, Read data channel, Write data channel and Write response channel. The channels Read/Write address channel carry information needed to begin a transfer, Read/Write data channel carry data after handshake in any address channel and Write response channel is used with Write data channel to indicate how a write operation ended or whether there was an error during handshake in a transfer or address channel. The AMBA AXI architecture was conceived to allow only two devices to be directly connected. In a given transfer, one device is the "Master", while the other is the "Slave". Since in the DMAC there might exist various devices, an "interconnect block" must be used, as the one available from ARM [7]. The topology of the interconnect block is such that for each Master there is a Slave, and for each Slave there is a Master (shown in Fig. 1). Hence, the interconnect block acts as a data multiplexer, allowing a Master to connect with any other Slave by using the information provided through Read or Write address channel [7]. A transfer is always started by a Master. Initially, it releases control information in Read or Write address channel, such as word length, number of words and addresses. After control information has been transferred
168 SIM 2011 26 th South Symposium on Microelectronics (and Slaves have answered), data transfer starts through Write or Read data channel. Finally, in the case of write operation the response is sent by Write response channel, from Slave to Master. In order to allow an energy-efficient operation, in this work the simplest AXI protocol was adopted, which only manages the handshaking and data transfer as specified in version two. However, may be interesting to add some of the features of newer versions, (mainly from version four), whenever they contribute to save power. Fig. 1 Interconnect block (taken from [7]). 3. Inside the controller The direct memory access controller (DMAC) acts as a bridge, where it receives an amount of data, stores the data in its internal buffer and forwards the data to the destination. This operation is repeated until all data has been transferred. The DMAC also handles the priority between all devices requesting memory access. For the operation to be performed the DMAC should have a set of components that assists it in the job (e.g. Registers, buffers, multiplexers, etc), in addition to the interfaces of the protocol, two Masters and a Slave. Fig. 2 shows the internal organization of the designed DMAC. These components keep and forward information concerning transfer direction, number of words to be transferred, size of each word, priority of a transfer, which device will transfer, temporary information about current transfer and a memory to store the data to be forwarded. Fig. 2 shows the set of components that performs all data manipulation inside the DMAC. This set consists of simple components, what makes this controller less power hungry, causing less impact in the overall system consumption. Registers, multiplexers, comparators, a buffer and adders (not shown) compose its datapath. The datapath is coordinated by the control block. The DMAC components are described in the following subsections. Fig. 2 DMAC internal structure and its interfaces to the system. 3.1. Registers The registers store information coming from CPU, information that CPU should read to know which device will transfer, temporary information generated performing the transfer and priority control.
SIM 2011 26 th South Symposium on Microelectronics 169 Information coming from CPU corresponds to the direction of transfer, number of words to be transferred and addresses. 3.2. Multiplexers A DMA transfer can occur from memory to a device, from device to memory or from memory to memory. In the first type of transfer the device is referred to as Receiver, whereas in the second type of transfer, it is referred to as the Provider. In each transfer the DMAC reads a block of words from the Provider and forwards to the Receiver. To properly accomplish this, extern signal should be selected among those signals coming from memory or from devices. This is the role of multiplexers: they generate internal signal from external signals, and thereby the control block deals only with internal signals, which make its design easy. For instance, in the case of a transfer operation from memory to a device, the controller generates control information concerning Read address channel connected to memory and waits the response coming from it. After memory response, it starts the transfer putting words on Read data channel, thus informing the controller that data is on this channel by selecting its signals through the multiplexers. After internal buffer is full the controller generates signals on Write address channel connected to device and starts forwarding the buffer contents. Such operation is similar to the first transfer previously described. 3.3. Buffer This component provides a way to store a block of words and then forwards when it is full. It avoids excessive handshakes. 3.4. Control block This component provides the control signals to the datapath, being highly dependent on how datapath components are organized. The control block consists of two state machines: the "priority machine" and the "transfer machine". The former manages priority between devices and controls the beginning of the operation, whereas the latter manages internal signals used to control the datapath and external signals, used in interruptions and in the AXI protocol. Fig. 3 depicts these state machines. (a) (b) Fig. 3 State machines of DMAC control block: priority machine (a) and transfer machine (b). The priority machine (Fig. 3 (a)) controls priority of accessing memory for two or more devices (it can be easily adapted for more devices). While there is no request on interruption interface, it stays in the NOP state. Whenever there is a request the machine goes either to state D0 or D1, depending on the device requesting access. In D0 and D1 a signal is generated and the transfer machine (Fig. 3 (b)) starts its operation. The transfer machine generates control signals to the datapath and externals signals. It performs four tasks: waits request (IDLE), waits for permission from the CPU (S0), read data from the Provider and write data to the Receiver. While waiting for request, the transfer machine is basically waiting for the signal coming from the priority machine. This signal changes its state from IDLE to S0. In S0 it writes data from the CPU in its internal register (information about transfer) and waits for the signal from CPU that indicates that the bus is free. In S1, S2 and S3 the controller reads data from the Provider and writes these data to the Receiver in S4, S5 and S6.
170 SIM 2011 26 th South Symposium on Microelectronics These states are triggered by signals of AXI (they are properly selected on the multiplexers). They generate output signal of AXI too. For instance, in a read operation from memory, S1, S2 and S3 generate signals on Read address channel and Read data channel. In the second part of the transfer, the controller forwards the data using Write address channel, Write data channel and Write Response channel. 4. DMA operation This section presents a simplified description of the DMAC operation when two devices disp0 and disp1 request a transfer at the same time, supposing that device disp0 has higher priority and therefore, will transfer first. In this examples memory is the Provider and a device is the Receiver. Initially, the DMAC waits for request and the two state machines are in NOP and IDLE states, respectively. Then, devices disp0 and disp1 request a transfer at the same time. The priority machine goes to state D0, writes on proper register which device should transfer and generates a signal that makes the operation machine goes from IDLE to S0. In S0 the DMAC interrupts the CPU and the signals coming from Slave interface connected to CPU are selected. The CPU reads some internal registers and with these data writes on other register information about the transfer. After that, it releases a signal indicating that the bus is free and the operation machine goes to S1. In S1 the Read address channel of Master connected to Provider is used. When Provider is complete it sends a signal and the transfer machine progresses to S2. In S2 Read data channel is used to transfer data, with each received data goes to S3, stores data, and verifies if buffer is full. If the buffer is not full, the machine proceeds to S2. Else, it goes to S4. S4 and S5 are similar to S2 and S3, respectively, except that the controller writes using Write address channel and Write data channel. S6 verifies if all data has been transferred. If not, the machine goes to S2. Else, it goes to IDLE states.. State S6 also verifies the Write response channel, looking for errors. 5. Synthesis Results In order to validate the DMAC architecture, its datapath was described in VHDL and synthesized for Stratix II and Stratix III Altera FPGAs by using Quartus II software. We estimate that the impact of the control block on the DMAC performance and FPGA resources is not as relevant as that of the datapath. The obtained results, in terms of resources and critical delay, are showed in Table 1. The small difference of delay for these FPGA families may indicate that the DMAC performance is bounded by the internal FPGA wire connections and by capacitive load represented by the chip pads. Therefore, for a more accurate evaluation, an ASIC synthesis is demanded. Table 1. Synthesis results of DMAC datapath for Altera FPGAs family/device # of ALUTs # of flip-flops logest delay (ns) Stratix II - EP2S15F484C4 269 573 9.296 stratix III - EP3SE50F484C4 268 573 9.172 6. Conclusion This paper presented the design of an AMBA AXI based direct memory access controller (DMAC). The internal structure of DMAC was showed, highlighting the simplicity of design which reduces power consumption, a critical subject in embedded systems. The DMAC operation also allows for CPU time savings, due to the small number of cycles for preparing each transfer. A device can be easily adapted to the designed DMAC because AMBA AXI does not have hard timing constraints. The DMAC was designed in such a way that the number of devices can be easily expanded. The compatibility with commercial IPs and systems is ensured by the use of AMBA AXI protocol. Although such protocol requires a significant number of signals, it is not a relevant problem since the DMAC is to be used in systems-on-a-chip (SoCs). The DMAC is currently being described in VHDL. For the final version of this paper, we intend to add synthesis results targeting Altera FPGA devices. 7. References [1] Mutlu, O.; Hyesoon Kim; Patt, Y.N.; "Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses," Computers, IEEE Transactions on, vol.55, no.12, pp.1491-1508, Dec. 2006. [2] Jones, R.B.; Allan, V.H.;, "Software pipelining: a comparison and improvement," Microprogramming and Microarchitecture. Micro 23. Proceedings of the 23rd Annual Workshop and Symposium. Workshop on, pp.46-56, 27-29 Nov 1990.
SIM 2011 26 th South Symposium on Microelectronics 171 [3] 8237A High Performance Programmable DMA Controller Datasheet, Intel Corporation, 1993. [4] The UCR Dalton Project - IP-Based Embedded System Design, 2010; http://www.cs.ucr.edu/~dalton/. [5] AMBA AXI Protocol Specification ver 2.0, ARM Ltd., 2003. [6] Instituto Nacional de Ciência e Tecnologia de Sistemas Micro e Nanoeletrônicos, 2010; http://namitec.cti.gov.br/. [7] PrimeCell AXI Configurable Interconnect, ARM Ltd., 2004
172 SIM 2011 26 th South Symposium on Microelectronics