EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)

Size: px

Start display at page:

Download "EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)"

Penelope Caldwell
5 years ago
Views:

1 A V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous, and Jan M. Rabaey EECS Dept., University of California at Berkeley Broadcom Corp., Irvine, CA Berkeley Wireless Research Center Tel: (50) Allston Way, Suite 200 Fax: (50) Berkeley, CA hui@eecs.berkeley.edu Abstract Heterogeneous reconfiguration enables the flexible implementation of baseband wireless functions at energy levels between 50 and 00 MIPS/mW, 8 times lower than traditional DSP processors. A mm 2 prototype processor, targeted for voice compression is implemented in a 0.25 µm 6-metal CMOS process, and consumes.8 mw at an average operation rate of 40 MHz. It combines an embedded microprocessor with an array of computational units of different granularities, connected by a hierarchical configurable interconnect network. ISSCC Subject Area: Signal Processing

2 ISSCC Subject Area: Signal Processing A V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous, and Jan M. Rabaey University of California at Berkeley Broadcom Corporation Introduction The advent of the third generation of wireless application creates a need for processing modules that simultaneously display high computational performance, ultra low-energy consumption and a high degree of flexibility and adaptability. The flexibility and adaptability is a necessity in the presence of multiple and evolving standards, and helps to increase quality-of-service in the presence of dynamically evolving conditions. (Re)configurable processors offer the advantage of combining flexibility and low-energy by providing a direct spatial mapping from algorithm to architecture, hence reducing the control overhead typically associated with instruction-set processors. General Concept The Pleiades processor approach [] combines an on-chip microprocessor with an array of heterogeneous programmable computational units of different granularities (called satellite processors) connected by a reconfigurable interconnect network (Figure ). The

3 microprocessor supports the control-intensive components of the applications as well as the reconfiguration, while repetitive and regular data-intensive loops (henceforth referred to kernels) are directly mapped on the array of satellites by configuring the satellite parameters and the interconnections between them (Figure 2). Synchronization between the satellite processors is accomplished by a data-driven communication protocol in accordance with the data-flow nature of the computations performed in the kernels. A generalized interface wrapper is placed around each satellite processor to comply with the communication protocol. This spatial programming approach results in energy dissipation levels of MIPS/mW, at least an order of magnitude better than what can be accomplished in comparable DSP processors by exploiting the locality of the computations and the correlations within data streams, and by distributing the control. Processor Architecture A prototype processor has been implemented targeting the domain of voice processing (and related applications) for wireless applications. The Maia processor (Figure 3) combines an ARM8 core with 2 satellite processors: two MACs, two ALUs, eight address generators, eight embedded memories ( bit, 4 K 6bit), and an embedded low-energy FPGA array [3]. Through an interface control unit, ARM8 configures the memory-mapped satellites using a configuration bus, and communicates data with satellites using 2 pairs of IO interface ports and direct memory reads/writes. Connections between satellite modules are accomplished through a 2-level hierarchical mesh-structured reconfigurable interconnect network. The 20-pin chip contains.2

4 million transistors and measures mm 2 in 0.25 µm 6-metal CMOS technology (Figure 4). The embedded ARM8 core is optimized for low-energy operation, and can operate under variable supply voltages [2]. Both the dual-stage pipelined MAC (including shift/round/saturate functions) and the ALU can be configured to handle a range of operations. The address generators and embedded memories are distributed to supply multiple parallel data streams to the computational elements. The address generator features a small local instruction memory, and can be programmed to support various types of addressing patterns and nested loops with loop counters and stride counters. It behaves as the local controller of data-flow kernels by initiating the data-flow threads, and by signaling the end of the data-flow threads to the ARM8. The embedded FPGA supports a 4 8 array of 5-input 3-output CLBs, optimized for arithmetic operations and data-flow control functions. It contains 3 levels of interconnect hierarchy, superimposing nearest-neighbor, mesh and tree architectures. Its energy-efficiency has been measured to be 70 times higher than equivalent industrial solutions [3]. The interface control unit coordinates synchronization and communication between the synchronous ARM8 core and the asynchronous reconfigurable data-paths, most importantly helping the core perform the reconfiguration of satellites by mapping all the configuration memories to the ARM8 memory space. Communication Network The data-driven synchronization between the processing elements employs a 2-phase self-timed handshaking scheme with REQUEST and ACKNOWLEDGE signals (Figure

5 5a), realized in a globally-asynchronous locally-synchronous implementation fashion. This approach not only reduces power consumption by ensuring that a module is only activated when data is ready, but also allows various modules to operate at different and dynamically varying rates. Each module includes a network interface controller to coordinate communication and synchronization. Data links combine 6-bit fixed-width data words with 2-bit control tokens that serve as tags of the different data structures (scalar, vector, or matrix) that are supported by the network (Figure 5b). Keeping the energy of the reconfigurable communication network as low as possible is crucial to the success of the approach. This is realized by a combination of architecture and circuit optimizations. The network itself is implemented as a 2-level hierarchical mesh. Several clusters of tightly connected modules are formed according to the communication locality. Each cluster has a local mesh with 2 buses-per-channel, and a universal switchbox at every intersection point (Figure 6a). Global interconnections are supported by a 2 nd level larger-granularity mesh (implemented on the higher metal layers) with 2 buses-per-channel and hierarchical switchboxes, located at the key connection points. The hierarchical switchbox (Figure 6b) contains a universal switchbox for each mesh-level, as well as a number of cross-level interconnect switches. This hierarchical network architecture requires only a limited number of buses to achieve sufficient connection flexibility for our target applications, and cuts the interconnect energy cost by a factor of 7 compared to a straightforward crossbar network implementation. Communication energy is further reduced by employing a low-swing (0.4V) pseudodifferential signaling scheme (Figure 7a). The capacitance loads are also reduced by

6 simplifying the switch network with NMOS-only switches. The circuit uses a single wire for each data bit while still retaining most advantages of differential signaling such as high common-mode noise rejection, low input-offset, and good sensitivity. It employs an NMOS-only push-pull driver with a very low voltage supply. The receiver is a clocked sense amplifier followed by a static flip-flop. It contains double pairs of input transistor, with the gates of P and P3 connected to d, while the gates of P4 and P2 biased at GND and REF respectively. Figure 7b shows the signaling waveforms. Initially, A and B are discharged to GND, and n and n2 are equalized. The receiver is enabled by a negative pulse, which is generated from the handshaking signals. If d is low, the current drive of P3 is same as that of P4, while the current drive of P is larger than that of P2. Consequently B and A are pulled high and low, respectively, by the cross-coupled inverter pair. An opposite transition is triggered if d is high. The following static flip-flop will retain the data value even after the sense amplifier is reinitialized. The low-swing signaling reduces the interconnect energy with a factor 3.4 compared to a full-swing CMOS implementation. Results and Data Measurements The overall chip characteristics are summarized in Table. Table 2 shows the performances of different chip components (based on a per-block analysis). The energy dissipation of the processor when programmed for a VCELP voice coder (with.8mw total power consumption) is presented in Table 3, including a breakdown of the energy over the major functions. Dominant kernels are directly mapped onto hardware satellites, and their run-time reconfiguration is performed by the ARM core. Therefore, the kernel energy presented in the table incorporate contributions from both satellite and ARM8

7 configuration. The program control part of the algorithm is completely mapped to the software. The total measured energy efficiency is a factor of 8 better than the best reported in literature [4]. Acknowledgments The research was funded by the DARPA ACS, and the California MICRO program. The support from Philips, Atmel, and Conexant is greatly appreciated. The authors also wish to thank SGS-Thompson for providing fabrication facilities of the integrated circuits. References [] Arthur Abnous and Jan Rabaey, Ultra-Low-Power Domain-Specific Multimedia Processors, IEEE VLSI Signal Processing Workshop, October 996. [2] Tom Burd et al, A Dynamic Voltage Scaled Microprocessor System, submitted to ISSCC [3] Varghese George et al, The Design of a Low-Energy FPGA, Proceedings of ISLPED99, Aug [4] Wai Lee et al, A V DSP for Wireless Communication, Digest of Technical Papers of ISSCC 97.

8 Technology Main Supply Voltage Additional Voltages Die Size Transistor Count Average Cycle Speed Average Power Dissipation 0.25 µm 6-level metal CMOS V 0.4 V,.5 V 5.2 mm x 6.7 mm.2 Million transistors 40 MHz.5-2 mw Table : Chip Characteristics Hardware modules Pipeline speed (ns) Energy consumption per operation (PJ) Area (mm 2 ) MAC ALU Memory (K x 6) Memory (52 x 6) Address generator Interconnect network 0 * NA FPGA 25 8** 2.76 Table 2: Performances of hardware modules *This number is the average energy consumption per connection **This number is the average energy consumption across various arithmetic functions Functionality Energy consumption (mj) for sec of VCELP speech processing Dot product FIR filter 0.3 IIIR filter 0.02 Kernels Vector sum with scalar multiply Compute code 0.0 Covariance matrix compute Program control Total.787 Table 3: VCELP energy consumption breakdown among dominant kernels and program control

9 Satellite Processors Configuration Bus Configurable Logic Embedded Memory Address Generator Reconfigurable Interconnect Micro- Processor Arithmetic Co-Processor Arithmetic Co-Processor Figure : Heterogeneous Reconfigurable Processor Architecture Execution Control AddrGen for (i=;i<=length;i++) { for (k=i<k<=length;k++) { phi[i][k] = phi[i-][k-] + in[np-i]*in[np-k] in[na--i]*in[na--k]; } } :i MPY MPY AddrGen :phi +/- Figure 2: Mapping a computational kernel on an array of satellite processors.

10 MemK MemK AG AG FPGA AG AG MemK MemK 2 MAC 5 m e M AG Mem52 ALU i o AG ALU i o MAC 2 5 m e M AG Mem52 AG Interface ARM Hierarchical Switchbox Universal Switchbox Level-2 Mesh Level- Mesh Figure 3: Floorplan of Prototype Processor Reconfigurable Network In Req in Processor Module Clk delay Done clk Out Req out In Req in Clk Enable Clk Done (a) Globally asynchronous - locally synchronous signaling MPY n MPY n n n MAC Data associated with an end-of-vector token Regular data (b) Control tokens differentiate and delineate data streams and data structures (scalar, vector, matrix) Figure 5: Data-driven globally-asynchronous locally-synchronous inter-processor communication.

11 AGU AGU FPGA AGU AGU MAC ALU ALU MAC Interconnect Network AGU AGU AGU AGU Interface ARM8 Core Figure 4: Heterogeneous Reconfigurable Processor Chip Microphotograph

12 Cluster Cluster (a) Level Mesh Universal Switchbox (b) Level 2 Mesh Hierarchical Switchbox (only cross-mesh connections are shown) Figure 6: Hierarchical Mesh Network and Switch Matrices in REF d clk clk P3 P n P6 B N3 N VDD P5 P2 P4 REF GND n2 GND P7 A N2 N4 clk out (a) Circuit diagram clk in d 0.4V V A B out (b) Circuit Waveforms Figure 7: Pseudo-differential low-swing interconnect circuitry

Silicon Architectures for Wireless Systems Part 2 Configurable Processors

Tutorial HotChips 01 Silicon Architectures for Wireless Systems Part 2 Configurable Processors Jan M. Rabaey BWRC University of California @ Berkeley http://www.eecs.berkeley.edu/~jan With contributions