Coarse Grain Reconfigurable Arrays are Signal Processing Engines! Advanced Topics in Telecommunications, Algorithms and Implementation Platforms for Wireless Communications, TLT-9707 Waqar Hussain Researcher waqar.hussain@tut.fi Tampere University of Technology, Finland Electronic Products Multifunction devices are becoming popular besides their reliability and durability Example Mobile Phone The key selling features of a cell phone are size, weight, longer battery times, audio/video streaming and several games running onto it Adaptability to many communication standards Expectations for Real Time performance No Limits to Human Desire 2
Embedded Technology The embedded technology empowers a mobile phone to carry all these features. Intended for a specific use which consist of a hardware capable to perform a set of different tasks with the help of software Example Embedded System = RISC + Accelerator(s) 3 Why Coarse Grain Reconfigurable Arrays? Answer : Computationally Intensive Kernels (CIK) need to be accelerated in a Signal Processing System. Examples of CIKs 1. FIR Filtering 2. Encoding and Decoding a) Viterbi b) Reed-Solomon 3. Matrix-Vector Multiplication 4. Fast Fourier Transform 4
Why Coarse Grain Reconfigurable Arrays? Question: So why CGRA, why not traditional accelerators? Its more desirable to use devices that could accelerate multiple kernels than typical traditional accelerators that were designed to accelerate only a single kernel. Thanks to Reconfigurability! 5 Why CGRAs are Powerful Engines? Answer: Due to its structure! CGRAs offer high parallelism and throughput due to its arraybased structure. Algorithms containing parallelism are most suitable to be mapped on a CGRA. It can process large streams of data. Unit of Structure of a CGRA is an ALU, called Processing Elements (PE). Each PE is connected to other PEs using point-to-point or a Network on Chip (NoC). 6
CGRA in an Embedded System An Example of Embedded System is RISC + Accelerator(s) RISC = COFFEE Accelerator = BUTTER Both COFFEE and BUTTER were designed at the Department of Computer Systems, Tampere University of Technology, Finland BUTTER A general purpose Coarse Grain Reconfigurable Array (CGRA) which is a martix of processing elements (PEs). Each PE is capable to perform a set of different tasks and connected with each other using point to point interconnections. BUTTER was capable to process many computationally intensive kernels. 7 Problems with BUTTER! BUTTER s presence in the system was expensive if it is not used most of the time BUTTER occupies a large number of hardware resources A General Purpose CGRA requires a few million gates of FPGA 8
Solution CREMA A parameterized general purpose CGRA to generate special purpose accelerators. 9 Category of Interconnections
Processing Elements in CREMA Two Operand Registers Decoder for Operation Selection Supports Integer and Floating point operations Blocks with dashed border are scalable and selectable for instantiation LUT for logical operations Processing Element Template CREMA based System COFFEE for general purpose processing CREMA generated accelerator for CIK Network of Switched Interconnections ti for faster data transfer between modules 12
CGRAs to be made Scalable 13 Scalability in Software A fixed hardware can be used to process a variable length algorithm For example: A single FFT butterfly can be used to process 4, 8, 16, 64, 128, 256 and higher points of FFT In this case, the hardware (FFT Butterfly) is fixed but we can scale the software as required to process different lengths of FFTs Another example can be matrix-vector multiplication Arithmetic resources required by 4 th order matrix-vector multiplication can be used to process higher order matrix-vector multiplication. 14
Scalability in CGRA 15 Why to Scale Hardware? An Example Wireless LAN 16
How to Scale the Hardware? The resources required by a set of applications can give an idea about to scale the hardware In short, nature of applications has to drive the dimensioning in hardware For a small set of applications, it might be easier but for a large set of applications, it might be difficult A method needs to be defined??? 17 Case Study Applications Driving Dimensioning Matrix-Vector Multiplication Radix-4 FFT Processing Target Platform under Dimensioning CREMA, a Coarse-Grain Reconfigurable Array consisting of 4x8 processing elements Scaling Order 1. Matrix-Vector Multiplication From 4x8 to 6x8 and 4x16 PEs CGRA 2. Radix-4, FFT Processing From 4x8 to 9x8 and 4x16 PEs CGRA Scaling Influence on Design Strategies Rapid Prototyping and System Integration Global Optimum Implementation for Area and Speed 18
Applications Mapped on CREMA and BUTTER Integer and Floating-point Matrix-Vector Multiplication Execution Time Compared with RISC and DSP 2D-Low Pass Image Filtering based on Averaging Window FFT Satisfied Execution Time Constraints for SISO and MIMO OFDM Applications Resource utilization and execution time was compared with other state-of-the-art W-CDMA cell search Execution time compared with a RISC core In all of the above applications, CREMA as a templatebased device required lesser resources for its generated accelerator than BUTTER 19 Thank You *Questions**