Flexible wireless communication architectures

Size: px

Start display at page:

Download "Flexible wireless communication architectures"

Sydney Porter
5 years ago
Views:

1 Flexible wireless communication architectures Sridhar Rajagopal Department of Electrical and Computer Engineering Rice University, Houston TX Faculty Candidate Seminar Southern Methodist University April 23, 2003 RICE UNIVERSITY This work has been supported in part by NSF, Nokia and Texas Instruments

2 Future wireless devices demand flexibility Wireless Cellular Wireless LAN Bluetooth/ Home Networks Multiple algorithms and environments supported in same device High data rate mobile devices with multimedia Flexible algorithms: Multiple antennas, complex signal processing Flexible architectures: High performance (Mbps), low power (mw) Fast design with structured exploration RICE UNIVERSITY2

3 Flexibility needed in different layers Application Layer Puppeteer project at Rice Network Layer MAC Layer Physical Layer Flexible Algorithms Mapping Flexible Architectures Analog RF RICE UNIVERSITY3

4 Research vision: Attain flexibility Design me Algorithms: Flexibility: support variety of sophisticated algorithms Architectures: Flexibility: adapts hardware to algorithms Fast, structured design exploration RICE UNIVERSITY4

5 Contributions: Algorithms Multi-user channel estimation:[jnl. Of VLSI Sig. Proc. 02, ASAP 00] Matrix-inversions Numerical techniques conjugate-gradient descent for complexity reduction Multi-user detection: [ISCAS 01] Block-based computation to streaming computations Pipelining, lower memory requirements Parallel, fixed-point, streaming VLSI implementations [IEEE Trans. Wireless Comm. 02] RICE UNIVERSITY5

6 Contributions: Architectures Heterogeneous DSP-FPGA system designs: [ICSPAT 00] Computer arithmetic:[symp. On Comp. Arith 01] Dynamic truncation in ASICs using on-line arithmetic with Most Significant Digit First computation [Ph.D. Thesis] Scalable Wireless Application-specific Processors (SWAPs) Rapid, structured architectures with flexibility-performance tradeoffs RICE UNIVERSITY6

7 Scalable Wireless Application-specific Processors Family of flexible programmable processors Clusters of ALUs High performance by supporting 100 s of ALUs Can provide customization for various algorithms Adapts ( swaps ) architecture dynamically for power Scale ALUs???? Scale Clusters RICE UNIVERSITY7

8 Rapid, structured design for SWAPs Low complexity, parallel, fixed point algorithms ASIC design DSP design apply apply? Architecture Exploration??? SWAPs RICE UNIVERSITY8

9 Research vision summary Provide a structured framework to rapidly explore: flexible, high performance, low power architectures (SWAPs) Efficient algorithm design for mapping to SWAPs Understanding of algorithms, DSPs and ASICs used Flexibility-performance trade-offs Inter-disciplinary research: Wireless communications, VLSI Signal Processing, Computer architecture, Computer arithmetic, Circuits, CAD, Compilers RICE UNIVERSITY9

10 Talk Outline Research vision SWAPs - Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals RICE UNIVERSITY10

11 SWAPs borrow from DSPs DSPs use : Instruction Level Parallelism (ILP) Subword Parallelism (MMX) Not enough ALUs for GOPs of computation-- Need 100 s TI C6x has 8 ALUs Why not more ALUs? Cannot support more registers (area,ports) Difficult to find ILP as ALUs increase 1 ALU RF Register File RICE UNIVERSITY11

12 SWAPs borrow from ASICs Exploit data parallelism (DP) Available in many wireless algorithms This is what ASICs do! int i,a[n],b[n],sum[n]; // 32 bits short int c[n],d[n],diff[n]; // 16 bits packed for (i = 0; i< 1024; i) { } sum[i] = a[i] b[i]; diff[i] = c[i] - d[i]; Subword ILP DP RICE UNIVERSITY12

13 SWAPs borrow from stream processors Kernels (computation) and streams (communication) Use local data in clusters providing GOPs support Imagine stream processor at Stanford [Rixner 01] Input Data Kernel Stream Output Data received signal Matched filter Interference Cancellation Viterbi decoding Decoded bits Correlator channel estimation RICE UNIVERSITY13 Scott Rixner. Stream Processor Architecture, Kluwer Academic Publishers: Boston, MA, 2001.

14 SWAPs are multi-cluster DSPs Memory: Stream Register File (SRF) ILP Internal Memory DSP (1 cluster) ILP DP SWAPs adapt clusters to DP Identical clusters, same operations. Power-down unused FUs, clusters RICE UNIVERSITY14

15 Arithmetic clusters in SWAPs From/To SRF Distributed Register Files (supports more ALUs) SRF Cross Point / / / / Scratchpad (indexed accesses) Intercluster Network Comm. Unit RICE UNIVERSITY15

16 Talk Outline Research vision SWAPs Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals RICE UNIVERSITY16

17 SWAPs: Physical layer algorithms Antenna Baseband processing RF Front-end Detection Channel estimation Decoding Higher (MAC/Network/ OS) Layers Complex signal processing algorithms with GOPs of computation RICE UNIVERSITY17

18 SWAP mapping example: Viterbi decoding Multiple antenna systems (MIMO systems) Complexity exponential with transmit x receive antennas Estimation: Linear MMSE, blind, conjugate gradient. Detection: FFT, (blind) interference cancellation. Decoding: Viterbi, Turbo, LDPC. & joint schemes SWAP flexibility lets you use the best algorithms for the situation Example for concept demonstration: Viterbi decoding RICE UNIVERSITY18

19 Parallel Viterbi Decoding for SWAPs Detected bits ACS Unit Traceback Unit Decoded bits Add-Compare-Select (ACS) : trellis interconnect : computations Parallelism depends on constraint length (#states) Traceback: searching Conventional Sequential (No DP) with dynamic branching Difficult to implement in parallel architecture Use Register Exchange (RE) parallel solution RICE UNIVERSITY19

20 Parallel Viterbi needs re-ordering for SWAPs ACS in SWAPs Regular ACS DP vector X(0) X(2) X(4) X(6) X(8) X(10) X(12) X(14) X(1) X(3) X(5) X(7) X(9) X(11) X(13) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) X(0) X(1) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X(11) X(12) X(13) X(14) X(15) Exploiting Viterbi DP in SWAPs: Use RE instead of regular traceback Re-order ACS, RE RICE UNIVERSITY20

21 Talk Outline Research vision SWAP Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals RICE UNIVERSITY21

22 SWAP architecture design More clusters better than more ALUs/per cluster (if #clusters > 2) 1. Decide how many clusters Exploit DP ILP???? DP 2. Decide what to put within each cluster Maximize ILP with high functional unit efficiency Search design space with explore tool Time-power-area characterization RICE UNIVERSITY22

23 Design a SWAP cluster: Explore Auto-exploration of adders and multipliers for ACS" (80,34) (85,24) (Adder util%, Multiplier util%) 160 (85,17) 140 (85,11) (85,13) (70,59) Instruction count (72,22) (72,19) (61,22) 3 #Adders 4 (76,33) (60,26) (50,22) 5 (73,41) (61,33) (48,26) (39,22) 5 (65,45) (49,33) (39,27) 4 (62,62) (47,43) (40,32) 3 (54,59) (39,41) 2 #Multipliers (43,58) 1 RICE UNIVERSITY23

24 Explore tool benefits Instruction count vs. ALU efficiency What goes inside each cluster Design customized application-specific units Better performance with increased ALU utilization Explore multiple algorithms turn off functional units not in use for given kernel Vdd-gating, clock gating techniques RICE UNIVERSITY24

25 Example for SWAP architecture design Explore Algorithm 1 : 3 adders, 3 multipliers, 32 clusters DP ILP Explore Algorithm 2 : 4 adders, 1 multiplier, 64 clusters Explore Algorithm 3 : 2 adders, 2 multipliers, 64 clusters Explore Algorithm 4 : 2 adders, 2 multipliers, 16 clusters Chosen Architecture: 4 adders, 3 multipliers, 64 clusters RICE UNIVERSITY25

26 SWAP flexibility provides power savings Multiple algorithms Different ALU, cluster requirements Turning off ALUs ( add mul compiler options) Use the right #ALUs from explore tool Turning off clusters Data across SRF of all clusters Cluster only has access to its own SRF Next kernel may need data from SRF of other clusters Reconfiguration support needs to be provided RICE UNIVERSITY26

27 SWAPs provide cluster reconfiguration SRF LATCH LATCH LATCH LATCH MDX2 MDX2 MDX1 Mux-Demux Network With Stream buffers Clusters Additional latency (few cycles) due to microcontroller stalls - Minimal loss in performance RICE UNIVERSITY27

28 Cluster reconfiguration for Viterbi DP Can be turned OFF Packet 1 Constraint length 7 (16 clusters) Packet 2 Constraint length 9 (64 clusters) Packet 3 Constraint length 5 (4 clusters) RICE UNIVERSITY28

29 SWAPs provide flexibility at negligible overhead Clusters Memory 64-bit Rate ½ Packet 1 K = 7 Execution Time (cycles) Kernels (Computation) Packet 2 K = 9 No Data Memory accesses Packet 3 K = 5 RICE UNIVERSITY29

30 SWAP exploration for Viterbi decoding Frequency needed to attain real-time (in MHz) 1000 DSP Different SWAPs (Without reconfiguration) Max DP Number of clusters K = 9 K = 7 K = 5 Same SWAP (With reconfiguration) Ideal C64x (w/o co-proc) needs ~200 MHz for real-time RICE UNIVERSITY30

31 SWAPs : Salient features 1-2 orders of magnitude better than a DSP Any constraint length 10 MHz at 128 Kbps Same code for all constraint lengths no need to re-compile or load another code as long as parallelism/cluster ratio is constant Power savings due to dynamic cluster scaling RICE UNIVERSITY31

32 Expected SWAP power consumption Power model based on [Khailany 03] 64 clusters and 1 multiplier per cluster: 0.13 micron, 1.2 V Peak Active Power: ~9 mw at 1 MHz (DSP ~1 mw) Area: ~53.7 mm 2 10 MHz, 128 Kbps with reconfiguration Viterbi Clusters Used Peak Power K = 9 64 ~90 mw K = 7 16 ~28.57 mw K = 5 4 ~13.8 mw overhead 0 ~8.1 mw DSP, K = 9 1 ~200 mw Power (in mw) Active Clusters (max 64) RICE UNIVERSITY32 Exploring the VLSI Scalability of Stream Processors, Brucek Khailany et al, Proceedings of the Ninth Symposium on High Performance Computer Architecture, February 8-12, 2003

33 Multiuser Estimation-DetectionDecoding Real-time target : 128 Kbps per user Frequency needed to attain real-time (in MHz) DSP Number of clusters FAST MEDIUM SLOW 32-user base-station Mobile Fading scenarios Ideal C64x (w/o co-proc) needs ~15 GHz for real-time RICE UNIVERSITY33

34 Expected SWAP power : base-station 32 user base-station with 3 X s per cluster and 64 clusters: 0.13 micron, 1.2 V Peak Active Power: ~18.19 mw for 1 MHz (increased X) Area: ~93.4 mm 2 Total Peak Base-station power consumption: ~18.19 W at 1 GHz for 32 users at 128 Kbps/user RICE UNIVERSITY34

35 Talk Outline Research vision SWAP Background Algorithm design for SWAPs Architecture design for SWAPs Current and Future Research Goals RICE UNIVERSITY35

36 Current research: Flexibility vs. performance SWAPs: 128 Kbps at ~ mwfor Viterbi Borrow DP from ASICs! suitable for base-stations Flexibility more important than power suitable for mobile devices Power constraints tighter can be customized for further power savings Handset SWAPs (H-SWAPs) Borrow Task pipelining from ASICs! Application-specific units and specialized comm. network RICE UNIVERSITY36

37 37 RICE UNIVERSITY Handset SWAPs: H-SWAPs Trade Data Parallelism for Task Pipelining SRF DP SWAPs (max. clusters and reconfigure) Limited DP SWAPlet (limit clusters) Limited DP Limited DP Limited DP H-SWAPs (collection of customized SWAPlets)

38 Sample points in architecture exploration Programmable solutions with increased customization DSPs (1 cluster) SWAPs (multiple) H-SWAPs (optimized for handsets) ILP Subword ILP Subword DP ILP Subword DP Task Pipelining Custom ALUs Performance, Power benefits (with decreasing flexibility) RICE UNIVERSITY38

39 Future: Efficient algorithms and mapping Channel Estimator Non- Coherent STC Coherent STC Channel Equalizer MRC Detector Demodulator Decoder Beamforming Multipath Channel Turbo Equalizer Multiple antenna systems with 1-2 orders-of-magnitude higher complexity RICE UNIVERSITY39

40 Future research: Architectures Generalized and structured framework and tools Joint algorithm-architecture exploration Area-time-power-flexibility tradeoffs Potential applications: embedded systems Image and Video processing: Cameras : variety of compression algorithms Biomedical applications: Hearing aids: DSP running on body heat Sensor networks Compression of data before transmission Quote: Gene Frantz, TI Fellow RICE UNIVERSITY40

41 SWAPs: Flexibility, Performance, Power Need flexibility in future wireless devices Algorithms and Architectures Rapid Exploration for Scalable, Wireless Application-specific Processors Structured approach with flexibility-performance trade-offs SWAPs - flexibility, high performance and low power Exploit data parallelism like ASICs 1-2 orders better performance than DSPs Turn off unused clusters and unused ALUs for low power RICE UNIVERSITY41

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research