Memory Intensive Architectures for DSP and Data Communication Pronita Mehrotra, Paul Franzon

Size: px

Start display at page:

Download "Memory Intensive Architectures for DSP and Data Communication Pronita Mehrotra, Paul Franzon"

Irene Gilmore
6 years ago
Views:

1 Memory Intensive Architectures for DSP and Data Communication Pronita Mehrotra, Paul Franzon Department of Electrical and Computer Engineering North Carolina State University

2 Outline Objectives Approach Signal Integrity and Routability Algorithms and DRAM Architecture Memory Mapping Scheme Twiddle Factor Generation Scheme Analysis of FFT Architecture and Performance Forwarding Schemes for Router Routing Scheme based on Compaction Binary Search Based Routing Scheme SHOCC High Density Packaging technology increases potential performance of a 1GB DSP system by a factor of about 20 Lower Memory Requirements for the Router 1

3 Motivation & Approach Radar processor for future UAVs Large problem size (1 GB, 1 M point FFTs) High-performance/Low-volume Leverage High Density Packaging Utilize SHOCC (Seamless High Off Chip Connectivity) iallows 128 parallel 16-bit wide memory channels i Number of channels limited by signal integrity and routability Designed 2,048-bit, 250 MHz, memory bus Determine architecture that maximizes the potential of this memory bandwidth 2

4 Physical Design High Density Substrate (8cm x 8cm) Edge-mounted commercial DDR DRAMs - approx. 2 mm pitch µm solder bump pitch (today) ( => ~ 120 pins => up to x36 memory) - 2 sets of 64x64 Mbit - organized as multiple independent banks Mbps per pin SHOCC-mounted - Better availability than RAMBUS Identical Accelerator ICs bare die (64 Multiplier-Accumulators) + more certain SI issues (approx. 1 sq.cm.) -interconnected by 2GHz, 128-bit bus 3

5 Substrate Stack-up 5µ BCB (5µ) BCB (5µ) BCB (5µ) BCB (10µ) BCB (5µ) Si Substrate Signal layer S1 (2µ) Signal layer S2 (2µ) (local ground) Signal layer S3 (2µ) Gnd/Pow planes (2µ) S2 acting as the local ground reduces the coupling between S1 and S3 Maxwell Q-3D (Ansoft) parameter extractor used to determine R,L,C 4

6 Routing Approach 2-Stage breakout routing approach: 13 µm Breakout Pitch (2 layers) 26 µm Intermediate pitch (1 layer) 36 µm final routing Parallel Routing S1 Gnd S2 S1 S2 Pitch decided by crosstalk limitations X-Y routing 5

7 SI Issues for High Density Wiring 0.25 µm CMOS Technology: DC NM = 1.04V Our design uses an upper limit of 0.7V Noise Sources: Crosstalk iespecially in the breakout region SSN Reflection Noise ipotential Issue for long, wide memory wiring 6

8 Equivalent Circuit (SHOCC Line) Dr. Dr. R oc L oc C oc R bump L bump C bump SHOCC line line model R bump L bump C bump R oc L oc C oc Rec. Input Signal: 2ns pulse with a rise time of 80ps Driver: 5 stage driver with a stage ratio of 3 7

9 SHOCC line model (Crosstalk) Signal (top) R/n L/n C/2n C mtt L mtt C/2n Signal (bottom) R/n L/n C/2n C mtb L mtb C/2n Signal (top) R/n L/n C/2n C mtb L mtb C/2n Signal (bottom) R/n L/n C/2n Signal (top) R/n C mtt L mtt L/n C/2n C/2n C/2n 8

10 Crosstalk Noise in Different Regions Crosstalk Noise (mv) bottom (S3) top (S1) Crosstalk Noise (mv) bottom (S3) top (S1) (a) Length (cm) (b) Length (cm) Crosstalk Noise (mv) (c) Length (cm) bottom (S3) top (S1) Crosstalk Noise for: (a) 13µ initial breakout pitch (b) 26µ XY routing (c) 36µ routing (under DRAMs) Trace Width in all cases = 10µ 9

11 Reflection Noise 50 Reflection Noise (mv) bottom (S3) top (S1) Reflection Noise for 36µ routing Length (cm) Reflection Noise constitutes a fairly small percentage of the total noise 10

12 Delays in Different SHOCC regions Delay (ns) bottom (S3) top (S1) Delay (ns) bottom (S3) top (S1) (a) Length (cm) (b) Length (cm) Delay (ns) bottom (S3) top (S1) Worst Case delay for: (a) 13µ initial breakout pitch (b) 26µ XY routing (c) 36µ routing (under DRAMs) Trace Width in all cases = 10µ (c) Length (cm) 11

13 Noise and Timing Analysis For a 1cm x 1cm chip (with 2500 I/O pins), the escape lengths in the two regions are 1cm and 0.8cm For the various routing regions, crosstalk noise is 0.19V + 0.2V V 0.56V The reflection noise is approximately 0.03V The total RSS noise is 0.6V (with an SSN of 0.2V). This is within the Noise Margin of 0.7V for a 0.25µ technology The worst case off-chip skew on an 8cm x 8cm substrate is around 0.2ns. After adding factors for on-chip skew and jitter, we can have a cycle time of at least 2ns This gives an I/O bandwidth > 100GByte/sec 12

14 DRAM Timing Issues DRAM organized in banks and rows: Row address Sense Amps Column Address Bank Address Data Word irandom access takes 60 ns ia new bank can be accessed every 15 ns ia different entry within the row most recently accessed can be read or written in 4 ns However, in the FFT described next we can sustain 98% of peak bandwidth SRAM performance at DRAM prices 13

15 FFT Architectural Issues Conventional FFT implementation would spend most time in only one memory channel Developed staggered channel algorithm Need to maximize page mode access in DRAMs Developed novel memory map scheme for data Conventional FFT stores twiddle factors in main memory Instead we regenerate them on-the-fly in the datapath during otherwise dead cycles 14

16 Micro-accelerator IC 32-bit FP arithmetic units MEM MEM MEM X X X SRAM 1 cm X X X MEM MEM MEM 1 cm For a 0.25µ technology, a 32 bit multiplier and adder would take up an area < 1mm 2. A 1cm 2 chip area can hold enough hardware to make a fully parallel 16 point FFT Micro-Accelerator: control reconfigures IC and manages MEM interface 64 multipliers and adders per chip bit mem interface units 0.5KB SRAM to store twiddle factors Four chips work together to give a radix 64 FFT engine 15

Performance of the FFT Engine A 32-bit multiply-accumulate unit, in 0.25µ technology, takes < 2ns to execute A 64-point FFT (including the twiddling) can be done in < 32ns.

17 Performance of the FFT Engine A 32-bit multiply-accumulate unit, in 0.25µ technology, takes < 2ns to execute A 64-point FFT (including the twiddling) can be done in < 32ns. By pipelining the FFT into two stages, a result can be obtained every 20ns 4 8-point units units 4 8-point units units Read Read 4 8-point units units 4 8-point units units Write Write 20 ns + penalty for new page access 20 ns 20 ns 20 ns + penalty for new page access 16

18 Address-Mapping Algorithm Key to success when using DRAM Maximize page-mode accesses At each stage The result set of each 64-point FFT is written to different DRAMs according to the following relation DRAM# = (FFT# + Index) % 64 where, FFT# is (index/64) Resulting performance: Most of the new-page penalty is hidden by bank operations 1.31ms for 4 stage million-point FFT Within 1.6% of perfect SRAM performance 17

19 ...Addressing Scheme Example: The indices and memory layout of the data after the end of the first stage is shown for one row in each of the DRAM s as an illustrative example. The other stages are the same. 0, 0, , , 0, , 1, DRAM 0 DRAM 1 The inputs for the next stage are now arranged in different DRAM s, allowing full exploitation of the memory bandwidth. 63, 63, , 63, DRAM 63 This shuffling of data after reading, for the next stage, is easily implemented using shift registers. 18

20 ...Addressing Scheme 12288, B , B3 8192, B2 8193, B2 4096, B1 4097, B1 0, B0 1, B , B3 8255, B2 4159, B1 63, B0 DRAM 0 DRAM 1 DRAM 63 Reads: Row # = FFT#/4 Bank # = FFT# % 4 Writes: Row # = FFT#/256 Bank # =( FFT#/64 ) % 4 Where FFT# = index/64 19

21 Twiddle Factor Generation A one-dimensional input array (N) can be manipulated as a two-dimensional array (LxM) X ( s, r) = M 1 m= 0 W Lmr W L 1 ms For a Radix-64 FFT, L=64 The results of 64-point FFT need to be multiplied with the twiddle factors, W ms, where W ms = e j( 2π / N ) ms l= 0 x( l, m) W Msl 20

22 Twiddle Factor Generation For all stages, s varies from 0 to 63. By storing an initial set of twiddle factors (m=1), subsequent twiddle factors in the same stage can be generated by multiplying current factors by the initial factors W ms N = W ( m 1) s N Whenever, m reaches 64, an initial twiddle factor set can be generated for the next stage W 64 s 1. s W N = WN / s N 21

23 Scheduling FFT Operations The first 2 stages of an 8-point FFT do not involve any multiplications. The free multipliers can be used for generating twiddle factors needed later. 1st two stages of 8-point FFT Generation of twiddle factors 3rd stage twiddling the final results 2ns 4ns 6ns 8ns 10ns 12ns time 22

24 FFT Performance Discussion 1,048,576 point FFT in 1.31 ms 892 FFT/s 1.44 x FLOPS 127 GBps sustained memory performance Commercial comparisons BOPS Inc. System: i 80 sq.cm. of PCB, 4-32-bit memory channels, 4 PEs with each PE having 5 FP units 21.5 ms to perform one million point FFT Motorola s Altivecs: 128 bit vector execution unit with 4 parallel executions, simultaneous load of 4 IEEE floats 511 ms for a million point FFT 23

25 Optical Burst Switching (OBS) Need to decouple the transmission/switching from forwarding/routing One control channel that goes through O/E/O conversion Data cuts through nodes without any conversion Just-in-Time signaling protocol for burst transmission Transmit packet after some delay without waiting for confirmation CALLING HOST CALLING SWITCH CALLED SWITCH CALLED HOST PROCESSING DELAY SETUP CALL PROC CROSSCONNECT CONFIGURED CONNECT SETUP OPTICAL BURST CONNECT SETUP CONNECT 24

26 OBS Node Architecture ICC #1 Input Module Input Module ICC #N Router Buffer and Scheduler Output Module Output Module Input Fiber #1 Demux Demux IDC #1 FDL ODC #1 Mux Output Fiber #1 Input Fiber #N Demux Demux IDC #N FDL Switching Switching Fabric Fabric ODC #N Mux Output Fiber #N Optical blocks Electrical blocks 25

27 Message Engine Message generator Data Bus Message Parsing and Header Verification TTL and CRC update Route Lookup Scheduler Switch Control Exception Handler Hard Path Soft Path To Software SRAM/ DRAM SRAM/ DRAM 26

28 Forwarding Engine The bottleneck of the forwarding engine is the route lookup Speed Reduce the number of lookups esp. in main memory inumber of memory accesses 2-9 (IPV4) ipartition data to ease hardware pipelining iexisting schemes take ns (average time) for address lookup Scalability Reduce the amount of memory required to store data idirect/indirect lookup schemes use memory inefficiently itree Based Schemes better 27

29 Trie Vs. Tree 0 1 < > < > < > < > < > < > < > Binary Trie Binary Tree Memory Accesses: Binary Trie: Number of address bits (32 for IPv4) Binary Tree: log 2 (N) ( 16 for 64K entries) *Nick McKeown, Balaji Prabhakar, High Performance Switches and Routers: Theory and Practice, Hot Interconnects Tutorial Slides (

30 Trie Based Schemes: Direct Lookup An entry for each address Inefficient use of memory Very poor scalability Trie of depth=1 and degree=2 B Lookup Time = 1 cycle (60ns) B bits Address 2 B bits Required Memory Size 1.00E E E E E E E E Address Bits 1,000 DRAM chips 29

31 Trie Based Schemes: Indirect Lookup Address split in 2 or more parts* Somewhat better use of memory Poor scalability B 1 B 2 B 1 Lookup Time = N cycles (N=no. of segments in the address) Memory Requirement = Depends on the routing table. Can reduce memory usage by using variable offset length B 2 *P. Gupta, S. Lin, N. McKeown, Routing Lookups in hardware at memory access speeds, in Proc. IEEE Infocom 98, Session 10B-1, San Francisco, CA, pp

32 Trie Based Schemes: Trie Optimizations Memory usage optimal Lookup Time = H cycles (H=depth of No Prefix tree) Binary tree Skip= Path-compressed (Patricia) tree Skip= Level-compressed (LC) tree *S. Nilsson, G. Karlsson, IP-Address Lookup Using LC- Tries, IEEE Journal on Selected Areas in Communications, Vol. 17, No.6, June 1999, pp

33 Trie or Tree? Issues with Trie Based Schemes: Extra Nodes with no data add to the depth of the tree imore Memory Accesses Needed Search time proportional to the size of the address ibinary Trie for Ipv4 can take up to 32 cycles ifor IPv6 the worst case could be 128 cycles. Issues with Tree Based Schemes Binary Search works for exact matching ibacktracking or wrong paths iunbalanced Approaches Pre-processing overhead higher 32

34 Tree Based Schemes: Binary Search Encoding prefixes as ranges Multiway search to reduce search time from log 2 N to log k+1 N Pre-computed table of best matching prefixes for the first Y bits Worst Case Lookup time =490ns (>32,000 entries) Patricia Binary 16 bit + binary 16 bit + 6 way Worst case search (ns) Worst case relative to Patricia *B. Lampson, V. Srinivasan, G. Varghese, IP Lookups using Multiway and Multicolumn Search, Infocom 98, Vol. 3, 1998, pp

35 Lookups using Hash Tables Hash Tables organized by prefix lengths hash collisions? Lookup Time = log 2 (address bits) Length Hash Improve performance by binary search of hash tables by using markers in tables corresponding to shorter lengths to point to prefixes of greater lengths *M. Waldvogel, G. Varghese, J. Turner, B. Plattner, Scalable High Speed IP Routing Lookups, ACM Comput. Commun. Rev., Vol. 27, Oct. 1997, pp

36 Proposed Scheme Using Compaction Store path information in a smaller ( 250x than forwarding table), faster, wide ( 1000 bits) on-chip SRAM Few SRAM and one DRAM lookups Store a table containing number of 1 s in each level Additionally, for each row of SRAM, the first few bits store number of 1 s till the previous row in that level First 16 bits or so can be direct mapped A lookup can be done every 60-65ns (14-15 million lookups per second) 35

37 Proposed Scheme Using Compaction On-chip SRAM and Off-chip DRAM >1000 bit wide on-chip SRAM For 40,000 prefixes in the routing table, the required SRAM size is less than 5kB 2 sets of these memories can be used to hide the update operations Pipelined SRAM and DRAM operation Only 1 DRAM lookup in all cases One lookup can be done every 60-65ns million lookups per second 36

38 Binary Search Based Proposed Scheme Sorting Prefixes: Two prefixes: A=a 1 a 2 a n B=b 1 b 2 b m If n = m, Compare by numerical value If n m, Chop longer prefix and compare. If chopped prefixes are equal then, the shorter prefix is considered larger After Sorting: 00010*, 0001*, *, *, *, *, 01011*, 01*, *, *, *, *, 1011*, 10*, 110* Prefix Next Hop 10* 7 01* 5 110* * * * * * * * * * * * * 9 Sample Prefix Set* *N. Yazdani, P.S. Min, Fast and Scalable Schemes for IP Address Lookup Problem, Proc. IEEE Conference on High Performance Switching and Routing, pp ,

39 Binary Search Based Proposed Scheme Sorting gives depth-firstsearch of corresponding binary trie Binary Trie constructed as: If A is a prefix of B, then B is the child of A If A < B, then A lies on the left of B Root(*) * * * * *

40 Modified Prefix Table Store Information about all parents in another field Pre-processing requires another step. Update process is O(N) (same as Lampson s scheme) Memory Requirement is 2x lesser * * * Match between * and * is until 4 bits Prefix Next Hop Parent Info * * * * * * * * * * * * * * * Best Matching Prefix is 01* 39

41 Conclusions 2,048-bit memory system buildable in high density packaging technologies Limit determined by Signal Integrity Issues Modeled & Simulated with Ansoft and Hspice FFT Architecture optimized to maximize available memory bandwidth Memory map perfectly matched to DDR DRAM architecture On-the-fly twiddle factor calculation Verified in Verilog model Result 20x faster capable with conventional packaging 40

42 Conclusions Trie-based routing scheme using compaction suggested for smaller address sizes SRAM Size is almost 250x lesser than DRAM size One DRAM access only Binary Search scheme for larger address size Number of memory accesses = log 2 (N) Memory requirement 2x lesser than existing schemes Update Process at O(N). Same as existing schemes 41

43 Future Work FFT Complete Verification (Verilog) Submit journal papers (T.VLSI, CPMT) (Conference paper published - EPEP) Forwarding Engine Verify Routing schemes (high level Verilog) Evaluate pre-processing overheads Evaluate performance against standard routing tables Submit journal paper Conducting scaling studies to support OBS 42

Binary Search Schemes for Fast IP Lookups

Binary Search Schemes for Fast IP Lookups Pronita Mehrotra Paul D. Franzon Department of Electrical and Computer Engineering North Carolina State University {pmehrot,paulf}@eos.ncsu.edu This research is