Energy Efficient Adaptive Beamforming on Sensor Networks

Size: px

Start display at page:

Download "Energy Efficient Adaptive Beamforming on Sensor Networks"

Aubrie Simpson
5 years ago
Views:

1 Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California

2 Outline ❹ Problem Definition ❹ Computational Characteristics ❹ Prior Solution ❹ Power Optimizations ❹ Sensor Node Level ❹ Inter Node Level ❹ Challenges/Discussion 1

3 Problem Scenario Energy Constrained Network Passive Active 2

4 Beamforming Def: The technique which spatially filters the signals received from an array of sensors and estimates the spatial features of the sources Procedure: 1. passively and repeatedly sample acoustic propagation wave field signals 2. input data, linearly combined with a weight matrix to form a sonar beam for a particular direction of look Adaptive Sonar Beamforming: For High SNR and High resolution Time changing signal and noise properties included in the derivation of weights, making them adapt accordingly 3

5 Space Time Adaptive Processing Elements 1 N Range gates 1 2 L Pulse Repetition Interval N Target Detection L M PRIs Each CPI (Coherent Processing Interval) 4

6 MITRE RT_STAP Benchmark Input Data Preprocessing Step 1 Preprocessing Step 2 L (1920) N (22) M (64) Weight Application Weight Computation Doppler Processing T latency = msec & T period = msec 5

7 Input Data Cube Elements (N = 22) Range Gates (L = 1920) PRIs (M = 64) 6

8 Sonar Signal Processing Adaptive Beamforming Sampling Rate =10 Hz~25 KHz Element Space Output Rate =1 Hz~100 Hz Beam Space Conventional Beamforming Frequency Domain Adaptive FFT Beamforming Adaptive FFT Beamforming Time Domain 100 ~5000 Beams per Output 7

9 An Example Adaptive Beamformer MVDR (Minimum Variance Distortionless Response) Channel s Frequency Bins F N FFT N F Corner Turn Beams per Bin B N N N Factorization F Steering F Covariance Linear Solver & Beamformer F N B 8

10 Computational Characteristics D A D D T A T A A A T A D A T A S 1 S 2 S 3 S 4 Outputs Initial Data Layout ❹ Overall processing consists of sequence of subproblems ❹ Computational requirements are different for each subproblem ❹ Large amount of data is repeatedly processed in real-time ❹ Data access patterns change from subproblem to subproblem ❹ Throughput and latency performance requirements 9

11 Adaptive Processing Key Problems ❹Doppler Processing (FFT) ❹Weight Computation apply (Co Variance matrix factorization) ❹Weight Application (Matrix Vector Product) adaptation Gates Elements (N = 22) Range (L = 1920) PRIs (M = 64) 10

12 Prior Solution Architecture= tightly coupled collection of processors Target detection High bandwidth, low latency network 11

13 Key Issue: Communication Cost Coarse grain machines : Powerful processing nodes -SP-2: Typical Configuration 640 Mflops/node 64 MB 4 GB Memory GB Internal Disk - T3E: Typical Configuration 1200 Mflops/node (T3E- 1200) Local Memory Access Time: 87 ~ 253 nsec Global Memory Access Time: 1~2 µ sec (SHMEM) ❹ Large software overhead for message transfer - SP-2: ~39 µsec overhead/message using MPL/MPI ~ 9 nsec/byte/node transfer rate - local memory access: 100 s of nsec 12

14 Key Idea- Data Remapping Data Access Pattern P 0 P 3 P 0 P 3 P 0 P 3 S 1 S 2 S 3 Remap? Remap? Benefits of Remapping Must Exceed the Overhead 13

15 Impact of Data Remapping Our Results Results reported in IPPS 95 Implementation performed on IBM SP-2 at MHPCC Code developed using C, MPI and ESSL 14

16 Lessons learnt Objective : Adaptive beamforming on parallel machines ❹ Task level parallelism ❹ Minimize communication cost ❹ Data Remapping 15

17 Energy Efficiency Power is critical and must be conserved ❹Reduce power dissipation at sensor node level ❹energy efficient algorithms ❹Energ y Constrained ❹Netw ork ❹Sensors ❹Decrease power dissipation at inter-node level ❹Optimize on communication cost between sensors ❹16

18 Power Model for a Processing Frequency Control Element Frequency Control f p Processor Processor f b FU FU Cache Memory Power Total = Power Processor +Power Data bus + Power Memory Power unit = Power Dynamic + Power Static = 0.5f(n)CV 2 f Active + VI Leakage F max (V-V t )/V 17

19 Reduce Processor-Memory Data Traffic Instructions for Memory access consume lot of power Instruction (Intel 486DX2) MOV DX BX MOV DX [BX] MOV [BX] DX Energy (10-8 Joules) Reduce # of memory accesses ❹ reduce cache misses ❹ high data reuse in cache ❹ use registers Reduce power consumed on the data bus 18

20 Cache size =n Example: Matrix Multiplication j k j i A i B x k C Do i = 0 ; Do j = 0 ; A[i,j] 0 ; Do k = 0 ; A[i, j] A[i,j] + B[i,k] x C[k,j] ; k++; j++; i++ ; Energy = αn 3 + β(n+n 2 )n + γ(3n 2 ) (α + β)n 3 Time = n 3 + lower order terms 19

21 Optimization I: Reduce Bus Traffic Block Matrix Multiply n n n n x Energy = αn 3 + 2β(n.n 1/2 )n + γ(3n 2 ) Time = n 3 + lower order terms 20

22 n Optimization II: Reduce Peak Bus Bandwidth A B C n n n n n Data = 2n 3 2 Bus Data Rate Time = 1 n n 2 Processor Rate! 21

23 Optimization III: Application directed Data Layouts ❹Applications have different data access patterns ❹ Matrices accessed by rows, columns, diagonals, sub-squares ❹ Tree structures accessed along paths, sub-trees ❹ Naive data layouts degrade performance ❹ Large working sets cause capacity misses ❹ Improper alignment in memory causes conflict misses Row major Layout Block Layout a 0,0 a 0,1 a 0,2 a 0,3 a 0,0 a 0,1 a 0,2 a 0,3 a 1,0 a 1,0 a 1,2 a 1,3 a 1,0 a 1,1 a 1,2 a 1,3 a 2,0 a 2,1 a 2,2 a 2,3 a 2,0 a 2,1 a 2,2 a 2,3 a 3,0 a 3,1 a 3,2 a 3,3 a 3,0 a 3,1 a 3,2 a 3,3 Page 0 Page 1 Page 2 Page 3 Page 0 Page 1 Page 2 Page 3 22

24 Cache Friendly Algorithms Cache friendly ❹High data reuse ❹Low cache pollution ❹Regular access patterns ❹Static data layouts (Matrix Multiply) ❹Dynamic data layouts (FFT) Data layouts 23

25 Fast Fourier Transform DFT: Cooley-Tukey Algorithm ❹ Compute DFT of size N = N 1 *N 2 ❹ Step1: compute N 2 DFTs of size N 1 ❹ Step2: multiply twiddle factors ❹ Step3: compute N 1 DFTs of size N 2 ❹ Divide and conquer recursively Current Approach ❹ MIT FFTW ❹ Determine optimal factorization ❹ Perform low level optimizations for kernels ❹ Construct larger size FFTs from kernels ❹ Key Assumption ❹ All DFTs of same size have same execution time 24

26 Problem with Current Approach All N-point DFTs do not have the same cost! ❹ different data access patterns with various strides ❹ stride affects execution time 32-point FFT with Strided Access - Experimental Results Execution Time (usec) N = Stride (2^s) Sun Ultra 1: 167MHz, L2 Cache = 512 KB = 32 K points 25

27 Our Approach Reorganize input data layout to change non-unit stride to unit stride Dynamic Data Layout Perform data reorganization during computation N 2 N 1 -point FFTs N 1 N 2 -point FFTs Data Reorganization 26

28 Example FFTW USC approach Decomposition trees for a 1024*1024 point FFT ms ms 54.96% improvement over state-of-the-art FFTW package on DEC Alpha 27

29 Other Techniques for Node Level Power Optimizations? ❹ Voltage frequency scaling f max α (V-V t )/V ❹ Power management (idle/sleep/active states) ❹ Reduce precision ❹ Clock Gating Instruction (Fujitsu Sparc 934) OR MUL Energy (10-8 Joules)

30 Current Work ❹ Development and Verification of techniques proposed for power optimization ❹ Existing simulators ❹ ❹ Simple Power(based on Simple Scalar architecture) Joule Track (Code Length Limitations) ❹ Board level Power Measurements ❹ Brutus Evaluation Board (SA-1100) ❹ Build a functional level power simulation ❹ ❹ Fast with acceptable level of accuracy. Develop a multiprocessor power model 31

31 Space Time Representation Compute results in each block ❹ Schedule blocks row-major ❹ N 2 steps c ❹Data per step N c ❹Operations per step Nc ❹Data reuse per step c ❹Total traffic N 2 * N c = N 3 c c A B for N x N matrices A 11 A 12 A 1N B 11 B 12 B 1N c c = computation for result (i,j) c = cache size 33

32 Theorem Unidirectional Space-Time representation leads to cache friendly algorithms => Energy Efficient Algorithms 34

33 Network level Energy Optimization ❹ Computation cost is much lower than communication cost ❹ Radio interface consumes a large amount of power POWER Consumed Transmission(100m) Reception Processor (SA1100) WINS sensor Node 600mw (at 100kbits/sec) 300mw 250MIPS/watt ❹ Energy to transfer 32 bits over 100m in WINS sensor node =( ( )mw 100kbits/s) x 32 = 288 x 10 6 Joules ❹ Energy to execute a 32 bit instruction using SA1100 processor = MIPS/watt = x 10 6 Joules ❹ Additional overhead for bits added for error correction ❹ Retransmissions are frequent due to unreliable links(e.g.wireless) 29

34 Reduce Communication Cost ❹Exploit data redundancy to reduce data traffic ❹Improve locality of computation while assigning subtasks to node ❹ Communication limited to closely placed nodes ❹Larger distance requires higher transmission power ❹Reduces reliability of link 30

35 Network Level Power Optimization Issues ❹Topology of network is unknown ❹Estimation of Communication cost ❹Task allocation ❹Broadcast Communication Model ❹Need: Framework for Energy Efficient Computation in Adhoc Networks 32

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention