Energy Efficient Adaptive Beamforming on Sensor Networks
|
|
- Aubrie Simpson
- 5 years ago
- Views:
Transcription
1 Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California
2 Outline ❹ Problem Definition ❹ Computational Characteristics ❹ Prior Solution ❹ Power Optimizations ❹ Sensor Node Level ❹ Inter Node Level ❹ Challenges/Discussion 1
3 Problem Scenario Energy Constrained Network Passive Active 2
4 Beamforming Def: The technique which spatially filters the signals received from an array of sensors and estimates the spatial features of the sources Procedure: 1. passively and repeatedly sample acoustic propagation wave field signals 2. input data, linearly combined with a weight matrix to form a sonar beam for a particular direction of look Adaptive Sonar Beamforming: For High SNR and High resolution Time changing signal and noise properties included in the derivation of weights, making them adapt accordingly 3
5 Space Time Adaptive Processing Elements 1 N Range gates 1 2 L Pulse Repetition Interval N Target Detection L M PRIs Each CPI (Coherent Processing Interval) 4
6 MITRE RT_STAP Benchmark Input Data Preprocessing Step 1 Preprocessing Step 2 L (1920) N (22) M (64) Weight Application Weight Computation Doppler Processing T latency = msec & T period = msec 5
7 Input Data Cube Elements (N = 22) Range Gates (L = 1920) PRIs (M = 64) 6
8 Sonar Signal Processing Adaptive Beamforming Sampling Rate =10 Hz~25 KHz Element Space Output Rate =1 Hz~100 Hz Beam Space Conventional Beamforming Frequency Domain Adaptive FFT Beamforming Adaptive FFT Beamforming Time Domain 100 ~5000 Beams per Output 7
9 An Example Adaptive Beamformer MVDR (Minimum Variance Distortionless Response) Channel s Frequency Bins F N FFT N F Corner Turn Beams per Bin B N N N Factorization F Steering F Covariance Linear Solver & Beamformer F N B 8
10 Computational Characteristics D A D D T A T A A A T A D A T A S 1 S 2 S 3 S 4 Outputs Initial Data Layout ❹ Overall processing consists of sequence of subproblems ❹ Computational requirements are different for each subproblem ❹ Large amount of data is repeatedly processed in real-time ❹ Data access patterns change from subproblem to subproblem ❹ Throughput and latency performance requirements 9
11 Adaptive Processing Key Problems ❹Doppler Processing (FFT) ❹Weight Computation apply (Co Variance matrix factorization) ❹Weight Application (Matrix Vector Product) adaptation Gates Elements (N = 22) Range (L = 1920) PRIs (M = 64) 10
12 Prior Solution Architecture= tightly coupled collection of processors Target detection High bandwidth, low latency network 11
13 Key Issue: Communication Cost Coarse grain machines : Powerful processing nodes -SP-2: Typical Configuration 640 Mflops/node 64 MB 4 GB Memory GB Internal Disk - T3E: Typical Configuration 1200 Mflops/node (T3E- 1200) Local Memory Access Time: 87 ~ 253 nsec Global Memory Access Time: 1~2 µ sec (SHMEM) ❹ Large software overhead for message transfer - SP-2: ~39 µsec overhead/message using MPL/MPI ~ 9 nsec/byte/node transfer rate - local memory access: 100 s of nsec 12
14 Key Idea- Data Remapping Data Access Pattern P 0 P 3 P 0 P 3 P 0 P 3 S 1 S 2 S 3 Remap? Remap? Benefits of Remapping Must Exceed the Overhead 13
15 Impact of Data Remapping Our Results Results reported in IPPS 95 Implementation performed on IBM SP-2 at MHPCC Code developed using C, MPI and ESSL 14
16 Lessons learnt Objective : Adaptive beamforming on parallel machines ❹ Task level parallelism ❹ Minimize communication cost ❹ Data Remapping 15
17 Energy Efficiency Power is critical and must be conserved ❹Reduce power dissipation at sensor node level ❹energy efficient algorithms ❹Energ y Constrained ❹Netw ork ❹Sensors ❹Decrease power dissipation at inter-node level ❹Optimize on communication cost between sensors ❹16
18 Power Model for a Processing Frequency Control Element Frequency Control f p Processor Processor f b FU FU Cache Memory Power Total = Power Processor +Power Data bus + Power Memory Power unit = Power Dynamic + Power Static = 0.5f(n)CV 2 f Active + VI Leakage F max (V-V t )/V 17
19 Reduce Processor-Memory Data Traffic Instructions for Memory access consume lot of power Instruction (Intel 486DX2) MOV DX BX MOV DX [BX] MOV [BX] DX Energy (10-8 Joules) Reduce # of memory accesses ❹ reduce cache misses ❹ high data reuse in cache ❹ use registers Reduce power consumed on the data bus 18
20 Cache size =n Example: Matrix Multiplication j k j i A i B x k C Do i = 0 ; Do j = 0 ; A[i,j] 0 ; Do k = 0 ; A[i, j] A[i,j] + B[i,k] x C[k,j] ; k++; j++; i++ ; Energy = αn 3 + β(n+n 2 )n + γ(3n 2 ) (α + β)n 3 Time = n 3 + lower order terms 19
21 Optimization I: Reduce Bus Traffic Block Matrix Multiply n n n n x Energy = αn 3 + 2β(n.n 1/2 )n + γ(3n 2 ) Time = n 3 + lower order terms 20
22 n Optimization II: Reduce Peak Bus Bandwidth A B C n n n n n Data = 2n 3 2 Bus Data Rate Time = 1 n n 2 Processor Rate! 21
23 Optimization III: Application directed Data Layouts ❹Applications have different data access patterns ❹ Matrices accessed by rows, columns, diagonals, sub-squares ❹ Tree structures accessed along paths, sub-trees ❹ Naive data layouts degrade performance ❹ Large working sets cause capacity misses ❹ Improper alignment in memory causes conflict misses Row major Layout Block Layout a 0,0 a 0,1 a 0,2 a 0,3 a 0,0 a 0,1 a 0,2 a 0,3 a 1,0 a 1,0 a 1,2 a 1,3 a 1,0 a 1,1 a 1,2 a 1,3 a 2,0 a 2,1 a 2,2 a 2,3 a 2,0 a 2,1 a 2,2 a 2,3 a 3,0 a 3,1 a 3,2 a 3,3 a 3,0 a 3,1 a 3,2 a 3,3 Page 0 Page 1 Page 2 Page 3 Page 0 Page 1 Page 2 Page 3 22
24 Cache Friendly Algorithms Cache friendly ❹High data reuse ❹Low cache pollution ❹Regular access patterns ❹Static data layouts (Matrix Multiply) ❹Dynamic data layouts (FFT) Data layouts 23
25 Fast Fourier Transform DFT: Cooley-Tukey Algorithm ❹ Compute DFT of size N = N 1 *N 2 ❹ Step1: compute N 2 DFTs of size N 1 ❹ Step2: multiply twiddle factors ❹ Step3: compute N 1 DFTs of size N 2 ❹ Divide and conquer recursively Current Approach ❹ MIT FFTW ❹ Determine optimal factorization ❹ Perform low level optimizations for kernels ❹ Construct larger size FFTs from kernels ❹ Key Assumption ❹ All DFTs of same size have same execution time 24
26 Problem with Current Approach All N-point DFTs do not have the same cost! ❹ different data access patterns with various strides ❹ stride affects execution time 32-point FFT with Strided Access - Experimental Results Execution Time (usec) N = Stride (2^s) Sun Ultra 1: 167MHz, L2 Cache = 512 KB = 32 K points 25
27 Our Approach Reorganize input data layout to change non-unit stride to unit stride Dynamic Data Layout Perform data reorganization during computation N 2 N 1 -point FFTs N 1 N 2 -point FFTs Data Reorganization 26
28 Example FFTW USC approach Decomposition trees for a 1024*1024 point FFT ms ms 54.96% improvement over state-of-the-art FFTW package on DEC Alpha 27
29 Other Techniques for Node Level Power Optimizations? ❹ Voltage frequency scaling f max α (V-V t )/V ❹ Power management (idle/sleep/active states) ❹ Reduce precision ❹ Clock Gating Instruction (Fujitsu Sparc 934) OR MUL Energy (10-8 Joules)
30 Current Work ❹ Development and Verification of techniques proposed for power optimization ❹ Existing simulators ❹ ❹ Simple Power(based on Simple Scalar architecture) Joule Track (Code Length Limitations) ❹ Board level Power Measurements ❹ Brutus Evaluation Board (SA-1100) ❹ Build a functional level power simulation ❹ ❹ Fast with acceptable level of accuracy. Develop a multiprocessor power model 31
31 Space Time Representation Compute results in each block ❹ Schedule blocks row-major ❹ N 2 steps c ❹Data per step N c ❹Operations per step Nc ❹Data reuse per step c ❹Total traffic N 2 * N c = N 3 c c A B for N x N matrices A 11 A 12 A 1N B 11 B 12 B 1N c c = computation for result (i,j) c = cache size 33
32 Theorem Unidirectional Space-Time representation leads to cache friendly algorithms => Energy Efficient Algorithms 34
33 Network level Energy Optimization ❹ Computation cost is much lower than communication cost ❹ Radio interface consumes a large amount of power POWER Consumed Transmission(100m) Reception Processor (SA1100) WINS sensor Node 600mw (at 100kbits/sec) 300mw 250MIPS/watt ❹ Energy to transfer 32 bits over 100m in WINS sensor node =( ( )mw 100kbits/s) x 32 = 288 x 10 6 Joules ❹ Energy to execute a 32 bit instruction using SA1100 processor = MIPS/watt = x 10 6 Joules ❹ Additional overhead for bits added for error correction ❹ Retransmissions are frequent due to unreliable links(e.g.wireless) 29
34 Reduce Communication Cost ❹Exploit data redundancy to reduce data traffic ❹Improve locality of computation while assigning subtasks to node ❹ Communication limited to closely placed nodes ❹Larger distance requires higher transmission power ❹Reduces reliability of link 30
35 Network Level Power Optimization Issues ❹Topology of network is unknown ❹Estimation of Communication cost ❹Task allocation ❹Broadcast Communication Model ❹Need: Framework for Energy Efficient Computation in Adhoc Networks 32
Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:
The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationEnergy Optimizations for FPGA-based 2-D FFT Architecture
Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline
More informationHigh Throughput Energy Efficient Parallel FFT Architecture on FPGAs
High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu
More information3.2 Cache Oblivious Algorithms
3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,
More informationThe course that gives CMU its Zip! Memory System Performance. March 22, 2001
15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationDISTRIBUTED PARALLEL PROCESSING TECHNIQUES FOR ADAPTIVE SONAR BEAMFORMING
2000, HCS Research Lab. All Rights Reserved. DISTRIBUTED PARALLEL PROCESSING TECHNIQUES FOR ADAPTIVE SONAR BEAMFORMING ALAN D. GEORGE, JESUS GARCIA, KEONWOOK KIM, and PRIYABRATA SINHA High-performance
More informationLecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationCache-oblivious Programming
Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix
More informationREAL TIME OPERATING SYSTEMS. Lesson-15:
REAL TIME OPERATING SYSTEMS Lesson-15: Power Optimization 1 1. Memory Optimization 2 Power Optimization Saving power and energy requirement for a given set of codes, while finishing instructions in the
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationDouble-Precision Matrix Multiply on CUDA
Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #2 1/17/2017 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Opportunities
More informationParallelism in Spiral
Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationHigh-Performance Packet Classification on GPU
High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction
More informationCISC 360. Cache Memories Nov 25, 2008
CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationOptimum Array Processing
Optimum Array Processing Part IV of Detection, Estimation, and Modulation Theory Harry L. Van Trees WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Preface xix 1 Introduction 1 1.1 Array Processing
More informationMemory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005
Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.
More informationAdvanced Parallel Programming I
Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University
More informationApplication Performance on Dual Processor Cluster Nodes
Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys
More informationDesign, implementation, and evaluation of parallell pipelined STAP on parallel computers
Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1998 Design, implementation, and evaluation of parallell pipelined STAP on parallel computers
More informationToday Cache memory organization and operation Performance impact of caches
Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationCache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010
Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of
More informationHigh performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering
More informationGiving credit where credit is due
CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based
More informationν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.
Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More informationAutomatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning
More informationLast class. Caches. Direct mapped
Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationAdministration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers
Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationCache Memories October 8, 2007
15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache
More informationParallel Pipeline STAP System
I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,
More informationCache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012
Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical
More informationPerformance Issues in Parallelization Saman Amarasinghe Fall 2009
Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationWhite paper Advanced Technologies of the Supercomputer PRIMEHPC FX10
White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 Next Generation Technical Computing Unit Fujitsu Limited Contents Overview of the PRIMEHPC FX10 Supercomputer 2 SPARC64 TM IXfx: Fujitsu-Developed
More informationCache memories are small, fast SRAM based memories managed automatically in hardware.
Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and
More informationPARALLEL ALGORITHMS FOR ADAPTIVE MATCHED-FIELD PROCESSING ON DISTRIBUTED ARRAY SYSTEMS
PARALLEL ALGORITHMS FOR ADAPTIVE MATCHED-FIELD PROCESSING ON DISTRIBUTED ARRAY SYSTEMS KILSEOK CHO, ALAN D. GEORGE, AND RAJ SUBRAMANIYAN High-performance Computing and Simulation (HCS) Research Laboratory
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationScalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA
Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationAgenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)
7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to
More informationModern CPU Architectures
Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes
More informationPerformance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply
Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu
More informationLecture 2. Memory locality optimizations Address space organization
Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationObjective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.
CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes
More informationA Comparison of Capacity Management Schemes for Shared CMP Caches
A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip
More informationCache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory
5-23 The course that gies CMU its Zip! Topics Cache Memories Oct., 22! Generic cache memory organization! Direct mapped caches! Set associatie caches! Impact of caches on performance Cache Memories Cache
More informationENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna
ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:
More informationCache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory
5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories
More informationEfficient Tridiagonal Solvers for ADI methods and Fluid Simulation
Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationTowards a Performance- Portable FFT Library for Heterogeneous Computing
Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon
More informationMemory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)
More informationHigh-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers
High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology
More informationA scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment
LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of
More information211: Computer Architecture Summer 2016
211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University
More informationPerformance Issues in Parallelization. Saman Amarasinghe Fall 2010
Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries
More informationA 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System
More informationMemory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,
Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O
More informationDesign and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications
Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department
More informationHow to Write Fast Numerical Code
How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory
More information3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires
More informationComparative Performance Analysis of Parallel Beamformers
1999, HCS Research Lab. All Rights Reserved. Comparative Performance Analysis of Parallel Beamformers Keonwook Kim, Alan D. George and Priyabrata Sinha HCS Research Lab, Electrical and Computer Engineering
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationMemory Systems and Performance Engineering. Fall 2009
Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory
More informationParallel Exact Inference on the Cell Broadband Engine Processor
Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationAdvanced Computer Architecture
18-742 Advanced Computer Architecture Test 2 April 14, 1998 Name (please print): Instructions: DO NOT OPEN TEST UNTIL TOLD TO START YOU HAVE UNTIL 12:20 PM TO COMPLETE THIS TEST The exam is composed of
More informationUniprocessors. HPC Fall 2012 Prof. Robert van Engelen
Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures
More informationExploration of Cache Coherent CPU- FPGA Heterogeneous System
Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based
More informationLow-Power Interconnection Networks
Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationParallelising Pipelined Wavefront Computations on the GPU
Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick
More informationHigh Performance MPI on IBM 12x InfiniBand Architecture
High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationAn Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston
An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.
More informationEE/CSCI 451 Midterm 1
EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming
More informationCache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,
More informationAll MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes
MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in
More informationHPCC Results. Nathan Wichmann Benchmark Engineer
HPCC Results Nathan Wichmann Benchmark Engineer Outline What is HPCC? Results Comparing current machines Conclusions May 04 2 HPCChallenge Project Goals To examine the performance of HPC architectures
More informationCHAPTER 5 PROPAGATION DELAY
98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,
More informationLarge and Fast: Exploiting Memory Hierarchy
CSE 431: Introduction to Operating Systems Large and Fast: Exploiting Memory Hierarchy Gojko Babić 10/5/018 Memory Hierarchy A computer system contains a hierarchy of storage devices with different costs,
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More information