Energy Efficient Adaptive Beamforming on Sensor Networks

Size: px
Start display at page:

Download "Energy Efficient Adaptive Beamforming on Sensor Networks"

Transcription

1 Energy Efficient Adaptive Beamforming on Sensor Networks Viktor K. Prasanna Bhargava Gundala, Mitali Singh Dept. of EE-Systems University of Southern California

2 Outline ❹ Problem Definition ❹ Computational Characteristics ❹ Prior Solution ❹ Power Optimizations ❹ Sensor Node Level ❹ Inter Node Level ❹ Challenges/Discussion 1

3 Problem Scenario Energy Constrained Network Passive Active 2

4 Beamforming Def: The technique which spatially filters the signals received from an array of sensors and estimates the spatial features of the sources Procedure: 1. passively and repeatedly sample acoustic propagation wave field signals 2. input data, linearly combined with a weight matrix to form a sonar beam for a particular direction of look Adaptive Sonar Beamforming: For High SNR and High resolution Time changing signal and noise properties included in the derivation of weights, making them adapt accordingly 3

5 Space Time Adaptive Processing Elements 1 N Range gates 1 2 L Pulse Repetition Interval N Target Detection L M PRIs Each CPI (Coherent Processing Interval) 4

6 MITRE RT_STAP Benchmark Input Data Preprocessing Step 1 Preprocessing Step 2 L (1920) N (22) M (64) Weight Application Weight Computation Doppler Processing T latency = msec & T period = msec 5

7 Input Data Cube Elements (N = 22) Range Gates (L = 1920) PRIs (M = 64) 6

8 Sonar Signal Processing Adaptive Beamforming Sampling Rate =10 Hz~25 KHz Element Space Output Rate =1 Hz~100 Hz Beam Space Conventional Beamforming Frequency Domain Adaptive FFT Beamforming Adaptive FFT Beamforming Time Domain 100 ~5000 Beams per Output 7

9 An Example Adaptive Beamformer MVDR (Minimum Variance Distortionless Response) Channel s Frequency Bins F N FFT N F Corner Turn Beams per Bin B N N N Factorization F Steering F Covariance Linear Solver & Beamformer F N B 8

10 Computational Characteristics D A D D T A T A A A T A D A T A S 1 S 2 S 3 S 4 Outputs Initial Data Layout ❹ Overall processing consists of sequence of subproblems ❹ Computational requirements are different for each subproblem ❹ Large amount of data is repeatedly processed in real-time ❹ Data access patterns change from subproblem to subproblem ❹ Throughput and latency performance requirements 9

11 Adaptive Processing Key Problems ❹Doppler Processing (FFT) ❹Weight Computation apply (Co Variance matrix factorization) ❹Weight Application (Matrix Vector Product) adaptation Gates Elements (N = 22) Range (L = 1920) PRIs (M = 64) 10

12 Prior Solution Architecture= tightly coupled collection of processors Target detection High bandwidth, low latency network 11

13 Key Issue: Communication Cost Coarse grain machines : Powerful processing nodes -SP-2: Typical Configuration 640 Mflops/node 64 MB 4 GB Memory GB Internal Disk - T3E: Typical Configuration 1200 Mflops/node (T3E- 1200) Local Memory Access Time: 87 ~ 253 nsec Global Memory Access Time: 1~2 µ sec (SHMEM) ❹ Large software overhead for message transfer - SP-2: ~39 µsec overhead/message using MPL/MPI ~ 9 nsec/byte/node transfer rate - local memory access: 100 s of nsec 12

14 Key Idea- Data Remapping Data Access Pattern P 0 P 3 P 0 P 3 P 0 P 3 S 1 S 2 S 3 Remap? Remap? Benefits of Remapping Must Exceed the Overhead 13

15 Impact of Data Remapping Our Results Results reported in IPPS 95 Implementation performed on IBM SP-2 at MHPCC Code developed using C, MPI and ESSL 14

16 Lessons learnt Objective : Adaptive beamforming on parallel machines ❹ Task level parallelism ❹ Minimize communication cost ❹ Data Remapping 15

17 Energy Efficiency Power is critical and must be conserved ❹Reduce power dissipation at sensor node level ❹energy efficient algorithms ❹Energ y Constrained ❹Netw ork ❹Sensors ❹Decrease power dissipation at inter-node level ❹Optimize on communication cost between sensors ❹16

18 Power Model for a Processing Frequency Control Element Frequency Control f p Processor Processor f b FU FU Cache Memory Power Total = Power Processor +Power Data bus + Power Memory Power unit = Power Dynamic + Power Static = 0.5f(n)CV 2 f Active + VI Leakage F max (V-V t )/V 17

19 Reduce Processor-Memory Data Traffic Instructions for Memory access consume lot of power Instruction (Intel 486DX2) MOV DX BX MOV DX [BX] MOV [BX] DX Energy (10-8 Joules) Reduce # of memory accesses ❹ reduce cache misses ❹ high data reuse in cache ❹ use registers Reduce power consumed on the data bus 18

20 Cache size =n Example: Matrix Multiplication j k j i A i B x k C Do i = 0 ; Do j = 0 ; A[i,j] 0 ; Do k = 0 ; A[i, j] A[i,j] + B[i,k] x C[k,j] ; k++; j++; i++ ; Energy = αn 3 + β(n+n 2 )n + γ(3n 2 ) (α + β)n 3 Time = n 3 + lower order terms 19

21 Optimization I: Reduce Bus Traffic Block Matrix Multiply n n n n x Energy = αn 3 + 2β(n.n 1/2 )n + γ(3n 2 ) Time = n 3 + lower order terms 20

22 n Optimization II: Reduce Peak Bus Bandwidth A B C n n n n n Data = 2n 3 2 Bus Data Rate Time = 1 n n 2 Processor Rate! 21

23 Optimization III: Application directed Data Layouts ❹Applications have different data access patterns ❹ Matrices accessed by rows, columns, diagonals, sub-squares ❹ Tree structures accessed along paths, sub-trees ❹ Naive data layouts degrade performance ❹ Large working sets cause capacity misses ❹ Improper alignment in memory causes conflict misses Row major Layout Block Layout a 0,0 a 0,1 a 0,2 a 0,3 a 0,0 a 0,1 a 0,2 a 0,3 a 1,0 a 1,0 a 1,2 a 1,3 a 1,0 a 1,1 a 1,2 a 1,3 a 2,0 a 2,1 a 2,2 a 2,3 a 2,0 a 2,1 a 2,2 a 2,3 a 3,0 a 3,1 a 3,2 a 3,3 a 3,0 a 3,1 a 3,2 a 3,3 Page 0 Page 1 Page 2 Page 3 Page 0 Page 1 Page 2 Page 3 22

24 Cache Friendly Algorithms Cache friendly ❹High data reuse ❹Low cache pollution ❹Regular access patterns ❹Static data layouts (Matrix Multiply) ❹Dynamic data layouts (FFT) Data layouts 23

25 Fast Fourier Transform DFT: Cooley-Tukey Algorithm ❹ Compute DFT of size N = N 1 *N 2 ❹ Step1: compute N 2 DFTs of size N 1 ❹ Step2: multiply twiddle factors ❹ Step3: compute N 1 DFTs of size N 2 ❹ Divide and conquer recursively Current Approach ❹ MIT FFTW ❹ Determine optimal factorization ❹ Perform low level optimizations for kernels ❹ Construct larger size FFTs from kernels ❹ Key Assumption ❹ All DFTs of same size have same execution time 24

26 Problem with Current Approach All N-point DFTs do not have the same cost! ❹ different data access patterns with various strides ❹ stride affects execution time 32-point FFT with Strided Access - Experimental Results Execution Time (usec) N = Stride (2^s) Sun Ultra 1: 167MHz, L2 Cache = 512 KB = 32 K points 25

27 Our Approach Reorganize input data layout to change non-unit stride to unit stride Dynamic Data Layout Perform data reorganization during computation N 2 N 1 -point FFTs N 1 N 2 -point FFTs Data Reorganization 26

28 Example FFTW USC approach Decomposition trees for a 1024*1024 point FFT ms ms 54.96% improvement over state-of-the-art FFTW package on DEC Alpha 27

29 Other Techniques for Node Level Power Optimizations? ❹ Voltage frequency scaling f max α (V-V t )/V ❹ Power management (idle/sleep/active states) ❹ Reduce precision ❹ Clock Gating Instruction (Fujitsu Sparc 934) OR MUL Energy (10-8 Joules)

30 Current Work ❹ Development and Verification of techniques proposed for power optimization ❹ Existing simulators ❹ ❹ Simple Power(based on Simple Scalar architecture) Joule Track (Code Length Limitations) ❹ Board level Power Measurements ❹ Brutus Evaluation Board (SA-1100) ❹ Build a functional level power simulation ❹ ❹ Fast with acceptable level of accuracy. Develop a multiprocessor power model 31

31 Space Time Representation Compute results in each block ❹ Schedule blocks row-major ❹ N 2 steps c ❹Data per step N c ❹Operations per step Nc ❹Data reuse per step c ❹Total traffic N 2 * N c = N 3 c c A B for N x N matrices A 11 A 12 A 1N B 11 B 12 B 1N c c = computation for result (i,j) c = cache size 33

32 Theorem Unidirectional Space-Time representation leads to cache friendly algorithms => Energy Efficient Algorithms 34

33 Network level Energy Optimization ❹ Computation cost is much lower than communication cost ❹ Radio interface consumes a large amount of power POWER Consumed Transmission(100m) Reception Processor (SA1100) WINS sensor Node 600mw (at 100kbits/sec) 300mw 250MIPS/watt ❹ Energy to transfer 32 bits over 100m in WINS sensor node =( ( )mw 100kbits/s) x 32 = 288 x 10 6 Joules ❹ Energy to execute a 32 bit instruction using SA1100 processor = MIPS/watt = x 10 6 Joules ❹ Additional overhead for bits added for error correction ❹ Retransmissions are frequent due to unreliable links(e.g.wireless) 29

34 Reduce Communication Cost ❹Exploit data redundancy to reduce data traffic ❹Improve locality of computation while assigning subtasks to node ❹ Communication limited to closely placed nodes ❹Larger distance requires higher transmission power ❹Reduces reliability of link 30

35 Network Level Power Optimization Issues ❹Topology of network is unknown ❹Estimation of Communication cost ❹Task allocation ❹Broadcast Communication Model ❹Need: Framework for Energy Efficient Computation in Adhoc Networks 32

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains: The Lecture Contains: Data Access and Communication Data Access Artifactual Comm. Capacity Problem Temporal Locality Spatial Locality 2D to 4D Conversion Transfer Granularity Worse: False Sharing Contention

More information

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System

Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM

More information

Effect of memory latency

Effect of memory latency CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable

More information

Energy Optimizations for FPGA-based 2-D FFT Architecture

Energy Optimizations for FPGA-based 2-D FFT Architecture Energy Optimizations for FPGA-based 2-D FFT Architecture Ren Chen and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California Ganges.usc.edu/wiki/TAPAS Outline

More information

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs

High Throughput Energy Efficient Parallel FFT Architecture on FPGAs High Throughput Energy Efficient Parallel FFT Architecture on FPGAs Ren Chen Ming Hsieh Department of Electrical Engineering University of Southern California Los Angeles, USA 989 Email: renchen@usc.edu

More information

3.2 Cache Oblivious Algorithms

3.2 Cache Oblivious Algorithms 3.2 Cache Oblivious Algorithms Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science,

More information

The course that gives CMU its Zip! Memory System Performance. March 22, 2001

The course that gives CMU its Zip! Memory System Performance. March 22, 2001 15-213 The course that gives CMU its Zip! Memory System Performance March 22, 2001 Topics Impact of cache parameters Impact of memory reference patterns memory mountain range matrix multiply Basic Cache

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

DISTRIBUTED PARALLEL PROCESSING TECHNIQUES FOR ADAPTIVE SONAR BEAMFORMING

DISTRIBUTED PARALLEL PROCESSING TECHNIQUES FOR ADAPTIVE SONAR BEAMFORMING 2000, HCS Research Lab. All Rights Reserved. DISTRIBUTED PARALLEL PROCESSING TECHNIQUES FOR ADAPTIVE SONAR BEAMFORMING ALAN D. GEORGE, JESUS GARCIA, KEONWOOK KIM, and PRIYABRATA SINHA High-performance

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Cache-oblivious Programming

Cache-oblivious Programming Cache-oblivious Programming Story so far We have studied cache optimizations for array programs Main transformations: loop interchange, loop tiling Loop tiling converts matrix computations into block matrix

More information

REAL TIME OPERATING SYSTEMS. Lesson-15:

REAL TIME OPERATING SYSTEMS. Lesson-15: REAL TIME OPERATING SYSTEMS Lesson-15: Power Optimization 1 1. Memory Optimization 2 Power Optimization Saving power and energy requirement for a given set of codes, while finishing instructions in the

More information

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.

LINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P. 1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Double-Precision Matrix Multiply on CUDA

Double-Precision Matrix Multiply on CUDA Double-Precision Matrix Multiply on CUDA Parallel Computation (CSE 60), Assignment Andrew Conegliano (A5055) Matthias Springer (A995007) GID G--665 February, 0 Assumptions All matrices are square matrices

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #2 1/17/2017 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Opportunities

More information

Parallelism in Spiral

Parallelism in Spiral Parallelism in Spiral Franz Franchetti and the Spiral team (only part shown) Electrical and Computer Engineering Carnegie Mellon University Joint work with Yevgen Voronenko Markus Püschel This work was

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides

Optimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas

More information

High-Performance Packet Classification on GPU

High-Performance Packet Classification on GPU High-Performance Packet Classification on GPU Shijie Zhou, Shreyas G. Singapura, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California 1 Outline Introduction

More information

CISC 360. Cache Memories Nov 25, 2008

CISC 360. Cache Memories Nov 25, 2008 CISC 36 Topics Cache Memories Nov 25, 28 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Cache memories are small, fast SRAM-based

More information

Scientific Computing. Some slides from James Lambers, Stanford

Scientific Computing. Some slides from James Lambers, Stanford Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical

More information

Optimum Array Processing

Optimum Array Processing Optimum Array Processing Part IV of Detection, Estimation, and Modulation Theory Harry L. Van Trees WILEY- INTERSCIENCE A JOHN WILEY & SONS, INC., PUBLICATION Preface xix 1 Introduction 1 1.1 Array Processing

More information

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005 Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Application Performance on Dual Processor Cluster Nodes

Application Performance on Dual Processor Cluster Nodes Application Performance on Dual Processor Cluster Nodes by Kent Milfeld milfeld@tacc.utexas.edu edu Avijit Purkayastha, Kent Milfeld, Chona Guiang, Jay Boisseau TEXAS ADVANCED COMPUTING CENTER Thanks Newisys

More information

Design, implementation, and evaluation of parallell pipelined STAP on parallel computers

Design, implementation, and evaluation of parallell pipelined STAP on parallel computers Syracuse University SURFACE Electrical Engineering and Computer Science College of Engineering and Computer Science 1998 Design, implementation, and evaluation of parallell pipelined STAP on parallel computers

More information

Today Cache memory organization and operation Performance impact of caches

Today Cache memory organization and operation Performance impact of caches Cache Memories 1 Today Cache memory organization and operation Performance impact of caches The memory mountain Rearranging loops to improve spatial locality Using blocking to improve temporal locality

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010

Cache Memories. EL2010 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 2010 Cache Memories EL21 Organisasi dan Arsitektur Sistem Komputer Sekolah Teknik Elektro dan Informatika ITB 21 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 23J Computer Organization Cache Memories Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce23j Giving credit where credit is due Most of slides for this lecture are based

More information

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines.

ν Hold frequently accessed blocks of main memory 2 CISC 360, Fa09 Cache is an array of sets. Each set contains one or more lines. Topics CISC 36 Cache Memories Dec, 29 ν Generic cache memory organization ν Direct mapped caches ν Set associatie caches ν Impact of caches on performance Cache Memories Cache memories are small, fast

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #11 2/21/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline Midterm 1:

More information

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries

Advanced Computing Research Laboratory. Adaptive Scientific Software Libraries Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity

More information

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University Outline Scientific Computation Kernels Matrix Multiplication Fast Fourier Transform (FFT) Automated Performance Tuning

More information

Last class. Caches. Direct mapped

Last class. Caches. Direct mapped Memory Hierarchy II Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in the cache Conflict misses if two addresses map to same place

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers

Administration. Prerequisites. Meeting times. CS 380C: Advanced Topics in Compilers Administration CS 380C: Advanced Topics in Compilers Instructor: eshav Pingali Professor (CS, ICES) Office: POB 4.126A Email: pingali@cs.utexas.edu TA: TBD Graduate student (CS) Office: Email: Meeting

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Cache Memories October 8, 2007

Cache Memories October 8, 2007 15-213 Topics Cache Memories October 8, 27 Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The memory mountain class12.ppt Cache Memories Cache

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012 Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10

White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 White paper Advanced Technologies of the Supercomputer PRIMEHPC FX10 Next Generation Technical Computing Unit Fujitsu Limited Contents Overview of the PRIMEHPC FX10 Supercomputer 2 SPARC64 TM IXfx: Fujitsu-Developed

More information

Cache memories are small, fast SRAM based memories managed automatically in hardware.

Cache memories are small, fast SRAM based memories managed automatically in hardware. Cache Memories Cache memories are small, fast SRAM based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

PARALLEL ALGORITHMS FOR ADAPTIVE MATCHED-FIELD PROCESSING ON DISTRIBUTED ARRAY SYSTEMS

PARALLEL ALGORITHMS FOR ADAPTIVE MATCHED-FIELD PROCESSING ON DISTRIBUTED ARRAY SYSTEMS PARALLEL ALGORITHMS FOR ADAPTIVE MATCHED-FIELD PROCESSING ON DISTRIBUTED ARRAY SYSTEMS KILSEOK CHO, ALAN D. GEORGE, AND RAJ SUBRAMANIYAN High-performance Computing and Simulation (HCS) Research Laboratory

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!)

Agenda. Cache-Memory Consistency? (1/2) 7/14/2011. New-School Machine Structures (It s a bit more complicated!) 7/4/ CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches II Instructor: Michael Greenbaum New-School Machine Structures (It s a bit more complicated!) Parallel Requests Assigned to

More information

Modern CPU Architectures

Modern CPU Architectures Modern CPU Architectures Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2014 16.04.2014 1 Motivation for Parallelism I CPU History RISC Software GmbH Johannes

More information

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu

More information

Lecture 2. Memory locality optimizations Address space organization

Lecture 2. Memory locality optimizations Address space organization Lecture 2 Memory locality optimizations Address space organization Announcements Office hours in EBU3B Room 3244 Mondays 3.00 to 4.00pm; Thurs 2:00pm-3:30pm Partners XSED Portal accounts Log in to Lilliput

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers. CS 612 Software Design for High-performance Architectures 1 computers. CS 412 is desirable but not high-performance essential. Course Organization Lecturer:Paul Stodghill, stodghil@cs.cornell.edu, Rhodes

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory

Cache Memories. Cache Memories Oct. 10, Inserting an L1 Cache Between the CPU and Main Memory. General Org of a Cache Memory 5-23 The course that gies CMU its Zip! Topics Cache Memories Oct., 22! Generic cache memory organization! Direct mapped caches! Set associatie caches! Impact of caches on performance Cache Memories Cache

More information

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna

ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE. Ren Chen, Hoang Le, and Viktor K. Prasanna ENERGY EFFICIENT PARAMETERIZED FFT ARCHITECTURE Ren Chen, Hoang Le, and Viktor K. Prasanna Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, USA 989 Email:

More information

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory

Cache memories The course that gives CMU its Zip! Cache Memories Oct 11, General organization of a cache memory 5-23 The course that gies CMU its Zip! Cache Memories Oct, 2 Topics Generic cache memory organization Direct mapped caches Set associatie caches Impact of caches on performance Cache memories Cache memories

More information

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation

Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Efficient Tridiagonal Solvers for ADI methods and Fluid Simulation Nikolai Sakharnykh - NVIDIA San Jose Convention Center, San Jose, CA September 21, 2010 Introduction Tridiagonal solvers very popular

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Towards a Performance- Portable FFT Library for Heterogeneous Computing

Towards a Performance- Portable FFT Library for Heterogeneous Computing Towards a Performance- Portable FFT Library for Heterogeneous Computing Carlo C. del Mundo*, Wu- chun Feng* *Dept. of ECE, Dept. of CS Virginia Tech Slides Updated: 5/19/2014 Forecast (Problem) AMD Radeon

More information

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska

Memory Hierarchy. Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3. Instructor: Joanna Klukowska Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O Hallaron (CMU) Mohamed Zahran (NYU)

More information

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers

High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers High-Performance Computational Electromagnetic Modeling Using Low-Cost Parallel Computers July 14, 1997 J Daniel S. Katz (Daniel.S.Katz@jpl.nasa.gov) Jet Propulsion Laboratory California Institute of Technology

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

211: Computer Architecture Summer 2016

211: Computer Architecture Summer 2016 211: Computer Architecture Summer 2016 Liu Liu Topic: Assembly Programming Storage - Assembly Programming: Recap - Call-chain - Factorial - Storage: - RAM - Caching - Direct - Mapping Rutgers University

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster,

Memory Hierarchy. Cache Memory Organization and Access. General Cache Concept. Example Memory Hierarchy Smaller, faster, Memory Hierarchy Computer Systems Organization (Spring 2017) CSCI-UA 201, Section 3 Cache Memory Organization and Access Instructor: Joanna Klukowska Slides adapted from Randal E. Bryant and David R. O

More information

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications

Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Design and Evaluation of I/O Strategies for Parallel Pipelined STAP Applications Wei-keng Liao Alok Choudhary ECE Department Northwestern University Evanston, IL Donald Weiner Pramod Varshney EECS Department

More information

How to Write Fast Numerical Code

How to Write Fast Numerical Code How to Write Fast Numerical Code Lecture: Memory hierarchy, locality, caches Instructor: Markus Püschel TA: Alen Stojanov, Georg Ofenbeck, Gagandeep Singh Organization Temporal and spatial locality Memory

More information

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA 3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA Introduction Fluid simulation using direct numerical methods Gives the most accurate result Requires

More information

Comparative Performance Analysis of Parallel Beamformers

Comparative Performance Analysis of Parallel Beamformers 1999, HCS Research Lab. All Rights Reserved. Comparative Performance Analysis of Parallel Beamformers Keonwook Kim, Alan D. George and Priyabrata Sinha HCS Research Lab, Electrical and Computer Engineering

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

Memory Systems and Performance Engineering. Fall 2009

Memory Systems and Performance Engineering. Fall 2009 Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory

More information

Parallel Exact Inference on the Cell Broadband Engine Processor

Parallel Exact Inference on the Cell Broadband Engine Processor Parallel Exact Inference on the Cell Broadband Engine Processor Yinglong Xia and Viktor K. Prasanna {yinglonx, prasanna}@usc.edu University of Southern California http://ceng.usc.edu/~prasanna/ SC 08 Overview

More information

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache. Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network

More information

Advanced Computer Architecture

Advanced Computer Architecture 18-742 Advanced Computer Architecture Test 2 April 14, 1998 Name (please print): Instructions: DO NOT OPEN TEST UNTIL TOLD TO START YOU HAVE UNTIL 12:20 PM TO COMPLETE THIS TEST The exam is composed of

More information

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen

Uniprocessors. HPC Fall 2012 Prof. Robert van Engelen Uniprocessors HPC Fall 2012 Prof. Robert van Engelen Overview PART I: Uniprocessors and Compiler Optimizations PART II: Multiprocessors and Parallel Programming Models Uniprocessors Processor architectures

More information

Exploration of Cache Coherent CPU- FPGA Heterogeneous System

Exploration of Cache Coherent CPU- FPGA Heterogeneous System Exploration of Cache Coherent CPU- FPGA Heterogeneous System Wei Zhang Department of Electronic and Computer Engineering Hong Kong University of Science and Technology 1 Outline ointroduction to FPGA-based

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

Parallelising Pipelined Wavefront Computations on the GPU

Parallelising Pipelined Wavefront Computations on the GPU Parallelising Pipelined Wavefront Computations on the GPU S.J. Pennycook G.R. Mudalige, S.D. Hammond, and S.A. Jarvis. High Performance Systems Group Department of Computer Science University of Warwick

More information

High Performance MPI on IBM 12x InfiniBand Architecture

High Performance MPI on IBM 12x InfiniBand Architecture High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston

An Adaptive Framework for Scientific Software Libraries. Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston An Adaptive Framework for Scientific Software Libraries Ayaz Ali Lennart Johnsson Dept of Computer Science University of Houston Diversity of execution environments Growing complexity of modern microprocessors.

More information

EE/CSCI 451 Midterm 1

EE/CSCI 451 Midterm 1 EE/CSCI 451 Midterm 1 Spring 2018 Instructor: Xuehai Qian Friday: 02/26/2018 Problem # Topic Points Score 1 Definitions 20 2 Memory System Performance 10 3 Cache Performance 10 4 Shared Memory Programming

More information

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance

Cache Memories. Topics. Next time. Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Next time Dynamic memory allocation and memory bugs Fabián E. Bustamante,

More information

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes

All MSEE students are required to take the following two core courses: Linear systems Probability and Random Processes MSEE Curriculum All MSEE students are required to take the following two core courses: 3531-571 Linear systems 3531-507 Probability and Random Processes The course requirements for students majoring in

More information

HPCC Results. Nathan Wichmann Benchmark Engineer

HPCC Results. Nathan Wichmann Benchmark Engineer HPCC Results Nathan Wichmann Benchmark Engineer Outline What is HPCC? Results Comparing current machines Conclusions May 04 2 HPCChallenge Project Goals To examine the performance of HPC architectures

More information

CHAPTER 5 PROPAGATION DELAY

CHAPTER 5 PROPAGATION DELAY 98 CHAPTER 5 PROPAGATION DELAY Underwater wireless sensor networks deployed of sensor nodes with sensing, forwarding and processing abilities that operate in underwater. In this environment brought challenges,

More information

Large and Fast: Exploiting Memory Hierarchy

Large and Fast: Exploiting Memory Hierarchy CSE 431: Introduction to Operating Systems Large and Fast: Exploiting Memory Hierarchy Gojko Babić 10/5/018 Memory Hierarchy A computer system contains a hierarchy of storage devices with different costs,

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information