Kernel Benchmarks and Metrics for Polymorphous Computer Architectures
|
|
- Betty Lloyd
- 6 years ago
- Views:
Transcription
1 PCAKernels-1 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures Hank Hoffmann James Lebak (Presenter) Janice McMahon Seventh Annual High-Performance Embedded Computing Workshop (HPEC) 24 September 2003 This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract F C Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.
2 Targeting Mission Cycle Detection Location Identification Target Nomination Weapon Selection Targeting Attack Assessment Future Warfighting Scenarios Examples SIGINT Communication Satellite Airborne Vehicle Surveillance Satellite Communication Antenna Aegis Cruiser Key Program Goal: Re-configurable Embedded Processor Approach to provide Multi-Mission Capability PCAKernels-2 Personal Terminals Micro UAV
3 Polymorphous Computing Stream processing Regular, deterministic operations Constant flow of input data Threaded processing Complex operations Dynamic data movement SIMD P M Set of homogenous computing tiles Distributed Cache Systolic Dedicated co-processors 1 morph \ mor-()f\ n : re-structuring of tiles for optimized processing 1 morph \ mor-()f\ n : re-structuring of tiles for optimized processing 2 morph \ mor-()f\ vt : to re-structure tiles for optimized processing PCAKernels-3
4 Architectural Flexibility Radar Processing Flow Front end Signal Processing Detection/ Estimation Back end Discrimination/ Identification Command Control Performance Signal Processing Benchmark 1 Signal Processing Benchmark 2 Information Processing Benchmark Knowledge Processing Benchmark Intelligence Processing Benchmark PCA Server Class PPC Class DSP Class Structured Bit-operations Vectors/ Streaming Dynamic/ Threading Symbolic Operations Specialized Class DSP Class PPC Class Server Class PCA PCAKernels-4
5 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-5
6 Kernel Synthesis from Application Survey Specific Application Areas Radar Sonar Infrared Hyper-Spectral SIGINT Communication Data Fusion Broad Processing Categories Front-end Processing Data independent, stream-oriented Signal processing, image processing, high-speed network communication Examples: pulse compression adaptive beamforming target detection Back-end Processing Data dependent, thread oriented Information processing, knowledge processing Examples: workload optimization target classification Specific Kernels Signal/Image Processing FIR Filter SVD CFAR Detection Communication Corner Turn Information/Knowledge Processing Graph Optimization Pattern Recognition Real-time Database Operations PCAKernels-6 MIT-LL MIT-LL Surveyed Surveyed DoD DoD Applications Applications to to Provide: Provide: Kernel Kernel Benchmark Benchmark Definitions Definitions Example Example Requirements Requirements and and Data Data Sets Sets
7 Kernel Performance Evaluation Kernel Benchmarks Performance Metrics Definitions Signal/Image Processing FIR Filter SVD CFAR Detection Communication Corner Turn Information/Knowledge Processing Graph Optimization Pattern Recognition Real-time Database Operations Floating point and integer ops Latency Throughput Efficiency Stability Density and cost Size Weight Power Workload (FLOPS or OPS) Execution time (seconds) Throughput Hardware Peak MIN(Throughput) MAX(Throughput) PowerPC(G4) RAW Smart Memory TRIPS MONARCH PCAKernels-7
8 Throughput Workload (FLOPS or OPS) Execution time (seconds) Throughput-Stability Product A New Kernel Metric Throughput x Stability rewards consistent high performance penalizes lack of performance or lack of consistency Interval Stability MIN I (Throughput) MAX I (Throughput) PCAKernels-8 For For a given given application, PCA PCA processors should should achieve higher higher product of of throughput and and stability than than conventional processors
9 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-9
10 High Performance Programming: Conventional vs. PCA Processors PowerPC(G4) Raw Characteristics: Rigid memory hierarchy Rigid datapath Specialized Structures High Performance Programming: Change algorithm to match memory hierarchy One degree of freedom Can only work with blocking factor PCAKernels-10 Characteristics: Flexible memory hierarchy Flexible datapath(s) Generic Structures High Performance Programming: Co-optimize algorithm and architecture Many degrees of freedom Optimize time/space tradeoff PCA PCA provides more more degrees of of freedom, and and thus thus greater flexibility (morphability) and and greater performance over over a range range of of applications
11 Kernel Benchmarks and the PowerPC G4 Main Memory PowerPC G Specs 500 MHz Clock rate 4 Gflop/s peak 125 MHz main memory bus L1 cache: 32 kb, on chip L2 cache: 2MB, 250 MHz bus Mercury daughtercard L2 Cache Two predictors of kernel performance: Programmer s maximization of data reuse and locality (blocking factor) Memory hierarchy of G4 Blocking factor determines max achieved performance Memory hierarchy determines shape of performance curve Want to maximize blocking factor to limit memory hierarchy bottleneck PCAKernels-11
12 FIR Filter (G4) 2000 FIR Filter Throughput (MFLOPS/sec) Number of filters = 4 Filter size = 16 FIR Throughput? Stability Number of filters = 4 Filter size = Level 1 Cache Level 2 Cache Level 1 Cache Level 2 Cache K 8K 32K 128K 512K PCAKernels-12 Vector Length PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec Mean Mean Efficiency: 29% 29% *Implemented with VSIPL Real FIR Filter Caches are are performance bottlenecks Performance Performance curve curve changes changes when when cache cache is is full full Product Product metric metric penalizes penalizes G4 G4 for for performance drop performance drop at at cache cache boundaries boundaries
13 Baseline Performance Measurements: Throughput and Stability Throughput Data Set and Overall Stability PowerPC G4 (Mercury) 500 MHz 32 KB L1 2 MB L2 Peak: 4 GFLOPS/sec PCAKernels-13 Data Set Stability: Overall Stability: Ratio of minimum to maximum over all data set sizes for a particular kernel Ratio of minimum to maximum over all floating-point kernels&all data set sizes
14 Stream Algorithms for Tiled Architectures Systolic Morph Time Time Space Space R M(R) edge tiles are allocated to memory management P(R) inner tiles perform computation systolically using registers and static network Stream Algorithm Efficiency: C(N) E (N,R) = where T(N,R)*(P(R) + M(R)) N = problem size R = edge length of tile array C(N) = number of operations T(N,R) = number of time steps P(R) + M(R) = total number of processors Compute Efficiency Condition: where? = N/R lim E(?,R) = 1?,R?? Stream algorithms achieve high high efficiency by by optimizing time time space space tradeoff tailoring memory hierarchy and and datapaths to to specific needs needs of of application PCAKernels-14
15 Time Domain Convolution on RAW RAW Chip with R rows and R+2 columns: Number of filters = R Number of memory tiles: M = 2*R Number of processing tiles: P = R 2 Manage Input Vectors Systolic Array for K Tap Filter Manage Output Vectors Each row performs a number of K tap filters Stream algorithms achieve high high performance by by removing memory access access bottleneck from from computational critical critical path path PCAKernels-15
16 FIR Filter (RAW) 4 Throughput (GFLOPS/sec) 4 Throughput * Stability 3 3 Number of filters = K 2K 4K 8K Vector Length RAW: 250 MHz, 4 GFLOPS/sec K 4K 16K 64K 256K512K Vector Length G4: 500 MHz, 4 GFLOPS/sec PCAKernels-16 Raw Raw implements the the appropriate memory hierarchy for for the the problem Raw s Raw s Throughput x Stability score score stays stays high high
17 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-17
18 Singular Value Decomposition (SVD) Input Matrix Upper- Triangular Matrix Bidiagonal Matrix Diagonal Matrix? M Rows X=U? V H H U, U, V Unitary??Diagonal N Columns SVD is becoming more widely used in signal and image processing Important for spectral analysis Can also be used for adaptive beamforming, especially for illconditioned problems SVD kernel implementation is a Reduced SVD that begins with a QR factorization if M > N Uses Modified Gram-Schmidt QR factorization Many possible optimizations, especially block factorization PCAKernels-18
19 SVD Results (G4) SVD Throughput (Mflop/s) SVD Throughput? Stability PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec Mean Mean Efficiency: 16% 16% PCAKernels-19 Reduced SVD of a 16-column complex matrix Begins with MGS QR factorization (needs A+R) L1 cache drives inner loop performance 1: A+R fills L1 cache 2: One column of A is half of L1 cache
20 Modified Gram-Schmidt QR Results (G4) MGS Throughput (Mflop/s) MGS Throughput? Stability PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec Mean Mean Efficiency: 12% 12% PCAKernels-20 Modified Gram-Schmidt QR factorization of a 16- column complex matrix MGS is about 60% of SVD time L1 cache drives inner loop performance 1: A+R fills L1 cache 2: One column of A is half of L1 cache
21 SVD for RAW Architecture Input Matrix Banded Matrix Bidiagonal Matrix Diagonal Matrix? M Rows N Columns Goal is to match problem size and architecture Use 2D systolic morph maximizes time/space efficiency uses architecture in a scalable way Uses efficient QR/LQ approach to get to banded form Fast Givens approach for QR/LQ Decoupled algorithm with good parallelism Banded form matches array dimension of systolic morph provides high locality for reduction to bidiagonal form PCAKernels-21 Memory Tiles Compute Tiles Raw Raw implementation seeks seeks to to efficiently match match the the many many possible algorithms to to the the many many possible architectural configurations
22 RAW and G4 Results: Fast Givens QR Factorization The The QR QR is is a key key sub-kernel of of the the SVD Throughput (GFLOPS/sec) Throughput * Stability K 2K PCAKernels-22 N (for N by N matrices) K N (for N by N matrices) The The QR QR performance demonstrates the the benefit benefit of of the the PCA PCA approach on on matrix matrix algebra operations
23 Lincoln Laboratory PCA Testbed Test Bed Architecture Intel PC Dual processor 66 MHz/64-bit wide PCI bus Running Linux Clusters on LLAN Ethernet LAN PCI bus Mercury RACE/VME Solaris/MCOS SBC G4 DSP Test Bed Objectives Kernel Kernel performance evaluation Application morphing demonstration High-level software prototyping Annapolis Wildstar High Speed I/O DSP/ FPGA Unit under test RAW Test Board (October 2003) 2 MB DRAM High Speed I/O USB Interface Daughtercard High Speed A/D PCAKernels-23
24 Outline Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions PCAKernels-24
25 Conclusions has defined kernel benchmarks for the PCA program Multiple categories of processing Based on DoD application needs Establishing a performance baseline on conventional architectures Performance is limited by the blocking factor and by the memory hierarchy Example: CFAR low ops/byte, 3% efficiency: FIR high ops/byte, 29% efficiency PCAKernels-25 PCA processors allow opportunities for high performance Performance achieved through co-optimization of the algorithm and the architecture Example: unusual SVD algorithm leads to high performance on Raw The greater degree of freedom allows greater optimization across a variety of problem domains
26 PCA Team Hector Chan Bill Coate Jim Daly Ryan Haney Hank Hoffmann Preston Jackson James Lebak Janice McMahon Eddie Rutledge Glenn Schrader Edmund Wong PCAKernels-26
Kernel Benchmarks and Metrics for Polymorphous Computer Architectures James Lebak Hank Hoffmann Janice McMahon MIT Lincoln Laboratory
Kernel Benchmarks and Metrics for Polymorphous Computer Architectures James Lebak Hank Hoffmann Janice McMahon Polymorphous computer architectures (PCA) are new computer architectures being developed under
More informationThe HPEC Challenge Benchmark Suite
The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts Institute of Technology Lincoln Laboratory HPEC 2005 This work is sponsored by the Defense Advanced
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationHigh Performance DoD DSP Applications
High Performance DoD DSP Applications Robert Bond Embedded Digital Systems Group 23 August 2003 Slide-1 Outline DoD High-Performance DSP Applications Middleware (with some streaming constructs) Future
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationThe Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development
The Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development Randall Janka and Mark Richards Georgia Tech Research University James Lebak MIT Lincoln Laboratory
More informationA KASSPER Real-Time Signal Processor Testbed
A KASSPER Real-Time Signal Processor Testbed Glenn Schrader 244 Wood St. exington MA, 02420 Phone: (781)981-2579 Fax: (781)981-5255 gschrad@ll.mit.edu The Knowledge Aided Sensor Signal Processing and Expert
More informationMaster Informatics Eng.
Advanced Architectures Master Informatics Eng. 207/8 A.J.Proença The Roofline Performance Model (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 207/8 AJProença, Advanced Architectures,
More informationFrequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System
Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System Chi Zhang, Viktor K Prasanna University of Southern California {zhan527, prasanna}@usc.edu fpga.usc.edu ACM
More information300x Matlab. Dr. Jeremy Kepner. MIT Lincoln Laboratory. September 25, 2002 HPEC Workshop Lexington, MA
300x Matlab Dr. Jeremy Kepner September 25, 2002 HPEC Workshop Lexington, MA This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions,
More informationSocial Behavior Prediction Through Reality Mining
Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO
More informationAdaptive Scientific Software Libraries
Adaptive Scientific Software Libraries Lennart Johnsson Advanced Computing Research Laboratory Department of Computer Science University of Houston Challenges Diversity of execution environments Growing
More informationOutline. Parallel Algorithms for Linear Algebra. Number of Processors and Problem Size. Speedup and Efficiency
1 2 Parallel Algorithms for Linear Algebra Richard P. Brent Computer Sciences Laboratory Australian National University Outline Basic concepts Parallel architectures Practical design issues Programming
More informationCluster-based 3D Reconstruction of Aerial Video
Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer (scott.sawyer@ll.mit.edu) MIT Lincoln Laboratory HPEC 12 12 September 2012 This work is sponsored by the Assistant Secretary of Defense for
More informationNear Memory Computing Spectral and Sparse Accelerators
Near Memory Computing Spectral and Sparse Accelerators Franz Franchetti ECE, Carnegie Mellon University www.ece.cmu.edu/~franzf Co-Founder, SpiralGen www.spiralgen.com The work was sponsored by Defense
More informationInstruction Set Extensions for Photonic Synchronous Coalesced Access
Instruction Set Extensions for Photonic Synchronous Coalesced Access Paul Keltcher, David Whelihan, Jeffrey Hughes September 12, 2013 This work is sponsored by Defense Advanced Research Projects Agency
More informationAnalysis and Mapping of Sparse Matrix Computations
Analysis and Mapping of Sparse Matrix Computations Nadya Bliss & Sanjeev Mohindra Varun Aggarwal & Una-May O Reilly MIT Computer Science and AI Laboratory September 19th, 2007 HPEC2007-1 This work is sponsored
More informationLLMORE: Mapping and Optimization Framework
LORE: Mapping and Optimization Framework Michael Wolf, MIT Lincoln Laboratory 11 September 2012 This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002.
More informationProcessor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP
Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor
More informationCSE5351: Parallel Processing Part III
CSE5351: Parallel Processing Part III -1- Performance Metrics and Benchmarks How should one characterize the performance of applications and systems? What are user s requirements in performance and cost?
More informationAdvanced Computing Research Laboratory. Adaptive Scientific Software Libraries
Adaptive Scientific Software Libraries and Texas Learning and Computation Center and Department of Computer Science University of Houston Challenges Diversity of execution environments Growing complexity
More informationVector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks
Vector Architectures Vs. Superscalar and VLIW for Embedded Media Benchmarks Christos Kozyrakis Stanford University David Patterson U.C. Berkeley http://csl.stanford.edu/~christos Motivation Ideal processor
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, Yong Wang, Bo Yu, Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A dominant
More informationAdvanced Computer Architecture
18-742 Advanced Computer Architecture Test 2 April 14, 1998 Name (please print): Instructions: DO NOT OPEN TEST UNTIL TOLD TO START YOU HAVE UNTIL 12:20 PM TO COMPLETE THIS TEST The exam is composed of
More informationFast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems
Fast Algorithms for Regularized Minimum Norm Solutions to Inverse Problems Irina F. Gorodnitsky Cognitive Sciences Dept. University of California, San Diego La Jolla, CA 9293-55 igorodni@ece.ucsd.edu Dmitry
More informationIntel Enterprise Processors Technology
Enterprise Processors Technology Kosuke Hirano Enterprise Platforms Group March 20, 2002 1 Agenda Architecture in Enterprise Xeon Processor MP Next Generation Itanium Processor Interconnect Technology
More informationA Characterization of High Performance DSP Kernels on the TRIPS Architecture
Technical Report #TR-06-62, Department of Computer Sciences, University of Texas A Characterization of High Performance DSP Kernels on the TRIPS Architecture Kevin B. Bush Mark Gebhart Doug Burger Stephen
More informationSDA: Software-Defined Accelerator for Large- Scale DNN Systems
SDA: Software-Defined Accelerator for Large- Scale DNN Systems Jian Ouyang, 1 Shiding Lin, 1 Wei Qi, 1 Yong Wang, 1 Bo Yu, 1 Song Jiang, 2 1 Baidu, Inc. 2 Wayne State University Introduction of Baidu A
More informationEffect of memory latency
CACHE AWARENESS Effect of memory latency Consider a processor operating at 1 GHz (1 ns clock) connected to a DRAM with a latency of 100 ns. Assume that the processor has two ALU units and it is capable
More informationTrends in the Infrastructure of Computing
Trends in the Infrastructure of Computing CSCE 9: Computing in the Modern World Dr. Jason D. Bakos My Questions How do computer processors work? Why do computer processors get faster over time? How much
More informationLLGrid: On-Demand Grid Computing with gridmatlab and pmatlab
LLGrid: On-Demand Grid Computing with gridmatlab and pmatlab Albert Reuther 29 September 2004 This work is sponsored by the Department of the Air Force under Air Force contract F19628-00-C-0002. Opinions,
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationBenchmarking Real-World In-Vehicle Applications
Benchmarking Real-World In-Vehicle Applications NVIDIA GTC 2015-03-18 m y c a b l e GmbH Michael Carstens-Behrens Gartenstraße 10 24534 Neumuenster, Germany +49 4321 559 56-55 +49 4321 559 56-10 mcb@mycable.de
More informationIntel released new technology call P6P
P6 and IA-64 8086 released on 1978 Pentium release on 1993 8086 has upgrade by Pipeline, Super scalar, Clock frequency, Cache and so on But 8086 has limit, Hard to improve efficiency Intel released new
More informationAdministrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve
Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard
More informationAn Advanced Graph Processor Prototype
An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant
More informationCASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster
CASE STUDY: Using Field Programmable Gate Arrays in a Beowulf Cluster Mr. Matthew Krzych Naval Undersea Warfare Center Phone: 401-832-8174 Email Address: krzychmj@npt.nuwc.navy.mil The Robust Passive Sonar
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationJohn Bloomfield, Mercury Computer Systems, Inc. HPEC September Mercury Computer Systems, Inc.
3DUWLWLRQLQJ &RPSXWDWLRQD7DVNV LWKLQDQ)3*$5,6& +HWHURJHQHRXV 0XWLFRPSXWHU John Bloomfield, Mercury Computer Systems, Inc. HPEC September 2002 Agenda Why worry about partitioning? How we partitioned a real-world
More informationGedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort
Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is
More informationSimultaneous Multithreading on Pentium 4
Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on
More informationMaximizing Server Efficiency from μarch to ML accelerators. Michael Ferdman
Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency from μarch to ML accelerators Michael Ferdman Maximizing Server Efficiency with ML accelerators Michael
More informationBehavioral Data Mining. Lecture 12 Machine Biology
Behavioral Data Mining Lecture 12 Machine Biology Outline CPU geography Mass storage Buses and Networks Main memory Design Principles Intel i7 close-up From Computer Architecture a Quantitative Approach
More informationVersal: AI Engine & Programming Environment
Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY
More informationComputer Systems Architecture I. CSE 560M Lecture 19 Prof. Patrick Crowley
Computer Systems Architecture I CSE 560M Lecture 19 Prof. Patrick Crowley Plan for Today Announcement No lecture next Wednesday (Thanksgiving holiday) Take Home Final Exam Available Dec 7 Due via email
More informationSDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center
SDAccel Environment The Xilinx SDAccel Development Environment Bringing The Best Performance/Watt to the Data Center Introduction Data center operators constantly seek more server performance. Currently
More informationHow to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić
How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about
More informationOptimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology
Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:
More informationTesla Architecture, CUDA and Optimization Strategies
Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationOptimal Configuration of Compute Nodes for Synthetic Aperture Radar Processing
Optimal Configuration of Compute Nodes for Synthetic Aperture Radar Processing Jeffrey T. Muehring and John K. Antonio Deptartment of Computer Science, P.O. Box 43104, Texas Tech University, Lubbock, TX
More informationOn Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators
On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC
More informationExploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture
Exploiting ILP, TLP, and DLP with the Polymorphous TRIPS Architecture Ramadass Nagarajan Karthikeyan Sankaralingam Haiming Liu Changkyu Kim Jaehyuk Huh Doug Burger Stephen W. Keckler Charles R. Moore Computer
More informationIdentifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning
Identifying Working Data Set of Particular Loop Iterations for Dynamic Performance Tuning Yukinori Sato (JAIST / JST CREST) Hiroko Midorikawa (Seikei Univ. / JST CREST) Toshio Endo (TITECH / JST CREST)
More informationEvaluation of Stream Virtual Machine on Raw Processor
Evaluation of Stream Virtual Machine on Raw Processor Jinwoo Suh, Stephen P. Crago, Janice O. McMahon, Dong-In Kang University of Southern California Information Sciences Institute Richard Lethin Reservoir
More informationHow to build a Megacore microprocessor. by Andreas Olofsson (MULTIPROG WORKSHOP 2017)
How to build a Megacore microprocessor by Andreas Olofsson (MULTIPROG WORKSHOP 2017) 1 Disclaimers 2 This presentation summarizes work done by Adapteva from 2008-2016. Statements and opinions are my own
More informationImaging Solutions by Mercury Computer Systems
Imaging Solutions by Mercury Computer Systems Presented By Raj Parihar Computer Architecture Reading Group, UofR Mercury Computer Systems Boston based; designs and builds embedded multi computers Loosely
More informationParallel Programming Multicore systems
FYS3240 PC-based instrumentation and microcontrollers Parallel Programming Multicore systems Spring 2011 Lecture #9 Bekkeng, 4.4.2011 Introduction Until recently, innovations in processor technology have
More informationHigh Performance Computing on GPUs using NVIDIA CUDA
High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and
More informationSupport for Programming Reconfigurable Supercomputers
Support for Programming Reconfigurable Supercomputers Miriam Leeser Nicholas Moore, Albert Conti Dept. of Electrical and Computer Engineering Northeastern University Boston, MA Laurie Smith King Dept.
More informationPerforming Multi-Phased Radar Processing with a Very Deep FPGA Pipeline
Performing Multi-Phased Radar Processing with a Very Deep FPGA Pipeline Jeffrey T. Muehring and John K. Antonio School of Computer Science University of Oklahoma antonio@ou.edu 2000 MAPLD Conference The
More informationA Scalable Multiprocessor for Real-time Signal Processing
A Scalable Multiprocessor for Real-time Signal Processing Daniel Scherrer, Hans Eberle Institute for Computer Systems, Swiss Federal Institute of Technology CH-8092 Zurich, Switzerland {scherrer, eberle}@inf.ethz.ch
More informationHakam Zaidan Stephen Moore
Hakam Zaidan Stephen Moore Outline Vector Architectures Properties Applications History Westinghouse Solomon ILLIAC IV CDC STAR 100 Cray 1 Other Cray Vector Machines Vector Machines Today Introduction
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationComputer Architecture s Changing Definition
Computer Architecture s Changing Definition 1950s Computer Architecture Computer Arithmetic 1960s Operating system support, especially memory management 1970s to mid 1980s Computer Architecture Instruction
More informationQuixilica Floating-Point QR Processor Core
Data sheet Quixilica Floating-Point QR Processor Core With 13 processors on XC2V6000-5 - 20 GFlop/s at 100MHz With 10 processors on XC2V6000-5 - 15 GFlop/s at 97MHz With 4 processors on XC2V3000-5 - 81
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for Performance Embedded Computing Shuai Mu 1, Chenxi Wang 1, Ming Liu 2, Dongdong Li 2, Maohua Zhu 1, Xiaoliang Chen 3, Xiang Xie 1, Yangdong Deng 1 1 Tsinghua
More informationrepresent parallel computers, so distributed systems such as Does not consider storage or I/O issues
Top500 Supercomputer list represent parallel computers, so distributed systems such as SETI@Home are not considered Does not consider storage or I/O issues Both custom designed machines and commodity machines
More informationIntroduction to GPU computing
Introduction to GPU computing Nagasaki Advanced Computing Center Nagasaki, Japan The GPU evolution The Graphic Processing Unit (GPU) is a processor that was specialized for processing graphics. The GPU
More informationVirtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar
Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar David Bueno, Adam Leko, Chris Conger, Ian Troxel, and Alan D. George HCS Research Laboratory College
More informationOverview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26
More informationParallel computer architecture classification
Parallel computer architecture classification Hardware Parallelism Computing: execute instructions that operate on data. Computer Instructions Data Flynn s taxonomy (Michael Flynn, 1967) classifies computer
More informationLINPACK Benchmark. on the Fujitsu AP The LINPACK Benchmark. Assumptions. A popular benchmark for floating-point performance. Richard P.
1 2 The LINPACK Benchmark on the Fujitsu AP 1000 Richard P. Brent Computer Sciences Laboratory The LINPACK Benchmark A popular benchmark for floating-point performance. Involves the solution of a nonsingular
More informationSimplify System Complexity
1 2 Simplify System Complexity With the new high-performance CompactRIO controller Arun Veeramani Senior Program Manager National Instruments NI CompactRIO The Worlds Only Software Designed Controller
More informationBaseline V IRAM Trimedia. Cycles ( x 1000 ) N
CS 252 COMPUTER ARCHITECTURE MAY 2000 An Investigation of the QR Decomposition Algorithm on Parallel Architectures Vito Dai and Brian Limketkai Abstract This paper presents an implementation of a QR decomposition
More informationOPERA. Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications
OPERA Low Power Heterogeneous Architecture for the Next Generation of Smart Infrastructure and Platforms in Industrial and Societal Applications Co-funded by the Horizon 2020 Framework Programme of the
More informationLAPACK. Linear Algebra PACKage. Janice Giudice David Knezevic 1
LAPACK Linear Algebra PACKage 1 Janice Giudice David Knezevic 1 Motivating Question Recalling from last week... Level 1 BLAS: vectors ops Level 2 BLAS: matrix-vectors ops 2 2 O( n ) flops on O( n ) data
More informationAdaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware. Objective/Approach/Process
Adaptive Computing Systems (ACS) Domain for Implementing DSP Algorithms in Reconfigurable Hardware John Zaino, Eric Pauer, Ken Smith, Paul Fiore, Jairam Ramanathan, Cory Myers {john.c.aino, ken.smith,
More information3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations
Carnegie Mellon 3D-Stacked Logic-in-Memory Hardware For Sparse Matrix Operations Franz Franchetti Carnegie Mellon University In collaboration with Qiuling Zhu, Fazle Sadi, Qi Guo, Guangling Xu, Ekin Sumbul,
More informationOptimizing Cache Performance in Matrix Multiplication. UCSB CS240A, 2017 Modified from Demmel/Yelick s slides
Optimizing Cache Performance in Matrix Multiplication UCSB CS240A, 2017 Modified from Demmel/Yelick s slides 1 Case Study with Matrix Multiplication An important kernel in many problems Optimization ideas
More informationDesign, Implementation and Performance Evaluation of Synthetic Aperture Radar Signal Processor on FPGAs
Design, Implementation and Performance Evaluation of Synthetic Aperture Radar Signal Processor on FPGAs Hemang Parekh Masters Thesis MS(Computer Engineering) University of Kansas 23rd June, 2000 Committee:
More informationScientific Computing. Some slides from James Lambers, Stanford
Scientific Computing Some slides from James Lambers, Stanford Dense Linear Algebra Scaling and sums Transpose Rank-one updates Rotations Matrix vector products Matrix Matrix products BLAS Designing Numerical
More informationSpring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker
Spring 2010 Prof. Hyesoon Kim Xbox 360 System Architecture, Anderews, Baker 3 CPU cores 4-way SIMD vector units 8-way 1MB L2 cache (3.2 GHz) 2 way SMT 48 unified shaders 3D graphics units 512-Mbyte DRAM
More informationA Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures
A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationSearching for Meaning in the Era of Big Data and IoT
Searching for Meaning in the Era of Big Data and IoT Trung Tran MIT Lincoln Labs GraphEx Conference 11 May 2016 Distribution Statement A MTO Strategy EM Spectrum Tactical Information Extraction Globalization
More informationClearspeed Embedded Apps and Architecture for Space
Clearspeed Embedded Apps and Architecture for Space EEL 6686: Presentation 1 Chris Morales Kaz Onishi ECE University of Florida, Gainesville, Florida January 29, 2015 1 / 32 Introduction Embedded systems
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationMercury Computer Systems & The Cell Broadband Engine
Mercury Computer Systems & The Cell Broadband Engine Georgia Tech Cell Workshop 18-19 June 2007 About Mercury Leading provider of innovative computing solutions for challenging applications R&D centers
More informationSystems Design and Programming. Instructor: Chintan Patel
Systems Design and Programming Instructor: Chintan Patel Text: Barry B. Brey, 'The Intel Microprocessors, 8086/8088, 80186/80188, 80286, 80386, 80486, Pentium and Pentium Pro Processor, Pentium II, Pentium
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationScalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India
Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor
More informationThe Fusion Distributed File System
Slide 1 / 44 The Fusion Distributed File System Dongfang Zhao February 2015 Slide 2 / 44 Outline Introduction FusionFS System Architecture Metadata Management Data Movement Implementation Details Unique
More informationExecutable Requirements: Opportunities and Impediments
Executable Requirements: Oppotunities and Impediments Executable Requirements: Opportunities and Impediments G. A. Shaw and A. H. Anderson * Abstract: In a top-down, language-based design methodology,
More informationAnatomy of AMD s TeraScale Graphics Engine
Anatomy of AMD s TeraScale Graphics Engine Mike Houston Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream
More informationPerformance Benefits of OpenVMS V8.4 Running on BL8x0c i2 Server Blades
Performance Benefits of OpenVMS V8.4 Running on BL8xc i2 Server Blades A detailed review of performance features and test results for OpenVMS V8.4. March 211 211, TechWise Research. All Rights Reserved
More informationA Streaming Virtual Machine for GPUs
A Streaming Virtual Machine for GPUs Kenneth Mackenzie (Reservoir L, Inc) Dan Campbell (Georgia Tech Research Institute) Peter Szilagyi (Reservoir L, Inc) Copyright 2005 Government Purpose Rights, All
More information