LLMORE: Mapping and Optimization Framework
|
|
- Roger Wade
- 6 years ago
- Views:
Transcription
1 LORE: Mapping and Optimization Framework Michael Wolf, MIT Lincoln Laboratory 11 September 2012 This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Distribution Statement A: Approved for public release, distribution is unlimited. (9/10/2012). NOT APPROVED FOR PUBLIC RELEASE
2 Architectures Applications Overview of Mapping and Optimization Challenges FPGA SAR Secure Communication Cyber Database operations Chips Mapping and Optimization Small hybrid systems Clusters Challenges: Realistic simulations of applications Support for diverse languages/numerical libraries Support for diverse devices and architectures Supercomputers Data warehouses Michael Wolf - 2
3 LORE LORE is MIT Lincoln Laboratory s Mapping and Optimization Runtime Environment Parallel framework/environment for Optimizing data to processor mapping for parallel applications Simulating and optimizing new (and existing) architectures Generating performance data (runtime, power, etc.) Code generation and execution for target architectures pmapper SMaRT/MORE LORE pmapper: Matlab, dense operations, simple architecture model, executor, serial SMaRT/MORE: Matlab, sparse operations (limited data size), architecture model, no executor, serial LORE: multiple language support, sparse/dense operations, architecture model, executor, parallel, robust software pmapper patent issued: Method and apparatus performing automatic mapping for multiprocessor system Three generations of mapping and optimization Michael Wolf - 3
4 Key Features of LORE Support for multiple languages and numerical libraries Ability to solve large problems Written in C++ and runs in parallel Fit larger problems into memory, reduces time to solution Support for dense and sparse linear algebra operations Production quality research software Easy to use interfaces Designed to support future algorithms/packages/languages LORE is NOT an autoparallelizing compiler Will not generate optimized parallel code for any set of (serial or parallel) instructions Data layouts optimized in context of maps Michael Wolf - 4
5 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 5
6 LORE Framework Overview LORE input User Code LORE Parameters Architectural Model Key/Novel Features Application code to simulator Data mapping optimization for user code Support for multiple languages and libraries Ability to solve large problems Production quality software LORE LORE output 1. Set of op=mized maps or 2. Performance data or 3. Op=mized architecture(s) or 4. Generated code or 5. Results from run on target architecture Output: One or more Michael Wolf - 6
7 LORE Design Overview LORE Input LORE Output Parse Manager Parser Language Interface Map Manager Map Converter Builder Map Builder Analyzer and Op=mizer Core Functionality Code Generator Run=me Engine Machine Interface LORE Output = abstract syntax tree Michael Wolf - 7
8 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 8
9 LORE Usage 1: Map Optimization y = A x y = A x LORE LORE produces set of optimized maps for parallel variables specified in user code Matrix-vector product example LORE computes map for dense matrix LORE computes map for two vectors LORE optimizes data mapping to improve parallel performance of key computational kernels Michael Wolf - 9
10 Map Optimization: Input Dense Matrix-Vector Product User Code: User code Parser Mapper Mapped Optimized Maps y = A x Analyzer and Optimizer LORE y=ax Input from application: user code for dense matrix-vector product Michael Wolf - 10
11 Map Optimization: Representation Dense Matrix-Vector Product SB User code y=ax Parser Mapper Mapped Optimized Maps MV Analyzer and Optimizer LORE PVar: y PVar: A PVar: x Parser converts user code into abstract syntax tree (), which is input language/numerical library neutral Michael Wolf - 11
12 Map Optimization: Mapping Dense Matrix-Vector Product SB User code Parser Mapper Mapped Optimized Maps MV LORE Analyzer and Optimizer PVar: y PVar: A PVar: x Vector Map P0: {0,1} P1: {2,3} Matrix Map P0: {0,1} P1: {2,3} Vector Map P0: {0,1} P1: {2,3} LORE computes map for each parallel variable in Michael Wolf - 12
13 Map Optimization: Output Dense Matrix-Vector Product x y A x y A User code Parser Mapper Mapped Optimized Maps LORE Analyzer and Optimizer y = A x Optimized y=ax LORE output: optimized maps for parallel variables New maps used to redistribute vector and matrix data Optimized matrix-vector product calculated with new data distributions Michael Wolf - 13
14 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 14
15 LORE Usage 2: Performance Evaluation y = A x Performance data LORE M1 P1 P2 M LORE simulates user code on specified architecture to produce performance evaluation metrics Michael Wolf - 15
16 Performance Evaluation: Input Dense Matrix-Vector Product User Code: Simulator Performance Data y = A x User Code, Arch. Model Parser Mapper Mapped MI Code Generator MI code y=ax LORE Performance Evaluator Analyzer and Optimizer Architecture Model: MI = Machine Independent M1 P1 P2 M2 Input from application: user code for dense matrix-vector product, architecture model Michael Wolf - 16
17 Performance Evaluation: Representation Dense Matrix-Vector Product Simulator Performance Data SB User Code, Arch. Model y=ax Parser Mapper Mapped MI Code Generator MI code Performance Evaluator MV LORE Analyzer and Optimizer MI = Machine Independent PVar: y PVar: A PVar: x Parser converts user code into abstract syntax tree (), which is input language/numerical library neutral Michael Wolf - 17
18 Performance Evaluation: Mapping Dense Matrix-Vector Product Simulator Performance Data SB User Code, Arch. Model Parser Mapper Mapped MI Code Generator MI code MV Performance Evaluator LORE Analyzer and Optimizer MI = Machine Independent PVar: y PVar: A PVar: x Vector Map P0: {0,1} P1: {2,3} Matrix Map P0: {0,1} P1: {2,3} Vector Map P0: {0,1} P1: {2,3} LORE computes map for each parallel variable in Michael Wolf - 18
19 Performance Evaluation: Machine Independent Code Mapped SB MV Architecture Model Simulator Output PVar: y Vector Map P0: {0,1} P1: {2,3} PVar: A Matrix Map P0: {0,1} P1: {2,3} PVar: x Vector Map P0: {0,1} P1: {2,3} M1 P1 P2 M2 Input Parser Mapper Mapped MI Code Generator MI code Performance Evaluator MI Code Generator read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 LORE Analyzer and Optimizer y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 MI = Machine Independent send x 0 send x 1 read A 0,1 read A 1,0 y 0 += A 0,1 x 1 y 1 += A 1,0 x 0 MI Code write y 0 write y 1 Mapped and architecture model used to generate machine independent code Michael Wolf - 19
20 Machine Independent Code read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 P0 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 P1 Computation Memory access Communication send x 0 send x 1 Flow graph of operations read A 0,1 read A 1,0 y 0 += A 0,1 x 1 y 1 += A 1,0 x 0 # rows = 2 # processors = 2 write y 0 write y 1 Machine independent code for matrix-vector product Michael Wolf - 20
21 Performance Evaluation: Output read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 Simulator Performance Data send x 0 send x 1 Input Parser Mapper Mapped MI Code Generator MI code read A 0,1 y 0 += A 0,1 x 1 read A 1,0 y 1 += A 1,0 x 0 Performance Evaluator MI Code write y 0 write y 1 LORE Analyzer and Optimizer Simulator Performance Data Simulation of user code on target architecture Michael Wolf - 21
22 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 22
23 POEM Photonic Switching Technology Ring Resonator Technology Photonics Layered on Silicon Architecture Parameter Characterization Architecture Design Proposed Photonic Architectures Photonic Innovation P hotonically O ptimized E mbedded M icroprocessor Fielded Systems Application Analysis Simulation, Mapping, and Optimization Framework Processing Chains for Military-Critical Applications Handheld DNA Sequencing Device UAV Surveillance Handheld PDA Field Devices POEM will bridge the gap between innovations in chip-scale photonics and fielded military-critical systems. It will offer a complete architecture design and analysis for numerous real-world military-critical applications. Michael Wolf - 23
24 LORE s Role in POEM Program Photonic Switching Technology Ring Resonator Technology Photonics Layered on Silicon Architecture Parameter Characterization Processing Chains for Military-Critical Applications Photonic Innovation Architecture Design Proposed Photonic Architectures P hotonically O ptimized Application Analysis E mbedded M icroprocessor Simulation, Mapping, and Optimization Framework Fielded Systems Handheld DNA Sequencing Device UAV Surveillance Handheld PDA Field Devices LORE used to study chip-scale photonics and its impact on applications Michael Wolf - 24
25 LORE and POEM: Applications SAR Secure Communication Cyber Database operations LORE LORE provides framework for analyzing POEM applications LORE supports many key numerical kernels (FFT, sparse matrix-vector product, vector updates, etc.) Applications supported through composition of these kernels Easy to extend to analyze new applications Initially analyzing synthetic aperture radar (SAR) application LORE enables the analysis of many applications on many different architectures (existing and proposed) Michael Wolf - 25
26 LORE and POEM: Architectures Application Electronic Mesh LORE Photonic Bus LORE supports simulation of applications on POEM architectures (e.g., electronic mesh and photonic bus) Framework for simulating user code LORE simulator for understanding big picture trends Interface to third party simulators (e.g., PhoenixSim) for higher fidelity performance data LORE enables the analysis of many applications on many different architectures (existing and proposed) Michael Wolf - 26
27 LORE and POEM: Optimization of Maps y = A x y = A x LORE Optimization of maps Good application data to processor mapping crucial to achieving peak parallel performance on target machines Not difficult for SAR applications (simple maps sufficient) Challenging for applications with irregular communication (DNA sequence analysis and sparse matrix computations) LORE provides automatic map optimization Michael Wolf - 27
28 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 28
29 LORE and Performance Evaluation LORE output LORE input User Code Architectural Model Parser LORE Mapper Mapped Analyzer and Optimizer Simulator MI Code Generator MI code Performance Evaluator Performance data: time (s), GFLOPS LORE used to produce performance evaluation data for 2D FFT, an important kernel in SAR processing chain Michael Wolf - 29
30 POEM Architecture Models Electronic Mesh Photonic Bus M1 P1 P2 P3 P4 M2 M1 M2 M3 M4 P1 P2 P3 P4 P5 P9 P6 P10 P7 P11 P8 P12 P=processor/core M=Shared memory =local memory P5 P9 P6 P10 P7 P11 P8 P12 M3 P13 P14 P15 P16 M4 P13 P14 P15 P16 Two architectures modeled (electronic mesh and photonic bus Processors same, shared memory same Network parameters (latency, bandwidth) set to allow for apple to apples comparison between networks LORE framework allows direct comparison of electronic mesh and photonic bus architectures Michael Wolf - 30
31 Preliminary POEM Simulations 2D FFT: FFT, CT, FFT 2D FFT: FFT, SCA, FFT 1D FFT for each row of dense matrix FFT Read FFT Write FFT Read FFT For data locality of FFT CT Read Transposed Write CT SCA Write 1D FFT for each column of dense matrix EM=Electronic mesh PB=Photonic bus SCA = Synchronous coalesced access FFT Read FFT Write Architectures simulated: EM FFT Read FFT Write Architectures simulated: PB Initially, simulated 2D FFT kernel Important kernel for SAR applications Michael Wolf - 31
32 Photonic Synchronous Coalesced Access Network (PSCAN) PSCAN leverages the differences between electronic and photonic interconnect to achieve large efficiency gains in critical operations Utilizes distance independence to rapidly re-organize spatially separate data Large gains in efficiency even when bandwidth is equalized MI M 1 SCA write Synthesizes matrix row-to-column transpose/write into a single transaction by interleaving data from spatially separate data producers Novel ISA construct enabled by a highly synchronous photonic waveguide DRIVE0 DRIVE1 DRIVE2 DRIVE3 P-CLK 0 Data is not driven time DRAM Address info driven by proc 3 Credit: Dave Whelihan, MITLL Michael Wolf - 32
33 Preliminary LORE Results GOPS Performance of 2D FFT Operation % of Ideal Experiment: Number of memory controllers: 4 Matrix size: 4096 x 4096 Equalize bandwidth to memory in photonic and electronic models Run traditional 2D FFT with block transpose for electronic mesh Run 2D FFT with SCA for PSCAN Scale number of processors Number of Cores Electronic PSCAN Ideal Examine performance of photonics and electronics in an increasingly workstarved environment PSCAN architecture with SCA yields significant speed-up over electronic mesh architectures using block transpose Michael Wolf - 33
34 Data Reorganization Bottleneck in Modern Manycore Systems 1 Time Spent Reorganizing Data for 2D FFT Block Transpose SCA Percentage Data Reorganization FFT CT FFT Read FFT Write Read Transpose Write Read FFT Write FFT CT FFT Read FFT SCA Read FFT Write Number of Cores Block Transpose SCA As the number of cores in modern CMPs grows, the data reorganization becomes an increasing performance bottleneck. Performance is limited by the inability to keep processors supplied with data. SCA operation performs data reorganization step of 2D FFT more efficiently as number of processors grows, alleviating traditional block transpose bottleneck Michael Wolf - 34
35 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 35
36 LORE Next Steps Extend architecture model, LORE simulator Support memory hierarchy Model network contention more accurately Support for external simulators E.g., PhoenixSim (Columbia), SST (Sandia) Needed for higher fidelity simulations Better power modeling (e.g., dynamic power) Additional parallel numerical library/language support Additional languages: Matlab, Python Additional libraries: e.g., VSIPL++/PVTOL Additional kernels: e.g., Sparse matrix operations Code generator and runtime engine for execution/emulation on target architectures Michael Wolf - 36
37 Summary LORE: Parallel framework/environment for Optimizing data to processor mapping for parallel applications Simulating and optimizing new (and existing) architectures LORE used to compare photonic and electronic architectures of interest to POEM project Support for many applications (through composition of kernels) Support for different architectures Preliminary results for 2D FFT kernel Support thesis that photonics can improve performance for these applications SCA instruction mitigates performance impact of corner turn in 2D FFT operation LORE allows for exploration of architecture design space in context of real application constraints Michael Wolf - 37
38 Acknowledgements LORE Michelle Beard Anna Klein Sanjeev Mohindra Julie Mullen Eric Robinson Nadya Bliss (Manager) Minna Song (MIT student intern) POEM Dave Whelihan Jeff Hughes Scott Sawyer Michael Wolf - 38
Instruction Set Extensions for Photonic Synchronous Coalesced Access
Instruction Set Extensions for Photonic Synchronous Coalesced Access Paul Keltcher, David Whelihan, Jeffrey Hughes September 12, 2013 This work is sponsored by Defense Advanced Research Projects Agency
More informationAnalysis and Mapping of Sparse Matrix Computations
Analysis and Mapping of Sparse Matrix Computations Nadya Bliss & Sanjeev Mohindra Varun Aggarwal & Una-May O Reilly MIT Computer Science and AI Laboratory September 19th, 2007 HPEC2007-1 This work is sponsored
More informationQuantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel
Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Daniel Kimball, Elizabeth Michel, Paul Keltcher, and Michael M. Wolf MIT Lincoln Laboratory Lexington, MA {daniel.kimball,
More informationDataSToRM: Data Science and Technology Research Environment
The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering
More informationCluster-based 3D Reconstruction of Aerial Video
Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer (scott.sawyer@ll.mit.edu) MIT Lincoln Laboratory HPEC 12 12 September 2012 This work is sponsored by the Assistant Secretary of Defense for
More informationThe HPEC Challenge Benchmark Suite
The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts Institute of Technology Lincoln Laboratory HPEC 2005 This work is sponsored by the Defense Advanced
More informationAn Advanced Graph Processor Prototype
An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant
More informationEvaluating the Potential of Graphics Processors for High Performance Embedded Computing
Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline
More informationDynamic Distributed Dimensional Data Model (D4M) Database and Computation System
Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System Jeremy Kepner, William Arcand, William Bergeron, Nadya Bliss, Robert Bond, Chansup Byun, Gary Condon, Kenneth Gregson, Matthew
More informationSignal Processing on Databases
Signal Processing on Databases Jeremy Kepner Lecture 0: Introduction 3 October 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,
More informationPAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs
PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs Julia S. Mullen, Michael M. Wolf and Anna Klein MIT Lincoln Laboratory Lexington, MA 02420 Email: {jsm, michael.wolf,anna.klein
More information300x Matlab. Dr. Jeremy Kepner. MIT Lincoln Laboratory. September 25, 2002 HPEC Workshop Lexington, MA
300x Matlab Dr. Jeremy Kepner September 25, 2002 HPEC Workshop Lexington, MA This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions,
More informationKernel Benchmarks and Metrics for Polymorphous Computer Architectures
PCAKernels-1 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures Hank Hoffmann James Lebak (Presenter) Janice McMahon Seventh Annual High-Performance Embedded Computing Workshop (HPEC)
More informationSampling Large Graphs for Anticipatory Analysis
Sampling Large Graphs for Anticipatory Analysis Lauren Edwards*, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller IEEE High Performance Extreme Computing Conference September 16, 2015
More information6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models
Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationA KASSPER Real-Time Signal Processor Testbed
A KASSPER Real-Time Signal Processor Testbed Glenn Schrader 244 Wood St. exington MA, 02420 Phone: (781)981-2579 Fax: (781)981-5255 gschrad@ll.mit.edu The Knowledge Aided Sensor Signal Processing and Expert
More informationGedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort
Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is
More informationSub-Graph Detection Theory
Sub-Graph Detection Theory Jeremy Kepner, Nadya Bliss, and Eric Robinson This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions,
More informationPhastlane: A Rapid Transit Optical Routing Network
Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens
More information6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP
LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,
More informationCommunications Research Innovation Across Commercial and Military Boundaries Optical Communications
Communications Research Innovation Across Commercial and Military Boundaries Optical Communications 27 October 2015 Bryan Robinson MIT Lincoln Laboratory This work is sponsored by National Aeronautics
More information3D Technologies For Low Power Integrated Circuits
3D Technologies For Low Power Integrated Circuits Paul Franzon North Carolina State University Raleigh, NC paulf@ncsu.edu 919.515.7351 Outline 3DIC Technology Set Approaches to 3D Specific Power Minimization
More informationThe Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development
The Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development Randall Janka and Mark Richards Georgia Tech Research University James Lebak MIT Lincoln Laboratory
More informationBubbleNet: A Cyber Security Dashboard for Visualizing Patterns
BubbleNet: A Cyber Security Dashboard for Visualizing Patterns Sean McKenna 1,2 Diane Staheli 2 Cody Fulcher 2 Miriah Meyer 1 1 University of Utah 2 MIT Lincoln Laboratory The Lincoln Laboratory portion
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationExecutable Requirements: Opportunities and Impediments
Executable Requirements: Oppotunities and Impediments Executable Requirements: Opportunities and Impediments G. A. Shaw and A. H. Anderson * Abstract: In a top-down, language-based design methodology,
More informationDana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:
Self-Organizing Networks (SONets) with Application to Target Tracking Dana Sinno 244 Wood Street Lexington, MA 02420-9108 phone: 781-981-4526 email: @ll.mit.edu Abstract The growing interest in large arrays
More informationSocial Behavior Prediction Through Reality Mining
Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO
More informationNetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013
NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology
More informationBig Orange Bramble. August 09, 2016
Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University
More informationScott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011
Activity-Based Community Detection Scott Philips, Edward Kao, Michael Yee and Christian Anderson Graph Exploitation Symposium August 9 th 2011 23-1 This work is sponsored by the Office of Naval Research
More informationReconfigurable Versus Fixed Versus Hybrid Architectures
Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008 Norman, Oklahoma October 6, 2008 Computer Science, University of Oklahoma Overview The (past)
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationNear Memory Computing Spectral and Sparse Accelerators
Near Memory Computing Spectral and Sparse Accelerators Franz Franchetti ECE, Carnegie Mellon University www.ece.cmu.edu/~franzf Co-Founder, SpiralGen www.spiralgen.com The work was sponsored by Defense
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationElectric Power HIL Controls Collaborative
Electric Power HIL Controls Collaborative Chris Smith Microgrid and DER Controller Symposium 2/16/2017 MIT Lincoln Laboratory This work is sponsored by the Department of Energy, Office of Electricity Delivery
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationOptimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning
Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,
More informationOptimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection
Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Thomas M. Benson 1 Daniel P. Campbell 1 David Tarjan 2 Justin Luitjens 2 1 Georgia Tech Research Institute {thomas.benson,dan.campbell}@gtri.gatech.edu
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationDSENT A Tool Connecting Emerging Photonics with Electronics for Opto- Electronic Networks-on-Chip Modeling Chen Sun
A Tool Connecting Emerging Photonics with Electronics for Opto- Electronic Networks-on-Chip Modeling Chen Sun In collaboration with: Chia-Hsin Owen Chen George Kurian Lan Wei Jason Miller Jurgen Michel
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationIntro to: Ultra-low power, ultra-high bandwidth density SiP interconnects
This work was supported in part by DARPA under contract HR0011-08-9-0001. The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter
More informationTools and Primitives for High Performance Graph Computation
Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World
More informationTrends in HPC (hardware complexity and software challenges)
Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18
More informationStorage Hierarchy Management for Scientific Computing
Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction
More informationEECS4201 Computer Architecture
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be
More informationDesign of a System-on-Chip Switched Network and its Design Support Λ
Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of
More informationEnergy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS
Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationPerfect Power Law Graphs: Generation, Sampling, Construction, and Fitting
Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Jeremy Kepner SIAM Annual Meeting, Minneapolis, July 9, 2012 This work is sponsored by the Department of the Air Force under Air
More informationzorder-lib: Library API for Z-Order Memory Layout
zorder-lib: Library API for Z-Order Memory Layout E. Wes Bethel Lawrence Berkeley National Laboratory Berkeley, CA, USA, 94720 April, 2015 i Acknowledgment This work was supported by the Director, Office
More informationAlgorithms and Architecture. William D. Gropp Mathematics and Computer Science
Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationHPCS HPCchallenge Benchmark Suite
HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary
More informationATAC: Improving Performance and Programmability with On-Chip Optical Networks
ATAC: Improving Performance and Programmability with On-Chip Optical Networks James Psota, Jason Miller, George Kurian, Nathan Beckmann, Jonathan Eastep, Henry Hoffman, Jifeng Liu, Mark Beals, Jurgen Michel,
More informationData Reorganization Interface
Data Reorganization Interface Kenneth Cain Mercury Computer Systems, Inc. Phone: (978)-967-1645 Email Address: kcain@mc.com Abstract: This presentation will update the HPEC community on the latest status
More informationIbex: A Space Situational Awareness Data Fusion Program
Ibex: A Space Situational Awareness Data Fusion Program Lt Col Travis Blake, PhD, USAF Program Manager DARPA/TTO - Space Systems Maj Christopher Gustin, USAF Deputy Program Manager SMC/SYFA - Space C2
More informationCHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization
CHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization Robert N.M. Watson *, Jonathan Woodruff *, Peter G. Neumann, Simon W. Moore *, Jonathan Anderson, David Chisnall
More informationHigh-Performance Linear Algebra Processor using FPGA
High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationRadar Tomography of Moving Targets
On behalf of Sensors Directorate, Air Force Research Laboratory Final Report September 005 S. L. Coetzee, C. J. Baker, H. D. Griffiths University College London REPORT DOCUMENTATION PAGE Form Approved
More informationWhite Paper Assessing FPGA DSP Benchmarks at 40 nm
White Paper Assessing FPGA DSP Benchmarks at 40 nm Introduction Benchmarking the performance of algorithms, devices, and programming methodologies is a well-worn topic among developers and research of
More informationScientific Applications. Chao Sun
Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:
More informationECE468 Computer Organization and Architecture. Memory Hierarchy
ECE468 Computer Organization and Architecture Hierarchy ECE468 memory.1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Input Datapath Output Today s Topic:
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationWhy Multiprocessors?
Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software
More informationBeyond the PDP-11: Architectural support for a memory-safe C abstract machine
Trustworthy Systems Research and CTSRDCRASH-worthy Development Beyond the PDP-11: Architectural support for a memory-safe C abstract machine David Chisnall, Colin Rothwell, Brooks Davis, Robert N.M. Watson,
More informationOrganic Computing DISCLAIMER
Organic Computing DISCLAIMER The views, opinions, and/or findings contained in this article are those of the author(s) and should not be interpreted as representing the official policies, either expressed
More informationUNCLASSIFIED. R-1 ITEM NOMENCLATURE PE D8Z: Data to Decisions Advanced Technology FY 2012 OCO
Exhibit R-2, RDT&E Budget Item Justification: PB 2012 Office of Secretary Of Defense DATE: February 2011 BA 3: Advanced Development (ATD) COST ($ in Millions) FY 2010 FY 2011 Base OCO Total FY 2013 FY
More informationEE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture
1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered
More informationEE482S Lecture 1 Stream Processor Architecture
EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered
More informationGraphBLAS: A Programming Specification for Graph Analysis
GraphBLAS: A Programming Specification for Graph Analysis Scott McMillan, PhD GraphBLAS: A Programming A Specification for Graph Analysis 2016 October Carnegie 26, Mellon 2016University 1 Copyright 2016
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationConstruction Change Order analysis CPSC 533C Analysis Project
Construction Change Order analysis CPSC 533C Analysis Project Presented by Chiu, Chao-Ying Department of Civil Engineering University of British Columbia Problems of Using Construction Data Hybrid of physical
More informationVirtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar
Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar David Bueno, Adam Leko, Chris Conger, Ian Troxel, and Alan D. George HCS Research Laboratory College
More informationTRACT: Threat Rating and Assessment Collaboration Tool
TRACT: Threat Rating and Assessment Collaboration Tool Robert Hollinger and Doran Smestad Advised by: George Heineman (WPI), Philip Marquardt (MIT/LL) Worcester Polytechnic Institute Major Qualifying Project
More informationCan FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.
Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:
More informationVirtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection
Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection Brendan Dolan-Gavitt *, Tim Leek, Michael Zhivich, Jonathon Giffin *, and Wenke Lee * * Georgia Institute of Technology MIT Lincoln
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationDesign methodology for multi processor systems design on regular platforms
Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline
More informationCovert and Anomalous Network Discovery and Detection (CANDiD)
Covert and Anomalous Network Discovery and Detection (CANDiD) Rajmonda S. Caceres, Edo Airoldi, Garrett Bernstein, Edward Kao, Benjamin A. Miller, Raj Rao Nadakuditi, Matthew C. Schmidt, Kenneth Senne,
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationVisualizing Attack Graphs, Reachability, and Trust Relationships with NAVIGATOR*
Visualizing Attack Graphs, Reachability, and Trust Relationships with NAVIGATOR* Matthew Chu, Kyle Ingols, Richard Lippmann, Seth Webster, Stephen Boyer 14 September 2010 9/14/2010-1 *This work is sponsored
More informationMulti-core Computing Lecture 2
Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationChallenges and Advances in Parallel Sparse Matrix-Matrix Multiplication
Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,
More informationLUNAR TEMPERATURE CALCULATIONS ON A GPU
LUNAR TEMPERATURE CALCULATIONS ON A GPU Kyle M. Berney Department of Information & Computer Sciences Department of Mathematics University of Hawai i at Mānoa Honolulu, HI 96822 ABSTRACT Lunar surface temperature
More informationSpiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.
Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications What is Spiral? Traditionally Spiral Approach Spiral
More informationCompilers and Compiler-based Tools for HPC
Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms
More informationParallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT
Parallel Matlab: RTExpress on -bit SGI Altix with SCSL and MPT Cosmo Castellano Integrated Sensors, Inc. Phone: 31-79-1377, x3 Email Address: castellano@sensors.com Abstract By late, RTExpress, a compiler
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationIntroducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method
Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System
More information