LLMORE: Mapping and Optimization Framework

Size: px
Start display at page:

Download "LLMORE: Mapping and Optimization Framework"

Transcription

1 LORE: Mapping and Optimization Framework Michael Wolf, MIT Lincoln Laboratory 11 September 2012 This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. Distribution Statement A: Approved for public release, distribution is unlimited. (9/10/2012). NOT APPROVED FOR PUBLIC RELEASE

2 Architectures Applications Overview of Mapping and Optimization Challenges FPGA SAR Secure Communication Cyber Database operations Chips Mapping and Optimization Small hybrid systems Clusters Challenges: Realistic simulations of applications Support for diverse languages/numerical libraries Support for diverse devices and architectures Supercomputers Data warehouses Michael Wolf - 2

3 LORE LORE is MIT Lincoln Laboratory s Mapping and Optimization Runtime Environment Parallel framework/environment for Optimizing data to processor mapping for parallel applications Simulating and optimizing new (and existing) architectures Generating performance data (runtime, power, etc.) Code generation and execution for target architectures pmapper SMaRT/MORE LORE pmapper: Matlab, dense operations, simple architecture model, executor, serial SMaRT/MORE: Matlab, sparse operations (limited data size), architecture model, no executor, serial LORE: multiple language support, sparse/dense operations, architecture model, executor, parallel, robust software pmapper patent issued: Method and apparatus performing automatic mapping for multiprocessor system Three generations of mapping and optimization Michael Wolf - 3

4 Key Features of LORE Support for multiple languages and numerical libraries Ability to solve large problems Written in C++ and runs in parallel Fit larger problems into memory, reduces time to solution Support for dense and sparse linear algebra operations Production quality research software Easy to use interfaces Designed to support future algorithms/packages/languages LORE is NOT an autoparallelizing compiler Will not generate optimized parallel code for any set of (serial or parallel) instructions Data layouts optimized in context of maps Michael Wolf - 4

5 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 5

6 LORE Framework Overview LORE input User Code LORE Parameters Architectural Model Key/Novel Features Application code to simulator Data mapping optimization for user code Support for multiple languages and libraries Ability to solve large problems Production quality software LORE LORE output 1. Set of op=mized maps or 2. Performance data or 3. Op=mized architecture(s) or 4. Generated code or 5. Results from run on target architecture Output: One or more Michael Wolf - 6

7 LORE Design Overview LORE Input LORE Output Parse Manager Parser Language Interface Map Manager Map Converter Builder Map Builder Analyzer and Op=mizer Core Functionality Code Generator Run=me Engine Machine Interface LORE Output = abstract syntax tree Michael Wolf - 7

8 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 8

9 LORE Usage 1: Map Optimization y = A x y = A x LORE LORE produces set of optimized maps for parallel variables specified in user code Matrix-vector product example LORE computes map for dense matrix LORE computes map for two vectors LORE optimizes data mapping to improve parallel performance of key computational kernels Michael Wolf - 9

10 Map Optimization: Input Dense Matrix-Vector Product User Code: User code Parser Mapper Mapped Optimized Maps y = A x Analyzer and Optimizer LORE y=ax Input from application: user code for dense matrix-vector product Michael Wolf - 10

11 Map Optimization: Representation Dense Matrix-Vector Product SB User code y=ax Parser Mapper Mapped Optimized Maps MV Analyzer and Optimizer LORE PVar: y PVar: A PVar: x Parser converts user code into abstract syntax tree (), which is input language/numerical library neutral Michael Wolf - 11

12 Map Optimization: Mapping Dense Matrix-Vector Product SB User code Parser Mapper Mapped Optimized Maps MV LORE Analyzer and Optimizer PVar: y PVar: A PVar: x Vector Map P0: {0,1} P1: {2,3} Matrix Map P0: {0,1} P1: {2,3} Vector Map P0: {0,1} P1: {2,3} LORE computes map for each parallel variable in Michael Wolf - 12

13 Map Optimization: Output Dense Matrix-Vector Product x y A x y A User code Parser Mapper Mapped Optimized Maps LORE Analyzer and Optimizer y = A x Optimized y=ax LORE output: optimized maps for parallel variables New maps used to redistribute vector and matrix data Optimized matrix-vector product calculated with new data distributions Michael Wolf - 13

14 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 14

15 LORE Usage 2: Performance Evaluation y = A x Performance data LORE M1 P1 P2 M LORE simulates user code on specified architecture to produce performance evaluation metrics Michael Wolf - 15

16 Performance Evaluation: Input Dense Matrix-Vector Product User Code: Simulator Performance Data y = A x User Code, Arch. Model Parser Mapper Mapped MI Code Generator MI code y=ax LORE Performance Evaluator Analyzer and Optimizer Architecture Model: MI = Machine Independent M1 P1 P2 M2 Input from application: user code for dense matrix-vector product, architecture model Michael Wolf - 16

17 Performance Evaluation: Representation Dense Matrix-Vector Product Simulator Performance Data SB User Code, Arch. Model y=ax Parser Mapper Mapped MI Code Generator MI code Performance Evaluator MV LORE Analyzer and Optimizer MI = Machine Independent PVar: y PVar: A PVar: x Parser converts user code into abstract syntax tree (), which is input language/numerical library neutral Michael Wolf - 17

18 Performance Evaluation: Mapping Dense Matrix-Vector Product Simulator Performance Data SB User Code, Arch. Model Parser Mapper Mapped MI Code Generator MI code MV Performance Evaluator LORE Analyzer and Optimizer MI = Machine Independent PVar: y PVar: A PVar: x Vector Map P0: {0,1} P1: {2,3} Matrix Map P0: {0,1} P1: {2,3} Vector Map P0: {0,1} P1: {2,3} LORE computes map for each parallel variable in Michael Wolf - 18

19 Performance Evaluation: Machine Independent Code Mapped SB MV Architecture Model Simulator Output PVar: y Vector Map P0: {0,1} P1: {2,3} PVar: A Matrix Map P0: {0,1} P1: {2,3} PVar: x Vector Map P0: {0,1} P1: {2,3} M1 P1 P2 M2 Input Parser Mapper Mapped MI Code Generator MI code Performance Evaluator MI Code Generator read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 LORE Analyzer and Optimizer y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 MI = Machine Independent send x 0 send x 1 read A 0,1 read A 1,0 y 0 += A 0,1 x 1 y 1 += A 1,0 x 0 MI Code write y 0 write y 1 Mapped and architecture model used to generate machine independent code Michael Wolf - 19

20 Machine Independent Code read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 P0 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 P1 Computation Memory access Communication send x 0 send x 1 Flow graph of operations read A 0,1 read A 1,0 y 0 += A 0,1 x 1 y 1 += A 1,0 x 0 # rows = 2 # processors = 2 write y 0 write y 1 Machine independent code for matrix-vector product Michael Wolf - 20

21 Performance Evaluation: Output read A 0,0 read x 0 read y 0 read A 1,1 read x 1 read y 1 y 0 = A 0,0 x 0 y 1 = A 1,1 x 1 Simulator Performance Data send x 0 send x 1 Input Parser Mapper Mapped MI Code Generator MI code read A 0,1 y 0 += A 0,1 x 1 read A 1,0 y 1 += A 1,0 x 0 Performance Evaluator MI Code write y 0 write y 1 LORE Analyzer and Optimizer Simulator Performance Data Simulation of user code on target architecture Michael Wolf - 21

22 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 22

23 POEM Photonic Switching Technology Ring Resonator Technology Photonics Layered on Silicon Architecture Parameter Characterization Architecture Design Proposed Photonic Architectures Photonic Innovation P hotonically O ptimized E mbedded M icroprocessor Fielded Systems Application Analysis Simulation, Mapping, and Optimization Framework Processing Chains for Military-Critical Applications Handheld DNA Sequencing Device UAV Surveillance Handheld PDA Field Devices POEM will bridge the gap between innovations in chip-scale photonics and fielded military-critical systems. It will offer a complete architecture design and analysis for numerous real-world military-critical applications. Michael Wolf - 23

24 LORE s Role in POEM Program Photonic Switching Technology Ring Resonator Technology Photonics Layered on Silicon Architecture Parameter Characterization Processing Chains for Military-Critical Applications Photonic Innovation Architecture Design Proposed Photonic Architectures P hotonically O ptimized Application Analysis E mbedded M icroprocessor Simulation, Mapping, and Optimization Framework Fielded Systems Handheld DNA Sequencing Device UAV Surveillance Handheld PDA Field Devices LORE used to study chip-scale photonics and its impact on applications Michael Wolf - 24

25 LORE and POEM: Applications SAR Secure Communication Cyber Database operations LORE LORE provides framework for analyzing POEM applications LORE supports many key numerical kernels (FFT, sparse matrix-vector product, vector updates, etc.) Applications supported through composition of these kernels Easy to extend to analyze new applications Initially analyzing synthetic aperture radar (SAR) application LORE enables the analysis of many applications on many different architectures (existing and proposed) Michael Wolf - 25

26 LORE and POEM: Architectures Application Electronic Mesh LORE Photonic Bus LORE supports simulation of applications on POEM architectures (e.g., electronic mesh and photonic bus) Framework for simulating user code LORE simulator for understanding big picture trends Interface to third party simulators (e.g., PhoenixSim) for higher fidelity performance data LORE enables the analysis of many applications on many different architectures (existing and proposed) Michael Wolf - 26

27 LORE and POEM: Optimization of Maps y = A x y = A x LORE Optimization of maps Good application data to processor mapping crucial to achieving peak parallel performance on target machines Not difficult for SAR applications (simple maps sufficient) Challenging for applications with irregular communication (DNA sequence analysis and sparse matrix computations) LORE provides automatic map optimization Michael Wolf - 27

28 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 28

29 LORE and Performance Evaluation LORE output LORE input User Code Architectural Model Parser LORE Mapper Mapped Analyzer and Optimizer Simulator MI Code Generator MI code Performance Evaluator Performance data: time (s), GFLOPS LORE used to produce performance evaluation data for 2D FFT, an important kernel in SAR processing chain Michael Wolf - 29

30 POEM Architecture Models Electronic Mesh Photonic Bus M1 P1 P2 P3 P4 M2 M1 M2 M3 M4 P1 P2 P3 P4 P5 P9 P6 P10 P7 P11 P8 P12 P=processor/core M=Shared memory =local memory P5 P9 P6 P10 P7 P11 P8 P12 M3 P13 P14 P15 P16 M4 P13 P14 P15 P16 Two architectures modeled (electronic mesh and photonic bus Processors same, shared memory same Network parameters (latency, bandwidth) set to allow for apple to apples comparison between networks LORE framework allows direct comparison of electronic mesh and photonic bus architectures Michael Wolf - 30

31 Preliminary POEM Simulations 2D FFT: FFT, CT, FFT 2D FFT: FFT, SCA, FFT 1D FFT for each row of dense matrix FFT Read FFT Write FFT Read FFT For data locality of FFT CT Read Transposed Write CT SCA Write 1D FFT for each column of dense matrix EM=Electronic mesh PB=Photonic bus SCA = Synchronous coalesced access FFT Read FFT Write Architectures simulated: EM FFT Read FFT Write Architectures simulated: PB Initially, simulated 2D FFT kernel Important kernel for SAR applications Michael Wolf - 31

32 Photonic Synchronous Coalesced Access Network (PSCAN) PSCAN leverages the differences between electronic and photonic interconnect to achieve large efficiency gains in critical operations Utilizes distance independence to rapidly re-organize spatially separate data Large gains in efficiency even when bandwidth is equalized MI M 1 SCA write Synthesizes matrix row-to-column transpose/write into a single transaction by interleaving data from spatially separate data producers Novel ISA construct enabled by a highly synchronous photonic waveguide DRIVE0 DRIVE1 DRIVE2 DRIVE3 P-CLK 0 Data is not driven time DRAM Address info driven by proc 3 Credit: Dave Whelihan, MITLL Michael Wolf - 32

33 Preliminary LORE Results GOPS Performance of 2D FFT Operation % of Ideal Experiment: Number of memory controllers: 4 Matrix size: 4096 x 4096 Equalize bandwidth to memory in photonic and electronic models Run traditional 2D FFT with block transpose for electronic mesh Run 2D FFT with SCA for PSCAN Scale number of processors Number of Cores Electronic PSCAN Ideal Examine performance of photonics and electronics in an increasingly workstarved environment PSCAN architecture with SCA yields significant speed-up over electronic mesh architectures using block transpose Michael Wolf - 33

34 Data Reorganization Bottleneck in Modern Manycore Systems 1 Time Spent Reorganizing Data for 2D FFT Block Transpose SCA Percentage Data Reorganization FFT CT FFT Read FFT Write Read Transpose Write Read FFT Write FFT CT FFT Read FFT SCA Read FFT Write Number of Cores Block Transpose SCA As the number of cores in modern CMPs grows, the data reorganization becomes an increasing performance bottleneck. Performance is limited by the inability to keep processors supplied with data. SCA operation performs data reorganization step of 2D FFT more efficiently as number of processors grows, alleviating traditional block transpose bottleneck Michael Wolf - 34

35 Outline Motivation/Overview Design and Usage Usage 1: Map Optimization Usage 2: Performance Evaluation LORE and POEM Preliminary POEM Results: 2D FFT Next Steps and Summary Michael Wolf - 35

36 LORE Next Steps Extend architecture model, LORE simulator Support memory hierarchy Model network contention more accurately Support for external simulators E.g., PhoenixSim (Columbia), SST (Sandia) Needed for higher fidelity simulations Better power modeling (e.g., dynamic power) Additional parallel numerical library/language support Additional languages: Matlab, Python Additional libraries: e.g., VSIPL++/PVTOL Additional kernels: e.g., Sparse matrix operations Code generator and runtime engine for execution/emulation on target architectures Michael Wolf - 36

37 Summary LORE: Parallel framework/environment for Optimizing data to processor mapping for parallel applications Simulating and optimizing new (and existing) architectures LORE used to compare photonic and electronic architectures of interest to POEM project Support for many applications (through composition of kernels) Support for different architectures Preliminary results for 2D FFT kernel Support thesis that photonics can improve performance for these applications SCA instruction mitigates performance impact of corner turn in 2D FFT operation LORE allows for exploration of architecture design space in context of real application constraints Michael Wolf - 37

38 Acknowledgements LORE Michelle Beard Anna Klein Sanjeev Mohindra Julie Mullen Eric Robinson Nadya Bliss (Manager) Minna Song (MIT student intern) POEM Dave Whelihan Jeff Hughes Scott Sawyer Michael Wolf - 38

Instruction Set Extensions for Photonic Synchronous Coalesced Access

Instruction Set Extensions for Photonic Synchronous Coalesced Access Instruction Set Extensions for Photonic Synchronous Coalesced Access Paul Keltcher, David Whelihan, Jeffrey Hughes September 12, 2013 This work is sponsored by Defense Advanced Research Projects Agency

More information

Analysis and Mapping of Sparse Matrix Computations

Analysis and Mapping of Sparse Matrix Computations Analysis and Mapping of Sparse Matrix Computations Nadya Bliss & Sanjeev Mohindra Varun Aggarwal & Una-May O Reilly MIT Computer Science and AI Laboratory September 19th, 2007 HPEC2007-1 This work is sponsored

More information

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel

Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Quantifying the Effect of Matrix Structure on Multithreaded Performance of the SpMV Kernel Daniel Kimball, Elizabeth Michel, Paul Keltcher, and Michael M. Wolf MIT Lincoln Laboratory Lexington, MA {daniel.kimball,

More information

DataSToRM: Data Science and Technology Research Environment

DataSToRM: Data Science and Technology Research Environment The Future of Advanced (Secure) Computing DataSToRM: Data Science and Technology Research Environment This material is based upon work supported by the Assistant Secretary of Defense for Research and Engineering

More information

Cluster-based 3D Reconstruction of Aerial Video

Cluster-based 3D Reconstruction of Aerial Video Cluster-based 3D Reconstruction of Aerial Video Scott Sawyer (scott.sawyer@ll.mit.edu) MIT Lincoln Laboratory HPEC 12 12 September 2012 This work is sponsored by the Assistant Secretary of Defense for

More information

The HPEC Challenge Benchmark Suite

The HPEC Challenge Benchmark Suite The HPEC Challenge Benchmark Suite Ryan Haney, Theresa Meuse, Jeremy Kepner and James Lebak Massachusetts Institute of Technology Lincoln Laboratory HPEC 2005 This work is sponsored by the Defense Advanced

More information

An Advanced Graph Processor Prototype

An Advanced Graph Processor Prototype An Advanced Graph Processor Prototype Vitaliy Gleyzer GraphEx 2016 DISTRIBUTION STATEMENT A. Approved for public release: distribution unlimited. This material is based upon work supported by the Assistant

More information

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing

Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Evaluating the Potential of Graphics Processors for High Performance Embedded Computing Shuai Mu, Chenxi Wang, Ming Liu, Yangdong Deng Department of Micro-/Nano-electronics Tsinghua University Outline

More information

Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System

Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System Jeremy Kepner, William Arcand, William Bergeron, Nadya Bliss, Robert Bond, Chansup Byun, Gary Condon, Kenneth Gregson, Matthew

More information

Signal Processing on Databases

Signal Processing on Databases Signal Processing on Databases Jeremy Kepner Lecture 0: Introduction 3 October 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations,

More information

PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs

PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs PAKCK: Performance and Power Analysis of Key Computational Kernels on CPUs and GPUs Julia S. Mullen, Michael M. Wolf and Anna Klein MIT Lincoln Laboratory Lexington, MA 02420 Email: {jsm, michael.wolf,anna.klein

More information

300x Matlab. Dr. Jeremy Kepner. MIT Lincoln Laboratory. September 25, 2002 HPEC Workshop Lexington, MA

300x Matlab. Dr. Jeremy Kepner. MIT Lincoln Laboratory. September 25, 2002 HPEC Workshop Lexington, MA 300x Matlab Dr. Jeremy Kepner September 25, 2002 HPEC Workshop Lexington, MA This work is sponsored by the High Performance Computing Modernization Office under Air Force Contract F19628-00-C-0002. Opinions,

More information

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures PCAKernels-1 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures Hank Hoffmann James Lebak (Presenter) Janice McMahon Seventh Annual High-Performance Embedded Computing Workshop (HPEC)

More information

Sampling Large Graphs for Anticipatory Analysis

Sampling Large Graphs for Anticipatory Analysis Sampling Large Graphs for Anticipatory Analysis Lauren Edwards*, Luke Johnson, Maja Milosavljevic, Vijay Gadepally, Benjamin A. Miller IEEE High Performance Extreme Computing Conference September 16, 2015

More information

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models

6 February Parallel Computing: A View From Berkeley. E. M. Hielscher. Introduction. Applications and Dwarfs. Hardware. Programming Models Parallel 6 February 2008 Motivation All major processor manufacturers have switched to parallel architectures This switch driven by three Walls : the Power Wall, Memory Wall, and ILP Wall Power = Capacitance

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

A KASSPER Real-Time Signal Processor Testbed

A KASSPER Real-Time Signal Processor Testbed A KASSPER Real-Time Signal Processor Testbed Glenn Schrader 244 Wood St. exington MA, 02420 Phone: (781)981-2579 Fax: (781)981-5255 gschrad@ll.mit.edu The Knowledge Aided Sensor Signal Processing and Expert

More information

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort

Gedae cwcembedded.com. The CHAMP-AV6 VPX-REDI. Digital Signal Processing Card. Maximizing Performance with Minimal Porting Effort Technology White Paper The CHAMP-AV6 VPX-REDI Digital Signal Processing Card Maximizing Performance with Minimal Porting Effort Introduction The Curtiss-Wright Controls Embedded Computing CHAMP-AV6 is

More information

Sub-Graph Detection Theory

Sub-Graph Detection Theory Sub-Graph Detection Theory Jeremy Kepner, Nadya Bliss, and Eric Robinson This work is sponsored by the Department of Defense under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions,

More information

Phastlane: A Rapid Transit Optical Routing Network

Phastlane: A Rapid Transit Optical Routing Network Phastlane: A Rapid Transit Optical Routing Network Mark Cianchetti, Joseph Kerekes, and David Albonesi Computer Systems Laboratory Cornell University The Interconnect Bottleneck Future processors: tens

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

Communications Research Innovation Across Commercial and Military Boundaries Optical Communications

Communications Research Innovation Across Commercial and Military Boundaries Optical Communications Communications Research Innovation Across Commercial and Military Boundaries Optical Communications 27 October 2015 Bryan Robinson MIT Lincoln Laboratory This work is sponsored by National Aeronautics

More information

3D Technologies For Low Power Integrated Circuits

3D Technologies For Low Power Integrated Circuits 3D Technologies For Low Power Integrated Circuits Paul Franzon North Carolina State University Raleigh, NC paulf@ncsu.edu 919.515.7351 Outline 3DIC Technology Set Approaches to 3D Specific Power Minimization

More information

The Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development

The Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development The Vector, Signal, and Image Processing Library (VSIPL): Emerging Implementations and Further Development Randall Janka and Mark Richards Georgia Tech Research University James Lebak MIT Lincoln Laboratory

More information

BubbleNet: A Cyber Security Dashboard for Visualizing Patterns

BubbleNet: A Cyber Security Dashboard for Visualizing Patterns BubbleNet: A Cyber Security Dashboard for Visualizing Patterns Sean McKenna 1,2 Diane Staheli 2 Cody Fulcher 2 Miriah Meyer 1 1 University of Utah 2 MIT Lincoln Laboratory The Lincoln Laboratory portion

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Executable Requirements: Opportunities and Impediments

Executable Requirements: Opportunities and Impediments Executable Requirements: Oppotunities and Impediments Executable Requirements: Opportunities and Impediments G. A. Shaw and A. H. Anderson * Abstract: In a top-down, language-based design methodology,

More information

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone:

Dana Sinno MIT Lincoln Laboratory 244 Wood Street Lexington, MA phone: Self-Organizing Networks (SONets) with Application to Target Tracking Dana Sinno 244 Wood Street Lexington, MA 02420-9108 phone: 781-981-4526 email: @ll.mit.edu Abstract The growing interest in large arrays

More information

Social Behavior Prediction Through Reality Mining

Social Behavior Prediction Through Reality Mining Social Behavior Prediction Through Reality Mining Charlie Dagli, William Campbell, Clifford Weinstein Human Language Technology Group MIT Lincoln Laboratory This work was sponsored by the DDR&E / RRTO

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Big Orange Bramble. August 09, 2016

Big Orange Bramble. August 09, 2016 Big Orange Bramble August 09, 2016 Overview HPL SPH PiBrot Numeric Integration Parallel Pi Monte Carlo FDS DANNA HPL High Performance Linpack is a benchmark for clusters Created here at the University

More information

Scott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011

Scott Philips, Edward Kao, Michael Yee and Christian Anderson. Graph Exploitation Symposium August 9 th 2011 Activity-Based Community Detection Scott Philips, Edward Kao, Michael Yee and Christian Anderson Graph Exploitation Symposium August 9 th 2011 23-1 This work is sponsored by the Office of Naval Research

More information

Reconfigurable Versus Fixed Versus Hybrid Architectures

Reconfigurable Versus Fixed Versus Hybrid Architectures Reconfigurable Versus Fixed Versus Hybrid Architectures John K. Antonio Oklahoma Supercomputing Symposium 2008 Norman, Oklahoma October 6, 2008 Computer Science, University of Oklahoma Overview The (past)

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Near Memory Computing Spectral and Sparse Accelerators

Near Memory Computing Spectral and Sparse Accelerators Near Memory Computing Spectral and Sparse Accelerators Franz Franchetti ECE, Carnegie Mellon University www.ece.cmu.edu/~franzf Co-Founder, SpiralGen www.spiralgen.com The work was sponsored by Defense

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Electric Power HIL Controls Collaborative

Electric Power HIL Controls Collaborative Electric Power HIL Controls Collaborative Chris Smith Microgrid and DER Controller Symposium 2/16/2017 MIT Lincoln Laboratory This work is sponsored by the Department of Energy, Office of Electricity Delivery

More information

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation

More information

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning Michael M. Wolf 1,2, Erik G. Boman 2, and Bruce A. Hendrickson 3 1 Dept. of Computer Science, University of Illinois at Urbana-Champaign,

More information

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection Thomas M. Benson 1 Daniel P. Campbell 1 David Tarjan 2 Justin Luitjens 2 1 Georgia Tech Research Institute {thomas.benson,dan.campbell}@gtri.gatech.edu

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

DSENT A Tool Connecting Emerging Photonics with Electronics for Opto- Electronic Networks-on-Chip Modeling Chen Sun

DSENT A Tool Connecting Emerging Photonics with Electronics for Opto- Electronic Networks-on-Chip Modeling Chen Sun A Tool Connecting Emerging Photonics with Electronics for Opto- Electronic Networks-on-Chip Modeling Chen Sun In collaboration with: Chia-Hsin Owen Chen George Kurian Lan Wei Jason Miller Jurgen Michel

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

Intro to: Ultra-low power, ultra-high bandwidth density SiP interconnects

Intro to: Ultra-low power, ultra-high bandwidth density SiP interconnects This work was supported in part by DARPA under contract HR0011-08-9-0001. The views, opinions, and/or findings contained in this article/presentation are those of the author/presenter

More information

Tools and Primitives for High Performance Graph Computation

Tools and Primitives for High Performance Graph Computation Tools and Primitives for High Performance Graph Computation John R. Gilbert University of California, Santa Barbara Aydin Buluç (LBNL) Adam Lugowski (UCSB) SIAM Minisymposium on Analyzing Massive Real-World

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction

More information

EECS4201 Computer Architecture

EECS4201 Computer Architecture Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis These slides are based on the slides provided by the publisher. The slides will be

More information

Design of a System-on-Chip Switched Network and its Design Support Λ

Design of a System-on-Chip Switched Network and its Design Support Λ Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting

Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Jeremy Kepner SIAM Annual Meeting, Minneapolis, July 9, 2012 This work is sponsored by the Department of the Air Force under Air

More information

zorder-lib: Library API for Z-Order Memory Layout

zorder-lib: Library API for Z-Order Memory Layout zorder-lib: Library API for Z-Order Memory Layout E. Wes Bethel Lawrence Berkeley National Laboratory Berkeley, CA, USA, 94720 April, 2015 i Acknowledgment This work was supported by the Director, Office

More information

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science

Algorithms and Architecture. William D. Gropp Mathematics and Computer Science Algorithms and Architecture William D. Gropp Mathematics and Computer Science www.mcs.anl.gov/~gropp Algorithms What is an algorithm? A set of instructions to perform a task How do we evaluate an algorithm?

More information

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

HPCS HPCchallenge Benchmark Suite

HPCS HPCchallenge Benchmark Suite HPCS HPCchallenge Benchmark Suite David Koester, Ph.D. () Jack Dongarra (UTK) Piotr Luszczek () 28 September 2004 Slide-1 Outline Brief DARPA HPCS Overview Architecture/Application Characterization Preliminary

More information

ATAC: Improving Performance and Programmability with On-Chip Optical Networks

ATAC: Improving Performance and Programmability with On-Chip Optical Networks ATAC: Improving Performance and Programmability with On-Chip Optical Networks James Psota, Jason Miller, George Kurian, Nathan Beckmann, Jonathan Eastep, Henry Hoffman, Jifeng Liu, Mark Beals, Jurgen Michel,

More information

Data Reorganization Interface

Data Reorganization Interface Data Reorganization Interface Kenneth Cain Mercury Computer Systems, Inc. Phone: (978)-967-1645 Email Address: kcain@mc.com Abstract: This presentation will update the HPEC community on the latest status

More information

Ibex: A Space Situational Awareness Data Fusion Program

Ibex: A Space Situational Awareness Data Fusion Program Ibex: A Space Situational Awareness Data Fusion Program Lt Col Travis Blake, PhD, USAF Program Manager DARPA/TTO - Space Systems Maj Christopher Gustin, USAF Deputy Program Manager SMC/SYFA - Space C2

More information

CHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization

CHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization CHERI A Hybrid Capability-System Architecture for Scalable Software Compartmentalization Robert N.M. Watson *, Jonathan Woodruff *, Peter G. Neumann, Simon W. Moore *, Jonathan Anderson, David Chisnall

More information

High-Performance Linear Algebra Processor using FPGA

High-Performance Linear Algebra Processor using FPGA High-Performance Linear Algebra Processor using FPGA J. R. Johnson P. Nagvajara C. Nwankpa 1 Extended Abstract With recent advances in FPGA (Field Programmable Gate Array) technology it is now feasible

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

Radar Tomography of Moving Targets

Radar Tomography of Moving Targets On behalf of Sensors Directorate, Air Force Research Laboratory Final Report September 005 S. L. Coetzee, C. J. Baker, H. D. Griffiths University College London REPORT DOCUMENTATION PAGE Form Approved

More information

White Paper Assessing FPGA DSP Benchmarks at 40 nm

White Paper Assessing FPGA DSP Benchmarks at 40 nm White Paper Assessing FPGA DSP Benchmarks at 40 nm Introduction Benchmarking the performance of algorithms, devices, and programming methodologies is a well-worn topic among developers and research of

More information

Scientific Applications. Chao Sun

Scientific Applications. Chao Sun Large Scale Multiprocessors And Scientific Applications Zhou Li Chao Sun Contents Introduction Interprocessor Communication: The Critical Performance Issue Characteristics of Scientific Applications Synchronization:

More information

ECE468 Computer Organization and Architecture. Memory Hierarchy

ECE468 Computer Organization and Architecture. Memory Hierarchy ECE468 Computer Organization and Architecture Hierarchy ECE468 memory.1 The Big Picture: Where are We Now? The Five Classic Components of a Computer Processor Control Input Datapath Output Today s Topic:

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Why Multiprocessors?

Why Multiprocessors? Why Multiprocessors? Motivation: Go beyond the performance offered by a single processor Without requiring specialized processors Without the complexity of too much multiple issue Opportunity: Software

More information

Beyond the PDP-11: Architectural support for a memory-safe C abstract machine

Beyond the PDP-11: Architectural support for a memory-safe C abstract machine Trustworthy Systems Research and CTSRDCRASH-worthy Development Beyond the PDP-11: Architectural support for a memory-safe C abstract machine David Chisnall, Colin Rothwell, Brooks Davis, Robert N.M. Watson,

More information

Organic Computing DISCLAIMER

Organic Computing DISCLAIMER Organic Computing DISCLAIMER The views, opinions, and/or findings contained in this article are those of the author(s) and should not be interpreted as representing the official policies, either expressed

More information

UNCLASSIFIED. R-1 ITEM NOMENCLATURE PE D8Z: Data to Decisions Advanced Technology FY 2012 OCO

UNCLASSIFIED. R-1 ITEM NOMENCLATURE PE D8Z: Data to Decisions Advanced Technology FY 2012 OCO Exhibit R-2, RDT&E Budget Item Justification: PB 2012 Office of Secretary Of Defense DATE: February 2011 BA 3: Advanced Development (ATD) COST ($ in Millions) FY 2010 FY 2011 Base OCO Total FY 2013 FY

More information

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture

EE482C, L1, Apr 4, 2002 Copyright (C) by William J. Dally, All Rights Reserved. Today s Class Meeting. EE482S Lecture 1 Stream Processor Architecture 1 Today s Class Meeting EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J Dally Computer Systems Laboratory Stanford University billd@cslstanfordedu What is EE482C? Material covered

More information

EE482S Lecture 1 Stream Processor Architecture

EE482S Lecture 1 Stream Processor Architecture EE482S Lecture 1 Stream Processor Architecture April 4, 2002 William J. Dally Computer Systems Laboratory Stanford University billd@csl.stanford.edu 1 Today s Class Meeting What is EE482C? Material covered

More information

GraphBLAS: A Programming Specification for Graph Analysis

GraphBLAS: A Programming Specification for Graph Analysis GraphBLAS: A Programming Specification for Graph Analysis Scott McMillan, PhD GraphBLAS: A Programming A Specification for Graph Analysis 2016 October Carnegie 26, Mellon 2016University 1 Copyright 2016

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Construction Change Order analysis CPSC 533C Analysis Project

Construction Change Order analysis CPSC 533C Analysis Project Construction Change Order analysis CPSC 533C Analysis Project Presented by Chiu, Chao-Ying Department of Civil Engineering University of British Columbia Problems of Using Construction Data Hybrid of physical

More information

Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar

Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar Virtual Prototyping and Performance Analysis of RapidIO-based System Architectures for Space-Based Radar David Bueno, Adam Leko, Chris Conger, Ian Troxel, and Alan D. George HCS Research Laboratory College

More information

TRACT: Threat Rating and Assessment Collaboration Tool

TRACT: Threat Rating and Assessment Collaboration Tool TRACT: Threat Rating and Assessment Collaboration Tool Robert Hollinger and Doran Smestad Advised by: George Heineman (WPI), Philip Marquardt (MIT/LL) Worcester Polytechnic Institute Major Qualifying Project

More information

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.

Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al. Can FPGAs beat GPUs in accelerating next-generation Deep Neural Networks? Discussion of the FPGA 17 paper by Intel Corp. (Nurvitadhi et al.) Andreas Kurth 2017-12-05 1 In short: The situation Image credit:

More information

Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection

Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection Virtuoso: Narrowing the Semantic Gap in Virtual Machine Introspection Brendan Dolan-Gavitt *, Tim Leek, Michael Zhivich, Jonathon Giffin *, and Wenke Lee * * Georgia Institute of Technology MIT Lincoln

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Covert and Anomalous Network Discovery and Detection (CANDiD)

Covert and Anomalous Network Discovery and Detection (CANDiD) Covert and Anomalous Network Discovery and Detection (CANDiD) Rajmonda S. Caceres, Edo Airoldi, Garrett Bernstein, Edward Kao, Benjamin A. Miller, Raj Rao Nadakuditi, Matthew C. Schmidt, Kenneth Senne,

More information

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M

More information

Visualizing Attack Graphs, Reachability, and Trust Relationships with NAVIGATOR*

Visualizing Attack Graphs, Reachability, and Trust Relationships with NAVIGATOR* Visualizing Attack Graphs, Reachability, and Trust Relationships with NAVIGATOR* Matthew Chu, Kyle Ingols, Richard Lippmann, Seth Webster, Stephen Boyer 14 September 2010 9/14/2010-1 *This work is sponsored

More information

Multi-core Computing Lecture 2

Multi-core Computing Lecture 2 Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication

Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication Aydin Buluc John R. Gilbert University of California, Santa Barbara ICPP 2008 September 11, 2008 1 Support: DOE Office of Science,

More information

LUNAR TEMPERATURE CALCULATIONS ON A GPU

LUNAR TEMPERATURE CALCULATIONS ON A GPU LUNAR TEMPERATURE CALCULATIONS ON A GPU Kyle M. Berney Department of Information & Computer Sciences Department of Mathematics University of Hawai i at Mānoa Honolulu, HI 96822 ABSTRACT Lunar surface temperature

More information

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance.

Spiral. Computer Generation of Performance Libraries. José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team. Performance. Spiral Computer Generation of Performance Libraries José M. F. Moura Markus Püschel Franz Franchetti & the Spiral Team Platforms Performance Applications What is Spiral? Traditionally Spiral Approach Spiral

More information

Compilers and Compiler-based Tools for HPC

Compilers and Compiler-based Tools for HPC Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms

More information

Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT

Parallel Matlab: RTExpress on 64-bit SGI Altix with SCSL and MPT Parallel Matlab: RTExpress on -bit SGI Altix with SCSL and MPT Cosmo Castellano Integrated Sensors, Inc. Phone: 31-79-1377, x3 Email Address: castellano@sensors.com Abstract By late, RTExpress, a compiler

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method

Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method Introducing a Cache-Oblivious Blocking Approach for the Lattice Boltzmann Method G. Wellein, T. Zeiser, G. Hager HPC Services Regional Computing Center A. Nitsure, K. Iglberger, U. Rüde Chair for System

More information