Opportunities and Approaches for System Software in Supporting
|
|
- Terence Andrews
- 5 years ago
- Views:
Transcription
1 Opportunities and Approaches for System Software in Supporting Application/Architecture t Co-Design Ron Brightwell Sandia National Laboratories Scalable System Software rbbrigh@sandia.gov Workshop on Application/Architecture Co-Design for Extreme-Scale Computing September 24, 2010 Sandia is a Multiprogram Laboratory Operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy Under Contract DE-ACO4-94AL85000.
2 Outline Overview of co-design Fundamental capabilities needed for co-design Two examples of limited co-design
3 Potential System Architecture Targets System attributes System peak 2 Peta 200 Petaflop/sec 1 Exaflop/sec Power 6MW 15 MW 20 MW System memory 0.3 PB 5 PB PB Node performance 125 GF 0.5 TF 7 TF 1 TF 10 TF Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec Node concurrency 12 O(100) O(1,000) O(1,000) O(10,000) System size (nodes) 18,700 50,000 5,000 1,000, ,000 Total Node 1.5 GB/s 20 GB/sec 200 GB/sec Interconnect t BW MTTI days O(1day) O(1 day)
4 Co-design is a key element of the Exascale strategy Architectures are undergoing a major change Single thread performance is remaining relatively constant and on chip parallelism is increasing rapidly Hierarchical parallelism, heterogeneity Massive multithreading NVRAM for caching I/O Applications will need to change in response to architectural changes Manage locality and extreme scalability (billion-way parallelism) Potentially tolerate latency Resilience? Unprecedented opportunity for applications/algorithms to influence architectures, system software and the next programming model Hardware R&D is needed to reach exascale We will not be able to solve all of the exascale problems through architectures work only
5 Co-design space is subject to other constraints Power, system cost, R&D costs Physical limitations Multiple applications Goal is to build a sustainable infrastructure with broad market support Extend beyond natural evolution of commodity hardware to create new markets Create system building blocks that offer superior price/performance/programmability all scales (exascale, departmental and embedded)
6 Co-design expands the feasible solution space to allow better solutions Application driven: Find the best technology to run this code. Sub-optimal Application Model Algorithms Code Now, we must expand the co-design space to find better solutions: new applications & algorithms, better technology and performance. Technology architecture programming model resilience power Technology driven: Fit your application to this technology. Sub-optimal.
7 Hardware/Software co-design is a mature field Design of an integrated system that contains hardware and software Focus on embedded systems (cell phones, appliances, engines, controllers, etc.) Concurrent development of hardware and software Interactions and tradeoffs Partitioning is a focus Must satisfy real-time and/or other performance/energy metrics/constraints
8 Original DOD Standard for HW/SW co-development had shortcomings Integrated Prototypes Modeling Substrate
9 Lockheed Martin Co-design Methodology HW/SW Cosim.
10 Phases of Co-Design
11 Co-Design Schematic
12 Why has co-design not been used more extensively in HPC? Leveraging of COTs technology Almost all leadership systems have some custom components but HPC has benefited from the ability to leverage commercial technology HPC applications are very complex May contain a million of lines of code ~15-20 years of architectural and programming model stability Bulk synchronous processing + explicit message passing Lack of Adequate Simulation Tools Often use Byte to Flop ratios and Excel spreadsheets Industry simulation tools are proprietary However, there are some HPC co-design examples and there are useful tools
13 Fundamental Capabilities for Co-Design Software agility Applications Need to identify an important, representative ese e subset Application code must be small and malleable System software Smaller is better Lightweight is ideal Toolchain is always a huge issue Hardware simulation tools Sandia SST Virtualization Leverage virtual machine capability to emulate new hardware capability Need mechanisms to know the impact of co-design quickly Integrated teams Co-design centers
14 System simulation should be a key enabling technology Co-simulation of hardware and software Assess architectural choices and their impact on applications Identify bottlenecks and enable the development of algorithms for future architectures Key features Open source with the ability to interface to proprietary software Holistic: performance, power, area, cost, reliability analysis Modular and multiscale (cycle accurate to analytical) Input traces as well as joint execution Parallel FPGA acceleration
15 SST Simulation Project Parallel Discrete Event core with conservative optimization over MPI Holistic Integrated Tech. Models for power McPAT, Sim-Panalyzer Multiscale Detailed and simple models for processor, network, and memory Current Release (2.0) at Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model
16 SST simulations have quantified the impact of the Memory Wall Most of DOE s Applications (e.g., climate, fusion, shock physics, ) spend most of their instructions accessing memory or doing integer computations, not floating point Additionally, most integer computations are computing memory Addresses Advanced development efforts are focused on accelerating memory subsystem performance for both scientific and informatics applications
17 SST is providing architectural insights to algorithm developers Input: SST Trace for SpMV. Lots of instruction stream data. Model: Use restricted sin 2 function to mark start/finish of each instruction. Use FFTs to analyze behavior. Trace fragment from SpMV inner loop Number if in-flight instructions vs. clock cycle. Important cycle frequencies
18 Sandia Mantevo Project Mini-applications Small, self-contained programs that embody essential performance characteristics of key applications Mini-drivers Small programs that act as drivers of performance-impacting Trilinos packages Application proxies Parameterizable applications that can be calibrated to mimic the performance of a large-scale application, then used as a proxy for the large-scale application
19 Sandia System Software Lightweight kernels Small code base (<50K LOC) Focused on performance and scalability Separate policy decision from policy enforcement Move resource management as close to application as possible Protect applications from each other Let user processes (libraries) manage resources Get out of the way Kitten Latest-generation LWK Open-source Lightweight virtualization when combined with Palacios hypervisor Portals
20 Portals Network Programming Interface Network API developed by Sandia, U. New Mexico, Intel Previous generations of Portals deployed on several production massively parallel systems 1993: 1800-node Intel Paragon (SUNMOS) 1997: 10,000-node Intel ASCI Red (Puma/Cougar) 1999: 1800-node Cplant cluster (Linux) 2005: 10,000-node Cray Sandia Red Storm (Catamount) 2009: 18,688-node Cray XT5 ORNL Jaguar (Linux) Focused on providing Lightweight connectionless model for MPP systems Low latency High bandwidth Independent progress Overlap of computation and communication Scalable buffering semantics Supports MPI, Cray SHMEM, ARMCI, GASNet, Lustre, etc.
21 NIC Architecture Co-Design
22 NIC Architecture Co-design Prevailing architectural constraints have driven many applications to highly bursty communication patterns In a power constrained world this trend will be unsustainable due to inefficient use of the system interconnect Design Goal: Produce a NIC architecture that enables overlap through high message rates and independent progress Using simulation, NIC hardware & software and host driver software were simultaneously profiled for various architecture choices Trade-offs: Which architectural features provide performance advantages What software bottlenecks need to be moved to hardware Which functions can be left to run on NIC CPU or in the host driver Next step: rework applications (or portions) to take advantage of the new features and provide feedback for more architectural improvements
23 MPI Will Likely Persist Into Exascale Era Number of network endpoints will increase significantly (5-50x) Memory and power will be dominant resources to manage Networks must be power and energy efficient Data movement must be carefully managed Memory copies will be very expensive Impact of unexpected messages must be minimized Eager protocol for short messages leads to receive-side buffering Need strategies for managing host buffer resources Flow control will be critical N-to-1 communication patterns will (continue to be) disastrous Must preserve key network performance characteristics Latency Bandwidth Message rate (throughput)
24 High Message Throughput is Vital Message rate determines the minimum message size needed to saturate the available network bandwidth
25 Current Flow Control Strategies Not Sufficient Credit-based Limit number of outstanding send operations Used credits are replenished implicitly or explicitly Effectiveness limited to N-to-1 scenario Potential performance penalty for well-behaved applications Acknowledgment-based Receiver explicitly confirms receipt of every message Significant per-message performance penalty Round trip acknowledgment doubles latency Performance penalty for well-behaved applications Local copying (bounce buffer) mitigates latency penalty Both strategies limit i message rate and effective bandwidth Flow control implemented at user-level inside MPI library Network transport usually has its own flow control mechanism No mechanism for back pressure from host resources to network
26 Applications Must Become More Asynchronous Applications cannot continue to be bulk synchronous Overhead of synchronization will limit scaling Synchronization increases susceptibility to noise Network API must provide asynchronous operations and progress Data movement must be independent of host activity Active Messages Polling is fundamental to all AM Progress only when nothing else to do Polling memory for message reception is inefficient Needs hardware support to integrate message arrival with thread invocation Run-time systems will also need to communicate Need to communicate evolving state of the system Need a common portable API Using TCP OOB connection will be infeasible
27 Resiliency Will Impact Network API Network will need to expose errors to enable recovery Applications and system components will have different resiliency requirements Reachability errors must be handled by run-time services Graceful degradation may be appropriate for some applications May need OOB mechanism for recognizing network failures AM or event-driven API would be ideal Hardware support for network-level protection RAS system invoking OS via network messages
28 Portals 4.0: Applying Lessons Learned from Cray SeaStar High message rate Atomic search and post for MPI receives required round-trip across PCI Eliminate round-trip by having Portals manage unexpected messages Flow control Encourages well-behaved applications Fail fast Identify application scalability issues early Resource exhaustion caused unrecoverable failure Recovery doesn t have to be fast Resource exhaustion will disable Portal Subsequent messages will fail with event notification at initiator Applies back pressure from network Performance for scalable applications Correctness for non-scalable applications
29 Portals 4.0 (cont d) Hardware implementation Designed for intelligent or programmable NICs Arbitrary list insertion Unneeded symmetry on initiator and target objects New functionality for one-sided operations Eliminate matching information Smaller network header Minimize processing at target Scalable event delivery Lightweight counting events Triggered operations Chain sequence of data movement operations Build asynchronous collective operations
30 Triggered Operations
31 Network Interface Controller Power will be number one constraint for exascale systems Current systems waste energy Using host cores to process messages is inefficient Only move data when necessary Move data to final destination No intermediate copying due to network Specialized network hardware Atomic operations Match processing Addressing and address translation Virtual address translation Avoid registration cache Logical node translation Rank translation on a million nodes Hardware support for thread activation on message arrival
32 High Message Throughput Challenges 20M messages per second implies a new message every 50ns Significant constraints created by MPI semantics On-load approach Roadblocks Host general purpose processors are inefficient for list management Caching (a cache miss is ns latency) Microbenchmarks are cache friendly, real life is not Benefits Queue Processor Easier & cheaper Off-load approach Roadblocks Storage requirements on NIC NIC embedded processor is worse at list management (than the host processor) Benefits Opportunity to create dedicated hardware Macroscale pipelining Header Posted Receives Match SRAM List Manager ALPU Unexp Msg Match Processor Bus SRAM List Manager ALPU Posted Receiv ve
33 Posted Queue Results 128 Entry ALPU
34 Match Unit Architecture Permute Ternary Register File Input FIFO Input Fifo Unit ALU Register File Microcode Branch Unit Architecture Drivers High throughput 3 stage pipeline Irregular data alignment SIMD operation Permute units Program Consistency Forwarding in datapath Ternary Unit Data Copy Unit ALU Read before write in register file Predicate Register File Predicate Unit Output FIFO
35 Match Time Results (30 items)
36 Match Time Results (300 items) 1x 4.6x Relative 3.8x Size 77x 7.7x 61.5x
37 Minimizing Memory Bandwidth Usage in Network Stack Memory bandwidth is most often the limiting factor for on node performance Instruction Mix Memory Usage in Sandia Applications Integer Instruction Usage We must minimize the use of host memory bandwidth in the network stack Bounce buffers (or any other copying) incur a 2x memory bandwidth penalty A fast off-load approach can minimize host memory bandwidth utilization Allows the NIC to determine where received messages need to be put in host memory and DMA the data directly there, eliminating the need for bounce buffers High message rate can reduce the need for buffering of non-contiguous data
38 IAA Algorithms Project / Extreme-Scale Algorithms & Software Institute
39 Motivation Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009) Domain decomposition preconditioning with incomplete factorizations Inflation in iteration count due to number of subdomains With scalable threaded triangular solves Solve triangular system on larger subdomains Reduce number of subdomains (MPI tasks) 39
40 MPI Shared Memory Allocation Idea: Shared memory alloc/free functions: MPI_Comm_alloc_mem () MPI_Comm_free_mem() Status: Available in current development branch of OpenMPI Demonstrated usage with threaded triangular solve 40
41 Simple MPI Program Simple MPI application Two distributed memory/mpi kernels Want to replace an MPI kernel with more efficient hybrid MPI/threaded Threading on multicore node 41
42 Simple MPI + Hybrid Program n=4 Very minor changes to code MPIKernel1 does not change Hybrid MPI/Threaded kernel runs on rank 0 of each node Threading on multicore node 42
43 Iterative Approach to Hybrid Parallelism Many sections of parallel applications scale extremely well using MPI-only model. Don t change these sections much Approach allows introduction of multithreaded kernels in iterative fashion Tune how multithreaded an application is Can focus on parts of application that don t scale with MPI-only programming Approach requires few changes to MPI-only sections 43
44 Iterative Approach to Hybrid Parallelism Can use 1 hybrid kernel 44
45 Iterative Approach to Hybrid Parallelism Or use 2 hybrid kernels 45
46 Preliminary PCG Results Iterations Flat MPI PCG Threaded Preconditioning untime Ru Runtime relative to flat MPI PCG
47 Summary: Bimodal MPI-only / MPI + X Programming Interface traditional MPI-only applications with efficient MPI + X kernels Only change parts of applications that don t scale MPI shared memory allocation useful Allows seamless combination of traditional MPI programming with MPI+X kernels Iterative approach to multithreading Implemented PCG using MPI shared memory extensions and level set method Effective in reducing iterations Runtime did not scale (work in progress) Better triangular solver algorithms needed 47
48 Acknowledgments People from whom I think I stole slides: Sudip Dosanjh Jim Ang Arun Rodrigues Scott Hemmert Brian Barrett Mike Heroux Michael Wolfe
Initial Performance Evaluation of the Cray SeaStar Interconnect
Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on
More informationEXASCALE COMPUTING: WHERE OPTICS MEETS ELECTRONICS
EXASCALE COMPUTING: WHERE OPTICS MEETS ELECTRONICS Overview of OFC Workshop: Organizers: Norm Jouppi HP Labs, Moray McLaren HP Labs, Madeleine Glick Intel Labs March 7, 2011 1 AGENDA Introduction. Moray
More informationChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI Ron Brightwell, R&D Manager Scalable System Software Department Sandia National Laboratories is a multi-mission laboratory managed and operated
More informationThe Cray Rainier System: Integrated Scalar/Vector Computing
THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier
More informationVARIABILITY IN OPERATING SYSTEMS
VARIABILITY IN OPERATING SYSTEMS Brian Kocoloski Assistant Professor in CSE Dept. October 8, 2018 1 CLOUD COMPUTING Current estimate is that 94% of all computation will be performed in the cloud by 2021
More informationAggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments
Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer
More informationCompute Node Linux (CNL) The Evolution of a Compute OS
Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide
More informationParallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model
Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy
More informationApplication Sensitivity to Link and Injection Bandwidth on a Cray XT4 System
Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System Cray User Group Conference Helsinki, Finland May 8, 28 Kevin Pedretti, Brian Barrett, Scott Hemmert, and Courtenay Vaughan Sandia
More informationThe Red Storm System: Architecture, System Update and Performance Analysis
The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationParallel Computing: Parallel Architectures Jin, Hai
Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationModern Processor Architectures. L25: Modern Compiler Design
Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationLeveraging Flash in HPC Systems
Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationI/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationIntegrating Analysis and Computation with Trios Services
October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque,
More informationWhat are Clusters? Why Clusters? - a Short History
What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by
More informationFast Forward I/O & Storage
Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7
More informationNERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber
NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationThe future is parallel but it may not be easy
The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the
More informationCompute Node Linux: Overview, Progress to Date & Roadmap
Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute
More informationWhite paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation
White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More information1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects
Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects
More informationIME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
More informationLeveraging HyperTransport for a custom high-performance cluster network
Leveraging HyperTransport for a custom high-performance cluster network Mondrian Nüssle HTCE Symposium 2009 11.02.2009 Outline Background & Motivation Architecture Hardware Implementation Host Interface
More informationBasics of Performance Engineering
ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently
More informationFlash: an efficient and portable web server
Flash: an efficient and portable web server High Level Ideas Server performance has several dimensions Lots of different choices on how to express and effect concurrency in a program Paper argues that
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationCopyright Push Technology Ltd December Diffusion TM 4.4 Performance Benchmarks
Diffusion TM 4.4 Performance Benchmarks November 2012 Contents 1 Executive Summary...3 2 Introduction...3 3 Environment...4 4 Methodology...5 4.1 Throughput... 5 4.2 Latency... 6 5 Results Summary...7
More informationNear Memory Key/Value Lookup Acceleration MemSys 2017
Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy
More informationChapter 14 Performance and Processor Design
Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures
More informationIBM Cell Processor. Gilbert Hendry Mark Kretschmann
IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:
More informationIssues in Multiprocessors
Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel
More informationMOVING FORWARD WITH FABRIC INTERFACES
14th ANNUAL WORKSHOP 2018 MOVING FORWARD WITH FABRIC INTERFACES Sean Hefty, OFIWG co-chair Intel Corporation April, 2018 USING THE PAST TO PREDICT THE FUTURE OFI Provider Infrastructure OFI API Exploration
More informationThe Exascale Architecture
The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central
More informationCray XE6 Performance Workshop
Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed
More informationOverview. CS 472 Concurrent & Parallel Programming University of Evansville
Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University
More informationIsoStack Highly Efficient Network Processing on Dedicated Cores
IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single
More informationLessons learned from MPI
Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationThe Google File System
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system
More informationChallenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008
Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated
More informationA unified multicore programming model
A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging
More informationOvercoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics
Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing
More informationEfficient Parallel Programming on Xeon Phi for Exascale
Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration
More informationModule 11: I/O Systems
Module 11: I/O Systems Reading: Chapter 13 Objectives Explore the structure of the operating system s I/O subsystem. Discuss the principles of I/O hardware and its complexity. Provide details on the performance
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationIntroduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono
Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied
More informationIncorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg
Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationRed Storm / Cray XT4: A Superior Architecture for Scalability
Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia
More informationConventional Computer Architecture. Abstraction
Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction
More informationComparing Memory Systems for Chip Multiprocessors
Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationIntroduction to High Performance Parallel I/O
Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing
More informationUpgrade Your MuleESB with Solace s Messaging Infrastructure
The era of ubiquitous connectivity is upon us. The amount of data most modern enterprises must collect, process and distribute is exploding as a result of real-time process flows, big data, ubiquitous
More informationThe Future of Interconnect Technology
The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies
More informationibench: Quantifying Interference in Datacenter Applications
ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization
More information4. Hardware Platform: Real-Time Requirements
4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationPaving the Road to Exascale
Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015
More informationScaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc
Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC
More informationContinuum Computer Architecture
Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005
More informationDesigning and debugging real-time distributed systems
Designing and debugging real-time distributed systems By Geoff Revill, RTI This article identifies the issues of real-time distributed system development and discusses how development platforms and tools
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationIn-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017
In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait
More information6.9. Communicating to the Outside World: Cluster Networking
6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and
More informationProgramming Models for Supercomputing in the Era of Multicore
Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationAdvanced Computer Networks. End Host Optimization
Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct
More informationThe University of Texas at Austin
EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin
More informationLUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract
LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November 2008 Abstract This paper provides information about Lustre networking that can be used
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011
ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include
More informationRDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits
RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation
More informationScalable Software Transactional Memory for Chapel High-Productivity Language
Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL
More informationHPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)
HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access
More informationCSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review
CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages
More informationCS370 Operating Systems
CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is
More informationGPUfs: Integrating a file system with GPUs
GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU
More informationOSIsoft PI World 2018
OSIsoft PI World 2018 Writing Highly Performant PI Web API Applications Presented by Jim Bazis, Max Drexel Introduction Max Drexel mdrexel@osisoft.com Software Developer PI Web API Team Jim Bazis jbazis@osisoft.com
More information