Opportunities and Approaches for System Software in Supporting

Size: px
Start display at page:

Download "Opportunities and Approaches for System Software in Supporting"

Transcription

1 Opportunities and Approaches for System Software in Supporting Application/Architecture t Co-Design Ron Brightwell Sandia National Laboratories Scalable System Software rbbrigh@sandia.gov Workshop on Application/Architecture Co-Design for Extreme-Scale Computing September 24, 2010 Sandia is a Multiprogram Laboratory Operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy Under Contract DE-ACO4-94AL85000.

2 Outline Overview of co-design Fundamental capabilities needed for co-design Two examples of limited co-design

3 Potential System Architecture Targets System attributes System peak 2 Peta 200 Petaflop/sec 1 Exaflop/sec Power 6MW 15 MW 20 MW System memory 0.3 PB 5 PB PB Node performance 125 GF 0.5 TF 7 TF 1 TF 10 TF Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec Node concurrency 12 O(100) O(1,000) O(1,000) O(10,000) System size (nodes) 18,700 50,000 5,000 1,000, ,000 Total Node 1.5 GB/s 20 GB/sec 200 GB/sec Interconnect t BW MTTI days O(1day) O(1 day)

4 Co-design is a key element of the Exascale strategy Architectures are undergoing a major change Single thread performance is remaining relatively constant and on chip parallelism is increasing rapidly Hierarchical parallelism, heterogeneity Massive multithreading NVRAM for caching I/O Applications will need to change in response to architectural changes Manage locality and extreme scalability (billion-way parallelism) Potentially tolerate latency Resilience? Unprecedented opportunity for applications/algorithms to influence architectures, system software and the next programming model Hardware R&D is needed to reach exascale We will not be able to solve all of the exascale problems through architectures work only

5 Co-design space is subject to other constraints Power, system cost, R&D costs Physical limitations Multiple applications Goal is to build a sustainable infrastructure with broad market support Extend beyond natural evolution of commodity hardware to create new markets Create system building blocks that offer superior price/performance/programmability all scales (exascale, departmental and embedded)

6 Co-design expands the feasible solution space to allow better solutions Application driven: Find the best technology to run this code. Sub-optimal Application Model Algorithms Code Now, we must expand the co-design space to find better solutions: new applications & algorithms, better technology and performance. Technology architecture programming model resilience power Technology driven: Fit your application to this technology. Sub-optimal.

7 Hardware/Software co-design is a mature field Design of an integrated system that contains hardware and software Focus on embedded systems (cell phones, appliances, engines, controllers, etc.) Concurrent development of hardware and software Interactions and tradeoffs Partitioning is a focus Must satisfy real-time and/or other performance/energy metrics/constraints

8 Original DOD Standard for HW/SW co-development had shortcomings Integrated Prototypes Modeling Substrate

9 Lockheed Martin Co-design Methodology HW/SW Cosim.

10 Phases of Co-Design

11 Co-Design Schematic

12 Why has co-design not been used more extensively in HPC? Leveraging of COTs technology Almost all leadership systems have some custom components but HPC has benefited from the ability to leverage commercial technology HPC applications are very complex May contain a million of lines of code ~15-20 years of architectural and programming model stability Bulk synchronous processing + explicit message passing Lack of Adequate Simulation Tools Often use Byte to Flop ratios and Excel spreadsheets Industry simulation tools are proprietary However, there are some HPC co-design examples and there are useful tools

13 Fundamental Capabilities for Co-Design Software agility Applications Need to identify an important, representative ese e subset Application code must be small and malleable System software Smaller is better Lightweight is ideal Toolchain is always a huge issue Hardware simulation tools Sandia SST Virtualization Leverage virtual machine capability to emulate new hardware capability Need mechanisms to know the impact of co-design quickly Integrated teams Co-design centers

14 System simulation should be a key enabling technology Co-simulation of hardware and software Assess architectural choices and their impact on applications Identify bottlenecks and enable the development of algorithms for future architectures Key features Open source with the ability to interface to proprietary software Holistic: performance, power, area, cost, reliability analysis Modular and multiscale (cycle accurate to analytical) Input traces as well as joint execution Parallel FPGA acceleration

15 SST Simulation Project Parallel Discrete Event core with conservative optimization over MPI Holistic Integrated Tech. Models for power McPAT, Sim-Panalyzer Multiscale Detailed and simple models for processor, network, and memory Current Release (2.0) at Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model

16 SST simulations have quantified the impact of the Memory Wall Most of DOE s Applications (e.g., climate, fusion, shock physics, ) spend most of their instructions accessing memory or doing integer computations, not floating point Additionally, most integer computations are computing memory Addresses Advanced development efforts are focused on accelerating memory subsystem performance for both scientific and informatics applications

17 SST is providing architectural insights to algorithm developers Input: SST Trace for SpMV. Lots of instruction stream data. Model: Use restricted sin 2 function to mark start/finish of each instruction. Use FFTs to analyze behavior. Trace fragment from SpMV inner loop Number if in-flight instructions vs. clock cycle. Important cycle frequencies

18 Sandia Mantevo Project Mini-applications Small, self-contained programs that embody essential performance characteristics of key applications Mini-drivers Small programs that act as drivers of performance-impacting Trilinos packages Application proxies Parameterizable applications that can be calibrated to mimic the performance of a large-scale application, then used as a proxy for the large-scale application

19 Sandia System Software Lightweight kernels Small code base (<50K LOC) Focused on performance and scalability Separate policy decision from policy enforcement Move resource management as close to application as possible Protect applications from each other Let user processes (libraries) manage resources Get out of the way Kitten Latest-generation LWK Open-source Lightweight virtualization when combined with Palacios hypervisor Portals

20 Portals Network Programming Interface Network API developed by Sandia, U. New Mexico, Intel Previous generations of Portals deployed on several production massively parallel systems 1993: 1800-node Intel Paragon (SUNMOS) 1997: 10,000-node Intel ASCI Red (Puma/Cougar) 1999: 1800-node Cplant cluster (Linux) 2005: 10,000-node Cray Sandia Red Storm (Catamount) 2009: 18,688-node Cray XT5 ORNL Jaguar (Linux) Focused on providing Lightweight connectionless model for MPP systems Low latency High bandwidth Independent progress Overlap of computation and communication Scalable buffering semantics Supports MPI, Cray SHMEM, ARMCI, GASNet, Lustre, etc.

21 NIC Architecture Co-Design

22 NIC Architecture Co-design Prevailing architectural constraints have driven many applications to highly bursty communication patterns In a power constrained world this trend will be unsustainable due to inefficient use of the system interconnect Design Goal: Produce a NIC architecture that enables overlap through high message rates and independent progress Using simulation, NIC hardware & software and host driver software were simultaneously profiled for various architecture choices Trade-offs: Which architectural features provide performance advantages What software bottlenecks need to be moved to hardware Which functions can be left to run on NIC CPU or in the host driver Next step: rework applications (or portions) to take advantage of the new features and provide feedback for more architectural improvements

23 MPI Will Likely Persist Into Exascale Era Number of network endpoints will increase significantly (5-50x) Memory and power will be dominant resources to manage Networks must be power and energy efficient Data movement must be carefully managed Memory copies will be very expensive Impact of unexpected messages must be minimized Eager protocol for short messages leads to receive-side buffering Need strategies for managing host buffer resources Flow control will be critical N-to-1 communication patterns will (continue to be) disastrous Must preserve key network performance characteristics Latency Bandwidth Message rate (throughput)

24 High Message Throughput is Vital Message rate determines the minimum message size needed to saturate the available network bandwidth

25 Current Flow Control Strategies Not Sufficient Credit-based Limit number of outstanding send operations Used credits are replenished implicitly or explicitly Effectiveness limited to N-to-1 scenario Potential performance penalty for well-behaved applications Acknowledgment-based Receiver explicitly confirms receipt of every message Significant per-message performance penalty Round trip acknowledgment doubles latency Performance penalty for well-behaved applications Local copying (bounce buffer) mitigates latency penalty Both strategies limit i message rate and effective bandwidth Flow control implemented at user-level inside MPI library Network transport usually has its own flow control mechanism No mechanism for back pressure from host resources to network

26 Applications Must Become More Asynchronous Applications cannot continue to be bulk synchronous Overhead of synchronization will limit scaling Synchronization increases susceptibility to noise Network API must provide asynchronous operations and progress Data movement must be independent of host activity Active Messages Polling is fundamental to all AM Progress only when nothing else to do Polling memory for message reception is inefficient Needs hardware support to integrate message arrival with thread invocation Run-time systems will also need to communicate Need to communicate evolving state of the system Need a common portable API Using TCP OOB connection will be infeasible

27 Resiliency Will Impact Network API Network will need to expose errors to enable recovery Applications and system components will have different resiliency requirements Reachability errors must be handled by run-time services Graceful degradation may be appropriate for some applications May need OOB mechanism for recognizing network failures AM or event-driven API would be ideal Hardware support for network-level protection RAS system invoking OS via network messages

28 Portals 4.0: Applying Lessons Learned from Cray SeaStar High message rate Atomic search and post for MPI receives required round-trip across PCI Eliminate round-trip by having Portals manage unexpected messages Flow control Encourages well-behaved applications Fail fast Identify application scalability issues early Resource exhaustion caused unrecoverable failure Recovery doesn t have to be fast Resource exhaustion will disable Portal Subsequent messages will fail with event notification at initiator Applies back pressure from network Performance for scalable applications Correctness for non-scalable applications

29 Portals 4.0 (cont d) Hardware implementation Designed for intelligent or programmable NICs Arbitrary list insertion Unneeded symmetry on initiator and target objects New functionality for one-sided operations Eliminate matching information Smaller network header Minimize processing at target Scalable event delivery Lightweight counting events Triggered operations Chain sequence of data movement operations Build asynchronous collective operations

30 Triggered Operations

31 Network Interface Controller Power will be number one constraint for exascale systems Current systems waste energy Using host cores to process messages is inefficient Only move data when necessary Move data to final destination No intermediate copying due to network Specialized network hardware Atomic operations Match processing Addressing and address translation Virtual address translation Avoid registration cache Logical node translation Rank translation on a million nodes Hardware support for thread activation on message arrival

32 High Message Throughput Challenges 20M messages per second implies a new message every 50ns Significant constraints created by MPI semantics On-load approach Roadblocks Host general purpose processors are inefficient for list management Caching (a cache miss is ns latency) Microbenchmarks are cache friendly, real life is not Benefits Queue Processor Easier & cheaper Off-load approach Roadblocks Storage requirements on NIC NIC embedded processor is worse at list management (than the host processor) Benefits Opportunity to create dedicated hardware Macroscale pipelining Header Posted Receives Match SRAM List Manager ALPU Unexp Msg Match Processor Bus SRAM List Manager ALPU Posted Receiv ve

33 Posted Queue Results 128 Entry ALPU

34 Match Unit Architecture Permute Ternary Register File Input FIFO Input Fifo Unit ALU Register File Microcode Branch Unit Architecture Drivers High throughput 3 stage pipeline Irregular data alignment SIMD operation Permute units Program Consistency Forwarding in datapath Ternary Unit Data Copy Unit ALU Read before write in register file Predicate Register File Predicate Unit Output FIFO

35 Match Time Results (30 items)

36 Match Time Results (300 items) 1x 4.6x Relative 3.8x Size 77x 7.7x 61.5x

37 Minimizing Memory Bandwidth Usage in Network Stack Memory bandwidth is most often the limiting factor for on node performance Instruction Mix Memory Usage in Sandia Applications Integer Instruction Usage We must minimize the use of host memory bandwidth in the network stack Bounce buffers (or any other copying) incur a 2x memory bandwidth penalty A fast off-load approach can minimize host memory bandwidth utilization Allows the NIC to determine where received messages need to be put in host memory and DMA the data directly there, eliminating the need for bounce buffers High message rate can reduce the need for buffering of non-contiguous data

38 IAA Algorithms Project / Extreme-Scale Algorithms & Software Institute

39 Motivation Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009) Domain decomposition preconditioning with incomplete factorizations Inflation in iteration count due to number of subdomains With scalable threaded triangular solves Solve triangular system on larger subdomains Reduce number of subdomains (MPI tasks) 39

40 MPI Shared Memory Allocation Idea: Shared memory alloc/free functions: MPI_Comm_alloc_mem () MPI_Comm_free_mem() Status: Available in current development branch of OpenMPI Demonstrated usage with threaded triangular solve 40

41 Simple MPI Program Simple MPI application Two distributed memory/mpi kernels Want to replace an MPI kernel with more efficient hybrid MPI/threaded Threading on multicore node 41

42 Simple MPI + Hybrid Program n=4 Very minor changes to code MPIKernel1 does not change Hybrid MPI/Threaded kernel runs on rank 0 of each node Threading on multicore node 42

43 Iterative Approach to Hybrid Parallelism Many sections of parallel applications scale extremely well using MPI-only model. Don t change these sections much Approach allows introduction of multithreaded kernels in iterative fashion Tune how multithreaded an application is Can focus on parts of application that don t scale with MPI-only programming Approach requires few changes to MPI-only sections 43

44 Iterative Approach to Hybrid Parallelism Can use 1 hybrid kernel 44

45 Iterative Approach to Hybrid Parallelism Or use 2 hybrid kernels 45

46 Preliminary PCG Results Iterations Flat MPI PCG Threaded Preconditioning untime Ru Runtime relative to flat MPI PCG

47 Summary: Bimodal MPI-only / MPI + X Programming Interface traditional MPI-only applications with efficient MPI + X kernels Only change parts of applications that don t scale MPI shared memory allocation useful Allows seamless combination of traditional MPI programming with MPI+X kernels Iterative approach to multithreading Implemented PCG using MPI shared memory extensions and level set method Effective in reducing iterations Runtime did not scale (work in progress) Better triangular solver algorithms needed 47

48 Acknowledgments People from whom I think I stole slides: Sudip Dosanjh Jim Ang Arun Rodrigues Scott Hemmert Brian Barrett Mike Heroux Michael Wolfe

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on

More information

EXASCALE COMPUTING: WHERE OPTICS MEETS ELECTRONICS

EXASCALE COMPUTING: WHERE OPTICS MEETS ELECTRONICS EXASCALE COMPUTING: WHERE OPTICS MEETS ELECTRONICS Overview of OFC Workshop: Organizers: Norm Jouppi HP Labs, Moray McLaren HP Labs, Madeleine Glick Intel Labs March 7, 2011 1 AGENDA Introduction. Moray

More information

Challenges and Opportunities for HPC Interconnects and MPI

Challenges and Opportunities for HPC Interconnects and MPI Challenges and Opportunities for HPC Interconnects and MPI Ron Brightwell, R&D Manager Scalable System Software Department Sandia National Laboratories is a multi-mission laboratory managed and operated

More information

The Cray Rainier System: Integrated Scalar/Vector Computing

The Cray Rainier System: Integrated Scalar/Vector Computing THE SUPERCOMPUTER COMPANY The Cray Rainier System: Integrated Scalar/Vector Computing Per Nyberg 11 th ECMWF Workshop on HPC in Meteorology Topics Current Product Overview Cray Technology Strengths Rainier

More information

VARIABILITY IN OPERATING SYSTEMS

VARIABILITY IN OPERATING SYSTEMS VARIABILITY IN OPERATING SYSTEMS Brian Kocoloski Assistant Professor in CSE Dept. October 8, 2018 1 CLOUD COMPUTING Current estimate is that 94% of all computation will be performed in the cloud by 2021

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

Compute Node Linux (CNL) The Evolution of a Compute OS

Compute Node Linux (CNL) The Evolution of a Compute OS Compute Node Linux (CNL) The Evolution of a Compute OS Overview CNL The original scheme plan, goals, requirements Status of CNL Plans Features and directions Futures May 08 Cray Inc. Proprietary Slide

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System

Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System Application Sensitivity to Link and Injection Bandwidth on a Cray XT4 System Cray User Group Conference Helsinki, Finland May 8, 28 Kevin Pedretti, Brian Barrett, Scott Hemmert, and Courtenay Vaughan Sandia

More information

The Red Storm System: Architecture, System Update and Performance Analysis

The Red Storm System: Architecture, System Update and Performance Analysis The Red Storm System: Architecture, System Update and Performance Analysis Douglas Doerfler, Jim Tomkins Sandia National Laboratories Center for Computation, Computers, Information and Mathematics LACSI

More information

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Center Extreme Scale CS Research

Center Extreme Scale CS Research Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

I/O Systems. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) I/O Systems Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) I/O Systems 1393/9/15 1 / 57 Motivation Amir H. Payberah (Tehran

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

Kaisen Lin and Michael Conley

Kaisen Lin and Michael Conley Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC

More information

Integrating Analysis and Computation with Trios Services

Integrating Analysis and Computation with Trios Services October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque,

More information

What are Clusters? Why Clusters? - a Short History

What are Clusters? Why Clusters? - a Short History What are Clusters? Our definition : A parallel machine built of commodity components and running commodity software Cluster consists of nodes with one or more processors (CPUs), memory that is shared by

More information

Fast Forward I/O & Storage

Fast Forward I/O & Storage Fast Forward I/O & Storage Eric Barton Lead Architect 1 Department of Energy - Fast Forward Challenge FastForward RFP provided US Government funding for exascale research and development Sponsored by 7

More information

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber

NERSC Site Update. National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory. Richard Gerber NERSC Site Update National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Richard Gerber NERSC Senior Science Advisor High Performance Computing Department Head Cori

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores SPARCCenter, SGI Challenge, Cray T3D, Convex Exemplar, KSR-1&2, today s CMPs message

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

The future is parallel but it may not be easy

The future is parallel but it may not be easy The future is parallel but it may not be easy Michael J. Flynn Maxeler and Stanford University M. J. Flynn 1 HiPC Dec 07 Outline I The big technology tradeoffs: area, time, power HPC: What s new at the

More information

Compute Node Linux: Overview, Progress to Date & Roadmap

Compute Node Linux: Overview, Progress to Date & Roadmap Compute Node Linux: Overview, Progress to Date & Roadmap David Wallace Cray Inc ABSTRACT: : This presentation will provide an overview of Compute Node Linux(CNL) for the CRAY XT machine series. Compute

More information

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation

White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation White paper FUJITSU Supercomputer PRIMEHPC FX100 Evolution to the Next Generation Next Generation Technical Computing Unit Fujitsu Limited Contents FUJITSU Supercomputer PRIMEHPC FX100 System Overview

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application

More information

Leveraging HyperTransport for a custom high-performance cluster network

Leveraging HyperTransport for a custom high-performance cluster network Leveraging HyperTransport for a custom high-performance cluster network Mondrian Nüssle HTCE Symposium 2009 11.02.2009 Outline Background & Motivation Architecture Hardware Implementation Host Interface

More information

Basics of Performance Engineering

Basics of Performance Engineering ERLANGEN REGIONAL COMPUTING CENTER Basics of Performance Engineering J. Treibig HiPerCH 3, 23./24.03.2015 Why hardware should not be exposed Such an approach is not portable Hardware issues frequently

More information

Flash: an efficient and portable web server

Flash: an efficient and portable web server Flash: an efficient and portable web server High Level Ideas Server performance has several dimensions Lots of different choices on how to express and effect concurrency in a program Paper argues that

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

Copyright Push Technology Ltd December Diffusion TM 4.4 Performance Benchmarks

Copyright Push Technology Ltd December Diffusion TM 4.4 Performance Benchmarks Diffusion TM 4.4 Performance Benchmarks November 2012 Contents 1 Executive Summary...3 2 Introduction...3 3 Environment...4 4 Methodology...5 4.1 Throughput... 5 4.2 Latency... 6 5 Results Summary...7

More information

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Chapter 14 Performance and Processor Design

Chapter 14 Performance and Processor Design Chapter 14 Performance and Processor Design Outline 14.1 Introduction 14.2 Important Trends Affecting Performance Issues 14.3 Why Performance Monitoring and Evaluation are Needed 14.4 Performance Measures

More information

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

IBM Cell Processor. Gilbert Hendry Mark Kretschmann IBM Cell Processor Gilbert Hendry Mark Kretschmann Architectural components Architectural security Programming Models Compiler Applications Performance Power and Cost Conclusion Outline Cell Architecture:

More information

Issues in Multiprocessors

Issues in Multiprocessors Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing explicit sends & receives Which execution model control parallel

More information

MOVING FORWARD WITH FABRIC INTERFACES

MOVING FORWARD WITH FABRIC INTERFACES 14th ANNUAL WORKSHOP 2018 MOVING FORWARD WITH FABRIC INTERFACES Sean Hefty, OFIWG co-chair Intel Corporation April, 2018 USING THE PAST TO PREDICT THE FUTURE OFI Provider Infrastructure OFI API Exploration

More information

The Exascale Architecture

The Exascale Architecture The Exascale Architecture Richard Graham HPC Advisory Council China 2013 Overview Programming-model challenges for Exascale Challenges for scaling MPI to Exascale InfiniBand enhancements Dynamically Connected

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2016 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 System I/O System I/O (Chap 13) Central

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 erformance Workshop odern HC Architectures David Henty d.henty@epcc.ed.ac.uk ECC, University of Edinburgh Overview Components History Flynn s Taxonomy SID ID Classification via emory Distributed

More information

Overview. CS 472 Concurrent & Parallel Programming University of Evansville

Overview. CS 472 Concurrent & Parallel Programming University of Evansville Overview CS 472 Concurrent & Parallel Programming University of Evansville Selection of slides from CIS 410/510 Introduction to Parallel Computing Department of Computer and Information Science, University

More information

IsoStack Highly Efficient Network Processing on Dedicated Cores

IsoStack Highly Efficient Network Processing on Dedicated Cores IsoStack Highly Efficient Network Processing on Dedicated Cores Leah Shalev Eran Borovik, Julian Satran, Muli Ben-Yehuda Outline Motivation IsoStack architecture Prototype TCP/IP over 10GE on a single

More information

Lessons learned from MPI

Lessons learned from MPI Lessons learned from MPI Patrick Geoffray Opinionated Senior Software Architect patrick@myri.com 1 GM design Written by hardware people, pre-date MPI. 2-sided and 1-sided operations: All asynchronous.

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

A unified multicore programming model

A unified multicore programming model A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging

More information

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics

Overcoming the Memory System Challenge in Dataflow Processing. Darren Jones, Wave Computing Drew Wingard, Sonics Overcoming the Memory System Challenge in Dataflow Processing Darren Jones, Wave Computing Drew Wingard, Sonics Current Technology Limits Deep Learning Performance Deep Learning Dataflow Graph Existing

More information

Efficient Parallel Programming on Xeon Phi for Exascale

Efficient Parallel Programming on Xeon Phi for Exascale Efficient Parallel Programming on Xeon Phi for Exascale Eric Petit, Intel IPAG, Seminar at MDLS, Saclay, 29th November 2016 Legal Disclaimers Intel technologies features and benefits depend on system configuration

More information

Module 11: I/O Systems

Module 11: I/O Systems Module 11: I/O Systems Reading: Chapter 13 Objectives Explore the structure of the operating system s I/O subsystem. Discuss the principles of I/O hardware and its complexity. Provide details on the performance

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Red Storm / Cray XT4: A Superior Architecture for Scalability

Red Storm / Cray XT4: A Superior Architecture for Scalability Red Storm / Cray XT4: A Superior Architecture for Scalability Mahesh Rajan, Doug Doerfler, Courtenay Vaughan Sandia National Laboratories, Albuquerque, NM Cray User Group Atlanta, GA; May 4-9, 2009 Sandia

More information

Conventional Computer Architecture. Abstraction

Conventional Computer Architecture. Abstraction Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

xsim The Extreme-Scale Simulator

xsim The Extreme-Scale Simulator www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of

More information

Introduction to High Performance Parallel I/O

Introduction to High Performance Parallel I/O Introduction to High Performance Parallel I/O Richard Gerber Deputy Group Lead NERSC User Services August 30, 2013-1- Some slides from Katie Antypas I/O Needs Getting Bigger All the Time I/O needs growing

More information

Upgrade Your MuleESB with Solace s Messaging Infrastructure

Upgrade Your MuleESB with Solace s Messaging Infrastructure The era of ubiquitous connectivity is upon us. The amount of data most modern enterprises must collect, process and distribute is exploding as a result of real-time process flows, big data, ubiquitous

More information

The Future of Interconnect Technology

The Future of Interconnect Technology The Future of Interconnect Technology Michael Kagan, CTO HPC Advisory Council Stanford, 2014 Exponential Data Growth Best Interconnect Required 44X 0.8 Zetabyte 2009 35 Zetabyte 2020 2014 Mellanox Technologies

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Chapter 3 Parallel Software

Chapter 3 Parallel Software Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers

More information

Paving the Road to Exascale

Paving the Road to Exascale Paving the Road to Exascale Gilad Shainer August 2015, MVAPICH User Group (MUG) Meeting The Ever Growing Demand for Performance Performance Terascale Petascale Exascale 1 st Roadrunner 2000 2005 2010 2015

More information

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc Scaling to Petaflop Ola Torudbakken Distinguished Engineer Sun Microsystems, Inc HPC Market growth is strong CAGR increased from 9.2% (2006) to 15.5% (2007) Market in 2007 doubled from 2003 (Source: IDC

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

Designing and debugging real-time distributed systems

Designing and debugging real-time distributed systems Designing and debugging real-time distributed systems By Geoff Revill, RTI This article identifies the issues of real-time distributed system development and discusses how development platforms and tools

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017

In-Network Computing. Sebastian Kalcher, Senior System Engineer HPC. May 2017 In-Network Computing Sebastian Kalcher, Senior System Engineer HPC May 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect CPU-Centric (Onload) Data-Centric (Offload) Must Wait

More information

6.9. Communicating to the Outside World: Cluster Networking

6.9. Communicating to the Outside World: Cluster Networking 6.9 Communicating to the Outside World: Cluster Networking This online section describes the networking hardware and software used to connect the nodes of cluster together. As there are whole books and

More information

Programming Models for Supercomputing in the Era of Multicore

Programming Models for Supercomputing in the Era of Multicore Programming Models for Supercomputing in the Era of Multicore Marc Snir MULTI-CORE CHALLENGES 1 Moore s Law Reinterpreted Number of cores per chip doubles every two years, while clock speed decreases Need

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Advanced Computer Networks. End Host Optimization

Advanced Computer Networks. End Host Optimization Oriana Riva, Department of Computer Science ETH Zürich 263 3501 00 End Host Optimization Patrick Stuedi Spring Semester 2017 1 Today End-host optimizations: NUMA-aware networking Kernel-bypass Remote Direct

More information

The University of Texas at Austin

The University of Texas at Austin EE382N: Principles in Computer Architecture Parallelism and Locality Fall 2009 Lecture 24 Stream Processors Wrapup + Sony (/Toshiba/IBM) Cell Broadband Engine Mattan Erez The University of Texas at Austin

More information

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract

LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November Abstract LUSTRE NETWORKING High-Performance Features and Flexible Support for a Wide Array of Networks White Paper November 2008 Abstract This paper provides information about Lustre networking that can be used

More information

Flexible Architecture Research Machine (FARM)

Flexible Architecture Research Machine (FARM) Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense

More information

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011

ANSYS HPC. Technology Leadership. Barbara Hutchings ANSYS, Inc. September 20, 2011 ANSYS HPC Technology Leadership Barbara Hutchings barbara.hutchings@ansys.com 1 ANSYS, Inc. September 20, Why ANSYS Users Need HPC Insight you can t get any other way HPC enables high-fidelity Include

More information

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits

RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits Sayantan Sur Hyun-Wook Jin Lei Chai D. K. Panda Network Based Computing Lab, The Ohio State University Presentation

More information

Scalable Software Transactional Memory for Chapel High-Productivity Language

Scalable Software Transactional Memory for Chapel High-Productivity Language Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL

More information

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances)

HPC and IT Issues Session Agenda. Deployment of Simulation (Trends and Issues Impacting IT) Mapping HPC to Performance (Scaling, Technology Advances) HPC and IT Issues Session Agenda Deployment of Simulation (Trends and Issues Impacting IT) Discussion Mapping HPC to Performance (Scaling, Technology Advances) Discussion Optimizing IT for Remote Access

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Spring 2018 Lecture 2 Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 2 What is an Operating System? What is

More information

GPUfs: Integrating a file system with GPUs

GPUfs: Integrating a file system with GPUs GPUfs: Integrating a file system with GPUs Mark Silberstein (UT Austin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UT Austin) 1 Traditional System Architecture Applications OS CPU

More information

OSIsoft PI World 2018

OSIsoft PI World 2018 OSIsoft PI World 2018 Writing Highly Performant PI Web API Applications Presented by Jim Bazis, Max Drexel Introduction Max Drexel mdrexel@osisoft.com Software Developer PI Web API Team Jim Bazis jbazis@osisoft.com

More information