Opportunities and Approaches for System Software in Supporting

Size: px

Start display at page:

Download "Opportunities and Approaches for System Software in Supporting"

Terence Andrews
5 years ago
Views:

1 Opportunities and Approaches for System Software in Supporting Application/Architecture t Co-Design Ron Brightwell Sandia National Laboratories Scalable System Software rbbrigh@sandia.gov Workshop on Application/Architecture Co-Design for Extreme-Scale Computing September 24, 2010 Sandia is a Multiprogram Laboratory Operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy Under Contract DE-ACO4-94AL85000.

2 Outline Overview of co-design Fundamental capabilities needed for co-design Two examples of limited co-design

3 Potential System Architecture Targets System attributes System peak 2 Peta 200 Petaflop/sec 1 Exaflop/sec Power 6MW 15 MW 20 MW System memory 0.3 PB 5 PB PB Node performance 125 GF 0.5 TF 7 TF 1 TF 10 TF Node memory BW 25 GB/s 0.1 TB/sec 1 TB/sec 0.4 TB/sec 4 TB/sec Node concurrency 12 O(100) O(1,000) O(1,000) O(10,000) System size (nodes) 18,700 50,000 5,000 1,000, ,000 Total Node 1.5 GB/s 20 GB/sec 200 GB/sec Interconnect t BW MTTI days O(1day) O(1 day)

4 Co-design is a key element of the Exascale strategy Architectures are undergoing a major change Single thread performance is remaining relatively constant and on chip parallelism is increasing rapidly Hierarchical parallelism, heterogeneity Massive multithreading NVRAM for caching I/O Applications will need to change in response to architectural changes Manage locality and extreme scalability (billion-way parallelism) Potentially tolerate latency Resilience? Unprecedented opportunity for applications/algorithms to influence architectures, system software and the next programming model Hardware R&D is needed to reach exascale We will not be able to solve all of the exascale problems through architectures work only

5 Co-design space is subject to other constraints Power, system cost, R&D costs Physical limitations Multiple applications Goal is to build a sustainable infrastructure with broad market support Extend beyond natural evolution of commodity hardware to create new markets Create system building blocks that offer superior price/performance/programmability all scales (exascale, departmental and embedded)

6 Co-design expands the feasible solution space to allow better solutions Application driven: Find the best technology to run this code. Sub-optimal Application Model Algorithms Code Now, we must expand the co-design space to find better solutions: new applications & algorithms, better technology and performance. Technology architecture programming model resilience power Technology driven: Fit your application to this technology. Sub-optimal.

7 Hardware/Software co-design is a mature field Design of an integrated system that contains hardware and software Focus on embedded systems (cell phones, appliances, engines, controllers, etc.) Concurrent development of hardware and software Interactions and tradeoffs Partitioning is a focus Must satisfy real-time and/or other performance/energy metrics/constraints

8 Original DOD Standard for HW/SW co-development had shortcomings Integrated Prototypes Modeling Substrate

9 Lockheed Martin Co-design Methodology HW/SW Cosim.

10 Phases of Co-Design

11 Co-Design Schematic

12 Why has co-design not been used more extensively in HPC? Leveraging of COTs technology Almost all leadership systems have some custom components but HPC has benefited from the ability to leverage commercial technology HPC applications are very complex May contain a million of lines of code ~15-20 years of architectural and programming model stability Bulk synchronous processing + explicit message passing Lack of Adequate Simulation Tools Often use Byte to Flop ratios and Excel spreadsheets Industry simulation tools are proprietary However, there are some HPC co-design examples and there are useful tools

13 Fundamental Capabilities for Co-Design Software agility Applications Need to identify an important, representative ese e subset Application code must be small and malleable System software Smaller is better Lightweight is ideal Toolchain is always a huge issue Hardware simulation tools Sandia SST Virtualization Leverage virtual machine capability to emulate new hardware capability Need mechanisms to know the impact of co-design quickly Integrated teams Co-design centers

14 System simulation should be a key enabling technology Co-simulation of hardware and software Assess architectural choices and their impact on applications Identify bottlenecks and enable the development of algorithms for future architectures Key features Open source with the ability to interface to proprietary software Holistic: performance, power, area, cost, reliability analysis Modular and multiscale (cycle accurate to analytical) Input traces as well as joint execution Parallel FPGA acceleration

15 SST Simulation Project Parallel Discrete Event core with conservative optimization over MPI Holistic Integrated Tech. Models for power McPAT, Sim-Panalyzer Multiscale Detailed and simple models for processor, network, and memory Current Release (2.0) at Includes parallel simulation core, configuration, power models, basic network and processor models, and interface to detailed memory model

16 SST simulations have quantified the impact of the Memory Wall Most of DOE s Applications (e.g., climate, fusion, shock physics, ) spend most of their instructions accessing memory or doing integer computations, not floating point Additionally, most integer computations are computing memory Addresses Advanced development efforts are focused on accelerating memory subsystem performance for both scientific and informatics applications

SST is providing architectural insights to algorithm

of each instruction. Use FFTs to analyze behavior.

17 SST is providing architectural insights to algorithm developers Input: SST Trace for SpMV. Lots of instruction stream data. Model: Use restricted sin 2 function to mark start/finish of each instruction. Use FFTs to analyze behavior. Trace fragment from SpMV inner loop Number if in-flight instructions vs. clock cycle. Important cycle frequencies

18 Sandia Mantevo Project Mini-applications Small, self-contained programs that embody essential performance characteristics of key applications Mini-drivers Small programs that act as drivers of performance-impacting Trilinos packages Application proxies Parameterizable applications that can be calibrated to mimic the performance of a large-scale application, then used as a proxy for the large-scale application

19 Sandia System Software Lightweight kernels Small code base (<50K LOC) Focused on performance and scalability Separate policy decision from policy enforcement Move resource management as close to application as possible Protect applications from each other Let user processes (libraries) manage resources Get out of the way Kitten Latest-generation LWK Open-source Lightweight virtualization when combined with Palacios hypervisor Portals

20 Portals Network Programming Interface Network API developed by Sandia, U. New Mexico, Intel Previous generations of Portals deployed on several production massively parallel systems 1993: 1800-node Intel Paragon (SUNMOS) 1997: 10,000-node Intel ASCI Red (Puma/Cougar) 1999: 1800-node Cplant cluster (Linux) 2005: 10,000-node Cray Sandia Red Storm (Catamount) 2009: 18,688-node Cray XT5 ORNL Jaguar (Linux) Focused on providing Lightweight connectionless model for MPP systems Low latency High bandwidth Independent progress Overlap of computation and communication Scalable buffering semantics Supports MPI, Cray SHMEM, ARMCI, GASNet, Lustre, etc.

21 NIC Architecture Co-Design

NIC Architecture Co-design Prevailing architectural constraints have

a power constrained world this trend will be unsustainable due to

NIC architecture that enables overlap through high message rates and

host driver software were simultaneously profiled for various

provide performance advantages What software bottlenecks need to be

in the host driver Next step: rework applications (or portions) to

22 NIC Architecture Co-design Prevailing architectural constraints have driven many applications to highly bursty communication patterns In a power constrained world this trend will be unsustainable due to inefficient use of the system interconnect Design Goal: Produce a NIC architecture that enables overlap through high message rates and independent progress Using simulation, NIC hardware & software and host driver software were simultaneously profiled for various architecture choices Trade-offs: Which architectural features provide performance advantages What software bottlenecks need to be moved to hardware Which functions can be left to run on NIC CPU or in the host driver Next step: rework applications (or portions) to take advantage of the new features and provide feedback for more architectural improvements

23 MPI Will Likely Persist Into Exascale Era Number of network endpoints will increase significantly (5-50x) Memory and power will be dominant resources to manage Networks must be power and energy efficient Data movement must be carefully managed Memory copies will be very expensive Impact of unexpected messages must be minimized Eager protocol for short messages leads to receive-side buffering Need strategies for managing host buffer resources Flow control will be critical N-to-1 communication patterns will (continue to be) disastrous Must preserve key network performance characteristics Latency Bandwidth Message rate (throughput)

24 High Message Throughput is Vital Message rate determines the minimum message size needed to saturate the available network bandwidth

25 Current Flow Control Strategies Not Sufficient Credit-based Limit number of outstanding send operations Used credits are replenished implicitly or explicitly Effectiveness limited to N-to-1 scenario Potential performance penalty for well-behaved applications Acknowledgment-based Receiver explicitly confirms receipt of every message Significant per-message performance penalty Round trip acknowledgment doubles latency Performance penalty for well-behaved applications Local copying (bounce buffer) mitigates latency penalty Both strategies limit i message rate and effective bandwidth Flow control implemented at user-level inside MPI library Network transport usually has its own flow control mechanism No mechanism for back pressure from host resources to network

26 Applications Must Become More Asynchronous Applications cannot continue to be bulk synchronous Overhead of synchronization will limit scaling Synchronization increases susceptibility to noise Network API must provide asynchronous operations and progress Data movement must be independent of host activity Active Messages Polling is fundamental to all AM Progress only when nothing else to do Polling memory for message reception is inefficient Needs hardware support to integrate message arrival with thread invocation Run-time systems will also need to communicate Need to communicate evolving state of the system Need a common portable API Using TCP OOB connection will be infeasible

27 Resiliency Will Impact Network API Network will need to expose errors to enable recovery Applications and system components will have different resiliency requirements Reachability errors must be handled by run-time services Graceful degradation may be appropriate for some applications May need OOB mechanism for recognizing network failures AM or event-driven API would be ideal Hardware support for network-level protection RAS system invoking OS via network messages

28 Portals 4.0: Applying Lessons Learned from Cray SeaStar High message rate Atomic search and post for MPI receives required round-trip across PCI Eliminate round-trip by having Portals manage unexpected messages Flow control Encourages well-behaved applications Fail fast Identify application scalability issues early Resource exhaustion caused unrecoverable failure Recovery doesn t have to be fast Resource exhaustion will disable Portal Subsequent messages will fail with event notification at initiator Applies back pressure from network Performance for scalable applications Correctness for non-scalable applications

29 Portals 4.0 (cont d) Hardware implementation Designed for intelligent or programmable NICs Arbitrary list insertion Unneeded symmetry on initiator and target objects New functionality for one-sided operations Eliminate matching information Smaller network header Minimize processing at target Scalable event delivery Lightweight counting events Triggered operations Chain sequence of data movement operations Build asynchronous collective operations

30 Triggered Operations

31 Network Interface Controller Power will be number one constraint for exascale systems Current systems waste energy Using host cores to process messages is inefficient Only move data when necessary Move data to final destination No intermediate copying due to network Specialized network hardware Atomic operations Match processing Addressing and address translation Virtual address translation Avoid registration cache Logical node translation Rank translation on a million nodes Hardware support for thread activation on message arrival

32 High Message Throughput Challenges 20M messages per second implies a new message every 50ns Significant constraints created by MPI semantics On-load approach Roadblocks Host general purpose processors are inefficient for list management Caching (a cache miss is ns latency) Microbenchmarks are cache friendly, real life is not Benefits Queue Processor Easier & cheaper Off-load approach Roadblocks Storage requirements on NIC NIC embedded processor is worse at list management (than the host processor) Benefits Opportunity to create dedicated hardware Macroscale pipelining Header Posted Receives Match SRAM List Manager ALPU Unexp Msg Match Processor Bus SRAM List Manager ALPU Posted Receiv ve

33 Posted Queue Results 128 Entry ALPU

34 Match Unit Architecture Permute Ternary Register File Input FIFO Input Fifo Unit ALU Register File Microcode Branch Unit Architecture Drivers High throughput 3 stage pipeline Irregular data alignment SIMD operation Permute units Program Consistency Forwarding in datapath Ternary Unit Data Copy Unit ALU Read before write in register file Predicate Register File Predicate Unit Output FIFO

35 Match Time Results (30 items)

36 Match Time Results (300 items) 1x 4.6x Relative 3.8x Size 77x 7.7x 61.5x

37 Minimizing Memory Bandwidth Usage in Network Stack Memory bandwidth is most often the limiting factor for on node performance Instruction Mix Memory Usage in Sandia Applications Integer Instruction Usage We must minimize the use of host memory bandwidth in the network stack Bounce buffers (or any other copying) incur a 2x memory bandwidth penalty A fast off-load approach can minimize host memory bandwidth utilization Allows the NIC to determine where received messages need to be put in host memory and DMA the data directly there, eliminating the need for bounce buffers High message rate can reduce the need for buffering of non-contiguous data

38 IAA Algorithms Project / Extreme-Scale Algorithms & Software Institute

Motivation Strong scaling of Charon on TLCC (P.

count due to number of subdomains With scalable

39 Motivation Strong scaling of Charon on TLCC (P. Lin, J. Shadid 2009) Domain decomposition preconditioning with incomplete factorizations Inflation in iteration count due to number of subdomains With scalable threaded triangular solves Solve triangular system on larger subdomains Reduce number of subdomains (MPI tasks) 39

40 MPI Shared Memory Allocation Idea: Shared memory alloc/free functions: MPI_Comm_alloc_mem () MPI_Comm_free_mem() Status: Available in current development branch of OpenMPI Demonstrated usage with threaded triangular solve 40

41 Simple MPI Program Simple MPI application Two distributed memory/mpi kernels Want to replace an MPI kernel with more efficient hybrid MPI/threaded Threading on multicore node 41

42 Simple MPI + Hybrid Program n=4 Very minor changes to code MPIKernel1 does not change Hybrid MPI/Threaded kernel runs on rank 0 of each node Threading on multicore node 42

43 Iterative Approach to Hybrid Parallelism Many sections of parallel applications scale extremely well using MPI-only model. Don t change these sections much Approach allows introduction of multithreaded kernels in iterative fashion Tune how multithreaded an application is Can focus on parts of application that don t scale with MPI-only programming Approach requires few changes to MPI-only sections 43

44 Iterative Approach to Hybrid Parallelism Can use 1 hybrid kernel 44

45 Iterative Approach to Hybrid Parallelism Or use 2 hybrid kernels 45

46 Preliminary PCG Results Iterations Flat MPI PCG Threaded Preconditioning untime Ru Runtime relative to flat MPI PCG

47 Summary: Bimodal MPI-only / MPI + X Programming Interface traditional MPI-only applications with efficient MPI + X kernels Only change parts of applications that don t scale MPI shared memory allocation useful Allows seamless combination of traditional MPI programming with MPI+X kernels Iterative approach to multithreading Implemented PCG using MPI shared memory extensions and level set method Effective in reducing iterations Runtime did not scale (work in progress) Better triangular solver algorithms needed 47

48 Acknowledgments People from whom I think I stole slides: Sudip Dosanjh Jim Ang Arun Rodrigues Scott Hemmert Brian Barrett Mike Heroux Michael Wolfe

Initial Performance Evaluation of the Cray SeaStar Interconnect

Initial Performance Evaluation of the Cray SeaStar Interconnect Ron Brightwell Kevin Pedretti Keith Underwood Sandia National Laboratories Scalable Computing Systems Department 13 th IEEE Symposium on