The Case of the Missing Supercomputer Performance
|
|
- Russell Harris
- 5 years ago
- Views:
Transcription
1 The Case of the Missing Supercomputer Performance Achieving Optimal Performance on the 8192 Processors of ASCI Q Fabrizio Petrini, Darren Kerbyson, Scott Pakin (Los Alamos National Lab) Presented by Jiahua He
2 Skeleton of the Story Machine: ASCI Q (Second of Top500) 2048 Alpha SMP nodes with 4 proc per node Interconnected with Quadrics QsNet network Application: SAGE compressible Eulerian hydrodynmics program 150,000 lines of Fortran MPI code Beginning: a serious but previously undetected problem Techniques: Measurement to determine real performance Analytical model to predict expected performance Microbenchmarks to identify problem source Simulator to examine what if scenarios Result: a factor of 2 improvement in app performance 10/03/05 2
3 Steps Performance expectation Use analytical model to determine the performance that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 3
4 Step 1 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 4
5 Performance Expectation Model (Darren Kerbyson et al. SC01) Validated on many large-scale systems including all ASCI systems Typical prediction error of less than 10% Terms QA: first 4096-processor segment QB: second 4096-processor segment Weal-scaling: fix per-node problem size and scale # of proc 10/03/05 5
6 Performance Expectation Model (Darren Kerbyson et al. SC01) Validated on many large-scale systems including all ASCI systems Typical prediction error of less than 10% MYSTERY #1 Terms SAGE QA: first performs 4096-processor significantly worse on ASCI Q than segment was predicted by our performance model. QB: second 4096-processor segment Weal-scaling: fix per-node problem size and scale # of proc 10/03/05 6
7 Different # of proc Is it the model accurate? n-proc: using n processors per node Only significant difference occurs when 4-proc Giving confidence to the model Limit the problem in 4-proc 3-proc outperforms 4-proc when using more than 256 nodes 2-proc outperforms 4-proc when using more than 512 nodes 10/03/05 7
8 Perf Variability Constant amount of work in each cycle constant amount of time Vary from 0.7s to 3.0s A factor of 4 in variability 10/03/05 8
9 Breakdown of Cycle Time cycle = computation + local boundary exchange + collective communication Local boundary exchanges (get, put) Plateau above 500 proc Match model prediction Collective communications (allreduce, reduction, broadcast) Increase with # of proc Constant number and payload size in allreduce operations Difference between allreduce and reduction/broadcast: the difference in frequency of occurrence 10/03/05 9
10 Observations Summary Significant difference: expected performance observed performance Only when 4-proc High variability Source of performance deficit: collective operations, especially allreduce Deduction Improve the performance of allreduce, especially when using four processors per node 10/03/05 10
11 Step 2 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 11
12 Investigating allreduce allreduce latency 4-proc: 3ms Others: less than 0.3ms Synthetic parallel benchmark Alternately computes for either 0, 1 or 5 ms then performs either an allreduce or barrier Ideal scalable system Logarithmic growth with # nodes Insensitivity to computational granularity Result: not scalable 10/03/05 12
13 Optimization Optimizing allreduce Always polling Blocking after a limited time (100us, determined empirically) Improve latency by a factor of 7 Expectation At 4096 proc, SAGE spends 51% time in allreduce 78% performance gain Measurement result Only a marginal improvement in application performance 10/03/05 13
14 Optimization Optimizing allreduce Always polling Blocking after a limited time (100us, determined empirically) Improve latency MYSTERY by a factor #2 of 7 Expectation Although SAGE spends half of its time in allreduce (at 4,096 processors), making allreduce seven At 4096 proc, SAGE spends 51% time in times faster leads to a negligible performance allreduce 78% performance gain improvement. Measurement result Only a marginal improvement in application performance 10/03/05 14
15 Analyzing Noise Neither MPI nor network node Periodic system activities (noise) Need a spare proc (Fig. 3, 6) Blocking in allreduce Benchmark Synthetic 1000s computation per proc without noise Max slowdown: only 2.5% Refined benchmark 1 million 1ms iterations per proc without noise Match LANL codes pattern Similar result 10/03/05 15
16 Analyzing Noise Neither MPI nor network node Periodic system activities (noise) Need a spare proc (Fig. 3, 6) Blocking in allreduce MYSTERY #3 Benchmark Synthetic Although 1000s the noise computation hypothesis per could explain proc SAGE s without suboptimal noise performance, microbenchmarks Max of per-processor slowdown: only noise 2.5% indicate that at most 2.5% of performance is being lost to noise. Refined benchmark 1 million 1ms iterations per proc without noise Match LANL codes pattern Similar result 10/03/05 16
17 Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation Regular pattern across nodes Each cluster (32 nodes) contains noisier nodes Zoom into a cluster Node 0: cluster manager Node 1: quorum node Node 31: RMS cluster monitor 10/03/05 17
18 Node Aggregation Expose structure in what appears to be uncorrelated noise on a per-proc basis Important observation FINDING #1 Regular pattern across nodes Analyzing noise on a per-node basis instead of a Each per-processor cluster (32 basis nodes) reveals a regular structure contains noisier nodes across nodes. Zoom into a cluster Node 0: cluster manager Node 1: quorum node Node 31: RMS cluster monitor 10/03/05 18
19 Noise Events 10/03/05 19
20 Kernel Source of Noises Distributed heartbeat generated at kernel level Lightweight: hundreds of microseconds (us) High frequency: one every 125ms RMS daemons Quadrics Resource Management System One every 30s TruCluster daemons HP cluster management software One every about 100s 10/03/05 20
21 Step 3 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 21
22 Coscheduling Application: fine-grained, bulk-synchronous A delay in a process slows down the whole app Large # proc at least one slow process per iteration Coscheduling: pay the penalty only once Developed a prototype, but no details or results 10/03/05 22
23 Discrete-event Simulator Why simulator? Time on ASCI Q is scarce Configuration changes are not always practical Event = <F, L, E, P> F: frequency of the event L: average duration E: distribution; P: placement Barriers + 1ms computations Validated for measured events (top two curves) Predict performance gain of removing noises Node 0, 1 or 31: marginal improvement (15%) Kernel noise on all nodes: dramatically improved 10/03/05 23
24 Discrete-event Simulator Why simulator? Time on ASCI Q is scarce Configuration changes are not always practical FINDING #2 Event = <F, L, E, P> On F: fine-grained frequency of applications, the event more performance is lost L: to average short but duration frequent noise on all nodes than to long but less frequent noise on just a few nodes. E: distribution; P: placement Barriers + 1ms computations Validated for measured events (top two curves) Predict performance gain of removing noises Node 0, 1 or 31: marginal improvement (15%) Kernel noise on all nodes: dramatically improved 10/03/05 24
25 Eliminating Noise Infeasible to remove all the noise Two TruCluster heartbeats at kernel level Require substantial kernel modifications Optimizations Removed ten daemons from all nodes Increased RMS interval from 30s to 60s Moved several TruCluster daemons from node 1 and 2 to node 0 Microbenchmarks Barriers + Computations (0, 1 or 5ms) Improvements 2.2 to 13 times faster 10/03/05 25
26 Step 4 Performance expectation Use analytical model to determine the perf. that SAGE ought to see on ASCI Q Measure the real performance of SAGE Problem source If the measured performance is less than the expected one, use custom microbenchmarks to identify the source of the discrepancy Problem eliminating Use the simulator to try different measures Eliminate the cause of the problem Remeasurement Remeasure and repeat from step 2 if still not match 10/03/05 26
27 Optimized SAGE Performance Old curves (top two curves) New curves 4-proc, but w/o nodes 0 & 31 Jan-27-03: 1024-node segment (only up to 3716 proc) May-01-03: full sized ASCI Q (up to 7680 proc) May-01-03(min): minimum time over 50 cycles Results Jan and May-01-03: much improved May-01-03(min): closely match expected performance further optimizations 10/03/05 27
28 Summary Different configurations tested prior to and after noise removal Total processing rate (# usable proc) * (cells per proc) / (cycle time) Fixed 13,500 cells per proc Varied # of usable proc Best observed (???) processing rate is only 15% below model expectation 10/03/05 28
29 Summary Different configurations tested prior to and after noise removal Total processing rate (# usable proc) * (cells FINDING per proc) #3 / (cycle time) Fixed 13,500 cells per proc We Varied were # able of usable to double procsage s performance by removing noise caused by several types of dæmons, confining dæmons to the cluster manager, and removing the cluster manager and the RMS cluster monitor from each cluster s compute pool. Best observed (???) processing rate is only 15% below model expectation 10/03/05 29
30 Discussion Computational granularity of app type of noise Load balanced, coarse-grained app (e.g. LINPACH): Long noise dominate Short noise becomes coscheduled Medium-grained app (e.g. SAGE): Medium noise dominate Fine-grained app (e.g. deterministic Sn-transport): Short noise dominate The freq of long noise is low 10/03/05 30
31 Discussion Computational granularity of app type of noise Load balanced, coarse-grained app (e.g. LINPACH): FINDING #4 Long noise dominate Substantial performance loss occurs when an application Short noise becomes coscheduled resonates with system noise: high-frequency, fine-grained noise affects only fine-grained applications; low-frequency, coarse-grained SAGE): noise affects only coarse-grained applications. Medium noise dominate Medium-grained app (e.g. Fine-grained app (e.g. deterministic Sn-transport): Short noise dominate The freq of long noise is low 10/03/05 31
32 Conclusion Described a figurative journey to improve the performance of a sizable hydrodynamics app, SAGE, on the world`s second-fastest supercomputer, ASCI Q Methodologies The first to determine how fast an app could potentially run Developed a methodology to analyze artifacts that degrade app performance yet are not part of the app Doubled the performance of SAGE w/o modifying a single line of code Notions Noise and resonance Applicable to other system and other app 10/03/05 32
33 More discussions What do they mean by best observed in Table 3? The processing rate of regular 4- proc using 7680 proc (120.6) is still lower than 3-proc with only 6144 proc. The analytical model is constructed manually (Darren Kerbyson et al. SC01). It is enormously labor intensive. 10/03/05 33
34 Thanks! Any questions? The Case of the Missing Supercomputer Performance (SC 2003)
CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019
CS 431/531 Introduction to Performance Measurement, Modeling, and Analysis Winter 2019 Prof. Karen L. Karavanic karavan@pdx.edu web.cecs.pdx.edu/~karavan Today s Agenda Why Study Performance? Why Study
More informationThe Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q
The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,9 Processors of ASCI Q Fabrizio Petrini Darren J. Kerbyson Scott Pakin Performance and Architecture Laboratory
More informationKevin J. Barker. Scott Pakin and Darren J. Kerbyson
Experiences in Performance Modeling: The Krak Hydrodynamics Application Kevin J. Barker Scott Pakin and Darren J. Kerbyson Performance and Architecture Laboratory (PAL) http://www.c3.lanl.gov/pal/ Computer,
More informationAssessing MPI Performance on QsNet II
Assessing MPI Performance on QsNet II Pablo E. García 1, Juan Fernández 1, Fabrizio Petrini 2, and José M. García 1 1 Departamento de Ingeniería y Tecnología de Computadores Universidad de Murcia, 371
More informationA PERFORMANCE EVALUATION OF AN ALPHA EV7 PROCESSING NODE
A PERFORMANCE EVALUATION OF AN ALPHA EV7 PROCESSING NODE Darren J. Kerbyson Adolfy Hoisie Scott Pakin Fabrizio Petrini Harvey J. Wasserman LOS ALAMOS NATIONAL LABORATORY (LANL), CCS-3 MODELING, ALGORITHMS
More informationPerformance Modeling the Earth Simulator and ASCI Q
Performance Modeling the Earth Simulator and ASCI Q Darren J. Kerbyson, Adolfy Hoisie, Harvey J. Wasserman Performance and Architectures Laboratory (PAL) Modeling, Algorithms and Informatics Group, CCS-3
More informationTechniques to improve the scalability of Checkpoint-Restart
Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart
More informationCommunication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems.
Cluster Networks Introduction Communication has significant impact on application performance. Interconnection networks therefore have a vital role in cluster systems. As usual, the driver is performance
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationIntroduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014
Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational
More informationjitsim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter
jitsim: A Simulator for Predicting Scalability of Parallel Applications in Presence of OS Jitter Pradipta De and Vijay Mann IBM Research - India, New Delhi Abstract. Traditionally, Operating system jitter
More informationBuffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems
National Alamos Los Laboratory Buffered Co-scheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini and Wu-chun Feng {fabrizio,feng}@lanl.gov Los Alamos National
More informationCharacterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection
Characterizing Application Sensitivity to OS Interference Using Kernel-Level Noise Injection Kurt B. Ferreira and Patrick Bridges Computer Science Department The University of New Mexico Albuquerque, NM
More informationPerformance Study of the MPI and MPI-CH Communication Libraries on the IBM SP
Performance Study of the MPI and MPI-CH Communication Libraries on the IBM SP Ewa Deelman and Rajive Bagrodia UCLA Computer Science Department deelman@cs.ucla.edu, rajive@cs.ucla.edu http://pcl.cs.ucla.edu
More informationA Cool Scheduler for Multi-Core Systems Exploiting Program Phases
IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth
More informationMPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA
MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA Gilad Shainer 1, Tong Liu 1, Pak Lui 1, Todd Wilde 1 1 Mellanox Technologies Abstract From concept to engineering, and from design to
More informationScheduling with Implicit Information in Distributed Systems
Scheduling with Implicit Information in Distributed Systems Andrea C. Arpaci-Dusseau, David E. Culler, Alan M. Mainwaring Computer Science Division University of California, Berkeley fdusseau, culler,
More informationCharacterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine
Characterizing the Influence of System Noise on Large-Scale Applications by Simulation Torsten Hoefler, Timo Schneider, Andrew Lumsdaine System Noise Introduction and History CPUs are time-shared Deamons,
More informationFIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations
FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for oc Modeling in Full-System Simulations Michael K. Papamichael, James C. Hoe, Onur Mutlu papamix@cs.cmu.edu, jhoe@ece.cmu.edu, onur@cmu.edu
More informationIn-Network Computing. Paving the Road to Exascale. June 2017
In-Network Computing Paving the Road to Exascale June 2017 Exponential Data Growth The Need for Intelligent and Faster Interconnect -Centric (Onload) Data-Centric (Offload) Must Wait for the Data Creates
More informationLoad balancing for Regular Meshes on SMPs with MPI
Load balancing for Regular Meshes on SMPs with MPI Vivek Kale and William Gropp University of Illinois at Urbana-Champaign, IL, USA, {vivek,wgropp}@illinois.edu Abstract. Domain decomposition for regular
More informationThe Use of Cloud Computing Resources in an HPC Environment
The Use of Cloud Computing Resources in an HPC Environment Bill, Labate, UCLA Office of Information Technology Prakashan Korambath, UCLA Institute for Digital Research & Education Cloud computing becomes
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationBuffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems
Buffered Coscheduling: A New Methodology for Multitasking Parallel Jobs on Distributed Systems Fabrizio Petrini y and Wu-chun Feng yx ffabrizio, fengg@lanl.gov y Computing, Information, and Communications
More informationvsan 6.6 Performance Improvements First Published On: Last Updated On:
vsan 6.6 Performance Improvements First Published On: 07-24-2017 Last Updated On: 07-28-2017 1 Table of Contents 1. Overview 1.1.Executive Summary 1.2.Introduction 2. vsan Testing Configuration and Conditions
More informationDiffusion TM 5.0 Performance Benchmarks
Diffusion TM 5.0 Performance Benchmarks Contents Introduction 3 Benchmark Overview 3 Methodology 4 Results 5 Conclusion 7 Appendix A Environment 8 Diffusion TM 5.0 Performance Benchmarks 2 1 Introduction
More informationDesigning Power-Aware Collective Communication Algorithms for InfiniBand Clusters
Designing Power-Aware Collective Communication Algorithms for InfiniBand Clusters Krishna Kandalla, Emilio P. Mancini, Sayantan Sur, and Dhabaleswar. K. Panda Department of Computer Science & Engineering,
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationNoise Injection Techniques to Expose Subtle and Unintended Message Races
Noise Injection Techniques to Expose Subtle and Unintended Message Races PPoPP2017 February 6th, 2017 Kento Sato, Dong H. Ahn, Ignacio Laguna, Gregory L. Lee, Martin Schulz and Christopher M. Chambreau
More informationPerformance of Multicore LUP Decomposition
Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations
More informationCluster Network Products
Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster
More informationDynamic load balancing in OSIRIS
Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal Maintaining parallel load balance
More informationCustomer Success Story Los Alamos National Laboratory
Customer Success Story Los Alamos National Laboratory Panasas High Performance Storage Powers the First Petaflop Supercomputer at Los Alamos National Laboratory Case Study June 2010 Highlights First Petaflop
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationAssessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationParallel Programming Concepts. Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04. Parallel Background. Why Bother?
Parallel Programming Concepts Tom Logan Parallel Software Specialist Arctic Region Supercomputing Center 2/18/04 Parallel Background Why Bother? 1 What is Parallel Programming? Simultaneous use of multiple
More informationLeveraging Modern Interconnects for Parallel System Software
Leveraging Modern Interconnects for Parallel System Software Thesis submitted for the degree of Doctor of Philosophy by Eitan Frachtenberg Submitted to the Senate of the Hebrew University December 2003
More informationHigh-resolution Measurement of Data Center Microbursts
High-resolution Measurement of Data Center Microbursts Qiao Zhang (University of Washington) Vincent Liu (University of Pennsylvania) Hongyi Zeng (Facebook) Arvind Krishnamurthy (University of Washington)
More informationSolving Traveling Salesman Problem Using Parallel Genetic. Algorithm and Simulated Annealing
Solving Traveling Salesman Problem Using Parallel Genetic Algorithm and Simulated Annealing Fan Yang May 18, 2010 Abstract The traveling salesman problem (TSP) is to find a tour of a given number of cities
More informationSeminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm
Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of
More informationPerformance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads
Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi
More informationDRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE
DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael Franklin, Benjamin Recht, Ion Stoica STREAMING WORKLOADS
More informationTable of contents. OpenVMS scalability with Oracle Rdb. Scalability achieved through performance tuning.
OpenVMS scalability with Oracle Rdb Scalability achieved through performance tuning. Table of contents Abstract..........................................................2 From technical achievement to
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationAn Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language
An Integrated Synchronization and Consistency Protocol for the Implementation of a High-Level Parallel Programming Language Martin C. Rinard (martin@cs.ucsb.edu) Department of Computer Science University
More informationLAPI on HPS Evaluating Federation
LAPI on HPS Evaluating Federation Adrian Jackson August 23, 2004 Abstract LAPI is an IBM-specific communication library that performs single-sided operation. This library was well profiled on Phase 1 of
More informationEvaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination
Evaluation of Seed Selection Strategies for Vehicle to Vehicle Epidemic Information Dissemination Richard Kershaw and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering, Viterbi School
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationPRIMEHPC FX10: Advanced Software
PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for
More informationWhatÕs New in the Message-Passing Toolkit
WhatÕs New in the Message-Passing Toolkit Karl Feind, Message-passing Toolkit Engineering Team, SGI ABSTRACT: SGI message-passing software has been enhanced in the past year to support larger Origin 2
More informationThe Six Principles of BW Data Validation
The Problem The Six Principles of BW Data Validation Users do not trust the data in your BW system. The Cause By their nature, data warehouses store large volumes of data. For analytical purposes, the
More informationNeuro-fuzzy admission control in mobile communications systems
University of Wollongong Thesis Collections University of Wollongong Thesis Collection University of Wollongong Year 2005 Neuro-fuzzy admission control in mobile communications systems Raad Raad University
More informationTOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology
TOP500 List s Twice-Yearly Snapshots of World s Fastest Supercomputers Develop Into Big Picture of Changing Technology BY ERICH STROHMAIER COMPUTER SCIENTIST, FUTURE TECHNOLOGIES GROUP, LAWRENCE BERKELEY
More informationHigh Performance Computing
The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical
More informationECE519 Advanced Operating Systems
IT 540 Operating Systems ECE519 Advanced Operating Systems Prof. Dr. Hasan Hüseyin BALIK (10 th Week) (Advanced) Operating Systems 10. Multiprocessor, Multicore and Real-Time Scheduling 10. Outline Multiprocessor
More informationExploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems
Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI
More informationEarly Evaluation of the Cray XD1
Early Evaluation of the Cray XD1 (FPGAs not covered here) Mark R. Fahey Sadaf Alam, Thomas Dunigan, Jeffrey Vetter, Patrick Worley Oak Ridge National Laboratory Cray User Group May 16-19, 2005 Albuquerque,
More informationHigh Performance MPI on IBM 12x InfiniBand Architecture
High Performance MPI on IBM 12x InfiniBand Architecture Abhinav Vishnu, Brad Benton 1 and Dhabaleswar K. Panda {vishnu, panda} @ cse.ohio-state.edu {brad.benton}@us.ibm.com 1 1 Presentation Road-Map Introduction
More informationDetermining the Number of CPUs for Query Processing
Determining the Number of CPUs for Query Processing Fatemah Panahi Elizabeth Soechting CS747 Advanced Computer Systems Analysis Techniques The University of Wisconsin-Madison fatemeh@cs.wisc.edu, eas@cs.wisc.edu
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationAn Extensible Message-Oriented Offload Model for High-Performance Applications
An Extensible Message-Oriented Offload Model for High-Performance Applications Patricia Gilfeather and Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu,
More informationHybrid programming with MPI and OpenMP On the way to exascale
Institut du Développement et des Ressources en Informatique Scientifique www.idris.fr Hybrid programming with MPI and OpenMP On the way to exascale 1 Trends of hardware evolution Main problematic : how
More informationDesign of Parallel Programs Algoritmi e Calcolo Parallelo. Daniele Loiacono
Design of Parallel Programs Algoritmi e Calcolo Parallelo Web: home.dei.polimi.it/loiacono Email: loiacono@elet.polimi.it References q The material in this set of slide is taken from two tutorials by Blaise
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationProgramming for Fujitsu Supercomputers
Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming
More informationClusters of SMP s. Sean Peisert
Clusters of SMP s Sean Peisert What s Being Discussed Today SMP s Cluters of SMP s Programming Models/Languages Relevance to Commodity Computing Relevance to Supercomputing SMP s Symmetric Multiprocessors
More informationBlue Waters I/O Performance
Blue Waters I/O Performance Mark Swan Performance Group Cray Inc. Saint Paul, Minnesota, USA mswan@cray.com Doug Petesch Performance Group Cray Inc. Saint Paul, Minnesota, USA dpetesch@cray.com Abstract
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More informationSFS: Random Write Considered Harmful in Solid State Drives
SFS: Random Write Considered Harmful in Solid State Drives Changwoo Min 1, 2, Kangnyeon Kim 1, Hyunjin Cho 2, Sang-Won Lee 1, Young Ik Eom 1 1 Sungkyunkwan University, Korea 2 Samsung Electronics, Korea
More informationMPI On-node and Large Processor Count Scaling Performance. October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory
MPI On-node and Large Processor Count Scaling Performance October 10, 2001 Terry Jones Linda Stanberry Lawrence Livermore National Laboratory Outline Scope Presentation aimed at scientific/technical app
More informationPHX: Memory Speed HPC I/O with NVM. Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan
PHX: Memory Speed HPC I/O with NVM Pradeep Fernando Sudarsun Kannan, Ada Gavrilovska, Karsten Schwan Node Local Persistent I/O? Node local checkpoint/ restart - Recover from transient failures ( node restart)
More informationWhite Paper. Why Remake Storage For Modern Data Centers
White Paper Why Remake Storage For Modern Data Centers Executive Summary Managing data growth and supporting business demands of provisioning storage have been the top concern of IT operations for the
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationReducing Application Runtime Variability on Jaguar XT5
Reducing Application Runtime Variability on Jaguar XT5 Sarp Oral Feiyi Wang David A. Dillow Ross Miller Galen M. Shipman Don Maxwell Oak Ridge National Laboratory Leadership Computing Facility {oralhs,fwang2,dillowda,rgmiller,gshipman,maxwellde}@ornl.gov
More informationAutomatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.
Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World
More informationAlleviating Scalability Issues of Checkpointing
Rolf Riesen, Kurt Ferreira, Dilma Da Silva, Pierre Lemarinier, Dorian Arnold, Patrick G. Bridges 13 November 2012 Alleviating Scalability Issues of Checkpointing Protocols Overview 2 3 Motivation: scaling
More informationMultilevel Algorithms for Multi-Constraint Hypergraph Partitioning
Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455 Technical Report
More informationComposite Metrics for System Throughput in HPC
Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last
More informationDynamic Feedback: An Effective Technique for Adaptive Computing
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science Engineering I Building University of California, Santa Barbara Santa Barbara,
More informationNon-Blocking Collectives for MPI
Non-Blocking Collectives for MPI overlap at the highest level Torsten Höfler Open Systems Lab Indiana University Bloomington, IN, USA Institut für Wissenschaftliches Rechnen Technische Universität Dresden
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationBuilding MPI for Multi-Programming Systems using Implicit Information
Building MPI for Multi-Programming Systems using Implicit Information Frederick C. Wong 1, Andrea C. Arpaci-Dusseau 2, and David E. Culler 1 1 Computer Science Division, University of California, Berkeley
More informationOLAP Introduction and Overview
1 CHAPTER 1 OLAP Introduction and Overview What Is OLAP? 1 Data Storage and Access 1 Benefits of OLAP 2 What Is a Cube? 2 Understanding the Cube Structure 3 What Is SAS OLAP Server? 3 About Cube Metadata
More informationAccelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K.
Accelerating MPI Message Matching and Reduction Collectives For Multi-/Many-core Architectures Mohammadreza Bayatpour, Hari Subramoni, D. K. Panda Department of Computer Science and Engineering The Ohio
More informationASPERA HIGH-SPEED TRANSFER. Moving the world s data at maximum speed
ASPERA HIGH-SPEED TRANSFER Moving the world s data at maximum speed ASPERA HIGH-SPEED FILE TRANSFER 80 GBIT/S OVER IP USING DPDK Performance, Code, and Architecture Charles Shiflett Developer of next-generation
More informationApplication-Specific System Customization on Many-Core Platforms: The VT-ASOS Framework Position paper
Application-Specific System Customization on Many-Core Platforms: The VT-ASOS Framework Position paper Godmar Back and Dimitrios S. Nikolopoulos Center for High-End Computing Systems Department of Computer
More informationEVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS
EVALUATING INFINIBAND PERFORMANCE WITH PCI EXPRESS INFINIBAND HOST CHANNEL ADAPTERS (HCAS) WITH PCI EXPRESS ACHIEVE 2 TO 3 PERCENT LOWER LATENCY FOR SMALL MESSAGES COMPARED WITH HCAS USING 64-BIT, 133-MHZ
More informationScalaIOTrace: Scalable I/O Tracing and Analysis
ScalaIOTrace: Scalable I/O Tracing and Analysis Karthik Vijayakumar 1, Frank Mueller 1, Xiaosong Ma 1,2, Philip C. Roth 2 1 Department of Computer Science, NCSU 2 Computer Science and Mathematics Division,
More informationSome aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)
Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationAn Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks
An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing
More informationIntroduction. Communication Systems Simulation - I. Monte Carlo method. Simulation methods
Introduction Communication Systems Simulation - I Harri Saarnisaari Part of Simulations and Tools for Telecommunication Course First we study what simulation methods are available Use of the Monte Carlo
More information