ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications
|
|
- Clinton Pierce
- 5 years ago
- Views:
Transcription
1 Workshop on Extreme-Scale Programming Tools 18th November 2013 Supercomputing 2013 ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications Toni Espinosa Andrea Martínez, Anna Sikora, Eduardo César and Joan Sorribes Universitat Autónoma de Barcelona Computer Architecture and Operating Systems Departament
2 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.
3 Motivation Centralised Architecture of Tuning Tools Analysis Analysis and and and tuning tuning module module Analysis and tuning module Elimination of a single centralised control point. BE daemon BE daemon... BE daemon... BE daemon Task 0 Task 01 Task n-1 1 Task n-1 Distribution of the analysis and tuning process, remaining effective. 3/22
4 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.
5 Hierarchical Tuning Network Decompose. A base level of analysis and tuning modules (ATM) that controls disjoint domains of application tasks. ATM user Source Code Local performance improvements are achieved. Executable Application Memory Execution time tool Monitoring Analysis Tuning and how to obtain global Monitoring performance orders. improvements? Events. Tuning Analysis orders. and tuning domain 5/22
6 Hierarchical Tuning Network Abstract. The abstraction mechanism is carried out by the ATMs. representing the tasks of the virtual parallel application 6/22
7 Hierarchical Tuning Network Decompose. Abstract. ATM Abstractor The Theactuation virtual of parallel each application ATM of theis network decomposed gives a hierarchical distribution of the analysis and tuning process and then analysed and tuned by ATMs located at the high level. Source Code user Executable Application Memory Execution time tool Monitoring Analysis Tuning 7/22
8 Abstraction Mechanism Parallel Application Task ATM Level i+1 Level i+1 Instrumentation Order Translator Abstractor Event Creator INTERNAL API Event Manager Performance Evaluator Instrumentation Order Sender Event to the parent level Instrumentation orders from the parent level Parallel Application Task ATM Level i Level i Instrumentation Order Translator Abstractor Event Creator INTERNAL API Event Manager Performance Evaluator Instrumentation Order Sender Events from the immediate children Instrumentation orders to the immediate children 8/22
9 Knowledge in the Tuning Network Performance Performance Model Model Monitoring Monitoring Points Points Performance Performance Expressions Expressions Tuning Points, Tuning Actions Points, and Actions Synchronisation and Synchronisation Method Method Abstraction Abstraction Model Model How to translate How to a translate monitoring a monitoring order order How to translate How to a translate tuning order a tuning order How to create How a to new create event a new event How to decompose How to decompose the real or the virtual real parallel or virtual application parallel application 9/22
10 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.
11 ELASTIC Prototype implementation in C++. For MPI parallel applications. Target systems: UNIX based supercomputers. user Source Code Executable Application Memory Execution time Dynamic ELASTIC instrumentation Event tracing Monitoring Analysis Tuning Performance and abstraction models Dynamic instrumentation 11/22
12 ELASTIC: Architecture ELASTIC Front-End Abstractor-ATM ELASTIC Front-End Abstractor-ATM pair ELASTIC Back-Ends TMLib Abstractor-ATM Abstractor-ATM Abstractor-ATM The tuning network topology can be configured to accommodate the size of the parallel application and the complexity of the tuning strategy being employed. Abstractor-ATM Abstractor-ATM... Abstractor-ATM... Abstractor-ATM ELASTIC Back-End... ELASTIC Back-End ELASTIC Back-End... ELASTIC Back-End ELASTIC Back-End... ELASTIC Back-End TMLib Application Task TMLib Application Task TMLib Application Task TMLib Application Task TMLib Application Task TMLib Application Task 12/22
13 ELASTIC Package Set of code and configurations that implements the performance and abstraction model. Performance Model Monitoring Points Performance Expressions Tuning Points, Actions and Synchronisation Method Abstraction Model How to translate a monitoring order vector<monitoring Order> PerformanceEvaluator::InitialMonitoringOrders() bool PerformanceEvaluator::NewEvent(Event *e) vector<order> PerformanceEvaluator::EvaluatePerformance() vector<monitoring Order> InstrumentationOrderTranslator:: TranslateMonitoringOrder(MonitoringOrder *mo) vector<tuning Order> InstrumentationOrderTranslator:: TranslateTuningOrder(TuningOrder *to) How to translate a tuning order How to create a new event How to decompose the real or virtual parallel application bool EventCreator::NewEvent(Event *e) vector<event> EventCreator::CreateEvent() 13/22
14 ELASTIC Package Codification of the ELASTIC Package based on subclassing Abstractor-ATM components. This plugin architecture converts ELASTIC into a general purpose tuning tool and gives it the flexibility to tackle a wide range of performance problems 14/22
15 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.
16 Experimental Evaluation The evaluation consists of Executing a parallel application which presents a specific performance problem and using ELASTIC to dynamically detect and resolve the problem. Synthetic Parallel Application Real Agent-based Parallel Application SPMD Load balancing problem Execution Environment: Supercomputer SuperMUC at LRZ compute nodes ( cores). Each node has 2 8-core 2.7 GHz Intel Xeon processors and 32GB main memory. SuSe Linux. 16/22
17 Experimental Evaluation Results Execution time (s) % Original application Application tuned by ELASTIC 50% 46.8% 40.6% 29.4% 25.8% Number of tasks 17/22
18 Experimental Evaluation Results 300 Original application Application tuned by ELASTIC % Execution time (s) % 28.3% 30.3% Number of tasks 18/22
19 Outline 1Motivation. 2Scalable Dynamic Tuning. 3ELASTIC. 4Experimental Evaluation. 5Conclusions and Future Work.
20 Conclusions The distribution of the dynamic tuning process through a hierarchical tuning network of analysis modules has been defined. ELASTIC, a tool that implements the proposed design, has been developed. It offers dynamic tuning through dynamic monitoring, automatic performance analysis and dynamic modifications. It presents an adaptable topology and a plugin architecture. The encouraging results obtained from the experimental evaluation using ELASTIC show that our approach is effective for large-scale dynamic tuning. 20/22
21 Future Work Creation of general ELASTIC Packages which solve a given performance problem. It would be required a small adaptation to applied them to specific parallel applications. Combine our approach with the one implemented under the AutoTune project. 21/22
22 Workshop on Extreme-Scale Programming Tools Supercomputing 2013 ELASTIC: Dynamic Tuning for Large-Scale Parallel Applications Thank you Toni Espinosa, Andrea Martinez, Anna Sikora, Eduardo César and Joan Sorribes Universitat Autónoma de Barcelona Computer Architecture and Operating Systems Departament
23 Tuning Network Topology Parallel Application of N tasks The structure of the topology will depend on the number of levels in the hierarchy and the number of Abstractor-ATM pairs in each level The use of tuning networks composed of the minimum number of non-saturated Abstractor-ATM pairs. The maximum domain size that an Abstractor-ATM pair can manage without becoming saturated. 23/66
24 Modelling an analysis and tuning process N *T m * E a E T a a N ( ) T t * f rp E c *T c N *T m N *T m N *T m Event Batch 1 N 1 N 1 N Event Batch N *T m N *T m N *T m T a N 1 N 1 N 1 N N *T m N *T m N *T m T a N ( ) ( ) Event Batch 1 N 1 N 1 N N *T m N *T m N *T m T a N ( ) Event Batch 1 N 1 N 1 N 24/66
25 Calculating the number of tasks that an analysis module can manage without becoming saturated f analysis = f e E a Period analysis Period analysis N *T m * E a Parallel Application of 256 tasks + E T a a ( N) *T c T t * f rp E c 1 T f analisis a f T m E a a f T c E e c f rp T 64 t /66
26 Experimental Evaluation Application and performance problem Logical layout: 2D grid. Iteration pattern: o Computation. o Communication. work units Computation Communication fixed size Load imbalance: hotspots of additional workload were introduced into the application at runtime. Scenario 1 10% Scenario 2 21% 1024 Tasks (32x32 grid) - After 1st Injection Tasks (32x32 grid) - After 2nd Injection Tasks (32x32 grid) - After 3rd Injection Work Units Work Units /66 0
27 Per task Experimental Evaluation ELASTIC Package to balance the work units Performance Model Abstraction Model Performance Expressions Measurement Points Constraint: the work units only can be moved between neighbouring tasks Input: work units. work units iteration id task id Place work() Output: work units to send. Tuning Points, Action and Synchronisation Points: [send_north, send_south, send_east, send_west] Action: set the value of these variables. Synchronisation: at the beginning of the migration phase. Migration function 27/66
28 Experimental Evaluation ELASTIC Package to balance the work units Performance Model Abstraction Model Decomposition Scheme Real/Virtual Application Constraint: communication pattern between tasks Monitoring Order Translation & Event Creation Tuning Network Monitoring Order Tuning Order Translation Tuning Network Events [send_south, 60] [send_south, 20] [send_south, 20] [send_south, 20] Tuning Order work units 28/66
29 Experimental Evaluation Experimentation Plan For the two scenarios Maximum domain size = iterations. 20 work units per task. The additional load is proportional to the size of the parallel application. 29/66
30 Experimental Evaluation Results Load State 4096 tasks parallel application (64x64 grid) Original application Virtual application 30/66
31 Experimental Evaluation Results Execution time (s) Original application Application tuned by ELASTIC 65.1% 63.1% 54.6% 50.1% 33.1% 22.9% Number of tasks 31/66
32 Experimental Evaluation Results Iteration Time Iteration time (ms) Iteration time (ms) 256 Tasks Synthetic Application 1024 Tasks Synthetic Application 2304 Tasks Synthetic Application 256 Task Synthetic Application Original Application Original Application 2000 Application tuned by ELASTIC 2000 Application tuned by ELASTIC Ideal balanced time Ideal balanced time Iteration Iteration Iteration time (ms) Iteration time (ms) 1024 Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time Iteration 4096 Tasks Synthetic Application 9216 Tasks Synthetic Application Tasks Synthetic Application Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time Iteration time (ms) 256 Tasks Synthetic Application 256 Task Synthetic Application 9216 Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time Original Application Iteration time (ms) Application tuned by ELASTIC Ideal balanced time Iteration Task Synthetic Application Iteration Iteration Iteration Iteration time (ms) Task Synthetic Application Original Application 2000 Application tuned by ELASTIC Ideal balanced time /66
33 Experimental Evaluation Results Execution time (s) % Original application Application tuned by ELASTIC 50% 46.8% 40.6% 29.4% 25.8% Number of tasks 33/66
34 Experimental Evaluation Synthetic Parallel Application Real Agent-based Parallel Application Application and performance problem. ELASTIC Package developed. Experimentation plan. Results. 34/66
35 Task 2 Task n-1 Task 0 Task 1 Experimental Evaluation Application and performance problem Large-scale agent-based simulation. Simulates an epidemic model. A AA A A A A A Communication A A A A Communication pattern: any-to-any A A A A A A A Load imbalance problem due to the dynamic behaviour of the agents: o Births and death. o Time required to process an agent is not uniform. 35/66
36 Per task Experimental Evaluation ELASTIC Package to balance the computation time Performance Model Abstraction Model Performance Expressions Measurement Points Marquez et al Migrations can be between any two tasks in the analysis and tuning domain time # agent iteration id Place phase_4() Input: #agents and computation time. Output: #agents to send to each task. task id Tuning Points, Action and Synchronisation Points: [intradomain_migrate[], interdomain_migrate[]] Action: set the value of these variables. Synchronisation: at the beginning of the migration phase. Migration functions. 36/66
37 Experimental Evaluation ELASTIC Package to balance the computation time Performance Model Abstraction Model Decomposition Scheme Real/Virtual Application Constraint: domains with the same size Monitoring Order Translation & Event Creation Tuning Network Monitoring Order Tuning Order Translation Tuning Network [intradomain, 80] [interdomain, 20] [interdomain, 20] Tuning Order #agents Events [interdomain, 20] [interdomain, 20] time 37/66
38 Experimental Evaluation Experimentation Plan Scale the number of agents. Scale the simulation space. Domain size = /66
39 256 tasks Experimental Evaluation Computation times agents on 256 tasks Agent-based application alone Agent-based application with ELASTIC Ideal balanced time Results Computation times agents on 512 tasks 4500 Agent-based application alone Agent-based application with ELASTIC 4000 Ideal balanced time 512 tasks Computation time (ms) Computation time (ms) Iteration Iteration 4500 Agent-based application alone 4500 Agent-based application alone 1024 tasks Agent-based application with ELASTIC Agent-based application with ELASTIC 2048 tasks 4000 Ideal balanced time 4000 Ideal balanced time Computation time (ms) Computation times agents on 1024 tasks Computation times agents on 2048 tasks Computation time (ms) Iteration Iteration 39/66
40 Experimental Evaluation Results 300 Original application Application tuned by ELASTIC % Execution time (s) % 28.3% 30.3% Number of tasks 40/66
AutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationCode Auto-Tuning with the Periscope Tuning Framework
Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de Project Participants Michael Gerndt, TUM Coordinator
More informationDynamic performance tuning supported by program specification
Scientific Programming 10 (2002) 35 44 35 IOS Press Dynamic performance tuning supported by program specification Eduardo César a, Anna Morajko a,tomàs Margalef a, Joan Sorribes a, Antonio Espinosa b and
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationPeta-Scale Simulations with the HPC Software Framework walberla:
Peta-Scale Simulations with the HPC Software Framework walberla: Massively Parallel AMR for the Lattice Boltzmann Method SIAM PP 2016, Paris April 15, 2016 Florian Schornbaum, Christian Godenschwager,
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationHow to overcome common performance problems in legacy climate models
How to overcome common performance problems in legacy climate models Jörg Behrens 1 Contributing: Moritz Handke 1, Ha Ho 2, Thomas Jahns 1, Mathias Pütz 3 1 Deutsches Klimarechenzentrum (DKRZ) 2 Helmholtz-Zentrum
More informationDynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications
Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications Xiaolin Li and Manish Parashar The Applied Software Systems Laboratory Department of
More informationImplementation of an integrated efficient parallel multiblock Flow solver
Implementation of an integrated efficient parallel multiblock Flow solver Thomas Bönisch, Panagiotis Adamidis and Roland Rühle adamidis@hlrs.de Outline Introduction to URANUS Why using Multiblock meshes
More informationCross-Layer Memory Management to Reduce DRAM Power Consumption
Cross-Layer Memory Management to Reduce DRAM Power Consumption Michael Jantz Assistant Professor University of Tennessee, Knoxville 1 Introduction Assistant Professor at UT since August 2014 Before UT
More informationVIProf: A Vertically Integrated Full-System Profiler
VIProf: A Vertically Integrated Full-System Profiler NGS Workshop, April 2007 Hussam Mousa Chandra Krintz Lamia Youseff Rich Wolski RACELab Research Dynamic software adaptation As program behavior or resource
More informationAutomatic Tuning of HPC Applications with Periscope. Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München
Automatic Tuning of HPC Applications with Periscope Michael Gerndt, Michael Firbach, Isaias Compres Technische Universität München Agenda 15:00 15:30 Introduction to the Periscope Tuning Framework (PTF)
More informationCenter Extreme Scale CS Research
Center Extreme Scale CS Research Center for Compressible Multiphase Turbulence University of Florida Sanjay Ranka Herman Lam Outline 10 6 10 7 10 8 10 9 cores Parallelization and UQ of Rocfun and CMT-Nek
More informationMassively Parallel Phase Field Simulations using HPC Framework walberla
Massively Parallel Phase Field Simulations using HPC Framework walberla SIAM CSE 2015, March 15 th 2015 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer, Harald Köstler and Ulrich
More informationEnergy Efficiency Tuning: READEX. Madhura Kumaraswamy Technische Universität München
Energy Efficiency Tuning: READEX Madhura Kumaraswamy Technische Universität München Project Overview READEX Starting date: 1. September 2015 Duration: 3 years Runtime Exploitation of Application Dynamism
More informationIHK/McKernel: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing
: A Lightweight Multi-kernel Operating System for Extreme-Scale Supercomputing Balazs Gerofi Exascale System Software Team, RIKEN Center for Computational Science 218/Nov/15 SC 18 Intel Extreme Computing
More informationTopology and affinity aware hierarchical and distributed load-balancing in Charm++
Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National
More informationBlueGene/L. Computer Science, University of Warwick. Source: IBM
BlueGene/L Source: IBM 1 BlueGene/L networking BlueGene system employs various network types. Central is the torus interconnection network: 3D torus with wrap-around. Each node connects to six neighbours
More informationTuning Alya with READEX for Energy-Efficiency
Tuning Alya with READEX for Energy-Efficiency Venkatesh Kannan 1, Ricard Borrell 2, Myles Doyle 1, Guillaume Houzeaux 2 1 Irish Centre for High-End Computing (ICHEC) 2 Barcelona Supercomputing Centre (BSC)
More informationMemory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25
ROME Workshop @ IPDPS Vancouver Memory Footprint of Locality Information On Many- Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25 Locality Matters to HPC Applications Locality Matters
More informationEnzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy
Enzo-P / Cello Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology James Bordner 1 Michael L. Norman 1 Brian O Shea 2 1 University of California, San Diego San Diego Supercomputer Center 2
More informationPerformance Analysis with Vampir
Performance Analysis with Vampir Johannes Ziegenbalg Technische Universität Dresden Outline Part I: Welcome to the Vampir Tool Suite Event Trace Visualization The Vampir Displays Vampir & VampirServer
More informationOASIS: Self-tuning Storage for Applications
OASIS: Self-tuning Storage for Applications Kostas Magoutis, Prasenjit Sarkar, Gauri Shah 14 th NASA Goddard- 23 rd IEEE Mass Storage Systems Technologies, College Park, MD, May 17, 2006 Outline Motivation
More informationOracle Financial Consolidation and Close Cloud. What s New in the November Update (16.11)
Oracle Financial Consolidation and Close Cloud What s New in the November Update (16.11) November 2016 TABLE OF CONTENTS REVISION HISTORY... 3 ORACLE FINANCIAL CONSOLIDATION AND CLOSE CLOUD, NOVEMBER UPDATE...
More informationDistributed Memory Parallel Markov Random Fields Using Graph Partitioning
Distributed Memory Parallel Markov Random Fields Using Graph Partitioning Colleen Heinemann, Talita Perciano, Daniela Ushizima, Wes Bethel December 11, 2017 Overview What is MRF-based image segmentation?
More informationCommunication and Topology-aware Load Balancing in Charm++ with TreeMatch
Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction
ICON for HD(CP) 2 High Definition Clouds and Precipitation for Advancing Climate Prediction High Definition Clouds and Precipitation for Advancing Climate Prediction ICON 2 years ago Parameterize shallow
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationPerformance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla
Performance Optimization of a Massively Parallel Phase-Field Method Using the HPC Framework walberla SIAM PP 2016, April 13 th 2016 Martin Bauer, Florian Schornbaum, Christian Godenschwager, Johannes Hötzer,
More informationOn the scalability of tracing mechanisms 1
On the scalability of tracing mechanisms 1 Felix Freitag, Jordi Caubet, Jesus Labarta Departament d Arquitectura de Computadors (DAC) European Center for Parallelism of Barcelona (CEPBA) Universitat Politècnica
More informationQuantifying Load Imbalance on Virtualized Enterprise Servers
Quantifying Load Imbalance on Virtualized Enterprise Servers Emmanuel Arzuaga and David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston MA 1 Traditional Data Centers
More informationREADEX Runtime Exploitation of Application Dynamism for Energyefficient
READEX Runtime Exploitation of Application Dynamism for Energyefficient exascale computing EnA-HPC @ ISC 17 Robert Schöne TUD Project Motivation Applications exhibit dynamic behaviour Changing resource
More informationJackson Marusarz Software Technical Consulting Engineer
Jackson Marusarz Software Technical Consulting Engineer What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action 2 Analysis Tools for Diagnosis
More informationIntroduction to Jackknife Algorithm
Polytechnic School of the University of São Paulo Department of Computing Engeneering and Digital Systems Laboratory of Agricultural Automation Introduction to Jackknife Algorithm Renato De Giovanni Fabrício
More informationA Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications
A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications James Bordner, Michael L. Norman San Diego Supercomputer Center University of California, San Diego 15th SIAM Conference
More informationImproving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism
Improving Linear Algebra Computation on NUMA platforms through auto-tuned tuned nested parallelism Javier Cuenca, Luis P. García, Domingo Giménez Parallel Computing Group University of Murcia, SPAIN parallelum
More informationsimulation framework for piecewise regular grids
WALBERLA, an ultra-scalable multiphysics simulation framework for piecewise regular grids ParCo 2015, Edinburgh September 3rd, 2015 Christian Godenschwager, Florian Schornbaum, Martin Bauer, Harald Köstler
More informationIntel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins
Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins Outline History & Motivation Architecture Core architecture Network Topology Memory hierarchy Brief comparison to GPU & Tilera Programming Applications
More informationOptimising Multicore JVMs. Khaled Alnowaiser
Optimising Multicore JVMs Khaled Alnowaiser Outline JVM structure and overhead analysis Multithreaded JVM services JVM on multicore An observational study Potential JVM optimisations Basic JVM Services
More informationImage-Space-Parallel Direct Volume Rendering on a Cluster of PCs
Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr
More informationIOS: A Middleware for Decentralized Distributed Computing
IOS: A Middleware for Decentralized Distributed Computing Boleslaw Szymanski Kaoutar El Maghraoui, Carlos Varela Department of Computer Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/wwc
More informationMigration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion *
Migration of tools and methodologies for performance prediction and efficient HPC on cloud environments: Results and conclusion * Ronal Muresano, Alvaro Wong, Dolores Rexachs and Emilio Luque Computer
More informationMulti-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation
Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation 1 Cheng-Han Du* I-Hsin Chung** Weichung Wang* * I n s t i t u t e o f A p p l i e d M
More informationE-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andrew Pavlo, Michael
More informationFrom Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with LibGeoDecomp
From Notebooks to Supercomputers: Tap the Full Potential of Your CUDA Resources with andreas.schaefer@cs.fau.de Friedrich-Alexander-Universität Erlangen-Nürnberg GPU Technology Conference 2013, San José,
More informationAn Oracle White Paper September Oracle Utilities Meter Data Management Demonstrates Extreme Performance on Oracle Exadata/Exalogic
An Oracle White Paper September 2011 Oracle Utilities Meter Data Management 2.0.1 Demonstrates Extreme Performance on Oracle Exadata/Exalogic Introduction New utilities technologies are bringing with them
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationStorage Hierarchy Management for Scientific Computing
Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction
More informationBlack-box and Gray-box Strategies for Virtual Machine Migration
Full Review On the paper Black-box and Gray-box Strategies for Virtual Machine Migration (Time required: 7 hours) By Nikhil Ramteke Sr. No. - 07125 1. Introduction Migration is transparent to application
More informationAnna Morajko.
Performance analysis and tuning of parallel/distributed applications Anna Morajko Anna.Morajko@uab.es 26 05 2008 Introduction Main research projects Develop techniques and tools for application performance
More informationEfficiency Evaluation of the Input/Output System on Computer Clusters
Efficiency Evaluation of the Input/Output System on Computer Clusters Sandra Méndez, Dolores Rexachs and Emilio Luque Computer Architecture and Operating System Department (CAOS) Universitat Autònoma de
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationSimulation of Cloud Computing Environments with CloudSim
Simulation of Cloud Computing Environments with CloudSim Print ISSN: 1312-2622; Online ISSN: 2367-5357 DOI: 10.1515/itc-2016-0001 Key Words: Cloud computing; datacenter; simulation; resource management.
More informationThe Development of Scalable Traffic Simulation Based on Java Technology
The Development of Scalable Traffic Simulation Based on Java Technology Narinnat Suksawat, Yuen Poovarawan, Somchai Numprasertchai The Department of Computer Engineering Faculty of Engineering, Kasetsart
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationAdvanced Job Launching. mapping applications to hardware
Advanced Job Launching mapping applications to hardware A Quick Recap - Glossary of terms Hardware This terminology is used to cover hardware from multiple vendors Socket The hardware you can touch and
More informationCoriolis: Scalable VM Clustering in Clouds
1 / 21 Coriolis: Scalable VM Clustering in Clouds Daniel Campello 1 Carlos Crespo 1 Akshat Verma 2 RajuRangaswami 1 Praveen Jayachandran 2 1 School of Computing and Information Sciences
More informationSUSE. High Performance Computing. Eduardo Diaz. Alberto Esteban. PreSales SUSE Linux Enterprise
SUSE High Performance Computing Eduardo Diaz PreSales SUSE Linux Enterprise ediaz@suse.com Alberto Esteban Territory Manager North-East SUSE Linux Enterprise aesteban@suse.com HPC Overview SUSE High Performance
More informationMulti-Threaded UPC Runtime for GPU to GPU communication over InfiniBand
Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State
More informationMemory & Thread Debugger
Memory & Thread Debugger Here is What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action Intel Confidential 2 Analysis Tools for Diagnosis
More informationJULEA: A Flexible Storage Framework for HPC
JULEA: A Flexible Storage Framework for HPC Workshop on Performance and Scalability of Storage Systems Michael Kuhn Research Group Scientific Computing Department of Informatics Universität Hamburg 2017-06-22
More informationFast Methods with Sieve
Fast Methods with Sieve Matthew G Knepley Mathematics and Computer Science Division Argonne National Laboratory August 12, 2008 Workshop on Scientific Computing Simula Research, Oslo, Norway M. Knepley
More informationEfficient Processing of Multiple Contacts in MPP-DYNA
Efficient Processing of Multiple Contacts in MPP-DYNA Abstract Brian Wainscott Livermore Software Technology Corporation 7374 Las Positas Road, Livermore, CA, 94551 USA Complex models often contain more
More informationDynamic Tuning of Parallel Programs
Dynamic Tuning of Parallel Programs A. Morajko, A. Espinosa, T. Margalef, E. Luque Dept. Informática, Unidad de Arquitectura y Sistemas Operativos Universitat Autonoma de Barcelona 08193 Bellaterra, Barcelona
More information1 of 8 14/12/2013 11:51 Tuning long-running processes Contents 1. Reduce the database size 2. Balancing the hardware resources 3. Specifying initial DB2 database settings 4. Specifying initial Oracle database
More informationREADEX: A Tool Suite for Dynamic Energy Tuning. Michael Gerndt Technische Universität München
READEX: A Tool Suite for Dynamic Energy Tuning Michael Gerndt Technische Universität München Campus Garching 2 SuperMUC: 3 Petaflops, 3 MW 3 READEX Runtime Exploitation of Application Dynamism for Energy-efficient
More informationQoS-aware resource allocation and load-balancing in enterprise Grids using online simulation
QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation * Universität Karlsruhe (TH) Technical University of Catalonia (UPC) Barcelona Supercomputing Center (BSC) Samuel
More informationClearSpeed Visual Profiler
ClearSpeed Visual Profiler Copyright 2007 ClearSpeed Technology plc. All rights reserved. 12 November 2007 www.clearspeed.com 1 Profiling Application Code Why use a profiler? Program analysis tools are
More informationExploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization
Exploiting Task-Parallelism on Clusters via Adrián Castelló, Rafael Mayo, Judit Planas, Enrique S. Quintana-Ortí RePara 2015, August Helsinki, Finland Exploiting Task-Parallelism on Clusters via Power/energy/utilization
More informationSCALING A DISTRIBUTED SPATIAL CACHE OVERLAY. Alexander Gessler Simon Hanna Ashley Marie Smith
SCALING A DISTRIBUTED SPATIAL CACHE OVERLAY Alexander Gessler Simon Hanna Ashley Marie Smith MOTIVATION Location-based services utilize time and geographic behavior of user geotagging photos recommendations
More informationORAP Forum October 10, 2013
Towards Petaflop simulations of core collapse supernovae ORAP Forum October 10, 2013 Andreas Marek 1 together with Markus Rampp 1, Florian Hanke 2, and Thomas Janka 2 1 Rechenzentrum der Max-Planck-Gesellschaft
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationOverview of Tianhe-2
Overview of Tianhe-2 (MilkyWay-2) Supercomputer Yutong Lu School of Computer Science, National University of Defense Technology; State Key Laboratory of High Performance Computing, China ytlu@nudt.edu.cn
More informationUpdate of Post-K Development Yutaka Ishikawa RIKEN AICS
Update of Post-K Development Yutaka Ishikawa RIKEN AICS 11:20AM 11:40AM, 2 nd of November, 2017 FLAGSHIP2020 Project Missions Building the Japanese national flagship supercomputer, post K, and Developing
More informationLoad Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs
Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs Laércio Lima Pilla llpilla@inf.ufrgs.br LIG Laboratory INRIA Grenoble University Grenoble, France Institute of Informatics
More informationCluster Network Products
Cluster Network Products Cluster interconnects include, among others: Gigabit Ethernet Myrinet Quadrics InfiniBand 1 Interconnects in Top500 list 11/2009 2 Interconnects in Top500 list 11/2008 3 Cluster
More informationA Global Operating System for HPC Clusters
A Global Operating System Emiliano Betti 1 Marco Cesati 1 Roberto Gioiosa 2 Francesco Piermaria 1 1 System Programming Research Group, University of Rome Tor Vergata 2 BlueGene Software Division, IBM TJ
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationSlurm BOF SC13 Bull s Slurm roadmap
Slurm BOF SC13 Bull s Slurm roadmap SC13 Eric Monchalin Head of Extreme Computing R&D 1 Bullx BM values Bullx BM bullx MPI integration ( runtime) Automatic Placement coherency Scalable launching through
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationPorting and Optimisation of UM on ARCHER. Karthee Sivalingam, NCAS-CMS. HPC Workshop ECMWF JWCRP
Porting and Optimisation of UM on ARCHER Karthee Sivalingam, NCAS-CMS HPC Workshop ECMWF JWCRP Acknowledgements! NCAS-CMS Bryan Lawrence Jeffrey Cole Rosalyn Hatcher Andrew Heaps David Hassell Grenville
More informationOracle Financial Consolidation and Close Cloud. What s New in the December Update (16.12)
Oracle Financial Consolidation and Close Cloud What s New in the December Update (16.12) December 2016 TABLE OF CONTENTS REVISION HISTORY... 3 ORACLE FINANCIAL CONSOLIDATION AND CLOSE CLOUD, DECEMBER UPDATE...
More informationSDC Design patterns GoF
SDC Design patterns GoF Design Patterns The design pattern concept can be viewed as an abstraction of imitating useful parts of other software products. The design pattern is a description of communicating
More informationThe 7 deadly sins of cloud computing [2] Cloud-scale resource management [1]
The 7 deadly sins of [2] Cloud-scale resource management [1] University of California, Santa Cruz May 20, 2013 1 / 14 Deadly sins of of sin (n.) - common simplification or shortcut employed by ers; may
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationPARALLEL AND DISTRIBUTED PLATFORM FOR PLUG-AND-PLAY AGENT-BASED SIMULATIONS. Wentong CAI
PARALLEL AND DISTRIBUTED PLATFORM FOR PLUG-AND-PLAY AGENT-BASED SIMULATIONS Wentong CAI Parallel & Distributed Computing Centre School of Computer Engineering Nanyang Technological University Singapore
More informationPerformance Analysis with Periscope
Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance
More informationScalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationInteractive Analysis of Large Distributed Systems with Scalable Topology-based Visualization
Interactive Analysis of Large Distributed Systems with Scalable Topology-based Visualization Lucas M. Schnorr, Arnaud Legrand, and Jean-Marc Vincent e-mail : Firstname.Lastname@imag.fr Laboratoire d Informatique
More informationWORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES BIG AND SMALL SERVER PLATFORMS
WORKLOAD CHARACTERIZATION OF INTERACTIVE CLOUD SERVICES ON BIG AND SMALL SERVER PLATFORMS Shuang Chen*, Shay Galon**, Christina Delimitrou*, Srilatha Manne**, and José Martínez* *Cornell University **Cavium
More informationMDHIM: A Parallel Key/Value Store Framework for HPC
MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system
More informationMiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces
MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces Hye-Churn Jang Hyun-Wook (Jin) Jin Department of Computer Science and Engineering Konkuk University Seoul, Korea {comfact,
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationSecurity Correlation Server System Deployment and Planning Guide
CorreLog Security Correlation Server System Deployment and Planning Guide The CorreLog Server provides a method of collecting security information contained in log messages generated by network devices
More informationBenchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware
Benchmarking Porting Costs of the SKYNET High Performance Signal Processing Middleware Michael J. Linnig and Gary R. Suder Engineering Fellows Sept 12, 2014 Linnig@Raytheon.com Gary_R_Suder@Raytheon.com
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More informationLoad-Balancing for Large Scale Situated Agent-Based Simulations
Procedia Computer Science Volume 51, 2015, Pages 90 99 ICCS 2015 International Conference On Computational Science Load-Balancing for Large Scale Situated Agent-Based Simulations Omar Rihawi, Yann Secq,
More informationMaxwell: a 64-FPGA Supercomputer
Maxwell: a 64-FPGA Supercomputer Copyright 2007, the University of Edinburgh Dr Rob Baxter Software Development Group Manager, EPCC R.Baxter@epcc.ed.ac.uk +44 131 651 3579 Outline The FHPCA Why build Maxwell?
More information