Efficient Clustering and Scheduling for Task-Graph based Parallelization
|
|
- Milton Osborne Bridges
- 5 years ago
- Views:
Transcription
1 Center for Information Services and High Performance Computing TU Dresden Efficient Clustering and Scheduling for Task-Graph based Parallelization Marc Hartung 02. February
2 HPC-OM 2/23
3 Content 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 3/23
4 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 4/23
5 Motivation Achieve speed-up through parallel execution of the ODE-system s tasks OM-Simulation Start Values ODE-System Time Integration 5/23
6 Motivation Achieve speed-up through parallel execution of the ODE-system s tasks OM-Simulation Start Values ODE-System Time Integration Assigning tasks to more than one CPU reduces simulation time Improvement depends on model 5/23
7 Motivation Task Graph visualizes right-hand side evaluation Contains computation costs, dependencies and communication costs T1 70 T2 T T4 T5 0 TX strongly connected component computation costs dependency with communication costs 6/23
8 Motivation Task Graph visualizes right-hand side evaluation Contains computation costs, dependencies and communication costs T1 Core 1 Core 2 T2 T3 T T2 T3 50 T4 T T4 T5 TX strongly connected component 1 0 computation costs dependency with communication costs 6/23
9 Motivation Task Graph visualizes right-hand side evaluation Contains computation costs, dependencies and communication costs T1 Core 1 Core 2 T2 T3 T T2 T3 50 T4 T T4 T5 TX strongly connected component 1 0 computation costs dependency with communication costs Algorithms for task-to-core mapping and ordering are needed! 6/23
10 Motivation Task Graph based Parallelization + Heterogeneous data dependencies + Allows nested parallelism + Numerical stable + Universal parallel solution (in theory) 7/23
11 Motivation Task Graph based Parallelization + Heterogeneous data dependencies + Allows nested parallelism + Numerical stable + Universal parallel solution (in theory) Obstacles Compile time Parallel efficiency Model dependent 7/23
12 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 8/23
13 Scheduling Scheduling is a NP-complete decision problem Many greedy algorithms available Complexity between O(n) and O(n 4 ) (n... number of tasks) 9/23
14 Scheduling Scheduling is a NP-complete decision problem Many greedy algorithms available Complexity between O(n) and O(n 4 ) (n... number of tasks) Low cost algorithms achieve usable solutions But: No speed up guaranty 9/23
15 Scheduling - Taxonomy /23
16 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
17 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
18 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
19 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
20 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
21 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
22 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
23 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) 11/23
24 Scheduling - ETF Earliest Time First - Algorithm List scheduler (for bounded number of processors) Checks every ready task for earliest start time Draws solved by highest bottom level Complexity: O(p n 2 ) (p... number of cores) Other list scheduler: LVL... Level Scheduler DLS... Dynamical Level Scheduling MCP... Modified Critical Path 11/23
25 Scheduling - Status Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel i7-3930k 6x 3.20 GHz, Linux Theoretical-Speed-Up Cpp-Runtime_Level_OpenMP speed-up 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 2,38 2,80 1,59 1, number of cores 12/23
26 Scheduling - Status Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel i7-3930k 6x 3.20 GHz, Linux Theoretical-Speed-Up Cpp-Runtime_Level_OpenMP speed-up 7,00 6,00 5,00 4,00 3,00 2,00 1,00 0,00 2,38 2,80 1,59 1, number of cores Approach: Analyse scheduler and parallelization methods to close gaps 12/23
27 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 13/23
28 TGSim-Framework Task Graph Simulation Framework Analyse and evaluate scheduling and clustering algorithms Benchmark different parallelization methods Parallel runtime prediction for OM simulations and other traceable programs 14/23
29 TGSim-Framework Task Graph Simulation Framework Analyse and evaluate scheduling and clustering algorithms Benchmark different parallelization methods Parallel runtime prediction for OM simulations and other traceable programs Implementation: Written in C++ using OOP Easy to expand and user-friendly Creates OM-simulation alike programs with low overhead tracing mechanisms ODE-tasks replaced by wait tasks to reduce unintended influences 14/23
30 TGSim workflow MOS Profiled CppRuntime-simulation creates GraphML-file TGSim uses GraphML-File as input OMC GraphML TGSim Scheduling Runtime Analysis Analytical evaluation of scheduled task graphs Execution of scheduled simulations to benchmark parallel methods Statistic 15/23
31 Outline 1 Motivation 2 Scheduling 3 TGSim - Framework 4 Results 16/23
32 Results - Outline 1 TGSim runtime simulation vs. OM-Cpp-Runtime simulation Comparable? 2 Compare parallelization methods Dynamic scheduling and static scheduling 3 Compare TGSim scheduler Scheduling algorithms: MCP, DLS, ETF, LVL 17/23
33 Results - TGSim vs. Cpp-Runtime Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_Level_OpenMP Cpp-Runtime_Level_OpenMP speed-up 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores 18/23
34 Results - TGSim vs. Cpp-Runtime Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_Level_OpenMP Cpp-Runtime_Level_OpenMP speed-up 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores TGSim simulates OM-simulation work flow very well 18/23
35 Results - TGSim vs. Cpp-Runtime Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_Level_OpenMP Cpp-Runtime_Level_OpenMP speed-up 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores TGSim simulates OM-simulation work flow very well To simplify comparing, the OpenMP-OM-Cpp-Runtime results will be in every diagram 18/23
36 Results - Benchmark Dynamic Methods Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux TGSim_IntelTBB TGSim_OpenMP Cpp-Runtime_OpenMP speed-up 3,50 3,00 2,50 2,00 1,50 1,00 0,50 0, number of cores IntelTBB is comparable to OpenMP Small disadvantage: Initialization time of IntelTBB ist 3700µs and of OpenMP 4µs 19/23
37 Results - Benchmark Static Methods Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux 3,00 2,50 TGSim_Pthread TGSim_MPI Cpp-Runtime_OpenMP speed-up 2,00 1,50 1,00 0,50 0, number of cores PThread Performance in the first test cases better Static parallelization depends on scheduling 20/23
38 Results - Benchmark Scheduling Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux MCP DLS ETF LVL Cpp-Runtime speed-up number of cores 21/23
39 Results - Benchmark Scheduling Model: Modelica.Fluid.Examples.BranchingDynamicPipes System: Intel Xeon E x 2.90GHz, Linux speed-up MCP DLS ETF LVL Cpp-Runtime Theoretical_Speed-Up number of cores Proper scheduling leads to high improvements 21/23
40 Summary & Future Work Summary With increasing number of cores static scheduling performs better than dynamic Scheduler which consider communication costs comparable in performance and much better than other PThreads fastest parallelization method, OpenMP and IntelTBB comparable 22/23
41 Summary & Future Work Summary With increasing number of cores static scheduling performs better than dynamic Scheduler which consider communication costs comparable in performance and much better than other PThreads fastest parallelization method, OpenMP and IntelTBB comparable Future Work Extend HPCOM OpenModelica library including TGSim optimizations 22/23
42 Thank you for your attention. 23/23
43 Test Framework TGSim II MPI example 1 // core // task 1 4 wait ( costs1 ); 5 MPI \ _Isend ( data1,..., core2,...); 6 7 // task 2 8 wait ( costs2 ); 9 // task4 11 MPI \ _Irecv ( data3,..., core2,...); 12 wait ( costs4 ); 1 // core // task MPIIrecv ( data1,..., core1,...); wait ( costs3 ); 9 11 MPIsend ( data3,..., core1,...); 23/23
Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations
Center for Information Services and High Performance Computing TU Dresden Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations Martin Flehmig, Marc Hartung, Marcus Walther
More informationHigh-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project
OpenModelica Workshop 2015 High-Performance-Computing meets OpenModelica: Achievements in the HPC-OM Project Linköping, 02/02/2015 HPC-OM www.hpc-om.de slide 2 Outline Outline 1. Parallelization Approaches
More informationTask-Graph-Based Parallelization of Modelica-Simulations. Tutorial on the Usage of the HPCOM-Module
Task-Graph-Based Parallelization of Modelica-Simulations Tutorial on the Usage of the HPCOM-Module 2 Introduction Prerequisites https://openmodelica.org/download/nightlybuildsdownload a multi-core cpu
More informationQuantifying power consumption variations of HPC systems using SPEC MPI benchmarks
Center for Information Services and High Performance Computing (ZIH) Quantifying power consumption variations of HPC systems using SPEC MPI benchmarks EnA-HPC, Sept 16 th 2010, Robert Schöne, Daniel Molka,
More informationOutline 1 Motivation 2 Theory of a non-blocking benchmark 3 The benchmark and results 4 Future work
Using Non-blocking Operations in HPC to Reduce Execution Times David Buettner, Julian Kunkel, Thomas Ludwig Euro PVM/MPI September 8th, 2009 Outline 1 Motivation 2 Theory of a non-blocking benchmark 3
More informationMPI Profile (mpip) on IRS BenchMark Application
MPI Profile (mpip) on IRS BenchMark Application MPI with ZRAD8 -np=2 -np=6 -np=1 Call App% MPI% App% MPI% App% MPI% App% MPI% App% MPI% any.62 3.94.79 28.22 26.23 73. 1.72 1.67 1.46 6.28.12 1.1.24 8.44
More informationIntroduction to Performance Tuning & Optimization Tools
Introduction to Performance Tuning & Optimization Tools a[i] a[i+1] + a[i+2] a[i+3] b[i] b[i+1] b[i+2] b[i+3] = a[i]+b[i] a[i+1]+b[i+1] a[i+2]+b[i+2] a[i+3]+b[i+3] Ian A. Cosden, Ph.D. Manager, HPC Software
More informationParallel Systems. Project topics
Parallel Systems Project topics 2016-2017 1. Scheduling Scheduling is a common problem which however is NP-complete, so that we are never sure about the optimality of the solution. Parallelisation is a
More informationHigh Performance Computing Systems
High Performance Computing Systems Multikernels Doug Shook Multikernels Two predominant approaches to OS: Full weight kernel Lightweight kernel Why not both? How does implementation affect usage and performance?
More informationRuntime Function Instrumentation with EZTrace
Runtime Function Instrumentation with EZTrace Charles Aulagnon, Damien Martin-Guillerez, François Rué and François Trahay 5 th Workshop on Productivity and Performance (PROPER 2012) INTRODUCTION Modern
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationA Characterization of Shared Data Access Patterns in UPC Programs
IBM T.J. Watson Research Center A Characterization of Shared Data Access Patterns in UPC Programs Christopher Barton, Calin Cascaval, Jose Nelson Amaral LCPC `06 November 2, 2006 Outline Motivation Overview
More informationLoad Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes
Load Balanced Parallel Simulated Annealing on a Cluster of SMP Nodes Agnieszka Debudaj-Grabysz 1 and Rolf Rabenseifner 2 1 Silesian University of Technology, Gliwice, Poland; 2 High-Performance Computing
More informationPotentials and Limitations for Energy Efficiency Auto-Tuning
Center for Information Services and High Performance Computing (ZIH) Potentials and Limitations for Energy Efficiency Auto-Tuning Parco Symposium Application Autotuning for HPC (Architectures) Robert Schöne
More informationHigh Performance Computing Implementation on a Risk Assessment Problem
High Performance Computing Implementation on a Risk Assessment Problem Carlos A. Acosta 1 and Juan D. Ocampo 2 University of Texas at San Antonio, San Antonio, TX, 78249 Harry Millwater, Jr. 3 University
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationOCTOPUS Performance Benchmark and Profiling. June 2015
OCTOPUS Performance Benchmark and Profiling June 2015 2 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationAutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationSSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh
SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationIntegrating Parallel Application Development with Performance Analysis in Periscope
Technische Universität München Integrating Parallel Application Development with Performance Analysis in Periscope V. Petkov, M. Gerndt Technische Universität München 19 April 2010 Atlanta, GA, USA Motivation
More informationScheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok
Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok Texas Learning and Computation Center Department of Computer Science University of Houston Outline Motivation
More informationEnabling Legacy Applications on Service Grids. Asvija B, C-DAC, Bangalore
Enabling Legacy Applications on Service Grids Asvija B, C-DAC, 1 Legacy Applications Service enablement of existing legacy applications is difficult No source code available Proprietary code fragments
More informationTuning Alya with READEX for Energy-Efficiency
Tuning Alya with READEX for Energy-Efficiency Venkatesh Kannan 1, Ricard Borrell 2, Myles Doyle 1, Guillaume Houzeaux 2 1 Irish Centre for High-End Computing (ICHEC) 2 Barcelona Supercomputing Centre (BSC)
More informationDuksu Kim. Professional Experience Senior researcher, KISTI High performance visualization
Duksu Kim Assistant professor, KORATEHC Education Ph.D. Computer Science, KAIST Parallel Proximity Computation on Heterogeneous Computing Systems for Graphics Applications Professional Experience Senior
More informationAdaptive Power Profiling for Many-Core HPC Architectures
Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationGROMACS Performance Benchmark and Profiling. August 2011
GROMACS Performance Benchmark and Profiling August 2011 Note The following research was performed under the HPC Advisory Council activities Participating vendors: Intel, Dell, Mellanox Compute resource
More informationPredicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters
Predicting Program Phases and Defending against Side-Channel Attacks using Hardware Performance Counters Junaid Nomani and Jakub Szefer Computer Architecture and Security Laboratory Yale University junaid.nomani@yale.edu
More informationOverview. Idea: Reduce CPU clock frequency This idea is well suited specifically for visualization
Exploring Tradeoffs Between Power and Performance for a Scientific Visualization Algorithm Stephanie Labasan & Matt Larsen (University of Oregon), Hank Childs (Lawrence Berkeley National Laboratory) 26
More informationSimulation using MIC co-processor on Helios
Simulation using MIC co-processor on Helios Serhiy Mochalskyy, Roman Hatzky PRACE PATC Course: Intel MIC Programming Workshop High Level Support Team Max-Planck-Institut für Plasmaphysik Boltzmannstr.
More informationDown selecting suitable manycore technologies for the ELT AO RTC. David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz
Down selecting suitable manycore technologies for the ELT AO RTC David Barr, Alastair Basden, Nigel Dipper and Noah Schwartz GFLOPS RTC for AO workshop 27/01/2016 AO RTC Complexity 1.E+05 1.E+04 E-ELT
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationApplying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems
V Brazilian Symposium on Computing Systems Engineering Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems Alessandro Trindade, Hussama Ismail, and Lucas Cordeiro Foz
More informationApril 2 nd, Bob Burroughs Director, HPC Solution Sales
April 2 nd, 2019 Bob Burroughs Director, HPC Solution Sales Today - Introducing 2 nd Generation Intel Xeon Scalable Processors how Intel Speeds HPC performance Work Time System Peak Efficiency Software
More informationCMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**
CMAQ 5.2.1 PARALLEL PERFORMANCE WITH MPI AND OPENMP** George Delic* HiPERiSM Consulting, LLC, P.O. Box 569, Chapel Hill, NC 27514, USA 1. INTRODUCTION This presentation reports on implementation of the
More informationToward Automated Application Profiling on Cray Systems
Toward Automated Application Profiling on Cray Systems Charlene Yang, Brian Friesen, Thorsten Kurth, Brandon Cook NERSC at LBNL Samuel Williams CRD at LBNL I have a dream.. M.L.K. Collect performance data:
More informationMemory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System
Center for Information ervices and High Performance Computing (ZIH) Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor ystem Parallel Architectures and Compiler Technologies
More informationPLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters
PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters IEEE CLUSTER 2015 Chicago, IL, USA Luis Sant Ana 1, Daniel Cordeiro 2, Raphael Camargo 1 1 Federal University of ABC,
More informationIBM High Performance Computing Toolkit
IBM High Performance Computing Toolkit Pidad D'Souza (pidsouza@in.ibm.com) IBM, India Software Labs Top 500 : Application areas (November 2011) Systems Performance Source : http://www.top500.org/charts/list/34/apparea
More informationHPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.
- Excerpt - Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University PPCES March 25th, RWTH Aachen University Agenda o Intel Trace Analyzer and Collector
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationShengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota
Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota Chip Multiprocessors (CMPs)
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationHPC code modernization with Intel development tools
HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor
More informationVIProf: A Vertically Integrated Full-System Profiler
VIProf: A Vertically Integrated Full-System Profiler NGS Workshop, April 2007 Hussam Mousa Chandra Krintz Lamia Youseff Rich Wolski RACELab Research Dynamic software adaptation As program behavior or resource
More informationDynamic Load Balancing in Parallelization of Equation-based Models
Dynamic Load Balancing in Parallelization of Equation-based Models Mahder Gebremedhin Programing Environments Laboratory (PELAB), IDA Linköping University mahder.gebremedhin@liu.se Annual OpenModelica
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationIntel Parallel Studio XE 2015
2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:
More informationxsim The Extreme-Scale Simulator
www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of
More informationRIGHTNOW A C E
RIGHTNOW A C E 2 0 1 4 2014 Aras 1 A C E 2 0 1 4 Scalability Test Projects Understanding the results 2014 Aras Overview Original Use Case Scalability vs Performance Scale to? Scaling the Database Server
More informationDetermining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace
Determining Optimal MPI Process Placement for Large- Scale Meteorology Simulations with SGI MPIplace James Southern, Jim Tuccillo SGI 25 October 2016 0 Motivation Trend in HPC continues to be towards more
More informationIntroduction Attributed Graphs Rule Specification Implementation Conclusion. AGG and PROGRES. Bernhard Scholz. 27th January 2006.
AGG and PROGRES 27th January 2006 AGG and PROGRES 1 Motivation AGG Attributed Graph Grammar System TU Berlin PROGRES PROgrammed Graph REwriting Systems RTWH Aachen How do these systems work? Graph model
More informationImproved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes
Improved Event Generation at NLO and NNLO or Extending MCFM to include NNLO processes W. Giele, RadCor 2015 NNLO in MCFM: Jettiness approach: Using already well tested NLO MCFM as the double real and virtual-real
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationAn Introduction to Balder An OpenMP Run-time Library for Clusters of SMPs
An Introduction to Balder An Run-time Library for Clusters of SMPs Sven Karlsson Department of Microelectronics and Information Technology, KTH, Sweden Institute of Computer Science (ICS), Foundation for
More informationLS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance
11 th International LS-DYNA Users Conference Computing Technology LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton
More informationALL the assignments (A1, A2, A3) and Projects (P0, P1, P2) we have done so far.
Midterm Exam Reviews ALL the assignments (A1, A2, A3) and Projects (P0, P1, P2) we have done so far. Particular attentions on the following: System call, system kernel Thread/process, thread vs process
More informationFUJITSU PHI Turnkey Solution
FUJITSU PHI Turnkey Solution Integrated ready to use XEON-PHI based platform Dr. Pierre Lagier ISC2014 - Leipzig PHI Turnkey Solution challenges System performance challenges Parallel IO best architecture
More informationComputer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters ANSYS, Inc. All rights reserved. 1 ANSYS, Inc.
Computer Aided Engineering with Today's Multicore, InfiniBand-Based Clusters 2006 ANSYS, Inc. All rights reserved. 1 ANSYS, Inc. Proprietary Our Business Simulation Driven Product Development Deliver superior
More informationSPEC MPI2007 Benchmarks for HPC Systems
SPEC MPI2007 Benchmarks for HPC Systems Ron Lieberman Chair, SPEC HPG HP-MPI Performance Hewlett-Packard Company Dr. Tom Elken Manager, Performance Engineering QLogic Corporation Dr William Brantley Manager
More informationShifter: Fast and consistent HPC workflows using containers
Shifter: Fast and consistent HPC workflows using containers CUG 2017, Redmond, Washington Lucas Benedicic, Felipe A. Cruz, Thomas C. Schulthess - CSCS May 11, 2017 Outline 1. Overview 2. Docker 3. Shifter
More informationAUTOMATIC SMT THREADING
AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY
More informationE-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems
E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems Rebecca Taft, Essam Mansour, Marco Serafini, Jennie Duggan, Aaron J. Elmore, Ashraf Aboulnaga, Andrew Pavlo, Michael
More informationIntel Distribution for Python* и Intel Performance Libraries
Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationEvaluating selected cluster file systems with Parabench
Evaluating selected cluster file systems with Parabench Internship report Authors: Marcel Krause; Jens Schlager, B.A. Tutors: Olga Mordvinova, Julian M. Kunkel Winter term 2009 2010 Overview Test Scenarios
More informationImproved Analysis and Evaluation of Real-Time Semaphore Protocols for P-FP Scheduling
Improved Analysis and Evaluation of Real-Time Semaphore Protocols for P-FP Scheduling RTAS 13 April 10, 2013 Björn B. bbb@mpi-sws.org Semaphores + P-FP Scheduling Used in practice: VxWorks, QNX, ThreadX,
More informationTypical Bugs in parallel Programs
Center for Information Services and High Performance Computing (ZIH) Typical Bugs in parallel Programs Parallel Programming Course, Dresden, 8.- 12. February 2016 Joachim Protze (protze@rz.rwth-aachen.de)
More informationACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS
ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS Ferdinando Alessi Annalisa Massini Roberto Basili INGV Introduction The simulation of wave propagation
More informationHPC In The Cloud? Michael Kleber. July 2, Department of Computer Sciences University of Salzburg, Austria
HPC In The Cloud? Michael Kleber Department of Computer Sciences University of Salzburg, Austria July 2, 2012 Content 1 2 3 MUSCLE NASA 4 5 Motivation wide spread availability of cloud services easy access
More informationBenchmark runs of pcmalib on Nehalem and Shanghai nodes
MOSAIC group Institute of Theoretical Computer Science Department of Computer Science Benchmark runs of pcmalib on Nehalem and Shanghai nodes Christian Lorenz Müller, April 9 Addresses: Institute for Theoretical
More informationPortable Power/Performance Benchmarking and Analysis with WattProf
Portable Power/Performance Benchmarking and Analysis with WattProf Amir Farzad, Boyana Norris University of Oregon Mohammad Rashti RNET Technologies, Inc. Motivation Energy efficiency is becoming increasingly
More informationDirk Tetzlaff Technical University of Berlin
Software Engineering g for Embedded Systems Intelligent Task Mapping for MPSoCs using Machine Learning Dirk Tetzlaff Technical University of Berlin 3rd Workshop on Mapping of Applications to MPSoCs June
More informationAnalyzing Cache Bandwidth on the Intel Core 2 Architecture
John von Neumann Institute for Computing Analyzing Cache Bandwidth on the Intel Core 2 Architecture Robert Schöne, Wolfgang E. Nagel, Stefan Pflüger published in Parallel Computing: Architectures, Algorithms
More informationThe Optimal CPU and Interconnect for an HPC Cluster
5. LS-DYNA Anwenderforum, Ulm 2006 Cluster / High Performance Computing I The Optimal CPU and Interconnect for an HPC Cluster Andreas Koch Transtec AG, Tübingen, Deutschland F - I - 15 Cluster / High Performance
More informationLocal vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments
Local vs. Global Optimization: Operator Placement Strategies in Heterogeneous Environments Tomas Karnagel, Dirk Habich, Wolfgang Lehner Database Technology Group Technische Universität Dresden Dresden,
More informationELP. Effektive Laufzeitunterstützung für zukünftige Programmierstandards. Speaker: Tim Cramer, RWTH Aachen University
ELP Effektive Laufzeitunterstützung für zukünftige Programmierstandards Agenda ELP Project Goals ELP Achievements Remaining Steps ELP Project Goals Goals of ELP: Improve programmer productivity By influencing
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationHigh-performance aspects in virtualized infrastructures
SVM 21 High-performance aspects in virtualized infrastructures Vitalian Danciu, Nils gentschen Felde, Dieter Kranzlmüller, Tobias Lindinger SVM 21 - HPC aspects in virtualized infrastructures 1/29/21 Niagara
More informationResource and Performance Distribution Prediction for Large Scale Analytics Queries
Resource and Performance Distribution Prediction for Large Scale Analytics Queries Prof. Rajiv Ranjan, SMIEEE School of Computing Science, Newcastle University, UK Visiting Scientist, Data61, CSIRO, Australia
More informationET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.
HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing
More informationUsing Intel VTune Amplifier XE for High Performance Computing
Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message
More informationIntroducing OTF / Vampir / VampirTrace
Center for Information Services and High Performance Computing (ZIH) Introducing OTF / Vampir / VampirTrace Zellescher Weg 12 Willers-Bau A115 Tel. +49 351-463 - 34049 (Robert.Henschel@zih.tu-dresden.de)
More informationPARALLEL ID3. Jeremy Dominijanni CSE633, Dr. Russ Miller
PARALLEL ID3 Jeremy Dominijanni CSE633, Dr. Russ Miller 1 ID3 and the Sequential Case 2 ID3 Decision tree classifier Works on k-ary categorical data Goal of ID3 is to maximize information gain at each
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationMILC Performance Benchmark and Profiling. April 2013
MILC Performance Benchmark and Profiling April 2013 Note The following research was performed under the HPC Advisory Council activities Special thanks for: HP, Mellanox For more information on the supporting
More informationShared memory parallel algorithms in Scotch 6
Shared memory parallel algorithms in Scotch 6 François Pellegrini EQUIPE PROJET BACCHUS Bordeaux Sud-Ouest 29/05/2012 Outline of the talk Context Why shared-memory parallelism in Scotch? How to implement
More informationI/O Profiling Towards the Exascale
I/O Profiling Towards the Exascale holger.brunst@tu-dresden.de ZIH, Technische Universität Dresden NEXTGenIO & SAGE: Working towards Exascale I/O Barcelona, NEXTGenIO facts Project Research & Innovation
More informationObfuscating Transformations. What is Obfuscator? Obfuscation Library. Obfuscation vs. Deobfuscation. Summary
? Obfuscating Outline? 1? 2 of Obfuscating 3 Motivation: Java Virtual Machine? difference between Java and others? Most programming languages: Java: Source Code Machine Code Predefined Architecture Java
More informationParallel Computer Architecture and Programming Final Project
Muhammad Hilman Beyri (mbeyri), Zixu Ding (zixud) Parallel Computer Architecture and Programming Final Project Summary We have developed a distributed interactive ray tracing application in OpenMP and
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationA2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications
A2E: Adaptively Aggressive Energy Efficient DVFS Scheduling for Data Intensive Applications Li Tan 1, Zizhong Chen 1, Ziliang Zong 2, Rong Ge 3, and Dong Li 4 1 University of California, Riverside 2 Texas
More informationPROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec
PROGRAMOVÁNÍ V C++ CVIČENÍ Michal Brabec PARALLELISM CATEGORIES CPU? SSE Multiprocessor SIMT - GPU 2 / 17 PARALLELISM V C++ Weak support in the language itself, powerful libraries Many different parallelization
More informationCase study: OpenMP-parallel sparse matrix-vector multiplication
Case study: OpenMP-parallel sparse matrix-vector multiplication A simple (but sometimes not-so-simple) example for bandwidth-bound code and saturation effects in memory Sparse matrix-vector multiply (spmvm)
More informationOpenStaPLE, an OpenACC Lattice QCD Application
OpenStaPLE, an OpenACC Lattice QCD Application Enrico Calore Postdoctoral Researcher Università degli Studi di Ferrara INFN Ferrara Italy GTC Europe, October 10 th, 2018 E. Calore (Univ. and INFN Ferrara)
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationIntroduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign
Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s
More information