Dimemas internals and details. BSC Performance Tools
|
|
- Bathsheba Parsons
- 5 years ago
- Views:
Transcription
1 Dimemas ernals and details BSC Performance Tools
2 CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS VENUS (IBM-ZRL) how2gen.xml Stats Gen.viz.cube.xls.txt Machine description Instr. Level Simulators PeekPerf Data Display Tools 2
3 Dimemas tracefile Characterises application Sequence of resource demands for each task Sequence of events: communication Application model Format SDDF for historical reasons. Definition of records New format 3
4 Dimemas tracefile Format SDDF for historical reasons Definition of records #1: " burst" double };; { "taskid"; "thid"; "time"; #2: "NX send" { "taskid"; "thid"; "dest taskid"; "msg length"; "tag"; "commid"; "use_rendezvous"; };; #40: "block begin" { "taskid"; "thid"; "blockid"; };; #41: "block end" { "taskid"; "thid"; "blockid"; };; #201: "global OP" { "rank"; "thid"; "glop_id"; "comm_id"; "root_rank"; "root_thid"; "bytes_sent"; "bytes_recvd"; };; 4
5 Dimemas tracefile Format ASCII records "block begin" { 35, 0, 73 };; "NX recv" { 35, 0, 39, 4160, 10003, "block end" { 35, 0, 73 };; " burst" { 35, 0, };; "block begin" { 35, 0, 73 };; "NX recv" { 35, 0, 31, 4160, 10004, "block end" { 35, 0, 73 };; " burst" { 35, 0, };; "block begin" { 35, 0, 75 };; "NX send" { 35, 0, 34, 1560, 10001, "block end" { 35, 0, 75 };; " burst" { 35, 0, };; "block begin" { 35, 0, 75 };; "NX send" { 35, 0, 31, 3640, 10003, "block end" { 35, 0, 75 };; 0, 1 };; 0, 1 };; 0, 0 };; 0, 0 };; 5
6 Dimemas trace generation Dimemas instrumentation MPIDtrace Run the same way as OMPItrace Paraver Dimemas trace Generation Prv2trf original.prv dimemas.trf Default: Duration of each computation region taken from.prv computation duration Usage: prv2trf -i <iprobe_miss_threshold> -b <hw_counter_type>,<factor> <paraver_trace> <dimemas_trace> Force synchronized start of all threads -h -n -i <iprobe_miss_threshold> This help No generate initial idle states Maximun MPI_Iprobe misses to discard Iprobe area burst -b <hw_counter_type>,<factor> Hardware counter type and factor used to generate burst durations Computation region duration derived from hardware counters assuming/modeling a given performance (<factor>) 6
7 Parallel machine model Dimemas: Coarse grain trace driven simulator Network of SMPs Multiprogrammed workload Key factors influencing performance Objectives B L L L Local Memory Local Memory Abstract architecture Basic MPI protocols No attempt to model details of a specific implementation Simple/general Fast simulation 7 Local Memory
8 Dimemas GUI Specify trace to simulate Open chooser Specify 8
9 Parallel machines: highly non linear systems Linear components Po to po communication Sequential processor performance MessageSize T= +L BW Global speed Per block/subroutine Non linear components Synchronization semantics Blocking receives Rendezvous Resource contention Communication subsystem B L L Local Memory L Local Memory Links (in/out, halfduplex) Busses 9 Local Memory
10 Dimemas GUI Specify target machine 10
11 p2p communication model Early receiver Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Computation proceeds Logical Transfer Physical Transfer Size BW Process Blocked MPI_recv Machine Latency Uses Independent of size 11
12 p2p communication model Late receiver Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Computation proceeds Physical Transfer Logical Transfer Size BW Machine Latency Uses Independent of size MPI_recv 12
13 p2p communication model Rendezvous Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Process Blocked Physical Transfer Logical Transfer Size BW Machine Latency Uses Independent of size MPI_recv 13
14 Collective communication model Generic model Barrier / Fan-in / Fan-out Cost of communication phase Generic Per call Model factor Lin / log / const Size of message Min over all processes Collective Processor time Avg over all processes Block time Comm. time Max over all processes 14
15 Collective Communication Model Generic model Communication time Model factor Lin / log / const Size Time = Latency + MODEL_FACTOR Bandwidth Model Null 0 Constant 1 Linear P Logarithmic Factor log2p C Nsteps = stepsi, stepsi = B i=1 15
16 Collective Communication Model Per call model Model factor Lin Log Const Size of message Min over all processes Mean over all processes Max over all processes Specified in input file 16
17 Dimemas GRID: model extension L L B B L Dedicated connections External network Variation on effective bandwidth due to traffic Collective communication extension. Not targeted by this tutorial 17
18 Architecture description file Configuration file SDDF format for historical reasons Definition of records #1: "environment information" { char "machine_name"[]; "machine_id"; // "instrumented_architecture" "Architecture used to instrument" char "instrumented_architecture"[]; // "number_of_nodes" "Number of nodes on virtual machine" "number_of_nodes"; // "network_bandwidth" "Data tranfer rate between nodes in Mbytes/s" // "0 means instantaneous communication" double "network_bandwidth"; // "number_of_buses_on_network" "Maximun number of messages on network" // "0 means no limit" // "1 means bus contention" "number_of_buses_on_network"; // "1 Constant, 2 Lineal, 3 Logarithmic" "communication_group_model"; };; 18
19 Architecture description file Configuration file #2: "node information" { "machine_id"; // "node_id" "Node number" "node_id"; // "simulated_architecture" "Architecture node name" char "simulated_architecture"[]; // "number_of_processors" "Number of processors within node" "number_of_processors"; // "number_of_input_links" "Number of input links in node" "number_of_input_links"; // "number_of_output_links" "Number of output links in node" "number_of_output_links"; // "startup_on_local_communication" "Communication startup" double "startup_on_local_communication"; // "startup_on_remote_communication" "Communication startup" double "startup_on_remote_communication"; // "speed_ratio_instrumented_vs_simulated" "Relative processor speed" double "speed_ratio_instrumented_vs_simulated"; // "memory_bandwidth" "Data tranfer rate o node in Mbytes/s" // "0 means instantaneous communication" double "memory_bandwidth"; double "external_net_startup"; };; 19
20 Architecture description file #s Configuration In/out links BW B "wide area network information" {"", 1, 0, 4, 0.0, 0.0, 1};; "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 2, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 3, "", 1, 1, 1, 0.0, , 1.0, "mapping information" {"WRF.MN.128p.chop2.trf", 128, [128] {0,1,2,3,4,5,6,7,8,9,10,11,,125,126,127}};; L 0.0, 0.0, 0.0, 0.0, 0.0};; 0.0};; 0.0};; 0.0};; "configuration files" {"", "", "collectives.cfg", ""};; 20
21 Application Analysis Group messages? Bandwidth problem? BW =, L = 0 Concurrent communication problems? L =, BW = BW = target, L = target, buses = 1, 2,... Ideal network? BW =, Allgather + sendrecv alltoall allreduce waitall L=0 Real run Ideal network sendrec 21
22 Hands on session Directory ro2dimemas contains a guidelines document that you can apply to the WRF.128p trace or your own. A comparison of the original and simulated trace is shown below for the WRF.128p case Real MareNostrum Dimemas prediction for MareNostrum 22
23 Hands on session BW 5 MB/s BW 10 MB/s L 100 us BW 250 MB/s Sensitivity to the different factors (latency, BW, )? In different parts of the trace? 23
24 Hands on session busses 2 links BW 5 MB/s BW 10 MB/s 2 busses BW 250 MB/s Relationship between bandwidth, injectors and contention. Amount of contention? Endpo contention? 24
25 Application Analysis End po contention Simulation with Dimemas PEPC Exchange phase Very low BW 1 output link, input links Recommendation: Important to schedule communications. Everybody sending by destination rank order Endpo contention at low ranked processes 25
26 Speedup model T eff i = Ti P LB CommEff IPC # instr0 Sup = * * * * P0 LB0 CommEff 0 IPC0 # instr CommEff = max(eff i ) IPC P Directly from real execution metrics Sup = P macrolb microlb CommEff IPC # instr0 * * * * * P0 macrolb0 microlb0 CommEff 0 IPC0 # instr LB = eff i =1 i # instr P * max(eff i ) Migrating/local load imbalance Serialization Requires Dimemas simulation Ti T 26
27 Parametric studies: Estimating impact of different factors GADGET Ideal speeding up ALL the computation bursts by the ratio factor The more processes the less speedup (higher impact of bandwidth limitations)!!!!! Speedup Speedup Bandwidth (MB/s) Bandwidth (MB/s) 0 Bandwidth (MB/s) ratio ratio Speedup ratio
28 Parametric studies #!/bin/sh echo bw time for log_bw in $(seq 6 14) do let i=2**log_bw sed s/bwref/$i.0/g machine.ref.cfg >tmp.cfg echo $i `Dimemas S 32K tmp.cfg grep Execu awk '{pr $NF}'` rm tmp.cfg done Machine.REF.cfg "environment information" {"", 0, "", 128, BWREF, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; 28
29 Estimating impact Profile 40 We do need to overcome the hybrid Amdahl s law asynchrony + Load balancing mechanisms!!! 0 1 Speedup Bandwdith (MB/s) Code region 64 0 Bandwdith (MB/s) Bandwdith (MB/s) ratio ratio % code region % Speedup Speedup 93.67% Speedup SELECTED regions by the ratio factor %elapsed time Hybrid GADGET (128 processes % of computation time 35 ratio
30 Using block factors Clusterize with option b Convert to trf Specify block performance factors (time of block is divided by factor) Simulate. WRF.NM.128p.chop2.prv clusterized with Cluster.I.IPC.xml Prediction speeding up cluster 2 by 100x 30
31 Dimemas GUI Block factors "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "modules information" {1007, 100.0};; 31
32 Application Analysis procs Serialization Detected through Dimemas simulation for ideal erconnect Precise measurement and prediction results in early detection small core counts warn for potential large core count relevant computation between communcation in sendrecv phases Real run Ideal network 32
33 CEPBA tools framework XML control Interconnect evaluation environment Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS VENUS (IBM-ZRL) how2gen.xml Stats Gen.viz.cube.xls.txt Machine description Instr. Level Simulators PeekPerf Data Display Tools 33
34 Interconnect simulation environment Dimemas MPI replay Very fast, coarse grain network model Config File Config File Venus (IBM) Detailed network simulator Routing Protocols Venus Sim. Dimemas Sim. (Client) traces Interaction (socket) routes ServerMod (Server) mapping topology traces statistics 34
35 Multiscale Simulation 35
36 Multiscale simulation: L2 cache size Vs Network Bandwidth Left: Cluster representatives IPC with different L2 cache sizes 64KB 512MB Right: Application execution time with different network bandwidths 125Mb/s 500Mb/s 4MB 250Mb/s VAC and WRF Dominated by computation phases Impact of network is negligible 64KB 500Mb/s NAS BT Network bandwidth is more significant L2 size reduction can be compensated by an increase in network bandwidth 36
BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, )
www.bsc.es BSC Tools Jesús Labarta BSC Paris, October 2 nd 212 Challenges on the way to Exascale Efficiency (, power, ) Variability Memory Faults Scale (,concurrency, strong scaling, ) J. Labarta, et all,
More informationPerformance Tools (Paraver/Dimemas)
www.bsc.es Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014 Our Tools! Since 1991! Based on traces! Open Source http://www.bsc.es/paraver!
More informationAdvanced Profiling of GROMACS
Advanced Profiling of GROMACS Jesus Labarta Director Computer Sciences Research Dept. BSC All I know about GROMACS A Molecular Dynamics application Heavily used @ BSC Not much Courtesy Modesto Orozco,(BSC)
More informationScalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center
Scalability of Trace Analysis Tools Jesus Labarta Barcelona Supercomputing Center What is Scalability? Jesus Labarta, Workshop on Tools for Petascale Computing, Snowbird, Utah,July 2007 2 Index General
More informationCEPBA-Tools environment. Research areas. Models. How to? GROMACS analysis. Paraver. Dimemas. Time analysis. On-line analysis.
Performance Analisis with CEPBA A-Tools Judit Gimenez P erformance Tools judit@ @bsc.es CEPBA-Tools environment Paraver Dimemas Research areas Time analysis Clustering On-line analysis Sampling Models
More informationUnderstanding applications with Paraver and Dimemas. March 2013
Understanding applications with Paraver and Dimemas judit@bsc.es March 2013 BSC Tools outline Tools presentation Demo: ABYSS-P analysis Hands-on pi computer Extrae, Paraver Clustering Dimemas Our Tools
More informationInstrumentation. BSC Performance Tools
Instrumentation BSC Performance Tools Index The instrumentation process A typical MN process Paraver trace format Configuration XML Environment variables Adding references to the source API CEPBA-Tools
More informationTools. Performance tools. Jesús Labarta CEPBA-UPC. Objective: Identify performance problems and help optimize application
Tools Jesús Labarta CEPBA-UPC Performance tools Objective: Identify performance problems and help optimize application Phases Data acquisition Processing Compaction Summarization: Statistics Presentation
More informationParaver internals and details. BSC Performance Tools
Paraver internals and details BSC Performance Tools overview 2 Paraver: Performance Data browser Raw data tunable Seeing is believing Performance index : s(t) (piecewise constant) Identifier of function
More informationUsing Lamport s Logical Clocks
Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationTrace-driven co-simulation of highperformance
IBM Research GmbH, Zurich, Switzerland Trace-driven co-simulation of highperformance computing systems using OMNeT++ Cyriel Minkenberg, Germán Rodríguez Herrera IBM Research GmbH, Zurich, Switzerland 2nd
More informationMPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh
MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance
More informationProgramming for Fujitsu Supercomputers
Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming
More informationReview: Creating a Parallel Program. Programming for Performance
Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)
More informationPRIMEHPC FX10: Advanced Software
PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for
More informationLecture 7: Distributed memory
Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing
More informationClustering. BSC Performance Tools
Clustering BSC Tools Clustering Identify computation regions of similar behavior Data structure not Gaussian DBSCAN Similar in terms of duration or hardware counter rediced metrics Different routines may
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationUsing BigSim to Estimate Application Performance
October 19, 2010 Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator BigSim Simulator
More informationDistributed Systems CS /640
Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1 Objectives Discussion on Programming Models
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationClustering. BSC Performance Tools
Clustering BSC Performance Tools Clustering Identify computation regions of similar behavior Data structure not Gaussian DBSCAN Similar in terms of duration or hardware counter reduced metrics Different
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationMPI Performance Snapshot
User's Guide 2014-2015 Intel Corporation Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all
More informationIntroduction to Parallel Programming
Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationIntroduction to Parallel Computing
Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen
More informationCommunication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures
Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart
More informationThe Use of the MPI Communication Library in the NAS Parallel Benchmarks
The Use of the MPI Communication Library in the NAS Parallel Benchmarks Theodore B. Tabe, Member, IEEE Computer Society, and Quentin F. Stout, Senior Member, IEEE Computer Society 1 Abstract The statistical
More informationPerformance Analysis with Periscope
Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance
More informationOptimization of MPI Applications Rolf Rabenseifner
Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization
More informationIntroduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign
Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationHPC Parallel Programing Multi-node Computation with MPI - I
HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright
More informationGermán Llort
Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,
More informationECE 669 Parallel Computer Architecture
ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate
More informationParallel Architectures
Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36
More informationAdvanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED
Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied
More informationMPI Performance Snapshot. User's Guide
MPI Performance Snapshot User's Guide MPI Performance Snapshot User s Guide Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by
More informationParallel Computing. Hwansoo Han (SKKU)
Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo
More informationParallel Performance Analysis Using the Paraver Toolkit
Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit [16a] [16a] Slide 1 University of Stuttgart High-Performance Computing Center Stuttgart (HLRS)
More informationShared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation
Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,
More informationScalasca performance properties The metrics tour
Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware
More informationProgramming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen
Programming with Message Passing PART I: Basics HPC Fall 2012 Prof. Robert van Engelen Overview Communicating processes MPMD and SPMD Point-to-point communications Send and receive Synchronous, blocking,
More informationMPI Performance Analysis and Optimization on Tile64/Maestro
MPI Performance Analysis and Optimization on Tile64/Maestro Mikyung Kang, Eunhui Park, Minkyoung Cho, Jinwoo Suh, Dong-In Kang, and Stephen P. Crago USC/ISI-East July 19~23, 2009 Overview Background MPI
More informationAdvanced Message-Passing Interface (MPI)
Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point
More informationLearning Curve for Parallel Applications. 500 Fastest Computers
Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor
More informationMPI Performance Snapshot
MPI Performance Snapshot User's Guide 2014-2015 Intel Corporation MPI Performance Snapshot User s Guide Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationOngoing work on NSF OCI at UNH InterOperability Laboratory. UNH IOL Participants
Ongoing work on NSF OCI-1127228 at UNH InterOperability Laboratory Robert D. Russell InterOperability Laboratory & Computer Science Department University of New Hampshire Durham, New Hampshire
More informationReducing Network Contention with Mixed Workloads on Modern Multicore Clusters
Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational
More informationScalable Critical Path Analysis for Hybrid MPI-CUDA Applications
Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale
More informationUNIVERSITY OF MORATUWA
UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Engineering 2012 Intake Semester 8 Examination CS4532 CONCURRENT PROGRAMMING Time allowed: 2 Hours March
More informationTowards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems
Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems Robert Birke, German Rodriguez, Cyriel Minkenberg IBM Research Zurich Outline High-performance computing:
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationScheduling. Jesus Labarta
Scheduling Jesus Labarta Scheduling Applications submitted to system Resources x Time Resources: Processors Memory Objective Maximize resource utilization Maximize throughput Minimize response time Not
More informationPerformance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior
Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior Kevin Huck, Juan González, Judit Gimenez, Jesús Labarta Dagstuhl Seminar 10181: Program Development
More informationClaudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste
Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationTechnical Computing Suite supporting the hybrid system
Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect
More informationScore-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1
Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide
More informationAll-Pairs Shortest Paths - Floyd s Algorithm
All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel
More informationCollective Communication in MPI and Advanced Features
Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective
More informationSeismic Code. Given echo data, compute under sea map Computation model
Seismic Code Given echo data, compute under sea map Computation model designed for a collection of workstations uses variation of RPC model workers are given an independent trace to compute requires little
More informationIntroduction to parallel computing concepts and technics
Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing
More information"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008
Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,
More informationTutorial: Application MPI Task Placement
Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How
More informationProgramming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam
Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationProf. Thomas Sterling
High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 27 th, 2007 Term Projects
More informationLecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1
Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method
More informationLOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig
LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part I Lecture 4, Jan 25, 2012 Majd F. Sakr and Mohammad Hammoud Today Last 3 sessions Administrivia and Introduction to Cloud Computing Introduction to Cloud
More informationApproaches to Performance Evaluation On Shared Memory and Cluster Architectures
Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University
More informationTree-Based Density Clustering using Graphics Processors
Tree-Based Density Clustering using Graphics Processors A First Marriage of MRNet and GPUs Evan Samanas and Ben Welton Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 The
More informationStandard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls
OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org, http://www.compunity.org
More informationTutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE
Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationOutline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples
Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors
More informationCOMP Superscalar. COMPSs Tracing Manual
COMP Superscalar COMPSs Tracing Manual Version: 2.4 November 9, 2018 This manual only provides information about the COMPSs tracing system. Specifically, it illustrates how to run COMPSs applications with
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationPerformance Analysis of MPI-programs. 4. Characteristics and methods of debugging for parallel programs.
Performance Analysis of MPI-programs (Originally was written in Russian. Sample.) 4. Characteristics and methods of debugging for parallel programs. 4.1 Main performance characteristics. Possibility to
More informationExercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing
Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous
More informationComposite Metrics for System Throughput in HPC
Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last
More information6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT
6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among
More informationECE 574 Cluster Computing Lecture 13
ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*
More informationThree basic multiprocessing issues
Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationBlocking SEND/RECEIVE
Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F
More informationPurity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language
Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,
More informationLecture 7: Parallel Processing
Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction
More informationAssessment of LS-DYNA Scalability Performance on Cray XD1
5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationEnabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer
Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Chris Carothers, Elsa Gonsiorowski and Justin LaPre Center for Computational Innovations Rensselaer
More informationHPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser
HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,
More information