Dimemas internals and details. BSC Performance Tools

Size: px
Start display at page:

Download "Dimemas internals and details. BSC Performance Tools"

Transcription

1 Dimemas ernals and details BSC Performance Tools

2 CEPBA tools framework XML control Predictions/expectations Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS VENUS (IBM-ZRL) how2gen.xml Stats Gen.viz.cube.xls.txt Machine description Instr. Level Simulators PeekPerf Data Display Tools 2

3 Dimemas tracefile Characterises application Sequence of resource demands for each task Sequence of events: communication Application model Format SDDF for historical reasons. Definition of records New format 3

4 Dimemas tracefile Format SDDF for historical reasons Definition of records #1: " burst" double };; { "taskid"; "thid"; "time"; #2: "NX send" { "taskid"; "thid"; "dest taskid"; "msg length"; "tag"; "commid"; "use_rendezvous"; };; #40: "block begin" { "taskid"; "thid"; "blockid"; };; #41: "block end" { "taskid"; "thid"; "blockid"; };; #201: "global OP" { "rank"; "thid"; "glop_id"; "comm_id"; "root_rank"; "root_thid"; "bytes_sent"; "bytes_recvd"; };; 4

5 Dimemas tracefile Format ASCII records "block begin" { 35, 0, 73 };; "NX recv" { 35, 0, 39, 4160, 10003, "block end" { 35, 0, 73 };; " burst" { 35, 0, };; "block begin" { 35, 0, 73 };; "NX recv" { 35, 0, 31, 4160, 10004, "block end" { 35, 0, 73 };; " burst" { 35, 0, };; "block begin" { 35, 0, 75 };; "NX send" { 35, 0, 34, 1560, 10001, "block end" { 35, 0, 75 };; " burst" { 35, 0, };; "block begin" { 35, 0, 75 };; "NX send" { 35, 0, 31, 3640, 10003, "block end" { 35, 0, 75 };; 0, 1 };; 0, 1 };; 0, 0 };; 0, 0 };; 5

6 Dimemas trace generation Dimemas instrumentation MPIDtrace Run the same way as OMPItrace Paraver Dimemas trace Generation Prv2trf original.prv dimemas.trf Default: Duration of each computation region taken from.prv computation duration Usage: prv2trf -i <iprobe_miss_threshold> -b <hw_counter_type>,<factor> <paraver_trace> <dimemas_trace> Force synchronized start of all threads -h -n -i <iprobe_miss_threshold> This help No generate initial idle states Maximun MPI_Iprobe misses to discard Iprobe area burst -b <hw_counter_type>,<factor> Hardware counter type and factor used to generate burst durations Computation region duration derived from hardware counters assuming/modeling a given performance (<factor>) 6

7 Parallel machine model Dimemas: Coarse grain trace driven simulator Network of SMPs Multiprogrammed workload Key factors influencing performance Objectives B L L L Local Memory Local Memory Abstract architecture Basic MPI protocols No attempt to model details of a specific implementation Simple/general Fast simulation 7 Local Memory

8 Dimemas GUI Specify trace to simulate Open chooser Specify 8

9 Parallel machines: highly non linear systems Linear components Po to po communication Sequential processor performance MessageSize T= +L BW Global speed Per block/subroutine Non linear components Synchronization semantics Blocking receives Rendezvous Resource contention Communication subsystem B L L Local Memory L Local Memory Links (in/out, halfduplex) Busses 9 Local Memory

10 Dimemas GUI Specify target machine 10

11 p2p communication model Early receiver Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Computation proceeds Logical Transfer Physical Transfer Size BW Process Blocked MPI_recv Machine Latency Uses Independent of size 11

12 p2p communication model Late receiver Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Computation proceeds Physical Transfer Logical Transfer Size BW Machine Latency Uses Independent of size MPI_recv 12

13 p2p communication model Rendezvous Machine Latency Uses Independent of size Simulated contention for machine resources (links & buses) MPI_send Process Blocked Physical Transfer Logical Transfer Size BW Machine Latency Uses Independent of size MPI_recv 13

14 Collective communication model Generic model Barrier / Fan-in / Fan-out Cost of communication phase Generic Per call Model factor Lin / log / const Size of message Min over all processes Collective Processor time Avg over all processes Block time Comm. time Max over all processes 14

15 Collective Communication Model Generic model Communication time Model factor Lin / log / const Size Time = Latency + MODEL_FACTOR Bandwidth Model Null 0 Constant 1 Linear P Logarithmic Factor log2p C Nsteps = stepsi, stepsi = B i=1 15

16 Collective Communication Model Per call model Model factor Lin Log Const Size of message Min over all processes Mean over all processes Max over all processes Specified in input file 16

17 Dimemas GRID: model extension L L B B L Dedicated connections External network Variation on effective bandwidth due to traffic Collective communication extension. Not targeted by this tutorial 17

18 Architecture description file Configuration file SDDF format for historical reasons Definition of records #1: "environment information" { char "machine_name"[]; "machine_id"; // "instrumented_architecture" "Architecture used to instrument" char "instrumented_architecture"[]; // "number_of_nodes" "Number of nodes on virtual machine" "number_of_nodes"; // "network_bandwidth" "Data tranfer rate between nodes in Mbytes/s" // "0 means instantaneous communication" double "network_bandwidth"; // "number_of_buses_on_network" "Maximun number of messages on network" // "0 means no limit" // "1 means bus contention" "number_of_buses_on_network"; // "1 Constant, 2 Lineal, 3 Logarithmic" "communication_group_model"; };; 18

19 Architecture description file Configuration file #2: "node information" { "machine_id"; // "node_id" "Node number" "node_id"; // "simulated_architecture" "Architecture node name" char "simulated_architecture"[]; // "number_of_processors" "Number of processors within node" "number_of_processors"; // "number_of_input_links" "Number of input links in node" "number_of_input_links"; // "number_of_output_links" "Number of output links in node" "number_of_output_links"; // "startup_on_local_communication" "Communication startup" double "startup_on_local_communication"; // "startup_on_remote_communication" "Communication startup" double "startup_on_remote_communication"; // "speed_ratio_instrumented_vs_simulated" "Relative processor speed" double "speed_ratio_instrumented_vs_simulated"; // "memory_bandwidth" "Data tranfer rate o node in Mbytes/s" // "0 means instantaneous communication" double "memory_bandwidth"; double "external_net_startup"; };; 19

20 Architecture description file #s Configuration In/out links BW B "wide area network information" {"", 1, 0, 4, 0.0, 0.0, 1};; "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 2, "", 1, 1, 1, 0.0, , 1.0, "node information" {0, 3, "", 1, 1, 1, 0.0, , 1.0, "mapping information" {"WRF.MN.128p.chop2.trf", 128, [128] {0,1,2,3,4,5,6,7,8,9,10,11,,125,126,127}};; L 0.0, 0.0, 0.0, 0.0, 0.0};; 0.0};; 0.0};; 0.0};; "configuration files" {"", "", "collectives.cfg", ""};; 20

21 Application Analysis Group messages? Bandwidth problem? BW =, L = 0 Concurrent communication problems? L =, BW = BW = target, L = target, buses = 1, 2,... Ideal network? BW =, Allgather + sendrecv alltoall allreduce waitall L=0 Real run Ideal network sendrec 21

22 Hands on session Directory ro2dimemas contains a guidelines document that you can apply to the WRF.128p trace or your own. A comparison of the original and simulated trace is shown below for the WRF.128p case Real MareNostrum Dimemas prediction for MareNostrum 22

23 Hands on session BW 5 MB/s BW 10 MB/s L 100 us BW 250 MB/s Sensitivity to the different factors (latency, BW, )? In different parts of the trace? 23

24 Hands on session busses 2 links BW 5 MB/s BW 10 MB/s 2 busses BW 250 MB/s Relationship between bandwidth, injectors and contention. Amount of contention? Endpo contention? 24

25 Application Analysis End po contention Simulation with Dimemas PEPC Exchange phase Very low BW 1 output link, input links Recommendation: Important to schedule communications. Everybody sending by destination rank order Endpo contention at low ranked processes 25

26 Speedup model T eff i = Ti P LB CommEff IPC # instr0 Sup = * * * * P0 LB0 CommEff 0 IPC0 # instr CommEff = max(eff i ) IPC P Directly from real execution metrics Sup = P macrolb microlb CommEff IPC # instr0 * * * * * P0 macrolb0 microlb0 CommEff 0 IPC0 # instr LB = eff i =1 i # instr P * max(eff i ) Migrating/local load imbalance Serialization Requires Dimemas simulation Ti T 26

27 Parametric studies: Estimating impact of different factors GADGET Ideal speeding up ALL the computation bursts by the ratio factor The more processes the less speedup (higher impact of bandwidth limitations)!!!!! Speedup Speedup Bandwidth (MB/s) Bandwidth (MB/s) 0 Bandwidth (MB/s) ratio ratio Speedup ratio

28 Parametric studies #!/bin/sh echo bw time for log_bw in $(seq 6 14) do let i=2**log_bw sed s/bwref/$i.0/g machine.ref.cfg >tmp.cfg echo $i `Dimemas S 32K tmp.cfg grep Execu awk '{pr $NF}'` rm tmp.cfg done Machine.REF.cfg "environment information" {"", 0, "", 128, BWREF, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; 28

29 Estimating impact Profile 40 We do need to overcome the hybrid Amdahl s law asynchrony + Load balancing mechanisms!!! 0 1 Speedup Bandwdith (MB/s) Code region 64 0 Bandwdith (MB/s) Bandwdith (MB/s) ratio ratio % code region % Speedup Speedup 93.67% Speedup SELECTED regions by the ratio factor %elapsed time Hybrid GADGET (128 processes % of computation time 35 ratio

30 Using block factors Clusterize with option b Convert to trf Specify block performance factors (time of block is divided by factor) Simulate. WRF.NM.128p.chop2.prv clusterized with Cluster.I.IPC.xml Prediction speeding up cluster 2 by 100x 30

31 Dimemas GUI Block factors "environment information" {"", 0, "", 128, 250.0, 0, 3};; "node information" {0, 0, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "node information" {0, 1, "", 1, 1, 1, 0.0, , 1.0, 0.0, 0.0};; "modules information" {1007, 100.0};; 31

32 Application Analysis procs Serialization Detected through Dimemas simulation for ideal erconnect Precise measurement and prediction results in early detection small core counts warn for potential large core count relevant computation between communcation in sendrecv phases Real run Ideal network 32

33 CEPBA tools framework XML control Interconnect evaluation environment Valgrind OMPITrace.prv MRNET Dyninst, PAPI Time Analysis, filters.prv.cfg Paraver +.pcf.trf DIMEMAS VENUS (IBM-ZRL) how2gen.xml Stats Gen.viz.cube.xls.txt Machine description Instr. Level Simulators PeekPerf Data Display Tools 33

34 Interconnect simulation environment Dimemas MPI replay Very fast, coarse grain network model Config File Config File Venus (IBM) Detailed network simulator Routing Protocols Venus Sim. Dimemas Sim. (Client) traces Interaction (socket) routes ServerMod (Server) mapping topology traces statistics 34

35 Multiscale Simulation 35

36 Multiscale simulation: L2 cache size Vs Network Bandwidth Left: Cluster representatives IPC with different L2 cache sizes 64KB 512MB Right: Application execution time with different network bandwidths 125Mb/s 500Mb/s 4MB 250Mb/s VAC and WRF Dominated by computation phases Impact of network is negligible 64KB 500Mb/s NAS BT Network bandwidth is more significant L2 size reduction can be compensated by an increase in network bandwidth 36

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, )

BSC Tools. Challenges on the way to Exascale. Efficiency (, power, ) Variability. Memory. Faults. Scale (,concurrency, strong scaling, ) www.bsc.es BSC Tools Jesús Labarta BSC Paris, October 2 nd 212 Challenges on the way to Exascale Efficiency (, power, ) Variability Memory Faults Scale (,concurrency, strong scaling, ) J. Labarta, et all,

More information

Performance Tools (Paraver/Dimemas)

Performance Tools (Paraver/Dimemas) www.bsc.es Performance Tools (Paraver/Dimemas) Jesús Labarta, Judit Gimenez BSC Enes workshop on exascale techs. Hamburg, March 18 th 2014 Our Tools! Since 1991! Based on traces! Open Source http://www.bsc.es/paraver!

More information

Advanced Profiling of GROMACS

Advanced Profiling of GROMACS Advanced Profiling of GROMACS Jesus Labarta Director Computer Sciences Research Dept. BSC All I know about GROMACS A Molecular Dynamics application Heavily used @ BSC Not much Courtesy Modesto Orozco,(BSC)

More information

Scalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center

Scalability of Trace Analysis Tools. Jesus Labarta Barcelona Supercomputing Center Scalability of Trace Analysis Tools Jesus Labarta Barcelona Supercomputing Center What is Scalability? Jesus Labarta, Workshop on Tools for Petascale Computing, Snowbird, Utah,July 2007 2 Index General

More information

CEPBA-Tools environment. Research areas. Models. How to? GROMACS analysis. Paraver. Dimemas. Time analysis. On-line analysis.

CEPBA-Tools environment. Research areas. Models. How to? GROMACS analysis. Paraver. Dimemas. Time analysis. On-line analysis. Performance Analisis with CEPBA A-Tools Judit Gimenez P erformance Tools judit@ @bsc.es CEPBA-Tools environment Paraver Dimemas Research areas Time analysis Clustering On-line analysis Sampling Models

More information

Understanding applications with Paraver and Dimemas. March 2013

Understanding applications with Paraver and Dimemas. March 2013 Understanding applications with Paraver and Dimemas judit@bsc.es March 2013 BSC Tools outline Tools presentation Demo: ABYSS-P analysis Hands-on pi computer Extrae, Paraver Clustering Dimemas Our Tools

More information

Instrumentation. BSC Performance Tools

Instrumentation. BSC Performance Tools Instrumentation BSC Performance Tools Index The instrumentation process A typical MN process Paraver trace format Configuration XML Environment variables Adding references to the source API CEPBA-Tools

More information

Tools. Performance tools. Jesús Labarta CEPBA-UPC. Objective: Identify performance problems and help optimize application

Tools. Performance tools. Jesús Labarta CEPBA-UPC. Objective: Identify performance problems and help optimize application Tools Jesús Labarta CEPBA-UPC Performance tools Objective: Identify performance problems and help optimize application Phases Data acquisition Processing Compaction Summarization: Statistics Presentation

More information

Paraver internals and details. BSC Performance Tools

Paraver internals and details. BSC Performance Tools Paraver internals and details BSC Performance Tools overview 2 Paraver: Performance Data browser Raw data tunable Seeing is believing Performance index : s(t) (piecewise constant) Identifier of function

More information

Using Lamport s Logical Clocks

Using Lamport s Logical Clocks Fast Classification of MPI Applications Using Lamport s Logical Clocks Zhou Tong, Scott Pakin, Michael Lang, Xin Yuan Florida State University Los Alamos National Laboratory 1 Motivation Conventional trace-based

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Trace-driven co-simulation of highperformance

Trace-driven co-simulation of highperformance IBM Research GmbH, Zurich, Switzerland Trace-driven co-simulation of highperformance computing systems using OMNeT++ Cyriel Minkenberg, Germán Rodríguez Herrera IBM Research GmbH, Zurich, Switzerland 2nd

More information

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh

MPI Optimisation. Advanced Parallel Programming. David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh MPI Optimisation Advanced Parallel Programming David Henty, Iain Bethune, Dan Holmes EPCC, University of Edinburgh Overview Can divide overheads up into four main categories: Lack of parallelism Load imbalance

More information

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

PRIMEHPC FX10: Advanced Software

PRIMEHPC FX10: Advanced Software PRIMEHPC FX10: Advanced Software Koh Hotta Fujitsu Limited System Software supports --- Stable/Robust & Low Overhead Execution of Large Scale Programs Operating System File System Program Development for

More information

Lecture 7: Distributed memory

Lecture 7: Distributed memory Lecture 7: Distributed memory David Bindel 15 Feb 2010 Logistics HW 1 due Wednesday: See wiki for notes on: Bottom-up strategy and debugging Matrix allocation issues Using SSE and alignment comments Timing

More information

Clustering. BSC Performance Tools

Clustering. BSC Performance Tools Clustering BSC Tools Clustering Identify computation regions of similar behavior Data structure not Gaussian DBSCAN Similar in terms of duration or hardware counter rediced metrics Different routines may

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

Using BigSim to Estimate Application Performance

Using BigSim to Estimate Application Performance October 19, 2010 Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign Outline Overview BigSim Emulator BigSim Simulator

More information

Distributed Systems CS /640

Distributed Systems CS /640 Distributed Systems CS 15-440/640 Programming Models Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andvinay Kolar 1 Objectives Discussion on Programming Models

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Clustering. BSC Performance Tools

Clustering. BSC Performance Tools Clustering BSC Performance Tools Clustering Identify computation regions of similar behavior Data structure not Gaussian DBSCAN Similar in terms of duration or hardware counter reduced metrics Different

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

MPI Performance Snapshot

MPI Performance Snapshot User's Guide 2014-2015 Intel Corporation Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. Intel disclaims all

More information

Introduction to Parallel Programming

Introduction to Parallel Programming Introduction to Parallel Programming Linda Woodard CAC 19 May 2010 Introduction to Parallel Computing on Ranger 5/18/2010 www.cac.cornell.edu 1 y What is Parallel Programming? Using more than one processor

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

The Use of the MPI Communication Library in the NAS Parallel Benchmarks

The Use of the MPI Communication Library in the NAS Parallel Benchmarks The Use of the MPI Communication Library in the NAS Parallel Benchmarks Theodore B. Tabe, Member, IEEE Computer Society, and Quentin F. Stout, Senior Member, IEEE Computer Society 1 Abstract The statistical

More information

Performance Analysis with Periscope

Performance Analysis with Periscope Performance Analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München periscope@lrr.in.tum.de October 2010 Outline Motivation Periscope overview Periscope performance

More information

Optimization of MPI Applications Rolf Rabenseifner

Optimization of MPI Applications Rolf Rabenseifner Optimization of MPI Applications Rolf Rabenseifner University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Optimization of MPI Applications Slide 1 Optimization and Standardization

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

Germán Llort

Germán Llort Germán Llort gllort@bsc.es >10k processes + long runs = large traces Blind tracing is not an option Profilers also start presenting issues Can you even store the data? How patient are you? IPDPS - Atlanta,

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Parallel Architectures

Parallel Architectures Parallel Architectures Part 1: The rise of parallel machines Intel Core i7 4 CPU cores 2 hardware thread per core (8 cores ) Lab Cluster Intel Xeon 4/10/16/18 CPU cores 2 hardware thread per core (8/20/32/36

More information

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED Advanced Software for the Supercomputer PRIMEHPC FX10 System Configuration of PRIMEHPC FX10 nodes Login Compilation Job submission 6D mesh/torus Interconnect Local file system (Temporary area occupied

More information

MPI Performance Snapshot. User's Guide

MPI Performance Snapshot. User's Guide MPI Performance Snapshot User's Guide MPI Performance Snapshot User s Guide Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

Parallel Performance Analysis Using the Paraver Toolkit

Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit Parallel Performance Analysis Using the Paraver Toolkit [16a] [16a] Slide 1 University of Stuttgart High-Performance Computing Center Stuttgart (HLRS)

More information

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation

Shared Memory and Distributed Multiprocessing. Bhanu Kapoor, Ph.D. The Saylor Foundation Shared Memory and Distributed Multiprocessing Bhanu Kapoor, Ph.D. The Saylor Foundation 1 Issue with Parallelism Parallel software is the problem Need to get significant performance improvement Otherwise,

More information

Scalasca performance properties The metrics tour

Scalasca performance properties The metrics tour Scalasca performance properties The metrics tour Markus Geimer m.geimer@fz-juelich.de Scalasca analysis result Generic metrics Generic metrics Time Total CPU allocation time Execution Overhead Visits Hardware

More information

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen Programming with Message Passing PART I: Basics HPC Fall 2012 Prof. Robert van Engelen Overview Communicating processes MPMD and SPMD Point-to-point communications Send and receive Synchronous, blocking,

More information

MPI Performance Analysis and Optimization on Tile64/Maestro

MPI Performance Analysis and Optimization on Tile64/Maestro MPI Performance Analysis and Optimization on Tile64/Maestro Mikyung Kang, Eunhui Park, Minkyoung Cho, Jinwoo Suh, Dong-In Kang, and Stephen P. Crago USC/ISI-East July 19~23, 2009 Overview Background MPI

More information

Advanced Message-Passing Interface (MPI)

Advanced Message-Passing Interface (MPI) Outline of the workshop 2 Advanced Message-Passing Interface (MPI) Bart Oldeman, Calcul Québec McGill HPC Bart.Oldeman@mcgill.ca Morning: Advanced MPI Revision More on Collectives More on Point-to-Point

More information

Learning Curve for Parallel Applications. 500 Fastest Computers

Learning Curve for Parallel Applications. 500 Fastest Computers Learning Curve for arallel Applications ABER molecular dynamics simulation program Starting point was vector code for Cray-1 145 FLO on Cray90, 406 for final version on 128-processor aragon, 891 on 128-processor

More information

MPI Performance Snapshot

MPI Performance Snapshot MPI Performance Snapshot User's Guide 2014-2015 Intel Corporation MPI Performance Snapshot User s Guide Legal Information No license (express or implied, by estoppel or otherwise) to any intellectual property

More information

Processor Architecture and Interconnect

Processor Architecture and Interconnect Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing

More information

Ongoing work on NSF OCI at UNH InterOperability Laboratory. UNH IOL Participants

Ongoing work on NSF OCI at UNH InterOperability Laboratory. UNH IOL Participants Ongoing work on NSF OCI-1127228 at UNH InterOperability Laboratory Robert D. Russell InterOperability Laboratory & Computer Science Department University of New Hampshire Durham, New Hampshire

More information

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters Matthew Koop 1 Miao Luo D. K. Panda matthew.koop@nasa.gov {luom, panda}@cse.ohio-state.edu 1 NASA Center for Computational

More information

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications

Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications Center for Information Services and High Performance Computing (ZIH) Scalable Critical Path Analysis for Hybrid MPI-CUDA Applications The Fourth International Workshop on Accelerators and Hybrid Exascale

More information

UNIVERSITY OF MORATUWA

UNIVERSITY OF MORATUWA UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Engineering 2012 Intake Semester 8 Examination CS4532 CONCURRENT PROGRAMMING Time allowed: 2 Hours March

More information

Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems

Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems Towards Massively Parallel Simulations of Massively Parallel High-Performance Computing Systems Robert Birke, German Rodriguez, Cyriel Minkenberg IBM Research Zurich Outline High-performance computing:

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

Scheduling. Jesus Labarta

Scheduling. Jesus Labarta Scheduling Jesus Labarta Scheduling Applications submitted to system Resources x Time Resources: Processors Memory Objective Maximize resource utilization Maximize throughput Minimize response time Not

More information

Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior

Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior Performance Diagnosis through Classification of Computation Bursts to Known Computational Kernel Behavior Kevin Huck, Juan González, Judit Gimenez, Jesús Labarta Dagstuhl Seminar 10181: Program Development

More information

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste

Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste Claudio Chiaruttini Dipartimento di Matematica e Informatica Centro Interdipartimentale per le Scienze Computazionali (CISC) Università di Trieste http://www.dmi.units.it/~chiarutt/didattica/parallela

More information

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking

More information

Technical Computing Suite supporting the hybrid system

Technical Computing Suite supporting the hybrid system Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect

More information

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1

Score-P. SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P SC 14: Hands-on Practical Hybrid Parallel Application Performance Engineering 1 Score-P Functionality Score-P is a joint instrumentation and measurement system for a number of PA tools. Provide

More information

All-Pairs Shortest Paths - Floyd s Algorithm

All-Pairs Shortest Paths - Floyd s Algorithm All-Pairs Shortest Paths - Floyd s Algorithm Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico October 31, 2011 CPD (DEI / IST) Parallel

More information

Collective Communication in MPI and Advanced Features

Collective Communication in MPI and Advanced Features Collective Communication in MPI and Advanced Features Pacheco s book. Chapter 3 T. Yang, CS240A. Part of slides from the text book, CS267 K. Yelick from UC Berkeley and B. Gropp, ANL Outline Collective

More information

Seismic Code. Given echo data, compute under sea map Computation model

Seismic Code. Given echo data, compute under sea map Computation model Seismic Code Given echo data, compute under sea map Computation model designed for a collection of workstations uses variation of RPC model workers are given an independent trace to compute requires little

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Charting the Course to Your Success! MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008 Description Course Summary This course provides students with the knowledge and skills to develop high-performance computing (HPC) applications for Microsoft. Students learn about the product Microsoft,

More information

Tutorial: Application MPI Task Placement

Tutorial: Application MPI Task Placement Tutorial: Application MPI Task Placement Juan Galvez Nikhil Jain Palash Sharma PPL, University of Illinois at Urbana-Champaign Tutorial Outline Why Task Mapping on Blue Waters? When to do mapping? How

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Prof. Thomas Sterling

Prof. Thomas Sterling High Performance Computing: Concepts, Methods & Means Performance 3 : Measurement Prof. Thomas Sterling Department of Computer Science Louisiana i State t University it February 27 th, 2007 Term Projects

More information

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1

Lecture 14: Mixed MPI-OpenMP programming. Lecture 14: Mixed MPI-OpenMP programming p. 1 Lecture 14: Mixed MPI-OpenMP programming Lecture 14: Mixed MPI-OpenMP programming p. 1 Overview Motivations for mixed MPI-OpenMP programming Advantages and disadvantages The example of the Jacobi method

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part I Lecture 4, Jan 25, 2012 Majd F. Sakr and Mohammad Hammoud Today Last 3 sessions Administrivia and Introduction to Cloud Computing Introduction to Cloud

More information

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures

Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Approaches to Performance Evaluation On Shared Memory and Cluster Architectures Peter Strazdins (and the CC-NUMA Team), CC-NUMA Project, Department of Computer Science, The Australian National University

More information

Tree-Based Density Clustering using Graphics Processors

Tree-Based Density Clustering using Graphics Processors Tree-Based Density Clustering using Graphics Processors A First Marriage of MRNet and GPUs Evan Samanas and Ben Welton Paradyn Project Paradyn / Dyninst Week College Park, Maryland March 26-28, 2012 The

More information

Standard promoted by main manufacturers Fortran. Structure: Directives, clauses and run time calls

Standard promoted by main manufacturers   Fortran. Structure: Directives, clauses and run time calls OpenMP Introducción Directivas Regiones paralelas Worksharing sincronizaciones Visibilidad datos Implementación OpenMP: introduction Standard promoted by main manufacturers http://www.openmp.org, http://www.compunity.org

More information

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE

Tutorial: Analyzing MPI Applications. Intel Trace Analyzer and Collector Intel VTune Amplifier XE Tutorial: Analyzing MPI Applications Intel Trace Analyzer and Collector Intel VTune Amplifier XE Contents Legal Information... 3 1. Overview... 4 1.1. Prerequisites... 5 1.1.1. Required Software... 5 1.1.2.

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples

Outline. Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples Outline Overview Theoretical background Parallel computing systems Parallel programming models MPI/OpenMP examples OVERVIEW y What is Parallel Computing? Parallel computing: use of multiple processors

More information

COMP Superscalar. COMPSs Tracing Manual

COMP Superscalar. COMPSs Tracing Manual COMP Superscalar COMPSs Tracing Manual Version: 2.4 November 9, 2018 This manual only provides information about the COMPSs tracing system. Specifically, it illustrates how to run COMPSs applications with

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Performance Analysis of MPI-programs. 4. Characteristics and methods of debugging for parallel programs.

Performance Analysis of MPI-programs. 4. Characteristics and methods of debugging for parallel programs. Performance Analysis of MPI-programs (Originally was written in Russian. Sample.) 4. Characteristics and methods of debugging for parallel programs. 4.1 Main performance characteristics. Possibility to

More information

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous

More information

Composite Metrics for System Throughput in HPC

Composite Metrics for System Throughput in HPC Composite Metrics for System Throughput in HPC John D. McCalpin, Ph.D. IBM Corporation Austin, TX SuperComputing 2003 Phoenix, AZ November 18, 2003 Overview The HPC Challenge Benchmark was announced last

More information

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT

6.189 IAP Lecture 5. Parallel Programming Concepts. Dr. Rodric Rabbah, IBM IAP 2007 MIT 6.189 IAP 2007 Lecture 5 Parallel Programming Concepts 1 6.189 IAP 2007 MIT Recap Two primary patterns of multicore architecture design Shared memory Ex: Intel Core 2 Duo/Quad One copy of data shared among

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 21 March 2017 Announcements HW#5 Finally Graded Had right idea, but often result not an *exact*

More information

Three basic multiprocessing issues

Three basic multiprocessing issues Three basic multiprocessing issues 1. artitioning. The sequential program must be partitioned into subprogram units or tasks. This is done either by the programmer or by the compiler. 2. Scheduling. Associated

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language

Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Purity: An Integrated, Fine-Grain, Data- Centric, Communication Profiler for the Chapel Language Richard B. Johnson and Jeffrey K. Hollingsworth Department of Computer Science, University of Maryland,

More information

Lecture 7: Parallel Processing

Lecture 7: Parallel Processing Lecture 7: Parallel Processing Introduction and motivation Architecture classification Performance evaluation Interconnection network Zebo Peng, IDA, LiTH 1 Performance Improvement Reduction of instruction

More information

Assessment of LS-DYNA Scalability Performance on Cray XD1

Assessment of LS-DYNA Scalability Performance on Cray XD1 5 th European LS-DYNA Users Conference Computing Technology (2) Assessment of LS-DYNA Scalability Performance on Cray Author: Ting-Ting Zhu, Cray Inc. Correspondence: Telephone: 651-65-987 Fax: 651-65-9123

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer

Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Enabling Scalable Parallel Processing of Venus/OMNeT++ Network Models on the IBM Blue Gene/Q Supercomputer Chris Carothers, Elsa Gonsiorowski and Justin LaPre Center for Computational Innovations Rensselaer

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information