Cluster Nazionale CSN4

Similar documents
Roberto Alfieri. Parma University & INFN Italy

Future of Grid parallel exploitation

Parallel computing on the Grid

Multi-thread and Mpi usage in GRID Roberto Alfieri - Parma University & INFN, Gr.Coll. di Parma

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

I Tier-3 di CMS-Italia: stato e prospettive. Hassen Riahi Claudio Grandi Workshop CCR GRID 2011

Parallel Job Support in the Spanish NGI! Enol Fernández del Cas/llo Ins/tuto de Física de Cantabria (IFCA) Spain

Parallel Computing in EGI

A unified user experience for MPI jobs in EMI

UAntwerpen, 24 June 2016

Advanced Job Launching. mapping applications to hardware

Introduction to HPC Using zcluster at GACRC

Grid Scheduling Architectures with Globus

Cluster Computing. Resource and Job Management for HPC 16/08/2010 SC-CAMP. ( SC-CAMP) Cluster Computing 16/08/ / 50

Introduction to HPC Using zcluster at GACRC

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Advances of parallel computing. Kirill Bogachev May 2016

MPI Applications with the Grid Engine

n N c CIni.o ewsrg.au

The INFN Tier1. 1. INFN-CNAF, Italy

CUDA GPGPU Workshop 2012

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

COSC 6385 Computer Architecture - Multi Processor Systems

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

High Performance Computing (HPC) Using zcluster at GACRC

LSF HPC :: getting most out of your NUMA machine

Advanced Topics in High Performance Scientific Computing [MA5327] Exercise 1

SuperMike-II Launch Workshop. System Overview and Allocations

Guillimin HPC Users Meeting March 17, 2016

Our new HPC-Cluster An overview

Parallel 3D Sweep Kernel with PaRSEC

The glite middleware. Ariel Garcia KIT

Outline. March 5, 2012 CIRMMT - McGill University 2

APENet: LQCD clusters a la APE

Multicore Performance and Tools. Part 1: Topology, affinity, clock speed

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

GRID Testing and Profiling. November 2017

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Windows-HPC Environment at RWTH Aachen University

HPC Architectures. Types of resource currently in use

Distributed production managers meeting. Armando Fella on behalf of Italian distributed computing group

Advanced Job Submission on the Grid

Interconnect EGEE and CNGRID e-infrastructures

CREAM-WMS Integration

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

The RWTH Compute Cluster Environment

WhatÕs New in the Message-Passing Toolkit

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Cerebro Quick Start Guide

Grid technologies, solutions and concepts in the synchrotron Elettra

glite Grid Services Overview

Why you should care about hardware locality and how.

BlueGene/L (No. 4 in the Latest Top500 List)

Introduction to HPC Using zcluster at GACRC

Experiences with ENZO on the Intel R Many Integrated Core (Intel MIC) Architecture

Single System Image OS for Clusters: Kerrighed Approach

Hybrid (MPP+OpenMP) version of LS-DYNA

XRAY Grid TO BE OR NOT TO BE?

Workload management at KEK/CRC -- status and plan

NUMA Control for Hybrid Applications. Hang Liu TACC February 7 th, 2011

Slurm Configuration Impact on Benchmarking

g-eclipse A Framework for Accessing Grid Infrastructures Nicholas Loulloudes Trainer, University of Cyprus (loulloudes.n_at_cs.ucy.ac.

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

A NUMA Aware Scheduler for a Parallel Sparse Direct Solver

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Introduction Workshop 11th 12th November 2013

MERCED CLUSTER BASICS Multi-Environment Research Computer for Exploration and Discovery A Centerpiece for Computational Science at UC Merced

Understanding StoRM: from introduction to internals

Vienna Scientific Cluster: Problems and Solutions

Edinburgh (ECDF) Update

NUMA Control for Hybrid Applications. Hang Liu & Xiao Zhu TACC September 20, 2013

Claude TADONKI. MINES ParisTech PSL Research University Centre de Recherche Informatique

DIRAC pilot framework and the DIRAC Workload Management System

Testing SLURM open source batch system for a Tierl/Tier2 HEP computing facility

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

Best practices. Using Affinity Scheduling in IBM Platform LSF. IBM Platform LSF

AGATA Analysis on the GRID

Hybrid Model Parallel Programs

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Cluster Clonetroop: HowTo 2014

Datura The new HPC-Plant at Albert Einstein Institute

Implementing GRID interoperability

Practical Introduction to Message-Passing Interface (MPI)

Practical Introduction to

AIST Super Green Cloud

arxiv: v1 [hep-lat] 13 Jun 2008

Performance of the 3D-Combustion Simulation Code RECOM-AIOLOS on IBM POWER8 Architecture. Alexander Berreth. Markus Bühler, Benedikt Anlauf

Scheduling. Jesus Labarta

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

PoS(EGICF12-EMITC2)057

MiAMI: Multi-Core Aware Processor Affinity for TCP/IP over Multiple Network Interfaces

Hybrid MPI/OpenMP parallelization. Recall: MPI uses processes for parallelism. Each process has its own, separate address space.

Advanced Message-Passing Interface (MPI)

Batch Systems. Running calculations on HPC resources

A High Availability Solution for GRID Services

The LHC Computing Grid

Grid and Cloud Activities in KISTI

Transcription:

CCR Workshop - Stato e Prospettive del Calcolo Scientifico Cluster Nazionale CSN4 Parallel computing and scientific activity Roberto Alfieri - Parma University & INFN, Gr.Coll. di Parma LNL - 16/02/2011 1

Outline - PC-clusters in the theoretical physics community - The CSN4cluster project - New granularity attributes on the grid - CSN4cluster jobs submission - Examples of physics parallel applications executed on the cluster - Conclusions 2

Parallel computing in the INFN theo-phys community Lattice simulations (local communication oriented): APE projects General purpose (parallel and serial): PC clusters 2005 : A lot of clusters small-medium sized - often managed by users 2005-2007 : Single centralized cluster (Cnaf) - 24 nodes Xeon +infiniband 2007-2009 : 4 PC clusters based on federated projects - grid access for serial jobs 2010 : Single centralized cluster ( CSN4cluster project) 3

CSN4cluster: timeline late 2009: collaboration CSN4-CCR to define cluster requirements and evaluate sites proposal Febr. 2010: INFN-Pisa project approved June 2010: cluster in operation for sequential jobs July 2010: call for scientific proposals Sept. 2010: 15+1 projects approved and fair-shares defined Dec. 2010: cluster in operation for parallel jobs 4

CSN4cluster access: theophys VO Access method: via Infngrid only (both serial and parallel jobs) Access policy: - thanks to G. Andronico (CT) - - Theophys VO members (~124 up to now) with low priority - Theophys subgroups (or others) can apply for a granted fairshare to the CSN4 cluster committee Active fairshare grants: - 16 has already been assigned, corresponding to 16 CSN4 IS proposals - Requests: 130K day*core serial + 250K day*core parallel = 380K day*core - availability: 365 K day*core per year - Details: http://wiki.infn.it/cn/csn4/calcolo/ 5

MPI and multi-thread support in EGEE MPI has always been supported by EGEE but.. - Survey for users and administrators in April 2009: MPI is still scarcely used - Multi-thread programming should be supported in EGEE to exploit the upcoming multi-core architectures. => Set-up of a new EGEE MPI-WG. Recommendation document released in 06/2010: http://www.grid.ie/mpi/wiki/workinggroup New attributes in the JDL are proposed by the MPI-WG for multi-thread support Attribute CPUNumber=P SMPGranularity=C HostNumber=N WholeNodes=true Meaning Total number of required CPUs Minimum number of cores per node Total number of required nodes Reserve the whole node (all cores) 6

Granularity attributes : JDL examples WholeNodes=false (not set) CPUNumber = 24; # Default: 24 CPUs any number of nodes CPUNumber = 64; # 32 nodes, with 2 CPUs per node SMPGranularity = 2; # (SMPsize >=2 ) CPUNumber = 16; # 2 nodes, with 8 CPUs per node HostNumber = 2; # (SMPsize >=8 ) WholeNode=true WholeNodes=true; HostNumber=2; SMPGranularity=8; WholeNodes=true; SMPGranularity=8; WholeNodes=true; HostNumber=2; # 2 whole nodes with SMPsize>=8 # 1 whole node with SMPsize>=8 # (default HostNumber=1) # 2 whole nodes with SMPsize>=1 # (default SMPGranularity=1) 7

Granularity support: preliminary patch New JDL attributes proposed by the MPI-WG aren t implemented in glite yet A preliminary patch for Cream-CE has been developed and tested in collaboration with the glite middleware developers - thanks to M. Sgaravatto (PD), S. Monforte (CT), A. Gianelle (PD) - The patch has been installed on the CSN4cluster and is now operational with a temporary syntax - waiting for the final integration of the attributes in glite - Temporary syntax JDL examples: CeRequirements = wholenodes=\ true\ && hostnumber==2 ; # 2 whole nodes CPUNumber = 16; CeRequirements = SMPGranularity==2 # 8 nodes with 2 CPUs per node CeRequirements are interpreted by the CEs. Match-Making process is not involved. 8

CSN4cluster: computational resources Resources: - thanks to E. Mazzoni, A. Ciampa, S. Arezzini (PI) - 1 CE gridce3.pi.infn.it (Cream - LSF) 128 WNs Dual-Opteron 8356 (2x4 cores per node) 10 TFlops peak perf. In modern multicore processors the memory architecture is NUMA - Cpu/memory affinity is the ability to bind a process to a specific CPU/memory bank - 1 2 3 Measured network performance (using NetPIPE): Comm Type Latency MAX Bandw. 1 Intra-socket 640 ns 14 GBytes/s 2 Intra-board 820 ns 12 GBytes/s 3 infiniband 3300 ns 11 GBytes/s Memory performance (peak): Memory Type Latency MAX Bandw. L3 cache 35 ns DDR3 50 ns 32 GBytes/s Numa (HT or QPI) 90 ns 11 GBytes/s 9

CSN4cluster: resources access Direct job submission to Cream-CE in the JDL Requirements =(other.glueceinfohostname == "gridce3.pi.infn.it") gives access to 2 queues: theompi : parallel job only - reservation time 8h - Role=parallel required voms-proxy-init voms theophys:/theophys/<group_name>/role=parallel theoshort: serial short jobs - runtime 4h - Role=parallel should not be specified voms-proxy-init voms theophys:/theophys/<group_name> - The serial queue allows the exploitation of cores when they are unused by parallel jobs - 10

MPI job : explicit submission Direct job submission means we know SMP architecture, MPI flavours, ecc.. - This example executes 16 MPI ranks (2 whole nodes) - p0. p7 p8. p15 LSF hostfile JDL Executable = "mpi.sh"; Requirements =(other.glueceinfohostname == "gridce3.pi.infn.it"); CeRequirements = "wholenodes==\"true\" && hostnumber==2"; mpi.sh NP=$[`cat $HF wc --lines`] mpirun -np $NP -hostfile $HF mympi 11

openmp job Wholenodes allows the submission of multi-thread jobs on the grid - This example executes 8 openmp threads on a whole nodes - - The user should be aware of potential memory affinity impact on performance - openmp application t0 t1 t2 t3 t4 t5 t6 t7 JDL Executable = openmp.sh"; Requirements =(other.glueceinfohostname == "gridce3.pi.infn.it"); CeRequirements = "wholenodes==\"true\" && hostnumber==1"; openmp.sh export OMP_NUM_THREADS=8./myomp LSF hostfile 12

Hybrid MPI-openMP job Hybrid parallel programming on the grid is enabled too. - This example requires 2 MPI ranks. Each MPI process will launch 8 openmp threads - mpi0 mpi1 t0 t1 t2 t3 t4 t5 t6 t7 t0 t1 t2 t3 t4 t5 t6 t7 JDL Executable = "mpi_openmp.sh"; Requirements =(other.glueceinfohostname == "gridce3.pi.infn.it"); CeRequirements = "wholenodes==\"true\" && hostnumber==2"; New hostfile mpi_openmp.sh Generate new hostfile cat $HF sort u > $HF2.txt NP=$[`cat $HF2 wc --lines`] mpirun -np $NP -x OMP_NUM_THREADS=8 --hostfile $HF2 mympiomp 13

Hybrid MPI-openMP job with affinity The openmpi rankfile supports the CPU affinity - binding of an MPI process to a specific core or range of cores - This example requires 4 MPI ranks. Each MPI process will launch 4 openmp threads - mpi0 t0 t1 t2 t3 mpi1 t0 t1 t2 t3 mpi2 t0 t1 t2 t3 mpi3 t0 t1 t2 t3 rankfile rank 0= slot=0-3 rank 1= slot=4-7 rank 2= slot=0-3 rank 3= slot=4-7 new hostfile JDL Executable = "mpi_openmp.bash"; Requirements =(other.glueceinfohostname == "gridce3.pi.infn.it"); CeRequirements = "wholenodes==\"true\" && hostnumber==2"; mpi_openmp.bash cat $HF sort u > $HF2.txt Generate new hostfile Generate rankfile awk '{print "rank " i++ "=" $1 " slot=0-3" "\n" "rank " i++"="$1 " slot=4-7"}' $HF2 > $RF NP=$[`cat $RF wc --lines`] mpirun -np $NP -x OMP_NUM_THREADS=4 --hostfile $HF2 --rankfile $RF mpiomp_exec 14

MPI job submission via MPI-start If a higher level of abstraction is needed (i.e. don t know where the MPI job will land) we have to use the MPI-start wrapper. MPI-start is the submission method recommended by the EGEE MPI-WG. The current version of mpi-start is not able to manage hybrid mpi-openmp application and memory/cpu affinity. - This example executes 16 MPI ranks (2 whole nodes) - JDL: Executable = "mpistart-wrapper.sh"; Arguments = mympi OPENMPI"; InputSandbox = {"mpistart_wrapper.sh","mpi-hooks.sh", mympi.c"}; #Requirements =(other.glueceinfohostname == "gridce3.pi.infn.it"); CeRequirements = "wholenodes==\"true\" && hostnumber==2"; - mpistart_wrapper.sh is a standard script. Modification not needed - mpi-hooks.sh includes pre and post execution scripts 15

Results of real applications: USQCD Hybrid-Montecarlo simulation of the Pure Gauge SU(3) on a 32x32x32x8 lattice (2000 sweep) using the publicly available USQCD collaboration chroma library (http://usqcd.jlab.org/usqcd-docs/chroma/). - Thanks to A. Feo, (Turin U.) - - Pure MPI code - Total memory occupation of the grid ~36MBytes - Importance of memory affinity - when all the data are not in cache - - Cache effect - efficiency >1-2.25 Mbyte dati/cpu (1) =32 (4x8) Np 8 (1x8) 16 (2x8) 32 (4x8) 64 (8x8) 128 (16x8) Non-ranked 295 min 146 min 62 min 27 min 14 min Ranked 287 min 139 min 59 min 27 min 14 min p0. p7 p8. p15 16

Results of real applications : NumRel Evolution of a stable general relativistic TOV-Star using the Einstein Toolkit consortium codes (http://einsteintoolkit.org/). - Thanks R. De Pietri, Parma U. - Hydro-dynamical Simulation of a perfect fluid coupled to the full Einstein Equations (dynamical space-time) on a 3-dimensional grid with 5-level of refinement spanning an octant of radius 177 km with a maximum resolution within the star of 370 m. - Hybrid MPI-openMP code - Total memory occupation of the grid ~8GByte. - The code is NOT full-parallelized but large memory request require parallelization. 17

Conclusions and future work The Cluster is ready for use. Now we need dissemination events/activity: - Wiki page: http://wiki.infn.it/cn/csn4/calcolo/csn4cluster/ - Cluster inauguration (April 2011) - Training course (date subject to fund availability from INFN Administration) This model can be extended to other MPI sites, after the integration of the Granularity attributes in glite. MPI-start should support affinity: The current version of MPI-start in not able to manage hybrid MPI-openMP application and memory/cpu affinity. - Management of CPU/memory affinity is important: the migration of process with respect to allocated memory has an impact on performance. - Hybrid MPI-openMP applications needs an adapted MPI-start able to suggest the right number of threads. 18

Thank you for your attention! 19