Going further with Damaris: Energy/Performance Study in Post-Petascale I/O Approaches

Size: px

Start display at page:

Download "Going further with Damaris: Energy/Performance Study in Post-Petascale I/O Approaches"

Helena Craig
5 years ago
Views:

Going further with Damaris: Energy/Performance Study in Post-Petascale I/O Approaches Ma@hieu Dorier, Orçun

1 Going further with Damaris: Energy/Performance Study in Post-Petascale I/O Approaches Dorier, Orçun Yildiz, Shadi Ibrahim, Gabriel Antoniu, Anne-Cécile Orgerie 2 nd workshop of the JLESC Chicago, November

2 Challenge (2009): Make CM1 s I/O scale for the future Blue Waters system Image credit: Leigh Orf, Bob Wilhelmson 2

3 TradiXonal I/O approach Periodic checkpoints 100,000+ processes Too much data, too many files 10,000+ processes Offline, a]er simulaxon Transfer to another cluster 100 to 1000 I/O servers I/O bursts 3

Time-parXXoning I/O (or why you didn t get

approach Too many files Hard to read back

4 Time-parXXoning I/O (or why you didn t get results in 5me for the deadline ) SimulaXon periodically stops to perform I/O File-per-process approach Collec=ve I/O approach Too many files Hard to read back High metadata overhead Requires synchronizaxon Data communicaxon steps 4

5 SoluXon: Damaris 5

Damaris in a few key concepts Dedicated I/O

6 Damaris in a few key concepts Dedicated I/O cores these cores do not perform any computaxon Shared memory to improve memory usage by avoiding copies Plugin system adaptability/flexibility connecxon with visualizaxon so]ware Simple API and external XML-based metadata 6

7 Damaris in the context of the Joint Lab Star=ng point Nov. 2009: preliminary discussions on I/O challenges for Blue Waters at the 2nd JLPC workshop First steps 2010: Start of the collaboraxon of the KerData team (INRIA) with NCSA: Dorier s MS internship@uiuc with Franck Cappello and Marc Snir 2011: CollaboraXon extended to ANL since (Rob Ross, Tom Peterka) Other internships and mutual visits followed Damaris at the core of several joint projects 2012: FACCTS (2012) PIs: Rob Ross, Gabriel Antoniu : Data@Exascale Associate Team (INRIA, ANL, UIUC) PIs: Gabriel Antoniu (INRIA), Rob Ross (ANL), Marc Snir (UIUC) : a WP within the NextGN PUF ANL-INRIA(+partners) Other joint projects in preparaxon 7

8 EvoluXon of Damaris Development StarXng ICS ACM SRC 2 nd Prize IPDPS 2013 PhD forum Cluster 2012 Time ParXXoning LDAV 2013 IPDPS 2014 PhD forum DIDC 2014 Version CM1 Grid 5000 Kraken OLAM Titan Nek5000 Intrepid Blue Waters In Situ VisualizaXon enabled with VisIt Dedicated nodes GTC 8

9 People involved INRIA Dorier Gabriel Antoniu Lokman Rahmani Shadi Ibrahim Orçun Yildiz Anne-Cécile Orgerie NCSA Roberto Sisneros Dave Semeraro ANL Franck Cappello Marc Snir Rob Ross Tom Peterka Dries Kimpe Internships: Dorier (1 st year master) Ma@hieu Perin (1 st year master) Sergiu Vicol (1 st year bachelor) Catalina Nita (1 st year master) Orçun Yildiz (2 nd year master) External users/contributors: Leigh Orf (Central Michigan) Francieli Zanon Boito (UFRGS) Rodrigo Kassick (UFRGS) 9

10 Damaris 1.0: state of the implementaxon 3 modes Synchronous (Xme-parXXoning) Dedicated core(s) Dedicated node(s) Very simple API for C, C++ and Fortran simulaxons XML-based data descripxon Enable/Disable shared memory Plugin system (C++ plugins) ConnecXon to VisIt for in situ visualiza=on About 20,000 lines of C++ code, based on MPI Depends on Boost, Xerces-C, XSD, (opxonally VisIt) hrp://damaris.gforge.inria.fr Poten=al plans for integra=on within the VisIt package Poten=al plans for use as one of the default backend in CM1 10

11 Three I/O approaches in Damaris Time ParXXoning Dedicated Core(s) Dedicated Node(s) Switch between modes using configuraxon file <dedicated cores= X nodes= Y /> Time par==oning Good at small scale, bad at larger scales Dedicated cores Good when many cores/node, when memory can be afforded Dedicated nodes Good when few cores/node and memory on a node is enough 11

12 Focus of this talk: How much does the I/O approach impact energy efficiency? Other talks related to this collabora5on: Lokman Rahmani, Gabriel Antoniu 12

13 Goals Study the impact of: The I/O approach (dedicated cores, nodes, etc.) The I/O frequency (Xme between I/O phases) The underlying architecture On: SimulaXon performance Energy consumpxon 13

CM1 on 32 nodes (128 cores) Rennes site: Parapluie cluster (24 cores/node) 20G

14 Experimental setup: CM1 on Grid 5000 Nancy site: Graphene cluster (4 cores/node) 20G InﬁniBand network 6 PVFS servers EATON Power DistribuXon Units CM1 on 32 nodes (128 cores) Rennes site: Parapluie cluster (24 cores/node) 20G InﬁniBand network 3 PVFS servers EATON Power DistribuXon Units CM1 on 16 nodes (384 cores) 14

15 Impact of the I/O approach (G5K/Nancy) Lower = Be@er And the winner is Longer run Xme + I/O variability = lower power usage 15

iteraxons, DC(1), DC(2) and DN(7:1) cannot keep up, resulxng in higher energy consumpxon

16 Impact of the I/O frequency (G5K/Nancy) Lower = Be@er Time-par==oning: linear dependency between frequency and energy consumpxon Dedicated resources: when doing I/O every 10 iteraxons, DC(1), DC(2) and DN(7:1) cannot keep up, resulxng in higher energy consumpxon When dedicated resources can keep up, the energy consump=on does not depend on I/O anymore 16

DedicaXng 1 node every 8 is the best approach on the Nancy site (4 cores

17 Impact of the architecture (G5K/Nancy,Rennes) Lower = Be@er DedicaXng 1 core is the best approach on the Rennes site (24 cores per node) DedicaXng 1 node every 8 is the best approach on the Nancy site (4 cores per node) Different number of cores per node = different op=mal I/O approach 17

18 Overall power/runxme results Nancy Rennes 1 dedicated node for 7 simulaxon nodes 1 dedicated core per 24-core node 18

19 Can we model the energy efficiency under different I/O approaches? Can we predict the best one? 19

20 Model s hypothesis ApplicaXon is computaxon-intensive I/O in fully overlap with computaxon 20

Energy model: general case E = T sim P sim Time for 1 iteraxon on 1 core Number of iteraxons T sim = T base n iterations (n cores/node s core (n cores/node ))(n nodes s nodes (n nodes )) Scalability

21 Energy model: general case E = T sim P sim Time for 1 iteraxon on 1 core Number of iteraxons T sim = T base n iterations (n cores/node s core (n cores/node ))(n nodes s nodes (n nodes )) Scalability funcxons w.r.t. number of cores and number of nodes Simplifica=on 1: the scalability w.r.t. the number of cores per node does not depend on the number of node, and the scalability w.r.t. the number of nodes does not depend on the number of cores per node. 21

22 Energy model for dedicated nodes Max power of a node Idle power of a node P sim = P max c + 1 ( 2 P idle + P max )d c + d Number of simulaxon nodes Number of dedicated nodes Simplifica=on 2: The power of a dedicated node is the average of max and idle. 22

23 Energy model for dedicated cores P sim = P max Simplifica=on 3: The power of a node running the simulaxon does change significantly when some of the cores are dedicated to I/O. 23

24 Model calibraxon (with CM1 on G5K/Rennes) Scalability of CM1 w.r.t. the number of cores per node and w.r.t. the number of nodes Power consumpxon of 8 nodes on Rennes site of G5K (max power and idle power) 24

25 Model validaxon (CM1 on G5K/Rennes) Five runs (error bars = min-max) Worst relaxve error between model and observaxon: 4% Larger variaxons with DN(7:1): probably due to network contenxon Best approach predicted (and observed): 1 dedicated core/node 25

26 Model validaxon (CM1 on G5K/Nancy) Five runs (error bars = min-max) Worst relaxve error between model and observaxon: 5.7% Best approach predicted (and observed): 1 dedicated node for 7 non-dedicated nodes 26

27 Model s accuracy: summary Site Approach Accuracy Rennes Dedicated cores (1) Dedicated cores (2) Dedicated nodes (15:1) Dedicated nodes (7:1) Nancy Dedicated cores (1) Dedicated nodes (15:1) Dedicated nodes (7:1) 96.0% 96.9% 97.3% 98.0% 95.0% 94.3% 95.0% 27

28 Conclusion 28

29 Conclusion and future direcxons Contribu=ons: Insight on energy/performance of I/O approaches All available within Damaris Energy model for dedicated cores and dedicated nodes ValidaXon on Grid 5000 with CM1 Model s limita=on: Valid for computaxon-intensive applicaxons Does not include network-related energy consumpxon Does not include the energy consumpxon in the storage system Future work: ValidaXon with other simulaxons Tradeoff between compression, performance and energy consumpxon 29

A bit of adverxsement: Darshan-Web Demo: h@p://darshan-web.

30 A bit of adverxsement: Darshan-Web Demo: Installa=on tutorial: 30

From Damaris to CALCioM Mi/ga/ng I/O Interference in HPC Systems

From Damaris to CALCioM Mi/ga/ng I/O Interference in HPC Systems Ma#hieu Dorier ENS Rennes, IRISA, Inria Rennes KerData project team Joint work with Rob Ross, Dries Kimpe, Gabriel Antoniu, Shadi Ibrahim