Optimization of non-contiguous MPI-I/O operations

Size: px

Start display at page:

Download "Optimization of non-contiguous MPI-I/O operations"

Rolf Newman
5 years ago
Views:

1 Optimization of non-contiguous MPI-I/O operations Enno Zickler Arbeitsbereich Wissenschaftliches Rechnen Fachbereich Informatik Fakultät für Mathematik, Informatik und Naturwissenschaften Universität Hamburg / 30

2 Outline 1 Non-Contiguous Data Access 2 Data Sieving 3 NCT Library 4 Evaluation 5 Summary & Conclusions 2 / 30

3 Non-Contiguous Data Access 3 / 30

4 What is Non-Contiguous Data Access? access on data which is scattered over the storage common data layouts often lead to non-contiguous access e.g. accessing only one element of a triples in an array non collective parallel access may also induce such access Figure 1: Non-Contiguous Data 4 / 30

5 Why is Non-Contiguous Data Access a problem? generally poor performance on storage systems overhead for many individual I/O calls on the file system long seek time on HDDs as I/O is a major bottleneck in HPC optimization of non-contiguous access is important 5 / 30

6 Data Sieving 6 / 30

7 What is Data Sieving? method to increase performance on non-contiguous data areas of unwanted (holes) data are also accessed the desired data is sieved out additional memory for buffers is needed (typically 2MB-4MB) much better performance for small holes and small data sizes 7 / 30

8 Figure 2: Data Sieving 8 / 30

9 Available implementations ROMIO[1] + good performance for small data and hole size + sustain detailed information about data layout but even harmful for big data and hole size requires detailed knowledge about the system MPI-I/O only hard to customize Performance model-directed data sieving [2] + automatic decision based on system characteristics + bandwidth, latency and number of I/O nodes properly MPI-I/O only (not released yet) 9 / 30

10 NCT Library 10 / 30

11 Goals 1 Flexibility allow algorithms to adopt to the underlying system easy to modify behavior of data sieving 2 Automation possibility to use sophisticated algorithms which do not need further user adjustments 3 Universality MPI Independence for wider use of data sieving POSIX level API 11 / 30

12 Design & Implementation C library Features view based file access functions POSIX compatible modularity through optimization plugins provides helper functions for plugins Processing of data sieving decision making: when and how to use data sieving helper functions access data and copy data between buffers 3 differn algorithms were implemented naive ROMIO simple performance model 12 / 30

13 API 1 t y p e d e f s t r u c t n c t t u p l e t 2 { 3 u i n t 3 2 t s i z e ; 4 u i n t 3 2 t d e l t a O f f s e t ; 5 } n c t t u p l e ; 6 7 t y p e d e f s t r u c t n c t i n f o t 8 { 9 double latencyfs ; 10 i n t bandwidth ; 11 i n t numnodes ; 12 double latencynetwork ; 13 u i n t 6 4 t s t r i p e S i z e ; 14 u i n t 6 4 t b l o c k s i z e ; 15 i n t dsmethod ; 16 } n c t i n f o ; n c t v i e w n c t c r e a t e v i e w ( u i n t 3 2 t i n i t a l O f f s e t, s i z e t d s b u f s i z e, v o i d d s b u f, i n t tuplecount, n c t t u p l e l i s t, n c t i n f o i n f o ) ; v o i d n c t d e s t r o y v i e w ( n c t v i e w view ) ; s i z e t n c t r e a d ( i n t fd, v o i d b u f f, u i n t 6 4 t o f f s e t, s i z e t b y t e c o u n t, n c t v i e w view ) ; s i z e t n c t w r i t e ( i n t fd, v o i d b u f f, u i n t 6 4 t o f f s e t, s i z e t b y t e c o u n t, nct view view ) ; Listing 1: nct.h 13 / 30

14 Evaluation 14 / 30

15 Methodology custom benchmark tool to use different data pattern vary hole and data size read and write of 10GB or at least 100sec WR- Cluster 1-10 lustre nodes 128KiB and 2MiB Stripe size different pattern min 3 iterations per setting all 3 runs are in a 20% range of the mean value model created for expected results 15 / 30

16 Reading model Figure 3: Model WR reading default pattern 16 / 30

17 Reading naive Figure 4: Naive 1 node WR 128 KiByte stripe reading default pattern 17 / 30

18 Reading ROMIO Figure 5: ROMIO WR cluster read 2 nodes 2 MiByte stripes 18 / 30

19 Reading ROMIO delta to naive Figure 6: ROMIO WR cluster write 2 nodes 2 MiByte stripes difference to naive 19 / 30

20 Reading simple performance model Figure 7: Simple performance Model WR cluster reading 2 nodes 2 MiByte stripes 20 / 30

21 Reading simple performance model delta to naive Figure 8: Simple performance Model WR cluster reading 2 nodes 2 MiByte stripes difference to naive 21 / 30

22 Writing ROMIO delta to naive Figure 9: ROMIO WR cluster write 2 nodes 2 MiByte stripes difference to naive 22 / 30

23 Writing simple performance model delta to naive Figure 10: Simple performance Model WR cluster writing 2 nodes 2 MiByte stripes difference to naive 23 / 30

24 Reading naive aligned Figure 11: naive reading 2 nodes 128 KiByte stripes aligned 24 / 30

25 Writing naive aligned Figure 12: naive writing 2 nodes 128 KiByte stripes aligned 25 / 30

26 Summary & Conclusions 26 / 30

27 POSIX leveled, flexible library makes research much easier potential in further optimization based on system characteristics implementation of the MPI-I/O interface is planed 27 / 30

28 Questions? 28 / 30

29 Literature Literature 29 / 30

30 Literature Literatur I Rajeev Thakur, William Gropp, and Ewing Lusk. Data sieving and collective i/o in ROMIO. In Frontiers of Massively Parallel Computation, Frontiers 99. The Seventh Symposium on the, pages IEEE, Yong Chen, Yin Lu, Prathamesh Amritkar, Rajeev Thakur, and Yu Zhuang. Performance model-directed data sieving for high-performance i/o. The Journal of Supercomputing, pages 1 25, September / 30

31 Literature Writing naive Figure 13: naive writing 10 nodes 2 MiByte stripes 30 / 30

Optimization of non-contiguous MPI-I/O Operations

Optimization of non-contiguous MPI-I/O Operations Bachelor Thesis Arbeitsbereich Wissenschaftliches Rechnen Fachbereich Informatik Fakultät für Mathematik, Informatik und Naturwissenschaften Universität