ECE7995 (7) Parallel I/O

Size: px

Start display at page:

Download "ECE7995 (7) Parallel I/O"

Marvin Blair
5 years ago
Views:

1 ECE7995 (7) Parallel I/O 1

2 Parallel I/O From user s perspective: Multiple processes or threads of a parallel program accessing data concurrently from a common file From system perspective: - Files striped across multiple I/O servers - File system designed to perform well for concurrent writes and reads (parallel file system) Compute Nodes Interconnect I/O requests are served in parallel and you may receive good performance. I/O nodes 2

3 A Scenario: Running an MPI Program with a Parallel File System Compute Nodes P 0 P 1 P 2 P n CN 0 CN 1 CN 2 CN n Cluster Network Data Servers DS 0 DS 1 DS m Meta-S Metadata Server 3

4 Parallel I/O Infrastructure Maps application abstractions onto storage abstractions and provides data portability HDF5, Parallel netcdf Maintains logical space and provides efficient access to data PVFS, Lustre, GPFS, PanFS Organizes accesses from many processes, especially those using collective I/O MPI IO, ROMIO Underlying I/O hardware, storage devices 4

5 What Are Parallel File Systems Store application data persistently usually extremely large datasets that can t fit in memory Provide global shared namespace (files, directories) Designed for parallelism Concurrent (often coordinated) access from many clients Designed for high-performance Operate over high-speed networks (IB, Myrinet) Optimized I/O path for maximum bandwidth 5

servers and drives (parallelism of access) Coordinate

6 Parallel File Systems Provide a directory tree all nodes can see (the global name space) Map data across many servers and drives (parallelism of access) Coordinate access to data so certain access rules are followed (useful semantics) 6

7 Data distribution in parallel file systems 7

8 Data Distribution Round-robin (aka Simple Stripe in PVFS) is a reasonable default solution Works consistently for a variety of workloads Works well on most systems Can you think of a system where this might not work so well? 8

9 Data Distribution Clients perform writes/reads of file at various regions Usually depends on application workload and number of tasks 9

10 PVFS - Parallel Virtual File System An open source parallel file system Brings state-of-the-art parallel I/O concepts to production parallel systems Designed to scale to petabytes of storage and provide access rates at 100s of GB Developed by Parallel Architecture Research Lab at Clemson University since 1993, MCS of ANL, and the Ohio Supercomputer Center Major developers: Walt Ligon, Rob Ross, Phil Carns, Pete Wyckoff, Neil Miller, Rob Latham, Sam Lang, Brad Settlemyer Current stable release (PVFS2):

11 PVFS Features Performance: Designed to provide high performance for parallel applications, where concurrent, large IO and many file accesses are common Dynamic distribution of IO and metadata, avoiding single points of contention Optimizations Integration with HPC Interfaces (MPI-IO) Non-contiguous accesses Easy Deployment Hardware independent Mostly userspace (small linux kernel module) Proven production environment Good Research Platform Much research in parallel I/O has used PVFS 11

12 IOrchestrator: Improving the Performance of Multi-node I/O Systems via Inter-Server Coordination 12

13 Outline A Motivation Example Design of IOrchestrator Performance Evaluation Related Work Conclusions 13

14 A Motivation Example Experimental setting MPICH compiled with ROMIO 6 compute nodes, 7 data servers, 1 meta-data server PVFS with default striping configuration CFQ disk scheduler The benchmark: mpi-io-test Run two instances of the benchmark. Each collectively reads a 10GB file with 64KB request size. Five processes are spawn in each of the program running. The processes access contiguous data. 14

15 I/O Request Generation Requests from Instance 1 Requests from Instance 2 Iteration 2 Iteration 1 For (j=0; j < opt_iter; j++){ err = MPI_File_read_all(fh, buf, nchars, MPI_CHAR,&status); } Iteration 0 15

16 Work conserving I/O Schedulers in a Single Disk Requests from Process 1 Requests from Process 2 disk head thrashing! New Position of Disk Head 16

17 Non work conserving Scheduling for one Disk Requests from Process 1 Requests from Process 2 Requests are efficiently served in non-workconserving manner. Position New Position of Disk of Disk Head Head 17

18 Analysis of On-Disk Data Accesses Disk access in a random order though there is ample spatial locality for data access of each instance. 18

19 Non work conserving Scheduling for Multiple Disks Disk heads have Anticipating to seek for serving for the requests remote pending requests. Requests accessing from Instance nearby 1 disk Requests areas. from Instance 2 Iteration 0 Next Request from the same instance does not arrive quickly because other requests from the same collective I/O call are still being served or pending at other servers. Anticipation fails! New Position Position of Disk of Disk Head Head 19

20 Coordinating Data Accesses on Disks Dedicated use of the disks through server coordination. Anticipation succeeds! Requests from Instance 1 Requests from Instance 2 Iteration 01 2 New Position of of Disk Disk Head Head 20

21 Issues with Dedicated Use of Disks But Requests what if from the Instance program 1 takes Requests a long think from Instance time 2 to issue its next requests? Iteration 0 Cost of idle-waiting might be larger than the cost for long-disk-head seeks. Position of Disk Head 21

22 Issues with Dedicated Use of Disks (Cont d) But what if requests are not evenly distributed over all the disks? Requests from Instance 1 Requests from Instance 2 Iteration 0 Position of Disk Head 22

23 Outline A Motivation Example Design of IOrchestrator Performance Evaluation Related Work Conclusions 23

24 IOrchestrator: Orchestration of Data Access on Multi node I/O Systems Objectives Recover spatial locality in a parallel program, which runs in a shared multi node storage cluster. System performance should not be compromised. Challenges How to track the spatial locality and think time of each program? How to determine the cost effectiveness of dedicated services for programs? How to implement the data access orchestration in a parallel file systems such as PVFS2? 24

25 Measurement of Spatial Locality and Program s Reuse Distance Two concepts Spatial locality: the average distance of disk head movement for serving a request Program s reuse distance: the think time between two requests from the same program that hit a data server Measurements To statistically quantify the spatial locality and reuse distance, we use the method which is similar to the one developed in Linux. Smooth out the short term dynamics accurately. Phase out historical statistics quickly. 25

26 Eligible Programs for Dedicated Service The spatial locality condition The standard deviations of spatial locality is small. 20% σ SL The benefits condition The average disk seek distance across all the programs shared the data server is large enough. n n ( SL 0 i) / SL 1.5 i= i 0 ij = The cost effectiveness condition The average reuse distance is smaller than disk seek time. n ( RD ) / n SeekTime i = 0 ij σsl : the standard deviations of SL SeekTime: the disk seek time derived from SL RD: Reuse distance i SL: Spatial locality 26

27 Scheduling of Programs What is an object for scheduling? Each eligible program is a running object for dedicated I/O service. Ineligible programs constitute an object for I/O service. How to determine the time slices for scheduling? Fixed window size 500ms Each object receives a portion of the window as time slice for its dedicated service. The ratio in each window is inversely proportional to the percentage of its average reuse distance over the sum of distances of all objects. How to avoid the starvation? The programs having poor SL or longer RD are serviced together in a dedicated time slice. The ratio is determined by their combined data access efficiency on disks. 27

28 IOrchestrator Architecture Compute Nodes (with instrumented MPI library) "mpdlistjobs" Ids of open files Ids of open files Metadata server program-files locality orchestrator ischeduler Disk I/O Scheduler 3 locality Hard DIsk ischeduler 3 Data Servers Disk I/O Scheduler Hard DIsk 28

29 Outline A Motivation Example Design of IOrchestrator Performance Evaluation Related Work Conclusions 29

30 Performance Evaluation: Benchmarks Name Access Pattern Sources mpi-io-test contiguous data sets PVFS2 software package ior-mpi-io non-contiguous data sets the ASCI Purple benchmark suite developed at LLNL mpi-tile-io data access in a tile-by-tile fashion the Parallel I/O Benchmarking Consortium at ANL noncontig data access with vector-derived MPI data type the Parallel I/O benchmarking Consortium at ANL hpio diverse set access patterns Northwestern University and SNL 30

31 Homogenous Workloads (w/o Collective I/O) The IOrchestrator improves I/O throughput of the entire file system by up to 89% and 43% on average. For the mpi io test benchmark, when IOrchestrator is used, the I/O throughput is increased by 57% for read and 37% for write. 31

32 On disk Data Access The disk head frequently alternates between two disk regions. CFQ does not preserve spatial locality without IOrchestrator. 32

33 Reuse Distances Without IOrchestrator, reuse there distances are many are significantly very large reuse reduced. distances. With IOrchestrator, CFQ exploits the strong locality in the program into efficient disk access

34 Homogenous Workloads (w/ Collective I/O) The IOrchestrator improves I/O throughput of the entire file system by up to 63% and 28% on average. For the ior mpi io benchmark, the I/O throughput is significantly reduced when collective I/O is used because of unbalanced workloads. 34

35 Performance of Heterogeneous Workloads Even IOrchestrator time slicing improves the system throughput by 17%. 47%. 35

36 Effect of File Distances among Programs The throughputs are improved by 33%, 64%, 147%, and 147%, respectively. 36

37 Impact of Scheduling Window Size The throughputs are improved by 40%, 48%, 58%, and 59%, respectively, with the selected window sizes. 37

38 Conclusions Coordinating data accesses across data servers is critical to preserve spatial locality of parallel program. We design and implement IOrchestrator based on PVFS2 to coordinate request scheduling across data servers according to monitored programs access behaviors. Our experiments with representative benchmarks show that IOrchestrator increases I/O throughput by 39% on average. 38

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application