Mining Supercomputer Jobs' I/O Behavior from System Logs. Xiaosong Ma

Size: px

Start display at page:

Download "Mining Supercomputer Jobs' I/O Behavior from System Logs. Xiaosong Ma"

Brittney Sparks
5 years ago
Views:

1 Mining Supercomputer Jobs' I/O Behavior from System Logs Xiaosong Ma

2 OLCF Architecture Overview Rhea node Development Cluster Eos 76 Node Cray XC Cluster Scalable IO Network (SION) - Infiniband Servers Servers 8 OST(LUN) Atlas 8 OST(LUN) Atlas

3 OLCF Architecture Overview Rhea node Development Cluster Eos 76 Node Cray XC Cluster Scalable IO Network (SION) - Infiniband Servers Servers Per-OST I/O throughput Monitoring tool 8 OST(LUN) Atlas MySQL database 8 OST(LUN) Server-side I/O throughput logs Atlas

4 Server-side I/O Throughput Logs RAID controller Coarse-granule logging

5 Server-side I/O Throughput Logs RAID controller Coarse-granule logging

6 I/O throughput logs Zero overhead No impact on user IO No user effort Server-side I/O Throughput Logs RAID controller Coarse-granule logging 6

7 I/O throughput logs Zero overhead No impact on user IO No user effort Mixed I/O traffic RAID controller Coarse-granule logging Server-side I/O Throughput Logs 7

6 8 Job scheduler logs Prior Work: IOSI Workflow Target App (User ID + App ID) Throughput logs Start_time End_time --6 : --6 : --7 : --7 : --8 : --8 7: IOSI Input 6 8 6 8 Sample set 6 8 6 8 6 8 IOSI

8 6 8 Job scheduler logs Prior Work: IOSI Workflow Target App (User ID + App ID) Throughput logs Start_time End_time --6 : --6 : --7 : --7 : --8 : --8 7: IOSI Input Sample set IOSI Data preprocessing Per-sample wavelet transform Cross-sample I/O burst identification. IOSI paper: Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces, FAST '... 8 IOSI Output 8

Per-sample wavelet transform Cross-sample I/O burst identification Strong assumption: identical runs of app.

9 6 8 Job scheduler logs Prior Work: IOSI Workflow Target App (User ID + App ID) Throughput logs Start_time End_time --6 : --6 : --7 : --7 : --8 : --8 7: IOSI Input Sample set IOSI Data preprocessing Per-sample wavelet transform Cross-sample I/O burst identification Strong assumption: identical runs of app.. IOSI paper: Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces, FAST '... 9 IOSI Output 9

Job Job Job App App6 Job Job Job Job Job Job Job Job Job Job 6 Scheduling

10 AID: Automatic I/O Diverter Start_time End_time --6 : --6 : --7 : --7 : --8 : --8 7: Time App Job Job Job App App App Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job App App6 Job Job Job Job Job Job Job Job Job Job 6 Scheduling suggestion Automatically identifying I/O-heavy apps (No prior knowledge, no user involvement)

AID: Automatic I/O Diverter SC 6 Tech paper presentation:

Job Job Job Job Job Job Job 6 Scheduling suggestion

11 AID: Automatic I/O Diverter SC 6 Tech paper presentation: Thursday pm, D Start_time End_time --6 : --6 : --7 : --7 : --8 : --8 7: Time App Job Job Job App App App Job Job Job Job Job Job Job Job Job Job Job Job Job Job Job App App6 Job Job Job Job Job Job Job Job Job Job 6 Scheduling suggestion Automatically identifying I/O-heavy apps (No prior knowledge, no user involvement)

12 Application I/O Characterization Results Name Value Total number of logged jobs 8,969 Unique applications identified 9,998 Initial I/O-intensive candidates 9 Candidates passing scope checking 67 Candidates passing minimum support User-verfied candidates 8 Result from months Titan I/O traffic and job logs (User verification by )

13 Application I/O Characterization Results Name Value Total number of logged jobs 8,969 Unique applications identified ID Node Time(m) OST 9,998 App. Domain Initial I/O-intensive candidates Geo-sciences Candidates passing scope checking Combustion Candidates passing minimum support Astrophysics User-verfied candidates Combustion - 8 Systems research Combustion Computer Science Environmental User-verified I/O-intensive applications

14 Application I/O Characterization Results Name Value Total number of logged jobs 8,969 Unique applications identified 9,998 Initial I/O-intensive candidates 9 Candidates passing scope checking 67 Candidates passing minimum support User-verfied candidates 8

15 Application I/O Characterization Results Name Value Total number of logged jobs 8,969 Unique applications identified 9,998 Initial I/O-intensive candidates 9 Candidates passing scope checking 67 Candidates passing minimum support User-verfied candidates 8

16 Application I/O Characterization Results Name Value Total number of logged jobs 8,969 Unique applications identified 9,998 Initial I/O-intensive candidates 9 Candidates passing scope checking 67 Candidates passing minimum support User-verfied candidates 8 Applications not using parallel I/O systems well! Similar finding as Huong HPDC work (Darshan) Motivates better I/O performance data analysis Connecting programs to systems 6

17 Questions? Xiaosong Ma Qatar Computing Research Institute, Hamad Bin Khalifa University 7

memory -D Torus interconnect Performance variance on HPC Shared parallel file system

18 I/O Contention on Large-Scale HPC Systems ORNL s Titan (World s # Supercomputer) 7. PF Peak performance 8,688 compute nodes 6-core AMD Opteron Nvidia Tesla GPU + 6 GB memory -D Torus interconnect Performance variance on HPC Shared parallel file system I/O-heavy jobs collision -> I/O performance degradation I/O performance variance on Titan with IOR [6] 8

19 CDF of per-ost I/O throughput 99.6% time < % capacity (MB/s) 98.% time < % capacity (MB/s) 88.% time < % capacity (MB/s) 9

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces. Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S.

Automatic Identification of Application I/O Signatures from Noisy Server-Side Traces Yang Liu Raghul Gunasekaran Xiaosong Ma Sudharshan S. Vazhkudai Instance of Large-Scale HPC Systems ORNL s TITAN (World