Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems

Size: px

Start display at page:

Download "Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems"

Silas Malone
5 years ago
Views:

1 Lessons from Post-processing Climate Data on Modern Flash-based HPC Systems Adnan Haider 1, Sheri Mickelson 2, John Dennis 2 1 Illinois Institute of Technology, USA; 2 National Center of Atmospheric Research, USA

2 Post-processing Climate Data Software Post-processing software analyze climate data Important goal of post-processing software: Allow scientists to do more science in less time Two post-processing software PyAverager: Computes averages PyReshaper: Converts input to different file layout 2

3 I/O Workload Characteristics Percentage of Runtime spent doing I/O 100% 80% 60% 40% 20% 0% PyReshaper Ice Land Atmosphere Atmosphere S.E Dataset Ocean Average I/O Request Size (MB) % I/O Time Average I/O Request Size I/O bound! Varying I/O workloads Can flash based HPC systems reduce execution time? 3

4 What is Flash? Faster hardware which accelerates I/O Flash in HPC systems 4

5 What is Flash? Faster hardware which accelerates I/O Flash in HPC systems 5

6 Two Flash System Architectures Flash Devices Gordon: SSD Wrangler: Difference in Flash Device Local Flash Design Gordon Pooled Flash Design Wrangler Compute Nodes Compute Nodes ` ` ` ` ` ` ` ` RDMA via Infiniband PCI Express Interface Parallel File System/ Object Store Mem Flash Based IO Node SSD SSD SSD SSD 6

7 Two Flash System Architectures Flash Devices Gordon: SSD Wrangler: Storage Architecture Gordon: Local Wrangler: Pooled Difference in Storage Architecture Local Flash Design Gordon Compute Nodes Compute Nodes ` ` ` ` ` ` ` ` RDMA via Infiniband Mem Flash Based IO Node SSD SSD SSD SSD Pooled Flash Design Wrangler PCI Express Interface Parallel File System/ Object Store 7

8 Two Flash System Architectures Flash Devices Gordon: SSD Wrangler: Storage Architecture Gordon: Local Wrangler: Pooled Interconnect Gordon: Infiniband Wrangler: PCI Express Difference in Interconnect Local Flash Design Gordon Compute Nodes Compute Nodes ` ` ` ` ` ` ` ` RDMA via Infiniband Mem Flash Based IO Node SSD SSD SSD SSD Pooled Flash Design Wrangler PCI Express Interface Parallel File System/ Object Store 8

9 Two Flash System Architectures Flash Devices Gordon: SSD Wrangler: Storage Architecture Gordon: Local Wrangler: Pooled Interconnect Gordon: Infiniband Wrangler: PCI Express Yellowstone has disks Local Flash Design Gordon Compute Nodes Compute Nodes ` ` ` ` ` ` ` ` RDMA via Infiniband Mem Flash Based IO Node SSD SSD SSD SSD Pooled Flash Design Wrangler PCI Express Interface Parallel File System/ Object Store 9

10 PyReshaper on 1 Compute Node Ocean(large) Yellowstone (Read & Write HDD) Gordon Read HDD Write SSD Read SSD Write HDD Read & Write SSD* Read & Write HDD Wrangler Read HDD Write Read Write HDD Read & Write Read & Write HDD Single SSD runs out of capacity! Metadata Time Read Time Write Time Seconds 10

11 PyReshaper on 1 Compute Node Ocean(large) Yellowstone (Read & Write HDD) Gordon Read HDD Write SSD Read SSD Write HDD Read & Write SSD* Read & Write HDD Wrangler Read HDD Write Read Write HDD Read & Write Read & Write HDD Reading from SSD increases runtime by 75% Metadata Time Read Time Write Time Seconds 11

12 PyReshaper on 1 Compute Node Ocean(large) Yellowstone (Read & Write HDD) Gordon Read HDD Write SSD Read SSD Write HDD Read & Write SSD* Read & Write HDD Wrangler Read HDD Write Read Write HDD Read & Write Read & Write HDD 3.6x reduction in execution time compared to Yellowstone Metadata Time Read Time Write Time Seconds 12

13 PyReshaper on 1 Compute Node Ice(small) Yellowstone (Read & Write HDD) Gordon Read HDD Write SSD Read SSD Write HDD Read & Write SSD Read & Write HDD Wrangler Read HDD Write Read Write HDD Read & Write Read & Write HDD SSDs decrease runtime by 47 % Seconds 13 Metadata Time Read Time Write Time

14 PyReshaper on 1 Compute Node Ice(small) Yellowstone (Read & Write HDD) Gordon Read HDD Write SSD Read SSD Write HDD Read & Write SSD Read & Write HDD Wrangler Read HDD Write Read Write HDD Read & Write Read & Write HDD Hybrid I/O decreases runtime by 6x Seconds 14 Metadata Time Read Time Write Time

15 PyReshaper on 1 Compute Node Ice(small) Yellowstone (Read & Write HDD) Gordon Read HDD Write SSD Read SSD Write HDD Read & Write SSD Read & Write HDD Wrangler Read HDD Write Read Write HDD Read & Write Read & Write HDD 11x reduction in execution time compared to Yellowstone Seconds 15 Metadata Time Read Time Write Time

16 Lesson 1 Incorrect matching between storage architecture and I/O workload can hide the benefits of flash devices by increasing runtime by 4x. Single SSD & Interconnect Local Flash Design: Gordon Compute Nodes ` ` ` ` RDMA via Infiniband Mem Flash Based IO Node SSD SSD SSD SSD 16

17 Lesson 2 Local flash architecture is more common Number of flash devices per compute node should increase Seconds Performance on Gordon with 16 Processes Ice 1/16 2/8 4/4 8/2 16/1 # of Compute Nodes (SSDS) / # of Processes per Node Atmosphere - Optimal number of SSDs Land Atmosphere S.E. 17

18 Lesson 3 Hybrid I/O (reading and writing to difference device types) decreases flash storage consumption by half while decreasing runtime by 6x. 18

19 Conclusion Pooled architecture performs better than local architecture but if the local architecture alleviates bottlenecks it can be a more feasible solution. 19

20 Conclusion Pooled architecture performs better than local architecture but if the local architecture alleviates bottlenecks it can be a more feasible solution. Moving from Yellowstone s HDD to Wrangler s HDD provided up to 3.6x reduction in execution time 20

21 Conclusion Pooled architecture performs better than local architecture but if the local architecture alleviates bottlenecks it can be a more feasible solution. Moving from Yellowstone s HDD to Wrangler s HDD provided up to 3.6x reduction in execution time Moving from Yellowstone s HDD to Wrangler s flash provided 11x reduction in execution time. 21

22 Conclusion Pooled architecture performs better than local architecture but if the local architecture alleviates bottlenecks it can be a more feasible solution. Moving from Yellowstone s HDD to Wrangler s HDD provided up to 3.6x reduction in execution time Moving from Yellowstone s HDD to Wrangler s flash provided up to a 11x reduction in execution time. With data amount surmounting, consideration must be placed on a cost-effective I/O architecture. 22

23 Acknowledgements Sheri Mickelson and John Dennis Kevin Paul & the ASAP group 23

24 Flash Based Systems in Future Comet Trinity Gordon Wrangler Aurora 24

Evolution of Flash Systems 2012 text Gordon-std 2013 text Catalyst 2015 text Local Flash Architecture Flash Devices (SSD) on remote nodes Pooled Flash Aggregates 16 flash devices at job config Pooled

25 Evolution of Flash Systems 2012 text Gordon-std 2013 text Catalyst 2015 text Local Flash Architecture Flash Devices (SSD) on remote nodes Pooled Flash Aggregates 16 flash devices at job config Pooled Flash devices as flash All-to-all connection Comet Burst Buffer 750 TB of flash and 750 GB/s bandwidth Gordon-vsmp Local Flash 800 GB of flash on compute node via PCI Ex. text text Wrangler Local Flash 320 GB of flash on each compute node text Cori Trinity Burst Buffer Burst Buffer Xeon processor based burst buffer nodes text Aurora

26 Gordon Performance Analysis Throughput of SSD / Throughput of HDD Scalability Workload /4 2/8 4/16 8/32 # of Processes / Amount of Data Written (GB) / I/O Request Size (KB) 26

8/32 # of Processes / Amount of Data Written (GB) 0-2 2-4

27 Gordon Performance Analysis Throughput of SSD / Throughput of HDD Scalability Workload /4 2/8 4/16 8/32 # of Processes / Amount of Data Written (GB) / I/O Request Size (KB) 27

Wrangler Performance Analysis 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 Throughput of SSD / Throughput of HDD Consistent 100 90

28 Wrangler Performance Analysis Throughput of SSD / Throughput of HDD Consistent /4 2/8 4/16 8/32 # of Processes / Amount of Data Written (GB) 2 16/ I/O Request Size (KB) 28

29 Performance Comparison Speedup Provided by Flash Ice Land ATM ATM S.E. Dataset Gordon Wrangler 29

30 Speedup Atmosphere Atmosphere S.E. Ocean Gordon Best Time over Wrangler HDD Time Wrangler HDD Time over Wrangler Flash Time 30

CESM Workflow Refactor Project Land Model and Biogeochemistry Working Groups 2015 Winter Meeting CSEG & ASAP/CISL

CESM Workflow Refactor Project Land Model and Biogeochemistry Working Groups 2015 Winter Meeting Alice Bertini Sheri Mickelson CSEG & ASAP/CISL CESM Workflow Refactor Project Who s involved? Joint project