distribution across network topology. Finally, we present a collection of methods to address some key performance issues plaguing SSDs, such as read

Size: px

Start display at page:

Download "distribution across network topology. Finally, we present a collection of methods to address some key performance issues plaguing SSDs, such as read"

Blanche Dalton
5 years ago
Views:

1 ABSTRACT ZHANG, WENZHAO. A Memory Hierarchy- and Network Topology-Aware Framework for Runtime Data Sharing at Scale. (Under the direction of Dr. Nagiza F. Samatova.) Data analytics is often performed in a post-processing manner, as the data generated by an application is first written to the file system, for example parallel file system (PFS), and then read out to dynamic-random-access-memory (DRAM) for analytics, requiring substantial I/O time. Runtime data sharing across multiple applications is a promising alternative approach towards avoiding the increasing I/O bottlenecks. For instance, the data generated by a running application can be moved to a DRAM-based server staging space, where that data is then retrieved by various client analytics applications, such as visualization and transformation. Thus, data generation and analytics can run concurrently, and slow PFS access is replaced by fast DRAM access. In this dissertation, we illustrate the value of the proposed framework using large scale scientific datasets generated and shared over modern supercomputers. Specifically, we demonstrate that our framework enables the runtime sharing of the Adaptive Mesh Refinement (AMR) scientific data, unlike traditional uniform mesh data. AMR represents a significant advance for large-scale scientific simulations. By dynamically refining resolutions over time and space, AMR simulations generate hierarchical, multi-resolution, and non-uniform meshes. This kind of refinement provides sufficient precision for regions of interest at finer levels while avoiding unnecessary data generation for regions of non-interest. However, due to unable to handle the dynamic characteristics of AMR data and the dynamic runtime behaviors of AMR simulations, existing methods are not applicable to support runtime AMR data sharing for scientific analytics. In this dissertation, we propose a framework to facilitate runtime AMR data sharing for scientific applications, with the goals of realizing effective AMR data access and further optimizing data access performance over the staging space by exploring memory hierarchy and network topology of modern supercomputers. We first present a purely DRAM-based framework to support runtime AMR data sharing. By employing an architecture with dedicated server processes for metadata management, an efficient and balanced AMR data distribution policy and a poly-tree-based spatial index, the framework enables client applications to effective write/retrieve AMR data to/from the staging space. Then, we present a set of methods to further extend the framework. The upgraded framework is able to utilize Solid State Drives (SSDs) on modern supercomputers as an overflow space for when the DRAM fills. It can detect common spatially constrained AMR data retrieval patterns and prefetch data from SSDs to DRAM to bridge the speed gap between the two memory layers. Moreover, the framework is able to utilize a model for optimizing runtime AMR data

2 distribution across network topology. Finally, we present a collection of methods to address some key performance issues plaguing SSDs, such as read contention and files fragmentation. To address read contention issues involved with SSDs, we present a general purpose online read algorithm that is able to detect and utilize memory hierarchy resource to relieve the problem. To maintain a near optimal operating environment for SSDs, we present methods to orchestrate data chunks across different memory layers in both online and offline manners to handle issues that may compromise SSDs performance.

3 A Memory Hierarchy- and Network Topology-Aware Framework for Runtime Data Sharing at Scale by Wenzhao Zhang A dissertation submitted to the Graduate Faculty of North Carolina State University in partial fulfillment of the requirements for the Degree of Doctor of Philosophy Computer Science Raleigh, North Carolina 2018 APPROVED BY: Dr. Rada Y. Chirkova Dr. Kemafor Anyanwu Ogan Dr. Ranga Raju Vatsavai Dr. Nagiza F. Samatova Chair of Advisory Committee

4 BIOGRAPHY From August 2013 to December 2017, Wenzhao Zhang pursued his Ph.D. study in Computer Science at North Carolina State University under the direction of Dr. Nagiza F. Samatova. During his doctoral study, he has spent two summers working at Lawrence Berkeley National Laboratory and one summer working at Renaissance Computing Institute. ii

5 ACKNOWLEDGEMENTS First and foremost, I am very grateful to my advisor, Dr. Nagiza Samatova. I definitely could not have reached this point without her consistent guidance and support in the past four and a half years. I am very thankful to my committee members, Dr. Rada Chirkova, Dr. Kemafor Ogan, and Dr. Raju Vatsavai, for taking their valuable time to serve on my thesis committee, and for offering their insights in this dissertation. This dissertation would not have been completed without the help from Dr. Samatova s research group. I am very grateful to Xiaocheng Zou and Houjun Tang for their priceless help for my PhD research. Additionally, I would like to thank Steve Harenberg and Stephen Ranshous for their help with my research paper writing. I have had the honour of collaborating with researchers at national laboratories: Drs. Suren Byna, Kesheng (John) Wu, Bin Dong, Dan Martin, Hans Johansen, and Dharshi Devendran from Lawrence Berkeley National Laboratory, and Scott Klasky, Qing (Gary) Liu from Oak Ridge National Laboratory. The work was supported by the U.S. Department of Energy, Office of Science (SciDAC SDM Center) and the U.S. National Science Foundation (Expeditions in Computing and EAGER programs). iii

6 TABLE OF CONTENTS LIST OF TABLES LIST OF FIGURES vi vii Chapter 1 INTRODUCTION A DRAM-Based Framework for Runtime AMR Data Sharing Problems and Challenges Approach and Results Exploring Memory Hierarchy and Network Topology for Runtime AMR Data Sharing Problems and Challenges Approach and Results Memory Hierarchy Aware Data Read Performance Optimization Problems and Challenges Approach and Results Chapter 2 A DRAM-Based Framework for Runtime AMR Data Sharing Introduction Background Issue I: Architecture Issue II: Online Data Organization Issue III: Online Spatial Index Methods Architecture Online Data Organization Online Spatial Index Implementation Results Scalability Performance over AMR Data Performance of Spatially Constrained Interaction Coupled with AMR Data Related Works Conclusion Chapter 3 Exploring Memory Hierarchy and Network Topology for Runtime AMR Data Sharing Introduction Background Block-structured AMR Data Overview of AMRZone Methods iv

7 3.3.1 Runtime Staging Space Capacity Control Spatial Read Patterns Detection and Prefetching Runtime Data Placement Optimization over Topology Implementation Results Staging Space Capacity Control Spatially Constrained AMR Data Read Patterns Detection and Prefetching Topology-aware Runtime AMR Data Placement Optimization related work Conclusion Chapter 4 Memory Hierarchy Aware Data Read Performance Optimization Introduction Background SSDs in Scientific HPC Systems Commonly Found Access Patterns in Scientific Data Analytic Applications Method Overview Online Algorithm for Read Memory Hierarchy Resource Online Data Management Offline Data Management Results Experimental Setup Suitable Storage Layout as Testbed for the Framework Read Performance Evaluation Model Evaluation Data Management Evaluation Overhead Analaysis Related Work Conclusion Chapter 5 Conclusion and Future Work Conclusion Future Work Runtime Value Index Construction New Hardware Architecture Post-processing Data Analytics Support Integration with AMR Simulations Runtime Workflow Semantics and Autonomic Engine for AMR Data BIBLIOGRAPHY v

8 LIST OF TABLES Table 4.1 Variables for the algorithm of read memory hierarchy resource Table 4.2 RID index size for each simulation s dataset Table 4.3 Values for key system-specific parameters in the coarse model Table 4.4 Comparision on the effectiveness of defragementation vi

9 LIST OF FIGURES Figure 2.1 Figure 2.2 Figure 2.3 Figure 2.4 Figure 2.5 Figure 2.6 Figure 2.7 The visualization graph of a 1GB block-structured AMR dataset generated by BISICLES [20]. The coarsest (or lowest) level (level 0) covers the entire global data domain. A finer (or higher) level is generated by refining a set of boxes on the adjacent coarser level, only covering some subregions of interest with higher resolution as defined by a refinement ratio. The boxes at the finer levels represent spatial regions of more interest A uniformly partitioned virtual bounding box over the finest level of the AMR data in Figure 2.1. These partitions would be distributed to the staging space according to a space-filling curve (e.g., Hilbert), leading to an unbalanced workload distribution, among other issues The client-server architecture of AMRZone consists of two types of server processes: (1) mservers that only manage metadata, recording how AMR boxes are distributed across dservers and constructing a spatial index; and (2) dservers that only manage the binary data. Note that Application1 and Application2 could be simulations or other data analytics programs. Also note that AMRzone does not limit the number of applications that can connect to the server side A spatial query over AMR data. Typically, boxes are retrieved in multiple levels, rather than a single level. The boxes at level 1 are refined from bigger boxes at level The polytree-based spatial index for AMR data. Tree nodes correspond to AMR boxes. Directed edges denote refinement relationship. It effectively represents the many-to-many refinement relationships of AMR boxes across different levels Configuration details for the scalability comparison experiments. The row header denotes four domain sizes and the column header gives four partition sizes. The format, $B $C($N) / $S($N), denotes the total number of boxes(b), the total number of parallel client processes(c), the total number of client nodes(n), the total number of DataSpaces server or AMRZone dserver processes(s) and the total number of server nodes(n). 21 Results of weak scalability comparison experiments between AMRZone and DataSpaces, over 5 time-steps. AMRZone generally performs better compared to DataSpaces (22 out of 32 cases). In the best case, it could achieve about 46% performance improvement, in the worst case there is about 35% performance reduction vii

10 Figure 2.8 Configuration details for two sets of AMRZone experiments, one over expanded BISICLES AMR datasets, one over synthetic uniform datasets. Row 2, 6 give the dataset size(gb) for one time-step. For the synthetic datasets, it also gives global dimension size. Row 3, 7 give the total number of client processes(c), the total number of client nodes(n), the total number of dserver processes(s) and the total number of server nodes(n) for the corresponding time-step. Row 4, 8 give the total number of boxes(b) and box sizes(mb) in the corresponding time-step. For the synthetic datasets, it also gives the dimension size for a box. In row 2 and 4, the values for BISICLES datasets are average ones Figure 2.9 Results of boxes write/read performance testing for AMRZone over real AMR datasets, with comparisons on synthetic uniform data, totally 10 time-steps. In the worst case, AMR data related task demands more than 36% additional execution time, in the best case it is 2% more. In 5 cases (out of 8), AMR data coupled write/read needs less than 10% more time. Note these are not weak scaling testings Figure 2.10 The statistics for the AMR data workload on dserver processes. The row header denotes the four domain sizes and total number of dserver processes respectively. The column header denotes the minimum, maximum, average, median, first quartile and third quartile for the workload on dservers for an experiment related to each domain size. Note, for each domain size, there are 10 time-steps of AMR data written to the server space Figure 2.11 Configuration details for two sets of AMRZone experiments over expanded BISICLES datasets, one for spatial constrained data retrieval, one for AMR boxes retrieval. Row 2, 6 give the amount data(gb) retrieved for a time-step. Row 3, 7 give the total number of client processes(c), the total number of client nodes(n), the total number of dserver processes(s) and the total number of server nodes(n) for the corresponding time-step. Row 4, 8 give generally the total number of boxes(boxes) and box sizes(mb) in the retrieved data for a time-step. The values in row 2, 4, 6 and 8 are average ones Figure 2.12 Results of spatially constrained data retrieval performance testing for AMRZone over AMR datasests, with comparisons of AMR boxes read, totally 10 time-steps. Note these are not weak scaling testings Figure 3.1 Figure 3.2 The visualization graph of a 1GB block-structured AMR dataset generated by BISICLES [20]. BISICLES is a large scale simulation for modeling Antarctic ice-sheets The polytree-based spatial index for AMR data in AMRZone. Tree nodes represent AMR boxes. Edges represent the refinement relationship viii

11 Figure 3.3 Figure 3.4 Figure 3.5 Figure 3.6 Figure 3.7 Figure 3.8 Figure 3.9 The illustration of 2D AMR data coupled spatial access patterns detection for a client process. The accessed region over time-step n is compared with the one over time-step n-1. After finding the boundary variation (the right boundary moves one unit toward the right), the framework predicts this change will continue, and it generates a new predicted spatial access region by applying the same trend. This new predicted region is used for prefetching data of time-step n The timing sequence diagram for prefetching, involving a client process, an mserver process s thread, a dserver processs and the dserver s dedicated prefetching thread. For a dserver, there is a time period between retrieving a time-step s data and receiving the data request of the next time-step, which can be leveraged to do prefetching for the next time-step Illustration of how redundant messages can be sent during prefetching. At level 0, the two spatial regions overlap with box0 1 and box0 2, respectively, and prefetching messages are sent for those two boxes. However, at level 1, box1 2 overlaps with both box0 1 and box0 2, so multiple messages are sent for box The throughput difference between nodes of different topology distance on Cori [19] (the testbed for this work). For details of how the topology distances are calculated, please refer to Section The result is based on a micro benchmark, which sends 1 MB ping-pong messages between a pair of nodes, repeating the process 30,000 times. This benchmark setup (small message size and big number of messages) resembles AMR data, as each AMR box is usually not very large (from a few KBs to a dozen MBs), but the number of boxes in a time-step is high (from several thousands to tens of thousands). At any given time, only one pair of nodes are communicating with each other. Average experiment values are reported. 40 An illustration of the runtime factors that our framework must consider when determining where to place an AMR box. The mserver must not only consider topology distance, but also the size of each AMR box, and the workload of all staging nodes which keep changing at runtime. For example, in the figure, in terms of topology distance node 0 should be chosen to place the incoming box, but in terms of workload, node 3 should be selected. The final choice should be appropriately balanced among all factors The effectiveness of our framework handling the condition when the staging space is becoming full, with comparison of direct SSDs and PFS access (not involving our framework s staging space). Compared to directly writing data to SSDs, with added time periods to imitate computation periods, our framework can achieve average 72.85% and median 71.93% improvement respectively Illustration of the 11 major Antarctic ice shelves. The spatial regions of those ice shelves are used as spatial access constraints over the BISICLES datasets ix

12 Figure 3.10 The effectiveness of our framework performing prefetching for spatially constrained access, compared to direct SSDs and PFS access (not involving our framework s staging space). The spatial access patterns of client processes are based on the 11 major Antarctic ice shelves on a BISICLES dataset, as illustrated in Figure 3.9. Each client accesses one such region. For our framework with patterns detection and prefetching, the average and median performance improvement are 26.47% and 26.03% respectively. 49 Figure 3.11 On Cori, the effectiveness of our framework s topology-aware runtime AMR data placement optimization, compared to direct SSDs and PFS access (not involving our framework s staging space). For writing, the average and median improvements of topology-aware optimization are 18.08% and 18.73%, respectively. For reading, the average and median improvements are 10.57% and 10.49%, respectively Figure 3.12 On Titan, the effectiveness of our framework s topology-aware runtime AMR data placement optimization, compared to direct PFS access (not involving our framework s staging space). Titan does not have SSDs, so no comparison to direct SSDs access is possible. Note that all data in a time-step is used, and no spatial access patterns involved. For writing, the average and median improvements are 24.85% and 26.39%, respectively. For reading, the average and median improvements are 17.21% and 16.43%, respectively Figure 4.1 Figure 4.2 Figure 4.3 Figure 4.4 SSDs and advanced PFS write performance comparison using one Sith compute node at Oak Ridge National Lab. SSDs write performance is worse than advanced PFS. Data written size in each test is 1GB. Refer to Sec for details of the Sith cluster and PFS striping policy Framework architecure overview: the gray components are our contributions and could be categorized as online and offline modules. The online modules have two major functionalities, executing a read algorithm and data management. They rein over all online read and write flows through all memory layers Scenario where read specially arranged PFS cache could relieve contention on the SSD: the first process on each node reads the PFS cache which is specially distributed across OSTs in disjoint manner. Because of reduced parallel read contention on the SSD, no contention between the processes that read the PFS, and PFS s high sequential read speed, the actual performance could be improved Algorithm for read memory hierarchy resource. Applying a two-step coarse model when read the SSD. First it checks if contention likely, if so only uses a near optimal subset of processes to reduce parallelism in case given too many, otherwise just reads the SSD and next step is skipped. Second it decides if read the PFS could help mitigating contention on the SSD, if so uses one process per node to read the PFS and other processes to read the SSD, otherwise all processes read the SSD x

13 Figure 4.5 State flowing chart for managing aggregated write buffers to interleave write and read operations to hide data write overhead on file systems Figure 4.6 Every process accesses all OSTs or a subset of OSTs. Contention is severe in this case Figure 4.7 Two ideal cases that read contention on OST is minimized. Case1 process number smaller than OST s; Case2, process number bigger than OST s.. 62 Figure 4.8 MLOC s VC-SC storage layout. Dataset is first binned to treat valueconstrained access(vc) as first priority, each bin is partitioned by Hilbert Curve order for spatial constrained access(sc) as second priority. RID denotes the index for a value if the dataset is linearized in row-major order 66 Figure 4.9 Three types of domain decomposition as benchmark used in spatial constained(sc) access evaluation Figure 4.10 Evaluation for spatial constained(sc) access, access 10% and 20% region using three types of domain decomposition. Performance on VC-SC layout is worse than the unoptimized dataset because SC is treated with minor priority, thus SC access suffers more overhead. This is the compromise the layout must pay in order to avoid generating multiple full copies of datasets, please refer to for details Figure 4.11 Evaluation for value constained(vc) access, 10% and 20% data selectivity using region-value and region-only access Figure % and 20% spatial constained access(sc) on SSD cache: the U-shape lines show that increasing the number of processes cannot always reduce access time, which is caused by read contention on SSDs file system Figure 4.13 Improvement for 10% and 20% spatial constained(sc) access by SSD-PFS hybrid read: applying one or two processes per node to read carefully arranged PFS cache that is distributed across disjoint OSTs, contention on the SSD could be effectively relieved. Please refer to Fig. 4.3 for the illustration of the read strategy Figure 4.14 Model correctness evaluation. When 100%, 90%, 80% and 70% PFS cache available, the prediction is generally valid. For 60% case, the prediction is invalid. Overall precision is about 90% Figure 4.15 Write SSD methods evaluation: write data which is read through 10% and 20% spatial constrained(sc) subvolume access xi

14 CHAPTER 1 INTRODUCTION Post-processing-based data analytics typically involves writing the generated data first to parallel file system (PFS), then reading the data out from PFS to dynamic-random-access-memory (DRAM) for analytics. As hard drivers based PFS is orders of magnitude slower than DRAM [10, 37, 54, 65], this approach experiences substantial I/O time. Runtime data sharing is an effective approach for avoiding the high I/O latency incurred by post-processing methods [26, 69]. The principle idea of runtime data sharing is to assemble a DRAM-based server staging space on a set of dedicated compute nodes. Client processes, which run over another set of nodes, can be simulations, producing data and writing to the staging space, or analytical applications, reading data from the staging space [5, 27]. Thus, data generation and analytics can be executed concurrently, and slow PFS access is replaced by fast DRAM access. In this dissertation, we demonstrate the value of the proposed framework using large scale scientific datasets generated and shared over modern supercomputers. Specifically, we illustrate that our framework enables the runtime sharing of the Adaptive Mesh Refinement (AMR) scientific data, unlike traditional uniform mesh data. AMR [6 8] represents a great advance for large-scale scientific simulations. AMR capable simulations are able to dynamically refine data resolutions over time and space, generating hierarchical, multi-resolution, and non-uniform meshes. This kind of refinement provides sufficient precision for regions of interest at finer levels while avoiding unnecessary data generation for regions of non-interest [69, 75, 76]. By adopting AMR, scientific simulations can achieve significant computing resources savings, such as CPU, 1

15 memory and storage, while retain or even improve computation accuracy [16, 47]. Many scientific applications have successfully adopted the AMR model, such as BISICLES [20] (a large scale Antarctic ice-sheet modeling program), Enzo [18, 51] (a astrophysics simulation for cosmological structure formation analytics), and GenASiS [9, 12, 31, 32] (a simulation for studying neutron star mergers and core-collapse supernovae). In this dissertation, we focus on block-structured AMR, which consists of a collection of disjoint rectangular boxes (or regions) at each refinement level [76]. However, the great flexibility of AMR is also its Achilles heel. Specifically, a set of dynamic characteristics are inherent to AMR data, namely, the numbers, sizes, and locations of the AMR boxes, are usually unpredictable before a simulation run and keep changing as the simulation progresses [69]. Moreover, AMR simulations demonstrate several dynamic runtime behaviors, namely largely and heterogeneously changing resource requirements (DRAM, CPU, and network bandwidth) [44]. Due to incapability of adapting to those characteristics, existing methods cannot facilitate runtime AMR data sharing across scientific applications. Moreover, flattening and unifying AMR boxes to make the data compatible with existing methods is not a viable solution as AMR s advantages would be lost and significant overhead would be introduced [76]. Therefore, we present a framework to facilitate runtime AMR data sharing across scientific applications. First, we present a purely DRAM-based framework to support efficient runtime AMR data sharing, which has dedicated server processes for handling metadata, balanced AMR data placement policy and a poly-tree-based [21] spatial index to address the issues introduced by the dynamic characteristics of AMR data. Based on this, we further present a set of methods to extend the framework. The upgraded framework is able to store data over Solid State Drives (SSDs) when the DRAM space is full, and optimize AMR data access performace over the staging space by detecting common spatial AMR data retrieval patterns and adaptively distributing AMR data across network topology. We finally present a set of approaches to address some key performance issues plaguing SSDs, such as read contention and files fragmentation. In the following subsections, we briefly summarize the challenges and contributions for our work. 1.1 A DRAM-Based Framework for Runtime AMR Data Sharing Problems and Challenges To create a framework to support runtime AMR data sharing, there are three major challenges that should be addressed. First, the architecture of such a framework should enable efficient online AMR data management. However, runtime AMR metadata synchronization across distributed 2

16 server processes of the staging space would incur significant overhead because of the dynamic characteristics inherent to AMR data and because the data being written to the staging space could arrive in any order. Second, to retain high data access throughput, the framework should have a balanced workload distribution at runtime for the nodes on the server side. However, static data domain partitioning and distribution methods (e.g., space-filling curves [55]) typically fail to achieve this goal for AMR data. Due to the dynamic characteristics of AMR data, those methods can not determine a suitable partition size until most of the AMR boxes of a domain have been received and examined, which would produce a large runtime overhead. Third, to support data retrieval of specific spatial regions, the framework needs an efficient online spatial index that can satisfy AMR s unique spatial data access patterns, which typically involve accessing multiple boxes across multiple levels [76]. However, existing spatial indices do not effectively catch the hierarchical and non-uniform structure of AMR data Approach and Results To the best of our knowledge, runtime AMR data sharing across scientific applications has not been well explored. In order to address the architecture issue, the proposed framework has dedicated server processes to handle only metadata, such as tracking the placement of each AMR box and building a spatial index over the metadata. It also has server processes which are only responsible for binary data, namely binary data storage and transmission. With this design, the framework is able to significantly reduce AMR metadata synchronization overhead. In order to achieve overall balanced AMR data distribution over the staging space, the metadata server processes of the framework do not perform any static data domain partition, thus do not require any static global domain information before a simulation run. Instead, the metadata servers check each received AMR box and determine its placement at runtime. To support this task, an metadata server maintains a collection of workload tables to monitor how much data is stored on each binary data server and each compute node. In order to facilitate efficient spatially constrained AMR data retrieval, the framework constructs a poly-tree-based [21] spatial index over the metadata. This spatial index can represent the multi-level and many-to-many AMR boxes relationships well, and support efficient spatial AMR data retrieval. In addition to addressing the above challenges, the framework even demonstrates comparable performance and scalability with existing state-of-the-art work when tested over uniform mesh data; in the best case, our framework achieves a 46% performance improvement. 3

17 1.2 Exploring Memory Hierarchy and Network Topology for Runtime AMR Data Sharing Problems and Challenges Our previous approach addresses how to achieve efficient runtime AMR data sharing in a purely DRAM-based staging space. However, the purely DRAM-based method is becoming insufficient, as the volume of data being generated by scientific applications continues to grow, often exceeding the capacity of DRAM by 100% or more [43], necessitating the use of the PFS, thus increasing access latency. To address this capacity issue, solid state drives (SSDs) have been utilized as an overflow space for when DRAM fills [43]. SSDs are chosen as they are usually two orders of magnitude faster than hard disk drives, which the PFS uses, and provide more than ten times the capacity of DRAM [37]. To develop such a memory hierarchy-aware runtime AMR data sharing framework, three major challenges must be addressed. First, to avoid out-of-memory (OOM) errors when receiving data from clients, the framework must know when to move data from DRAM to another part of the memory hierarchy. To predict when the DRAM space will be full, the framework would need to know the ingress throughput; however, AMR simulations runtime output is typically unpredictable [44], resulting in a highly dynamic throughput in the staging space. As this issue is rare in the domain of uniform mesh data, it is a challenge brought on by AMR data. Second, because SSDs are about one order of magnitude slower than DRAM [37], being able to identify spatial read patterns to facilitate prefetching data from the SSDs into DRAM is desirable. Unfortunately, due to the multi-level, non-uniform structure of AMR data, the techniques used in [43] are scarcely applicable, requiring new methods be developed. Third, in order to further optimize data access performance, the framework needs to distribute data across the staging space properly to reduce network transmission latency at runtime. This is not as straightforward as it would be with uniform mesh data [60], as the sizes of AMR boxes are unknown a priori and highly irregular when generated [69] Approach and Results To the best of our knowledge, a framework to support runtime AMR data sharing between scientific applications that employs more than the DRAM portion of the memory hierarcy and accounts for network topology has not been proposed. By greatly extending our research of a purely DRAM-based runtime AMR data sharing framework (discussed in the previous subsection), we address the above challenges. First, in order to predict when the staging space is becoming full, the upgraded framework 4

18 can monitor the changes in the throughput to retain an accurate estimate of when DRAM will reach capacity. Thus make it possible to move data to the SSDs in a timely manner. Second, in order to adapt to the highly irregular structure of AMR data and prefetch the data from SSDs to DRAM to bridge the speed gap between the two memory layers, the framework can employ an AMR data-aware algorithm to effectively search all AMR levels to generate spatial regions for prefetching. Third, in order to reduce the network transmission latency, the framework can utilize a multivariate cost model to adaptively distribute AMR data over the network topology. The model will be able to consider factors, such as box sizes, staging space workload balance and topology distrances at runtime. When tested over real AMR datasets, besides effectively manage the staging space without any OOM errors, the framework s spatial access patterns detection and prefetching methods demonstrate about 26% performance improvement, and its runtime AMR data placement optimization can improve performance by up to 18%. 1.3 Memory Hierarchy Aware Data Read Performance Optimization Problems and Challenges Many scentific data analytics applications are becoming I/O bound as simulations in modern high performance computing (HPC) systems, such as S3D combustion [15] and FLASH reactive hydrodynamic equations [3], are generating increasingly large datasets. Further compounding this problem, the huge volume of data continuously widens the speed and capacity gap between DRAM and hard disk drives (HDDs). To address this bottleneck, methods for optimizing read performance have been proposed in two independent parallel lines of researches: memory hierarchy exploitation and storage layout reorganization. Regarding the first line, although the related methods achieve their goal by employing SSDs in a pivotal role combined with either PFS or DRAM, they still lack some considerations that we argue to be important. First, read contention on SSDs may cause up to 50% performance reduction, greatly compromising the effectiveness of the devices. Second, attention to supporting methods are required to maintain a near optimal operating environment for SSDs, namely to handle high write latency and fragmentation issues on the devices. Another line of research for improving read performance involves storage layout reorganization. These techniques transform the original dataset into certain fixed formats [34, 41] or create partial replicates [62, 68]. While they may significantly improve performance for certain types of access patterns, they are not effective for other types. One straight forward approach is to apply 5

19 several methods to optimize different categories of patterns, but in this case it is necessary to keep more than one full copy of dataset, which is not storage efficient Approach and Results In order to relieve the reading contention issue of SSDs, we propose a general purpose memory hierarchy aware online read algorithm. It could detect possible read contention on SSDs and utilize memory hierarchy resouce to mitigate the issue. In order to maintain a near optimal operating environment for SSDs, we also present a set of data management methods. They are able to orchestrate data across different memory layers to handle issues that may compromise the effectiveness of SSDs, such as fragmentation. Note that, those methods only work for the cases that each individual SSD device is explicitly available for access, for example SSDs are installed on each compute node, so that the proposed methods can explicitly manage data access behaviors and data organization on the SSDs. Otherwise, if SSDs are installed on a set of dedicated I/O nodes, and only an abstract access interface is provided, such as the case of Cori [19], those proposed methods are unapplicable. In order to optimize common access patterns without creating multiple full copies of datasets, we present methods to cache carefully selected data chunks onto SSDs. The methods can facilitate both spatial constrained and value constrained data read access. When tested over real simulations datasets, the online algorithm can reduce data access time by about 20%-50%. The methods of optimization for common access patterns can achieve about 20%-30% performance improvement. 6

20 CHAPTER 2 A DRAM-BASED FRAMEWORK FOR RUNTIME AMR DATA SHARING 2.1 Introduction Scientific data analytics are often performed in a post-processing manner, as the data generated by a simulation is first written to the file system and then read for analytics, requiring substantial I/O time. Runtime data sharing across multiple applications is a promising alternative approach towards avoiding these increasingly severe I/O bottlenecks [73]. For instance, the data generated by a running simulation can be moved to the memory of a set of dedicated compute nodes where that data is then retrieved by various analytics applications, such as visualization and transformation [5, 27]. To facilitate this runtime data sharing, related methods typically organize a set of nodes to provide an in-memory data staging and management space on the server side. Through client side APIs, applications running on other nodes can efficiently write data to the space and retrieve data from it. Although these methods are effective at handling uniform mesh data, they currently do not support adaptive mesh refinement (AMR) data. AMR represents a significant advance for large-scale scientific simulations [6 8]. By dynamically refining resolutions over time and space, AMR simulations generate hierarchical, multi-resolution, and non-uniform meshes. This kind of refinement provides sufficient precision 7

21 for regions of interest at finer levels while avoiding unnecessary data generation for regions of non-interest [76]. In this work, we focus on block-structured AMR, which consists of a collection of disjoint rectangular boxes (or regions) at each refinement level [76]. To the best of our knowledge, runtime AMR data sharing across applications has not been well explored. This is a non-trivial task due to the dynamic characteristics inherent to AMR data; namely, the numbers, sizes, and locations of the AMR boxes, are usually unpredictable before a simulation run and keep changing as the simulation progresses. These characteristics prevent existing methods from effectively handling AMR data. Moreover, flattening and unifying AMR boxes to make the data compatible with existing methods is not a viable solution as AMR s advantages would be lost and significant overhead would be introduced [76]. To create a framework that facilitates runtime AMR data sharing across multiple applications, there are three major challenges that should be addressed. First, the architecture of such a framework should enable efficient online AMR data management. However, runtime AMR metadata synchronization across distributed server processes would incur significant overhead because of the dynamic characteristics inherent to AMR data and because the data being written to the staging space could arrive in any order. Second, to retain high throughput, the framework should have a balanced workload distribution at runtime for the nodes on the server side. However, static data domain partitioning and distribution methods (e.g., space-filling curves [55]) typically fail to achieve this goal for AMR data. Due to the dynamic characteristics of AMR data, those methods can not determine a suitable partition size until most of the AMR boxes of a domain have been received and examined, which would produce a large runtime overhead. Third, to support data retrieval of a specific spatial region, the framework needs an efficient online spatial index that can satisfy AMR s unique spatial data access patterns, which typically involve accessing multiple boxes across multiple levels [76]. However, existing spatial indices do not effectively catch the hierarchical and non-uniform structure of AMR data. Moreover, due to the dynamic nature of AMR, how to build the spatial index efficiently at runtime while maintaining high data transmission performance poses another challenge. In this work, we propose AMRZone, a framework for facilitating runtime AMR data sharing across multiple scientific applications. In addition to addressing the above challenges, AMRZone even demonstrates comparable performance and scalability with existing state-of-the-art work when tested over uniform mesh data; in the best case, our framework achieves a 46% performance improvement. Specifically, towards addressing the above challenges, we make the following contributions through our framework: ˆ An architecture that facilitates AMR data management by dedicating some server processes to handle metadata exclusively (Section 2.3.1). 8

22 ˆ Online balanced AMR data distribution on the server side, by adopting a runtime workload policy based on AMR boxes (Section 2.3.2). ˆ A polytree-based online spatial index to facilitate spatially constrained AMR data retrieval (Section 2.3.3). 2.2 Background DataSpaces [25] is the current state-of-the-art framework for runtime data sharing across multiple scientific applications over uniform mesh data. In the following sections, we explain three major issues that prevent it from effectively supporting AMR data. Although other frameworks (described in Section 2.5) can provide a distributed and in-memory data manipulating space, we select DataSpaces for comparison because it is the only framework that can build an explicit online index over the distributedly staged data as well as provide effective data access APIs, both of which are key features that facilitate runtime data sharing across applications. At the heart of DataSpaces is a distributed hash index that enables efficient data retrieval from spatial regions of interest. The index is based on a Hilbert space-filling curve [50] that is used to partition a global data domain into subregions (or partitions) and then distribute these partitions evenly to the staging nodes. Although effective at handling uniform mesh data, there are several non-trivial issues that would arise if DataSpaces were applied to AMR data. A 1GB 5-level dataset generated by BISICLES(a large-scale Antarctic ice sheet modeling code for climate simulation) [20] is used to illustrate the hierarchical and non-uniform structure of AMR data. Its visualization is shown in Figure Issue I: Architecture DataSpaces has a client-server architecture. The server side is composed of a set of server processes (or servers) running on different nodes to form a virtual in-memory space. The client side is a collection of APIs used to interact with the space. Each server process is responsible for both maintaining metadata (the distribution of data subregions) and transporting data. The server side demands a pre-defined global domain size before any data is written or read. Therefore, once a partition size is determined, it is easy to know the total number of subregions and how to map each one to the servers evenly, before any client connects to the servers. If a client needs to access a certain region of data, it can get the metadata by contacting any server, making the metadata management of this architecture very effective over uniform mesh data. However, the dynamic nature of AMR data (namely, the numbers, sizes, and locations of the AMR boxes, are usually unpredictable before a simulation run and keep changing as the simulation progresses) makes it impossible for the framework to determine a balanced workload 9

Figure 2.1 The visualization graph of a 1GB block-structured AMR dataset generated by BISI- CLES [20]. The coarsest (or lowest) level (level 0) covers the entire global data domain.

23 Figure 2.1 The visualization graph of a 1GB block-structured AMR dataset generated by BISI- CLES [20]. The coarsest (or lowest) level (level 0) covers the entire global data domain. A finer (or higher) level is generated by refining a set of boxes on the adjacent coarser level, only covering some subregions of interest with higher resolution as defined by a refinement ratio. The boxes at the finer levels represent spatial regions of more interest. distribution before a simulation run (see Section for details). Thus, in order to maintain consistent metadata between all the servers to monitor the overall workload status, it is necessary to frequently exchange metadata between all servers at runtime. This frequent communication between the servers results in a high runtime synchronization overhead Issue II: Online Data Organization In the data staging space, which is based on Hilbert curve, an unbalanced workload distribution could arise because the AMR boxes are divided across the different partitions in a highly uneven manner, as illustrated in Figure 2.2. Due to the dynamic nature of AMR data, it would be impossible for DataSpaces to determine a suitable virtual bounding box and partition sizes for all levels of all time-steps before a simulation run. Moreover, once the data has been partitioned 10

24 Figure 2.2 A uniformly partitioned virtual bounding box over the finest level of the AMR data in Figure 2.1. These partitions would be distributed to the staging space according to a space-filling curve (e.g., Hilbert), leading to an unbalanced workload distribution, among other issues. and distributed to the server nodes, it would be too time consuming to dynamically optimize an unbalanced workload distribution, as that would require retrieving, repartitioning, and redistributing all of the staged data. On the other hand, attempting to achieve a balanced workload distribution at runtime would be costly because the majority of the AMR boxes must be received and evaluated first before a suitable partition size can be determined. Arguably, any partition methods that are based on space-filling curves (e.g., Z, Peano, Hilbert curves, etc. [55]) would not be effective at evenly distributing AMR data onto a set of nodes at runtime. Except space-filling curves, heat-diffusion-based [23] runtime workload balancing algorithm, are also not very suitable. Because they require local or global AMR data migration from nodes of high workload to nodes of low workload, which will introduce noticeable network transmission overhead. 11

25 Application1 Proc1 Proc2 ProcN Client API Metadata Flow Binary Data Flow Node1 dserver Process1 dserver Process2 dserver ProcessN Node0 mserver Process Thread Pool T1 T2 T3 TN NodeN dserver Process1 dserver Process2 dserver ProcessN Client API Proc1 Proc2 ProcN Application2 Figure 2.3 The client-server architecture of AMRZone consists of two types of server processes: (1) mservers that only manage metadata, recording how AMR boxes are distributed across dservers and constructing a spatial index; and (2) dservers that only manage the binary data. Note that Application1 and Application2 could be simulations or other data analytics programs. Also note that AMRzone does not limit the number of applications that can connect to the server side Issue III: Online Spatial Index Analytics over AMR data are usually performed over the boxes from all levels that overlap with the specified spatial region [76], rather than just the boxes on a single level, as illustrated in Figure 2.4. However, to retrieve AMR boxes from all levels using Datapsace s hash index, where partitions are given by the Hilbert space-filling curve, it would be necessary to check the boxes in every partition that overlaps with the specified query region. This kind of linear checking is inefficient when faced with a large number of parallel queries. Moreover, it would appropriate CPU resources that could otherwise be used for processing other runtime tasks (e.g., data transportation), reducing parallelism. 2.3 Methods AMRZone is designed to facilitate data sharing across multiple AMR-capable scientific applications. To address the three issues stated in Section 2.2, AMRZone employs a centralized metadata management architecture design, an AMR boxes-based runtime workload assignment policy, and a polytree based spatial index. 12

26 2.3.1 Architecture AMRZone consists of a distributed client-server architecture. The server side consists of a set of server processes (or servers), which run on a user-defined collection of compute nodes, providing a shared memory-based virtual data staging and management space with public data access functions. The client side is a set of APIs that can be integrated by running applications (e.g., simulations or other data analytics programs) to access (e.g., write, retrieve, or update) the data in the space. The architecture of ARMZone is depicted in Figure 2.3. As described in Section 2.2.1, the architecture design in which a server process is responsible for both maintaining metadata and transportating data introduces a significant performance overhead with AMR metadata management. To avoid this issue, AMRzone uses servers that exclusively handle either the metadata task or the data transportation task: ˆ Mservers - responsible for metadata, namely recording on which data-server an AMR box is placed and building a spatial index ˆ Dservers - manage the actual binary data of AMR boxes, such as data storage and transmission The mservers act as coordinators between the clients and dservers. For example, when an application demands to write or read a certain region of data, it first contacts an mserver. The mserver updates or searches the metadata, and sends back the communication addresses of a certain set of dservers where the application can establish connections and perform the data transportation. Note that our framework does not limit the number of applications that connect to the data staging space. With this architecture design, AMRzone is able to significantly reduce metadata synchronization. To accomplish this, AMRZone uses a single mserver to handle all metadata associated with a single time-step of a simulation. This design eliminates the need to exchange metadata at runtime for the servers, preventing a significant performance overhead. Concerning mservers and dservers placement on the compute nodes, each compute node contains only one mserver process. If a single node s memory is not sufficient, the metadata of different time-steps can be divided to multiple mservers on multiple compute nodes. Additionally, an mserver is able to set up a thread pool to handle incoming requests in parallel. Because threads can share the same memory space with a process, the threads inside the same processes do not need to exchange any metadata. Moreover, because the metadata-related message sizes are usually very small (a few dozens of bytes at most), this centralized metadata management design does not become a performance bottleneck to the framework. In contrast, multiple dserver processes run on a single node and there is no multi-threading inside a dserver process as there is no metadata sharing requirement for dservers. With this architecture design, AMRzone is 13

27 able to achieve a high parallelism. In fact, as we show in Section 2.4.1, this architecture gives satisfactory performance when facing more than 10,000 writers/readers in parallel. The write and read functions associated with the client APIs use AMR boxes as the atomic units. We make this design decision because AMR data is typically accessed by boxes. For example, in Chombo [17], a popular block-structured AMR data manipulation framework, a level s domain is represented by a collection of boxes. In a previous AMR data analytics method [76], analysis tasks are performed over a set of boxes. For more details of the initial prototype implementation of the framework, refer to Section Online Data Organization As stated in Section 2.2.2, when space-filling curves are used to partition and distribute AMR data, it is difficult to avoid an unbalanced online workload placement at the staging space. To address this issue, the mservers of AMRzone do not perform a static partition of the data domain and, thus, do not require any global domain information. An mserver checks each received AMR box and determines its placement at runtime. To support this task, an mserver maintains a collection of workload tables to monitor how much data is stored on each dserver and each node. Algorithm 2.1 gives the general procedure that the mservers use to decide on which dserver to place the binary data of an AMR box after a write-amr-box request is received. First, it searches the dserver nodes workload table for a node with minimum workload (lines 1-7). Next, it finds the dserver process with the minimum workload on the specified node (line 9-14). By considering each of the AMR boxes received at runtime, the algorithm avoids any static and uniform data domain partition, thus realizing a far better data distribution balance across the staging space than space-filling curves. According to the results in Section 2.4.2, our framework s read performance over AMR data is comparable to the read performance over balanced uniform mesh data, demonstrating the effectiveness of this workload assignment policy. To store the metadata for an individual AMR box, we employ a linear hash table because of its effective insertion and lookup operations. A cell of a hash table corresponds to an AMR box and a combination of a box s coordinates is used as the hash key. Inside each dserver process, there is also a collection of hash tables corresponding to all the levels of all time-steps. These are used to store and retrieve the boxes binary data. When a dserver receives the binary data of a box, it only needs to insert the box to a hash table. After a box is placed in the hash table, an mserver uses a simple pointer-based list to chain boxes at the same level of a time-step together. This operation would not compromise the performance of an mserver, as every insertion only involves the movement of a few lightweight pointers. These lists could be used for spatial query or constructing a polytree-based spatial index (2.3.3). 14

28 Algorithm 2.1: Algorithm of an mserver determines the placement for an AMR box at the staging space Input : 1D array of the workload for all staging nodes: T1[]; the size of T1: N. Input : 2D array of the workload for all dservers for each staging node: T2[N][]; the size of T2[N]: S[] Result: A suitable dserver to place the box: ds index 1 /*Find the node with minimum workload:*/ 2 max wl = maximum possible workload for a node 3 node index = 0 4 for i = {0,..., N 1} do 5 if T 1[i] < max wl then 6 max wl = T 1[i] 7 node index = i 8 /*Find the dserver process with minimum workload:*/ 9 max wl = maximum possible workload for a dserver 10 ds index = 0 11 for i = {0,..., S[node index] 1} do 12 if T 2[node index][i] < max wl then 13 max wl = T 2[node index][i] 14 ds index = i Online Spatial Index To facilitate AMR data sharing across multiple applications, the framework needs to be able to effectively retrieve data from a specific spatial region of interest. As shown in a previous AMR data analytics work [76], spatial queries over AMR data usually request data from all levels, rather than a single level (as illustrated in Figure 2.4). To achieve this goal, an mserver s linear hash table, which is used to store the metadata of the AMR boxes (as described in 2.3.2), is far from efficient. The hash table requires linearly checking all AMR boxes, reducing parallelism by competing with CPU resources from other runtime tasks, such as data transportation. However, popular spatial indices (e.g., R-tree [36], Quadtree [56], UB-tree [4], etc.) are unsuitable for AMR data because they are not capable of capturing the hierarchical structure inherent to AMR data. The common idea of these indices is to organize a set of sub-spaces so that the selection space in the query can be efficiently narrowed down. A one-to-many relationship between sub-spaces is a necessary pre-condition to utilize most of those indices. However, relationships between AMR boxes at adjacent levels are many-to-many (in other words, one coarse box could cover multiple fine boxes and multiple fine boxes could cover one coarse box). Even if it is possible to divide a box into a set of smaller boxes to reduce this many-to-many relationship into a one-to-many relationship, this approach may result in a large 15

29 Level1 Level0 Spatial Region Level1 Level0 Figure 2.4 A spatial query over AMR data. Typically, boxes are retrieved in multiple levels, rather than a single level. The boxes at level 1 are refined from bigger boxes at level 0. number of boxes and add more complicated runtime procedures to the framework. Not only could this compromise index construction and search performance, but it could also introduce network overhead due to more boxes write/read operations. To overcome those issues, we propose a polytree [21] based spatial index for AMR data, as shown in Figure 2.5. The root of the polytree represents a single time-step of the simulation. Each level of the tree corresponds to an AMR level, and the nodes at the same level of the tree represent the AMR boxes of that level. Finally, the directed edges from coarser level nodes to finer level nodes represent refinement relationships. This polytree index is constructed over the boxes metadata inside the mservers. Given this index structure, the many-to-many AMR boxes refinement relationships are well represented and spatial queries can be answered efficiently. After finding the AMR boxes at a level that overlaps with a given spatial region, the searching at the next finer level can be limited to the boxes which are refined from those found boxes at the coarser level. By performing a search in this manner, AMRZone avoids inefficiently checking all boxes at a certain level. Algorithm 2.2 illustrates this complete depth-first search procedure. The algorithm starts at the coarsest level (level 0) to search for AMR boxes that overlap with the given spatial region (lines 3-7). When an overlapping box is found, a recursive function is invoked (lines 9-15) to check the boxes at the adjacent finer level that are refined from the found box. In the function, because we are dealing with boxes at finer levels, it is necessary to refine the given region by a refinement ratio (line 12) before checking if a box overlaps with the given spatial region. 16

30 B4 B5 B6 B7 B8 B1 B2 B3 Level 1 Level 0 Root Time-Step Figure 2.5 The polytree-based spatial index for AMR data. Tree nodes correspond to AMR boxes. Directed edges denote refinement relationship. It effectively represents the many-to-many refinement relationships of AMR boxes across different levels. When the mserver finds a box satisfying the criteria, it first instructs the dserver that holds the box s binary data to transfer the data to the client, and then it updates the total number of found boxes (line 5-6 and 13-14). After the entire search concludes, the mserver sends the client the total number of found boxes. This number can be used as the condition variable used to terminate the querying API function. The next major issue is when to build this index, as the boxes generated by a running application could be sent to the framework in any order. Building the index while boxes are being received could introduce significant overhead, because it requires frequently checking the relationships of received boxes. Instead, after a box is received, an mserver only chains it to the boxes list at its corresponding level, as described in the above subsection(2.3.2). A client API function must be explicitly invoked to request the mserver to build the index. In this way, clients are given the freedom of choosing when to build the index. The client applications that use the API to transport data to the staging space usually have control over how to send the data and know when the transportation finishes. Therefore, the applications can schedule the data-write and index-build tasks in a disjoint manner to avoid unnecessary overhead. This practice is common in many data management fields, such as a DBMS where, after using SQL insertion statements to write data into the database, users can execute some index-build procedures to build a more complicated index over the data. In order to build this polytree-based spatial index, the mserver needs to iterate over boxes at each level. The iteration at each level is performed over the boxes lists, which is described in subsection (2.3.2). For each box, it must then check all the boxes at the next finer level to see if there is refinement relationship. If so, a pointer to the box at finer level is added. This kind of linear iteration could be inefficient. To speed up this procedure, AMRZone s API can divide a level s index-building workload to multiple subtasks, which would be processed by different 17

31 Algorithm 2.2: Algorithm of searching the polytree-based index to perform spatial query over one time-step AMR data Input : Specified spatial region: R Input : 1D array of refinement ratio for all levels: REF[] Input : The built polytree-based spatial index: Index Input : 1D array of boxes at the coarsest level: Boxes0[] Input : The size of Boxes0[]: N Result: The metadata of all found AMR boxes 1 num found box = 0 2 /*Search the coarsest level(level 0) first:*/ 3 for i = {0,..., N 1} do 4 if region overlap(r, Boxes0[i]) == T RU E then 5 process metadata(boxes0[i]) 6 num found box spatialquery(boxes0[i], 0) 8 /*Function to search box s(amr box) refined boxes:*/ 9 Procedure spatialquery(box, lev) 10 ref boxes = get ref ined boxes(box, Index) 11 for j = {0,..., size of ref boxes 1} do 12 if region overlap(ref ine region(r, REF [lev]), ref boxes[j]) == T RU E then 13 process metadata(ref boxes[j]) 14 num found box spatialquery(ref boxes[j], lev + 1) 18

32 threads inside the mserver in parallel Implementation To construct a prototype for AMRZone, one of the most important pieces is to implement a data transportation layer. We only have a few technical choices, such as TCP/IP-based programming APIs, network native APIs, and MPI [35]. TCP/IP-based APIs are not designed for high performance computing. Libraries that are based on network native APIs, such as DART [24] which is used to build DataSpaces, are not very portable. For example, DART uses complicated network native programming APIs to implement data transmission. In other words, there is a different implementation for the different types of network connections (e.g., InfiniBand [53], Gemini [13], etc.). As a result, it currently doesn t support newer high performance computing systems, such as Edison [29] at the Lawrence Berkeley National Laboratory (LBNL). Moreover, there may be a long transition period before it can be ported to next generation supercomputers, namely Summit [59] at the Oak Ridge National Laboratory (ORNL) and Cori [19] at LBNL. To the best of our knowledge, work that uses a network native API to perform data transmission face this portability issue, more or less. Thus, to develop a portable prototype of AMRZone, we use MPI to implement the server processes and data transportation. Pthread [58] is used to implement the thread pool inside mservers. We leverage pthread-based mutex and reader/writer locks to protect the finely partitioned data structures to manage the metadata inside the mservers and maintain data consistency while achieving a high degree of parallelism. Finally, it is important to note that the methods described in the above three subsections are independent of any specific data transportation implementation. 2.4 Results Evaluations of AMRZone are driven by the goal to show its high performance compared with the existing state-of-the-art framework as well as its efficiency in sharing AMR data across multiple applications. Towards this end, we compare the scalability of write/read tasks between AMRZone and DataSpaces over uniform mesh data (Section 2.4.1). In addition, we evaluate AMRZone s performance of write/read actions and spatially constrained accesses over real AMR data (Section and 2.4.3, respectively). The AMRZone prototype implementation is evaluated on Titan [63] at the Oak Ridge National Laboratory (ORNL). Titan is a Cray XK7 machine with a total of 18,688 compute nodes. Each node contains 16-core 2.2GHz AMD Opteron 6274 processors and 32GB memory. A 19

33 pair of nodes share a Gemini [13] high-speed interconnect router. For each experiment, the total time of all writes/reads (seconds) is reported. Each experiment is repeated at least 15 times, and the run with the smallest write/read time is reported since it has the least influence from outliers with much larger values Scalability To demonstrate that our architecture design could handle a large data transmissions with many parallel writers and readers, we compare the weak scalability between AMRZone and DataSpaces. Specifically, we use the officially distributed DataSpaces1.6 (the latest version) source code. The same compilation configuration as the DataSpaces1.6 module on Titan is adopted. Moreover, the code for testing the DataSpaces server program is unmodified and the same as the one used to build the module on Titan. We use the client APIs of both frameworks to develop our own client testing programs. Since DataSpaces could not handle real AMR data, we use synthetic 3D double-precision uniform data. A time-step s domain is evenly partitioned to a set of subregions (or boxes) by four partition sizes respectively, which are assigned evenly to a collection of parallel processes. After launching, the set of parallel processes of a client testing program first write their assigned boxes of a time-step to the server space and then retrieve those written boxes from the space. In each experiment, the write/read operations are performed over 5 time-steps. The configuration details of the experiments are summarized in Figure 2.6. Every row represents a subset of experiments where a box size is used to partition four different domains. As the domain size increases, so does the total number of boxes and parallel client/server processes. By default, we use the minimum number of Titan nodes to hold the client and server processes. However, because DART [24], on which DataSpaces is built, utilizes remote-directmemory-access (RDMA) to transport data, and the memory available for RDMA on each Titan node is about 2GB by default, a large box size or high number of parallel processes means more nodes must be used to host the same number of processes, otherwise DataSpaces crashes. Also note that for AMRZone, Figure 2.6 only shows how many dserver instances are deployed. For all the experiment cases, we consistently use one mserver instance with 15 threads. Figure 2.7 shows the results of the experiments. In total, there are 32 comparison scenarios (each one of the 16 cases in Figure 2.6 includes write/read senarios). In 22 of these scenarios, AMRZone performs better than DataSpaces. In the best case it achieves around a 46% improvement. The min, max, median and average improvement are around 0.9%, 46%, 22% and 24%, respectively. In the other 10 scenarios, AMRZone performs worse. In the worst case, it experiences around a 35% higher execution time. The min, max, median and average reduction are around 0.05%, 35%, 19% and 15%, respectively. Regarding scalability, AMRZone generally 20

34 ÿ ÿÿÿÿ ÿ 8ÿÿ634ÿ ÿ ÿ12634ÿ ÿ!ÿ!ÿ!ÿÿ 7634ÿ ÿ!ÿ ÿ!ÿÿ ÿ 28634ÿ ÿ"ÿ ÿ ÿÿÿ ÿ ÿÿ "ÿ "#!ÿ"!ÿ "!ÿ!!ÿ!!ÿÿ "ÿ #ÿ"ÿ "ÿ ÿ ÿÿ "!ÿ "#!ÿ"!ÿÿ "ÿ!ÿ"#!!ÿ!ÿÿ ÿ797ÿ 972ÿ#ÿ"ÿ ÿ"ÿ!ÿ"#! ÿ"#! ÿ!ÿ$!ÿ#"ÿ ÿ#"ÿ! "#! ÿ#"ÿ!ÿ!"ÿ ÿ!ÿ Figure 2.6 Configuration details for the scalability comparison experiments. The row header denotes four domain sizes and the column header gives four partition sizes. The format, $B $C($N) / $S($N), denotes the total number of boxes(b), the total number of parallel client processes(c), the total number of client nodes(n), the total number of DataSpaces server or AMRZone dserver processes(s) and the total number of server nodes(n). Figure 2.7 Results of weak scalability comparison experiments between AMRZone and DataSpaces, over 5 time-steps. AMRZone generally performs better compared to DataSpaces (22 out of 32 cases). In the best case, it could achieve about 46% performance improvement, in the worst case there is about 35% performance reduction. 21

35 achieves a better result (less increased execution time while more compute resources are devoted to process a larger domain size), compared to DataSpaces. Based on these results, we consider on average our framework s performance is comparable with DataSpaces Performance over AMR Data Experiments in this section are aimed at evaluating the write/read performance and workload distribution of AMR boxes at the server space of our framework. First, we use the 2D AMR datasets generated by BISICLES [20], a large-scale Antarctic ice sheet modeling code for climate simulation. Then, to have a baseline for comparison, we also include experiments over 2D double-precision synthetic uniform data with similar configurations. The testing programs know the exact coordinates of boxes. Figure 2.8 gives the detailed information about the experiments over these two datasets. The BISICLES-generated datasets consist of double-precision values and are 1GB in size. Each dataset has 5 levels, with a total of about 6,700 boxes. To create larger datasets for testing performance, different dataset sizes are created by expanding all boxes of a time-step 8, 16, 32 and 64 times, respectively. So the total number of boxes in a time-step doesn t change. So an original 1GB dataset is expanded 8, 16, 32 and 64 times, respectively. During this expansion, we ensure that the relative positions of the boxes to their adjacent levels does not change. In each experiment, 512, 1024, 2048 and 4096 parallel processes (based on the client APIs of AMRZone) are used to write/read 10 time-steps of data (recall, each write/read is based on one AMR box). On the server side, we consistently use 1 mserver process with 15 threads and the minimum number of nodes to host those client and dserver processes. In each of these AMR data related experiments, the workload assignment policy makes each client process have a similar amount of data to write/read. This means that some processes may be assigned a few big boxes, while others may be assigned more boxes of smaller sizes. Although still not completely balanced, compared to assigning each process a similar number of boxes, this approach could achieve a more balanced workload distribution between client processes, thus improving performance. It is important to point out that all of these experiments are not weak scaling. First, regarding the experiments over AMR data, although on average the per-process workload remains the same as the time-step domain size and number of processes increase, the actual workload for each process is not the same due to the highly irregular sizes of the AMR boxes. In fact, in one time-step of this BISICLES simulation, the biggest box size is 17 times larger than the smallest one. Second, for the experiments over synthetic data, although the actual overall workload for each process is consistent, the number of boxes and box size for each process are different among the experiments. A higher number of processes with bigger size data chunks 22

36 ÿ7897ÿÿ7ÿ7ÿ7ÿÿÿÿ87ÿ ÿÿ ÿ2345ÿ7ÿ &!". $"%&!"'(ÿ)ÿ$"*&!"'(ÿÿ+"#%& ÿ!"ÿ #ÿ,"ÿ)ÿ0&"$.++ÿ)ÿ-/"0ÿ- 6!". (ÿ 6$"(ÿ,"ÿ)ÿ"0&$"6$"(ÿ,"ÿ)ÿ#0&$"6+"#(ÿ,"ÿ)ÿ0&+"#6+"#(ÿ.++ÿ)ÿ-"/#0ÿ &!". #'(ÿ)ÿ+"#*& 6 $$! (ÿ!"& #'(ÿ"+#%&"'(ÿ)ÿ"+#*&"'(ÿ#+, -.++ÿ)ÿ-#/0ÿ- $$! 6 $$! #&.++ÿ)ÿ-,/.0ÿ %&"$ $$! '(ÿ)ÿ#+, 6!+."(ÿ *&"$ '(ÿ Figure 2.8 Configuration details for two sets of AMRZone experiments, one over expanded BISI- CLES AMR datasets, one over synthetic uniform datasets. Row 2, 6 give the dataset size(gb) for one time-step. For the synthetic datasets, it also gives global dimension size. Row 3, 7 give the total number of client processes(c), the total number of client nodes(n), the total number of dserver processes(s) and the total number of server nodes(n) for the corresponding time-step. Row 4, 8 give the total number of boxes(b) and box sizes(mb) in the corresponding time-step. For the synthetic datasets, it also gives the dimension size for a box. In row 2 and 4, the values for BISICLES datasets are average ones. could cause noticeably more network transmission overhead. Thus, when reviewing the results of the two sets of experiments, it is more appropriate to compare the two sets to each other, rather than comparing all experiments of a single set together. Figure 2.9 shows the results of the experiments. As expected, the performance over AMR data is worse than the performance over the uniform synthetic data. This could be attributed to the unbalanced workload distribution for client processes when writing/reading AMR datasets. In the worst case (reading an 8GB dataset), AMR data related tasks demand more than 36% additional execution time. In the best case (writing a 64GB dataset), they only need about 2% more additional execution time. There are three cases in which AMR data coupled tasks take more than 10% additional time: writing/reading an 8GB dataset (29%, 36%), and writing a 32GB dataset (11%). A possible explaination for the two 8GB AMR data related cases taking a noticeably higher percentage of additional time is that, the box size of the 8GB synthetic data domain is small, making write/read operations very efficient; therefore, the two perform comparatively worse. In all other cases (5 out of 8), AMR data coupled writes/reads require no more than 10% additional time. Considering the unbalanced AMR boxes distribution on the client processes (the biggest box size is 17 times larger than the smallest one), we believe AMRZone s performance over real AMR datasets to be comparable with its performance over uniform mesh, thus satisfactory. Finally, the statistics for the AMR data workload on the dservers is shown in Figure When the number of dservers is relatively small, the workloads are closer to each other; in contrast, as the number of dservers increases, the gaps between the workload also increase. This trend is expected because workload assignment would become more unbalanced for a higher 23

37 ÿ123ÿÿ5678ÿ6923ÿÿ678ÿ723ÿÿ718ÿ923ÿÿ98ÿ ÿÿÿÿÿ ÿÿÿÿÿ ÿÿÿÿÿ Figure 2.9 Results of boxes write/read performance testing for AMRZone over real AMR datasets, with comparisons on synthetic uniform data, totally 10 time-steps. In the worst case, AMR data related #6ÿ"ÿÿÿÿ ÿÿÿÿ ÿÿÿÿ ÿÿÿÿ task demands more than 36% additional execution time, in the best case it is 2% more. In 5 cases (out of 8), AMR data coupled write/read needs less than 10% more time. Note these are not weak #ÿÿÿÿÿ!ÿÿÿ"ÿÿ "ÿÿÿÿ ÿÿÿÿ scaling testings. $ÿÿ"ÿÿ Figure 2.10 The statistics for the AMR data workload on dserver processes. The row header denotes the four domain sizes and total number of dserver processes respectively. The column header denotes the minimum, maximum, average, median, first quartile and third quartile for the workload on dservers for an experiment related to each domain size. Note, for each domain size, there are 10 time-steps of AMR data written to the server space. 24

38 number of processes. However, the min, avg, median, Q1, and Q3 are quite similar to each other for all types of processes numbers, which indicates a good overall balance. Considering that, for the AMR dataset, the biggest box size is 17 times larger than the smallest one, we believe AMRZone s workload assignment policy on the server side produces satisfactory results Performance of Spatially Constrained Interaction Coupled with AMR Data In this section, we evaluate the performance of AMRZone under a more complicated data sharing scenario: retrieving the data (or AMR boxes) of specific spatial regions of interest. The datasets are the same collection of BISICLES s 1GB time-steps used in the previous subsection (Section 2.4.2). In this dataset, the regions that are covered by boxes at finer levels (for instance, level3 and level4) represent spatial areas of greater interest, for example ice sheet grounding lines, calving fronts, and ice streams [20]. Figure 2.1 shows the visualization of one time-step. At the finest level, there are about 4,100-4,300 AMR boxes. In each of the experiments, we expand all the boxes of a time-step by 0, 2, 4 and 8 times respectively, similar to what is done in Section Moreover, we first write 10 time-steps of data to the server space, then use 256, 512, 1024 and 2048 parallel processes (based on the client APIs of AMRZone) to perform spatially constrained data retrievals over the staged time-steps one by one. The processes could represent potential data analytics applications. The coordinates of AMR boxes at the finest level are used by the client processes as the spatial query condition to perform the data retrieval. The boxes assignment policy assigns each process a similar number of spatial regions. On the server side, we consistently use 1 mserver process with 15 threads and a minimum number of nodes to host those client and dserver processes. Since in previous experiments AMR data write performance has been evaluated, we only record the execution time of data retrieval in this section. Before using the coordinates of an AMR box at the finest level for a spatial query, the coordinates need to be properly mapped to the domain of the coarsest level, according to the refinement ratios. Recall that, for AMR data spatial queries, boxes at all levels are retrieved rather than a single level(2.3.3), and more than one box could be refined from a single coarserlevel box. So, the total amount of data retrieved may be much larger than the actual size of a time-step. At a single time-step of the 1GB BISICLES datasets, the above designed experiments would retrieve about 26,000 AMR boxes and 13GB data in total. So, for the time-steps that are expanded 2, 4 and 8 times respectively, the final retrieved amount of data is about 26GB, 52GB and 104GB. To have a point of comparison, we also include AMR boxes read experiments over the BISICLES datasets (here the testing program knows the exact coordinates of each AMR box). 25

39 ÿ ÿ232ÿ3425ÿ8ÿ129ÿ00ÿ2323ÿ '()*+ÿ$ÿ,-ÿ'()*+ÿ$ÿ-ÿ'()*+ÿ$ÿ-ÿ'()*+ÿ$ÿ-ÿ 8ÿ3425ÿ8ÿ129ÿ00ÿ2323ÿ ÿ ÿ ÿ '.()*+ÿ$ÿ-ÿ'.()*+ÿ$ÿ-ÿ!"#ÿ$ÿ%!"#ÿÿ!"#ÿ$ÿ%!"#ÿ '.()*+ÿ$ÿ&-ÿ'.()*+ÿ$ÿ-ÿ!"#ÿ$ÿ%!"#ÿ& ÿ!&"#ÿ$ÿ&%!&"#ÿ Figure 2.11 Configuration details for two sets of AMRZone experiments over expanded BISICLES datasets, one for spatial constrained data retrieval, one for AMR boxes retrieval. Row 2, 6 give the amount data(gb) retrieved for a time-step. Row 3, 7 give the total number of client processes(c), the total number of client nodes(n), the total number of dserver processes(s) and the total number of server nodes(n) for the corresponding time-step. Row 4, 8 give generally the total number of boxes(boxes) and box sizes(mb) in the retrieved data for a time-step. The values in row 2, 4, 6 and 8 are average ones. We expand the boxes in the 1GB BISICLES datasets 13, 26, 52, 104 times, write them to the staging space and retrieve the boxes (similar to what is performed in 2.4.2, a read operation is provided the exact coordinates of a box, and the workload assignment policy assigns each process similar amount of data to read). Figure 2.11 gives detailed information about these experiments. It is important to point out that neither of the two sets of experiments are weak scaling, because the actual workload for each process doesn t remain the same while the dataset size and number of processes increase. Thus, when reviewing the results of the two sets of experiments, it is more appropriate to compare the two sets to each other, rather than comparing all experiments of a single set together. Figure 2.12 shows the results. For the 13GB, 26GB and 52GB cases, the spatially constrained data retrieval use about 69%, 63%, and 32% less execution time compared to reading the AMR boxes. For the 104GB cases, spatially constrained data retrieval takes about 31% more execution time. An important fact which should be considered before explaining the results is that, the spatial access retrieves about 4 times more boxes than the boxes read (as described earlier). However, when the average box size is relatively small, transmitting an individual box is so efficient that even 4 times more transmissions could still be fast. In addition, a relatively smaller number of processes helps to achieve a more balanced workload distribution. Thus, in the first three cases, the spatial queries have a better performance than reading all the AMR boxes. However, with an increasing average box size, 4 times more data transportations cause significant network overhead. Worse, more processes lead to a more unbalanced workload distribution, further compromising performance. In terms of the amount of data (MB) one process retrieves for one time-step spatial access, the ratio of maximum and minimum for the 26

40 Figure 2.12 Results of spatially constrained data retrieval performance testing for AMRZone over AMR datasests, with comparisons of AMR boxes read, totally 10 time-steps. Note these are not weak scaling testings. 27

41 2048 processes case is 143:37. So for the last case, spatial access endures a noticeable performance downgrade. However, considering the above factors, we believe the spatial AMR data retrieval performance of AMRZone is satisfactory overall. Finally, it takes about 0.8 seconds for the mserver to build the polytree-based spatial index for the 10 time-step data in these experiments. In fact, because expanding the boxes does not impact the box numbers and relative locations at each level of a time-step, whether the boxes are expanded or not does not influence the efficiency of the index construction. Considering the index is built once and read many times, we believe the result is satisfactory. 2.5 Related Works In-situ and in-transit data analytics are widely used to avoid the high overhead related to file system I/O. In-transit refers to the approach of moving data from the compute nodes on which a simulation is running to a virtual in-memory space that is constructed by another collection of nodes, and performing various analytics tasks over the space. In-situ means the analytics tasks share the same compute resource as the running simulation. The term analytics can denote actions like writing data to storage, feature extraction, indexing, compression, transformation, visualization, etc [73]. Towards supporting these complicated tasks, a number of approaches have been proposed. Work that does not involve file system I/O usually study how to efficiently move data among nodes and provide various functions to facilitate analytics tasks. EVPath [30] enables users to setup dataflows among compute nodes through which fully-typed data can flow with assigned operators, filers, or routing logic. GLEAN [10] makes data movement topologically-aware and provides functionalities like data subfiling and compression. DataSpaces [25] builds a space-filling curve [50] based index over data in the virtual space, and provides efficient access functions to enable live data of any spatial region can be written to or read from the space. These key features of DataSpaces greatly facilitate runtime data sharing across applications, compared to manually implementing these complex communication behaviors by low level programming standards. Those data sharing scenarios typically consist of multiple heterogeneous and coupled simulation processes dynamically exchanging data on-the-fly [25]. Other related work is coupled with file system I/O. Besides inheriting EVPath s data transportation, extending its functions and enhancing performance, FlexPath [22] is integrated into ADIOS [49] as a transport method. Adopting the data transportation and manipulation methods of EVPath, DataStager [1] provides a phase-aware congestion avoidance data movement scheduler and compatible interface with ADIOS. PreDatA [74] is also based on EVPath and provides a few pluggable data analytics functions, such as sorting, plotting, and reducing/integrating with ADIOS. The method in [5] combines DataSpaces, ADIOS, and other in-situ techniques to 28

42 speed up scientific analysis tasks. Based on DataSpaces, ActiveSpaces [27] supports defining and executing data processing routines in the space. SDS [28] provides efficient scientific data management and query as services. All above works are for uniform mesh data. Moreover, only DataSpaces can build an explicit online data index and provide a public data access API, which are indispensable features for supporting runtime data sharing across applications. To the best of our knowledge, runtime data sharing across AMR capable applications has not been studied before. 2.6 Conclusion In this work, we first identify three major challenges for developing a framework to facilitate AMR data sharing across multiple scientific applications. We then address these challenges with our framework AMRZone, whose performance and scalability are even comparable with the existing state-of-the-art framework when tested over uniform mesh data. 29

43 CHAPTER 3 EXPLORING MEMORY HIERARCHY AND NETWORK TOPOLOGY FOR RUNTIME AMR DATA SHARING 3.1 Introduction Runtime data sharing is an effective approach for avoiding the high I/O latency incurred by post-processing methods [26, 69]. The principle idea of runtime data sharing is to assemble a dynamic-random-access-memory (DRAM) based staging space on a set of dedicated compute nodes. Client processes, which run over another set of nodes, can be simulations, producing data and writing to the staging space, or analytical applications, reading data from the staging space [5, 27]. Thus, simulations and analytics can be run concurrently, and slow parallel file system (PFS) access is replaced by fast DRAM access. However, the volume of data being generated by scientific applications continues to grow, often exceeding the capacity of DRAM by 100% or more [43], necessitating the use of the PFS, thus increasing access latency. To address this capacity issue, solid state drives (SSDs) have been utilized as an overflow space for when DRAM fills [43]. SSDs are chosen as they are usually two orders of magnitude faster than hard disk drives, which the PFS uses, and provide more than ten times the capacity of DRAM [37]. Still, as SSDs are about one order of magnitude 30

44 slower than DRAM [37], optimizations for data access performance over the staging space are desirable. Currently, there are two independent lines of research for improving data access performance over the staging space. The first approach directly targets the speed gap between SSDs and DRAM. By detecting spatial patterns in the read requests of the clients, the staging space can prefetch data from the SSDs and bring it into DRAM, relieving the access latency of SSDs [43]. The second approach focuses on the topology of the network, aiming to decrease the distance between a requesting client and where the data is stored in the staging space [60]. However, both of these proposed methods were developed for uniform mesh data, and are thus hardly applicable to adaptive mesh refinment (AMR) data [6 8], given its multi-level and non-uniform structure, as well as the highly irregular box sizes. To the best of our knowledge, a framework to support runtime AMR data sharing between scientific applications that employs more than the DRAM portion of the memory hierarcy and accounts for network topology has not been proposed. In the following, we summarize three major issues that should be addressed in order to develop such a framework, as well as our contributions, in addition to others, in the domain of runtime data sharing. First, to avoid out-of-memory (OOM) errors when receiving data from clients, the framework must know when to move data from DRAM to another part of the memory hierarchy. Similar to [43], we consider a memory hierarchy consisting of DRAM and SSDs, as the PFS is too slow and CPU caches are too small. To predict when the DRAM space will be full, the framework would need to know the ingress throughput; however, AMR simulations runtime output is typically unpredictable [44], resulting in a highly dynamic throughput in the staging space. As this issue is rare in the domain of uniform mesh data, it is a challenge brought on by AMR data. In our framework, we address this by monitoring the changes in the throughput to retain an accurate estimate of when DRAM will reach capacity, thus enabling us to move data to the SSDs in a timely manner. Second, similar to uniform mesh data, being able to identify spatial read patterns to facilitate prefetching data from the SSDs into DRAM would be helpful to reconcile the difference in access latency. Unfortunately, due to the multi-level, non-uniform structure of AMR data, the techniques used in [43] are scarcely applicable, requiring new methods be developed. To address this issue, our framework employs an AMR data-aware algorithm to effectively search all AMR levels to generate spatial regions for prefetching. Third, in order to further optimize data access performance, the framework needs to distribute data across the staging space properly to reduce network transmission latency at runtime. This is not as straightforward as it would be with uniform mesh data [60], as the sizes of AMR boxes are unknown a priori and highly irregular when generated [69]. To address this issue, we propose a multivariate cost model, accounting for box sizes, staging space workload balance, 31

45 and network topology, which the framework will utilize to properly select staging nodes to store AMR boxes on. In summary, we present a framework to facilitate runtime AMR data sharing between scientific applications. We specifically address the scenario when the generated data far exceeds the capacity of a pure DRAM-based staging space, and optimize its data access performance across multiple memory layers and the network topology. When tested over real AMR datasets, our framework s spatial access patterns detection and prefetching methods demonstrate about 26% performance improvement, and its runtime AMR data placement optimization can improve performance by up to 18%. Specifically, our framework makes the following contributions: ˆ An AMR data-aware capacity control policy for the staging space (Section 3.3.1). ˆ Methods for AMR data spatial access patterns detection and prefeching (Section 3.3.2). ˆ A model for optimizing runtime AMR data distribution across network topology (Section 3.3.3). 3.2 Background Block-structured AMR Data Adaptive mesh refinement (AMR) data has been shown to be an important advancement for scientific applications [6 8]. In this work, we focus on block-structured AMR data, which consists of a collection of disjoint rectangular boxes (or regions) at each refinement level [76]. Figure 3.1 [69] shows a visualization of a 1GB, 5-level block-structured AMR dataset produced by BISICLES [20], a program for modeling the ice-sheets in the Antarctic for climate simulation. The first level, level 0, covers the entire domain and is the coarsest. Each higher level is a refinement of the coarser level below it, retaining only some spatial regions of interest, and storing them at a higher resolution. A refinement ratio determines how much finer the resolution is. The amount, sizes, and locations of the AMR boxes are usually unpredictable before a simulation run, and continually change as the simulation progresses [69]. Moreover, AMR simulations demonstrate several dynamic runtime behaviors [44], specifically largely and heterogeneously changing resource requirements (DRAM, CPU, and network bandwidth) Overview of AMRZone Research in this work is based on our previous work AMRZone [69], which is a DRAM-based runtime AMR data sharing framework for scientific applications. AMRZone consists of a clientserver architecture. The server is composed of a set of dedicated compute nodes, providing a 32

Figure 3.1 The visualization graph of a 1GB block-structured AMR dataset generated by BISI- CLES [20]. BISICLES is a large scale simulation for modeling Antarctic ice-sheets.

46 Figure 3.1 The visualization graph of a 1GB block-structured AMR dataset generated by BISI- CLES [20]. BISICLES is a large scale simulation for modeling Antarctic ice-sheets. virtual DRAM-based data staging space. The client is a set of APIs, which can be used by applications to access data in the space. To facilitate runtime AMR data management, AMRZone consists of two types of server processes: mservers, dedicated for metadata, namely tracing AMR box distribution over the staging space and constructing a spatial index; and dservers, for handling binary data, such as storage and transfering. The mservers coordinate between dservers and clients. For instance, when a client needs to write some data, it first sends a request to an mserver. The mserver then finds a suitable dserver and sends back the dserver s communication address, which the client then connects to for data transmission. Inside an mserver, there is a thread pool to handle request in parallel. This design is proven to have comparable scalability with the state-of-the-art framework in the uniform mesh data domain when tested with up to 16,384 cores [69]. AMRZone achieves an overall balanced workload distribution over the staging space by (i) maintaining workload tables for every staging node and dserver, and (ii) employing an AMRbox-oriented runtime workload distribution policy. To facilitate spatial AMR data retrieval, it builds a polytree-based [21] online index, as illustrated in Figure 3.2 [69]. In the index, every box not in the highest level keeps a list of pointers to boxes in the next higher resolution level that are refined from it. This index represents the many-to-many relationships of AMR boxes well: a coarser level box can have multiple finer level boxes, and a finer level box can be refined from more than one coarser level boxes. 33

47 Root B0_1 B0_2 B0_3 B0_4 Time Step Level 0 B1_1 B1_2 B1_3 B1_4 B1_5 B1_6 Level 1 Figure 3.2 The polytree-based spatial index for AMR data in AMRZone. Tree nodes represent AMR boxes. Edges represent the refinement relationship. 3.3 Methods By greatly extending our previous work, AMRZone [69], this work addresses the three major issues stated in Section 3.1. In doing so, we propose a framework to facilitate runtime AMR data sharing between scientific applications that utilizes multiple memory layers and is cognizant of network topology Runtime Staging Space Capacity Control In this section, we first discuss how to detect the change in ingress throughput, which is the amount of data being written to the staging space, then we describe our methods of managing the space s capacity in two different cases Runtime Ingress Throughput Detection In some cases, our framework may need to know the data ingress throughput before predicting when the staging space will become full. However, as previous work shows that AMR data output throughput is unpredictable [44]. Thus, it is necessary to dynamically detect the throughput changes at runtime. In our framework, the total available DRAM of a staging node is divided evenly among all dservers (server processes that only handle binary data, 3.2.2). Each dserver manages its share of DRAM independently, recording usage statistics and moving data between DRAM and SSDs. This design is adopted as sharing the entire DRAM with all processes on a node requires frequent resource usage synchronization, introducing a non-negligable overhead. Moreover, AMRZone has proven to be effective at balancing the data distribution across all dservers [69]. In order to track the changes in ingress throughput, a dserver records the elapsed time (t 1 ) during which a fixed amount of data is received (or an inspection point is reached). Once the fixed amount of data has been received, or the inspection point reached, the throughput is 34

48 re-calculated, and the iteration continues. Around 5% of the initial amount of available DRAM is a reasonable value for the inspection point, as it enables a dserver to effectively detect the throughput variation while avoiding the overhead of frequently checking Capacity Control We first discuss the common senario of writing data to the staging space. If only a single data producer connects to the space, and there is a sufficient time period between the generation of two adjacent time-steps of data (e.g. computation time), it is possible to utilize this period to move one time-step of data to the SSD to make space for the next time-step. Under this condition the write throughput of the client applications should not be delayed. In this case, a dserver utilizes the following condition to determine if a data move task should be triggered when it arrives at a time period between two time-steps. 2 T S Data > F reedram Here, T S Data is the average amount of data in a time-step received by a dserver, and F reedram is how much DRAM is currently available to the dserver. When the inequality is true, it is necessary to move a time-step of data from DRAM to the SSDs to make space for the next time-step. Twice the average time-step size is used to account for the potentially large variance in throughput, helping to avoid OOM errors. Moreover, we also provide, through the staging space s API, the ability to notify the staging space to move a certain time-step s data to SSDs, providing more flexibility for users. Next we discuss a more demanding case for the capacity control of the staging space, which may become common in the era of big data. This second case is described by the following conditions. ˆ Before the next computation period, the generated data is too big to be stored in a client application s DRAM, and thus needs to be moved to the staging space. ˆ The total amount of data is too large to be stored entirely in the DRAM of the staging space, or there are multiple data producers connecting to the staging space (one may be producing data during the other s computation period). Thus some staged data needs to be moved to SSDs as soon as possible. If these conditions hold, then the write throughput of the client applications can be delayed, as the servers must temporarily block from receiving data to make space in DRAM by moving data to SSDs. 35

49 A dserver utilizes the following model to decide when a data move task should be triggered in this case. t 1 DRAM Ingress T P > F reedram + t 1 SSD W rite T P Here, t 1 is the elapsed time between the latest two inspection points, and DRAM Ingress T P is the average throughput during that time. Using these as predictions for the average throughput and time period until the next inspection point, the model checks if the predicted total amount of data received by a dserver (t 1 DRAM Ingress T P ) exceeds the sum of currently available DRAM (F reedram) and the amount of DRAM that can be freed by writing data to SSDs (t 1 SSD W rite T P ) in the interim. If the predicted value is larger, then an OOM condition is likely and a data move must be initiated. The amount of data to be moved is set as t 1 DRAM Ingress T P F reedram, and the data to be moved starts with the latest time-step and moves reverse chronologically. To ease data management, an AMR box is moved completely once, rather than kept partially in DRAM. Finally, each dserver has a dedicated thread for data move tasks. As we consider only runtime data sharing, not post-processing analytics, each dserver creates its own private file to store data. More complex data formats, such as HDF5 [33], are not used. When needed, the data moved to SSDs can be brought back to DRAM. In this case, when the staging space is full, older data can be moved to SSDs or freed Spatial Read Patterns Detection and Prefetching In this section, we discuss how to detect common spatially constrained AMR data retrieval patterns and perform prefetching. We suppose the data to be accessed has been moved to SSDs, so prefetching it to DRAM can help to bridge the speed gap between DRAM and SSDs. When the staging space is becoming full, data can be moved to SSDs as discussed in Section A spatial access region is represented by a bounding box over the coarsest level of an AMR dataset. The AMR boxes that overlap with the given bounding box are retrieved from all levels [69, 75, 76]. A selection region doesn t have to be aligned with the boundaries of AMR boxes. The boundaries of an access pattern may change as the analytics progresses over a series of time-steps. However, the trends of the changes are usually predictable, and by catching the trends our framework can perform data prefetching. In this work, trends are identified by tracing each client process s accesses. For example, for a client process s access region over time-step n, our framework will compare the access boundaries with the one over time-step n-1; if there is no change, the same access region will be re-used for prefetching; otherwise, the framework will analyze the boundary variations, generate a new predicted access region by applying the 36

50 Level1 Level0 Spatial Region Access Region Variation Predicted Access Region Change TS n 1 TS n TS n+1 Figure 3.3 The illustration of 2D AMR data coupled spatial access patterns detection for a client process. The accessed region over time-step n is compared with the one over time-step n-1. After finding the boundary variation (the right boundary moves one unit toward the right), the framework predicts this change will continue, and it generates a new predicted spatial access region by applying the same trend. This new predicted region is used for prefetching data of time-step n+1. detected boundary changes, and use the updated one for prefetching, as illustrated in Figure 3.3. The mservers (server processes that only manage metadata) are responsible for tracing data access history for every client process, pattern detection, and sending prefetching messages to dservers (server processes that only manage binary data). Each dserver has a dedicated thread for prefetching, so the data retrieval for time-step n and the prefetching for time-step n+1 can be performed in parallel. While the dserver is transferring the requested data for time-step n to the client, the prefetched data for time-step n+1 begins to be brought into DRAM. However, as it is quicker to transfer data over the network than read it from SSDs, the dserver may not have all of the next time-step prefetched when the client requests it. As a result, the performance of the dserver with prefetching is bound above by the performance of the dserver if all the data was already in DRAM (please refer to Section for results). Figure 3.4 shows the time sequence diagram of prefetching. Next we discuss how to search for AMR boxes which overlap with a predicted spatial access region for prefetching for the next time-step. It is necessary to search the predicted region rather than re-use the boxes coordinates of the previous time-step, as the coordinates keep changing over time-steps [69]. For the coarsest level (level 0), the framework needs to linearly check each box to see if it overlaps with the given region. For higher level (more refined) boxes, it can take advantages of AMRZone s spatial index. The boxes which are refined from a coarser level box also overlap with the coarser level box spatially [69]. So, after finding a box which overlaps with the given spatial access region, searching the next higher level can be limited to those boxes that are refined from the found one. 37

51 Client Process Spatial access at TS n TS n data Spatial access at TS n+1 NServer Thread Pattern analysis & prefetch TS n+1 Fetch TS n request DServer Process Notify prefetch thread Fetch TS n & send to client DServer Prefetch Thread Fetch TS n+1 Figure 3.4 The timing sequence diagram for prefetching, involving a client process, an mserver process s thread, a dserver processs and the dserver s dedicated prefetching thread. For a dserver, there is a time period between retrieving a time-step s data and receiving the data request of the next timestep, which can be leveraged to do prefetching for the next time-step. Algorithm 3.1: Algorithm of an mserver searching the spatial index for prefetching data of the next time-step (TS) Input : The predicted access region for prefetching: SR Input : The array of refinement ratio for all levels: Ref[] Input : The built spatial index: Idx Result: Prefetching messages are sent 1 /*Search the coarsest level(level 0) of the next TS:*/ 2 for box IN all coarsest level boxes do 3 if box not prefetched AND region overlap(sr, box) == T RUE then 4 record box as pref etched 5 sendp ref etchm sg(box) 6 scp ref etch(box, 0) 7 /*Function to search a box s refined boxes at the adjacent higher level:*/ 8 Procedure scprefetch(cbox, level) 9 ref boxes = retrieve ref ined boxes(cbox, Idx) 10 for box IN ref boxes do 11 if box not pref etched AN D region overlap( ref ine region(sr, Ref[level]), box ) == T RUE then 12 record box as pref etched 13 sendp ref etchm sg(box) 14 scp ref etch(box, level + 1) 38

52 Prefetch region 1 Prefetch region 2 Box0_1 Box0_2 Box1_1 Box1_2 Box1_3 Figure 3.5 Illustration of how redundant messages can be sent during prefetching. At level 0, the two spatial regions overlap with box0 1 and box0 2, respectively, and prefetching messages are sent for those two boxes. However, at level 1, box1 2 overlaps with both box0 1 and box0 2, so multiple messages are sent for box1 2. Algorithm 3.1 illustrates the depth-first procedure of how an mserver utilizes the spatial index to search for boxes in a given predicted access region. The algorithm first searches the coarsest level for boxes which overlap with the given spatial region (lines 2-6). After finding an overlapping box, a prefetching message will be sent (line 5) and a recursive function (lines 8-14) will be called to search boxes at the next higher level. In this function, it first retrieves those boxes at the next level which are refined from the found box (line 9). It then iterates over those finer level boxes to check if they overlap with the given region. Before this check, the refinement ratio must be applied to the given prefetch region to enable accurate overlap checks at the current refinement level. Moreover, before the overlap check, the algorithm first makes sure a box has not already been prefetched; after the checking (lines 3, 11), it records the box as prefetched (lines 4, 12). Otherwise, given the many-to-many relationship of the boxes, redundant prefetching messages may be sent for a single box, as illustrated in Figure 3.5. Without this check, the performance improvement may be little under high parallelism scenarios, as too many prefetching messages can cause a bottleneck inside an mserver Runtime Data Placement Optimization over Topology Our final optimization is network topology-aware data placement, or how to determine which staging node an AMR box should be stored on. Data transmission time is a nontrivial part of the overall data access time. However, the compute nodes allocated for a job can be distributed 39

53 Figure 3.6 The throughput difference between nodes of different topology distance on Cori [19] (the testbed for this work). For details of how the topology distances are calculated, please refer to Section The result is based on a micro benchmark, which sends 1 MB ping-pong messages between a pair of nodes, repeating the process 30,000 times. This benchmark setup (small message size and big number of messages) resembles AMR data, as each AMR box is usually not very large (from a few KBs to a dozen MBs), but the number of boxes in a time-step is high (from several thousands to tens of thousands). At any given time, only one pair of nodes are communicating with each other. Average experiment values are reported. 40

54 Size: 8 Receive a new box Nserver Process???? Distance: 0 Distance: 1 Distance: 2 Distance: 3 Server Node 0 Workload: 100 Server Node 1 Workload: 95 Server Node 2 Workload: 93 Server Node 3 Workload: 92 Figure 3.7 An illustration of the runtime factors that our framework must consider when determining where to place an AMR box. The mserver must not only consider topology distance, but also the size of each AMR box, and the workload of all staging nodes which keep changing at runtime. For example, in the figure, in terms of topology distance node 0 should be chosen to place the incoming box, but in terms of workload, node 3 should be selected. The final choice should be appropriately balanced among all factors. physically far away from each other over the entire topology, which can lead to two cases that incur undesirable network related overhead: ˆ Communication may have to go through many levels of switches/routers. ˆ Communication is more likely to be subject to contention from other users tasks. Figure 3.6 shows the throughput difference between nodes of different topology distances (to be defined later) at Cori [19]. As shown, the difference in throughput can be up to 34% between different distances. Thus, matching two processes whose nodes are close together can help to improve data access performance. For uniform mesh data, runtime topology-aware data placement is relatively straightforward. A time-step s domain is pre-partitioned into unifom regions before the data is sent to the space [26, 69]. For each client process, the framework needs to determine a set of server processes which are the closest to the client. Then it maps the client s data to those selected server processes evenly [60]. For AMR data, ideally the framework would find the staging node which has the lowest workload and the shortest topology distance. However, this is usually unrealistic in real world scenarios, as the sizes of AMR boxes are highly irregular and unknown to servers [69], and the workload on each node is not absolutely balanced and is continuously changing during runtime. Hence, our framework must be able to consider multiple runtime factors, such as the size of 41

55 received AMR boxes, the workload of all staging nodes, as well as the topology, as illustrated in Figure 3.7. The final placement should be a balanced choice that considers all of these factors. In order to address this issue, we propose an experiment-based model to select a suitable staging node to place an AMR box on: weight = (threshold node workload) α distance ( box size node avg box size ) The following is a brief explaination of each parameter. ˆ weight: Result of evaluating a node with the given model. The node with the highest weight is chosen to store the incoming box. ˆ threshold: A statistical value of all nodes current workload, e.g., the first quartile. ˆ node workload: How much binary data has been stored on a staging node. ˆ distance: Topology distance between a node and the client which sends a box. ˆ box size: Size of the incoming box. ˆ node avg box size: Average size of all boxes on the node. ˆ α: System specific variable (Section 3.3.4). For each incoming box, the model is evaluated for a set of selected nodes to determine the box s placement. The model considers workload balance the highest priority factor, as a large amount of data being imbalancely distributed across a small number of nodes/processes will not only reduce the parallelism of data access but also increase the likelihood of network congestion. Secondary is the topology distance between the client process and the staging node being considered. This helps achieve a balance of distance and workload distribution for staging node selection. The minuend in the model, threshold node workload, is directly related to workload. The threshold is first used to filter nodes before model calculation. Specifically, the staging node selection is only performed among the nodes whose workload is smaller than this threshold. The threshold node workload term is the gap between a node s workload and the threshold value. A higher gap value means lower workload on a node, which indicates this node as a more favorable choice for storing the box in consideration of workload balance. Through this setting, the model treats workload balance as first priority, because it only considers nodes whose current workload is not already high (smaller than the threshold), and because when facing a large amount of data this workload gap value will be much larger than a topology distance value, exerting a higher influence over the model result (Section 3.3.4). 42

56 The subtrahend in the model, α distance (box size/node avg box size), is related to topology distance, where a smaller value indicates a more favorable choice. It is used to offset the influence of workload, i.e. rather than choosing a node that has the lowest workload but is far away from the client, it s better to select one whose workload is a little higher but distance to the client is shorter. The box size/node avg box size term is used to reduce or increase the influence of the distance. Specifically, if this ratio is less than one then the incoming box is small relative to the others on the node, lessening the effect distance has, as transmission of a small box should be quite efficient. If the ratio is larger than one then the opposite is true, and distance plays a larger role, as the transmission is more likely to be subject to network contention. Finally, the node with the highest weight is chosen, and the box is placed on the process whose workload is the lowest on the node. Note that this work doesn t consider heat-diffusion-based [23] runtime data replication to improve performance, which can be future work Implementation The implementation of this work is based on our previous work, AMRZone [69]. Based on AMRZone s prototype, we have implemented: staging space capacity control, spatial access patterns recognition and prefetching, and network topology-aware modules. In this section, we primarily discuss the network topology of one of our testbeds, Cori [19], and how the model described in Section adapts to the topology. How a topology is configured has great influence over how the topology distance should be calculated for a pair of nodes. For example, the Titan super computer [63] features a 3D torus topology. Every node on the machine is assigned an unique set of 3D coordinates, thus the topology distance between two nodes can be calculated as the distance between 2 points in a 3D space [60]. However, our testbed Cori has a more advanced Dragonfly topology [19], on which a 3D topological distance cannot be applied. In Cori phase I, nodes are physically organized in a 3 level architecture, cabinet-chassis-blade. At the top level, there are 12 cabinets; each cabinet contains 3 chassises; each chassis is composed of 12 blades; each blade has 4 compute nodes. Complicated connections are established between different levels [19], with nodes within the same blade being the most closely connected. Given the topology, when calculating the distance between a pair of nodes, we only consider the highest level of difference. Specifically, if two nodes are within the same blade, they have a topology distance of 0; if they are only distributed in different blades, the distance is 1; if they are located in different chassises, the distance is 2; if they are in different cabinets, the distance is 3. Subsequent (dis)similarities don t matter, i.e. no matter how two nodes are located in two 43

57 different cabinets, their distance is 3. This policy has proven to be able to represent significant distance differences in the topology while ignoring the ones which matter little. Figure 3.6 shows the difference in throughput for the 4 types of distances. However, there is a special case in Cori. The 12 cabinets of Cori are divided to 6 pairs, 0-1, 2-3, 4-5, 6-7, 8-9, Such a pair of cabinets are wired more closely together. Specifically, all the blades within a cabinet pair are also directly connected to the blade in the same position in each chassis of the cabinet pair. So, if a pair of nodes are located in two blades which are connected under to this case, their distance is set as 1. Since the topology layout is fixed for the life time of Cori, we only need to compute the distance information once and store it in a file. During initialization, our framework just reads the pre-computed information and uses it for runtime data placement optimization. Finally, as shown above, the topology distance values on Cori are small. If directly applied to our runtime optimization model, they would not effectively influence the outcome. Here is where the α in the model can help, by weighting the distance values appropriately. In our framework, we set α as 3% of the minuend in the model (threshold node workload, which represents the workload factor). For example, if there is a pair of nodes whose distance is 3, multiplied by this α, the distance factor can account for approximately 10% of the workload factor. Our experiments show this choice of α produces reasonable results (Section 3.4.3). 3.4 Results Our experiments are designed to isolate and evaluate each of the proposed framework s new capabilities. Specifically, it s ability to successfully manage the staging space capacity (Section 3.4.1), enhance spatially constrained AMR data access over SSDs via patterns recognition and prefetching (Section 3.4.2), and optimize data distribution over the network at runtime (Section 3.4.3). Each experiment is run multiple times, and the average result values are reported. The framework s prototype is primarily evaluated on Cori phase I [19] at the National Energy Research Scientific Computing Center (NERSC). Cori is a Cray XC40 machine with 1,630 compute nodes. Each node contains two 16-core Intel Haswell processors at 2.3GHz, and 128GB DRAM. Note that the actual amount of DRAM available on a node for applications is less than 128GB, due to the operating system and other runtime libraries usage. All nodes are connected via Cray Aries with a Dragonfly topology. Cori has 875TB of SSDs space, which is composed of a collection of dedicated SSDs nodes, which jobs gain access to by declaring their required space in the job submission file. The details of how the data is managed over the SSDs are transparent to a user. We believe our framework s design is flexible enough to be adopted to other machines with different SSDs architectures, e.g. Summit [59] at Oak Ridge National Lab (ORNL) on which each compute node is armed with SSDs. 44

58 The topology-aware runtime AMR data placement optimization is also evaluated on Titan [63] at ORNL. As Titan does not have any SSDs, this is the only portion of the framework tested. Titan is a Cray XK7 machine with a total of 18,688 nodes. Each node contains a 16-core 2.2GHz AMD Opteron 6274 processor, and 32GB memory. Nodes are connected via Gemini [13] high-speed interconnects. The datasets used all come from BISICLES [20] simulations. Figure 3.1 shows a visualization of one dataset. Each dataset consists of double-precision values, is about 1GB in size, and is composed of 5 levels with 6,700-7,000 boxes. For the staging space, 1 mserver with 32 threads is used for all experiments. Client testing programs that connects to the staging space are based on our framework s APIs Staging Space Capacity Control Our first experiment tests the staging space capacity control of the framework. We add consistent time periods between writing two adjacent time-steps in the client testing program, imitating the computation periods between time-steps. The idle period between writing is set to 5 seconds considerably lower than the 60+ seconds between BISCICLES time-steps. In this case our program is expected to be able to move data to the SSDs, freeing space in DRAM, while maintaining a high write throughput. We also include test without periods between writing two adjacent time-steps, representing the demanding cases discussed in Section 3.3.1, where client s throughput may be delayed. To create large datasets for performance testing, we expand all boxes in a dataset by a factor of 256. We make sure the expansion doesn t change a box s relative position to other ones at the same level and adjacent levels. In total there are 100 time-steps that will be written to the staging space, and the staging space will be full at about time-step 42. 4,096 client processes and the same number of dservers are used for testing. Each client process is assigned a similar amount of data to write and knows the the exact coordinates of assigned boxes. Note that all data in a time-step is used, therefore no spatial access patterns are involved. In order to have a baseline for comparison, we also include experiments which directly access SSDs and PFS (not involving our framework s staging space). The datasets expansion ratios and number of client processes are the same as above. For PFS access, we utilize all 248 Object Storage Targets (OSTs) of Cori. The results are shown in Figure 3.8. For the case with computation time periods between two time-steps, our framework is able to achieve significantly improved performance, and the client process s write throughput are not delayed, as the periods fully encapsulates the time required to move data from DRAM to SSDs. Compared to writing directly to SSDs, in this case, our framework improves the average and median throughput by 72.85% and 71.93%, respectively. 45

Write via our framework, with time periods between time-steps Write via our framework, no time periods between time-steps Direct SSDs write Direct PFS write Direct SSD Direct PFS Figure 3.

59 Write via our framework, with time periods between time-steps Write via our framework, no time periods between time-steps Direct SSDs write Direct PFS write Direct SSD Direct PFS Figure 3.8 The effectiveness of our framework handling the condition when the staging space is becoming full, with comparison of direct SSDs and PFS access (not involving our framework s staging space). Compared to directly writing data to SSDs, with added time periods to imitate computation periods, our framework can achieve average 72.85% and median 71.93% improvement respectively. For the other demanding case without computation periods, the client s throughput goes down at about time-step 42, as the framework must frequently stop receiving data to move overflowed data to SSDs before the DRAM space is full. Moreover, in both cases, our framework is able to avoid any OOM errors. Overall, we believe our framework exhibits satisfactory performance in staging space capacity control when handling a large amount of data Spatially Constrained AMR Data Read Patterns Detection and Prefetching Our next experiments evaluate our framework s effectiveness of spatially constrained AMR data access patterns detection and prefetching. To setup the experiment, we first write 100 1GB time-steps to the staging space, which is composed of 512 dservers. The spatial regions the clients access are based on 11 major Antarctic ice shelves represented in the BISICLES datasets, as illustrated by Figure 3.9. Totally 11 client processes are used, each accesses a region. For every 10 time-steps, a spatial region expands 5% in a certain fixed direction, as illustrated in Figure 3.3. Since data write performance was shown in the previous subsection, here we focus on the performance of spatial regions retrieval. Note that there are no breaks between clients retrieving two adjacent time-steps data. 46

60 Riiser-Larsen Fimbul Larsen C Ronne-Filchner Amery Willkins Abbott Ross West Shackleton Getz Figure 3.9 Illustration of the 11 major Antarctic ice shelves. The spatial regions of those ice shelves are used as spatial access constraints over the BISICLES datasets. 47

61 For the server side of our framework, we include three cases. First, the data is entirely in the staging space (DRAM), which should produce the highest throughput. Second, the data is moved to the SSDs, and the spatial access patterns detection and prefetching are enabled in the servers. Third, the data is moved to the SSDs, and the spatial access patterns detection and prefetching are disabled in the servers. We also include experiments of direct SSDs and PFS access for baseline comparison. The datasets are placed on SSDs and PFS in HDF5 format [33]. As we are directly accessing the file system, there is no spatial index that we can leverage. When retrieving a box, the program must search all the metadata (all boxes coordinates) to find which ones overlap with the given spatial region. Once found, the binary data is read back into the testing program s DRAM. The full results are shown in Figure Read throughput is the highest when all of the data fits entirely into DRAM, which is the upper bound scenario. Pattern detection and prefetching make a noticeable difference in throughput when the data is on the SSDs. Compared to no prefetching, the average throughput is 26.47% higher with a median of 26.03%. Moreover, the throughput improvement is consistent throughout all time-steps, indicating our framework can catch most of the spatial patterns changes and perform prefetching with high precision. However, even with prefetching from the SSDs, the throughput cannot reach the upper bound. This is due to the fact that the time required to transmit the current time-step data from the staging space to the client process is smaller than the time required to prefetch the predicted data entirely from SSDs to DRAM, as discussed in Section When directly accessing the SSDs and PFS, the boxes binary data retrieval is random, as the program has to search all metadata. For PFS, which is based on spinning hard drives, random data access is very slow, thus the throughput is commensurately low. For SSDs, as it is insensitive to random data access, its performance is about 400% times better than that of PFS, but it is still at least 50% below our framework, because there is no spatial index to utilize Topology-aware Runtime AMR Data Placement Optimization We now look at the last optimization, topology-aware data placement. In addition to testing on Cori, we also test this optimization on the Titan supercomputer at Oak Ridge National Lab. Each dataset is expanded by a factor of 256, in addition to scaling the client and dserver processes up to 4096 each. The dataset expansion and workload assignment among client processes is done in the same way as Section In this experiment we first write 100 time-steps to the staging space, then read all 100 time-steps back from the staging space. Again we have an idle time inserted between writing timesteps, mimicking a process s compute time. After reading one time-step of data, our framework will free up all the data from the previous time-step to avoid moving data when DRAM space is 48

62 Our framework, data in DRAM Our framework, data on SSDs, with prefetching Our framework, data on SSDs, no prefetching Direct SSDs access Direct PFS access 11 client processes Figure 3.10 The effectiveness of our framework performing prefetching for spatially constrained access, compared to direct SSDs and PFS access (not involving our framework s staging space). The spatial access patterns of client processes are based on the 11 major Antarctic ice shelves on a BISI- CLES dataset, as illustrated in Figure 3.9. Each client accesses one such region. For our framework with patterns detection and prefetching, the average and median performance improvement are 26.47% and 26.03% respectively. 49

For writing, the average and median improvements of topology-aware optimization are 18.08% and 18.73%, respectively. For reading, the average and median improvements are 10.57% and 10.

63 Our framework with optimization Our framework without optimization Direct SSDs access Direct PFS access Figure 3.11 On Cori, the effectiveness of our framework s topology-aware runtime AMR data placement optimization, compared to direct SSDs and PFS access (not involving our framework s staging space). For writing, the average and median improvements of topology-aware optimization are 18.08% and 18.73%, respectively. For reading, the average and median improvements are 10.57% and 10.49%, respectively. insufficient, as discussed in Section Experiments are run on server processes both with the optimization and without. We also include experiments of directly accessing SSDs and PFS (not involving our framework s staging space) for comparison. All data in a time-step is used, thus no spatial access patterns are involved. Figure 3.11 shows the results on Cori. For writing, the average and median improvement of topology-aware optimization are 18.08% and 18.73%, respectively. The throughput is consistently high, as our program utilizes the compute time period to move data to SSDs. When reading the time-steps back, the throughput for the first 42 time-steps is high, as the data is read from DRAM. During this period the average and median improvements are 12.89% and 12.84%. For the following time-steps, the throughput goes down sharply, as data is read from SSDs (at about time-step 43 the DRAM is becoming full, hence the following time-steps have to be moved onto SSDs). During this period the average improvement was 8.77%, and the median improvement was 8.85%, which are lower than the writing case, because SSDs access latency becomes dominant, thus shadows the effectiveness of topoloy optimization. Overall, the average and median improvements for reading are 10.57% and 10.49%, respectively. For both writing and reading, our framework performs consistently better than directly accessing SSDs or the PFS. We run a similar set of experiments on Titan. The only configuration difference from the ones on Cori is that only 10 time-steps are used, as each Titan node only has 32GB DRAM. As Titan has no SSDs, if too much data is written to the staging space the PFS would have 50

Storage Hierarchy Management for Scientific Computing

Storage Hierarchy Management for Scientific Computing by Ethan Leo Miller Sc. B. (Brown University) 1987 M.S. (University of California at Berkeley) 1990 A dissertation submitted in partial satisfaction