Towards a Distributed Large-Scale Dynamic Graph Data Store

Size: px

Start display at page:

Download "Towards a Distributed Large-Scale Dynamic Graph Data Store"

Christian Lindsey
6 years ago
Views:

1 2016 IEEE International Parallel and Distributed Processing Symposium Workshops Towards a Distributed Large-Scale Dynamic Graph Data Store Keita Iwabuchi, Scott Sallinen, Roger Pearce, Brian Van Essen, Maya Gokhale, Satoshi Matsuoka Center for Applied Scientific Computing; Lawrence Livermore National Laboratory Dept. of Mathematical and Computing Sciences; Tokyo Institute of Technology Dept. of Electrical and Computer Engineering; University of British Columbia Global Scientific Information and Computing Center; Tokyo Institute of Technology iwabuchi.k.ab@m.titech.ac.jp scotts@ece.ubc.ca {rpearce, vanessen1, maya}@llnl.gov matsu@is.titech.ac.jp Abstract In many graph applications, the structure of the graph changes dynamically over time and may require real time analysis. However, constructing a large graph is expensive, and most studies for large graphs have not focused on a dynamic graph data structure, but rather a static one. To address this issue, we propose DegAwareRHH, a high performance dynamic graph data store designed for scaling out to store large, scale-free graphs by leveraging compact hash tables with high data locality. We extend DegAwareRHH for multiple processes and distributed memory, and perform dynamic graph construction on large scale-free graphs using emerging Big Data HPC systems such as the Catalyst cluster at LLNL. We demonstrate that DegAwareRHH processes a request stream 206.5x faster than a state-of-the-art sharedmemory dynamic graph processing framework, when both implementations use 24 threads/processes to construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests. DegAwareRHH also achieves a processing rate of over 2 billion edge insertion requests per second using 128 compute nodes to construct a large-scale web graph, containing 128 billion edges, the largest open-source real graph dataset to our knowledge. I. INTRODUCTION In many graph applications, the structure of a graph changes dynamically over time and may require real time analysis. Large scale graph processing often presents challenging data intensive workloads, due to the unstructured memory access patterns common to many applications. Significant research into the analysis of challenging scalefree graphs that are static (graph topology remains fixed) has been leveraged to remarkable improvements in the processing capabilities using HPC systems [1], [2], [3]; much of this work has been enabled by the bi-annual Graph500 competition [4]. Research into the algorithms and management of data structures for large scale dynamic temporal graphs, that is, graphs where the topology changes as a function of time, is at the moment immature when compared to the static scenario. While capabilities for processing large scale dynamic graphs are lacking, the need for such capabilities is strong. In this work, we present our progress towards developing the data structures and infrastructure management necessary to support dynamic graph analysis at large scale, on distributed High Performance Computing (HPC) platforms. Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory. Such large memory requirements can be due to large scale topology, as is the case with the 128 billion edge web graph we study, or rich property that decorates the topology. For example, Facebook manages 1.39 billion active users as of 2014, and their social network graph has more than 400 billion edges [5]. To accommodate such large graphs we use node-local NVRAM to extend main memory, and persistently store the graph data. Node-local NVRAM (e.g., NAND Flash) has found its way into the HPC platform, driven in large part by application checkpointing needs. In this work, we present a dynamic graph data store (DegAwareRHH), which adopts open addressing compact hash tables that exhibit high data locality properties for vertices with large degree, in order to minimize the number of accesses to NVRAM. DegAwareRHH is degree aware, and uses separated compact data structures for low-degree vertices to reduce their storage and search overheads. We implemented DegAwareRHH for distributed-memory using an asynchronous visitor queue abstraction [6] [7], and perform scaling studies of large scale graph construction on up to 128 compute nodes. Summary of our contributions: We demonstrate that DegAwareRHH can construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests times faster than a state-of-the-art shared-memory graph processing framework, when both implementations use 24 threads/processes. We present scaling studies of constructing large-scale, synthetic graphs and a large, real-world graph (up to 275 billion edges insertion requests), scaling up to 128 compute nodes. We show DegAwareRHH processes over 2 billion edge insertion requests per second, using 128 compute nodes on the large-scale real graph with 128 billion edges /16 $ IEEE$ IEEE DOI /IPDPSW

2 A. Scale-free Graph II. BACKGROUND Many real-world graphs can be classified as scale-free, where the distribution of vertex degrees follows a scale-free power-law distribution [8]. The degree of a vertex is the count of the number of edges connecting to the vertex. A power-law vertex degree distribution means that the majority of vertices will have a low-degree, while a select few will have a very large degree, with degree distribution following a power-law pattern. Recent work has focused on optimizations related to the challenge of high-degree vertices which cause load imbalance for parallel computations [6] [9] [7]. In this work, we identify low-degree vertices which are particularly challenging for graph data stores in NVRAM. B. NVRAM In the next generation of supercomputers, providing sufficient main memory capacity is one of the biggest challenges due to the high power consumption of DRAM. NVRAM will greatly expand the possibility of processing extremely large-scale graphs that exceed the DRAM capacity of the nodes. However, graph construction cost using NVRAM is extremely high compared to DRAM: a naive data structure implementation would cause significant performance degradation due to unstructured memory accesses to disk. C. Dynamic Graph Processing In many real-world graph applications, the structure of the graph changes dynamically over time and may require real time analysis. For example, genomic sequence assembly leveraging de Bruijn graphs [10] [11] incrementally build graphs during sequence assembly. These applications construct a graph based on relatively short fragments of DNA from a sequencer. When a new sequence is generated, graph analysis and graph construction/updates are conducted simultaneously. Therefore, not only is high performance dynamic graph construction required, but also efficient data retrieval during graph analysis. D. Graph Data Structure Graph processing is a highly data-intensive problem, thus improving the data locality of the graph data structure storage is an essential optimization [12]. Over the years, many data structure models have been studied. In this section, we discuss classic graph data structure models for static graph and dynamic graphs and their advantages and disadvantages. In the following context, we consider a graph G which has n vertices and m edges. 1) Static Graph Data Structure: Adjacency-matrix The simplest data structure model is an adjacency-matrix, an n n matrix. If there is a edge from v i to v j, an entry a ij holds a positive value or the property data of the edge. An adjacency matrix consumes O(n 2 ) memory, which is a huge disadvantage when processing sparse graph. Figure 1. CSR Graph Data structure Compressed Sparse Row (CSR) The CSR data structure is the de-facto graph data structure widely used in graph processing implementations. The CSR data structure consists of two array structures called an index array and an edge array. The index array holds indices of the edge array, and the edge array holds neighbor vertices ID. More specifically, each index in the index array represents a source vertex s ID, and the corresponding element in the index array refers to an index in the edge array. The range of a vertex s edges in the edge array is from index[v i ] to index[v i +1]. The memory size of the CSR data structure is O(n +1) for the index array and O(m) for the edge array. An overview of the CSR data structure is illustrated in Figure 1. The CSR data structure can provide high data locality and efficient memory usage due to its packed array structure. However, because of packing, the CSR data structure is not well suited for storing dynamic graphs. In general, even when adding or deleting a single edge or vertex, the CSR data structure requires updates to the entire array, causing large data movement. 2) Dynamic Graph Data Structure: Key-Value The Key-Value Store (KVS) model, in other words a map or dictionary data structure, manages an element as pair of a key and value. In a simple KVS graph model, an edge table uses the source and target edge pair as the key, and the edge property as the value. To store vertex data, the KVS holds a vertex table consisting of pair of a vertex and vertex property data. However, while such a simple Key- Value Store model can insert and delete a vertex and edge efficiently, there is no consideration of graph topologies. Therefore, it is difficult to obtain locality benefits among edges adjacent to a vertex, which is a highly important factor to determine the performance of many graph analyses. Adjacency-list The basic components of an adjacencylist are a vertex-table and an edge-list. A vertex-table describes a set of all vertices of a graph. Each element of a vertex-table consists of a pointer to an edge-list and also 893

3 vertex property data if needed. An edge-list holds the list of neighbors of its vertex, with edge property data. The main advantage of the adjacency-list model is it can provide high data locality among out-going edges (or in-coming, depends on user configuration) of each vertex. There are many variations on this model, each with its advantages and disadvantages. For the vertex-table, typically a tree or hash (key-value) data structure is used, and for the edge-list, a single vector or linked-list data structure is used. A tree data structure is widely used in database systems; however, a tree data will structure potentially cause random memory accesses for each operation (such as, insert, find and delete). Even though the time complexity of a tree data structure is O(log(n)), this is too costly for large-scale graphs stored in NVRAM; therefore, we use near-constant time hashtables. III. ROBIN HOOD HASHING (RHH) In this section we describe a hash table algorithm which is used by our DegAwareRHH. In order to minimize the number of accesses to NVRAM, resulting in page misses, we choose Robin Hood Hashing [13], because of its locality properties and compactness. Robin Hood Hashing is an open address hashing (without collision chains) and designed to maintain a small average probe distance the distance between initial (hashed) position and current position of a key. Average probe distance is an important factor to determine the construction cost of the table. We expect that the combination of low probe distance and sequential memory accesses patterns of Robin Hood Hashing will be beneficial for out-of-core processing using NVRAM. Previous work has improved RHH s performance on various workloads, and also shown that the probe distance can remain small under various load conditions [14] [15]. Now, we describe how Robin Hood Hashing behaves during insert, delete and search operations. A. Insert The core features and characteristics of the open addressing hash table are how to deal with hash collisions. The main idea of alternatives in case of a collision of Robin Hood Hashing is performing linear probing to the cyclized probe sequence H(k), H(k)+1,...,L 1, 0, 1,...,H(k) 1, where H(k) is hash value of k and L is the length of the table. The length of the probe sequence is called the probe distance. Specifically, RHH tries to move an existing element if a new element hashes into its position, and the probe distance of the existing element is equal or smaller than that of the new element. This strategy attempts to keep the average probe distance of all elements in the table small. An illustration of edge insertion using Robin Hood Hashing is shown in Figure 2. Let {i, j} be an edge between a vertex i and j. In this example, a hash function is h = v id mod C where v id is a ID of a vertex and C is the capacity of a hash table. First, compute a hash value Figure Edge insert into Robin Hood Hashing table (initial position) of a new edge {1,5} using the hash function. Second, edge {1,6} is moved to the next position since the probe distance of {1,6} is equal to the one of edge {1,5}. Next, edge {2,2} is moved to the next position and edge {1,6} is inserted in the position (index 2) because probe distance of edge {2,2} is smaller than the one of edge {1,6}. Ideally, a target element is located in a same memory page where its hashed position is located, although the element is moved to an another position because of hash conflict. B. Delete When deleting an element, instead of simply erasing all data (probe distance, key and value) of the element and moving succeeding elements forward, we make the element as deleted by setting a tombstone flag, keeping its existing probe distance. By keeping probe distances of deleted elements, it can be expected to reduce the cost of moving elements when inserting or deleting an element. The main drawback of this approach occurs after many delete operations have been done. After many elements are deleted, an average prove distance will be large, although the number of actual elements is small. In a worst-case, the probe distance may exceed the capacity of a table. To prevent this situation, we periodically re-hash a whole element of a table; erasing tombstones of deleted elements and reducing probe distances of active elements. C. Search A search operation is straight forward: compute a hash value of a target element and probe the target element linearly from the hash value position. If an empty space (not including tombstones), or an element which has larger probe distance than the target is found, fault the search since the target element does not exist in the table. IV. DEGAWARERHH In this section, first, we describe the design of DegAwareRHH. Second, we describe its insertion algorithm. Last, we give some optimization techniques that improve its performance. 894

4 A. Overview DegAwareRHH is designed to target graphs that have following characteristics: 1) extremely-large-scale, scale-free graphs that have at least hundreds of billions trillions of edges; 2) each vertex and edge has property data; 3) there are multiple edges between two vertices, called a multigraph. For this type of graph, the key challenges of designing a dynamic graph data structure are: 1) supporting out-of-core data structures in order to process large-scale graphs; 2) high performance insertion and deletion of vertices and edges; 3) quickly locating a specific edge matching topological and property constraints; 4) providing high locality when reading all adjacent edges of a vertex in order to improve graph analysis performances. B. Graph Data Structure Using Robin Hood Hashing DegAwareRHH is based on the adjacency-list data structure model using Robin Hood Hashing. Additionally, it is designed to manage scale-free graphs (with many low degree and a few very high degree vertices), by using two different data structures, depending on the vertex type. 1) Middle-high-degree Table: Locating a specific edge from high degree vertices may be highly costly. Therefore, to quickly locate a specific edge matching topological constraints, we use a hash table for the edge-list instead of using a simple 1D array. This data structure efficiently handles insertion while maintaining locality of access. 2) Low-degree Table: Even though we use the compact hash table, there are some relatively high overhead operations for low-degree vertices, such as allocating a new table and pointer accesses to edge-lists. Accordingly, we use a single Robin Hood Hashing table to store low-degree vertices in order to reduce the costs that are relatively high for low-degree vertices. C. Data Layout An illustration of DegAwareRHH is shown in Figure 3. Depending on the degree of a source vertex, edges are stored in two types of tables: a low-degree table and a middle-highdegree table. The low-degree table stores edges in a single compact table, i.e., directly using Robin Hood Hashing. The middle-high-degree table consists of a vertex table and edgechunks. The vertex table stores source vertices information, i.e., source vertex s ID and vertex property data if needed. Adjacency edges are stored into edge-chunks and the vertex table holds pointers to edge-chunks. In order to locate a specific edge with a constant time, we also use Robin Hood Hashing for edge-chunks. In our current implementation, only 1 byte of extra space is allocated for each element to construct a Robin Hood Hashing table (7 bits for a probe distance, and 1 bit for a tombstone flag). w1 v1 p1 w2 v2 p2 w5 w6 w4 v4 p4 w7 w8 v3 p3 Low-degree table Mid-high-degree table Figure 3. Overview of NVRAM Specialized Degree Aware Dynamic Graph Data Structure D. Programming API Like existing popular graph processing frameworks, DegAwareRHH provides basic APIs including: graph construction and update, such as add edge, delete edge and update property data; graph reading APIs, such as get property data and get adjacency-list. The get adjacency-list function return a forward iterator pointing to the adjacency-list. Using C++ templates, our implementation can hold any type of object for key, value and property data. E. Insertion Algorithm An illustration of an edge insertion algorithm of DegAwareRHH is shown in Figure 4. Let (u, v) be an edge to insert where u is a source vertex ID and v is a target vertex ID. Step 1. Query current degree (the number of edges) of vertex u to the low-degree table. Step 2-A. If the degree is more than 0 and less than the low-degree threshold of the middle-high-degree table, insert the edge into the low-degree table. If the degree of the lowdegree vertex is equal to the low-degree threshold, move all edges of the vertex to the middle-high-degree table. Step 2-B. Otherwise, query current degree of vertex u to the middle-high-degree table. If the degree is 0, that is, vertex u is new vertex, then insert into the low-degree table. Otherwise insert the edge into the middle-high-degree table. Step 3. While inserting an element, if a probe distance of an element exceeds a threshold due to too many hash conflicts, the table is grown to double in size. It is well known that a probe distance usually keeps small constant number, and it will rarely happen that a probe distance becomes long. F. Optimization Load Factor: It has been reported that when the table is close to full that it degrades its performance rapidly [13]. To prevent this situation, we double table capacity when the number of elements, not including tombstones, exceeds 90% of its capacity. Rehashing Table: As described in section III, after many elements are deleted and inserted, the probe distance may 895

5 immediately into the local graph store if it is the owner of the source vertex; however, if the owner of a vertex is a remote process, it pushes the request into its local message queue (wrapped in a visitor object), and send the visitors using asynchronous communication when the queue becomes full. Each process checks its receive queue periodically, and executes received visitors as a visit to the local store, i.e., it applies the graph construction requests into the local process. Figure 4. Degree Aware Edge Insertion Algorithm become quite large. Thus, we check the probe distance each time after insertion is completed, and rehash all elements if the value is larger than the table size. Memory Pool Allocator: Allocating a small size of memory space is an expensive operation. We use the memory pool allocation technique to reduce the cost of individual, small sized memory allocations. V. EXTENDING DEGAWARERHH FOR DISTRIBUTED-MEMORY Distributed memory DegAwareRHH has been implemented with MPI to store a distributed graph. We adopt the distributed asynchronous visitor queue[6][7] to be the driver of DegAwareRHH. The visitor queue framework provides the parallelism, and creates a data-driven flow of computation over MPI with asynchronous communication. All visitors are asynchronously transmitted, scheduled, and executed. An illustration of a distributed dynamic graph construction algorithm is shown in Figure 5. Note that, in Figure 5, we only show the communication from Process A to Process B to make the figure concise. Before constructing the graph, we first need to determine an allocation strategy for vertices i.e., which process is responsible for a given vertex. To determine the owner process of a new edge request e id, consisting of an operation and a source, destination pair, we use Consistent Hashing. A vertices owner process is computed as follows: hash(e id.source) mod P, where P is the number of processes, and all processes use the same hash function. Due to this strategy, any process can determine in constant time the owner of a given vertex, and thus the process responsible for the target location of any edge. Each process independently reads graph construction requests (insertion/deletion) and passes them to the visitor queue sequentially. The visitor queue applies the request VI. EXPERIMENTS In this section, we experimentally evaluate the graph construction performance of our dynamic graph data store (DegAwareRHH). A. Experimental Setup 1) Implementation: We implemented our DegAwareRHH in C++, and used the Boost.Interprocess library to allocate in a memory mapped region. Specifically, we use the Boost.Interprocess memory pool allocator to allocate the hash tables to reduce the cost of many small sized memory allocations. We use memory mapped files as an interface to the NVRAM. For in-core experiments, we create files under /dev/shm to only use DRAM. Based on a preliminary experiment, we set middle-highdegree threshold at 2, that is, vertices with 2 or more edges are stored in the middle-high-degree table. For experimental comparison, we show the performance of the following implementations: Baseline model The model consists of a vertex table and edge tables: Vertex table holds source vertices ID, property data, and a pointer pointing to an edge table using Boost unordered map container; Edge table holds target vertices ID and edges property data, using Boost vector container. Boost unordered map consists of multiple buckets where each one is chunk of any number of elements. When deleting a element in a edge table, instead of using Boost vector s erase function naively, we delete the element by swapping with the last element in the edge table to avoid moving succeeding elements forward after each deletion. STINGER[16] STINGER is a shared memory (incore) parallel dynamic graph processing framework developed at the Georgia Institute of Technology. STINGER can update a graph with several times three orders of magnitude better performance in comparison with state-of-the-art 12 open source graph databases and libraries [17], including SQLite, Neo4j, Giraph, DEX and Boost Graph Library. We used version To perform a fair comparison, we compiled the implementations using GCC-4.9 with -O3 optimization option. For Baseline model and DegAwareRHH, we used Boost and MVAPICH2. 896

6 Apply to local Process A Local graph store (DegAwareRHH) Process B Local graph store (DegAwareRHH) Apply to local Streaming graph construction requests Graph partitioning function Push into message queue Message queue Async comm when queue become full Message queue Visitor Queue Framework Figure 5. Distributed Dynamic Graph Construction over the Visitor Queue Framework 2) Datasets: We show experiments using both synthetic graph models and a real-world graph. RMAT Generates scale-free graphs [18], and we follow the Graph500 V1.2 specifications for generator parameters [4]. After graph generation, all vertex labels are uniformly permuted to destroy any locality artifacts from the generators. The graph generator generates a graph as an undirected graph in random order. For a edge (e i,e j ), we also generated the opposite direction of edge (e j,e i ). WebGraph2012 (hyperlink graph) We also use a large web graph[19], the largest open source real graph dataset to our knowledge, that has 128 billion edges (as a directed graph). Each vertex corresponds to a web page and each edge is a hyperlink. Each vertex is represented as a 64-bit integer value, thus a edge is a pair of 64-bit integers. In this experiment, we set vertex and edge property data as dummy value (unsigned char, 0), the datasets above do not have property data. To evaluate the performance of the delete operation, we add 5% 50% of edge deletion requests into edge insertion requests with random order, but with a carefully chosen position such that the insertion request of a edge comes before its deletion request. The datasets are stored in files as a pair of source and vertex IDs. We use source vertex s ID as the key of the low-degree table and middle-high-degree table, and target vertex s ID as the key of the edge-chunk. 3) Machine Configurations: We use the Catalyst cluster at Lawrence Livermore National Laboratory. Catalyst has 12-core Intel(R) Xeon(R) E5-2695v2 processors (2 sockets) and 128 GB of DRAM and is equipped with a node-local PCI-based 800 GB of NAND flash NVRAM. The nodes are connected over dual rail QDR-80 Intel TrueScale fabrics. 4) Experiment Method: In this paper, we evaluate our DegAwareRHH in the context of graph construction performance. We repeated the following steps: Step 1) each process buffers a subset of edges (1 million) from the file system into DRAM to avoid measuring the cost of reading edges from files. Step 2) insert or delete edges from the edge buffer into the graph data store sequentially. All edges are uniquely inserted, that is, all insertion operations are performed following a find operation. We only measured the execution time of step 2. B. Single Node Experiments First, we evaluate graph construction performance of our DegAwareRHH against STINGER and Baseline model on a single node. We show graph construction performance using a RMAT graph on Baseline model, STINGER and our DegAwareRHH in Figure 6. We measure performance for 6, 12, and 24 threads (STINGER) or processes (Baseline & DegAwareRHH). The dataset has 1 billion edge insertion requests and 54 million edge deletion requests. The x- axis denotes the number of processed graph construction (edge insertion/deletion) requests in millions. The y-axis denotes the cumulated execution time in seconds. All implementations strong scale with increasing the number of threads/processes from 6 to 24, by 1.9 times on STINGER and Baseline model and by 3.7 times on DegAwareRHH. As the size of a constructed graph increases, the execution time of each chunk step on STINGER and Baseline model increases progressively. On the other hand, our DegAwareRHH results in near-liner scaling and can insert over 1 billion edges in 50 seconds with 24 processes. Our DegAwareRHH outperforms the other implementations, specifically, times faster than STINGER and 23.5 times faster than Baseline model with 24 threads/processes. The differences of the results come from differences of the data structures. STINGER stores a graph using adjacencylist model, like our DegAwareRHH; however, it stores edges into a linked-block-list (each linked-block has multiple edges) data structure, and it require a sequential search to find a target edge or an empty space for a new edge. Thus, the time to insert an edge (i, j) increases as the out-degree of 897

7 Figure 6. Unique Edge Insertion and 5% of Edge Deletion (RMAT 25, in-core, single-node); lower is better. i increases. The Baseline model stores edges into a vector container, and it also requires sequential access to find a target edge; however, different from STINGER, a new edge is always inserted into the last position of the array. Due to this, it performs better than STINGER. On the other hand, as our DegAwareRHH uses hash tables for edge tables, the execution time can be near constant even though the number of inserted edges increases. Second, we evaluate graph construction performance while varying the number of edge deletion requests on the three implementations with 24 threads/processes. The results are shown in Figure 7. We changed the rate of added edge deletion requests as 5, 10, 20, 30, 40 and 50. The x- axis denotes the rate, and the y-axis denotes the number of processed graph construction requests per second in log scale. As the number of deletion requests increases, the performance of STINGER and Baseline model increases since the cost of finding a edge decreases due to the size of edge table becoming smaller. In contrast, DegAwareRHH slow downs slightly as the number of deletion requests increase (by 5% when comparing 5% and 50% rate). Nevertheless, DegAwareRHH still outperforms the other implementations, e.g., 121x faster than STINGER and 16x faster than Baseline model on the 50% of edge deletion requests case. C. Multiple Node Experiments We evaluate graph construction performance of our DegAwareRHH against the Baseline model on multiple nodes. Note that we don t use STINGER in this experiment, since it only supports a shared memory environment. 1) Weak Scaling: To explore how our DegAwareRHH scales when multiple processes run on multiple compute nodes, we first perform a weak scaling experiment which increases the number of compute nodes while fixing the size of the sub-graph per node. Each compute node constructs a subgraph which has 134 million vertices and over 2 billion undirected edges, thus, the actual number of inserted edges is over 4 billion. We increase the number of compute nodes Figure 7. Unique Edge Insertion and Deletion (RMAT 25, in-core, singlenode). Higher is better; note log y-axis scale. Figure 8. In-core unique edge insertion (24 processes per node, 4.3 billion edges per node); lower is better. up to 64 with 24 processes per node, therefore, at 64 nodes, the graph has 86 billion vertices and 275 billion edges that are inserted. In this experiment, we also evaluate performance of nonunique edge insertions since Baseline model has a huge overhead to find an edge to uniquely insert it. We perform four experiment scenarios: unique edge insertion (Figure 8); non-unique edge insertion (Figure 9); unique edge insertion and deletion (Figure 10); non-unique edge insertion and deletion (Figure 11). We added 5% of deletion requests in the deletion scenarios. In the 4th scenario, only a single edge is deleted per a deletion requet. Note that except for the non-unique insertion workload, we halted the experiments on Baseline at more than 8 compute nodes before finishing due to excessive run times. First of all, DegAwareRHH scales as the number of compute 898

8 Figure 9. In-core non-unique edge insertion (24 processes per node, 4.3 billion edges per node); lower is better. Figure 11. In-core non-unique edge insertion and 5% of deletion (24 processes per node, 4.3 billion edges per node); lower is better. Figure 10. In-core unique edge insertion and 5% of deletion (24 processes per node, 4.3 billion edges per node); lower is better. nodes increases and outperforms Baseline on the 3 scenarios; it achieves a processing of 800 million requests per second at 64 nodes in the unique insertion workload for instance (Figure 8). On the non-unique insertion workload (Figure 9), Baseline model outperforms DegAwareRHH by 29.8% at 64 nodes. This result can be attributed to the edge insertion algorithm of Baseline model that simply inserts a edge at the last position of an edge vector without checking for existence of the edge. On the non-unique insertion and deletion workload (Figure 11), even though edges are inserted without checking for existence, Baseline model slows down significantly when only adding 5% of deletion requests. 2) Strong Scaling: To evaluate the performance of strong scaling on a realistic workload, we perform graph construction experiments on a large-scale real-world graph (Web- Figure 12. In-core unique edge insertion (strong scaling). Graph2012). Both implementations can hold the graph on 32 nodes at a minimum; we increase the number of nodes as 32, 64 and 128. Both implementations scale with increasing the number of nodes, up to 4.1 times on Baseline model and 4.3 times on DegAwareRHH. As the degree of vertices in the graph is smaller than that of the RMAT graphs, the performance gaps between Baseline model and DegAwareRHH shrink. DegAwareRHH outperforms Baseline model by 30.69% at 128 nodes. In addition, we found that the vertex tables of both implementations become small with increasing the number of processes, improving performance due to greater data locality. 899

9 Figure 13. Out-of-core unique edge insertion (higher is better, log y-axis). 3) Out-of-core Experiments: Finally, we perform an experimental evaluation of out-of-core graph construction performance. We set up the experimental evaluation by reordering the edge stream into a sorted order, and we regard this as the best performance scenario because the edges from a same vertex are inserted continuously into a same edge-list and it contributes to data localities. In contrast, we regard the random ordered datasets as a worse performance scenario. We construct RMAT graphs with 4.3 billion edge insertions for the in-core scenario, and 17 billion edge insertions for the out-of-core scenario. Because out-of-core experiments with the Baseline model require excessive run time, we show estimated performances based on the performance after it exceeds DRAM size but before finishing. Note that the Baseline model slows down as the size of the graph increases; therefore, the estimated performance denotes better performance than actual. We show the results in Figure 13. In sorted-order scenarios, Baseline model slow down by 58.8% when compare the in-core scenario against to the out-of-core. On the other hand, out-of-core DegAwareRHH can process a 4x larger graph than the in-core scenario with only a 11.4% performance degradation and keep high graph construction performances over 27 million processing per second. In randomized-order scenarios, although DegAwareRHH slows down considerably, it still achieves processing 0.2 million edge insertion requests. VII. RELATED WORK Many dynamic graph data structures have been proposed based on a tree data structure [20]. However, a tree data structure takes O(log(n)) operations and tends to incur more random accesses than Robin Hood Hashing, a critical matter in large-scale and/or out-of-core graph processing. Furthermore, several studies have been conducted on out-ofcore large-scale static graph processing [21] [22] [23] [24]. However, little is known about large-scale and/or out-of-core dynamic graph data storage. Several studies have reported that graph databases are slow and have huge memory footprint when considering large-scale graphs [5] [17] [25]. A key-value store is a popular database model and is designed to easily scale to a very large size [26]. However, since key-value store doesn t consider the topology of a graph, the locality properties on graph analysis workloads are low. Another approach is to accelerate I/O performance itself: maximizing random I/O performance by using multithreading [27], to effectively utilize storage bandwidth by conducting sequence accesses called Parallel Sliding Windows (PSW) method [22], or by using an edge-centric rather than a vertex-centric computation model [23]; combining multiple NVRAM devices to improve bandwidth and IOPS performance without using expensive NVRAM devices [28] [29]. Setting a high threshold for the middle-high-degree will reduce the cost of inserting edges into the middle-highdegree table, causing many hash conflicts in the low-degree table, which are relatively high overhead for low-degree vertices as described in Section IV-B2. Accordingly, from the perspective of improving the performance of Robin Hood Hashing itself especially with many hash conflicts, techniques such as using bucket [30] or multiple hashing, for example Cuckoo hashing [31] maybe be used. VIII. CONCLUSION We implemented DegAwareRHH, a high performance dynamic graph data store for shared and distributedmemory using our asynchronous visitor queue framework. We demonstrated that DegAwareRHH is times faster than STINGER, a state-of-the-art shared-memory graph processing framework, with 24 threads/processes on a single node to construct a graph in which there are 1 billion edge insertion requests and 54 million edge deletion requests. We confirmed that DegAwareRHH preserves high graph construction performance on all of our graph construction workloads, unique or non-unique insertions or deletions, even though the size of graphs increase owing to using a hash table for its edge-list in contrast to STINGER and Baseline models, which use simple 1D type array for their edge-lists. DegAwareRHH also achieves a processing rate of over 2 billion edge insertion requests per second at 128 nodes, on a large-scale real graph with 128 billion edges. IX. ACKNOWLEDGMENTS This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL- CONF ). Funding was partially provided by LDRD 13-ERD-025. Experiments were performed at the Livermore Computing facility. REFERENCES [1] A. Yoo, A. Baker, R. Pearce, and V. Henson, A scalable eigensolver for large scale-free graphs using 2D graph partitioning, in Supercomputing, 2011, pp

10 [2] A. Buluç and K. Madduri, Parallel breadth-first search on distributed memory systems, in Supercomputing, [3] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine, A. R. Choudhury, and Y. Sabharwal, Breaking the speed and scalability barriers for graph exploration on distributed-memory machines, in Supercomputing, [4] Graph500, [Online]. Available: [5] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan, One trillion edges: Graph processing at facebook-scale, Proc. VLDB Endow., vol. 8, no. 12, pp , Aug [6] R. Pearce, M. Gokhale, and N. Amato, Scaling techniques for massive scale-free graphs in distributed (external) memory, in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, May 2013, pp [7] R. Pearce, M. Gokhale, and N. M. Amato, Faster parallel traversal of scale free graphs at extreme scale with vertex delegates, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 14. Piscataway, NJ, USA: IEEE Press, 2014, pp [8] A.-L. Barabási and R. Albert, Emergence of scaling in random networks, science, vol. 286, no. 5439, pp , [9] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, Powergraph: Distributed graph-parallel computation on natural graphs, in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI 12. Berkeley, CA, USA: USENIX Association, 2012, pp [10] P. E. Compeau, P. A. Pevzner, and G. Tesler, How to apply de bruijn graphs to genome assembly, Nature biotechnology, vol. 29, no. 11, pp , [11] D. R. Zerbino and E. Birney, Velvet: algorithms for de novo short read assembly using de bruijn graphs, Genome research, vol. 18, no. 5, pp , [12] L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, Graphbig: Understanding graph computing in the context of industrial solutions, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 15. New York, NY, USA: ACM, 2015, pp. 69:1 69:12. [13] P. Celis, Robin hood hashing, Ph.D. dissertation, Waterloo, Ont., Canada, Canada, [14] M. Mitzenmacher, A new approach to analyzing robin hood hashing, CoRR, vol. abs/ , [15] L. Devroye, P. Morin, and A. Viola, On worst-case robin hood hashing, SIAM Journal on Computing, vol. 33, no. 4, pp , [16] D. Ediger, R. McColl, J. Riedy, and D. Bader, Stinger: High performance data structure for streaming graphs, in High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, Sept 2012, pp [17] R. C. McColl, D. Ediger, J. Poovey, D. Campbell, and D. A. Bader, A performance evaluation of open source graph databases, in Proceedings of the First Workshop on Parallel Programming for Analytics Applications, ser. PPAA 14. New York, NY, USA: ACM, 2014, pp [18] D. Chakrabarti, Y. Zhan, and C. Faloutsos, R-mat: A recursive model for graph mining. in in Proceedings of the Fourth SIAM International Conference on Data Mining. Society for Industrial Mathematics, 2004, pp [19] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph structure in the web, Comput. Netw., vol. 33, no. 1-6, pp , Jun [20] D. D. Sleator and R. E. Tarjan, A data structure for dynamic trees, in Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, ser. STOC 81. New York, NY, USA: ACM, 1981, pp [21] R. Pearce, M. Gokhale, and N. Amato, Multithreaded asynchronous graph traversal for in-memory and semi-external memory, in High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, Nov 2010, pp [22] A. Kyrola, G. Blelloch, and C. Guestrin, Graphchi: Largescale graph computation on just a pc, in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI 12. Berkeley, CA, USA: USENIX Association, 2012, pp [23] A. Roy, I. Mihailovic, and W. Zwaenepoel, X-stream: Edgecentric graph processing using streaming partitions, in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ser. SOSP 13. New York, NY, USA: ACM, 2013, pp [24] K. Iwabuchi, H. Sato, Y. Yasui, K. Fujisawa, and S. Matsuoka, Nvm-based hybrid bfs with memory efficient data structure, in Big Data (Big Data), 2014 IEEE International Conference on, Oct 2014, pp [25] O. Goonetilleke, S. Sathe, T. Sellis, and X. Zhang, Microblogging queries on graph databases: An introspection, in Proceedings of the GRADES 15, ser. GRADES 15. New York, NY, USA: ACM, 2015, pp. 5:1 5:6. [26] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Transactions on Computer Systems (TOCS), vol. 26, no. 2, p. 4, [27] B. Hendrickson and J. Berry, Graph analysis with highperformance computing, Computing in Science Engineering, vol. 10, no. 2, pp , March [28] D. Zheng, R. Burns, and A. S. Szalay, Toward millions of file system iops on low-cost, commodity hardware, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC 13. New York, NY, USA: ACM, 2013, pp. 69:1 69:12. [29] K. Sato, K. Mohror, A. Moody, T. Gamblin, B. de Supinski, N. Maruyama, and S. Matsuoka, A user-level infinibandbased file system and checkpoint strategy for burst buffers, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, May 2014, pp [30] A. Viola, Distributional analysis of robin hood linear probing hashing with buckets, in International Conference on Analysis of Algorithms DMTCS proc. AD, vol. 297, 2005, p [31] R. Pagh and F. F. Rodler, Cuckoo hashing, Journal of Algorithms, vol. 51, no. 2, pp ,

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy