Towards a Distributed Large-Scale Dynamic Graph Data Store

Size: px
Start display at page:

Download "Towards a Distributed Large-Scale Dynamic Graph Data Store"

Transcription

1 2016 IEEE International Parallel and Distributed Processing Symposium Workshops Towards a Distributed Large-Scale Dynamic Graph Data Store Keita Iwabuchi, Scott Sallinen, Roger Pearce, Brian Van Essen, Maya Gokhale, Satoshi Matsuoka Center for Applied Scientific Computing; Lawrence Livermore National Laboratory Dept. of Mathematical and Computing Sciences; Tokyo Institute of Technology Dept. of Electrical and Computer Engineering; University of British Columbia Global Scientific Information and Computing Center; Tokyo Institute of Technology iwabuchi.k.ab@m.titech.ac.jp scotts@ece.ubc.ca {rpearce, vanessen1, maya}@llnl.gov matsu@is.titech.ac.jp Abstract In many graph applications, the structure of the graph changes dynamically over time and may require real time analysis. However, constructing a large graph is expensive, and most studies for large graphs have not focused on a dynamic graph data structure, but rather a static one. To address this issue, we propose DegAwareRHH, a high performance dynamic graph data store designed for scaling out to store large, scale-free graphs by leveraging compact hash tables with high data locality. We extend DegAwareRHH for multiple processes and distributed memory, and perform dynamic graph construction on large scale-free graphs using emerging Big Data HPC systems such as the Catalyst cluster at LLNL. We demonstrate that DegAwareRHH processes a request stream 206.5x faster than a state-of-the-art sharedmemory dynamic graph processing framework, when both implementations use 24 threads/processes to construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests. DegAwareRHH also achieves a processing rate of over 2 billion edge insertion requests per second using 128 compute nodes to construct a large-scale web graph, containing 128 billion edges, the largest open-source real graph dataset to our knowledge. I. INTRODUCTION In many graph applications, the structure of a graph changes dynamically over time and may require real time analysis. Large scale graph processing often presents challenging data intensive workloads, due to the unstructured memory access patterns common to many applications. Significant research into the analysis of challenging scalefree graphs that are static (graph topology remains fixed) has been leveraged to remarkable improvements in the processing capabilities using HPC systems [1], [2], [3]; much of this work has been enabled by the bi-annual Graph500 competition [4]. Research into the algorithms and management of data structures for large scale dynamic temporal graphs, that is, graphs where the topology changes as a function of time, is at the moment immature when compared to the static scenario. While capabilities for processing large scale dynamic graphs are lacking, the need for such capabilities is strong. In this work, we present our progress towards developing the data structures and infrastructure management necessary to support dynamic graph analysis at large scale, on distributed High Performance Computing (HPC) platforms. Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory. Such large memory requirements can be due to large scale topology, as is the case with the 128 billion edge web graph we study, or rich property that decorates the topology. For example, Facebook manages 1.39 billion active users as of 2014, and their social network graph has more than 400 billion edges [5]. To accommodate such large graphs we use node-local NVRAM to extend main memory, and persistently store the graph data. Node-local NVRAM (e.g., NAND Flash) has found its way into the HPC platform, driven in large part by application checkpointing needs. In this work, we present a dynamic graph data store (DegAwareRHH), which adopts open addressing compact hash tables that exhibit high data locality properties for vertices with large degree, in order to minimize the number of accesses to NVRAM. DegAwareRHH is degree aware, and uses separated compact data structures for low-degree vertices to reduce their storage and search overheads. We implemented DegAwareRHH for distributed-memory using an asynchronous visitor queue abstraction [6] [7], and perform scaling studies of large scale graph construction on up to 128 compute nodes. Summary of our contributions: We demonstrate that DegAwareRHH can construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests times faster than a state-of-the-art shared-memory graph processing framework, when both implementations use 24 threads/processes. We present scaling studies of constructing large-scale, synthetic graphs and a large, real-world graph (up to 275 billion edges insertion requests), scaling up to 128 compute nodes. We show DegAwareRHH processes over 2 billion edge insertion requests per second, using 128 compute nodes on the large-scale real graph with 128 billion edges /16 $ IEEE$ IEEE DOI /IPDPSW

2 A. Scale-free Graph II. BACKGROUND Many real-world graphs can be classified as scale-free, where the distribution of vertex degrees follows a scale-free power-law distribution [8]. The degree of a vertex is the count of the number of edges connecting to the vertex. A power-law vertex degree distribution means that the majority of vertices will have a low-degree, while a select few will have a very large degree, with degree distribution following a power-law pattern. Recent work has focused on optimizations related to the challenge of high-degree vertices which cause load imbalance for parallel computations [6] [9] [7]. In this work, we identify low-degree vertices which are particularly challenging for graph data stores in NVRAM. B. NVRAM In the next generation of supercomputers, providing sufficient main memory capacity is one of the biggest challenges due to the high power consumption of DRAM. NVRAM will greatly expand the possibility of processing extremely large-scale graphs that exceed the DRAM capacity of the nodes. However, graph construction cost using NVRAM is extremely high compared to DRAM: a naive data structure implementation would cause significant performance degradation due to unstructured memory accesses to disk. C. Dynamic Graph Processing In many real-world graph applications, the structure of the graph changes dynamically over time and may require real time analysis. For example, genomic sequence assembly leveraging de Bruijn graphs [10] [11] incrementally build graphs during sequence assembly. These applications construct a graph based on relatively short fragments of DNA from a sequencer. When a new sequence is generated, graph analysis and graph construction/updates are conducted simultaneously. Therefore, not only is high performance dynamic graph construction required, but also efficient data retrieval during graph analysis. D. Graph Data Structure Graph processing is a highly data-intensive problem, thus improving the data locality of the graph data structure storage is an essential optimization [12]. Over the years, many data structure models have been studied. In this section, we discuss classic graph data structure models for static graph and dynamic graphs and their advantages and disadvantages. In the following context, we consider a graph G which has n vertices and m edges. 1) Static Graph Data Structure: Adjacency-matrix The simplest data structure model is an adjacency-matrix, an n n matrix. If there is a edge from v i to v j, an entry a ij holds a positive value or the property data of the edge. An adjacency matrix consumes O(n 2 ) memory, which is a huge disadvantage when processing sparse graph. Figure 1. CSR Graph Data structure Compressed Sparse Row (CSR) The CSR data structure is the de-facto graph data structure widely used in graph processing implementations. The CSR data structure consists of two array structures called an index array and an edge array. The index array holds indices of the edge array, and the edge array holds neighbor vertices ID. More specifically, each index in the index array represents a source vertex s ID, and the corresponding element in the index array refers to an index in the edge array. The range of a vertex s edges in the edge array is from index[v i ] to index[v i +1]. The memory size of the CSR data structure is O(n +1) for the index array and O(m) for the edge array. An overview of the CSR data structure is illustrated in Figure 1. The CSR data structure can provide high data locality and efficient memory usage due to its packed array structure. However, because of packing, the CSR data structure is not well suited for storing dynamic graphs. In general, even when adding or deleting a single edge or vertex, the CSR data structure requires updates to the entire array, causing large data movement. 2) Dynamic Graph Data Structure: Key-Value The Key-Value Store (KVS) model, in other words a map or dictionary data structure, manages an element as pair of a key and value. In a simple KVS graph model, an edge table uses the source and target edge pair as the key, and the edge property as the value. To store vertex data, the KVS holds a vertex table consisting of pair of a vertex and vertex property data. However, while such a simple Key- Value Store model can insert and delete a vertex and edge efficiently, there is no consideration of graph topologies. Therefore, it is difficult to obtain locality benefits among edges adjacent to a vertex, which is a highly important factor to determine the performance of many graph analyses. Adjacency-list The basic components of an adjacencylist are a vertex-table and an edge-list. A vertex-table describes a set of all vertices of a graph. Each element of a vertex-table consists of a pointer to an edge-list and also 893

3 vertex property data if needed. An edge-list holds the list of neighbors of its vertex, with edge property data. The main advantage of the adjacency-list model is it can provide high data locality among out-going edges (or in-coming, depends on user configuration) of each vertex. There are many variations on this model, each with its advantages and disadvantages. For the vertex-table, typically a tree or hash (key-value) data structure is used, and for the edge-list, a single vector or linked-list data structure is used. A tree data structure is widely used in database systems; however, a tree data will structure potentially cause random memory accesses for each operation (such as, insert, find and delete). Even though the time complexity of a tree data structure is O(log(n)), this is too costly for large-scale graphs stored in NVRAM; therefore, we use near-constant time hashtables. III. ROBIN HOOD HASHING (RHH) In this section we describe a hash table algorithm which is used by our DegAwareRHH. In order to minimize the number of accesses to NVRAM, resulting in page misses, we choose Robin Hood Hashing [13], because of its locality properties and compactness. Robin Hood Hashing is an open address hashing (without collision chains) and designed to maintain a small average probe distance the distance between initial (hashed) position and current position of a key. Average probe distance is an important factor to determine the construction cost of the table. We expect that the combination of low probe distance and sequential memory accesses patterns of Robin Hood Hashing will be beneficial for out-of-core processing using NVRAM. Previous work has improved RHH s performance on various workloads, and also shown that the probe distance can remain small under various load conditions [14] [15]. Now, we describe how Robin Hood Hashing behaves during insert, delete and search operations. A. Insert The core features and characteristics of the open addressing hash table are how to deal with hash collisions. The main idea of alternatives in case of a collision of Robin Hood Hashing is performing linear probing to the cyclized probe sequence H(k), H(k)+1,...,L 1, 0, 1,...,H(k) 1, where H(k) is hash value of k and L is the length of the table. The length of the probe sequence is called the probe distance. Specifically, RHH tries to move an existing element if a new element hashes into its position, and the probe distance of the existing element is equal or smaller than that of the new element. This strategy attempts to keep the average probe distance of all elements in the table small. An illustration of edge insertion using Robin Hood Hashing is shown in Figure 2. Let {i, j} be an edge between a vertex i and j. In this example, a hash function is h = v id mod C where v id is a ID of a vertex and C is the capacity of a hash table. First, compute a hash value Figure Edge insert into Robin Hood Hashing table (initial position) of a new edge {1,5} using the hash function. Second, edge {1,6} is moved to the next position since the probe distance of {1,6} is equal to the one of edge {1,5}. Next, edge {2,2} is moved to the next position and edge {1,6} is inserted in the position (index 2) because probe distance of edge {2,2} is smaller than the one of edge {1,6}. Ideally, a target element is located in a same memory page where its hashed position is located, although the element is moved to an another position because of hash conflict. B. Delete When deleting an element, instead of simply erasing all data (probe distance, key and value) of the element and moving succeeding elements forward, we make the element as deleted by setting a tombstone flag, keeping its existing probe distance. By keeping probe distances of deleted elements, it can be expected to reduce the cost of moving elements when inserting or deleting an element. The main drawback of this approach occurs after many delete operations have been done. After many elements are deleted, an average prove distance will be large, although the number of actual elements is small. In a worst-case, the probe distance may exceed the capacity of a table. To prevent this situation, we periodically re-hash a whole element of a table; erasing tombstones of deleted elements and reducing probe distances of active elements. C. Search A search operation is straight forward: compute a hash value of a target element and probe the target element linearly from the hash value position. If an empty space (not including tombstones), or an element which has larger probe distance than the target is found, fault the search since the target element does not exist in the table. IV. DEGAWARERHH In this section, first, we describe the design of DegAwareRHH. Second, we describe its insertion algorithm. Last, we give some optimization techniques that improve its performance. 894

4 A. Overview DegAwareRHH is designed to target graphs that have following characteristics: 1) extremely-large-scale, scale-free graphs that have at least hundreds of billions trillions of edges; 2) each vertex and edge has property data; 3) there are multiple edges between two vertices, called a multigraph. For this type of graph, the key challenges of designing a dynamic graph data structure are: 1) supporting out-of-core data structures in order to process large-scale graphs; 2) high performance insertion and deletion of vertices and edges; 3) quickly locating a specific edge matching topological and property constraints; 4) providing high locality when reading all adjacent edges of a vertex in order to improve graph analysis performances. B. Graph Data Structure Using Robin Hood Hashing DegAwareRHH is based on the adjacency-list data structure model using Robin Hood Hashing. Additionally, it is designed to manage scale-free graphs (with many low degree and a few very high degree vertices), by using two different data structures, depending on the vertex type. 1) Middle-high-degree Table: Locating a specific edge from high degree vertices may be highly costly. Therefore, to quickly locate a specific edge matching topological constraints, we use a hash table for the edge-list instead of using a simple 1D array. This data structure efficiently handles insertion while maintaining locality of access. 2) Low-degree Table: Even though we use the compact hash table, there are some relatively high overhead operations for low-degree vertices, such as allocating a new table and pointer accesses to edge-lists. Accordingly, we use a single Robin Hood Hashing table to store low-degree vertices in order to reduce the costs that are relatively high for low-degree vertices. C. Data Layout An illustration of DegAwareRHH is shown in Figure 3. Depending on the degree of a source vertex, edges are stored in two types of tables: a low-degree table and a middle-highdegree table. The low-degree table stores edges in a single compact table, i.e., directly using Robin Hood Hashing. The middle-high-degree table consists of a vertex table and edgechunks. The vertex table stores source vertices information, i.e., source vertex s ID and vertex property data if needed. Adjacency edges are stored into edge-chunks and the vertex table holds pointers to edge-chunks. In order to locate a specific edge with a constant time, we also use Robin Hood Hashing for edge-chunks. In our current implementation, only 1 byte of extra space is allocated for each element to construct a Robin Hood Hashing table (7 bits for a probe distance, and 1 bit for a tombstone flag). w1 v1 p1 w2 v2 p2 w5 w6 w4 v4 p4 w7 w8 v3 p3 Low-degree table Mid-high-degree table Figure 3. Overview of NVRAM Specialized Degree Aware Dynamic Graph Data Structure D. Programming API Like existing popular graph processing frameworks, DegAwareRHH provides basic APIs including: graph construction and update, such as add edge, delete edge and update property data; graph reading APIs, such as get property data and get adjacency-list. The get adjacency-list function return a forward iterator pointing to the adjacency-list. Using C++ templates, our implementation can hold any type of object for key, value and property data. E. Insertion Algorithm An illustration of an edge insertion algorithm of DegAwareRHH is shown in Figure 4. Let (u, v) be an edge to insert where u is a source vertex ID and v is a target vertex ID. Step 1. Query current degree (the number of edges) of vertex u to the low-degree table. Step 2-A. If the degree is more than 0 and less than the low-degree threshold of the middle-high-degree table, insert the edge into the low-degree table. If the degree of the lowdegree vertex is equal to the low-degree threshold, move all edges of the vertex to the middle-high-degree table. Step 2-B. Otherwise, query current degree of vertex u to the middle-high-degree table. If the degree is 0, that is, vertex u is new vertex, then insert into the low-degree table. Otherwise insert the edge into the middle-high-degree table. Step 3. While inserting an element, if a probe distance of an element exceeds a threshold due to too many hash conflicts, the table is grown to double in size. It is well known that a probe distance usually keeps small constant number, and it will rarely happen that a probe distance becomes long. F. Optimization Load Factor: It has been reported that when the table is close to full that it degrades its performance rapidly [13]. To prevent this situation, we double table capacity when the number of elements, not including tombstones, exceeds 90% of its capacity. Rehashing Table: As described in section III, after many elements are deleted and inserted, the probe distance may 895

5 immediately into the local graph store if it is the owner of the source vertex; however, if the owner of a vertex is a remote process, it pushes the request into its local message queue (wrapped in a visitor object), and send the visitors using asynchronous communication when the queue becomes full. Each process checks its receive queue periodically, and executes received visitors as a visit to the local store, i.e., it applies the graph construction requests into the local process. Figure 4. Degree Aware Edge Insertion Algorithm become quite large. Thus, we check the probe distance each time after insertion is completed, and rehash all elements if the value is larger than the table size. Memory Pool Allocator: Allocating a small size of memory space is an expensive operation. We use the memory pool allocation technique to reduce the cost of individual, small sized memory allocations. V. EXTENDING DEGAWARERHH FOR DISTRIBUTED-MEMORY Distributed memory DegAwareRHH has been implemented with MPI to store a distributed graph. We adopt the distributed asynchronous visitor queue[6][7] to be the driver of DegAwareRHH. The visitor queue framework provides the parallelism, and creates a data-driven flow of computation over MPI with asynchronous communication. All visitors are asynchronously transmitted, scheduled, and executed. An illustration of a distributed dynamic graph construction algorithm is shown in Figure 5. Note that, in Figure 5, we only show the communication from Process A to Process B to make the figure concise. Before constructing the graph, we first need to determine an allocation strategy for vertices i.e., which process is responsible for a given vertex. To determine the owner process of a new edge request e id, consisting of an operation and a source, destination pair, we use Consistent Hashing. A vertices owner process is computed as follows: hash(e id.source) mod P, where P is the number of processes, and all processes use the same hash function. Due to this strategy, any process can determine in constant time the owner of a given vertex, and thus the process responsible for the target location of any edge. Each process independently reads graph construction requests (insertion/deletion) and passes them to the visitor queue sequentially. The visitor queue applies the request VI. EXPERIMENTS In this section, we experimentally evaluate the graph construction performance of our dynamic graph data store (DegAwareRHH). A. Experimental Setup 1) Implementation: We implemented our DegAwareRHH in C++, and used the Boost.Interprocess library to allocate in a memory mapped region. Specifically, we use the Boost.Interprocess memory pool allocator to allocate the hash tables to reduce the cost of many small sized memory allocations. We use memory mapped files as an interface to the NVRAM. For in-core experiments, we create files under /dev/shm to only use DRAM. Based on a preliminary experiment, we set middle-highdegree threshold at 2, that is, vertices with 2 or more edges are stored in the middle-high-degree table. For experimental comparison, we show the performance of the following implementations: Baseline model The model consists of a vertex table and edge tables: Vertex table holds source vertices ID, property data, and a pointer pointing to an edge table using Boost unordered map container; Edge table holds target vertices ID and edges property data, using Boost vector container. Boost unordered map consists of multiple buckets where each one is chunk of any number of elements. When deleting a element in a edge table, instead of using Boost vector s erase function naively, we delete the element by swapping with the last element in the edge table to avoid moving succeeding elements forward after each deletion. STINGER[16] STINGER is a shared memory (incore) parallel dynamic graph processing framework developed at the Georgia Institute of Technology. STINGER can update a graph with several times three orders of magnitude better performance in comparison with state-of-the-art 12 open source graph databases and libraries [17], including SQLite, Neo4j, Giraph, DEX and Boost Graph Library. We used version To perform a fair comparison, we compiled the implementations using GCC-4.9 with -O3 optimization option. For Baseline model and DegAwareRHH, we used Boost and MVAPICH2. 896

6 Apply to local Process A Local graph store (DegAwareRHH) Process B Local graph store (DegAwareRHH) Apply to local Streaming graph construction requests Graph partitioning function Push into message queue Message queue Async comm when queue become full Message queue Visitor Queue Framework Figure 5. Distributed Dynamic Graph Construction over the Visitor Queue Framework 2) Datasets: We show experiments using both synthetic graph models and a real-world graph. RMAT Generates scale-free graphs [18], and we follow the Graph500 V1.2 specifications for generator parameters [4]. After graph generation, all vertex labels are uniformly permuted to destroy any locality artifacts from the generators. The graph generator generates a graph as an undirected graph in random order. For a edge (e i,e j ), we also generated the opposite direction of edge (e j,e i ). WebGraph2012 (hyperlink graph) We also use a large web graph[19], the largest open source real graph dataset to our knowledge, that has 128 billion edges (as a directed graph). Each vertex corresponds to a web page and each edge is a hyperlink. Each vertex is represented as a 64-bit integer value, thus a edge is a pair of 64-bit integers. In this experiment, we set vertex and edge property data as dummy value (unsigned char, 0), the datasets above do not have property data. To evaluate the performance of the delete operation, we add 5% 50% of edge deletion requests into edge insertion requests with random order, but with a carefully chosen position such that the insertion request of a edge comes before its deletion request. The datasets are stored in files as a pair of source and vertex IDs. We use source vertex s ID as the key of the low-degree table and middle-high-degree table, and target vertex s ID as the key of the edge-chunk. 3) Machine Configurations: We use the Catalyst cluster at Lawrence Livermore National Laboratory. Catalyst has 12-core Intel(R) Xeon(R) E5-2695v2 processors (2 sockets) and 128 GB of DRAM and is equipped with a node-local PCI-based 800 GB of NAND flash NVRAM. The nodes are connected over dual rail QDR-80 Intel TrueScale fabrics. 4) Experiment Method: In this paper, we evaluate our DegAwareRHH in the context of graph construction performance. We repeated the following steps: Step 1) each process buffers a subset of edges (1 million) from the file system into DRAM to avoid measuring the cost of reading edges from files. Step 2) insert or delete edges from the edge buffer into the graph data store sequentially. All edges are uniquely inserted, that is, all insertion operations are performed following a find operation. We only measured the execution time of step 2. B. Single Node Experiments First, we evaluate graph construction performance of our DegAwareRHH against STINGER and Baseline model on a single node. We show graph construction performance using a RMAT graph on Baseline model, STINGER and our DegAwareRHH in Figure 6. We measure performance for 6, 12, and 24 threads (STINGER) or processes (Baseline & DegAwareRHH). The dataset has 1 billion edge insertion requests and 54 million edge deletion requests. The x- axis denotes the number of processed graph construction (edge insertion/deletion) requests in millions. The y-axis denotes the cumulated execution time in seconds. All implementations strong scale with increasing the number of threads/processes from 6 to 24, by 1.9 times on STINGER and Baseline model and by 3.7 times on DegAwareRHH. As the size of a constructed graph increases, the execution time of each chunk step on STINGER and Baseline model increases progressively. On the other hand, our DegAwareRHH results in near-liner scaling and can insert over 1 billion edges in 50 seconds with 24 processes. Our DegAwareRHH outperforms the other implementations, specifically, times faster than STINGER and 23.5 times faster than Baseline model with 24 threads/processes. The differences of the results come from differences of the data structures. STINGER stores a graph using adjacencylist model, like our DegAwareRHH; however, it stores edges into a linked-block-list (each linked-block has multiple edges) data structure, and it require a sequential search to find a target edge or an empty space for a new edge. Thus, the time to insert an edge (i, j) increases as the out-degree of 897

7 Figure 6. Unique Edge Insertion and 5% of Edge Deletion (RMAT 25, in-core, single-node); lower is better. i increases. The Baseline model stores edges into a vector container, and it also requires sequential access to find a target edge; however, different from STINGER, a new edge is always inserted into the last position of the array. Due to this, it performs better than STINGER. On the other hand, as our DegAwareRHH uses hash tables for edge tables, the execution time can be near constant even though the number of inserted edges increases. Second, we evaluate graph construction performance while varying the number of edge deletion requests on the three implementations with 24 threads/processes. The results are shown in Figure 7. We changed the rate of added edge deletion requests as 5, 10, 20, 30, 40 and 50. The x- axis denotes the rate, and the y-axis denotes the number of processed graph construction requests per second in log scale. As the number of deletion requests increases, the performance of STINGER and Baseline model increases since the cost of finding a edge decreases due to the size of edge table becoming smaller. In contrast, DegAwareRHH slow downs slightly as the number of deletion requests increase (by 5% when comparing 5% and 50% rate). Nevertheless, DegAwareRHH still outperforms the other implementations, e.g., 121x faster than STINGER and 16x faster than Baseline model on the 50% of edge deletion requests case. C. Multiple Node Experiments We evaluate graph construction performance of our DegAwareRHH against the Baseline model on multiple nodes. Note that we don t use STINGER in this experiment, since it only supports a shared memory environment. 1) Weak Scaling: To explore how our DegAwareRHH scales when multiple processes run on multiple compute nodes, we first perform a weak scaling experiment which increases the number of compute nodes while fixing the size of the sub-graph per node. Each compute node constructs a subgraph which has 134 million vertices and over 2 billion undirected edges, thus, the actual number of inserted edges is over 4 billion. We increase the number of compute nodes Figure 7. Unique Edge Insertion and Deletion (RMAT 25, in-core, singlenode). Higher is better; note log y-axis scale. Figure 8. In-core unique edge insertion (24 processes per node, 4.3 billion edges per node); lower is better. up to 64 with 24 processes per node, therefore, at 64 nodes, the graph has 86 billion vertices and 275 billion edges that are inserted. In this experiment, we also evaluate performance of nonunique edge insertions since Baseline model has a huge overhead to find an edge to uniquely insert it. We perform four experiment scenarios: unique edge insertion (Figure 8); non-unique edge insertion (Figure 9); unique edge insertion and deletion (Figure 10); non-unique edge insertion and deletion (Figure 11). We added 5% of deletion requests in the deletion scenarios. In the 4th scenario, only a single edge is deleted per a deletion requet. Note that except for the non-unique insertion workload, we halted the experiments on Baseline at more than 8 compute nodes before finishing due to excessive run times. First of all, DegAwareRHH scales as the number of compute 898

8 Figure 9. In-core non-unique edge insertion (24 processes per node, 4.3 billion edges per node); lower is better. Figure 11. In-core non-unique edge insertion and 5% of deletion (24 processes per node, 4.3 billion edges per node); lower is better. Figure 10. In-core unique edge insertion and 5% of deletion (24 processes per node, 4.3 billion edges per node); lower is better. nodes increases and outperforms Baseline on the 3 scenarios; it achieves a processing of 800 million requests per second at 64 nodes in the unique insertion workload for instance (Figure 8). On the non-unique insertion workload (Figure 9), Baseline model outperforms DegAwareRHH by 29.8% at 64 nodes. This result can be attributed to the edge insertion algorithm of Baseline model that simply inserts a edge at the last position of an edge vector without checking for existence of the edge. On the non-unique insertion and deletion workload (Figure 11), even though edges are inserted without checking for existence, Baseline model slows down significantly when only adding 5% of deletion requests. 2) Strong Scaling: To evaluate the performance of strong scaling on a realistic workload, we perform graph construction experiments on a large-scale real-world graph (Web- Figure 12. In-core unique edge insertion (strong scaling). Graph2012). Both implementations can hold the graph on 32 nodes at a minimum; we increase the number of nodes as 32, 64 and 128. Both implementations scale with increasing the number of nodes, up to 4.1 times on Baseline model and 4.3 times on DegAwareRHH. As the degree of vertices in the graph is smaller than that of the RMAT graphs, the performance gaps between Baseline model and DegAwareRHH shrink. DegAwareRHH outperforms Baseline model by 30.69% at 128 nodes. In addition, we found that the vertex tables of both implementations become small with increasing the number of processes, improving performance due to greater data locality. 899

9 Figure 13. Out-of-core unique edge insertion (higher is better, log y-axis). 3) Out-of-core Experiments: Finally, we perform an experimental evaluation of out-of-core graph construction performance. We set up the experimental evaluation by reordering the edge stream into a sorted order, and we regard this as the best performance scenario because the edges from a same vertex are inserted continuously into a same edge-list and it contributes to data localities. In contrast, we regard the random ordered datasets as a worse performance scenario. We construct RMAT graphs with 4.3 billion edge insertions for the in-core scenario, and 17 billion edge insertions for the out-of-core scenario. Because out-of-core experiments with the Baseline model require excessive run time, we show estimated performances based on the performance after it exceeds DRAM size but before finishing. Note that the Baseline model slows down as the size of the graph increases; therefore, the estimated performance denotes better performance than actual. We show the results in Figure 13. In sorted-order scenarios, Baseline model slow down by 58.8% when compare the in-core scenario against to the out-of-core. On the other hand, out-of-core DegAwareRHH can process a 4x larger graph than the in-core scenario with only a 11.4% performance degradation and keep high graph construction performances over 27 million processing per second. In randomized-order scenarios, although DegAwareRHH slows down considerably, it still achieves processing 0.2 million edge insertion requests. VII. RELATED WORK Many dynamic graph data structures have been proposed based on a tree data structure [20]. However, a tree data structure takes O(log(n)) operations and tends to incur more random accesses than Robin Hood Hashing, a critical matter in large-scale and/or out-of-core graph processing. Furthermore, several studies have been conducted on out-ofcore large-scale static graph processing [21] [22] [23] [24]. However, little is known about large-scale and/or out-of-core dynamic graph data storage. Several studies have reported that graph databases are slow and have huge memory footprint when considering large-scale graphs [5] [17] [25]. A key-value store is a popular database model and is designed to easily scale to a very large size [26]. However, since key-value store doesn t consider the topology of a graph, the locality properties on graph analysis workloads are low. Another approach is to accelerate I/O performance itself: maximizing random I/O performance by using multithreading [27], to effectively utilize storage bandwidth by conducting sequence accesses called Parallel Sliding Windows (PSW) method [22], or by using an edge-centric rather than a vertex-centric computation model [23]; combining multiple NVRAM devices to improve bandwidth and IOPS performance without using expensive NVRAM devices [28] [29]. Setting a high threshold for the middle-high-degree will reduce the cost of inserting edges into the middle-highdegree table, causing many hash conflicts in the low-degree table, which are relatively high overhead for low-degree vertices as described in Section IV-B2. Accordingly, from the perspective of improving the performance of Robin Hood Hashing itself especially with many hash conflicts, techniques such as using bucket [30] or multiple hashing, for example Cuckoo hashing [31] maybe be used. VIII. CONCLUSION We implemented DegAwareRHH, a high performance dynamic graph data store for shared and distributedmemory using our asynchronous visitor queue framework. We demonstrated that DegAwareRHH is times faster than STINGER, a state-of-the-art shared-memory graph processing framework, with 24 threads/processes on a single node to construct a graph in which there are 1 billion edge insertion requests and 54 million edge deletion requests. We confirmed that DegAwareRHH preserves high graph construction performance on all of our graph construction workloads, unique or non-unique insertions or deletions, even though the size of graphs increase owing to using a hash table for its edge-list in contrast to STINGER and Baseline models, which use simple 1D type array for their edge-lists. DegAwareRHH also achieves a processing rate of over 2 billion edge insertion requests per second at 128 nodes, on a large-scale real graph with 128 billion edges. IX. ACKNOWLEDGMENTS This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL- CONF ). Funding was partially provided by LDRD 13-ERD-025. Experiments were performed at the Livermore Computing facility. REFERENCES [1] A. Yoo, A. Baker, R. Pearce, and V. Henson, A scalable eigensolver for large scale-free graphs using 2D graph partitioning, in Supercomputing, 2011, pp

10 [2] A. Buluç and K. Madduri, Parallel breadth-first search on distributed memory systems, in Supercomputing, [3] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine, A. R. Choudhury, and Y. Sabharwal, Breaking the speed and scalability barriers for graph exploration on distributed-memory machines, in Supercomputing, [4] Graph500, [Online]. Available: [5] A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan, One trillion edges: Graph processing at facebook-scale, Proc. VLDB Endow., vol. 8, no. 12, pp , Aug [6] R. Pearce, M. Gokhale, and N. Amato, Scaling techniques for massive scale-free graphs in distributed (external) memory, in Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, May 2013, pp [7] R. Pearce, M. Gokhale, and N. M. Amato, Faster parallel traversal of scale free graphs at extreme scale with vertex delegates, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 14. Piscataway, NJ, USA: IEEE Press, 2014, pp [8] A.-L. Barabási and R. Albert, Emergence of scaling in random networks, science, vol. 286, no. 5439, pp , [9] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, Powergraph: Distributed graph-parallel computation on natural graphs, in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI 12. Berkeley, CA, USA: USENIX Association, 2012, pp [10] P. E. Compeau, P. A. Pevzner, and G. Tesler, How to apply de bruijn graphs to genome assembly, Nature biotechnology, vol. 29, no. 11, pp , [11] D. R. Zerbino and E. Birney, Velvet: algorithms for de novo short read assembly using de bruijn graphs, Genome research, vol. 18, no. 5, pp , [12] L. Nai, Y. Xia, I. G. Tanase, H. Kim, and C.-Y. Lin, Graphbig: Understanding graph computing in the context of industrial solutions, in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC 15. New York, NY, USA: ACM, 2015, pp. 69:1 69:12. [13] P. Celis, Robin hood hashing, Ph.D. dissertation, Waterloo, Ont., Canada, Canada, [14] M. Mitzenmacher, A new approach to analyzing robin hood hashing, CoRR, vol. abs/ , [15] L. Devroye, P. Morin, and A. Viola, On worst-case robin hood hashing, SIAM Journal on Computing, vol. 33, no. 4, pp , [16] D. Ediger, R. McColl, J. Riedy, and D. Bader, Stinger: High performance data structure for streaming graphs, in High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, Sept 2012, pp [17] R. C. McColl, D. Ediger, J. Poovey, D. Campbell, and D. A. Bader, A performance evaluation of open source graph databases, in Proceedings of the First Workshop on Parallel Programming for Analytics Applications, ser. PPAA 14. New York, NY, USA: ACM, 2014, pp [18] D. Chakrabarti, Y. Zhan, and C. Faloutsos, R-mat: A recursive model for graph mining. in in Proceedings of the Fourth SIAM International Conference on Data Mining. Society for Industrial Mathematics, 2004, pp [19] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener, Graph structure in the web, Comput. Netw., vol. 33, no. 1-6, pp , Jun [20] D. D. Sleator and R. E. Tarjan, A data structure for dynamic trees, in Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, ser. STOC 81. New York, NY, USA: ACM, 1981, pp [21] R. Pearce, M. Gokhale, and N. Amato, Multithreaded asynchronous graph traversal for in-memory and semi-external memory, in High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, Nov 2010, pp [22] A. Kyrola, G. Blelloch, and C. Guestrin, Graphchi: Largescale graph computation on just a pc, in Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI 12. Berkeley, CA, USA: USENIX Association, 2012, pp [23] A. Roy, I. Mihailovic, and W. Zwaenepoel, X-stream: Edgecentric graph processing using streaming partitions, in Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, ser. SOSP 13. New York, NY, USA: ACM, 2013, pp [24] K. Iwabuchi, H. Sato, Y. Yasui, K. Fujisawa, and S. Matsuoka, Nvm-based hybrid bfs with memory efficient data structure, in Big Data (Big Data), 2014 IEEE International Conference on, Oct 2014, pp [25] O. Goonetilleke, S. Sathe, T. Sellis, and X. Zhang, Microblogging queries on graph databases: An introspection, in Proceedings of the GRADES 15, ser. GRADES 15. New York, NY, USA: ACM, 2015, pp. 5:1 5:6. [26] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A distributed storage system for structured data, ACM Transactions on Computer Systems (TOCS), vol. 26, no. 2, p. 4, [27] B. Hendrickson and J. Berry, Graph analysis with highperformance computing, Computing in Science Engineering, vol. 10, no. 2, pp , March [28] D. Zheng, R. Burns, and A. S. Szalay, Toward millions of file system iops on low-cost, commodity hardware, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC 13. New York, NY, USA: ACM, 2013, pp. 69:1 69:12. [29] K. Sato, K. Mohror, A. Moody, T. Gamblin, B. de Supinski, N. Maruyama, and S. Matsuoka, A user-level infinibandbased file system and checkpoint strategy for burst buffers, in Cluster, Cloud and Grid Computing (CCGrid), th IEEE/ACM International Symposium on, May 2014, pp [30] A. Viola, Distributional analysis of robin hood linear probing hashing with buckets, in International Conference on Analysis of Algorithms DMTCS proc. AD, vol. 297, 2005, p [31] R. Pagh and F. F. Rodler, Cuckoo hashing, Journal of Algorithms, vol. 51, no. 2, pp ,

Near Memory Key/Value Lookup Acceleration MemSys 2017

Near Memory Key/Value Lookup Acceleration MemSys 2017 Near Key/Value Lookup Acceleration MemSys 2017 October 3, 2017 Scott Lloyd, Maya Gokhale Center for Applied Scientific Computing This work was performed under the auspices of the U.S. Department of Energy

More information

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work

Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work Got Burst Buffer. Now What? Early experiences, exciting future possibilities, and what we need from the system to make it work The Salishan Conference on High-Speed Computing April 26, 2016 Adam Moody

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation

STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation STINGER: Spatio-Temporal Interaction Networks and Graphs (STING) Extensible Representation David A. Bader Georgia Institute of Technolgy Adam Amos-Binks Carleton University, Canada Jonathan Berry Sandia

More information

Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates

Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates Roger Pearce, Maya Gokhale Center for Applied Scientific Computing Lawrence Livermore National Laboratory; Livermore,

More information

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System

Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from

More information

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Hashing. Hashing Procedures

Hashing. Hashing Procedures Hashing Hashing Procedures Let us denote the set of all possible key values (i.e., the universe of keys) used in a dictionary application by U. Suppose an application requires a dictionary in which elements

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader What we will see today The first dynamic graph data structure for the GPU. Scalable in size Supports the same functionality

More information

FPGP: Graph Processing Framework on FPGA

FPGP: Graph Processing Framework on FPGA FPGP: Graph Processing Framework on FPGA Guohao DAI, Yuze CHI, Yu WANG, Huazhong YANG E.E. Dept., TNLIST, Tsinghua University dgh14@mails.tsinghua.edu.cn 1 Big graph is widely used Big graph is widely

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

Leveraging Flash in HPC Systems

Leveraging Flash in HPC Systems Leveraging Flash in HPC Systems IEEE MSST June 3, 2015 This work was performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344. Lawrence Livermore National Security,

More information

Fast Graph Mining with HBase

Fast Graph Mining with HBase Information Science 00 (2015) 1 14 Information Science Fast Graph Mining with HBase Ho Lee a, Bin Shao b, U Kang a a KAIST, 291 Daehak-ro (373-1 Guseong-dong), Yuseong-gu, Daejeon 305-701, Republic of

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

Question Bank Subject: Advanced Data Structures Class: SE Computer

Question Bank Subject: Advanced Data Structures Class: SE Computer Question Bank Subject: Advanced Data Structures Class: SE Computer Question1: Write a non recursive pseudo code for post order traversal of binary tree Answer: Pseudo Code: 1. Push root into Stack_One.

More information

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP

PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP ISSN: 0976-2876 (Print) ISSN: 2250-0138 (Online) PROFILING BASED REDUCE MEMORY PROVISIONING FOR IMPROVING THE PERFORMANCE IN HADOOP T. S. NISHA a1 AND K. SATYANARAYAN REDDY b a Department of CSE, Cambridge

More information

Performance analysis of -stepping algorithm on CPU and GPU

Performance analysis of -stepping algorithm on CPU and GPU Performance analysis of -stepping algorithm on CPU and GPU Dmitry Lesnikov 1 dslesnikov@gmail.com Mikhail Chernoskutov 1,2 mach@imm.uran.ru 1 Ural Federal University (Yekaterinburg, Russia) 2 Krasovskii

More information

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

New Approach to Unstructured Data

New Approach to Unstructured Data Innovations in All-Flash Storage Deliver a New Approach to Unstructured Data Table of Contents Developing a new approach to unstructured data...2 Designing a new storage architecture...2 Understanding

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Data and File Structures Chapter 11. Hashing

Data and File Structures Chapter 11. Hashing Data and File Structures Chapter 11 Hashing 1 Motivation Sequential Searching can be done in O(N) access time, meaning that the number of seeks grows in proportion to the size of the file. B-Trees improve

More information

A Parallel Community Detection Algorithm for Big Social Networks

A Parallel Community Detection Algorithm for Big Social Networks A Parallel Community Detection Algorithm for Big Social Networks Yathrib AlQahtani College of Computer and Information Sciences King Saud University Collage of Computing and Informatics Saudi Electronic

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far

Introduction. hashing performs basic operations, such as insertion, better than other ADTs we ve seen so far Chapter 5 Hashing 2 Introduction hashing performs basic operations, such as insertion, deletion, and finds in average time better than other ADTs we ve seen so far 3 Hashing a hash table is merely an hashing

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

Cassandra- A Distributed Database

Cassandra- A Distributed Database Cassandra- A Distributed Database Tulika Gupta Department of Information Technology Poornima Institute of Engineering and Technology Jaipur, Rajasthan, India Abstract- A relational database is a traditional

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR

SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR SILT: A MEMORY-EFFICIENT, HIGH-PERFORMANCE KEY- VALUE STORE PRESENTED BY PRIYA SRIDHAR AGENDA INTRODUCTION Why SILT? MOTIVATION SILT KV STORAGE SYSTEM LOW READ AMPLIFICATION CONTROLLABLE WRITE AMPLIFICATION

More information

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 3, Ver. IV (May-Jun. 2016), PP 125-129 www.iosrjournals.org I ++ Mapreduce: Incremental Mapreduce for

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Data and File Structures Laboratory

Data and File Structures Laboratory Hashing Assistant Professor Machine Intelligence Unit Indian Statistical Institute, Kolkata September, 2018 1 Basics 2 Hash Collision 3 Hashing in other applications What is hashing? Definition (Hashing)

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

Efficient, Scalable, and Provenance-Aware Management of Linked Data

Efficient, Scalable, and Provenance-Aware Management of Linked Data Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management

More information

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Da Zheng, Disa Mhembere, Randal Burns Department of Computer Science Johns Hopkins University Carey E. Priebe Department of Applied

More information

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures

Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures Accelerating Direction-Optimized Breadth First Search on Hybrid Architectures Scott Sallinen, Abdullah Gharaibeh, and Matei Ripeanu University of British Columbia {scotts, abdullah, matei}@ece.ubc.ca arxiv:1503.04359v2

More information

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

Current and Future Challenges of the Tofu Interconnect for Emerging Applications Current and Future Challenges of the Tofu Interconnect for Emerging Applications Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 22, 2017, ExaComm 2017 Workshop

More information

TEFS: A Flash File System for Use on Memory Constrained Devices

TEFS: A Flash File System for Use on Memory Constrained Devices 2016 IEEE Canadian Conference on Electrical and Computer Engineering (CCECE) TEFS: A Flash File for Use on Memory Constrained Devices Wade Penson wpenson@alumni.ubc.ca Scott Fazackerley scott.fazackerley@alumni.ubc.ca

More information

CASS-MT Task #7 - Georgia Tech. GraphCT: A Graph Characterization Toolkit. David A. Bader, David Ediger, Karl Jiang & Jason Riedy

CASS-MT Task #7 - Georgia Tech. GraphCT: A Graph Characterization Toolkit. David A. Bader, David Ediger, Karl Jiang & Jason Riedy CASS-MT Task #7 - Georgia Tech GraphCT: A Graph Characterization Toolkit David A. Bader, David Ediger, Karl Jiang & Jason Riedy October 26, 2009 Outline Motivation What is GraphCT? Package for Massive

More information

Compact Encoding of the Web Graph Exploiting Various Power Laws

Compact Encoding of the Web Graph Exploiting Various Power Laws Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department

More information

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Apache Spark Graph Performance with Memory1. February Page 1 of 13 Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing

More information

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion,

Chapter 5 Hashing. Introduction. Hashing. Hashing Functions. hashing performs basic operations, such as insertion, Introduction Chapter 5 Hashing hashing performs basic operations, such as insertion, deletion, and finds in average time 2 Hashing a hash table is merely an of some fixed size hashing converts into locations

More information

ASSIGNMENTS. Progra m Outcom e. Chapter Q. No. Outcom e (CO) I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n))

ASSIGNMENTS. Progra m Outcom e. Chapter Q. No. Outcom e (CO) I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n)) ASSIGNMENTS Chapter Q. No. Questions Course Outcom e (CO) Progra m Outcom e I 1 If f(n) = Θ(g(n)) and g(n)= Θ(h(n)), then proof that h(n) = Θ(f(n)) 2 3. What is the time complexity of the algorithm? 4

More information

Was ist dran an einer spezialisierten Data Warehousing platform?

Was ist dran an einer spezialisierten Data Warehousing platform? Was ist dran an einer spezialisierten Data Warehousing platform? Hermann Bär Oracle USA Redwood Shores, CA Schlüsselworte Data warehousing, Exadata, specialized hardware proprietary hardware Introduction

More information

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6)

Announcements. Reading Material. Recap. Today 9/17/17. Storage (contd. from Lecture 6) CompSci 16 Intensive Computing Systems Lecture 7 Storage and Index Instructor: Sudeepa Roy Announcements HW1 deadline this week: Due on 09/21 (Thurs), 11: pm, no late days Project proposal deadline: Preliminary

More information

Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs

Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs Sparse Matrix Partitioning for Parallel Eigenanalysis of Large Static and Dynamic Graphs Michael M. Wolf and Benjamin A. Miller Lincoln Laboratory Massachusetts Institute of Technology Lexington, MA 02420

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Acolyte: An In-Memory Social Network Query System

Acolyte: An In-Memory Social Network Query System Acolyte: An In-Memory Social Network Query System Ze Tang, Heng Lin, Kaiwei Li, Wentao Han, and Wenguang Chen Department of Computer Science and Technology, Tsinghua University Beijing 100084, China {tangz10,linheng11,lkw10,hwt04}@mails.tsinghua.edu.cn

More information

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic WHITE PAPER Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic Western Digital Technologies, Inc. 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin Department of Electrical Engineering, National Taiwan Normal University, Taipei, Taiwan Abstract String matching is the most

More information

Fast Iterative Graph Computation: A Path Centric Approach

Fast Iterative Graph Computation: A Path Centric Approach Fast Iterative Graph Computation: A Path Centric Approach Pingpeng Yuan*, Wenya Zhang, Changfeng Xie, Hai Jin* Services Computing Technology and System Lab. Cluster and Grid Computing Lab. School of Computer

More information

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti

MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti International Journal of Computer Engineering and Applications, ICCSTAR-2016, Special Issue, May.16 MAPREDUCE FOR BIG DATA PROCESSING BASED ON NETWORK TRAFFIC PERFORMANCE Rajeshwari Adrakatti 1 Department

More information

AAL 217: DATA STRUCTURES

AAL 217: DATA STRUCTURES Chapter # 4: Hashing AAL 217: DATA STRUCTURES The implementation of hash tables is frequently called hashing. Hashing is a technique used for performing insertions, deletions, and finds in constant average

More information

Hashing for searching

Hashing for searching Hashing for searching Consider searching a database of records on a given key. There are three standard techniques: Searching sequentially start at the first record and look at each record in turn until

More information

Online Optimization of VM Deployment in IaaS Cloud

Online Optimization of VM Deployment in IaaS Cloud Online Optimization of VM Deployment in IaaS Cloud Pei Fan, Zhenbang Chen, Ji Wang School of Computer Science National University of Defense Technology Changsha, 4173, P.R.China {peifan,zbchen}@nudt.edu.cn,

More information

FPGA-based Supercomputing: New Opportunities and Challenges

FPGA-based Supercomputing: New Opportunities and Challenges FPGA-based Supercomputing: New Opportunities and Challenges Naoya Maruyama (RIKEN AICS)* 5 th ADAC Workshop Feb 15, 2018 * Current Main affiliation is Lawrence Livermore National Laboratory SIAM PP18:

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1 1 Sandia National Laboratories, 2 The Pennsylvania

More information

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2

CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 CMSC 341 Lecture 16/17 Hashing, Parts 1 & 2 Prof. John Park Based on slides from previous iterations of this course Today s Topics Overview Uses and motivations of hash tables Major concerns with hash

More information

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2014/15 Systems Infrastructure for Data Science Web Science Group Uni Freiburg WS 2014/15 Lecture II: Indexing Part I of this course Indexing 3 Database File Organization and Indexing Remember: Database tables

More information

Job Startup at Exascale:

Job Startup at Exascale: Job Startup at Exascale: Challenges and Solutions Hari Subramoni The Ohio State University http://nowlab.cse.ohio-state.edu/ Current Trends in HPC Supercomputing systems scaling rapidly Multi-/Many-core

More information

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries

Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big

More information

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Job Re-Packing for Enhancing the Performance of Gang Scheduling Job Re-Packing for Enhancing the Performance of Gang Scheduling B. B. Zhou 1, R. P. Brent 2, C. W. Johnson 3, and D. Walsh 3 1 Computer Sciences Laboratory, Australian National University, Canberra, ACT

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

Course Introduction / Review of Fundamentals of Graph Theory

Course Introduction / Review of Fundamentals of Graph Theory Course Introduction / Review of Fundamentals of Graph Theory Hiroki Sayama sayama@binghamton.edu Rise of Network Science (From Barabasi 2010) 2 Network models Many discrete parts involved Classic mean-field

More information

DDSF: A Data Deduplication System Framework for Cloud Environments

DDSF: A Data Deduplication System Framework for Cloud Environments DDSF: A Data Deduplication System Framework for Cloud Environments Jianhua Gu, Chuang Zhang and Wenwei Zhang School of Computer Science and Technology, High Performance Computing R&D Center Northwestern

More information

GraphIn: An Online High Performance Incremental Graph Processing Framework

GraphIn: An Online High Performance Incremental Graph Processing Framework GraphIn: An Online High Performance Incremental Graph Processing Framework Dipanjan Sengupta 1(B), Narayanan Sundaram 2, Xia Zhu 2, Theodore L. Willke 2, Jeffrey Young 1, Matthew Wolf 1, and Karsten Schwan

More information

LH*Algorithm: Scalable Distributed Data Structure (SDDS) and its implementation on Switched Multicomputers

LH*Algorithm: Scalable Distributed Data Structure (SDDS) and its implementation on Switched Multicomputers LH*Algorithm: Scalable Distributed Data Structure (SDDS) and its implementation on Switched Multicomputers Written by: Salman Zubair Toor E-Mail: salman.toor@it.uu.se Teacher: Tore Risch Term paper for

More information

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good

More information

Workload Characterization using the TAU Performance System

Workload Characterization using the TAU Performance System Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, and Alan Morris Performance Research Laboratory, Department of Computer and Information Science University of

More information

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage

Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems. SkimpyStash: Key Value Store on Flash-based Storage ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 2: Key-value Data Management Systems SkimpyStash: Key Value

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic

More information

Complementary Graph Coloring

Complementary Graph Coloring International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Complementary Graph Coloring Mohamed Al-Ibrahim a*,

More information

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices

Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices Page Mapping Scheme to Support Secure File Deletion for NANDbased Block Devices Ilhoon Shin Seoul National University of Science & Technology ilhoon.shin@snut.ac.kr Abstract As the amount of digitized

More information

Enosis: Bridging the Semantic Gap between

Enosis: Bridging the Semantic Gap between Enosis: Bridging the Semantic Gap between File-based and Object-based Data Models Anthony Kougkas - akougkas@hawk.iit.edu, Hariharan Devarajan, Xian-He Sun Outline Introduction Background Approach Evaluation

More information

Graph Compression Using Pattern Matching Techniques

Graph Compression Using Pattern Matching Techniques Graph Compression Using Pattern Matching Techniques Rushabh Jitendrakumar Shah Department of Computer Science California State University Dominguez Hills Carson, CA, USA rshah13@toromail.csudh.edu Abstract:

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Optimization solutions for the segmented sum algorithmic function

Optimization solutions for the segmented sum algorithmic function Optimization solutions for the segmented sum algorithmic function ALEXANDRU PÎRJAN Department of Informatics, Statistics and Mathematics Romanian-American University 1B, Expozitiei Blvd., district 1, code

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

IBM Spectrum Scale IO performance

IBM Spectrum Scale IO performance IBM Spectrum Scale 5.0.0 IO performance Silverton Consulting, Inc. StorInt Briefing 2 Introduction High-performance computing (HPC) and scientific computing are in a constant state of transition. Artificial

More information

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox

Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques. James Fox Fast and Scalable Subgraph Isomorphism using Dynamic Graph Techniques James Fox Collaborators Oded Green, Research Scientist (GT) Euna Kim, PhD student (GT) Federico Busato, PhD student (Universita di

More information

Efficient graph computation on hybrid CPU and GPU systems

Efficient graph computation on hybrid CPU and GPU systems J Supercomput (2015) 71:1563 1586 DOI 10.1007/s11227-015-1378-z Efficient graph computation on hybrid CPU and GPU systems Tao Zhang Jingjie Zhang Wei Shu Min-You Wu Xiaoyao Liang Published online: 21 January

More information

CS220 Database Systems. File Organization

CS220 Database Systems. File Organization CS220 Database Systems File Organization Slides from G. Kollios Boston University and UC Berkeley 1.1 Context Database app Query Optimization and Execution Relational Operators Access Methods Buffer Management

More information

Performance analysis of parallel de novo genome assembly in shared memory system

Performance analysis of parallel de novo genome assembly in shared memory system IOP Conference Series: Earth and Environmental Science PAPER OPEN ACCESS Performance analysis of parallel de novo genome assembly in shared memory system To cite this article: Syam Budi Iryanto et al 2018

More information

High-Performance Key-Value Store on OpenSHMEM

High-Performance Key-Value Store on OpenSHMEM High-Performance Key-Value Store on OpenSHMEM Huansong Fu*, Manjunath Gorentla Venkata, Ahana Roy Choudhury*, Neena Imam, Weikuan Yu* *Florida State University Oak Ridge National Laboratory Outline Background

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

An MPI failure detector over PMPI 1

An MPI failure detector over PMPI 1 An MPI failure detector over PMPI 1 Donghoon Kim Department of Computer Science, North Carolina State University Raleigh, NC, USA Email : {dkim2}@ncsu.edu Abstract Fault Detectors are valuable services

More information

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment

Modeling and evaluation on Ad hoc query processing with Adaptive Index in Map Reduce Environment DEIM Forum 213 F2-1 Adaptive indexing 153 855 4-6-1 E-mail: {okudera,yokoyama,miyuki,kitsure}@tkl.iis.u-tokyo.ac.jp MapReduce MapReduce MapReduce Modeling and evaluation on Ad hoc query processing with

More information

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of California, Berkeley Operating Systems Principles

More information

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases

The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases The Role of Database Aware Flash Technologies in Accelerating Mission- Critical Databases Gurmeet Goindi Principal Product Manager Oracle Flash Memory Summit 2013 Santa Clara, CA 1 Agenda Relational Database

More information