A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

Size: px
Start display at page:

Download "A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms"

Transcription

1 A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas at Austin ABSTRACT Distributed-memory clusters are used for in-memory processing of very large graphs with billions of nodes and edges. This requires partitioning the graph among the machines in the cluster. When a graph is partitioned, a node in the graph may be replicated on several machines, and communication is required to keep these replicas synchronized. Good partitioning policies attempt to reduce this synchronization overhead while keeping the computational load balanced across machines. A number of recent studies have looked at ways to control replication of nodes, but these studies are not conclusive because they were performed on small clusters with eight to sixteen machines, did not consider work-efficient data-driven algorithms, or did not optimize communication for the partitioning strategies they studied. This paper presents an experimental study of partitioning strategies for work-efficient graph analytics applications on large KNL and Skylake clusters with up to 256 machines using the Gluon communication runtime which implements partitioning-specific communication optimizations. Evaluation results show that although simple partitioning strategies like Edge-Cuts perform well on a small number of machines, an alternative partitioning strategy called Cartesian Vertex-Cut () performs better at scale even though paradoxically it has a higher replication factor and performs more communication than Edge-Cut partitioning does. Results from communication micro-benchmarks resolve this paradox by showing that communication overhead depends not only on communication volume but also on the communication pattern among the partitions. These experiments suggest that high-performance graph analytics systems should support multiple partitioning strategies, like Gluon does, as no single graph partitioning strategy is best for all cluster sizes. For such systems, a decision tree for selecting a good partitioning strategy based on characteristics of the computation and the cluster is presented. PVLDB Reference Format: Gurbinder Gill, Roshan Dathathri, Loc Hoang, and Keshav Pingali. A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms. PVLDB, 12(4): , 218. DOI: This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4. International License. To view a copy of this license, visit For any use beyond those covered by this license, obtain permission by ing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 4 ISSN DOI: 1. INTRODUCTION Graph analytics systems must handle extremely large graphs with billions of nodes and trillions of edges [3]. Graphs of this size do not fit in the main memory of a single machine, so systems like Pregel [32], PowerGraph [18], PowerLyra [12], Gemini [51], and D-Galois [15] use distributed-memory clusters. Since each host in the cluster can address only its own memory, it is necessary to partition the graph among hosts and use a communication layer like MPI to perform data transfers between hosts [7,8,11,12,18,23,41, 42, 44]. Graph partitioning must balance two concerns. The first concern is computational load balance. For the graph algorithms considered in this paper, computation is performed in rounds: in each round, active nodes in the graph are visited, and a computation is performed by reading and writing the immediate neighbors of the active node [3]. For simple topology-driven graph algorithms in which all nodes are active at the beginning of a round, the computational load of a host is proportional to the numbers of nodes and edges assigned to that host, so by dividing nodes and edges evenly among hosts, it is possible to achieve load balance [12]. However, work-efficient graph algorithms like the ones considered in this paper are data-driven: nodes become active in data-dependent, statically unpredictable ways, so a statically balanced partition of the graph does not necessarily result in computational load balance. The second concern is communication overhead. We consider graph partitioning with replication: when a graph is partitioned, its edges are divided up among the hosts, and if edge (n 1 n 2) is assigned to a host h, proxy nodes are created for n 1 and n 2 on host h and connected by an edge. A given node in the original graph may have proxies on several hosts in the partitioned graph, so updates to proxies must be synchronized during execution using inter-host communication. This communication entirely dominates the execution time of graph analytics applications on large scale clusters as shown by the results in Section 4, so optimizing communication is the key to high performance. Substantial effort has gone into designing graph partitioning strategies that reduce communication overhead. Overlapping communication with computation (as done in HPC applications) would reduce its relative overhead, but there is relatively little computation in graph analytics applications. Therefore, reducing the volume of communication has been the focus of much effort in this area. Since communication is needed to synchronize proxies, reducing the average number of proxies per node (known in the literature as the average replication factor) while also ensuring computational load balance can reduce communication [8, 23, 41, 42, 44]. This is one of the driving principles behind the Vertex-Cut partitioning strategy (Vertex-Cuts and other partitioning strategies are described in

2 detail in Section 2) used in PowerGraph [18] and PowerLyra [12]. Partitioning policies developed for efficient distribution of sparse matrix computation such as 2D block partitioning [7, 11] can also be used since sparse graphs are isomorphic to sparse matrices. Several papers [3,7,12,45,51] have studied how the performance of graph analytics applications changes when different partitioning policies are used, and they have advocated particular partitioning policies based on their results. We believe these studies are not conclusive for the following reasons. Most of the evaluations were done for small graphs on small clusters, so it is not clear whether their conclusions extend to large graphs and large clusters. In some cases, only topology-driven algorithms were evaluated, so it is not clear whether their conclusions extend to work-efficient data-driven algorithms. The distributed graph analytics systems used in the studies optimize communication only for particular partitioning strategies, putting other partitioning strategies at a disadvantage. This paper makes the following contributions: 1. We present the first detailed performance analysis of stateof-the-art, work-efficient graph analytics applications using different graph partitioning strategies including Edge-Cuts, 2D block partitioning strategies, and general Vertex-Cuts on large-scale clusters, including one with 256 machines and roughly 69K threads. These experiments use a system called D-Galois, a distributed-memory version of the Galois system [36] based on the Gluon communication runtime [15]. Gluon performs communication optimizations that are specific to each partitioning strategy. Our results show that although Edge-Cuts perform well on small-scale clusters, a 2D partitioning policy called Cartesian Vertex-Cut () [7] performs the best at scale even though it results in higher replication factors and higher communication volumes than the other partitioning strategies. 2. We present an analytical model for estimating communication volumes for different partitioning strategies, and an empirical study using micro-benchmarks to estimate communication times. These help estimate and understand the performance differences in communication required by the partitioning strategies. In particular, these models explain why at scale, has lower communication overhead even though it performs more communication. 3. We give a simple decision tree that can be used by the user to select a partitioning strategy at runtime given an application and the number of distributed hosts. Although the chosen policy might not be the best in some cases, we show that the application s performance using the chosen policy is almost as good as using the best policy. This paper s contributions include some important lessons for designers of high-performance graph analytics systems. 1. It is desirable to support optimized implementations of multiple partitioning policies including Edge-Cuts and Cartesian Vertex-Cuts, like D-Galois does. Existing systems either support general partitioning, using approaches like gather-applyscatter (PowerGraph, PowerLyra), without optimizing communication for particular partitioning policies like Edge-Cuts, or support only Edge-Cuts (Gemini). Neither approach is flexible enough. 2. An important lesson for designers of efficient graph partitioning policies is that the communication overhead in graph analytics applications depends on not only the communication volume but also the communication pattern among the partitions as explained in Section 2. The replication factor and the number of edges/vertices split between partitions [12,18,41, 44] are not adequate proxies for communication overhead. The rest of this paper is organized as follows. Section 2 describes the graph partitioning strategies used in this study, including Edge- Cuts, 2D block partitioning strategies, and general Vertex-Cuts. Section 3 presents an analytical model for estimating the communication volumes required by different partitioning strategies and an empirical study using micro-benchmarks to model communication time. Section 4 presents detailed experimental results using stateof-the-art graph analytics benchmarks including data-driven algorithms for betweenness centrality, breadth-first search, connected components, page-rank, and single-source shortest-path. Section 5 presents a decision tree for choosing a partitioning strategy. Section 6 puts the contributions of this paper in the perspective of related work. Section 7 concludes the paper. 2. PARTITIONING POLICIES The graph partitioning policies considered in this work divide a graph s edges among hosts and creating proxy nodes on each host for the endpoints of the edges assigned to that host. If edge e : (n 1 n 2) is assigned to a host h, h creates proxy nodes on itself for nodes n 1 and n 2, and adds an edge between them. For each node in the graph, one proxy is made the master, and the other proxies are made mirrors. Intuitively, the master holds the canonical value of the node during the computation, and it communicates that value to the mirrors as needed. Each partitioning strategy represents choices made along two dimensions: (i) how edges are partitioned among hosts and (ii) how the master is chosen from the proxies of a given node. To understand these choices, it is useful to consider both the graph-theoretic (i.e., nodes and edges) and the adjacency matrix representation of a graph D Partitions In 1D partitioning, nodes are partitioned among hosts. If a host owns node n, all outgoing (or incoming) edges connected to n are assigned to that host, and the corresponding proxy for node n is made the master. In the graph analytical literature, this partitioning strategy is called outgoing (or incoming) Edge-Cut [23, 41, 42, 51]. In matrix-theoretic terms, this corresponds to assigning rows (or columns) to hosts. 1D partitioning is commonly used in stencil codes in computational science applications; the mirror nodes are often referred to as halo nodes in that context. Stencil codes use topology-driven algorithms on meshes, which are uniform-degree graphs, and computational load balance can be accomplished by assigning roughly equal numbers of nodes to all hosts. For powerlaw graphs, the adjacency matrix is irregular, and ensuring load balance for data-driven graph algorithms through static partitioning is difficult. A number of policies are used in practice. 1. Balanced nodes: Assign roughly equal numbers of nodes to all hosts. 2. Balanced edges: Assign nodes such that all hosts have roughly the same number of edges. 3. Balanced nodes and edges [51]: Assign nodes such that a given linear combination of the number of nodes and edges on a host has roughly the same value on all hosts.

3 (a) CheckerBoard Vertex-Cut () (b) Cartesian Vertex-Cut () (c) Jagged Vertex-Cut () Figure 1: 2D partitioning policies. Figure 2: Communication patterns in and policies: red arrows are reductions and blue arrows are broadcasts for host 4. The first policy is the simplest and does not require any analysis of the graph. However, it may result in substantial load imbalance for power-law graphs since there is usually a large variation in node degrees. The other two policies require computing the degree of each vertex and the prefix-sums of these degrees to determine how to partition the set of nodes D Block Partitions In 2D block partitioning, the adjacency matrix is blocked along both dimensions, and each host is assigned some of the blocks. Unlike in 1D partitioning, both outgoing and incoming edges of a given node may be distributed among different hosts. In the graph analytical literature, such partitioning strategies are called Vertex- Cuts. 2D block partitioning can be viewed as a restricted form of Vertex-Cuts in which the adjacency matrix is blocked. This paper explores three alternative implementations of 2D block partitioning, illustrated in Figure 1 using a cluster of eight hosts in a 4 2 grid. The descriptions below assume that hosts are organized in a grid of size p r p c. 1. CheckerBoard Vertex-Cuts () [11, 27]: The nodes of the graph are partitioned into equal size blocks and assigned to the hosts. Masters are created on each host for its block of nodes. The matrix is partitioned into contiguous blocks of size N/p r N/p c, and each host is assigned one block of edges as shown in Figure 1(a). This approach is used by CombBLAS [9] for sparse matrix-vector multiplication (SpMV) on graphs. There are several variations to this approach; the one used in this study is shown in Figure 1(a). 2. Cartesian Vertex-Cuts () [7]: Nodes are partitioned among hosts using any 1D block partitioning policy, and masters are created on hosts for the nodes assigned to it. Unlike, these node partitions need not be of the same size, as shown in Figure 1(b). This can happen, for example, if the 1D block partitioning assigns nodes to hosts such that the number of edges is balanced among hosts. The columns are then partitioned into same sized blocks as the rows. Therefore, blocks along the diagonal will be square, but other blocks may be rectangular unlike in. These blocks of edges can be distributed among hosts in different ways. This study uses a block distribution along rows and a cyclic distribution along columns, as shown in Figure 1(b). 3. Jagged Vertex-Cuts () [11]: The first stage of is similar to. However, instead of partitioning columns into same sized blocks as the rows, each block of rows can be partitioned independently into blocks for a more balanced number of edges (non-zeros) in edge blocks. Therefore, blocks will not be aligned among the column dimension. These edge blocks can be assigned to hosts in any fashion, but in this study, the host assignment is kept the same as as shown in Figure 1(c). Although these 2D block partitioning strategies seem similar, they have different communication requirements. Consider a push-style graph algorithm in which active nodes perform computation on their own labels and push values to their immediate outgoing neighbors, where they are reduced to compute labels for the next round. In a distributed-memory setting, a proxy with outgoing edges of a node n on host h push values to its proxy neighbors also present on h. These proxies may be masters or mirrors, and it is useful to distinguish two classes of mirrors for a node n: in-mirrors, mirrors that have incoming edges, and out-mirrors, mirrors that have outgoing edges. For a push-style algorithm in the distributed setting, the computation at an active node in the original graph is performed partially at each in-mirror, the intermediate values are reduced to the master, and the final value is broadcast to the out-mirrors. Therefore, the communication pattern can be described as broadcast along the row dimension and reduce along the column dimension of the adjacency matrix. The hosts that participate in broadcast and/or reduce depends on the 2D partitioning policy and the assignment of edge blocks to the hosts. To illustrate this, consider Figure 2 which shows the 4 2 grid of hosts and the reduce and broadcast partners for host 4 in and. For, by walking down the fourth block column

4 of the matrix in Figure 1(a), we see that incoming edges to the nodes owned by host 4 may be mapped to hosts {1,3,5,7}, so labels of mirrors on these hosts must be communicated to host 4 during the reduce phase. Similarly, the labels of mirrors on host 4 must be communicated to masters on hosts {5,6,7,8}. Once the values have been reduced at masters on host 4, they must be sent to hosts that own out-mirrors for nodes owned by host 4. Walking over the fourth block row of the matrix, we see that only host 3 is involved in this communication. Similarly, the labels of masters on the host {3} must be sent to host 4. The same analysis can be done on and partitionings. For, a given host may need to involve all other hosts in reductions and broadcasts, leading to larger communication requirements than. 2.3 Unrestricted Vertex-Cut Partitions Unrestricted or general Vertex-Cuts are partitioning strategies that assign edges to hosts without restriction, and they do not correspond to 1D or 2D blocked partitions. The partitioning strategies used in the PowerGraph [18] and PowerLyra systems [12] are examples of this strategy. For example, in PowerLyra s Hybrid Vertex-Cut (), nodes are assigned to hosts in a manner similar to an Edge-Cut. However, high-degree nodes are treated differently from low-degree nodes to avoid assigning all edges connected to a high-degree node to the same host, which creates load imbalance. If (n 1 n 2) is an edge and n 2 has low in-degree (based on some threshold), the edge is assigned to the host that owns n 2; otherwise, it is assigned to the node that owns n Discussion A small detail in 2D block partitioning is that when the number of processors is not a perfect square, we factorize the number into a product p x p y and assign the larger factor to the row dimension rather than the column dimension. This heuristic reduces the number of broadcast partners because the reduce phase communicates only updated values from proxies to the master and the number of these updates decrease over iterations, whereas the broadcast phase in each round sends the updated canonical value from the master to all its proxies. In our studies, we observed that for Edge-Cuts and, almost all pairs of hosts have proxies in common for some number of nodes. Therefore, for these partitioning policies, Edge-Cut partitioning performs all-to-all communication while performs two rounds of all-to-all communication (from mirrors to masters followed by masters to mirrors). On the other hand, has p c number of parallel all-to-all communications among p r P hosts followed by p r number of parallel all-to-all communications among p c P hosts. Therefore, each host in sends fewer messages and has fewer communication partners than and, which can help both at the application level (a host need not wait for another host in another row) and at the hardware level (less contention for network resources). The next section explores these differences among partitioning policies quantitatively. 3. PERFORMANCE MODEL FOR COMMU- NICATION This section presents performance models to estimate communication volume and communication time for three partitioning policies: Edge-Cut, Hybrid Vertex-Cut, and Cartesian Vertex-Cut. We formally define replication factor (Section 3.1), describe a simple analytical model for estimating the communication volume using the replication factor (Section 3.2), and use micro-benchmarks to estimate the communication time from the communication volume (Section 3.3). 3.1 Replication Factor Given a graph G=(V, E), and a strategy S for partitioning the graph between P hosts, let v be a node in V and let r P,S(v) denote the number of proxies of v created when the graph is partitioned. r P,S(v) is referred to as the replication factor, and it is an integer between 1 and P. The average replication factor across all nodes is a number between 1 (no node is replicated) and P (all nodes are replicated between all hosts), and it is given by the following equation: r P,S = v V rp,s(v) V 3.2 Estimating Communication Volume For simplicity in the communication volume model, assume that the data flow in the algorithm is from the source to the destination of an edge. A similar analysis can be performed for other cases. Assume that u nodes ( u V ) are updated in a given round and that the size of the node data (update) is b bytes Edge-Cut () Consider an incoming Edge-Cut (I) strategy (1D column partitioning), and let r P,E(v) be the replication factor for a node v. In I, the destination of an edge is always a master, so only the master node is updated during computation. At the end of a round, the master node updates its mirrors on other hosts. Since an update requires b bytes, the master node for v in the graph must send (r P,E(v) 1) b bytes. If u nodes are updated in the round, the volume of communication is the following: (1) C I(G, P ) (r P,E 1) u b (2) For an outgoing Edge-Cut (O) strategy (1D row partitioning), the source of an edge is always a master, so the mirrors must send the updates to their master, and only the master needs the updated value. The communication volume remains the same as I s only if all the mirrors of u nodes are updated in the round. In practice, only some mirrors of the nodes in u are updated. Let f E denote the average fraction of mirrors of nodes in u that are updated. The communication volume in the round is the following: C O(G, P ) (f E r P,E 1) u b (3) Hybrid Vertex-Cut () In a general Vertex-Cut strategy like Hybrid Vertex-Cut, the mirrors must communicate with their master, and the masters must communicate with their mirrors. If r P,H is the average replication factor, the volume of communication in a round in which an average fraction f H of mirrors are updated is C HV C(G, P ) ((f H r P,H 1)+(r P,H 1)) u b (4) Comparing this with (2) and (3), we see that the volume of communication for O and I can be much lower than if they have the same replication factor, and it will be greater than only if their replication factor is almost twice that of Cartesian Vertex-Cut () Consider a Cartesian Vertex-Cut with P = p r p c. Unlike, at most p r mirrors must communicate with their master, and the masters must communicate with at most p c of their mirrors. Let r P,C be the average replication factor, and let f C and f C be the

5 Communication time (msec) Communication volume (MB) Communication time (msec) Communication volume (MB) (a) KNL hosts (b) Skylake hosts Figure 3: Performance of different communication patterns on different number of hosts on Stampede for different CPU architectures. Computation rounds (%) Communication Volume (MB) >1 >1 & <=1 <=1 Figure 4: Communication volume of each computation round on. fractions of mirrors that are active along rows and columns respectively. Then, the volume of communication in a round is C CV C(G, P ) ((f C r P,C 1) + (f C r P,C 1)) u b (5) Comparing this with (4), we see that the communication volume for can be less than the volume for even with the same replication factor. This illustrates that the replication factor is not the sole determinant for communication volume. 3.3 Micro-benchmarking to Estimate Communication Time To relate communication volumes under different communication patterns to communication time, we adapted the MVAPICH2 all-to-allv micro-benchmark. For a given total communication volume across all hosts, we simulate the communication patterns of the three strategies (the message sizes differ for different strategies): (1) I/O: one round of all-to-all communication among all hosts, (2) : two rounds of all-to-all communication among all hosts, (3) : a round of all-to-all communication among all row hosts followed by a round of all-to-all communication among all column hosts. In a given strategy, the same message size is used between every pair of hosts. In practice, the message sizes usually differ and point-to-point communication (instead of a collective) is typically used to dynamically manage buffers and parallelize serialization and deserialization of data. pr sssp Figures 3a and 3b show the communication time of the different strategies on different number of KNL hosts and Skylake hosts, respectively, on the Stampede cluster [43] (described in Section 4) as the total communication volume increases. As expected, the communication time is dependent not only on the communication volume but also on the communication pattern. The performance difference in the communication patterns grows as the number of hosts increases. While,, and perform similarly on 32 KNL hosts, gives a geomean speedup of 4.6 and 5.6 over and respectively to communicate the same volume on 256 KNL hosts; the speedup is higher for low volume ( 1MB) than for medium volume (> 1MB and 1MB). The communication volume in work-efficient graph analytical applications changes over rounds and several rounds have low communication volume in practice. For example, Figure 4 shows the percentage of computation rounds of pagerank and sssp with communication volume in the low, medium, and high ranges for different partitioning policies on 32 and 256 KNL hosts of Stampede. In Figure 3, communicates more data in the same amount of time on 256 KNL hosts; e.g., can synchronize 128 to 256 MB of data in the same time that and can synchronize 1 MB of data. Similar behavior is also observed on Skylake hosts. Thus, can reduce communication time over and for the same communication volume or can increase the volume without an increase in time. 3.4 Discussion The analytical model and micro-benchmark results described in this section show that although communication time depends on the communication volume (which in turn depends on the average replication factor), it is incorrect to conclude from this alone that partitioning strategies with larger average replication factors or communication volume will perform worse than those with smaller replication factors or communication volumes. In particular, might be expected to perform better at scale because it requires fewer messages and fewer pairs of processors to communicate than other partitioning strategies like and do. While the microbenchmarks used MPI collectives with the same message sizes, most distributed graph analytics systems use MPI or similar interfaces to perform point-to-point communication with variable message sizes. For example, the D-Galois [15] system uses LCI [14] 1 instead of MPI for sending and receiving point-to-point messages between hosts. Even in such systems, can be expected to perform better at scale than and because it needs fewer messages and fewer pairs of processors to communicate. We study this using the D-Galois system quantitatively in the next section. 1 LCI [14] is an alternative to MPI that has been shown to perform better than MPI for graph analytical applications.

6 Table 1: Inputs and their properties. kron3 wdc12 V 173M 978M 3,563M E 1,791M 42,574M 128,736M E / V Max D out 3.2M 7,447 55,931 Max D in 3.2M 75M 95M Size on disk 136GB 325GB 986GB Table 2: Execution time of Gemini and D-Galois with. 32 hosts Gemini (sec) D-Galois (sec) kron3 bfs cc pr sssp bfs cc pr sssp EXPERIMENTAL RESULTS All experiments were conducted on the Texas Advanced Computing Center s Stampede Cluster (Stampede2) [2, 43]. Stampede2 has 2 distributed clusters: one with Intel Knights Landing (KNL) nodes and another with Intel Skylake nodes. Each KNL cluster host has GHz cores with 4 hardware threads per core, 96 GB RAM, and 16 GB MC-DRAM, which serves as a direct-mapped L3 cache. Each Skylake cluster host has GHz cores on 2 sockets (24 cores per socket) with 2 hardware threads per core and 192 GB RAM. Both clusters use 1Gb/sec Intel Omni-Path (OPA) network. We limit our experiments to 256 hosts on both the clusters and we use 272 threads per host (69632 threads in total) on the KNL cluster and 48 threads per host (12288 threads in total) on the Skylake cluster. All code was compiled using g Unless otherwise stated, all results are presented using the KNL cluster. We used three graphs in our evaluation: synthetically generated randomized power-law graph kron3 (using kron [31] generator with weights of.57,.19,.19, and.5, as suggested by graph5 [1]) and the largest publicly available web-crawl graphs, [5, 6, 38] and wdc12 (Web Data Commons) [33, 34]; we present results for wdc12 only at scale (256 hosts). Properties of these graphs are listed in Table 1. To study the impact of graph partitioning on application execution time, we use five graph analytics applications: betweenness centrality ( bc ), breadth-first search ( bfs ), connected components ( cc ), pagerank ( pr ), and single-source shortest path ( sssp ). We implement topology-driven algorithm for pagerank and data-driven algorithms for the rest. We run bc with only one source. The source nodes for bc, bfs, and sssp are the maximum out-degree node. The directed graph is given as input to bc, bfs, pagerank, and sssp (bc and sssp use a weighted graph), while an undirected graph (we make the directed graph symmetric by adding reverse edges) is given as input to cc. We present the mean execution time of 3 runs (each being maximum across all hosts) excluding graph partitioning time. All algorithms are run until convergence except for pagerank, which is run for up to 1 rounds or iterations. 4.1 Implementation Existing graph analytics systems implement a single partitioning strategy, so they are not suitable for comparative studies of graph partitioning. Moreover, these frameworks do not optimize communication patterns to exploit structure in the partitioning policy. Table 3: Different initial Edge-Cut policies used for different benchmarks, inputs, and partitioning policies: IE: Incoming Edge- Cut, OE: Outgoing Edge-Cut, UE: Undirected Edge-Cut. bc/bfs/sssp cc pr kron3 wdc12 X OE UE IE IE UE IE IE UE IE OE UE IE OE UE IE OE UE IE X OE UE IE OE UE OE OE UE OE OE UE IE OE UE IE OE UE IE X OE UE IE OE UE OE OE UE OE OE UE IE OE UE IE OE UE IE Therefore, we used a system called D-Galois: it uses the sharedmemory Galois system [3, 36] for computing on each host and the Gluon communication runtime [15] for inter-host communication and synchronization. Gluon is an efficient bulk-synchronous communication substrate, which enables existing shared-memory CPU and GPU graph analytics frameworks to run on distributed heterogeneous clusters. Gluon is partition-aware and optimizes communication for particular partitioning strategies by exploiting their structural invariants. Instead of naively reducing from mirrors to masters and broadcasting from masters to mirrors during synchronization (as done in gather-apply-scatter model), it avoids redundant reductions or broadcasts by exploiting structural invariants in the partitioning strategy, as described in Section 3. Gluon also exploits the fact that the partitioning of the graph does not change during computation and it uses this temporal invariance to reduce the overhead and volume of communication by optimizing metadata like node IDs that needs to be communicated along with the node values or updates. To ensure that D-Galois is a suitable platform for this study, we compared it with Gemini [51], a state-of-the-art system that uses only Edge-Cuts. Table 2 shows the execution times for Gemini versions of our benchmarks on 32 KNL hosts; wdc12 is omitted because Gemini runs out of memory while partitioning it (even on 256 hosts). Since Gemini supports only Edge-Cuts, we compare it with D-Galois using Edge-Cut (). We see that D-Galois outperforms Gemini. Gemini also does not scale beyond 32 hosts [15]. These results support the claim that D-Galois is a reasonable stateof-the-art platform for performing studies of graph partitioning. For our study, graphs are stored on disk in CSR and CSC formats (size for directed, weighted graphs in Table 1). The synthetic graphs (kron3) are stored after randomizing the vertices (randomized order) while the web-crawl graphs ( and wdc12) are stored in their natural (crawled) vertex order. D-Galois s distributed graph partitioner reads the input graph such that each host only reads a distinct portion of the file on disk once. D-Galois also supports loading partitions directly from disk, and we use this to evaluate XtraPulp [41], the state-of-the-art graph partitioner for large scale-free graphs. XtraPulp generates (using command line parameters -e 1.1 -v 1. -c -d -s and 272 OpenMP threads) partitions for kron3 and, and it runs out of memory for wdc12 (the authors have been in-

7 Table 4: Graph partitioning time (includes time to load and construct graph) and static load balance of edges assigned to hosts on 256 KNL hosts. Partitioning Max-by-mean time (sec) edges kron3 wdc12 bc bc bfs cc pr bfs cc pr sssp sssp X X X OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM formed about this). We write these partitions to disk and load it in D-Galois. We term this the XtraPulp Edge-Cut (X) policy. We implement five partitioning policies in D-Galois: (1) Edge- Cut (), (2) Hybrid Vertex-Cut (), (3) CheckerBoard Vertex- Cut (), (4) Jagged Vertex-Cut (), and (5) Cartesian Vertex- Cut (). Direction of edges traversed during computation is application dependent. bc, bfs, and sssp traverse outgoing edges of a directed graph, pr traverses incoming edges of a directed graph, and cc traverses outgoing edges of a symmetric, directed graph (Gluon handles undirected graphs in this way). Due to this, each application might prefer a different policy: Outgoing Edge-Cut (OE): Nodes of a directed graph are partitioned into contiguous blocks while trying to balance outgoing edges and all outgoing edges of a node are assigned to the same block as the node, like in Gemini [51]. Incoming Edge-Cut (IE): Nodes of a directed graph are partitioned into contiguous blocks while balancing incoming edges and all incoming edges are assigned to the same block. Undirected Edge-Cut (UE): Nodes of a symmetric, directed graph are partitioned into contiguous blocks while trying to balance outgoing edges and all outgoing edges are assigned to the same block. does not communicate during graph partitioning as each process reads its portion of the graph directly from the file. All Vertex- Cut policies use some 2 to further partition the edges of some vertices, which are sent to the respective hosts. Table 3 shows the initial Edge-Cut used by the partitioning policies for each benchmark and input. as described in Section 2.3 is used for graphs with skewed in-degree (e.g., wdc12 and ). For graphs with skewed out-degree (e.g., kron3), if (n 1 n 2) is an edge and n 1 has low out-degree (based on some threshold), the edge is assigned to the host that owns n 1; otherwise, it is assigned to the node that owns n 2.,, and are as described in Section The Vertex-Cut policies could also have used X instead of to further partition the edges of some vertices, but we chose to show that can do well at scale even with a simple Edge-Cut. Table 5: Dynamic load balance: maximum-by-mean computation time on 256 KNL hosts. bc bfs cc pr sssp kron3 wdc12 X X X OOM OOM OOM OOM OOM OOM 1.94 OOM OOM OOM Partitioning Time Although the time to partition graphs is an overhead in distributedmemory graph analytics systems, shared-memory systems must also take time to load the graph from disk and construct it in memory. For example, Galois [36] and other shared-memory systems like Ligra [4] take roughly 2 minutes to load and construct rmat28 (35GB), which is relatively small compared to the graphs used in this study. This time should be used as a point of reference when considering the graph partitioning times shown in Table 4 for different partitioning policies and inputs on 256 KNL hosts. X excludes time to write partitions from XtraPulp to disk, load it in D-Galois, and construct the graph. All other policies include graph loading and construction time. Note that some partitioning policies run out-of-memory (OOM). is a lower bound for partitioning as each host loads its partition of the graph directly from disk in parallel. Vertex-Cut policies are slower than Edge-Cut policies because they involve communication and more analysis on top of. Comparing partitioning time for different partitioning strategies is not the focus of this work. In particular, graphs can be partitioned once, and different applications can run using these partitions. The main takeaway for partitioning time is that it can finish within a few minutes. partitioning time is better than or similar to that of X. takes around 5 minutes, 1 minutes, and 4 minutes for directed kron3,, and wdc12, respectively. 4.3 Load Balance We first compare the quality of the resulting partitions. The number of edges assigned to a host represents the static load of that host. Table 4 shows the maximum to mean ratio of edges assigned to hosts. We see that static load balance not only varies with different policies but also with the input graphs. For both directed and undirected graphs, kron3 is very well balanced for all policies. For and wdc12, is well balanced while the rest are not. Dynamic load may be quite different from static load, especially for data-driven algorithms in which computation changes over rounds. To measure the dynamic load balance, we measure the computation time of each round on each host and calculate the maximum and mean across hosts. Summing up over rounds, we get the maximum and mean computation time respectively. Table 5 shows the maximum-by-mean computation time for all policies, benchmarks, and inputs on 256 hosts. Note that bc, bfs, and sssp use the same

8 Table 6: Execution time (sec) for different partitioning policies, benchmarks, and inputs on KNL hosts. 32 hosts 256 hosts X X kron3 bc bfs cc pr sssp bc bfs cc OOM OOM pr OOM sssp Table 7: Execution statistics of wdc12 on 256 KNL hosts. bfs Execution cc Time (sec) pr sssp bfs Replication cc Factor pr sssp Total bfs Communication cc Volume (GB) pr sssp Table 8: Execution time (sec) on Skylake hosts for and. 8 hosts 256 hosts bc bfs kron3 cc pr sssp bc bfs cc pr sssp partitions. Firstly, it is clear that although kron3 is statically well balanced for all policies, it is not dynamically load balanced. Moreover, the load balance depends on the dynamic nature of the algorithm. Even though bc and bfs use the same graph, their dynamic load balance is different. on is well-balanced for bfs, but highly imbalanced for bc. The policy significantly impacts dynamic load balance, but this need not directly correlate with their static load balance. For example, for bfs and sssp, although on is statically severely imbalanced while is not, both and are fairly well balanced at runtime. This demonstrates that dynamic load balance is difficult to achieve since it depends on the interplay between policy, algorithm, and graph. 4.4 Execution Time Table 6 shows the execution time of D-Galois with all policies on 32 and 256 KNL hosts for kron3 and. Table 7 shows the execution time for wdc12 on 256 KNL hosts (all policies run out of memory on 32 hosts; X and run out of memory on 256 hosts too). These tables also highlight which policy performs best for a given application, input, and number of hosts (scale). It is clear that the best performing policy is dependent on the application, the input, and the number of hosts (scale). Although there is no clear winner for all cases, performs the best on 256 hosts for almost all applications and inputs. Table 8 shows the execution time of D-Galois with and on 8 and 256 Skylake hosts for kron3 and. Even on these Skylake hosts, the best performing policy depends on the application, the input, and the number of hosts (scale), without any clear winner. Nonetheless, similar to KNL hosts, performs the best on 256 hosts for almost all applications and inputs, whereas performs the best on 8 hosts for almost all applications and inputs. This suggests that the relative merits of the partitioning policies are not specific to the CPU architecture. 4.5 Strong Scaling We measure the time for computation of each round or iteration on each host and take the maximum across hosts. Summing up over iterations, we get the computation time. Figure 5 shows the strong scaling of execution time (left) and computation time (right) for kron3 and (most partitioning policies run out of memory for wdc12 on less than 256 hosts). Some general trends are apparent. At small scale on 32 hosts, performs fairly well for almost all the benchmarks and is comparable to the best, but as we go to higher number of hosts,, X, and do not scale. On the other hand, all 2D partitioning policies scale relatively better for almost all benchmarks and inputs. Among them, scales better than, and scales better than. In most cases, has the best performance at scale and scales best. does not scale well for bfs and sssp on due to computation time not scaling well. In all other cases, both computation time and execution time scale. For bfs and sssp, is still better than the others in execution time since computation times of all policies do not scale. This is likely due to the computation being too small for it to scale (both execute more than 18 iterations, so each iteration has little to compute on each host at scale). Computation time of all policies scales similarly in most cases. However, their execution time scaling differs. This is most evident in and. This indicates that the difference in the policies arises due to the communication. We analyze this next.

9 bc bfs cc pr sssp bc bfs cc pr sssp Execution time (sec) kron3 Compute time (sec) kron Hosts X Hosts X Figure 5: Execution time (left) and compute time (right) of D-Galois with different partitioning policies on KNL hosts. bc bfs cc pr sssp bc bfs cc pr sssp Replication factor kron3 Communication volume (GB) kron Hosts X Hosts X Figure 6: Replication factor (left) and communication volume (right) of D-Galois with different partitioning policies on KNL hosts.

10 kron3 kron3 kron3 kron3 kron3 Time (sec) bc bfs cc pr sssp Time (sec) bc bfs cc pr 4 2 sssp X X X X X X X X X X Time Non-overlapping Communication Computation Time Non-overlapping Communication Computation Figure 7: Breakdown of execution time on 256 hosts: kron3 (left) and (right). Time (sec) wdc12 bfs X wdc12 cc X wdc12 pr X 1 5 wdc12 sssp X Time Non-overlapping Communication Computation Figure 8: Breakdown of execution time on 256 hosts: wdc12 (X and run out-of-memory). Table 9: Communication volume estimated by the model vs. observed communication volume on 128 KNL hosts for kron3. I Replication Factor bfs Estimated Volume(GB) Observed Volume(GB) Replication Factor cc Estimated Volume(GB) Observed Volume(GB) Replication Factor pr Estimated Volume(GB) Observed Volume(GB) Replication Factor sssp Estimated Volume(GB) Observed Volume(GB) Communication Analysis The total communication volume is the sum of the number of bytes sent from one host to another during execution. Table 7 shows the replication factor and total communication volume for wdc12 on 256 hosts (X and run out of memory). Figure 6 shows how the replication factor (left) and the total communication volume (right) scale as we increase the number of hosts for kron3 and. The replication factor increases with the number of hosts for all partitioning policies as expected. Similarly, the total communication volume increases with the number of hosts as more data needs to be exchanged to synchronize those proxy nodes across hosts. However, the difference in replication factor across policies can vary from that of communication volume. This is most evident for kron3: although has a much higher replication factor than the other policies, the communication volume of is close to that of others. This demonstrates that replication factor need not be the sole determinant for the communication volume. For kron3, corresponds to I, so there are no updates sent from mirrors to masters; only masters send updates to mirrors. Vertex-Cut policies like and, however, send from mirrors to masters and then from masters to mirrors. Thus, if has roughly twice the replication factor of, would still communicate the same volume as. This can analyzed using our analytical model in Section 3.2. We estimate the communication volume for,, and using Equations 2, 4, and 5, respectively. The values to use for the replication factor r and the size of the data b are straightforward to determine. We assume that if a node is updated, all its mirrors are updated, so we use a value of 1 for f. These equations estimate the volume for a given round or iteration. To get the total communication volume, we can sum over all rounds. We instead replace the number of updates u in a round with the total number of updates performed on the graph across all the rounds to estimate the total communication volume. Table 9 presents the estimated communication volume of different applications and policies for kron3 along with the replication factor and the observed communication volume. The estimated volume can be more than the observed volume because we assume f is 1, which is an over-approximation. The observed volume can be higher than the estimated volume as the estimation does not account for the metadata like node IDs communicated along with node values or updates, which is an under-approximation. Such approximations are fine because the relative ordering of the communication volume among different partitioning policies is important, not the absolute values. From the estimated volume for all the benchmarks, we can see that our analytical model predicts to communicate the least amount of volume even if has higher replication factor than. A similar pattern can also be seen in the observed communication volume for all the benchmarks, where has the minimum volume but not necessarily the minimum replication factor. This validates our analytical model, stating that the replication

11 Estimated Execution Time T = Long/Short Start PI == Y or T == L No Offline Partitioned Input? PI = Yes/No Yes CS > 32 No Cluster Size CS = #Hosts Yes Figure 9: Decision tree to choose a partitioning policy. factor is not the sole determinant for communication volume. In Figure 6, we see that may have a higher replication factor and communication volume than the other policies, yet it performs better than the other policies in most cases. Figures 7 and 8 show the breakdown of execution time into computation time and nonoverlapped communication time (the rest of the execution time) on 256 hosts. It is clear that is doing better (except pagerank on kron3) because the communication time is lower. We can see that more communication volume does not always imply more communication time. For example, in pr and sssp on, has higher replication factor and more communication volume than and but lower communication time. Figure 4 shows the percentage of rounds in those applications and policies that have low ( 1MB), medium (> 1MB and 1MB), and high (> 1MB) communication volume. increases the number of high volume rounds of pr and sssp over on both 32 and 256 hosts. Our micro-benchmarking in Section 3.3 shows that yields significant speedups over on 256 hosts in both low and high volumes, which outweighs the increase in high volume rounds. In contrast, the increase in high volume rounds on 32 hosts for over causes a slowdown in communication time since there is very little difference between and at this scale for the same communication volume. This validates the claim that the communication time depends on both the communication volume and the communication pattern and shows that has much less communication overhead than other policies at large scale. 5. CHOOSING A PARTITIONING Based on the results presented in Section 4, we present a decision tree (Figure 9) to help users choose the most suitable partitioning strategy based on the following parameters: 1. Whether the input is partitioned offline: The time it takes to partition and load the graph (online) depends on the complexity of the partitioning strategy. takes least amount of time, whereas strategies like X [41] and [7] take more time as they involve analysis and communication during partitioning. It makes sense to use complex partitioning strategies if the benefits gained from them outweigh the time spent in partitioning. D-Galois also supports direct loading of offline partitioned graphs, in which case partitioning time is not a factor as the partitioned graph is on disk. 2. Whether the execution time is estimated to be long or short: The amount of time spent in execution of the application plays a vital role in determining if it makes sense to invest time in a good partitioning strategy. Spending more time in partitioning makes more sense for long-running applications as they involve more rounds of communication which can Table 1: % difference in execution time (excluding partitioning time) on KNL hosts between the partitioning strategy chosen by the decision tree and the optimal one (wdc12 is omitted because chosen one is always optimal; % means that chosen one is optimal) kron3 bc 44.74% % % % bfs 13.33% 13.68% % % cc 12.5% 26.63% % % pr 9.31% 36.38% 3.63% 2.99% sssp 21.31% 23.26% % % bc 68.98% 6.59% 21.32% % bfs % 37.24% % % cc 55.62% 52.16% 51.35% % pr 11.89% 18.69% 26.9% 3.62% sssp % 43.83% 17.33% % Table 11: % difference in execution time (excluding partitioning time) on Skylake hosts between the partitioning strategy chosen by the decision tree and the optimal one kron3 bc 21.79% % bfs % % cc % % pr % % sssp % % bc % % bfs % % cc 12.84% % pr % % sssp % 11.34% benefit from a well-chosen partitioning strategy. The user can easily identify whether the application is expected to run for a long time (e.g., by using algorithm complexity). Applications such as bc tend to be more complex as they have multiple phases within each round of computation, and they involve floating-point operations; therefore, they are expected to have longer execution times. On the other hand, applications like bfs are relatively less complex and can be classified as short-running applications. 3. Cluster size: Our results show that the performance of different partitioning strategies also depend on the number of hosts on which the application is run. Figure 9 illustrates our decision tree to choose a partitioning strategy. For short running applications, if the graph is not already partitioned, we recommend using simple. Additionally, if the graph is already partitioned and the number of hosts is less than 32, simple should suffice. For long-running applications, the decision to use or primarily depends on the cluster size. If the number of machines is more than 32, it makes sense to invest partitioning time in. Otherwise, is recommended. Choosing the best partitioning strategy is a difficult problem as it depends on several factors such as properties of input graphs, applications, number of hosts (scale), etc. Therefore, the decision tree may not always suggest or choose the best partitioning strategy. Tables 1 and 11 illustrate this point by showing the percentage difference in the application execution time between the chosen

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics Loc Hoang, Roshan Dathathri, Gurbinder Gill, Keshav Pingali Department of Computer Science The University of Texas at Austin

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Seminar on A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm Mohammad Iftakher Uddin & Mohammad Mahfuzur Rahman Matrikel Nr: 9003357 Matrikel Nr : 9003358 Masters of

More information

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation Optimizing Cache Performance for Graph Analytics Yunming Zhang 6.886 Presentation Goals How to optimize in-memory graph applications How to go about performance engineering just about anything In-memory

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

MVAPICH2 vs. OpenMPI for a Clustering Algorithm

MVAPICH2 vs. OpenMPI for a Clustering Algorithm MVAPICH2 vs. OpenMPI for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland, Baltimore

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

PULP: Fast and Simple Complex Network Partitioning

PULP: Fast and Simple Complex Network Partitioning PULP: Fast and Simple Complex Network Partitioning George Slota #,* Kamesh Madduri # Siva Rajamanickam * # The Pennsylvania State University *Sandia National Laboratories Dagstuhl Seminar 14461 November

More information

Accelerated Load Balancing of Unstructured Meshes

Accelerated Load Balancing of Unstructured Meshes Accelerated Load Balancing of Unstructured Meshes Gerrett Diamond, Lucas Davis, and Cameron W. Smith Abstract Unstructured mesh applications running on large, parallel, distributed memory systems require

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions Data In single-program multiple-data (SPMD) parallel programs, global data is partitioned, with a portion of the data assigned to each processing node. Issues relevant to choosing a partitioning strategy

More information

Exploring the Hidden Dimension in Graph Processing

Exploring the Hidden Dimension in Graph Processing Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous

More information

Algorithm Engineering with PRAM Algorithms

Algorithm Engineering with PRAM Algorithms Algorithm Engineering with PRAM Algorithms Bernard M.E. Moret moret@cs.unm.edu Department of Computer Science University of New Mexico Albuquerque, NM 87131 Rome School on Alg. Eng. p.1/29 Measuring and

More information

Extreme-scale Graph Analysis on Blue Waters

Extreme-scale Graph Analysis on Blue Waters Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State

More information

arxiv: v1 [cs.dc] 27 Sep 2018

arxiv: v1 [cs.dc] 27 Sep 2018 Performance of MPI sends of non-contiguous data Victor Eijkhout arxiv:19.177v1 [cs.dc] 7 Sep 1 1 Abstract We present an experimental investigation of the performance of MPI derived datatypes. For messages

More information

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS

6.2 DATA DISTRIBUTION AND EXPERIMENT DETAILS Chapter 6 Indexing Results 6. INTRODUCTION The generation of inverted indexes for text databases is a computationally intensive process that requires the exclusive use of processing resources for long

More information

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS c 2017 Shiv Verma AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING BY SHIV VERMA THESIS Submitted in partial fulfillment of the requirements for the degree of Master

More information

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC MIKE GOWANLOCK NORTHERN ARIZONA UNIVERSITY SCHOOL OF INFORMATICS, COMPUTING & CYBER SYSTEMS BEN KARSIN UNIVERSITY OF HAWAII AT MANOA DEPARTMENT

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Pregel. Ali Shah

Pregel. Ali Shah Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

CS 475: Parallel Programming Introduction

CS 475: Parallel Programming Introduction CS 475: Parallel Programming Introduction Wim Bohm, Sanjay Rajopadhye Colorado State University Fall 2014 Course Organization n Let s make a tour of the course website. n Main pages Home, front page. Syllabus.

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Lecture 4: Principles of Parallel Algorithm Design (part 4) Lecture 4: Principles of Parallel Algorithm Design (part 4) 1 Mapping Technique for Load Balancing Minimize execution time Reduce overheads of execution Sources of overheads: Inter-process interaction

More information

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs B. Barla Cambazoglu and Cevdet Aykanat Bilkent University, Department of Computer Engineering, 06800, Ankara, Turkey {berkant,aykanat}@cs.bilkent.edu.tr

More information

6. Parallel Volume Rendering Algorithms

6. Parallel Volume Rendering Algorithms 6. Parallel Volume Algorithms This chapter introduces a taxonomy of parallel volume rendering algorithms. In the thesis statement we claim that parallel algorithms may be described by "... how the tasks

More information

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures Rolf Rabenseifner rabenseifner@hlrs.de Gerhard Wellein gerhard.wellein@rrze.uni-erlangen.de University of Stuttgart

More information

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge

More information

Efficient and Simplified Parallel Graph Processing over CPU and MIC

Efficient and Simplified Parallel Graph Processing over CPU and MIC Efficient and Simplified Parallel Graph Processing over CPU and MIC Linchuan Chen Xin Huo Bin Ren Surabhi Jain Gagan Agrawal Department of Computer Science and Engineering The Ohio State University Columbus,

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

GraphBLAS Mathematics - Provisional Release 1.0 -

GraphBLAS Mathematics - Provisional Release 1.0 - GraphBLAS Mathematics - Provisional Release 1.0 - Jeremy Kepner Generated on April 26, 2017 Contents 1 Introduction: Graphs as Matrices........................... 1 1.1 Adjacency Matrix: Undirected Graphs,

More information

Chapter 5: Analytical Modelling of Parallel Programs

Chapter 5: Analytical Modelling of Parallel Programs Chapter 5: Analytical Modelling of Parallel Programs Introduction to Parallel Computing, Second Edition By Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar Contents 1. Sources of Overhead in Parallel

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions Dr. Amotz Bar-Noy s Compendium of Algorithms Problems Problems, Hints, and Solutions Chapter 1 Searching and Sorting Problems 1 1.1 Array with One Missing 1.1.1 Problem Let A = A[1],..., A[n] be an array

More information

Bipartite Perfect Matching in O(n log n) Randomized Time. Nikhil Bhargava and Elliot Marx

Bipartite Perfect Matching in O(n log n) Randomized Time. Nikhil Bhargava and Elliot Marx Bipartite Perfect Matching in O(n log n) Randomized Time Nikhil Bhargava and Elliot Marx Background Matching in bipartite graphs is a problem that has many distinct applications. Many problems can be reduced

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Parallel Performance Studies for a Clustering Algorithm

Parallel Performance Studies for a Clustering Algorithm Parallel Performance Studies for a Clustering Algorithm Robin V. Blasberg and Matthias K. Gobbert Naval Research Laboratory, Washington, D.C. Department of Mathematics and Statistics, University of Maryland,

More information

CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015

CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015 CS/ECE 566 Lab 1 Vitaly Parnas Fall 2015 October 9, 2015 Contents 1 Overview 3 2 Formulations 4 2.1 Scatter-Reduce....................................... 4 2.2 Scatter-Gather.......................................

More information

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad

Architectural Differences nc. DRAM devices are accessed with a multiplexed address scheme. Each unit of data is accessed by first selecting its row ad nc. Application Note AN1801 Rev. 0.2, 11/2003 Performance Differences between MPC8240 and the Tsi106 Host Bridge Top Changwatchai Roy Jenevein risc10@email.sps.mot.com CPD Applications This paper discusses

More information

A Distributed Multi-GPU System for Fast Graph Processing

A Distributed Multi-GPU System for Fast Graph Processing A Distributed Multi-GPU System for Fast Graph Processing Zhihao Jia Stanford University zhihao@cs.stanford.edu Pat M c Cormick LANL pat@lanl.gov Yongkee Kwon UT Austin yongkee.kwon@utexas.edu Mattan Erez

More information

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger Parallel Programming Concepts Parallel Algorithms Peter Tröger Sources: Ian Foster. Designing and Building Parallel Programs. Addison-Wesley. 1995. Mattson, Timothy G.; S, Beverly A.; ers,; Massingill,

More information

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters 1 On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters N. P. Karunadasa & D. N. Ranasinghe University of Colombo School of Computing, Sri Lanka nishantha@opensource.lk, dnr@ucsc.cmb.ac.lk

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

A new edge selection heuristic for computing the Tutte polynomial of an undirected graph.

A new edge selection heuristic for computing the Tutte polynomial of an undirected graph. FPSAC 2012, Nagoya, Japan DMTCS proc. (subm.), by the authors, 1 12 A new edge selection heuristic for computing the Tutte polynomial of an undirected graph. Michael Monagan 1 1 Department of Mathematics,

More information

Determining the Number of CPUs for Query Processing

Determining the Number of CPUs for Query Processing Determining the Number of CPUs for Query Processing Fatemah Panahi Elizabeth Soechting CS747 Advanced Computer Systems Analysis Techniques The University of Wisconsin-Madison fatemeh@cs.wisc.edu, eas@cs.wisc.edu

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

What is Parallel Computing?

What is Parallel Computing? What is Parallel Computing? Parallel Computing is several processing elements working simultaneously to solve a problem faster. 1/33 What is Parallel Computing? Parallel Computing is several processing

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms

An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms An Algorithmic Approach to Communication Reduction in Parallel Graph Algorithms Harshvardhan, Adam Fidel, Nancy M. Amato, Lawrence Rauchwerger Parasol Laboratory Dept. of Computer Science and Engineering

More information

Chapter 4: Implicit Error Detection

Chapter 4: Implicit Error Detection 4. Chpter 5 Chapter 4: Implicit Error Detection Contents 4.1 Introduction... 4-2 4.2 Network error correction... 4-2 4.3 Implicit error detection... 4-3 4.4 Mathematical model... 4-6 4.5 Simulation setup

More information

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning

Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning 5th ANNUAL WORKSHOP 209 Designing High-Performance MPI Collectives in MVAPICH2 for HPC and Deep Learning Hari Subramoni Dhabaleswar K. (DK) Panda The Ohio State University The Ohio State University E-mail:

More information

Parallel Poisson Solver in Fortran

Parallel Poisson Solver in Fortran Parallel Poisson Solver in Fortran Nilas Mandrup Hansen, Ask Hjorth Larsen January 19, 1 1 Introduction In this assignment the D Poisson problem (Eq.1) is to be solved in either C/C++ or FORTRAN, first

More information

Efficient graph computation on hybrid CPU and GPU systems

Efficient graph computation on hybrid CPU and GPU systems J Supercomput (2015) 71:1563 1586 DOI 10.1007/s11227-015-1378-z Efficient graph computation on hybrid CPU and GPU systems Tao Zhang Jingjie Zhang Wei Shu Min-You Wu Xiaoyao Liang Published online: 21 January

More information

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206 1 Today Characteristics of Tasks and Interactions (3.3). Mapping Techniques for Load Balancing (3.4). Methods for Containing Interaction

More information

Design of Parallel Algorithms. Models of Parallel Computation

Design of Parallel Algorithms. Models of Parallel Computation + Design of Parallel Algorithms Models of Parallel Computation + Chapter Overview: Algorithms and Concurrency n Introduction to Parallel Algorithms n Tasks and Decomposition n Processes and Mapping n Processes

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

I/O CANNOT BE IGNORED

I/O CANNOT BE IGNORED LECTURE 13 I/O I/O CANNOT BE IGNORED Assume a program requires 100 seconds, 90 seconds for main memory, 10 seconds for I/O. Assume main memory access improves by ~10% per year and I/O remains the same.

More information

Graphs. The ultimate data structure. graphs 1

Graphs. The ultimate data structure. graphs 1 Graphs The ultimate data structure graphs 1 Definition of graph Non-linear data structure consisting of nodes & links between them (like trees in this sense) Unlike trees, graph nodes may be completely

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report

Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Enabling Loop Parallelization with Decoupled Software Pipelining in LLVM: Final Report Ameya Velingker and Dougal J. Sutherland {avelingk, dsutherl}@cs.cmu.edu http://www.cs.cmu.edu/~avelingk/compilers/

More information

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing ABSTRACT Shiv Verma 1, Luke M. Leslie 1, Yosub Shin, Indranil Gupta 1 1 University of Illinois at Urbana-Champaign,

More information

LA3: A Scalable Link- and Locality-Aware Linear Algebra-Based Graph Analytics System

LA3: A Scalable Link- and Locality-Aware Linear Algebra-Based Graph Analytics System LA3: A Scalable Link- and Locality-Aware Linear Algebra-Based Graph Analytics System Yousuf Ahmad, Omar Khattab, Arsal Malik, Ahmad Musleh, Mohammad Hammoud Carnegie Mellon University in Qatar {myahmad,

More information

Current and Future Challenges of the Tofu Interconnect for Emerging Applications

Current and Future Challenges of the Tofu Interconnect for Emerging Applications Current and Future Challenges of the Tofu Interconnect for Emerging Applications Yuichiro Ajima Senior Architect Next Generation Technical Computing Unit Fujitsu Limited June 22, 2017, ExaComm 2017 Workshop

More information

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin

Parallel Programming in the Age of Ubiquitous Parallelism. Keshav Pingali The University of Texas at Austin Parallel Programming in the Age of Ubiquitous Parallelism Keshav Pingali The University of Texas at Austin Parallelism is everywhere Texas Advanced Computing Center Laptops Cell-phones Parallel programming?

More information

Deterministic Parallel Graph Coloring with Symmetry Breaking

Deterministic Parallel Graph Coloring with Symmetry Breaking Deterministic Parallel Graph Coloring with Symmetry Breaking Per Normann, Johan Öfverstedt October 05 Abstract In this paper we propose a deterministic parallel graph coloring algorithm that enables Multi-Coloring

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David

Scalable Algorithmic Techniques Decompositions & Mapping. Alexandre David Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability

More information

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE) Some aspects of parallel program design R. Bader (LRZ) G. Hager (RRZE) Finding exploitable concurrency Problem analysis 1. Decompose into subproblems perhaps even hierarchy of subproblems that can simultaneously

More information

Computation and Communication Efficient Graph Processing with Distributed Immutable View

Computation and Communication Efficient Graph Processing with Distributed Immutable View Computation and Communication Efficient Graph Processing with Distributed Immutable View Rong Chen, Xin Ding, Peng Wang, Haibo Chen, Binyu Zang, Haibing Guan Shanghai Key Laboratory of Scalable Computing

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Patterns: Graph Search CSE 599 I Accelerated Computing - Programming GPUS Parallel Patterns: Graph Search Objective Study graph search as a prototypical graph-based algorithm Learn techniques to mitigate the memory-bandwidth-centric

More information

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER Shaheen R. Tonse* Lawrence Berkeley National Lab., Berkeley, CA, USA 1. INTRODUCTION The goal of this

More information

Distributed minimum spanning tree problem

Distributed minimum spanning tree problem Distributed minimum spanning tree problem Juho-Kustaa Kangas 24th November 2012 Abstract Given a connected weighted undirected graph, the minimum spanning tree problem asks for a spanning subtree with

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization ECE669: Parallel Computer Architecture Fall 2 Handout #2 Homework # 2 Due: October 6 Programming Multiprocessors: Parallelism, Communication, and Synchronization 1 Introduction When developing multiprocessor

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Irregular Graph Algorithms on Parallel Processing Systems

Irregular Graph Algorithms on Parallel Processing Systems Irregular Graph Algorithms on Parallel Processing Systems George M. Slota 1,2 Kamesh Madduri 1 (advisor) Sivasankaran Rajamanickam 2 (Sandia mentor) 1 Penn State University, 2 Sandia National Laboratories

More information

Scalability of Heterogeneous Computing

Scalability of Heterogeneous Computing Scalability of Heterogeneous Computing Xian-He Sun, Yong Chen, Ming u Department of Computer Science Illinois Institute of Technology {sun, chenyon1, wuming}@iit.edu Abstract Scalability is a key factor

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information