A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

Size: px

Start display at page:

Download "A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms"

Antonia Chapman
5 years ago
Views:

1 A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas at Austin ABSTRACT Distributed-memory clusters are used for in-memory processing of very large graphs with billions of nodes and edges. This requires partitioning the graph among the machines in the cluster. When a graph is partitioned, a node in the graph may be replicated on several machines, and communication is required to keep these replicas synchronized. Good partitioning policies attempt to reduce this synchronization overhead while keeping the computational load balanced across machines. A number of recent studies have looked at ways to control replication of nodes, but these studies are not conclusive because they were performed on small clusters with eight to sixteen machines, did not consider work-efficient data-driven algorithms, or did not optimize communication for the partitioning strategies they studied. This paper presents an experimental study of partitioning strategies for work-efficient graph analytics applications on large KNL and Skylake clusters with up to 256 machines using the Gluon communication runtime which implements partitioning-specific communication optimizations. Evaluation results show that although simple partitioning strategies like Edge-Cuts perform well on a small number of machines, an alternative partitioning strategy called Cartesian Vertex-Cut () performs better at scale even though paradoxically it has a higher replication factor and performs more communication than Edge-Cut partitioning does. Results from communication micro-benchmarks resolve this paradox by showing that communication overhead depends not only on communication volume but also on the communication pattern among the partitions. These experiments suggest that high-performance graph analytics systems should support multiple partitioning strategies, like Gluon does, as no single graph partitioning strategy is best for all cluster sizes. For such systems, a decision tree for selecting a good partitioning strategy based on characteristics of the computation and the cluster is presented. PVLDB Reference Format: Gurbinder Gill, Roshan Dathathri, Loc Hoang, and Keshav Pingali. A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms. PVLDB, 12(4): , 218. DOI: This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4. International License. To view a copy of this license, visit For any use beyond those covered by this license, obtain permission by ing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. 4 ISSN DOI: 1. INTRODUCTION Graph analytics systems must handle extremely large graphs with billions of nodes and trillions of edges [3]. Graphs of this size do not fit in the main memory of a single machine, so systems like Pregel [32], PowerGraph [18], PowerLyra [12], Gemini [51], and D-Galois [15] use distributed-memory clusters. Since each host in the cluster can address only its own memory, it is necessary to partition the graph among hosts and use a communication layer like MPI to perform data transfers between hosts [7,8,11,12,18,23,41, 42, 44]. Graph partitioning must balance two concerns. The first concern is computational load balance. For the graph algorithms considered in this paper, computation is performed in rounds: in each round, active nodes in the graph are visited, and a computation is performed by reading and writing the immediate neighbors of the active node [3]. For simple topology-driven graph algorithms in which all nodes are active at the beginning of a round, the computational load of a host is proportional to the numbers of nodes and edges assigned to that host, so by dividing nodes and edges evenly among hosts, it is possible to achieve load balance [12]. However, work-efficient graph algorithms like the ones considered in this paper are data-driven: nodes become active in data-dependent, statically unpredictable ways, so a statically balanced partition of the graph does not necessarily result in computational load balance. The second concern is communication overhead. We consider graph partitioning with replication: when a graph is partitioned, its edges are divided up among the hosts, and if edge (n 1 n 2) is assigned to a host h, proxy nodes are created for n 1 and n 2 on host h and connected by an edge. A given node in the original graph may have proxies on several hosts in the partitioned graph, so updates to proxies must be synchronized during execution using inter-host communication. This communication entirely dominates the execution time of graph analytics applications on large scale clusters as shown by the results in Section 4, so optimizing communication is the key to high performance. Substantial effort has gone into designing graph partitioning strategies that reduce communication overhead. Overlapping communication with computation (as done in HPC applications) would reduce its relative overhead, but there is relatively little computation in graph analytics applications. Therefore, reducing the volume of communication has been the focus of much effort in this area. Since communication is needed to synchronize proxies, reducing the average number of proxies per node (known in the literature as the average replication factor) while also ensuring computational load balance can reduce communication [8, 23, 41, 42, 44]. This is one of the driving principles behind the Vertex-Cut partitioning strategy (Vertex-Cuts and other partitioning strategies are described in

2 detail in Section 2) used in PowerGraph [18] and PowerLyra [12]. Partitioning policies developed for efficient distribution of sparse matrix computation such as 2D block partitioning [7, 11] can also be used since sparse graphs are isomorphic to sparse matrices. Several papers [3,7,12,45,51] have studied how the performance of graph analytics applications changes when different partitioning policies are used, and they have advocated particular partitioning policies based on their results. We believe these studies are not conclusive for the following reasons. Most of the evaluations were done for small graphs on small clusters, so it is not clear whether their conclusions extend to large graphs and large clusters. In some cases, only topology-driven algorithms were evaluated, so it is not clear whether their conclusions extend to work-efficient data-driven algorithms. The distributed graph analytics systems used in the studies optimize communication only for particular partitioning strategies, putting other partitioning strategies at a disadvantage. This paper makes the following contributions: 1. We present the first detailed performance analysis of stateof-the-art, work-efficient graph analytics applications using different graph partitioning strategies including Edge-Cuts, 2D block partitioning strategies, and general Vertex-Cuts on large-scale clusters, including one with 256 machines and roughly 69K threads. These experiments use a system called D-Galois, a distributed-memory version of the Galois system [36] based on the Gluon communication runtime [15]. Gluon performs communication optimizations that are specific to each partitioning strategy. Our results show that although Edge-Cuts perform well on small-scale clusters, a 2D partitioning policy called Cartesian Vertex-Cut () [7] performs the best at scale even though it results in higher replication factors and higher communication volumes than the other partitioning strategies. 2. We present an analytical model for estimating communication volumes for different partitioning strategies, and an empirical study using micro-benchmarks to estimate communication times. These help estimate and understand the performance differences in communication required by the partitioning strategies. In particular, these models explain why at scale, has lower communication overhead even though it performs more communication. 3. We give a simple decision tree that can be used by the user to select a partitioning strategy at runtime given an application and the number of distributed hosts. Although the chosen policy might not be the best in some cases, we show that the application s performance using the chosen policy is almost as good as using the best policy. This paper s contributions include some important lessons for designers of high-performance graph analytics systems. 1. It is desirable to support optimized implementations of multiple partitioning policies including Edge-Cuts and Cartesian Vertex-Cuts, like D-Galois does. Existing systems either support general partitioning, using approaches like gather-applyscatter (PowerGraph, PowerLyra), without optimizing communication for particular partitioning policies like Edge-Cuts, or support only Edge-Cuts (Gemini). Neither approach is flexible enough. 2. An important lesson for designers of efficient graph partitioning policies is that the communication overhead in graph analytics applications depends on not only the communication volume but also the communication pattern among the partitions as explained in Section 2. The replication factor and the number of edges/vertices split between partitions [12,18,41, 44] are not adequate proxies for communication overhead. The rest of this paper is organized as follows. Section 2 describes the graph partitioning strategies used in this study, including Edge- Cuts, 2D block partitioning strategies, and general Vertex-Cuts. Section 3 presents an analytical model for estimating the communication volumes required by different partitioning strategies and an empirical study using micro-benchmarks to model communication time. Section 4 presents detailed experimental results using stateof-the-art graph analytics benchmarks including data-driven algorithms for betweenness centrality, breadth-first search, connected components, page-rank, and single-source shortest-path. Section 5 presents a decision tree for choosing a partitioning strategy. Section 6 puts the contributions of this paper in the perspective of related work. Section 7 concludes the paper. 2. PARTITIONING POLICIES The graph partitioning policies considered in this work divide a graph s edges among hosts and creating proxy nodes on each host for the endpoints of the edges assigned to that host. If edge e : (n 1 n 2) is assigned to a host h, h creates proxy nodes on itself for nodes n 1 and n 2, and adds an edge between them. For each node in the graph, one proxy is made the master, and the other proxies are made mirrors. Intuitively, the master holds the canonical value of the node during the computation, and it communicates that value to the mirrors as needed. Each partitioning strategy represents choices made along two dimensions: (i) how edges are partitioned among hosts and (ii) how the master is chosen from the proxies of a given node. To understand these choices, it is useful to consider both the graph-theoretic (i.e., nodes and edges) and the adjacency matrix representation of a graph D Partitions In 1D partitioning, nodes are partitioned among hosts. If a host owns node n, all outgoing (or incoming) edges connected to n are assigned to that host, and the corresponding proxy for node n is made the master. In the graph analytical literature, this partitioning strategy is called outgoing (or incoming) Edge-Cut [23, 41, 42, 51]. In matrix-theoretic terms, this corresponds to assigning rows (or columns) to hosts. 1D partitioning is commonly used in stencil codes in computational science applications; the mirror nodes are often referred to as halo nodes in that context. Stencil codes use topology-driven algorithms on meshes, which are uniform-degree graphs, and computational load balance can be accomplished by assigning roughly equal numbers of nodes to all hosts. For powerlaw graphs, the adjacency matrix is irregular, and ensuring load balance for data-driven graph algorithms through static partitioning is difficult. A number of policies are used in practice. 1. Balanced nodes: Assign roughly equal numbers of nodes to all hosts. 2. Balanced edges: Assign nodes such that all hosts have roughly the same number of edges. 3. Balanced nodes and edges [51]: Assign nodes such that a given linear combination of the number of nodes and edges on a host has roughly the same value on all hosts.

3 (a) CheckerBoard Vertex-Cut () (b) Cartesian Vertex-Cut () (c) Jagged Vertex-Cut () Figure 1: 2D partitioning policies. Figure 2: Communication patterns in and policies: red arrows are reductions and blue arrows are broadcasts for host 4. The first policy is the simplest and does not require any analysis of the graph. However, it may result in substantial load imbalance for power-law graphs since there is usually a large variation in node degrees. The other two policies require computing the degree of each vertex and the prefix-sums of these degrees to determine how to partition the set of nodes D Block Partitions In 2D block partitioning, the adjacency matrix is blocked along both dimensions, and each host is assigned some of the blocks. Unlike in 1D partitioning, both outgoing and incoming edges of a given node may be distributed among different hosts. In the graph analytical literature, such partitioning strategies are called Vertex- Cuts. 2D block partitioning can be viewed as a restricted form of Vertex-Cuts in which the adjacency matrix is blocked. This paper explores three alternative implementations of 2D block partitioning, illustrated in Figure 1 using a cluster of eight hosts in a 4 2 grid. The descriptions below assume that hosts are organized in a grid of size p r p c. 1. CheckerBoard Vertex-Cuts () [11, 27]: The nodes of the graph are partitioned into equal size blocks and assigned to the hosts. Masters are created on each host for its block of nodes. The matrix is partitioned into contiguous blocks of size N/p r N/p c, and each host is assigned one block of edges as shown in Figure 1(a). This approach is used by CombBLAS [9] for sparse matrix-vector multiplication (SpMV) on graphs. There are several variations to this approach; the one used in this study is shown in Figure 1(a). 2. Cartesian Vertex-Cuts () [7]: Nodes are partitioned among hosts using any 1D block partitioning policy, and masters are created on hosts for the nodes assigned to it. Unlike, these node partitions need not be of the same size, as shown in Figure 1(b). This can happen, for example, if the 1D block partitioning assigns nodes to hosts such that the number of edges is balanced among hosts. The columns are then partitioned into same sized blocks as the rows. Therefore, blocks along the diagonal will be square, but other blocks may be rectangular unlike in. These blocks of edges can be distributed among hosts in different ways. This study uses a block distribution along rows and a cyclic distribution along columns, as shown in Figure 1(b). 3. Jagged Vertex-Cuts () [11]: The first stage of is similar to. However, instead of partitioning columns into same sized blocks as the rows, each block of rows can be partitioned independently into blocks for a more balanced number of edges (non-zeros) in edge blocks. Therefore, blocks will not be aligned among the column dimension. These edge blocks can be assigned to hosts in any fashion, but in this study, the host assignment is kept the same as as shown in Figure 1(c). Although these 2D block partitioning strategies seem similar, they have different communication requirements. Consider a push-style graph algorithm in which active nodes perform computation on their own labels and push values to their immediate outgoing neighbors, where they are reduced to compute labels for the next round. In a distributed-memory setting, a proxy with outgoing edges of a node n on host h push values to its proxy neighbors also present on h. These proxies may be masters or mirrors, and it is useful to distinguish two classes of mirrors for a node n: in-mirrors, mirrors that have incoming edges, and out-mirrors, mirrors that have outgoing edges. For a push-style algorithm in the distributed setting, the computation at an active node in the original graph is performed partially at each in-mirror, the intermediate values are reduced to the master, and the final value is broadcast to the out-mirrors. Therefore, the communication pattern can be described as broadcast along the row dimension and reduce along the column dimension of the adjacency matrix. The hosts that participate in broadcast and/or reduce depends on the 2D partitioning policy and the assignment of edge blocks to the hosts. To illustrate this, consider Figure 2 which shows the 4 2 grid of hosts and the reduce and broadcast partners for host 4 in and. For, by walking down the fourth block column

4 of the matrix in Figure 1(a), we see that incoming edges to the nodes owned by host 4 may be mapped to hosts {1,3,5,7}, so labels of mirrors on these hosts must be communicated to host 4 during the reduce phase. Similarly, the labels of mirrors on host 4 must be communicated to masters on hosts {5,6,7,8}. Once the values have been reduced at masters on host 4, they must be sent to hosts that own out-mirrors for nodes owned by host 4. Walking over the fourth block row of the matrix, we see that only host 3 is involved in this communication. Similarly, the labels of masters on the host {3} must be sent to host 4. The same analysis can be done on and partitionings. For, a given host may need to involve all other hosts in reductions and broadcasts, leading to larger communication requirements than. 2.3 Unrestricted Vertex-Cut Partitions Unrestricted or general Vertex-Cuts are partitioning strategies that assign edges to hosts without restriction, and they do not correspond to 1D or 2D blocked partitions. The partitioning strategies used in the PowerGraph [18] and PowerLyra systems [12] are examples of this strategy. For example, in PowerLyra s Hybrid Vertex-Cut (), nodes are assigned to hosts in a manner similar to an Edge-Cut. However, high-degree nodes are treated differently from low-degree nodes to avoid assigning all edges connected to a high-degree node to the same host, which creates load imbalance. If (n 1 n 2) is an edge and n 2 has low in-degree (based on some threshold), the edge is assigned to the host that owns n 2; otherwise, it is assigned to the node that owns n Discussion A small detail in 2D block partitioning is that when the number of processors is not a perfect square, we factorize the number into a product p x p y and assign the larger factor to the row dimension rather than the column dimension. This heuristic reduces the number of broadcast partners because the reduce phase communicates only updated values from proxies to the master and the number of these updates decrease over iterations, whereas the broadcast phase in each round sends the updated canonical value from the master to all its proxies. In our studies, we observed that for Edge-Cuts and, almost all pairs of hosts have proxies in common for some number of nodes. Therefore, for these partitioning policies, Edge-Cut partitioning performs all-to-all communication while performs two rounds of all-to-all communication (from mirrors to masters followed by masters to mirrors). On the other hand, has p c number of parallel all-to-all communications among p r P hosts followed by p r number of parallel all-to-all communications among p c P hosts. Therefore, each host in sends fewer messages and has fewer communication partners than and, which can help both at the application level (a host need not wait for another host in another row) and at the hardware level (less contention for network resources). The next section explores these differences among partitioning policies quantitatively. 3. PERFORMANCE MODEL FOR COMMU- NICATION This section presents performance models to estimate communication volume and communication time for three partitioning policies: Edge-Cut, Hybrid Vertex-Cut, and Cartesian Vertex-Cut. We formally define replication factor (Section 3.1), describe a simple analytical model for estimating the communication volume using the replication factor (Section 3.2), and use micro-benchmarks to estimate the communication time from the communication volume (Section 3.3). 3.1 Replication Factor Given a graph G=(V, E), and a strategy S for partitioning the graph between P hosts, let v be a node in V and let r P,S(v) denote the number of proxies of v created when the graph is partitioned. r P,S(v) is referred to as the replication factor, and it is an integer between 1 and P. The average replication factor across all nodes is a number between 1 (no node is replicated) and P (all nodes are replicated between all hosts), and it is given by the following equation: r P,S = v V rp,s(v) V 3.2 Estimating Communication Volume For simplicity in the communication volume model, assume that the data flow in the algorithm is from the source to the destination of an edge. A similar analysis can be performed for other cases. Assume that u nodes ( u V ) are updated in a given round and that the size of the node data (update) is b bytes Edge-Cut () Consider an incoming Edge-Cut (I) strategy (1D column partitioning), and let r P,E(v) be the replication factor for a node v. In I, the destination of an edge is always a master, so only the master node is updated during computation. At the end of a round, the master node updates its mirrors on other hosts. Since an update requires b bytes, the master node for v in the graph must send (r P,E(v) 1) b bytes. If u nodes are updated in the round, the volume of communication is the following: (1) C I(G, P ) (r P,E 1) u b (2) For an outgoing Edge-Cut (O) strategy (1D row partitioning), the source of an edge is always a master, so the mirrors must send the updates to their master, and only the master needs the updated value. The communication volume remains the same as I s only if all the mirrors of u nodes are updated in the round. In practice, only some mirrors of the nodes in u are updated. Let f E denote the average fraction of mirrors of nodes in u that are updated. The communication volume in the round is the following: C O(G, P ) (f E r P,E 1) u b (3) Hybrid Vertex-Cut () In a general Vertex-Cut strategy like Hybrid Vertex-Cut, the mirrors must communicate with their master, and the masters must communicate with their mirrors. If r P,H is the average replication factor, the volume of communication in a round in which an average fraction f H of mirrors are updated is C HV C(G, P ) ((f H r P,H 1)+(r P,H 1)) u b (4) Comparing this with (2) and (3), we see that the volume of communication for O and I can be much lower than if they have the same replication factor, and it will be greater than only if their replication factor is almost twice that of Cartesian Vertex-Cut () Consider a Cartesian Vertex-Cut with P = p r p c. Unlike, at most p r mirrors must communicate with their master, and the masters must communicate with at most p c of their mirrors. Let r P,C be the average replication factor, and let f C and f C be the

5 Communication time (msec) Communication volume (MB) Communication time (msec) Communication volume (MB) (a) KNL hosts (b) Skylake hosts Figure 3: Performance of different communication patterns on different number of hosts on Stampede for different CPU architectures. Computation rounds (%) Communication Volume (MB) >1 >1 & <=1 <=1 Figure 4: Communication volume of each computation round on. fractions of mirrors that are active along rows and columns respectively. Then, the volume of communication in a round is C CV C(G, P ) ((f C r P,C 1) + (f C r P,C 1)) u b (5) Comparing this with (4), we see that the communication volume for can be less than the volume for even with the same replication factor. This illustrates that the replication factor is not the sole determinant for communication volume. 3.3 Micro-benchmarking to Estimate Communication Time To relate communication volumes under different communication patterns to communication time, we adapted the MVAPICH2 all-to-allv micro-benchmark. For a given total communication volume across all hosts, we simulate the communication patterns of the three strategies (the message sizes differ for different strategies): (1) I/O: one round of all-to-all communication among all hosts, (2) : two rounds of all-to-all communication among all hosts, (3) : a round of all-to-all communication among all row hosts followed by a round of all-to-all communication among all column hosts. In a given strategy, the same message size is used between every pair of hosts. In practice, the message sizes usually differ and point-to-point communication (instead of a collective) is typically used to dynamically manage buffers and parallelize serialization and deserialization of data. pr sssp Figures 3a and 3b show the communication time of the different strategies on different number of KNL hosts and Skylake hosts, respectively, on the Stampede cluster [43] (described in Section 4) as the total communication volume increases. As expected, the communication time is dependent not only on the communication volume but also on the communication pattern. The performance difference in the communication patterns grows as the number of hosts increases. While,, and perform similarly on 32 KNL hosts, gives a geomean speedup of 4.6 and 5.6 over and respectively to communicate the same volume on 256 KNL hosts; the speedup is higher for low volume ( 1MB) than for medium volume (> 1MB and 1MB). The communication volume in work-efficient graph analytical applications changes over rounds and several rounds have low communication volume in practice. For example, Figure 4 shows the percentage of computation rounds of pagerank and sssp with communication volume in the low, medium, and high ranges for different partitioning policies on 32 and 256 KNL hosts of Stampede. In Figure 3, communicates more data in the same amount of time on 256 KNL hosts; e.g., can synchronize 128 to 256 MB of data in the same time that and can synchronize 1 MB of data. Similar behavior is also observed on Skylake hosts. Thus, can reduce communication time over and for the same communication volume or can increase the volume without an increase in time. 3.4 Discussion The analytical model and micro-benchmark results described in this section show that although communication time depends on the communication volume (which in turn depends on the average replication factor), it is incorrect to conclude from this alone that partitioning strategies with larger average replication factors or communication volume will perform worse than those with smaller replication factors or communication volumes. In particular, might be expected to perform better at scale because it requires fewer messages and fewer pairs of processors to communicate than other partitioning strategies like and do. While the microbenchmarks used MPI collectives with the same message sizes, most distributed graph analytics systems use MPI or similar interfaces to perform point-to-point communication with variable message sizes. For example, the D-Galois [15] system uses LCI [14] 1 instead of MPI for sending and receiving point-to-point messages between hosts. Even in such systems, can be expected to perform better at scale than and because it needs fewer messages and fewer pairs of processors to communicate. We study this using the D-Galois system quantitatively in the next section. 1 LCI [14] is an alternative to MPI that has been shown to perform better than MPI for graph analytical applications.

6 Table 1: Inputs and their properties. kron3 wdc12 V 173M 978M 3,563M E 1,791M 42,574M 128,736M E / V Max D out 3.2M 7,447 55,931 Max D in 3.2M 75M 95M Size on disk 136GB 325GB 986GB Table 2: Execution time of Gemini and D-Galois with. 32 hosts Gemini (sec) D-Galois (sec) kron3 bfs cc pr sssp bfs cc pr sssp EXPERIMENTAL RESULTS All experiments were conducted on the Texas Advanced Computing Center s Stampede Cluster (Stampede2) [2, 43]. Stampede2 has 2 distributed clusters: one with Intel Knights Landing (KNL) nodes and another with Intel Skylake nodes. Each KNL cluster host has GHz cores with 4 hardware threads per core, 96 GB RAM, and 16 GB MC-DRAM, which serves as a direct-mapped L3 cache. Each Skylake cluster host has GHz cores on 2 sockets (24 cores per socket) with 2 hardware threads per core and 192 GB RAM. Both clusters use 1Gb/sec Intel Omni-Path (OPA) network. We limit our experiments to 256 hosts on both the clusters and we use 272 threads per host (69632 threads in total) on the KNL cluster and 48 threads per host (12288 threads in total) on the Skylake cluster. All code was compiled using g Unless otherwise stated, all results are presented using the KNL cluster. We used three graphs in our evaluation: synthetically generated randomized power-law graph kron3 (using kron [31] generator with weights of.57,.19,.19, and.5, as suggested by graph5 [1]) and the largest publicly available web-crawl graphs, [5, 6, 38] and wdc12 (Web Data Commons) [33, 34]; we present results for wdc12 only at scale (256 hosts). Properties of these graphs are listed in Table 1. To study the impact of graph partitioning on application execution time, we use five graph analytics applications: betweenness centrality ( bc ), breadth-first search ( bfs ), connected components ( cc ), pagerank ( pr ), and single-source shortest path ( sssp ). We implement topology-driven algorithm for pagerank and data-driven algorithms for the rest. We run bc with only one source. The source nodes for bc, bfs, and sssp are the maximum out-degree node. The directed graph is given as input to bc, bfs, pagerank, and sssp (bc and sssp use a weighted graph), while an undirected graph (we make the directed graph symmetric by adding reverse edges) is given as input to cc. We present the mean execution time of 3 runs (each being maximum across all hosts) excluding graph partitioning time. All algorithms are run until convergence except for pagerank, which is run for up to 1 rounds or iterations. 4.1 Implementation Existing graph analytics systems implement a single partitioning strategy, so they are not suitable for comparative studies of graph partitioning. Moreover, these frameworks do not optimize communication patterns to exploit structure in the partitioning policy. Table 3: Different initial Edge-Cut policies used for different benchmarks, inputs, and partitioning policies: IE: Incoming Edge- Cut, OE: Outgoing Edge-Cut, UE: Undirected Edge-Cut. bc/bfs/sssp cc pr kron3 wdc12 X OE UE IE IE UE IE IE UE IE OE UE IE OE UE IE OE UE IE X OE UE IE OE UE OE OE UE OE OE UE IE OE UE IE OE UE IE X OE UE IE OE UE OE OE UE OE OE UE IE OE UE IE OE UE IE Therefore, we used a system called D-Galois: it uses the sharedmemory Galois system [3, 36] for computing on each host and the Gluon communication runtime [15] for inter-host communication and synchronization. Gluon is an efficient bulk-synchronous communication substrate, which enables existing shared-memory CPU and GPU graph analytics frameworks to run on distributed heterogeneous clusters. Gluon is partition-aware and optimizes communication for particular partitioning strategies by exploiting their structural invariants. Instead of naively reducing from mirrors to masters and broadcasting from masters to mirrors during synchronization (as done in gather-apply-scatter model), it avoids redundant reductions or broadcasts by exploiting structural invariants in the partitioning strategy, as described in Section 3. Gluon also exploits the fact that the partitioning of the graph does not change during computation and it uses this temporal invariance to reduce the overhead and volume of communication by optimizing metadata like node IDs that needs to be communicated along with the node values or updates. To ensure that D-Galois is a suitable platform for this study, we compared it with Gemini [51], a state-of-the-art system that uses only Edge-Cuts. Table 2 shows the execution times for Gemini versions of our benchmarks on 32 KNL hosts; wdc12 is omitted because Gemini runs out of memory while partitioning it (even on 256 hosts). Since Gemini supports only Edge-Cuts, we compare it with D-Galois using Edge-Cut (). We see that D-Galois outperforms Gemini. Gemini also does not scale beyond 32 hosts [15]. These results support the claim that D-Galois is a reasonable stateof-the-art platform for performing studies of graph partitioning. For our study, graphs are stored on disk in CSR and CSC formats (size for directed, weighted graphs in Table 1). The synthetic graphs (kron3) are stored after randomizing the vertices (randomized order) while the web-crawl graphs ( and wdc12) are stored in their natural (crawled) vertex order. D-Galois s distributed graph partitioner reads the input graph such that each host only reads a distinct portion of the file on disk once. D-Galois also supports loading partitions directly from disk, and we use this to evaluate XtraPulp [41], the state-of-the-art graph partitioner for large scale-free graphs. XtraPulp generates (using command line parameters -e 1.1 -v 1. -c -d -s and 272 OpenMP threads) partitions for kron3 and, and it runs out of memory for wdc12 (the authors have been in-

7 Table 4: Graph partitioning time (includes time to load and construct graph) and static load balance of edges assigned to hosts on 256 KNL hosts. Partitioning Max-by-mean time (sec) edges kron3 wdc12 bc bc bfs cc pr bfs cc pr sssp sssp X X X OOM OOM OOM OOM OOM OOM OOM OOM OOM OOM formed about this). We write these partitions to disk and load it in D-Galois. We term this the XtraPulp Edge-Cut (X) policy. We implement five partitioning policies in D-Galois: (1) Edge- Cut (), (2) Hybrid Vertex-Cut (), (3) CheckerBoard Vertex- Cut (), (4) Jagged Vertex-Cut (), and (5) Cartesian Vertex- Cut (). Direction of edges traversed during computation is application dependent. bc, bfs, and sssp traverse outgoing edges of a directed graph, pr traverses incoming edges of a directed graph, and cc traverses outgoing edges of a symmetric, directed graph (Gluon handles undirected graphs in this way). Due to this, each application might prefer a different policy: Outgoing Edge-Cut (OE): Nodes of a directed graph are partitioned into contiguous blocks while trying to balance outgoing edges and all outgoing edges of a node are assigned to the same block as the node, like in Gemini [51]. Incoming Edge-Cut (IE): Nodes of a directed graph are partitioned into contiguous blocks while balancing incoming edges and all incoming edges are assigned to the same block. Undirected Edge-Cut (UE): Nodes of a symmetric, directed graph are partitioned into contiguous blocks while trying to balance outgoing edges and all outgoing edges are assigned to the same block. does not communicate during graph partitioning as each process reads its portion of the graph directly from the file. All Vertex- Cut policies use some 2 to further partition the edges of some vertices, which are sent to the respective hosts. Table 3 shows the initial Edge-Cut used by the partitioning policies for each benchmark and input. as described in Section 2.3 is used for graphs with skewed in-degree (e.g., wdc12 and ). For graphs with skewed out-degree (e.g., kron3), if (n 1 n 2) is an edge and n 1 has low out-degree (based on some threshold), the edge is assigned to the host that owns n 1; otherwise, it is assigned to the node that owns n 2.,, and are as described in Section The Vertex-Cut policies could also have used X instead of to further partition the edges of some vertices, but we chose to show that can do well at scale even with a simple Edge-Cut. Table 5: Dynamic load balance: maximum-by-mean computation time on 256 KNL hosts. bc bfs cc pr sssp kron3 wdc12 X X X OOM OOM OOM OOM OOM OOM 1.94 OOM OOM OOM Partitioning Time Although the time to partition graphs is an overhead in distributedmemory graph analytics systems, shared-memory systems must also take time to load the graph from disk and construct it in memory. For example, Galois [36] and other shared-memory systems like Ligra [4] take roughly 2 minutes to load and construct rmat28 (35GB), which is relatively small compared to the graphs used in this study. This time should be used as a point of reference when considering the graph partitioning times shown in Table 4 for different partitioning policies and inputs on 256 KNL hosts. X excludes time to write partitions from XtraPulp to disk, load it in D-Galois, and construct the graph. All other policies include graph loading and construction time. Note that some partitioning policies run out-of-memory (OOM). is a lower bound for partitioning as each host loads its partition of the graph directly from disk in parallel. Vertex-Cut policies are slower than Edge-Cut policies because they involve communication and more analysis on top of. Comparing partitioning time for different partitioning strategies is not the focus of this work. In particular, graphs can be partitioned once, and different applications can run using these partitions. The main takeaway for partitioning time is that it can finish within a few minutes. partitioning time is better than or similar to that of X. takes around 5 minutes, 1 minutes, and 4 minutes for directed kron3,, and wdc12, respectively. 4.3 Load Balance We first compare the quality of the resulting partitions. The number of edges assigned to a host represents the static load of that host. Table 4 shows the maximum to mean ratio of edges assigned to hosts. We see that static load balance not only varies with different policies but also with the input graphs. For both directed and undirected graphs, kron3 is very well balanced for all policies. For and wdc12, is well balanced while the rest are not. Dynamic load may be quite different from static load, especially for data-driven algorithms in which computation changes over rounds. To measure the dynamic load balance, we measure the computation time of each round on each host and calculate the maximum and mean across hosts. Summing up over rounds, we get the maximum and mean computation time respectively. Table 5 shows the maximum-by-mean computation time for all policies, benchmarks, and inputs on 256 hosts. Note that bc, bfs, and sssp use the same

8 Table 6: Execution time (sec) for different partitioning policies, benchmarks, and inputs on KNL hosts. 32 hosts 256 hosts X X kron3 bc bfs cc pr sssp bc bfs cc OOM OOM pr OOM sssp Table 7: Execution statistics of wdc12 on 256 KNL hosts. bfs Execution cc Time (sec) pr sssp bfs Replication cc Factor pr sssp Total bfs Communication cc Volume (GB) pr sssp Table 8: Execution time (sec) on Skylake hosts for and. 8 hosts 256 hosts bc bfs kron3 cc pr sssp bc bfs cc pr sssp partitions. Firstly, it is clear that although kron3 is statically well balanced for all policies, it is not dynamically load balanced. Moreover, the load balance depends on the dynamic nature of the algorithm. Even though bc and bfs use the same graph, their dynamic load balance is different. on is well-balanced for bfs, but highly imbalanced for bc. The policy significantly impacts dynamic load balance, but this need not directly correlate with their static load balance. For example, for bfs and sssp, although on is statically severely imbalanced while is not, both and are fairly well balanced at runtime. This demonstrates that dynamic load balance is difficult to achieve since it depends on the interplay between policy, algorithm, and graph. 4.4 Execution Time Table 6 shows the execution time of D-Galois with all policies on 32 and 256 KNL hosts for kron3 and. Table 7 shows the execution time for wdc12 on 256 KNL hosts (all policies run out of memory on 32 hosts; X and run out of memory on 256 hosts too). These tables also highlight which policy performs best for a given application, input, and number of hosts (scale). It is clear that the best performing policy is dependent on the application, the input, and the number of hosts (scale). Although there is no clear winner for all cases, performs the best on 256 hosts for almost all applications and inputs. Table 8 shows the execution time of D-Galois with and on 8 and 256 Skylake hosts for kron3 and. Even on these Skylake hosts, the best performing policy depends on the application, the input, and the number of hosts (scale), without any clear winner. Nonetheless, similar to KNL hosts, performs the best on 256 hosts for almost all applications and inputs, whereas performs the best on 8 hosts for almost all applications and inputs. This suggests that the relative merits of the partitioning policies are not specific to the CPU architecture. 4.5 Strong Scaling We measure the time for computation of each round or iteration on each host and take the maximum across hosts. Summing up over iterations, we get the computation time. Figure 5 shows the strong scaling of execution time (left) and computation time (right) for kron3 and (most partitioning policies run out of memory for wdc12 on less than 256 hosts). Some general trends are apparent. At small scale on 32 hosts, performs fairly well for almost all the benchmarks and is comparable to the best, but as we go to higher number of hosts,, X, and do not scale. On the other hand, all 2D partitioning policies scale relatively better for almost all benchmarks and inputs. Among them, scales better than, and scales better than. In most cases, has the best performance at scale and scales best. does not scale well for bfs and sssp on due to computation time not scaling well. In all other cases, both computation time and execution time scale. For bfs and sssp, is still better than the others in execution time since computation times of all policies do not scale. This is likely due to the computation being too small for it to scale (both execute more than 18 iterations, so each iteration has little to compute on each host at scale). Computation time of all policies scales similarly in most cases. However, their execution time scaling differs. This is most evident in and. This indicates that the difference in the policies arises due to the communication. We analyze this next.

9 bc bfs cc pr sssp bc bfs cc pr sssp Execution time (sec) kron3 Compute time (sec) kron Hosts X Hosts X Figure 5: Execution time (left) and compute time (right) of D-Galois with different partitioning policies on KNL hosts. bc bfs cc pr sssp bc bfs cc pr sssp Replication factor kron3 Communication volume (GB) kron Hosts X Hosts X Figure 6: Replication factor (left) and communication volume (right) of D-Galois with different partitioning policies on KNL hosts.

10 kron3 kron3 kron3 kron3 kron3 Time (sec) bc bfs cc pr sssp Time (sec) bc bfs cc pr 4 2 sssp X X X X X X X X X X Time Non-overlapping Communication Computation Time Non-overlapping Communication Computation Figure 7: Breakdown of execution time on 256 hosts: kron3 (left) and (right). Time (sec) wdc12 bfs X wdc12 cc X wdc12 pr X 1 5 wdc12 sssp X Time Non-overlapping Communication Computation Figure 8: Breakdown of execution time on 256 hosts: wdc12 (X and run out-of-memory). Table 9: Communication volume estimated by the model vs. observed communication volume on 128 KNL hosts for kron3. I Replication Factor bfs Estimated Volume(GB) Observed Volume(GB) Replication Factor cc Estimated Volume(GB) Observed Volume(GB) Replication Factor pr Estimated Volume(GB) Observed Volume(GB) Replication Factor sssp Estimated Volume(GB) Observed Volume(GB) Communication Analysis The total communication volume is the sum of the number of bytes sent from one host to another during execution. Table 7 shows the replication factor and total communication volume for wdc12 on 256 hosts (X and run out of memory). Figure 6 shows how the replication factor (left) and the total communication volume (right) scale as we increase the number of hosts for kron3 and. The replication factor increases with the number of hosts for all partitioning policies as expected. Similarly, the total communication volume increases with the number of hosts as more data needs to be exchanged to synchronize those proxy nodes across hosts. However, the difference in replication factor across policies can vary from that of communication volume. This is most evident for kron3: although has a much higher replication factor than the other policies, the communication volume of is close to that of others. This demonstrates that replication factor need not be the sole determinant for the communication volume. For kron3, corresponds to I, so there are no updates sent from mirrors to masters; only masters send updates to mirrors. Vertex-Cut policies like and, however, send from mirrors to masters and then from masters to mirrors. Thus, if has roughly twice the replication factor of, would still communicate the same volume as. This can analyzed using our analytical model in Section 3.2. We estimate the communication volume for,, and using Equations 2, 4, and 5, respectively. The values to use for the replication factor r and the size of the data b are straightforward to determine. We assume that if a node is updated, all its mirrors are updated, so we use a value of 1 for f. These equations estimate the volume for a given round or iteration. To get the total communication volume, we can sum over all rounds. We instead replace the number of updates u in a round with the total number of updates performed on the graph across all the rounds to estimate the total communication volume. Table 9 presents the estimated communication volume of different applications and policies for kron3 along with the replication factor and the observed communication volume. The estimated volume can be more than the observed volume because we assume f is 1, which is an over-approximation. The observed volume can be higher than the estimated volume as the estimation does not account for the metadata like node IDs communicated along with node values or updates, which is an under-approximation. Such approximations are fine because the relative ordering of the communication volume among different partitioning policies is important, not the absolute values. From the estimated volume for all the benchmarks, we can see that our analytical model predicts to communicate the least amount of volume even if has higher replication factor than. A similar pattern can also be seen in the observed communication volume for all the benchmarks, where has the minimum volume but not necessarily the minimum replication factor. This validates our analytical model, stating that the replication

11 Estimated Execution Time T = Long/Short Start PI == Y or T == L No Offline Partitioned Input? PI = Yes/No Yes CS > 32 No Cluster Size CS = #Hosts Yes Figure 9: Decision tree to choose a partitioning policy. factor is not the sole determinant for communication volume. In Figure 6, we see that may have a higher replication factor and communication volume than the other policies, yet it performs better than the other policies in most cases. Figures 7 and 8 show the breakdown of execution time into computation time and nonoverlapped communication time (the rest of the execution time) on 256 hosts. It is clear that is doing better (except pagerank on kron3) because the communication time is lower. We can see that more communication volume does not always imply more communication time. For example, in pr and sssp on, has higher replication factor and more communication volume than and but lower communication time. Figure 4 shows the percentage of rounds in those applications and policies that have low ( 1MB), medium (> 1MB and 1MB), and high (> 1MB) communication volume. increases the number of high volume rounds of pr and sssp over on both 32 and 256 hosts. Our micro-benchmarking in Section 3.3 shows that yields significant speedups over on 256 hosts in both low and high volumes, which outweighs the increase in high volume rounds. In contrast, the increase in high volume rounds on 32 hosts for over causes a slowdown in communication time since there is very little difference between and at this scale for the same communication volume. This validates the claim that the communication time depends on both the communication volume and the communication pattern and shows that has much less communication overhead than other policies at large scale. 5. CHOOSING A PARTITIONING Based on the results presented in Section 4, we present a decision tree (Figure 9) to help users choose the most suitable partitioning strategy based on the following parameters: 1. Whether the input is partitioned offline: The time it takes to partition and load the graph (online) depends on the complexity of the partitioning strategy. takes least amount of time, whereas strategies like X [41] and [7] take more time as they involve analysis and communication during partitioning. It makes sense to use complex partitioning strategies if the benefits gained from them outweigh the time spent in partitioning. D-Galois also supports direct loading of offline partitioned graphs, in which case partitioning time is not a factor as the partitioned graph is on disk. 2. Whether the execution time is estimated to be long or short: The amount of time spent in execution of the application plays a vital role in determining if it makes sense to invest time in a good partitioning strategy. Spending more time in partitioning makes more sense for long-running applications as they involve more rounds of communication which can Table 1: % difference in execution time (excluding partitioning time) on KNL hosts between the partitioning strategy chosen by the decision tree and the optimal one (wdc12 is omitted because chosen one is always optimal; % means that chosen one is optimal) kron3 bc 44.74% % % % bfs 13.33% 13.68% % % cc 12.5% 26.63% % % pr 9.31% 36.38% 3.63% 2.99% sssp 21.31% 23.26% % % bc 68.98% 6.59% 21.32% % bfs % 37.24% % % cc 55.62% 52.16% 51.35% % pr 11.89% 18.69% 26.9% 3.62% sssp % 43.83% 17.33% % Table 11: % difference in execution time (excluding partitioning time) on Skylake hosts between the partitioning strategy chosen by the decision tree and the optimal one kron3 bc 21.79% % bfs % % cc % % pr % % sssp % % bc % % bfs % % cc 12.84% % pr % % sssp % 11.34% benefit from a well-chosen partitioning strategy. The user can easily identify whether the application is expected to run for a long time (e.g., by using algorithm complexity). Applications such as bc tend to be more complex as they have multiple phases within each round of computation, and they involve floating-point operations; therefore, they are expected to have longer execution times. On the other hand, applications like bfs are relatively less complex and can be classified as short-running applications. 3. Cluster size: Our results show that the performance of different partitioning strategies also depend on the number of hosts on which the application is run. Figure 9 illustrates our decision tree to choose a partitioning strategy. For short running applications, if the graph is not already partitioned, we recommend using simple. Additionally, if the graph is already partitioned and the number of hosts is less than 32, simple should suffice. For long-running applications, the decision to use or primarily depends on the cluster size. If the number of machines is more than 32, it makes sense to invest partitioning time in. Otherwise, is recommended. Choosing the best partitioning strategy is a difficult problem as it depends on several factors such as properties of input graphs, applications, number of hosts (scale), etc. Therefore, the decision tree may not always suggest or choose the best partitioning strategy. Tables 1 and 11 illustrate this point by showing the percentage difference in the application execution time between the chosen

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics

CuSP: A Customizable Streaming Edge Partitioner for Distributed Graph Analytics Loc Hoang, Roshan Dathathri, Gurbinder Gill, Keshav Pingali Department of Computer Science The University of Texas at Austin