The Pennsylvania State University The Graduate School College of Engineering FAST PARALLEL TRIAD CENSUS AND TRIANGLE LISTING ON

Size: px

Start display at page:

Download "The Pennsylvania State University The Graduate School College of Engineering FAST PARALLEL TRIAD CENSUS AND TRIANGLE LISTING ON"

Horatio Atkins
6 years ago
Views:

1 The Pennsylvania State University The Graduate School College of Engineering FAST PARALLEL TRIAD CENSUS AND TRIANGLE LISTING ON SHARED-MEMORY PLATFORMS A Thesis in Computer Science and Engineering by Sindhuja Parimalarangan 2016 Sindhuja Parimalarangan Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science May 2016

2 The thesis of Sindhuja Parimalarangan was reviewed and approved by the following: Kamesh Madduri Assistant Professor in Department of Computer Science and Engineering Thesis Advisor Mahmut Taylan Kandemir Professor in Department of Computer Science and Engineering Director of Graduate Studies John Hannan Associate Professor in Department of Computer Science and Engineering Interim Associate Department Head Signatures are on file in the Graduate School. ii

3 Abstract Triad census and triangle counting are essential graph analysis measures used in areas such as social network analysis and systems biology. Triad census is a graph analytic for comparative network analysis and to characterize local structure in directed networks. For large sparse graphs, an algorithm by Batagelj and Mrvar is considered the state-of-the-art for computing triad census. In this work, we present a new parallel algorithm for triad census. Our algorithm takes advantage of a specific graph vertex identifier ordering to reduce the operation count. We also develop several new variants for exact triangle counting and triangle listing in large sparse, undirected graphs. Further, we implement a parallel sampling-based algorithm for approximate triangle counting. We show that our parallel triangle counting variants outperform other recently-developed triangle counting methods on current Intel multicore and manycore processors. We also achieve good strong scaling for both triad census and triangle counting on these platforms. iii

4 Table of Contents List of Figures List of Tables Acknowledgments vi vii viii Chapter 1 Introduction 1 Chapter 2 Background Triangle Counting Triad Census Chapter 3 New Serial and Parallel Algorithms Triangle Counting Approximate Triangle Counting Triad Census Chapter 4 Performance Discussion Experimental Methodology Results and Performance Analysis Triad Census Triangle Counting Approximate Triangle Counting Performance Scaling Impact of ordering on overall performance Performance comparisons to prior work iv

5 Chapter 5 Conclusions and Future Work 34 Bibliography 36 v

6 List of Figures 2.1 A directed graph representation for a canonical triangle isomorphism classes for triads in directed graphs Illustrating all possible triangle counting algorithm variants Triad census analysis of various graphs Parallel scaling of triangle counting methods on SNB and KNC processors Parallel scaling of triad census methods on SNB and KNC processors Parallel scaling of approximate triangle counting on a single node of Lion-XG (SNB) vi

7 List of Tables 3.1 Operation counts for all the counting variants Directed graphs used to evaluate performance of our new triad census approaches. TC indicates total number of connected triads Undirected graphs used to evaluate performance of our new triangle counting approaches. TC indicates total triangle count Triad census execution times (in seconds) Triangle counting performance on Sandy Bridge node Triangle counting performance on KNC Serial approximate triangle counting performance on a single compute node of Lion-XG (SNB) Parallel approximate triangle counting performance on a single compute node of Lion-XG (SNB) Performance Impact of ordering strategy (NAT: Natural, SF: smaller degree first, LF: larger degree first) for parallel triad census and parallel triangle counting on SNB (16 cores). Table values are performance improvements over random vertex ordering (higher values are better) vii

8 Acknowledgments I would like to heartily thank my advisor Prof. Kamesh Madduri for his guidance and support in my research. I extend my warm regards to my family and friends for their unwavering support. This work is supported by the US National Science Foundation grants ACI and CCF viii

9 Chapter 1 Introduction A triad is a graph between three vertices. There are different connectivity patterns that can occur between three nodes in a graph [1]. In an undirected graph, there are four possible triads (all three vertices connected to each other; the null triad, or no connections between three vertices; one edge subgraph; two edge subgraph). In a directed graph, sixteen nonisomorphic triads can be enumerated based on the the direction of edges and the number of vertices connected to each other. Those that involve only two vertices connected by one or more directed edges are also referred to as dyads. Triad census of a directed graph refers to determining the number of each of the 16 kinds of triads; or determining the frequencies of isomorphic triads [2]. It is possible and relevant for certain applications to count only one or a few of the 16 triad types. When all three vertices are connected in an undirected graph, this triad is called a triangle. A triangle is considered to be the most basic and non-trivial subgraph. The fundamental triangle related problems are triangle finding, counting (determining the number of triangles) and listing (listing the triangles found - vertices involved in each triangle found) [3] [4]. Triangle counting is generally classified into two types - global triangle counting and local triangle counting. Global triangle counting involves finding the total number of triangles in an undirected graph while local triangle counting calculates the number of triangles incident to each node. The latter can be used to yield results for the former, but there are more efficient methods for global triangle counting. Approximation procedures and heuristics have been developed for both triad census and triangle counting [5] [6]. Depending on the application, accurate or approximate methods of triangle counting are implemented. The notion of a triad has its roots in sociology and social network analysis, 1

10 with work on the triadic closure concept motivating it [7]. Triad census is a graph analytic used for comparative network analysis and to characterize local structure in directed networks [8]. The aforementioned 16 triads can be further classified into mutual, asymmetric and null triads (none of the three vertices are connected by an edge), each with an associated attribute of up, down, cyclical or transitive. The mutual attribute depicts a bidirectional edge while asymmetric denotes a unidirectional edge between two vertices. This classification aids in identifying the required pattern for a specific application. Triad census helps in detecting various motifs and structural properties between three nodes of a network. For example, in a friend s network, a fully connected triad would depict three mutual friends, extending the likelihood of their neighbors becoming friends too. Triad census gives a measure of reciprocity of relations within a network, which can lead to deductions of stability and hierarchy of the network. Triad distribution and density can be used to detect strongly connected components on the internet that will further aid congestion control and bandwidth allocation. Triadic transitivity (or intransitivity) analysis in directed graphs provides key information about graph equilibrium and the direction of graph growth. Too less transitivity could depict a disorganized portion of the graph; while a high transitivity would mean that the portion is internally highly clustered, but isolated from the rest of the graph. Statistics of triangles in an undirected graph is elemental for complex network analysis attributes like clustering coefficient and transitivity ratio [9]. Triangle counting plays an important role in security graph applications. Triangle distributions can be used to classify spam and non-spam hosts on a network. Triangles in a graph of web pages denote mutual recognition and are used as a seed to identify thematic structures in the graph. Triangles (like triads) depict homophily and transitivity in social networks [9]. It contributes to computation of Jaccard index which is a measure of difference (or similarity) between communities. This information, in conjunction with other data, can predict adjacencies of a future vertex or edge additions. There is an abundance of triangles in protein-interaction networks [10] and facilitates linkage between structural and functional properties of these biological networks. For large (and usually sparse) real world networks, naive triad census and triangle counting algorithms are inefficient with respect to space or time or both. Information from triad census and triangle counting is being widely used in various 2

11 applications. A major part of these algorithms involve computations with loops that have a high potential for parallelization. Vertex ordering strategies and smart data structures have a sizeable impact the time and space complexity of the algorithm. This motivates algorithms to be implemented in a faster and more space efficient way. This is a driving factor for approximate algorithms and parallelization strategies implemented in both shared and distributed memory. With the advent and availability of multicore and many core processors, implementations are shaped to increase their performance by utilize their inherent parallel architecture and cache structures. The simplistic and scalable nature of OpenMP standard promotes its use for parallelzed execution on shared memory platforms. Exploring such optimization methods have been the focus of this thesis. 3

12 Chapter 2 Background Numerous applications have resulted in the investigation of a multitude of algorithmic approaches for triangle counting and triad census [11]. The algorithms vary based on memory constraints and time complexity apart from parallelization strategies for implementation on shared and distributed memory platforms. 2.1 Triangle Counting The fastest-known algorithms for triangle counting are based on matrix multiplication, with running time Θ(n γ ), γ < The ones used frequently used in practice has been described by Latapy [4]. Let G(V, E) be an undirected graph with n = V vertices and m = E edges. Further assume that G is a simple graph (no multiple edges) and has no self loops. Let Adj(v) be the set of adjacencies of v, Adj(v) = {u : v, u E}. The degree of v, d v, is the size of Adj(v). For undirected graphs, v V d v = 2m. Let d max = max v V d v. Crude triangle counting algorithms takes Θ(n 3 ) time and Θ(n 2 ) space. Matthieu Latapy [4] describes a series of space efficient and time efficient algorithms for triangle counting. The first algorithm counts triangles by determining number of triangles in two parts of the graph and then merging the obtained results. The graph is partitioned by a factor K, where K Θ(m ω 1 ω+1 ).The triangles formed by vertices v = {v : d v < K} are counted based on the intersection of their adjacency lists, along with constraints to ensure that the same triangle is not counted twice. Number of triangles formed by vertices v = {v : d v > K} is determined by fast matrix product of the adjacency matrix. The two results are merged to get a total count. Although such an algorithm can take advantage of sparse matrix 4

13 Algorithm 1 Compact forward algorithm for triangle listing. 1: procedure CompactForce(V, E) 2: Renumber vertices in G according to η() where η(u) < η(v) if d(u) > d(v) 3: Sort Adjacency array associated with each vertex by η() 4: for vertex v V do taken in order of η() 5: for vertex u Adj(v) where η(u) > η(v) do 6: u is first neighbor of u and v is the first neighbor of v 7: while there are untreated vertices of u and v and η(u ) < η(v) and η(v ) < η(v) do 8: if u < v, set u as the next neighbor of u 9: else if u > v, set v as the next neighbor of v 10: else list u, v, u as a triangle set u to the next neighbor of u set v to the next neighbor of v computations, they have a restrictive space complexity in addition to the error prone fast matrix implementations.it is demonstrated that given the adjacency matrix representation, crude triangle listing can be achieved in Θ(n 3 ) time and Θ(n 2 ) space. The main drawback of this is that the time complexity is independent of sparse or dense graphs. Listing procedures based on vertex iterators and edge iterators improve upon the time complexity to Θ(md max ) for sparse graphs by using adjacency array representations. However, the performance of these algorithms falls when the maximum degree is unbounded. An improved algorithm for sparse graphs referred to as the forward algorithm, proposed earlier by Schank and Wagner [2], is presented with Θ(m 3 2 ) time and Θ(m) space complexity. It uses only the adjacency array representation and an injective function η() such that for any two vertices u, v V, η(u) < η(v) if d(u) > d(v). Every vertex u is associated with an array A(u) as the set v Adj(u), η(v) < η(u). Walking the graph in ascending order of η() and looking for intersections in the A() list ensures that there are no duplications in triangle counting. The forward algorithm has been further improved upon for space complexity. The forward algorithm requires arrays η() and A() to be stored in memory throughout the graph computations.in the new compact forward algorithm, the vertices are renumbered by η() and there is no requirement to maintain the η() array after that. Vertex ordering by degree in this algorithm produces a time complexity of Θ(m 3 2 ) and space complexity of Θ (2m + 2n). Another technique proposed by Schank and Wagner to bring down execution 5

14 time [2] is a variant of the edge iterator algorithm that attempts to cut down on time by using hash containers for the adjacencies of each vertex. An intersection of adjacencies can be determined by checking if the smaller container is present in the larger container. It has a time complexity of Θ( u,v E d u, d v ) This is in fact a precursor to the forward algorithm described above and the forward algorithm also has a hashed implementation. Such algorithms, however, require complex datastructures and hash functions for their implementations. Ortmann and Brandes [3] classify triangle counting algorithms into two types - neighbor intersection and adjacency testing. In neighborhood intersection, all edges are iterated and intersection is checked for between the adjacencies of the incident vertices. The edge-iterator, forward and compact forward algorithms discussed earlier fall in this category. Adjacency testing comprises of two stages - marking adjacencies of a vertex and scanning neighbors of each adjacency to look for a marked vertex. It can be faster than neighborhood intersection and the scanning step is performed via bit vectors. However, it requires additional space to store the same bit vectors and poses a latency challenging while accessing its elements. Node-iterator algorithms fall in this category. This classification is the same as adjacency intersection (AI)-based and adjacency marking (AM)-based methods. Algorithms 2 and 3 give the templates for each of these methods. The key distinction is that the AM-based methods use additional storage for adjacency lookups and perform faster set intersections in comparison to adjacency intersection (AI)-based methods. Algorithm 2 The general structure of adjacency intersection-based triangle counting algorithms. 1: procedure TriCount-AI(V, E) 2: tc 0 3: for all v V do 4: for all u Adj(v) do 5: tc tc + Adj(v) Adj(u) 6: return tc 3 Performance of triangle counting algorithms can also be improved by Vertex reordering and renumbering. Degree ordering (vertices sorted by ascending or descending order of their degree) and smallest-first ordering (used commonly in graph coloring practices) are some of the common techniques. In Figure 2.1, we 6

15 Algorithm 3 The general structure of adjacency marking-based triangle counting algorithms. 1: procedure TriCount-AM(V, E) 2: tc 0 3: for all v V do 4: for all u Adj(v) do 5: mark u 6: for all u Adj(v) do 7: for all w Adj(u) do 8: if w is marked then tc tc + 1 9: for all u Adj(v) do 10: unmark u 11: return tc 3 show one way of generating a canonical representation, which is by selecting only the triples u, v, w that satisfy u < v < w. Note that this canonical representation can also be used to convert the undirected triangle to a directed graph, with edges oriented from vertices with lower labels to ones with higher labels.the edges can be labeled as two short edges (S1, S2) and one long edge (L). They both follow the rationale that the time complexity of choosing the next vertex for a triangle check is bounded by the maximum degree of the graph G. It has been observed that it takes a little more time to determine smallest-first ordering and thus degree ordering has an overall better time complexity. However, smallest-first ordering is more suitable for small world while degree ordering is more suitable for graphs with skewed degree distribution. Figure 2.1: A directed graph representation for a canonical triangle. Large graph data has posed a challenge in terms of storing and processing 7

16 it in a time and space efficient manner. This has been dealt via a combination of three methods - graph partitioning [12], approximation techniques [13] and parallelization [14]. Prudent graph partitioning practices take into account the cache structure of the implementation platform and then store only a chunk of graph data for that point in time in main memory. Map reduce approach is commonly used for parallelizing graph partitioning algorithms of triangle counting [15] [16]. Suri and Vassilvitski [17] present a Map Reduce algorithm keeping in mind the work distribution depending on the memory available in each node. their algorithm is agnostic of the sequential triangle counting algorithm itself. Park et. al. [18] extend this algorithm for larger graphs by increasing the maximum load each reducer node can handle. Approximate triangle counting has been approached by using a variety of strategies, each with trade-offs in memory, peed and accuracy. A parallel multilevel shared memory implementation for subgraph enumeration called FASCIA (Fast Approximate Subgraph Counting and Enumeration) [6] performs approximate triangle counting by color coding techniques. This tool has support for both shared and distributed memory systems. Shun and Tangwongsan [19] recently developed several shared-memory parallel schemes for triangle listing and related problems. They address the load balancing difficulties in triangle counting by resorting to dynamic multithreading. The parallel algorithm designed are cache oblivious [20], eliminating complex tuning requirements. They report parallel performance results of their implementations on a quad-socket, 40-core Intel server. Their main approach is also based on the compact-forward algorithm. They cover two approaches for exact triangle counting, namely, the merge approach and the hash approach to arrive at the intersection between the adjacency lists of the two vertices concerned. This work also reviews several prior algorithms to exact and approximate triangle counting. We present a detailed empirical comparison of our new methods to the fastest approach from [19]. J. Kim et. al. [21] propose an Overlapped and Parallel Triangulation Framework for multicore platforms called OPT. Their algorithm is a triangle listing algorithm based on edge-iterator and vertex iterator procedures. This paper deals with large scale graphs by dividing it into smaller graphs. They load this small graph into memory and execute the algorithm over it. This is done repeatedly till the entire graph is covered. They also have a parallel implementation of the algorithm on multicore platforms. The algorithm overlaps triangle computations in internal 8

17 and external memory. Their work addresses triangle listing in dynamic graph processing in light of parallel in-memory computations. Chu and Cheng [22] also deal with exact triangle listing of graphs that cannot fit in main memory. Local triangle counts of the partitioned graphs are combined to provide the global triangle count. Their focus is on avoiding random memory access. For applications that entail local triangle statistics, L. Becchetti at. al. [23] describe two semi-streaming approximate triangle counting algorithms. Other implementations of approximate triangle counting include those based on eigenvalues of adjacency matrix of a graph [24]. They have shown improved speed compared to algorithms that rely on intersecting adjacency lists of involved vertices. Tsourakakis et. al. [25] have come up with a parallelizable pre-processing method called DOULION. This approximate triangle estimation algorithm accounts for vertices and edges to be checked for triangle incidence on the basis on the sparsif ication parameter. Yet another probabilistic model for estimation of global triangle count is done using results of the birthday paradox problem [26].This is a streaming algorithm and the birthday paradox is used to predict the number of closed wedges from the stream of edges. It is more space efficient as it requires Θ( n) space under the condition of constant transitivity and higher number of edges compared to wedges. Wedge sampling has been used to provide triangle counts with high accuracy [5]. This is based on the concept that triangles are closed edges. So, identifying the number of closed edges would account for the global clustering coefficient and hence the global triangle count. The approximate count is determined by inspecting k vertices for closed wedges. Algorithm 4 An outline of wedge sampling algorithm for approximate triangle counting. 1: procedure WedgeSampling(V, E) 2: Determine wedge probability distribution W v 3: Select k vertices in a uniform random manner from W v 4: for all v W v do 5: mark u 6: Choose uniformly two vertices u1, u2 Adj(v) 7: Check for possibility of closed edge between u1 and u2 8: return estimate of triangle count X The value of k is determined by two constants δ and ɛ : k = 0.5ɛ 2 ln( 2 δ ). 9

18 With this value of k, Algorithm 4 outputs an estimate X for the triangle count T C such that X T C < ɛ with a probability greater than (1 δ). Such computations make k independent of graph size. We need a common k value for medium sized as well as large graphs. This greatly reduces the relative number of operations. However, all m edges used to compute the wedge probability distribution W v. Although only k vertices are considered to compute the clustering coefficient (and hence the triangle count), tuning of parameters ɛ and δ can help achieve extremely high accuracy. This is also because these k vertices are picked uniformly according to the wedges probability distribution of the graph. Seshadri et. al. [5] have been able to bring about an accuracy of 99.9% using the wedge sampling algorithm. 2.2 Triad Census As shown in Figure 2.2, triads in directed graphs can be divided into 16 isomorphism classes. When considering any three vertices in a directed graph, we can have one of three cases: a null triad with none of the vertices connected to one another, dyadic triads with only two of the three vertices connected, and connected triads, where all three vertices are connected to one or more of the other two vertices. The connections can be asymmetric or mutual (unidirectional or bidirectional for the vertices concerned). The patterns in the figure are further labeled based on the edge direction (U: Up, D: Down), transitivity (T: transitive), and cyclicity (C) [27]. The naming of the triad types follows a 3 number rule - number of mutual dyads, number of asymmetric dyads and number of null dyads. Like triangle counting algorithms, sequential triad census algorithms also use a canonical ordering of vertices [28] [29] to avoid repeated lookups. Moody [1] provides matrix based equations to compute each type of triad separately with O(n 2 ) complexity. These equations are useful for scanning the graph for a specific triad type, but simultaneously determining counts of all triads reduces the total amount of data accessed and increases reuse. Batagelj and Mrvar [30] present a sub-quadratic triad census algorithm of complexity O(md max ) that is suitable for graphs with a low d max. This algorithm is implemented in the Pajek [31] graph analysis software package. Apart from visualization features, Pajek moves away from matrix representation based network 10

19 u u u u v w v u w v u w v 4-021D w u v w v w v w 5-021U 6-021C 7-111D 8-111U w u u u u v 9-030T w u v C u w v u w v D u w v U w v C w v w v w Figure 2.2: 16 isomorphism classes for triads in directed graphs. analysis to more sophisticated techniques. While the asymptotic bounds for triad census are the same as triangle counting, this algorithm greatly reduces the number of adjacency intersections required in order to perform the census. Algorithm 5 lists the main routine. N(v) denotes the set of all neighbors of v. In addition to adjacencies from outgoing arcs, incoming arcs are also considered in this set. Canonical ordering is used to ensure every triad is counted only once. The algorithm uses a simple subroutine called TriCode, and an array called TriTypes to reduce the number of conditional statements. TriCode inspects the specific connectivity pattern of u, v, and w and assigns a value between 0 and 63 to the triple currently being inspected. This value is then used to index the TriTypes array to figure out the triad corresponding to this pattern. Connected triads are identified in the main 11

20 Algorithm 5 Batagelj Mrvar [30] triad census algorithm. 1: procedure TriadCensus-BM(V, E) 2: for i 1, 16 do 3: C[i] 0 4: for all v V do 5: for all u N(v) do 6: if v < u then 7: S N(u) N(v) \ {u, v} 8: if v Adj(u) and u Adj(v) then 9: tritype 3 10: else 11: tritype 2 12: C[tritype] C[tritype] + n S 2 13: for all w S do 14: if (u < w) or (v < w and w < u and w / N(v)) then 15: tricode TriCode(v, u, w) 16: tritype TriTypes[tricode] 17: C[tritype] C[tritype] : sum 0 19: for i 2, 16 do 20: sum sum + C[i] 21: C[1] = 1 n(n 1)(n 2) sum 6 22: return C algorithmic nested loop (line 13), while dyadic triads can be counted based on the number of transitive edges (lines 8 to 12). Finally, null triads are not explicitly computed, but instead determined using the total triad count of ( ) n 3. They applied this algorithm to the routing data of the internet and reported their triad census results. This is considered state of the art and we use it is used the basis for the new algorithms described in Chapter 3. Chin et al. [32, 33] discuss parallelizations of the Batagelj-Mrvar sequential algorithm on shared memory architectures (Cray XMT) and evaluated performance with loop futures and interleaved scheduling techniques. Seshadhri et al. [5] present a novel strategy to approximately count each triad pattern. They adapt the wedge sampling strategy for this purpose, which they also use to determine triangle counts. 12

21 Chapter 3 New Serial and Parallel Algorithms We first introduce the new parallel approaches for triangle counting, followed by triad census with implementation details. These new algorithms aim to boost both time and space complexities. It has been achieved by experimenting with various existing techniques and enhancing them with vertex ordering and cache efficient parallelization procedures. 3.1 Triangle Counting We use the adjacency intersection and adjacency marking classification from the previous section, and systematically list all possible variants for canonical triangle counting and triangle listing. See Figure 3.1. We define Adj + (v) to be the subset of Adj(v) comprising of vertices w with w > v. d + v is the cardinality of Adj + (v). Adj (v) and d v are defined similarly. For an undirected graph with no self loops, d v = d v + d + v and Adj + (v) Adj (v) =. Intersection-based algorithms are typically structured as shown in Algorithm 2, and marking-based algorithms are similar to Algorithm 3. The three AI algorithms shown in the figure differ by the adjacency sets that are being tested for common elements. All three variants search for the canonical triangle v, u, w, v < u < w, thus avoiding duplicate counting. The six AM variants differ by which pair of adjacency sets are involved in the marking and scanning process. Again, they all maintain the same vertex ordering. In our implementation of adjacency intersection-based methods, we perform a set 13

22 Table 3.1: Operation counts for all the counting variants. Variant Op Count Comments AI1 AI2 AI3 AM1 O( v O( v O( v O( v d + v + v d v + v d + 2 v + v d + v + v d + v d v ) use with low-degree-first ordering d v d v ) use high-degree-first ordering to get compactforward [4] d 2 v ) better to use AI1/AI2 with degree-based ordering d + v d v ) does not exploit ordering AM2 AM3 O( v O( v d v + v d + v + v d v d v 2 ) use with high-degree-first ordering 2 ) similiar to AM2 AM4 AM5 O( v O( v d + v + v d v + v d + v 2 ) use with low-degree-first ordering d + v d v ) similar to AM1 AM6 O( v d v + v d + v 2 ) similar to AM4 intersection by striding through the sorted sequences of adjacencies and identifying common ones. This is similar to the merge routine used in the merge sort algorithm. For two sorted lists of size p and q, determining their intersection requires Θ(p + q) operations using this simple merge-like routine. The main difference between the three intersection-based algorithms is the choice of adjacency lists to intersect. Again, we use the ordering v < u < w to guide us. We can also precisely determine the operation count for the overall algorithm in terms of vertex degrees. For instance, each intersection of AI1 requires O(d + v + d + u ) operations, and so the overall operation count is O( d + v + (d + v + d + u )). The second term simplifies to v V v,u E,v<u O( d + v d v ), giving the overall bounds shown in Table 3.1. We similarly derive the v V bounds for the two other AI variants. The marking-based approaches essentially perform set intersections by using O(n) auxiliary space. This reduces the operation counts to the ones given in 14

23 u 1 2 Adjacency intersection-based u 3 u v w Adj + [v] Adj + [u] v w Adj - [u] Adj - [w] v w Adj + [v] Adj - [w] u Adjacency marking-based u u v w Mark Adj + [v] Scan Adj + [u] u 4 v w Mark Adj - [u] Scan Adj - [w] u 5 6 v w Mark Adj + [v] Scan Adj - [w] u v w Mark Adj + [u] Scan Adj + [v] v w Mark Adj - [w] Scan Adj - [u] v w Mark Adj - [w] Scan Adj + [v] Figure 3.1: Illustrating all possible triangle counting algorithm variants. Table 3.1. Note that all these variants perform correctly for any vertex labeling. However, some variants perform fewer operations than others for certain vertex orderings. Consider graphs with highly skewed vertex degree distributions. If we reorder the vertices such that vertices of lower degree are assigned lower vertex identifiers, we are reducing v d + 2 v. This is the theoretical justification for using the AI1 and AM4 variants, and reordering vertices such that lower degree vertices are assigned low vertex identifiers to directlyreduce operation count. A further improvement to this simple degree-based vertex ordering would be to use a core number-based vertex ordering, as suggested by Ortmann and Brandes [3]. Ortmann and Brandes further show that the running time bounds under favorable orderings is O(mα(G)), where α(g) is the arboricity of the graph. The arboricity for real-world sparse graphs is typically a low value [29], and hence these ordering heuristics work well in practice. In Algorithms 6 and 7, we list the pseudocode for our parallel implementations 15

24 Algorithm 6 Our parallel adjacency intersection (AI)-based triangle counting algorithm (variant 1), exploiting vertex ordering and avoiding redundant counting. 1: procedure TriCountOptPar-AI(V, E) 2: tcl 0 thread-local count 3: for all v V pardo Parallelize 4: for all u Adj + (v) do 5: tcl tcl + Adj + (v) Adj + (u) 6: tc tcl sum all thread-local counts 7: return tc of AI1 and AM4, respectively. On shared memory systems, the outer loop over the vertices can be partitioned to multiple threads, and each thread updates a local variable to track the triangle count. These values can be aggregated in the end to get the global triangle count. Thus, this scheme is fairly simple to parallelize, with the only synchronization required at the end. However, note that with the degree-based ordering of vertices, the operation counts and running time increase as v increases. A few threads may have to work longer than the rest (maybe scan more vertices because that thread is assigned a high degree vertex). So, the other threads will be idle and the final triangle count cannot be computed till all the threads contribute their individual counts. A naive static partitioning of the outer loop to multiple threads may thus lead to considerable load imbalance. We explore different loop scheduling strategies in our empirical evaluation of these methods. Algorithm 7 lists the pseudocode for the AM4 variant. In our implementation, we use a per-thread bit vector to mark adjacencies. For a graph with 50 million vertices and 250 threads of execution, this scheme would require about 1.5 GB of additional memory to store the bit vectors, and so the AM variants are still applicable to a large class of graphs. This approach trades off reduced strided adjacency array accesses for potentially random memory lookups. However, if the bit vector can be cached and reused, then the random memory access latency can be amortized. Again, the use of degree ordering with the AM4 variant permits this, as the marked vector is reused for multiple iterations of the loop over v (loop at line 8 of Algorithm 7). While the algorithms do not show the steps for triangle listing, the actual implementation is straightforward. Each thread maintains a large in-memory buffer for storing the triple IDs for each triangle, and the buffer is written to disk when it is full. 16

25 Algorithm 7 Our parallel adjacency marking (AM)-based triangle counting algorithm (variant 4), exploiting vertex ordering and avoiding redundant counting. 1: procedure TriCountOptPar-AM(V, E) 2: tcl 0 thread-local count 3: for i 1, n do thread-local mark array 4: M[i] 0 5: for all u V pardo Parallelize 6: for all w Adj + (u) do 7: M[w] 1 8: for all v Adj (u) do 9: for all (w Adj + (v)) do 10: if M[w] = 1 then 11: tcl tcl : for all w Adj + (u) do 13: M[w] 0 14: tc tcl 15: return tc 3.2 Approximate Triangle Counting Our approximate triangle counting algorithm is an extension of the Seshadri et. al [5] wedge sampling method. We choose k random vertices uniformly from the wedge probability distribution of the given graph. For each of these vertices, we select two vertices (again in a uniform random manner) and check if there is a possibility of a closed wedge (triangle) between them. Parallization and optimization strategies have been used to its improve time and space complexity. This algorithm has a lot of scope for paralleization at different levels - determining probability distribution of wedges, extracting k vertices and estimating the number of closed edges. In contrast to common methodology, this method interestingly uses clustering coefficient data to determine the triangle count. The total number of wedges any vertex can have is given by ( ) d v 2. This information is used to create a wedges probability distribution of the graph. When k vertices are chosen from this distribution, we are indirectly giving preference to those vertices with higher degrees. This process of randomly selecting k vertices can be parallelized comfortably. Two adjacencies of each of the k vertices are selected at random to form a wedge. The approximation strategies end here. The possibility of a closed wedge between the 17

26 Algorithm 8 Parallel wedge sampling algorithm. 1: procedure WedgeCountPar(V, E) 2: k 0.5ɛ 2 ln 2 δ 3: totalw 0 Total possible wedges of v 4: for all v V pardo wedge probability distribution 5: wedgetotalv = ( ) d v Total number of possible edges of v 2 6: totalw totalw + wedgetotalv 7: totalw totalw 8: meancc 0 local mean clustering coefficient 9: kv ertices : uniform random method from wedgetotalv Parallelized 10: for all v kv ertices pardo Estimation of closed wedges 11: r1, r2 uniform randomly selected vertices from Adj(v) 12: for all w Adj(r1) do 13: if (w == r2) mean mean : meancc meancc 15: meancc = meancc/k 16: tc meancc totalw/3 17: return tc chosen adjacencies is determined by intersecting their respective adjacency lists with no explicit array being stored in memory for the same. The resulting clustering coefficient counts are combined to output an approximate triangle count. We have optimized this algorithm for locality by ordering the vertices and adjacency lists by degree [34]. This also aids in reducing memory storage as the adjacency intersection methods used to determine closed wedge counts is similar to triangle counting AI variants described above. The vertex ordering has been utilized to inspect only that part of r1 s adjacency list that can contain r2 - binary search has been implemented to determine existence of r2. So, this eliminated a large number of iterations in the nested loop. With appropriately small values of δ and ɛ, highly accurate triangle counts have been obtained by the above mentioned parallel implementation. We can see from the loop used to calculate total number of closed edges that smaller the value of k, lesser the number of vertices examined but the accuracy also falls. We can also note from the formula used to determine k that it is independent of the size of the graph or type of the graph. We have performed experiments to arrive at an acceptable value of k. 18

27 3.3 Triad Census We next discuss our new methods for triad census. Our main contribution is to combine the vertex reordering strategies with the Batagelj-Mrvar algorithm, in order to reduce operation counts. We also further simplify the algorithm to remove extraneous conditions. We count and list canonical triads only, and they are again given by the triple v < u < w. We first define N + (v) and N (v) analogous to the definition of Adj(v) in the undirected case. N + comprises all vertices k such that there is a directed edge from v to k, or k to v, or both, and k > v. Since N + (v) and N (v) rely on ordering of vertices from both incoming and outgoing arcs into v, we found it best to combine these two lists. We use an adjacency array representation of the graph. To quickly detect the direction of an arc given just the adjacency, we use an optimization proposed by Chin et al. [33]. The idea is to use two bits of the 32- or 64-bit word for the adjacency identifier to compactly store the edge direction. We set the bits to 11 if there are edges in both directions, and 01 (10) for just outgoing (incoming) edges. This compact scheme avoids unnecessary adjacency lookups and permits both AI- and AM-based implementations. A second optimization is to implicitly determine the vertices that are added to S (see Algorithm 5), without actually creating the array. This is made possible by using a merge-like routine to stride through the sorted adjacency lists of u and w, similar to the step in AI. We refer to our implementation of the census algorithm with these two changes (two bits for adjacency direction, implicit S) as the baseline version of the census routine. Building on the baseline, we develop two optimized variants (AM and AI). The AM-based approach is listed in Algorithm 9. To simplify the pseudocode, we remove the lines corresponding to counting the dyadic triads. After we do so, we can exploit the sorted ordering of the adjacencies. S is again maintained implicitly, and is different from the one in Algorithm 5. Each thread uses a local array to mark vertices and simplify adjacency intersections. The total triad count is computed by synchronizing the sum of all individual thread triad counts after completion of tasks by all threads. We parallelize this approach in a shared-memory environment by distributing 19

28 Algorithm 9 Counting all connected triads using our optimized parallel approach (adjacency marking variant). 1: procedure TriadCensusOptPar-AM(V, E) 2: for i 4, 16 do thread-local census counts 3: Cl[i] 0 4: for i 1, n do thread-local mark array 5: M[i] 0 6: for all v V pardo Parallelize 7: for all u N + (v) do M[u] 1 8: for all u N + (v) do 9: for all (w N (u) and w > v) do 10: if (M[w] = 0) then 11: tricode TriCode(v, u, w) 12: tritype TriTypes[tricode] 13: C[tritype] C[tritype] : S N + (u) N + (v) \ {x : x N + (u) and x > v} S maintained implicitly 15: for all w S do 16: tricode TriCode(v, u, w) 17: tritype TriTypes[tricode] 18: C[tritype] C[tritype] : for all u N + (v) do M[u] 0 20: C Cl Sum all thread-local counts 21: return C iterations of the outer loop to multiple threads. Notice again that there is minimal communication synchronization required for both counting and listing, as threads only need to update local counts. Both the variants we implement work for any vertex ordering, but a degree-based ordering certainly benefits graphs with skewed vertex distributions, as the loops have operation counts proportional to d + 2 v. Like the case of triangle counting, a naive outer loop parallelization will lead to significant load imbalance, as we are using a specific vertex ordering. The operation count analysis of this algorithm is similar to the previous case of triangle counting. Step 14 of the algorithm iterates more number of times than the other variants. So, the operation count is v d + 2 v instead of v d + v. However, the larger number of patterns that we track increases the complexity. Specifically, we need to get the connectivity of every triple in both directions to update the 20

29 appropriate triad count. This means we need to inspect a larger range for w (see line 9 of the Algorithm), when compared to a similar triangle counting algorithm. A minor additional optimization over the baseline is to simplify the TriCode routine (not shown here). We reduce conditional checks for existence of edges using bitwise operations. Optimal implementations of AI1 and AM1 variants of triad census has been parallelized to obtain performance results. AI is more space efficient as we do not store S explicitly as in the baseline version; rather the relevant adjacencies are scanned and compared directly for common elements. AM has to reserve extra space to store bit vectors used to mark adjacencies. Both AI and AM use the TriCode subroutine which has been optimized to remove branching operations that can hinder parallelization. In general, the code has minimal conditional operations. They have been substituted as appropriate by logical loop limits or bit operations. Proper loop inital and end points have been determined through binary search in order to avoid redundant iterations. Random pointer references and accesses have been curtailed by copying arrays locally. The code outputs only the connected triads as they are the ones that are challenging to optimize. The other types of triads can be calculated in constant time using formulas. 21

30 Chapter 4 Performance Discussion We evaluate performance of our optimized triad census and triangle counting variants on several large-scale graphs. Both sequential and parallel implementations are assessed for efficiency and complexity on single core and manycore platforms. 4.1 Experimental Methodology Table 4.1 lists all the directed graph instances that we use. We chose these graphs from the Koblenz network repository [35] and the UFL Sparse matrix collection [36]. Most of them are crawls of the web or of social networks, and the original sources of these graphs are also listed in the table. We removed any self loops and parallel edges that may be present in the original graphs. The table also lists the total count of all connected triads (isomorphism classes 4 to 16 in Figure 2.2) in these test graphs. We omit the counts of the dyadic patterns and the null triad, as these counts are indirectly derived from the rest, and they are several orders of magnitude larger than the count of the connected triads. The total raw counts of connected triads vary by nearly six orders of magnitude among the input graphs, from 320 million to 123 trillion. We also report d max for each graph. As we mention earlier, the skew in degree distribution motivates the vertex ordering schemes. We also use several undirected graphs, listed in Table 4.2, to evaluate performance of the triangle counting variants. Of the 10 input graphs, 6 are derived from directed graphs used in Table 4.1, by symmetrizing the graphs and removing parallel edges. The input graphs are ordered in increasing order of connected triad/triangle counts, as we note that the operation counts, as well as the algorithm running times, appear to scale linearly with the total count. 22

31 We preprocess all the data to reorder the vertex labels to be in non-decreasing degree order, and write the graphs to disk. We also evaluate the impact of alternate vertex orderings on overall performance. Note that the running times reported in this section do not include the time to read the graph from disk, nor the time to reorder the graph. For large graphs, the reordering time can be ignored in comparison to the overall census running time. For instance, reordering the twitter graph takes less than 1 minute in serial, whereas the census method running time in parallel is nearly 7.25 hours. However, initial graph I/O time and reordering time may not be negligible for small graphs (say, less than vertices). We evaluate performance on a single compute node of the TACC Stampede supercomputer. Each compute node has two 8-core Intel Xeon E processors (Sandy Bridge SNB microarchitecture) and one Intel Xeon Phi SE10P coprocessor (Knight s Corner KNC microarchitecture). The Xeon processors can access 32 GB of DDR3 memory. The Xeon Phi coprocessor has GHz cores and 8 GB of GDDR5 memory. We compile our programs (written in C and using OpenMP) using the Intel C/C++ compilers (v15.0.2) with -O3 optimization. We bind threads to cores using the KMP_AFFINITY and MIC_KMP_AFFINITY environment variables. On the Xeon E5s, we use the compact strategy and the balanced strategy on the Xeon Phi. In order to compare our implementations to prior work, we run Shun and Tangwongsan s orderedmerge variant [19] for exact triangle counting. This is similar to our AI1 variant. We built this code using the same version of Intel C++ compilers, and control parallelism using the CILK_NWORKERS and MIC_CILK_NWORKERS variables. We did not find any publicly-available parallel triadic census codes or implementations, and so most of our comparisons are relative to the baseline version (i.e., our parallel implementation of the Batagelj-Mrvar algorithm). To analyze the performance of the approximate triangle counting algorithms, we ran them on a single compute node of the Lion-XG cluster at Penn State. A compute node of Lion-XG has two 8-core Intel Xeon E processors (SNB microarchitecture). Processor cores are clocked at 2.6 GHz and each server has 32 GB of main memory. The code is compiled using C/C++ compilers with -O3 optimization and OpenMP support. 23

32 Table 4.1: Directed graphs used to evaluate performance of our new triad census approaches. TC indicates total number of connected triads. n m Graph ( 10 6 ) d max TC Sources patentcite [35, 37] cage [36] soc-pokec [11, 35] soc-livejournal [8, 35] HV15R [36] flickr [35, 38] indochina [36, 39, 40] arabic [36, 39 41] it [36, 39, 40] twitter [35, 42] 4.2 Results and Performance Analysis Triad Census In Table 4.3, we report parallel performance of our three census variants on the dual socket, 8-core Intel Xeon (SNB) and the Xeon Phi (KNC). The results are obtained by using OpenMP dynamic scheduling with a chunk size of 10 on both SNB and KNC for all the graphs. We also experimented with static scheduling and a chunk sizes of 1 and 10, and dynamic scheduling with chunk sizes of 50 and 100. We found that the chunk-size-10 setting and dynamic scheduling gave the best results for the majority of the graphs, and so we selected these settings. We could not execute census counts for some graphs on KNC due to memory limitations. The first observation from these running time results is that our optimized AI and AM provide a significant improvement over the baseline. In case of SNB, AM is 2.37 faster than Baseline, and in case of KNC, AI is 1.77 faster than Baseline. Using the total connected triad count information from Table 4.1, we can compute a performance rate, of number of connected triads counted per second (TCPS). For 16-way threading on SNB, this value ranges from 867 million TCPS (for patentcite) 24

33 Table 4.2: Undirected graphs used to evaluate performance of our new triangle counting approaches. TC indicates total triangle count. n m Graph ( 10 6 ) d max TC Sources hugetrace [36] soc-pokec [11, 35] cage [36] soc-livejournal [8, 35] rgg_n_2_24_s [36] orkut [9, 35] HV15R [36] kron_g500-logn [36] twitter [35, 42] indochina [36, 39, 40] to 5665 billion TCPS (for it-2004). For KNC, performance ranges between 1604 (patentcite) to 3000 TCPS (HV15R). On both systems, this performance rate increases on increasing problem size. The next important observation is that Table 4.3: Triad census execution times (in seconds). SNB (16 cores) KNC (61 cores) Graph Base AI AM Base AI AM patentcite cage soc-pokec soc-livejournal HV15R flickr indochina arabic it-2004 > 1 hr twitter > 8 hr > 8 hr 7.25 hr 25

34 different algorithmic variants are faster on each system. On SNB, AM is faster for most of the instances, whereas AI is consistently faster on KNC. Further, there is a considerable gap between AI and AM performance on KNC. This can be attributed to the space structuring of AM variants. The regular merge-like intersection routine is a better fit to KNC than the random access-based AM routine. However, on SNB, the considerable last-level cache and the relatively lower number of threads means that the overhead of random writes and reads to bit vectors can be amortized by caching. We also collected some cache performance statistics for the AI and AM variants on SNB. With the twitter graph, a single compute node of Lion-XG shows that the L3 cache misses for AM is around 17% of the number of cache accesses, whereas for AI, it is around 25% of the cache accesses. For some of the smaller graphs, KNC is actually slightly faster than the dual socket SNB. We also observe a noticeable performance impact of using dynamic scheduling for some of the regular graphs (cage15, HV15R). We achieve better performance for these instances with larger dynamic chunk sizes or with static scheduling. flickr HV15R indochina 2004 soc LiveJournal1 soc pokec Normalized Relative Frequency 1e 01 1e 03 1e Triad # Figure 4.1: Triad census analysis of various graphs. Figure 4.1 provides an example illustration of the types of analytics that are possible with exact triad census. We compare the relative frequency of each 26

35 connected triad normalized to the total connected triad count. The Y axis is in a log scale and we note a range of relative frequencies that span six orders of magnitude. For the indochina-2004 web crawl, pattern 5 has the largest count, indicating presence of vertices with very high in-degree and low out-degree. Also, we observe that patterns 10 and 14 are highly underrepresented in indochina Such observations have applications in social media analysis where in the graph structure has direct implications on the growth and community formations in the network Triangle Counting We next report sequential and parallel performance of our implementations for triangle counting. Table 4.4 gives the performance of AI and AM on SNB. We also give the parallel performance achieved with Shun and Tangwongsan s (ST) orderedmerge code. We use OpenMP dynamic scheduling with a chunk size of 50 on SNB, and also experimented with chunk sizes of 10 and 100. The AM variant is fastest for 9 out of 10 graph instances. As problem size increases, the performance of AM relative to AI also increases. The serial time corresponds to running time of either the AI or AM variant (whichever is faster) without OpenMP pragmas. Hence, the parallel speedup reported in the table is absolute speedup. We notice that the speedup is better for larger graph instances. The performance of AI is comparable to ST for a few instances, but is better for most other instances like orkut and LiveJournall. We were unable to run ST on the largest instance (twitter). We also refer to the serial and parallel running times reported on various Intel platforms in [3, 19] and find that our approaches are comparable to, or faster than, the times reported in these papers. For example on KNC, our AI implementation is 179x faster than ST for the orkut graph. Performance rates achieved, in terms of triangles counted per second, are comparable to the census implementation. In Table 4.5, we give performance of triangle counting on KNC. These results were obtained with dynamic scheduling and a chunk size of 10. We again notice that AI outperforms AM, similar to census. ST is dominated by AI on all the graphs. There are a few instances when ST fails to execute. Also, for a majority of the graphs, AI on KNC is faster than AM on SNB, which is a notable result. The 27

36 Table 4.4: Triangle counting performance on Sandy Bridge node. SNB (16 cores) time (s) AM Par. Graph Serial ST AI AM Speedup hugetrace soc-pokec cage soc-livejournal rgg_n_2_24_s orkut HV15R kron_g500-logn twitter Fail indochina Table 4.5: Triangle counting performance on KNC. KNC (61 cores) time (s) AI Par. Graph Serial ST AI AM Speedup hugetrace soc-pokec cage soc-livejournal rgg_n_2_24_s Fail orkut HV15R Fail kron_g500-logn indochina Fail variation in normalized performance is not as high on KNC as it is on SNB. We also note very high absolute speedups, although single-threaded times are admitted very low due to high instruction and memory latencies. We observe better speedups for social networks and web crawls than for the structured sparse matrices, indicating that there is likely more room for tuning and improvement on these data sets. 28

37 Table 4.6: Serial approximate triangle counting performance on a single compute node of Lion-XG (SNB). Graph Time (s) Exact Approximate Speedup soc-pokec orkut indochina twitter friendster Approximate Triangle Counting We now report the performance of our approximate triangle counting implementation. Results of both serial and parallel implementations have been recorded. We compare the serial performances of the exact and approximate triangle algorithms and then look at the parallel execution times. Our implementation provides an average accuracy of 99.6% with proper tuning of δ and ɛ values to obtain k for a graph. We observed that in general, δ 0.1 and ɛ 0.1 produces results of high accuracy. Keeping in mind a requirement for high accuracy, we study the serial implementation results. From Table 4.6, we can see that the approximate serial algorithm is considerably faster than the exact counting algorithm. This is mainly because a much smaller number of vertices are considered for triangle counting. soc pokec has almost no speedup, as the serial time is too small to really improve upon. The major contributors to the execution time will dominate equally in the approximate and exact approaches. It is important to note that since k is agnostic of the size of the graph, it causes different speedups based on the graph structure. Once the graph size has crossed a certain threshold, the speedup seems to be pretty much constant, about 6. Another factor that contributes to the lower execution time of the approximate algorithm is that the approximate algorithm involves many constant time operations such as selecting vertices randomly. Apart from the loop structure concerned with computing the probability distribution, the size of the other loops are determined by the value of k (which is fixed even if the graph is very large) and the (larger) degree of vertex s adjacencies being examined for closed wedges. Since 29

38 Table 4.7: Parallel approximate triangle counting performance on a single compute node of Lion-XG (SNB). Graph Time (s) 1 core 2 cores 4 cores 8 cores 16 cores Speedup soc-pokec orkut indochina twitter friendster most real-world graphs share a characteristic skewed degree distribution with very few vertices of high degree, the probability of choosing a vertex which will lead to a large number of iterations is very low. Thus, in an amortized sense, most of the time consumption will be dependent on k, which is in the user s control. Table 4.7 shows good speedups after parallelization. We use OpenMP pragmas to parallelize three main steps: computing the probability distribution, computing k, and determining the number of closed wedges. For the first two steps, the main operation is choosing vertices at random. So, static parallelization suffices as work is well-balanced across cores. 4.3 Performance Scaling In Figures 4.2 and 4.3, we plot relative speedups achieved by each of the variants on two graphs, for counting and triad census, respectively. For counting, AI shows the best scaling on SNB and KNC. This behavior is due to the space-efficient characteristic of AI. ST s scaling is comparable to AM for soc-pokec, but is slightly lower for soc-livejournal1 on both the platforms. Figure 4.3 shows that triad census scaling is comparable for all three variants, which is as expected. Thus, the overall running time improvements for triad census are due to the algorithmic changes in our new variants in comparison to the baseline scheme. Load imbalance is also not significant for these graphs, probably due to dynamic scheduling and the choice of a reasonable chunk size. Figure 4.4 and Table 4.7 show the speedups achieved by our approximate triangle 30

39 AI AM ST Relative Speedup soc LiveJournal1 KNC soc LiveJournal1 SNB soc pokec KNC soc pokec SNB # of threads 10 5 Figure 4.2: Parallel scaling of triangle counting methods on SNB and KNC processors. AI AM Base Relative Speedup patentcite KNC patentcite SNB soc pokec KNC soc pokec SNB # of threads Figure 4.3: Parallel scaling of triad census methods on SNB and KNC processors. implementation over 16 cores. The scaling up is good up to 8 cores. 8 to 16 cores does not show much difference, probably because there was not enough work for 16 cores. For smaller graphs like soc-pokec, it stops showing speedup even at 8 cores as there are more overheads in assigning tasks to 8 cores than actual per-core execution. There are few overheads associated with synchronization, with only a few reductions done across cores while computing total number of closed wedges. 4.4 Impact of ordering on overall performance We next study the impact of ordering on parallel performance. We report performance of triad census and counting with various ordering schemes, and normalize them to performance with random ordering. We see that the performance of triad census variants does not vary that much, but the performance of counting variants is significantly affected. Triad census remains largely agnostic to vertex reordering due to many complex operations to be performed for every outer loop iteration. For indochina-2004, ordering makes a substantial difference, greater than an order 31

Figure 4.4: Parallel scaling of approximate triangle counting on a single node of Lion-XG (SNB). of magnitude in AM performance with SF ordering. Table 4.

40 Figure 4.4: Parallel scaling of approximate triangle counting on a single node of Lion-XG (SNB). of magnitude in AM performance with SF ordering. Table 4.8: Performance Impact of ordering strategy (NAT: Natural, SF: smaller degree first, LF: larger degree first) for parallel triad census and parallel triangle counting on SNB (16 cores). Table values are performance improvements over random vertex ordering (higher values are better). Census AI Census AM Graph NAT LF SF NAT LF SF patentcite flickr indochina Counting AI Counting AM Graph NAT LF SF NAT LF SF soc-livejournal orkut indochina

Multicore Triangle Computations Without Tuning

Multicore Triangle Computations Without Tuning Julian Shun and Kanat Tangwongsan Presentation is based on paper published in International Conference on Data Engineering (ICDE), 2015 1 2 Triangle Computations