GIVEN two graphs G and H, the subgraph isomorphism

Size: px

Start display at page:

Download "GIVEN two graphs G and H, the subgraph isomorphism"

Megan Lindsey
6 years ago
Views:

1 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 1 Fndng and countng tree-lke subgraphs usng MapReduce Zhao Zhao, Langsh Chen, Mha Avram, Meng L, Guanyng Wang, Al Butt, Maleq Khan, Madhav Marathe, Judy Qu, Anl Vullkant Abstract Several varants of the subgraph somorphsm problem, e.g., fndng, countng and estmatng frequences of subgraphs n networks arse n a number of real world applcatons, such as genetc network analyss n bonformatcs, web analyss, dsease dffuson predcton and socal network analyss. These problems are computatonally challengng to scale to very large networks wth mllons of nodes. In ths paper, we present SAHAD, a MapReduce based algorthm for detectng and countng trees of bounded sze usng the elegant color codng technque, developed by N. Alon, R. Yuster and U. Zwck, Journal of the ACM (JACM) SAHAD s a randomzed algorthm, and we show rgorous bounds on the approxmaton qualty and the performance. We mplement SAHAD on two dfferent frameworks: the standard Hadoop model and Harp, whch s more of a hgh performance computng envronment, and evaluate ts performance on a varety of synthetc and real networks. SAHAD scales to very large networks comprsng of nodes and edges and tree-lke (acyclc) templates wth up to 1 nodes. Further, we extend our results by mplementng our algorthm usng the Harp framework. The new mplementaton gves two orders of magntude mprovement n performance over the standard Hadoop mplementaton and acheves comparable or even better performance than start-of-the-art MPI soluton. Index Terms subgraph somorphsm, graph parttonng, MapReduce, Hadoop, Harp 1 INTRODUCTION GIVEN two graphs G and H, the subgraph somorphsm problem asks f H s somorphc to a subgraph of G. The countng problem assocated wth ths seeks to count the number of copes of H n G. These and other varants are fundamental problems n Network Scence and have a wde range of applcatons n areas such as bonformatcs, socal networks, semantc web, transportaton and publc health. Analysts n these areas tend to search for meanngful patterns n networked data; and these patterns are often specfc subgraphs such as trees. Three dfferent varants of subgraph analyss problems have been studed extensvely. The frst verson nvolves countng specfc subgraphs, whch has applcatons n bonformatcs [4], [16]. The second nvolves fndng the most frequent subgraphs ether n a sngle network or n a famly of networks ths has been used n fndng patterns n bonformatcs (e.g., []), recommendaton networks [], chemcal structure analyss [3], and detectng memory leaks [5]. The thrd nvolves fndng subgraphs whch are ether over-represented or underrepresented, compared to random networks wth smlar Zhao Zhao, Al Butt, Madhav Marathe and Anl Vullkant are wth the Network Dynamcs and Smulaton Scence Laboratory, Bocomplexty Insttute & Department of Computer Scence, Vrgna Tech, VA, 461. E-mal: zhaozhao@vt.edu, butta@cs.vt.edu, mmarathe@vt.edu, vsakumar@vt.edu Maleq Khan s wth the Department of Electrcal Engneerng and Computer Scence, Texas A&M Unversty-Kngsvlle. E-mal: maleq.khan@tamuk.edu Langsh Chen, Meng L, and Mha Avram are wth the Computer Scence Department, Indana Unversty. Emal: lc37@ndana.edu, l56@umal.u.edu, mavram@umal.u.edu Judy Qu s wth the Intellgent Systems Engneerng Department, Indana Unversty. Emal: xqu@ndana.edu Guanyng Wang s workng wth Google Inc. Emal: wang.guanyng@gmal.com propertes such subgraphs are referred to as motfs. Mlo et al. [6] dentfy motfs n many networks, such as protenproten nteracton (PPI) networks, ecosystem food webs and neuronal connectvty networks. Subgraph counts have also been used n characterzng networks [8]. The Subgraph Isomorphsm problem and ts varants s well known to be computatonally challengng. In general the decson verson of the problem s NP-hard, and the countng problem s #P-hard. Extensve work has been done n theoretcal computer scence on ths problem; we refer the reader to the recent papers by [1], [1], [4] for an extensve dscusson on the decson and countng complexty of the problem and tractable results for varous parameterzed versons of the problem. The prmary focus of ths paper s on the three mentoned varants of the subgraph somorphsm problem when k, the number of nodes n the template H, s fxed. Lettng n be the number of nodes n G, one can mmedately get smple algorthms wth runnng tme O(n k ) to fnd and count the number of copes of template H n G. Note that n ths paper we focus on non-nduced subgraph matchng. When the template s a tree or has a bounded treewdth, Alon et al. [4] present an elegant randomzed approxmaton algorthm wth runnng tme O(k E k e k log (1/δ) 1 ε ), where ε and δ are error and confdence parameters, respectvely, based on the color codng technque. There result was sgnfcantly mproved by Kouts and Wllams [19] who gave an algorthm wth runnng tme of O( k E ). A lot of practcal heurstcs have also been developed for varous versons of these problems, especally for the frequent subgraph mnng problem. An example s the Apror method, whch uses a level-wse exploraton of the template [18], [], n generatng canddates for subgraphs at each level; these have been made to run faster by better

2 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS prunng and exploraton technques, e.g., [], [], [4]. Other approaches n relatonal databases and data mnng nvolve queres for specfcally labeled subgraphs, and have combned relatonal database technques wth careful depthfrst exploraton, e.g., [8], [31], [3]. Most of these approaches are sequental, and generally scale to modest sze graphs G and templates H. Parallelsm s necessary to scale to much larger networks and templates. In general, these approaches are hard to parallelze as t s dffcult to decompose the task nto ndependent subtasks. Furthermore, t s not clear f canddate generaton approaches [], [], [4] can be parallelzed and scaled to large graphs and computng clusters. Two recent approaches for parallel algorthms, related to ths work, are [8], [41]. The approach of Bröcheler et al. [8] requres a complex preprocessng and enumeraton process, whch has hgh end-to-end tme, whle the approach of [41] nvolves an MPI-based mplementaton wth a very hgh communcaton overhead for larger templates. Two other papers [7], [36] develop MapReduce based algorthms for approxmately countng the number of trangles wth a work complexty bound of O( E ). The development of parallel algorthms for subgraph analyss wth rgorous polynomal work complexty, whch are mplementable on heterogeneous computng resources remans an open problem. Due to the complexty of enumeratng subgraphs, people propose to compute some metrcs of the subgraph whch s ant-monotone to the subgraph sze. The algorthm reported n [3] s capable of computng subgraph support on large networks wth up to 1 Bllon edges. However, t requres each machne to have a copy of the graph n memory whch lmts ts scalablty to larger graphs. Addtonally, computng support requres much less computatonal effort than countng subgraphs. Another recent work also employs MapReduce to match subgraphs [35] whch scales to networks wth up to 3 mllon edges. Other approaches studed n the context of data mnng and databases, e.g., [8], [31], [3], are capable of processng large networks, but are usually slow due to lmtatons of database technques for processng networks. Our contrbutons. In ths paper, we present SAHAD, a new algorthm for Subgraph Analyss usng Hadoop, wth rgorously provable polynomal work complexty for several varants of the subgraph somorphsm problem when H s a tree. SAHAD scales to very large graphs, and because of the Hadoop mplementaton, runs flexbly on a varety of computng resources, ncludng Amazon EC cloud. We also adapt SAHAD n the Harp [9] framework to utlze ts advanced MPI-lke collectve communcaton. It scales to graphs wth up to 1. bllon edges. Our specfc contrbutons are dscussed below. 1. SAHAD s the frst MapReduce-based algorthm for fndng and countng labeled trees n very large networks. The only pror Hadoop based approaches have been on trangles [7], [36], [37] on very large networks, or more general subgraphs on relatvely small networks [3]. Our man techncal contrbuton s the development of a Hadoop verson of the color codng algorthm of Alon et al. [4], [5], whch s a (sequental) randomzed approxmaton algorthm for subgraph countng. It s a randomzed approxmaton algorthm that for any ε, δ, gves a (1±ε) approxmaton to the number of embeddngs wth probablty at least 1 δ. We prove that the work complexty of SAHAD s O(k E G k e k log (1/δ) 1 ε ), whch s more than the runnng tme of the sequental algorthm of [4] by just a factor of k.. We demonstrate our results on nstances generated usng the Erdös-Reny random graph model, the Chung- Lu random graph model and on synthetc socal contact graphs for Mam cty and Chcago cty (wth 5.7 and 68.9 mllon edges, respectvely), constructed usng the methodology of [7]. We study the performance of countng unlabeled/labeled templates wth up to 1 nodes. The total runnng tmes for templates wth 1 nodes on Mam and Chcago networks are and 35 mnutes, respectvely; note that these are the total end-to-end tmes, and do not requre any addtonal pre-processng (unlke, e.g. [8]). 3. We dscuss how our basc algorthms for countng subgraphs can be extended to compute supervsed motfs and graphlet frequency dstrbutons. They can also be extended to count labeled subgraphs. 4. SAHAD runs easly on heterogeneous computng resources, e.g., t scales well when we request up to 16 nodes on a medum sze cluster wth 3 cores per node. Our Hadoop based mplementaton s also amenable to runnng on publc clouds, e.g., Amazon EC [6]. Except for a 1-node template whch produces extremely large amount of data so as to ncur the I/O bottleneck on the vrtual dsk of EC. It s worth notng here that the performance of SAHAD on EC s almost the same as on the local cluster. Ths would enable researchers to perform useful queres even f they do not have access to large resources, such as those requred to run prevously proposed queryng nfrastructures. We beleve ths aspect s unque to SAHAD and lowers the barrer-to-entry for scentfc researchers to utlze advanced computng resources. 5. We study the performance mprovement n extensons of the standard Hadoop framework. The enhanced algorthm s called EN-SAHAD. Frst, we consder technques to explctly control the sortng and nter partton communcatons n Hadoop. We fnd that reducng the sortng step by pre-allocatng can mprove the performance by about %, but mproved parttonng does not seem to help. 6. Fnally, we mplement SAHAD wthn the Harp [9] framework the new algorthm s called HARPSAHAD+. HARPSAHAD+ yelds an order of magntude mprovement n performance, as a result of ts flexblty n task schedulng, data flow control and n memory cache. We are therefore able to scale to networks wth up to bllons of edges usng the HARPSAHAD+ and obtan a comparable performance when compared to a state-of-the-art MPI/C++ mplementaton. Organzaton. Secton 3 ntroduces the background for the subgraph countng problem and MapReduce, the opensourced mplementaton Hadoop and the Harp system. Then n Secton 4, we gve a bref overvew of the color codng algorthm proposed by Alon et. al n [4]. Furthermore, n Secton 5 we present our MapReduce mplementatons. In Secton 6 we study the computaton cost of our algorthm. Secton 7 proposes several varatons of the subgraph countng problems that can be computed usng our framework,

3 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 3 whle secton 8 dscusses experment results of SAHAD, EN- SAHAD and HARPSAHAD+. Fnally, Secton 9 concludes the paper. Extenson from conference verson. The SAHAD algorthm appeared n [4]. The results on EN-SAHAD and HARPSA- HAD+ are new addtons. Snce the publcaton of [4], there has been more work done on parallelzng the color codng technque, e.g., [33], [34]. However, none of these have been based on MapReduce and ts generalzatons. RELATED WORK As mentoned earler, the subgraph somorphsm problem and ts varant has been studed extensvely by theoretcal computer scentsts; see [1], [1], [13], [17], [4], [38] for complexty theoretc results. Marx and Plpczuk [4] undertake a comprehensve study of the decson problem and provde strong lower bounds ncludng fxed parameter ntractablty results. They also study the complexty of the problem as a functon of structural propertes of G and H. A varety of dfferent algorthms and heurstcs have been developed for dfferent doman specfc versons of subgraph somorphsm problems. One verson nvolves fndng frequent subgraphs, and many approaches for ths problem use the Apror method from frequent tem set mnng [14], [18], []. These approaches nvolve canddate generaton durng a breadth frst search on the subset lattce and a determnaton of the support of tem sets by a subset test. A varety of optmzatons have been developed, e.g., usng a DFS order to avod the cost of canddate generaton [], [4] or prunng technques, e.g., []. A related problem s that of computng the graphlet frequency dstrbuton, whch generalzes the degree dstrbuton [8]. Another class of results for frequent subgraph fndng s based on the powerful technque of color codng (whch also forms the bass of our paper), e.g., [4], [16], [41], whch has been used for approxmatng the number of embeddngs of templates that are trees or tree-lke. In [4], Alon et al. use color codng to compute the dstrbuton of treelets wth szes 8, 9 and 1, on the protenproten nteracton networks of Yeast. The color codng technque s further explored and mproved n [16], n terms of worst case performance and practcal consderatons. For example, by ncreasng the number of colors, they speed up the color codng algorthm wth up to orders of magntude. They also reduce the memory usage for mnmum weght paths fndng, by carefully removng unsatsfed canddates, and reducng the color set storage. A recent work developed by Venkatesan et al [?] extends color codng to subgraphs wth treewdth up to, and they scale ther algorthm to graph wth up to.7 mllon edges. Most of these approaches n bonformatcs applcatons nvolve small templates, and have only been scaled to relatvely small graphs wth at most 1 4 nodes (apart from [41], whch shows scalng to much larger graphs by means of a parallel mplementaton). Other settngs n relatonal databases and data mnng have nvolved queres for specfc labeled subgraphs. Some of the approaches for these problems have combned relatonal database technques, based on careful ndexng and translaton of queres, wth such depth-frst exploraton strategy that s dstrbuted over dfferent parttons of the graph e.g., [8], [31], [3], and scale to very large graphs. For nstance, Bröcheler et al. [8] demonstrate labeled subgraph queres wth up to 7-node templates on graphs wth over half a bllon edges, by carefully parttonng the massve network usng mnmum edge cuts, and dstrbutng the parttons on computng nodes. A shared-memory parallelzaton wth an OpenMP mplementaton of the color codng approach s gven n [33]. Ths algorthm acheves a speed up of 1 n a graph wth 1.5 mllon nodes and 31 mllon edges. A more recent work [34] parallelzes the dynamc processng of the color-codng algorthm to enumerate subgraphs and s able to handle networks as large as bllon edges, wth template sze up to 1. 3 BACKGROUND 3.1 Prelmnares and problem statement We consder labeled graphs G = (V G, E G, L, l G ), where V G and E G are the sets of nodes and edges, L s a set of labels and l G : V L s a labelng on the nodes. A graph H = (V H, E H, L, l H ) s a non-nduced subgraph of G f we have V H V G and E H E G. We say that a template graph T = (V T, E T, L, l T ) s somorphc to a nonnduced subgraph H = (V H, E H, L, l H ) of G f there exsts a bjecton f : V T V H such that: () for each (u, v) E T, we have (f(u), f(v)) E H, and () for each v V T, we have l T (v) = l H (f(v)). In ths paper, we assume T s a tree. We wll consder trees to be rooted, and use ρ = ρ(t ) V T to denote the root of T, whch s arbtrarly chosen. If T s somorphc to a non-nduced subgraph H wth the mappng f( ), we also say that H s a non-nduced embeddng of T wth the root ρ(t ) mapped to node f(ρ(t )). Fgure 1 shows an example of a non-nduced embeddng of template T n a graph G. Let emb(t, G) denote the number of all embeddngs of template T n graph G. Here, we focus on approxmatng emb(t, G). T u1 u u3 u 4 v9 v 8 Fg. 1: Here the shaded subgraph s a non-nduced embeddng of T. The mappng of the template to the subgraph s denoted wth the arrow. An (ε, δ)-approxmaton to emb(t, G). We say that a randomzed algorthm A produces an (ε, δ)-approxmaton to emb(t, G), f the estmate Z produced by A satsfes: Pr[ Z emb(t, G) > ε emb(t, G)] δ; n other words, A s requred to produce an estmate that s close to emb(t, G), wth hgh probablty. Problems studed. We consder the followng two problems: 1) Subgraph countng: Gven a template T and graph G, compute an (ε, δ)-approxmaton to emb(t, G). When the labels can be dsregarded, we refer to ths as the v7 v6 G v 5 v 4 v 3 v1 v

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 4 Unlabeled Subgraph Countng problem. Otherwse, t s referred to as the Labeled Subgraph Countng problem.

4 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 4 Unlabeled Subgraph Countng problem. Otherwse, t s referred to as the Labeled Subgraph Countng problem. ) Graphlet Frequency Dstrbuton (GFD) [8]: a graphlet s another name for a subgraph. We say a node touches a graphlet T, f t s contaned n an embeddng of T n the graph G. The graphlet degree of a node v s the number of graphlets t touches. Gven a sze parameter k, the GFD n a graph G s the frequency dstrbuton of the graphlet degrees of all nodes wth respect to all graphlets of sze up to k. The specfc problem s to obtan an approxmaton to the GFD. In ths paper, we wll focus on treelets, whch only consders all trees of sze up to k. 3. MapReduce, Hadoop and Harp MapReduce and ts extensons have become a domnant computaton model n bg data analyss. It nvolves two stages for data processng: (a) dvdng the nput nto dstnct map tasks and dstrbutng to multple computng enttes, and (b) mergng the results of ndvdual computng enttes n the reduce tasks to produce the fnal output [11]. The MapReduce model processes data n the form of key-value pars k, v. An applcaton frst takes pars of the form k 1, v 1 as nput to the map functon, n whch one or more k, v pars are produced for each nput par. Then the MapReduce re-organzes all k, v pars and aggregates all tems v that are assocated wth the same key k, whch are then processed by a reduce functon. Hadoop [39] s an open-sourced mplementaton of MapReduce. By defnng applcaton specfc map and reduce functons, the user can employ Hadoop to manage and allocate approprate resources n order to perform the tasks, wthout knowng the complexty of load balancng, communcaton and task schedulng. Due to the relablty and scalablty n handlng vast amount of computaton n parallel, Hadoop s becomng a de facto soluton for large parallel computng tasks. Hadoop falls short n two aspects though: () the hgh I/O cost nvolved wthn the mapper, shufflng and the reducer snce the data s always read and wrte from the dsk n every stage of a Hadoop job and () global synchronzaton of the mapper and reducer,.e. reducers can start only when all mappers have completed ther tasks and vce versa, thus reducng the effcent usage of the computng resources. To conquer the problems that Hadoop s facng, we further extend our work to use the Harp platform [9]. Harp ntroduces full collectve communcaton (broadcast, reduce, allgather, allreduce, rotaton, regroup or push & pull), addng a separate communcaton abstracton. The advantage of usng n-memory collectve communcaton replacng the shufflng phase s that fne-graned data algnment and data transfer of many synchronzaton patterns can be optmzed. Harp categorzes four types of computaton models (Lockng, Rotaton, Allreduce, Asynchronous) that are based on the synchronzaton patterns and the effectveness of the model parameter update. They provde the bass for a systematc approach to parallelzng teratve algorthms. Fgure shows the four categores of the computng model. The Harp framework has been used by 35 students at Indana Unversty for ther course projects. Now t has Fg. : Harp has 4 computaton models: (A) Lockng,(B) Rotaton, (C) AllReduce, (D) Asynchronous been released as an open source project that s avalable at the publc gthub doman [1]. Harp provdes a collecton of teratve machne learnng and data analyss algorthms (e.g. Kmeans, Mult-class Logstc Regresson, Random Forests, Support Vector Machne, Neural Networks, Latent Drchlet Allocaton, Matrx Factorzaton, Mult-Dmensonal Scalng) that have been tested and benchmarked on OpenStack Cloud and HPC platforms ncludng Haswell and Knghts Landng archtectures. It has also been used for Subgraph mnng, Force-Drected Graph Drawng, and Image classfcaton applcatons. 4 THE SEQUENTIAL ALGORITHM: COLOR CODING TABLE 1: Notatons symbol descrpton symbol descrpton G graph T, T, T template and sub-templates n, m # nodes, # edges k # nodes n T ρ root of T S, s color set, the th color d(v) degree of node v N(v) neghbors of node v We brefly ntroduce the color codng algorthm for subgraph countng [5], whch gves a randomzed approxmaton scheme for countng trees n a graph. Some of the notaton used n the paper s lsted n Table 1. Hgh level descrpton. There are two man deas underlyng the color codng algorthm of [5]. 1) Colorful embeddngs: Color the nodes of the graph wth k colors where k V T, and only count colorful embeddngs an embeddng H of the template T s colorful f each node n H has a dstnct color. The advantage of ths s that the number of colorful embeddngs can be counted by a smple and natural dynamc program. a) In partcular, let C(v, T (ρ), S) be the number of colorful embeddngs of T wth node v V G mapped to the root ρ, and usng the color set S, where V T = S. b) Suppose (ρ = u 1, u ) s an edge ncdent on the root node ρ n T. Let tree T be parttoned nto trees T 1 and T when the edge (u 1, u ) s removed, wth roots ρ 1 = u 1 and ρ = u of the trees T 1 and T, respectvely.

5 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 5 c) Suppose S 1 and S are dsjont subsets of colors such that S 1 = V T1, S = V T. Let H 1 and H be two colorful embeddngs of T 1 and T usng color sets S 1 and S, respectvely, wth ρ 1 and ρ mapped to neghborng nodes v 1 V G and v V G, respectvely. Then, H 1 and H must be non-overlappng, because they have dstnct colors. d) Therefore, C(v 1, T, S) = C(v 1, T 1 (v 1 ), S 1 ) S=S 1 S v N(v 1) C(v, T (v ), S ), where the frst summaton s over all neghbors v of v 1 and the second summaton s over all parttons S 1 S of S. ) Random colorngs: If the colorng s done randomly wth k = V T colors, there s a reasonable probablty k! that k k an embeddng s colorful ths allows us to get a good approxmaton of the number of embeddngs. Algorthm 1 The sequental color codng algorthm. 1: Input: Graph G = (V, E) and template T = (V T, E T ) : Output: Approxmaton to emb(t, G) 3: 4: For each v V G, pck a color c(v) S = {1,..., k} unformly at random, where k = V T. 5: Partton the tree T nto subtrees recursvely to form a set T usng algorthm PARTITION(T (ρ)). For each tree T T, we have a root ρ. Furthermore, f V T > 1, T s parttoned nto two trees T 1, T wth roots ρ 1 = ρ and ρ, respectvely, whch are referred to as the actve and passve chldren of T. 6: For each v V G, T T wth root ρ, and subset S S, wth S = T, we compute C(v, T (ρ ), S ) usng the the recurrence ( 1) below: c(v, T (ρ ), S ) = 1 d c(v, T (ρ ), S ) u (1) c(u, T (τ ), S ), where d s equal to one plus the number of sblngs of τ whch are roots of subtrees somorphc to T (τ ). 7: For the jth random colorng, let C (j) = 1 q k! k k v V G c(v, T (ρ), S), () where q denotes the number of node ρ V T such that T s somorphc to tself when ρ s mapped to ρ. 8: Repeat the above steps N = O( ek log(1/δ) ε ) tmes, and partton N estmates C (1),..., C (N) nto t = O(log(1/δ)) sets. Let Z j be the average of set j. Output the medan of Z 1,..., Z t. Algorthm 1 descrbes the sequental color codng algorthm. Fgure 3 gves an example of computng Eq PARALLEL ALGORITHMS In ths secton, we present a parallelzaton of the color codng approach usng MapReduce framework, we wll frst descrbe SAHAD [4], followed by EN-SAHAD and HARPSAHAD+ respectvely. Algorthm Partton(T (ρ)) 1: f T / T then : f V T = 1 then 3: T T 4: else 5: Add T to T 6: Pck τ N(ρ), the set of the neghbors of ρ, and partton T nto two sub-templates by cuttng the edge (ρ, τ) 7: Let T be the sub-template contanng ρ (name as actve chld) and T the other (name as passve chld) 8: Partton(T (ρ)) 9: Partton(T (τ)) Fg. 3: The example shows one step of the dynamc programmng n color codng. T n Fgure 1 s splt nto T and T. To count C(w 1, T (v 1 ), S), or the number of embeddngs of T (v 1 ) rooted at w 1, usng color set S = {red, yellow, blue, purple, green}, we frst obtan C(w 1, T (v 1 ), {r, y, b}) = and C(w 5, T (v 3 ), {p, g}) = 1. Then, C(w 1, T (v 1 ), S) = C(w 1, T (v 1 ), {r, y, b})c(w 5, T (v 3 ), {p, g}) =. The embeddngs of T are subgraphs wth nodes {w 3, w 4, w 1, w 5, w 6 } and {w 3, w, w 1, w 5, w 6 }. Here s, c, b represents the label of the nodes. Detals of labeled subgraph countng can be found at [4]. 5.1 SAHAD SAHAD takes a sequence of templates T = {T,..., T } as nput. Here T represents a set of templates generated by parttonng T usng Algorthm. Then t performs a MapReduce varaton of Algorthm 1 to compute the number of embeddngs of T. As shown n Equaton 1, the counts of all colorful embeddngs somorphc to T rooted from a sngle node v s computed by aggregatng the same measurement of T and T,.e., the two sub-templates, wth T rooted from v and T rooted from u N(v). We can parallelze color-codng algorthm by dstrbutng the computaton among multple machnes, and sendng data related wth v and N(v) to a computaton unt for the aggregaton. In our MapReduce algorthm, we manage ths by assgnng v as the key for both the counts of T rooted at v and the counts of T rooted at v s neghbors, such that all data requred for computng counts for T rooted at v has the same key and wll be handled by a sngle reduce functon. Let X T,v be a sequence of color-count pars (S =

6 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 6 {s 1, s,..., s k }, c ), (S 1 = {s 1 1, s 1,..., s 1 k }, c 1),..., where S represents a color set contanng k colors, and c represents the counts of the subgraphs somorphc to T and rooted at v that are colored by S. Here k = V (T ), and each subgraph s a colorful match. There are 3 types of Hadoop jobs n SAHAD, whch are 1) colorer (Algorthm 3) that performs lne 4 of Algorthm 1; ) counter (Algorthm 4, 5) whch performs lne 6 of Algorthm 1 and 3) fnalzer (Algorthm 6, 7) that performs lne 7 of Algorthm 1. The frst step s to random color network G wth k colors. The map functon s descrbed n Algorthm 3: Algorthm 3 mapper(v, N(v)) 1: Pck s {s 1,..., s k } unformly at random : color v wth s 3: Let T be the sngle node template 4: Let c(v, T, {s }) = 1 snce v s the only colorful matchng 5: X T,v {({s }, 1)} 6: Collect(key v, value X T,v, N(v)) Here Collect s a standard MapReduce operaton that wll emt the key-value pars to global space for further process such as shufflng, sortng or I/O. N(v) represents the neghbors of v. Note that template T s a sngle node, therefore X T,v contans only a sngle color-count par (s v, 1) Accordng to Equaton 1, to compute X T,v, we need X T,v for sub-template T and X T,u for all u N(v) for sub-template T. We use a mapper and a reducer functon to mplement ths as shown n Algorthm 4 and 5, respectvely. Algorthm 4 mapper(v, X t,v, N(v)) 1: f t s T then : Collect(key v, value X t,v, flag ) 3: else 4: for u N(v) do 5: Collect(key u, value X t,v, flag ) Note that n Algorthm 4, the second Collect emts X T,v to all ts neghbors. Therefore, as shown n Algorthm 5, X T,v and X T,u from all u N(v) are handled by the same reducer, whch s suffcent for computng Eq. 1. Also note that for a gven node v, the number of entres wth flag s 1, and the number of entres wth flag equals N(v). Algorthm 5 reducer(v, (X, flag), (X, flag),...) 1: pck X 1 where flag = flag : for all colorset S from X 1 do 3: for each X other than X 1 do 4: for all colorset S from X do 5: f S S = then 6: c(v, T, S S )+ = 1 7: Collect(key v, value X T,v, N(v)) The last step s to compute the total count descrbed n Eq., and s shown n Algorthm 6 and 7. Algorthm 6 mapper(v, X T,v, N(v)) 1: Collect(key sum, value X T,v ) Algorthm 7 reducer( sum, X T,v1, X T,v,...) 1: Y = mm m! 1 q v V G X : Collect(key sum, value X T,v ) Note that n Algorthm 6, X T,v only contans one element, whch s the count correspondng to the entre color set. Then n the reducer shown n Algorthm 7, all the counts are added together and properly factorzed, to obtan the fnal count. For a comprehensve descrpton of the MapReduce verson of color codng, please refer to [4]. 5. EN-SAHAD For general MapReduce problem, the set of keys that s processed n the Mapper and Reducer vares among dfferent jobs. Therefore, MapReduce uses external shufflng and sortng n-between Mappers and Reducers to deploy the keys to computng nodes. In our algorthm, however, the dynamc program aggregates counts based on the root node of the subtree, and therefore the key s the node ndex v. In EN-SAHAD, we use ths pre-knowledge to predefne a reducer that corresponds to a set of nodes. We also assgn the predefned reducers to computng nodes pror to the begnnng of the dynamc program. Therefore, a data entry wth key v wll be drectly sent to the correspondng computng node and processed by desgnated Reducer. Usng ths mechansm, we can reduce the cost of shufflng and sortng n ntermedate stage of Hadoop jobs. 5.3 HARPSAHAD+ HARPSAHAD+ s bult upon the Harp framework [?] [?], whch adopts a varety of the advanced technologes n the research area of hgh performance Java language. HARP- SAHAD+ has the followng optmzaton n front of the MapReduce Sahad verson: 1) It uses a two-level parallel programmng model. At the nter-node level, workload s dstrbuted by harp mappers; At the ntra-node level, local workload s dvded and assgned to multple Java threads. ) For nter-node communcaton, t utlzes a MPI-AlltoAll lke regroup operaton owned by Harp. 3) For ntra-node computaton, t utlzes Habanero Java thread lbrary from Rce Unversty [?] and adopts a Long-Runnng-Thread programmng style [?] to unleash the potental performance of Java language Inter-Node Communcaton In SAHAD, the template counts of a vertex v and all of ts neghbours N(v) are assgned the same key value v, therefore, they are shuffled nto the same reducer to complete the countng process. In HARPSAHAD+, we remove the reducer module and replace t by an user-defned mapper functon. The whole set of vertces V s dstrbuted and cached nto the memory space of p harp mappers. Each mapper holds a subset of vertces V wth s = V. In the mapper functon,

7 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 7 we create a table LT able wth s entres, and each entry j < s serves as a reducer for vertex v j. HARPSA- HAD+ then uses a regroup operaton to shuffle the data wthn the memory but n a collectve way. Each mapper functon creates another harp Table object RT able, contanng multple parttons, to transfer data. A preprocessng functon s fred to record re-usable nformaton requred by regroup operatons n each teraton. In the preprocessng stage, each mapper holds a copy of all the vertex IDs v and the mapper ID j, v V j by an allgather communcaton operaton. The mapper then parses the neghbour lsts N(v) of all the local vertces V and labels each vertex u, u N(v) but u / V, wth a mapper ID j that u V j. Therefore, each mapper keeps a queue of vertex IDs for each mapper j wth v Q,j, v V j. By sendng Q,j to mapper j, fnally each mapper j obtans a sendng queue Q j, of vertces. In each teraton of HARPSAHAD+, the regroup operaton fred by mapper has three steps: 1) For each sendng queue Q,j, loadng subtemplate counts of v n sendng queue Q,j nto a partton P ar,j of RT able. ) The sender and recever mapper denttes, and j, are coded nto a sngle partton ID for P ar,j. Durng the collectve regroupng, a desgned harp parttoner wll decode the partton ID and delver the partton P ar,j to the recever mapper j. 3) After the regroup operaton, the harp Table RT able of each mapper now contans counts of vertces u N(v) to update subtemplate counts of local vertces v n LT able Intra-Node Computaton HARPSAHAD+ extends the MapReduce framework by takng advantage of the mult-threadng programmng model n a shared-memory node. We favor the Habanero Java threads nstead of the Java.lang.Thread mplementaton because t allows users to setup thread affnty n multcore/many-core processors. We also embrace the so-called Long-Runnng-Thread programmng style, where we create the threads at the most out loop and keep them runnng untl the end of the program. Ths approach avods the overhead of frequently creatng and destroyng threads, nstead, t uses java.utl.concurrent.cyclcbarrer object to synchronze threads f requred. 6 PERFORMANCE ANALYSIS In ths secton, we dscuss the performance of SAHAD n terms of the overall work and tme complexty. Throughout ths secton, we denote the number of nodes and edges n the network by n and m respectvely. We use k to represent the number of nodes n the template. Lemma 6.1. For a template T, suppose the szes of the two sub-templates T and T are k and k, respectvely. As a result, the szes of the nput, output, and work complexty correspondng to a node v are gven below: The szes of the nput and output of Algorthm 4 are O( ( k ) ( k + k ) ( k + d(v)) and O( k ) k d(v)), respectvely. The sze of the nput to Algorthm 5 s O( ( k ) k d(v)). Proof For a node v, the nput to Algorthm 4 nvolves the correspondng X T,v and X T,v for T and T, as well as N(v), whch together have sze O( ( k ) ( k + k ) k + d(v)). If the nput s for T, Algorthm 4 generates multple key-value pars for a node v, n whch each key-value par corresponds to some node u N(v). Therefore, the output has sze O( ( k ) k d(v)). For a gven v, the nput to Algorthm 5 s the combnaton of the above, and therefore, has sze O( ( k ) k d(v)). Lemma 6.. The total work complexty s O(k E G k e k log (1/δ) 1 ε ). Proof For node v and each neghbor u N(v), Algorthm 5 aggregates every par of the form (S a, C a ) n X T,v, and (S b, C b ) n X T,u, whch leads to a work complexty of )( k ) d(v)). Snce T k, the total work, over all O( ( k k k nodes and templates s at most O( ( )( ) k k k v,t k d(v)) = O( v k k d(v)) = O(k E G k ) (3) Snce O(e k log (1/δ) 1 ε ) teratons are performed n order to get the (ε, δ)-approxmaton, the lemma follows. Tme Complexty. We use P to denote the number of machnes. We assume each machne s confgured to run a maxmum of M Mappers and R Reducers smultaneously. Fnally, we assume a unform parttonng, so that each machne processes n/p nodes. Lemma 6.3. The tme complexty of Algorthm 3 and 4 s O( n m P M ) and O( P M ), respectvely. Proof We frst consder Algorthm 3, whch takes as nput an entry of the form (v, N(v)) for some node v, and perform a constant work. There are n P entres processed by each machne. Snce M Mappers are run smultaneously, ths gves a runnng tme of O( n P M ). Next, we consder Algorthm 4. Each Mapper outputs (v, X) for nput T and d entres for nput T for each u N(v), where d s the degree of v. Therefore, each computng node performs O( n/p =1 d ) = O(m/P ) steps. Here d s the degree for v. Agan, snce M Mappers run smultaneously, the total runnng tme s O( m P M ). Lemma 6.4. The tme complexty of Algorthm 5 s O( m k P R ). Proof Suppose S = k and S = k. The number of possble color sets S and S s ( k ) ( k and k ) k, respectvely. Lne of Algorthm 5 nvolves O( ( k k ) = O( ) k ) steps. Smlarly, lne 4 also nvolves O( k ) steps and Lne 3 nvolves O(d) steps. Therefore the totally runnng tme s O(d) k. Each machne processes n P entres correspondng to dfferent nodes, leadng to a total of O( nd k P ) steps. Snce R reducers run n parallel on each machne, ths leads to a total tme of O( m k P R ). Lemma 6.5. The tme complexty of Algorthm 6 and 7 s ) and O(n), respectvely. O( n P M Proof Algorthm 6 maps out a sngle entry for each nput. Followng the same outlne as the proof of 6.3, ts runnng tme s O( n P M ). Algorthm 7 wll take O(n) tme snce we

8 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 8 have only one key sum, and only one Reducer wll be assgned for the summaton for all v V (G), whch takes O(n) tme. Lemma 6.6. The overall runnng tme of SAHAD s bounded by O( kk m P ( 1 M + 1 R )ek log (1/δ) 1 ε ) (4) Proof Algorthm 3 takes O( n P M ) tme. Algorthm 4 and 5 run for each step of the dynamc programmng,.e., jonng two sub-templates nto a larger template as shown n Fgure 3. Snce the number of total sub-templates s O(k) when T s a tree, Algorthm 4 and 5 run O(k) tmes. Therefore the total tme s O(k ( m P M + m k P R )) = O( kk m P ( 1 M + 1 R ). Fnally, the entre algorthm as to be repeated O(e k log (1/δ) 1 ε ) tmes, n order to get the (ε, δ)- approxmaton, and the lemma follows. 6.1 Performance Analyss of Intermedate Stage Wth SAHAD, a major bottleneck of a Hadoop job n terms of runnng tme s the shufflng and sortng cost n the ntermedate stage between Mapper and Reducer, due to the hgh I/O and synchronzaton cost as shown by the black bar n Fgure 4. Fg. 4: The fgure shows the tme spent n each stage of a runnng Hadoop job to produce a color-count for a 5-node template, by aggregatng the -node and 3-node sub-tree. The black bar s the tme for the ntermedate stage, whch s for shufflng and sortng. on whch graph s ths? We observe that the external shufflng and sortng stage takes roughly twce the tme of the reducng stage, whch dramatcally ncrease the overall runnng tme. Gven that the keys n Mappers and Reducers are always the ndex of all the nodes v V (G), we can enhance SAHAD by removng the shufflng and sortng n the ntermedate stages. Instead, we can desgnate Reducers and drectly send the data to correspondng Reducers. 7 VARIATIONS OF SUBGRAPH ISOMORPHISM PROBLEMS So far we have dscussed the basc framework of the algorthm. We have also dscussed how to compute the total number of subgraph embeddngs n Algorthm 7. We now dscuss a set of problems that are closely the subgraph somorphsm problem, ncludng fndng supervsed motf and computng graphlet frequency dstrbuton, whch can be computed usng our framework. Note that our algorthm s specfcally sutable for computng on multple templates f they have common subtemplates, snce those common sub-templates only need to be computed once. Ths s the case n many problems, where common sub-templates such as sngle node, edge, or smple paths are shared. 7.1 Supervsed Motf Fndng Motfs of a real-world network are specfc templates whose embeddngs occur wth much hgher frequences than n random networks and are referred as buldng blocks for networks. They have been found n many real-world networks [6]. Our algorthm can reduce the computatonal cost for a group of templates snce the common subtemplates are only computed once, therefore, ths approach s amenable to be appled n supervsed motf fndng. 7. Graphlet Frequency Dstrbuton Graphlet frequency dstrbuton has been proposed as a way of measurng the smlarty of proten-proten networks [8], where common propertes such as degree dstrbuton, dameter, etc., may not suffce. Unlke motfs, graphlet frequency dstrbuton s computed on all selected small subgraphs regardless of whether they appear frequently or not. Graphlet frequency dstrbuton D(, T ) measures the number of nodes from whch graphlets that are somorphc to T are touched on. The number of graphlets touched on a sngle node v can be computed usng a number of counts of the same templates T wth root placed at dfferent nodes of T. 8 EXPERIMENTAL ANALYSIS OF SAHAD, EN- SAHAD & HARPSAHAD+ We carry out a detaled expermental analyss of SAHAD, EN-SAHAD and HARPSAHAD+, by focusng on three aspects: () Qualty of the soluton: We compare the color codng results wth exact counts on small graphs n order to measure the emprcal approxmaton error of our algorthms and show that the error s very small (less than.5% wth one teraton as shown n Fgure 7) so n the followng experments we run the program for a sngle teraton. () Scalablty of the algorthms as a functon of template sze, graph sze and computng resources: We carred out experments usng templates wth szes rangng from 3 nodes to 1 nodes, ncludng both labeled and unlabeled templates. The graphs we use go from several hundreds of thousands of nodes to tens of mllons. We also study how our algorthm scales n terms of computng resources ncludng number of threads per node, number of computng nodes, as well as dfferent settngs of mappers and reducers, etcetera. () Varatons of the problem: Our framework has the ablty to extend to a varety of measurements related wth the subgraph countng problem. In the experments, we show the unlabled/labeled subgraph countng and graphlet dstrbuton results. (v) Enhancng overall performance by system tunng: We also nvestgate dfferent components of the system and

9 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 9 ther mpact to the overall performance. For example, EN- SAHAD studes the communcaton and sortng cost n the ntermedate stage of the system and gves approaches for mprovement. We also propose a degree based graph parttonng scheme that can mprove the performance of Harp by mposng better load balancng n terms of computatons wthn each partton. Table hghlghts the man results we obtaned wth varous methods. TABLE : Comparson on SAHAD, EN-SAHAD and HARP- SAHAD+ U5-1 U5- U5-3 U7-1 U1-1 ms fk ma fa fy fy fy fa ms my fa my fy fy fa fs fa ma ms fs L7-1 L1-1 Fg. 5: Templates used n the experments. ma mk mk my my fk ms mk L1-1 fk Method Networks Templates Performance we can have maxmum 31.5T B storage for the HDFS. In SAHAD 68M edges 1 nodes 1s of mn for 7 node most of our experments, we use up to 16 nodes, whch template on Chcago gve up to 1T B capacty for the computaton. Although EN-SAHAD 1M edges 5 nodes % mprovement over SAHAD the number of cores and RAM capacty on each node can HARPSAHAD+ 1.B edges up to 1 nodes 1- tmes fastersupport a large number of mappers/reducers, the avalablty of a sngle dsk on each node lmts aggregate I/O than SAHAD bandwdth of all parallel processes on each node. To make t worse, aggregate I/O bandwdth of parallel processes dong 8.1 Experment Desgn sequental I/O could result n many extra dsk seeks and hurt overall performance. Therefore, dsk bandwdth s the Datasets bottleneck for more parallelsm n each node. Ths lmtaton For our experments, we use synthetc socal contact networks of the followng ctes and regons: Mam, Chcago, Amazon Elastc Computng Cloud (EC) for some of our s further dscussed n secton 8... We also use the publc New Rver Valley (NRV), and New York Cty (NYC) (see [7] experments. EC enables customers to nstantly get cheap for detans). We consder demographc labels {kd, youth, yet powerful computng resources, and start the computng adult, senor} based on the age and gender for ndvduals. We also run experments on a G(n, p) graph (denoted Hgh-CPU Extra-Large nstances from EC. Each nstance has process wth no upfront cost for hardware. We allocated 4 GNP1) wth n nodes, where each par of nodes are connected wth probablty p, and are randomly assgned node Block Store Volume). 8 cores, 7 GB RAM, and two 5 GB vrtual dsks (Elastc labels. We also experment on a few other networks: Web- For experments wth HARPSAHAD+, we use the Julet Google [], RoadNet (rnet) [], Twtter [1] and Chung-Lu cluster (Intel Haswell archtecture) wth 1,, 4, 8 and 16 random graphs [9]. Table 3 summarzes the characterstcs nodes. The Julet cluster contans 3 nodes each wth two of the networks. 18-core 36-thread Intel Xeon E5-699 processors and 96 TABLE 3: Networks used n the experments nodes each wth two 1-core 4-thread Intel Xeon E5-67 processors. All the nodes used n the experments are wth Intel Xeon E5-67 processors and 18 GB memory. All the experments are performed on InfnBand FDR wth 1Gbt/s per lnk. Network No. of Nodes(n mllon) No. of Edges(n mllon) Twtter Mam Chcago NYC NRV. 1.4 rnet..8 GNP Web-Google Templates The templates we use n the experments are shown n Fgure 5. The templates vary n sze from 5 to 1 nodes, n whch U5-1,...U1-1 are the unlabeled templates and L7-1,L1-1 as well as L1-1 are the labeled templates. In the labels, m, f, k, y, a and s stand for male, female, kd, youth, adult and senor, respectvely Computng Envronment For experments wth SAHAD, we use a computng cluster Athena, wth 4 computng nodes and a large RAM memory footprnt. Each node has a quad-socket AMD.3GHz Magny Cour 8 Core Processor,.e., 3 cores per node or 1344 cores n total, and 64 GB RAM(1.4 TFLOP peak). The local dsk avalable on each node s 75GB. Therefore, Performance metrcs We carry out experments on SAHAD, EN-SAHAD and HARPSAHAD+. For SAHAD, we measure the approxmaton bounds, the mpact of Hadoop confguraton ncludng number of Mapper/Reducers and performance on queres related wth varous templates and graphs. For enhanced SAHAD, we measure the performance mprovement ganed by elmnatng the sortng n the ntermedate stage. We also measures the mpact wth dfference parttonng schemes. Then wth Harp, smlar to SAHAD, we measure the performance mpact wth varous templates and graphs, as well as the system performance regardng number of computng nodes. We also compare HARPSAHAD+ and SAHAD to study the mprovement Harp brngs. 8. Performance of SAHAD In ths secton, we evaluate varous aspects of the performance. Our man conclusons are summarzed below. Table 4 summarzes the dfferent experments we perform, whch are dscussed n greater detals later.

10 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 1 1. Approxmaton bounds: Whle the worst case bounds on the algorthm mply O(e k log (1/δ) 1 ε ) rounds to get an (ε, δ)-approxmaton (see Lemma 6.), n practce, we fnd that far fewer teratons are needed.. System performance: We run our algorthm on a dverse set of computng resources, ncludng the publcly avalable Amazon EC cloud. Here, we fnd that our algorthm scales well wth the number of nodes, and dsk I/O s one of the man bottlenecks. We post that employng multple dsks per node (a rsng trend n Hadoop) or usng I/O cachng wll help mtgate ths bottleneck and boost performance even further. 3. Performance on varous queres: We evaluate the performance on templates wth szes rangng from 5 to 1. Here, we fnd that labeled queres are sgnfcantly faster than unlabeled ones, and the overall runnng tme s under 35 mnutes for these queres on our computng cluster (descrbed below). We also get comparable performance on EC Approxmaton bounds As dscussed n Secton 3, the color codng algorthm averages the estmates over multple teratons. Fgure 6 shows the error for each teraton n countng U5-1 for Mam and Web-Google, respectvely. It s observed that the standard devaton for the error s % and.4% for Mam and Web- Google, whch s very small Standard devaton = (a) Mam Standard devaton = (b) Web-Google Fg. 6: Error n countng U5-1 for 3 teraton In Fgure 7, we show that the approxmaton error s below.5% for the template U7-1 for the GNP1 graph, even for one teraton. The fgure also plots the results based on usng more than 7 colors, whch can sometmes mprove the runnng tme, as dscussed n [16]. In the rest of the experments, we only use the estmaton from one teraton, because of the small error shown n ths secton. The error for teratons s computed usng ( Z)/ emb(t,g) emb(t,g). 8.. Performance Analyss We now study how the runnng tme s affected by the number of total computng nodes and number of reducers/mappers per node. We carry out 3 sets of experments: () how the total runnng tme scales wth the number of computng nodes; () how the runnng tme s affected by varyng assgnment of mappers/reducers per node. 1. Varyng number of computng nodes Fgure 8 shows that the runnng tme for Mam reduces from over mnutes to less than 3 mnutes when the number of computng nodes ncreases from 3 to 13. However, the curve for GNP1 does not show good scalng. The reason s that the actual computaton for GNP1 only consumes a small porton of the runnng tme, and there s overhead from managng the mappers/reducers. In other words, the curve for GNP1 shows a lower bound on the runnng tme n our algorthm..varyng number of mappers/reducers per node Here we consder two cases..a. Varyng number of reducers per node. Fgure 9 shows the runnng tme on Athena when we vary the number of reducers per node. Here we fx the number of nodes to be 16 and the number of mappers per node to be 4. We fnd that runnng 3 reducers concurrently on each node mnmzes the total runnng tme. In addton we fnd that although ncreasng the number of reducers per node can reduce the tme for the Reduce stage for a sngle job, the runnng tme ncreases sharply n Map and Shuffle stage. As a result, the total runnng tme ncreases wth the number of reducers. Ths can be explaned by the vsble I/O bottleneck for concurrent accessng on Athena, snce Athena has only 1 dsk per node. Ths phenomenon s not present on EC, as seen from Fgure 11b, ndcatng that EC s better optmzed for concurrent dsk accessng for cloud usage number of reducers per node (a) Total runnng tme v.s. number of reducers mapper shuffle and sortng reducer number of reducers per node (b) Runnng tme of job stages v.s. number of reducers. Fg. 9: Runnng tme v.s. number of reducers per node error sze of colorset = 7 sze of colorset = 8 sze of colorset = 9 sze of colorset = 1 sze of colorset = number of teratons Fg. 7: Approxmaton error n countng U7-1 on GNP Mam GNP number of computng nodes Fg. 8: Runnng tme for countng U1-1 vs number of computng nodes..b. Varyng number of mappers per node. Fgure 1 shows the runnng tme on Athena when we vary the number of mappers per node whle fxng the number of reducers as 7 per node. We fnd that varyng the number of mappers per node does not affect the performance. Ths s also valdated n EC, as shown n Fgure 11..c. Reducers runnng tme dstrbuton. Fgure 1 shows the dstrbuton of the reducers runnng tme on Athena. We observe that when we ncrease the number of reducers per node, the dstrbuton becomes more volatle; for example, when we concurrently run reducers per node, the reducers completon tme vary from mnutes to 1 mnutes.

11 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 11 TABLE 4: Summary of the experment results (refer to Secton 8.1 for the termnology used n the table) Experment Computng resource Template & Network Key Observatons Approxmaton bounds Athena U7-1 & GNP1 error well below.5% Impact of the number of data nodes Athena U1-1 & Mam, GNP1 scale from 4 hours to 3 mnutes wth data nodes from 3 to 13 Impact of the number of concurrent reducers Athena & EC U1-1 & Mam performance worsen on Athena Impact of the number of concurrent mappers Athena & EC U1-1 & Mam no apparent performance change Unlabeled/labeled templates countng Athena & EC templates from Fgure 5 all tasks complete n less than 35 mnutes and networks from Table 3 Graphlet frequency dstrbuton Athena U5-1 & Mam,Chcago complete n less than 35 mnutes number of mappers per node (a) Total runnng tme v.s. number of mappers mapper shuffle and sortng reducer 5 1 number of mappers per node (b) Runnng tme of job stages v.s. number of mappers. Fg. 1: Runnng tme v.s. number of mappers per node number of mappers per node (a) Total runnng tme v.s. number of mappers on EC number of reducers per node (b) Total runnng tme v.s. number of reducers on EC. Fg. 11: Runnng tme w.r.t. number of mappers and reducers on EC. Ths also ndcates the bad I/O performance on Athena for concurrent accessng Illustratve applcatons In ths secton, we llustrate the performance on 3 dfferent knds of queres. We use Athena and assgn 16 nodes as the data nodes; for each node, we assgn a maxmum of 4 mappers and 3 reducers per node. Our experments on EC for some of these queres are dscussed later n Secton Unlabeled subgraph queres: Here we compute the counts of templates U5-1, U7-1 and U1-1 on GNP1 and Mam, as well as the runnng tme, as shown n Fgure 13 we observe that for unlabeled templates wth up to 1 nodes on the Mam graph, the algorthm runs n less than 5 mnutes.. Labeled subgraph queres: Here we count the total number of embeddngs of templates L7-1, L1-1 and L1-1 n Mam and Chcago. Fgure 14b shows that the runnng tme for countng templates up to 1 nodes s around mnutes on Mam, whch s less than 35 mnutes needed for Chcago. The runnng tme s much less for the labeled subgraph queres than that for the unlabeled subgraph queres. Ths s number of the reducers reducers on each node (a) 3 reducers per computng node. number of the reducers reducers on each node (c) 11 reducers per computng node. number of subgraph matchngs 1e+17 1e+16 1e+ 1e+14 1e+13 1e+1 1e+11 1e+1 1e+9 1e+8 number of the reducers reducers on each node (b) 7 reducers per computng node. number of the reducers reducers on each node (d) reducers per computng node. Fg. 1: Reducers completon tme dstrbuton. U5-1 U5- U5-3 U7-1 U1-1 GNP1 graph Mam (a) The counts of unlabeled subgraphs U5-1 U5- U5-3 U7-1 U1-1 GNP1 graph Mam (b) Runnng tme for countng unlabeled subgraphs. Fg. 13: Queryng unlabeled subgraphs on GNP1 and Mam due to the fact that labeled templates contan a much fewer number of embeddngs due to the label constrants. 3. Computng graphlet frequency dstrbuton: Fgure shows the graphlet frequency dstrbuton n the networks of Mam and Chcago, respectvely. By usng template U5-1 for ths experment, we observe that t takes mnutes and 35 mnutes to compute graphlet frequency dstrbutons on Mam and Chcago, respectvely Performance Study wth Amazon EC On EC, we run unlabeled and labeled subgraph queres on Mam and GNP1 for templates U5-1, U7-1, U1-1, L7-1, L1-1 and L1-1. Here we use the same 4 EC nstances as

12 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 1 number of subgraph matchngs 1e+16 1e+ 1e+14 1e+13 1e+1 1e+11 1e+1 L7-1 L1-1 L1-1 Mam Chcago graph (a) The counts of labeled subgraphs L7-1 L1-1 L1-1 Mam Chcago graph (b) Runnng tme for countng labeled subgraphs. the key-value pars from Mappers can be drectly sent to correspondng Reducers wthout beng shuffled and sorted. Fgure 17 shows the overall runnng tme of our algorthm on NRV, RoadNet and ther varatons. Here we generate the varatons of the graph by shufflng a proporton of the edges n the graph, e.g., nrv4 s a NRV wth 4% of ts edges beng shuffled. As a result, we observe that preallocatng a Reducer can delver roughly a % performance mprovement. Fg. 14: Queryng labeled subgraphs on Mam and Chcago. 5 4 SAHad Enhanced-SAHad 5 4 SAHad En-SAHad number of nodes 1e Mam number of nodes 1e Chcago runnng tme (sec) 3 1 nrv nrv nrv4 nrv6 nrv8 nrv1 graphs runnng tme (sec) 3 1 rnet rnet rnet4 rnet6 rnet8 rnet1 graphs 1 1 5e+8 1e+9 1.5e+9 e+9.5e+9 3e+9 number of graphlet adjacent to a node (a) Mam 1 1 1e+9 e+9 3e+9 4e+9 5e+9 6e+9 number of graphlet adjacent to a node (b) Chcago (a) NRV and ts varatons (b) RoadNet and ts varatons Fg. 17: SAHAD v.s. EN-SAHAD on RoadNet and NRV. Fg. : Graphlet dstrbuton on Mam and Chcago. dscussed prevously, and each node runs up to a maxmum of mappers and 8 reducers concurrently. As shown n Fgures 16, the runnng tme on EC s comparable to that on Athena, except for U1-1 on Mam, whch takes roughly.5 hours to fnsh on EC, but only 5 mnutes on Athena. Ths s because for large templates and graphs as large as Mam, the nput/output data as well as the I/O pressure on dsks s tremendous. EC uses vrtual dsks as local storage, whch hurt overall performance when dealng wth such a large amount of data U5-1 U7-1 U1-1 template (a) GNP1 unlabeled labeled L7-1 L1-1 L U5-1 U7-1 U1-1 template (b) Mam unlabeled labeled L7-1 L1-1 L1-1 Fg. 16: Runnng tme for varous templates on EC. 8.3 Performance of EN-SAHAD In ths secton we experment our algorthms on two realworld networks NRV and RoadNet and a number of ther shuffled versons. We generate shuffled networks wth, 4, 6, 8 and 1 percent shufflng rato, and name them as nrv to nrv1, and rnet to rnet1. As dscussed n Secton 5., a major factor that mpacts the overall performance s the heavy shufflng and sortng cost n the ntermedate stage of a Hadoop job. We mtgate ths factor by desgnatng node ndex v to Reducers, and preallocatng Reducers among computng nodes. In ths way, 8.4 Performance of HARPSAHAD+ In the followng experments, we evaluate the performance of HARPSAHAD+ by comparng t wth a state-of-the-art MPI subgraph countng program called MPI-Fasca. MPI- Fasca s developped by Slota et al. [34], whch mplements the same color codng algorthm as SAHAD and HARP- SAHAD+. MPI-Fasca uses a MPI+OpenMP programmng model. In our tests, t s compled wth g and compler opton -O3 as well as OpenMPI Also, we choose InfnBand nstead of Ethernet as the nterconnect to test MPI-Fasca and HARPSAHAD+, thus offerng more challenges to the Java based communcaton operaton of HARPSAHAD Executon Tme In Fgure 18a, we observe that HARPSAHAD+ has a 1x to x speedup over SAHAD on a sngle Haswell node. Ths tremendous mprovement comes from two sdes: 1) HARPSAHAD+ has a better utlzaton of the hardware resources (logcal cores) by usng Habanero Java threads and affnty bndng. ) Compared to the dsk based shuffle process of SAHAD, HARPSAHAD+ caches all of the data n man memory, whch sgnfcantly reduces the overhead of data access. In Fgure 18, we compare HARPSAHAD+ wth MPI-Fasca on a Twtter dataset wth templates of large sze n a dstrbuted envronment of 16 Haswell nodes. HARPSAHAD+ acheves comparable or even slghtly better performance than MPI-Fasca, whch comes from ts optmzed communcaton operatons. Fgure 19 llustrates a breakdown of the executon tme nto computaton and communcaton on Twtter wth template U1-. Because of the hghly ntensve computaton workload, MPI-Fasca consumes less tme n computaton thanks to the complerlevel O3 optmaton. However, HARPSAHAD+ as a pure Java mplementaton can stll acheve almost the same total countng tme wth the help of optmzed collectve communcaton operatons.

13 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 13 Runnng tme (Sec) 1, Mam 1 u5-1 SAHad u5-1 u7-1 SAHad u7-1 Graphs 17 6 Web-Google (a) SAHAD vs. HARPSAHAD+ Runnng tme (Sec) 6, 4, U1-1 MPI-Fasca U1-1 U1- MPI-Fasca U1-, 1,887 1,486 1,1 1, NYC Twtter Graph 7,17 6,874 (b) HARPSAHAD+ vs. MPI-Fasca Fg. 18: (a) Test on 1 Haswell Node and each node runnng 4 threads; (b) Test on 16 Haswell Nodes and each node runnng 4 threads MPI-Fasca 1,, 3, 4, 5, 6, 7, 8, Countng Tme n Seconds Computaton Communcaton Fg. 19: BreakDown of Tme for Twtter-U1- on 16 Nodes 8.4. Problem Sze Scalng Next we study the performance of HARPSAHAD+ by controllng the number of nodes n a graph whle ncreasng the number of edges. In ths experment, we use the Chung-Lu model [9] to generate a seres of random graphs gven the degree sequence and ts varatons of Mam and NYC. The average degree of the generated random graphs range from 5 to for Mam and 1 to 1 for NYC. In Fgure, the runnng tme generally ncreases wth the number of edges, whch meets the tme complexty we propose n Secton 6. For Mam, when the average degree ncreases from 5 to, the runnng tme only ncreases by 1.7x. Also, a tenfold (x1) ncrease n average degree for the NYC graph only accounts for less than x of an ncrease n runnng tme. Ths ndcates that our HARPSAHAD+ mplementaton mantans good performance n computng the neghbours of vertces n parallel, whch s due to the hgh effcency of Java threads. Countng Tme (sec) CL CL1 CL CL3 CL4 CL5 CL6 CL7 CL8 CL9 Graph Name 7 (a) Mam Dataset Countng Tme (sec) 1,4 1, 1, ,4 1,49 1,16 1,179 1,31 1,4 CL CL1 CL CL3 CL4 CL5 CL6 CL7 CL8 CL9 Graph Name (b) NYC Dataset Fg. : (a) Test on Mam graph, Template U1-1, 4 Haswell Nodes, and 4 threads/node; (b) Test on NYC graph, Template U1-1, 4 Haswell Nodes and 4 threads/node; Varyng number of computng nodes In ths secton, we study the performance of HARPSAHAD+ as a functon of computng resources,.e., computng nodes and threads per node. In Fgure 1, we compare the nternode strong scalng test results between HARPSAHAD+ and MPI-Fasca. For the NYC dataset, we ran strong scalng tests on three templates, and the value of the y-axs represents the speedup on N nodes by dvdng the tme on a sngle node by the tme on N nodes. Snce the NYC dataset s relatvely small for HARPSAHAD+ and MPI-Fasca, both of the two mplementatons are not bounded by the computaton overhead, whch prevents them from achevng the lnear speedup. However, HARPSAHAD+ (sold lnes) stll obtans a better strong scalablty than MPI-Fasca (dashed lnes). Furthermore, MPI-Fasca could not run on two nodes due to a memory capacty bottleneck and t shows no scalablty after 4 nodes. For the Twtter Dataset, HARPSAHAD+ agan outperforms MPI-Fasca after 4 nodes. The speedup s also mproved as Twtter gves a much larger workload than NYC and HARPSAHAD+ s more bounded by computaton overhead. Speedup (T1/Tn) Num of Nodes U5-1 MPI-Fasca U5-1 1 U7-1 MPI-Fasca U7-1 (a) NYC Dataset 1.4 U1-1 MPI-Fasca U1-1 Speedup (T1/Tn) Num of Nodes U3-1 MPI-Fasca U3-1 U5-1 MPI-Fasca U5-1 (b) Twtter Dataset U7-1 MPI-Fasca U7-1 Fg. 1: (a) Test on NYC graph each node runnng 4 threads; (b) Test on Twtter graph each node runnng 4 threads Degree based parttonng schemes In the above experments, we evenly partton the graphs wthout consderng the nature of the problem and structure of the graphs. In that nave approach, each partton has the same number of vertces. In ths secton, we experment a new parttonng scheme based on a degree-related metrc D p as shown n Equaton 5. Gven a vertex wth degree d, there are n total ( d ) dfferent pars of edges that sub-templates τ and τ can resde at, or O(d ) ways to jon sub-templates. Hence, n order to nduce a roughly equal computatonal cost wthn each partton p we partton the graph such that each partton has smlar D p. We expect that the computaton n each partton wll be roughly the same wth ths parttonng scheme, hence lghtenng the overhead due to synchronzaton and unbalanced loads. D p = v p d v Here d v s the degree of node v. In Fgure (a), the beneft of usng the degreeparttoned Mam dataset s merely around 5% by average, whch s largely due to the relatvely small sze of the graph and computatonal cost. In contrast, the degree partton on NYC dataset has an 4% mprovement by average (5)

14 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS 14 Countng Tme (sec) even-partton degree-partton CL CL1 CL CL3 CL4 CL5 CL6 CL7 CL8 CL9 Graph Name (a) Mam Dataset Countng Tme (sec) 1,4 1, 1, ,4 1,49 1, ,179 1,31 1, even-partton degree-partton CL CL1 CL CL3 CL4 CL5 CL6 CL7 CL8 CL9 Graph Name (b) NYC Dataset Fg. : (a) Test on Mam graph, Template U1-1, 4 Haswell Nodes 4 threads; (b) Test on NYC graph, Template U1-1, 4 Haswell Nodes 4 threads; and a 6% mprovement by maxmum. It shows that for larger graphs that can ncur hgh computatonal costs, the parttonng scheme plays a major role by reducng load mbalance. 9 CONCLUSION In ths paper we descrbed an effcent parallel algorthm to compute the number of somorphc embeddngs of a subgraph n very large networks usng MapReduce and the color codng technque. Hence, we frst develop SAHAD a Hadoop based mplementaton and also provde performance analyss n terms of work and tme complexty. After observng large sortng and communcaton costs n SA- HAD, we further explore two approaches to remedy these problems. The frst approach called EN-SAHAD, entals the tght couplng of the number of graph vertces to mappers and reducers, so as to reduce the sortng and shufflng phases of the MapReduce jobs. The second approach s the mplementaton of the color codng algorthm usng the Harp framework, called HARPSAHAD+, whch employs collectve communcaton and shared memory to better facltate computaton and communcaton. Our experments show that HARPSAHAD+ has sgnfcantly mproved performance when compared to SAHAD by almost two orders of magntude, and smultaneously acheves comparable or even better executon tme and scalablty than a state-of-the-art MPI soluton. HARPSAHAD+ can process networks wth 1. Bllon edges and 1 node templates. We also explore the performance of these mplementatons on dfferent cluster archtectures such as EC on-demand nodes and Intel Haswell nodes. Fnally, we ntroduce a novel graph-load parttonng scheme whch mproves the performance on large graphs and templates. As drectons for future research, t would be nterestng to devse new algorthms that scale to larger nstances. Addtonally, t would be nterestng to mplement a varant of these algorthms for restrcted classes of networks. REFERENCES [1] Harp. [] Snap stanford network analyss project. [3] E. Abdelhamd, I. Abdelazz, P. Kalns, Z. Khayyat, and F. Jamour. Scalemne: scalable parallel frequent subgraph mnng n a sngle large graph. In Proceedngs of the Internatonal Conference for Hgh Performance Computng, Networkng, Storage and Analyss, page 61. IEEE Press, 16. [4] N. Alon, P. Dao, I. Hajrasoulha, F. Hormozdar, and S. Sahnalp. Bomolecular network motf countng and dscovery by color codng. Bonformatcs, 4(13):41, 8. [5] N. Alon, R. Yuster, and U. Zwck. Color-codng. Journal of the ACM (JACM), 4(4):856, [6] Amazon. Elastc computng cloud (ec). com/ec. [7] C. Barrett, R. Beckman, M. Khan, V. Kumar, M. Marathe, P. Stretz, T. Dutta, and B. Lews. Generaton and analyss of large synthetc socal contact networks. In Wnter Smulaton Conference, 9. [8] M. Bröcheler, A. Puglese, and V. Subrahmanan. Cos: Cloud orented subgraph dentfcaton n massve socal networks. In 1 Internatonal Conference on Advances n Socal Networks Analyss and Mnng, pages IEEE, 1. [9] F. Chung and L. Lu. Connected components n random graphs wth gven expected degree sequences. Annals of combnatorcs, 6(): 145,. [1] R. Curtcapean and D. Marx. Complexty of countng subgraphs: Only the boundedness of the vertex-cover number counts. In Foundatons of Computer Scence (FOCS), 14 IEEE 55th Annual Symposum on, pages IEEE, 14. [11] J. Dean and S. Ghemawat. Mapreduce: Smplfed data processng on large clusters. Communcatons of the ACM, 51(1):17 113, 8. [1] J. Flum and M. Grohe. The parameterzed complexty of countng problems. SIAM Journal on Computng, 33(4):89 9, 4. [13] F. V. Fomn, D. Lokshtanov, V. Raman, S. Saurabh, and B. R. Rao. Faster algorthms for fndng and countng subgraphs. Journal of Computer and System Scences, 78(3):698 76, 1. [14] L. Getoor and C. Dehl. Lnk mnng: a survey. ACM SIGKDD Exploratons Newsletter, 7():3 1, 5. [] J. Huan, W. Wang, J. Prns, and J. Yang. Spn: mnng maxmal frequent subgraphs from graph databases. In Proceedngs of the tenth ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM, 4. [16] F. Hüffner, S. Werncke, and T. Zchner. Algorthm engneerng for color-codng wth applcatons to sgnalng pathway detecton. Algorthmca, 5():114 13, 8. [17] H. B. Hunt III, M. V. Marathe, V. Radhakrshnan, and R. E. Stearns. The complexty of planar countng problems. SIAM Journal on Computng, 7(4): , [18] A. Inokuch, T. Washo, and H. Motoda. An apror-based algorthm for mnng frequent substructures from graph data. Prncples of Data Mnng and Knowledge Dscovery, pages 13 3,. [19] I. Kouts and R. Wllams. Lmts and applcatons of group algebras for parameterzed problems. In Proc. ICALP, pages , 9. [] M. Kuramoch and G. Karyps. Fndng frequent patterns n a large sparse graph. Data mnng and knowledge dscovery, 11(3):43 71, 5. [1] H. Kwak, C. Lee, H. Park, and S. Moon. What s Twtter, a socal network or a news meda? In WWW 1: Proceedngs of the 19th nternatonal conference on World wde web, pages 591 6, New York, NY, USA, 1. ACM. [] J. Leskovec, A. Sngh, and J. Klenberg. Patterns of nfluence n a recommendaton network. Advances n Knowledge Dscovery and Data Mnng, pages , 6. [3] Y. Lu, X. Jang, H. Chen, J. Ma, and X. Zhang. Mapreducebased pattern fndng algorthm appled n motf detecton for prescrpton compatblty network. Advanced Parallel Processng Technologes, pages , 9. [4] D. Marx and M. Plpczuk. Everythng you always wanted to know about the parameterzed complexty of subgraph somorphsm (but were afrad to ask). In 31st Internatonal Symposum on Theoretcal Aspects of Computer Scence, page 54, 14. [5] E. Maxwell, G. Back, and N. Ramakrshnan. Dagnosng memory leaks usng graph mnng on heap dumps. In Proceedngs of the 16th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM, 1. [6] R. Mlo, S. Shen-Orr, S. Itzkovtz, N. Kashtan, D. Chklovsk, and U. Alon. Network motfs: smple buldng blocks of complex networks. Scence, 98(5594):84,. [7] R. Pagh and C. Tsourakaks. Colorful trangle countng and a mapreduce mplementaton. Arxv preprnt arxv: , 11. [8] N. Pržulj. Bologcal network comparson usng graphlet degree dstrbuton. Bonformatcs, 3():e177, 7. [9] J. Qu, S. Jha, A. Luckow, and G. C. Fox. Towards hpc-abds: an ntal hgh-performance bg data stack. Buldng Robust Bg Data

IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS [3] [31] [3] [33] [34] [35] [36] [37] [38] [39] [4] [41] [4] Ecosystem ISO/IEC JTC 1 Study Group on Bg Data, pages 18 1, 14. J. Raymond and P.

Evaluatng very large datalog queres on socal networks. In Proceedngs of the 1th Internatonal Conference on Extendng Database Technology: Advances n Database Technology, pages 577 587. ACM, 9. S. Sakr.

Fast approxmate subgraph countng and enumeraton. In Parallel Processng (ICPP), 13 4nd Internatonal Conference on, pages 1 19. IEEE, 13. G. M. Slota and K. Maddur. Parallel color-codng.

In Parallel and Dstrbuted Systems (ICPADS), 16 IEEE nd Internatonal Conference on, pages 1118 116. IEEE, 16. S. Sur and S. Vasslvtsk. Countng trangles and the curse of the last reducer.

In Proceedngs of the th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages 837 846. ACM, 9. L. G. Valant. The complexty of enumeraton and relablty problems.

In Proceedngs of the eleventh ACM SIGKDD nternatonal conference on Knowledge dscovery n data mnng, pages 34 333. ACM, 5. Z. Zhao, M. Khan, V. Kumar, and M. Marathe.

Butt, M. Khan, V. Kumar, and M. V. Marathe. Sahad: Subgraph analyss n massve networks usng hadoop. In Parallel & Dstrbuted Processng Symposum (IPDPS), 1 IEEE 6th Internatonal, pages 39 41. IEEE, 1.

Hs research nterests are n Network Scence and analytcs, especally n the desgn and analyss of parallel graph algorthms.

15 IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS [3] [31] [3] [33] [34] [35] [36] [37] [38] [39] [4] [41] [4] Ecosystem ISO/IEC JTC 1 Study Group on Bg Data, pages 18 1, 14. J. Raymond and P. Wllett. Maxmum common subgraph somorphsm algorthms for the matchng of chemcal structures. Journal of computer-aded molecular desgn, 16(7):51 533,. R. Ronen and O. Shmuel. Evaluatng very large datalog queres on socal networks. In Proceedngs of the 1th Internatonal Conference on Extendng Database Technology: Advances n Database Technology, pages ACM, 9. S. Sakr. Graphrel: A decomposton-based and selectvtyaware relatonal framework for processng sub-graph queres. In Database Systems for Advanced Applcatons, pages Sprnger, 9. G. M. Slota and K. Maddur. Fast approxmate subgraph countng and enumeraton. In Parallel Processng (ICPP), 13 4nd Internatonal Conference on, pages IEEE, 13. G. M. Slota and K. Maddur. Parallel color-codng. Parallel Computng, 47:51 69,. B. Suo, Z. L, Q. Chen, and W. Pan. Towards scalable subgraph pattern matchng over bg graphs on mapreduce. In Parallel and Dstrbuted Systems (ICPADS), 16 IEEE nd Internatonal Conference on, pages IEEE, 16. S. Sur and S. Vasslvtsk. Countng trangles and the curse of the last reducer. In Proceedngs of the th nternatonal conference on World wde web, pages ACM, 11. C. Tsourakaks, U. Kang, G. Mller, and C. Faloutsos. Doulon: Countng trangles n massve graphs wth a con. In Proceedngs of the th ACM SIGKDD nternatonal conference on Knowledge dscovery and data mnng, pages ACM, 9. L. G. Valant. The complexty of enumeraton and relablty problems. SIAM Journal on Computng, 8(3):41 41, T. Whte. Hadoop: The defntve gude. Yahoo Press, 1. X. Yan, X. Zhou, and J. Han. Mnng closed relatonal graphs wth connectvty constrants. In Proceedngs of the eleventh ACM SIGKDD nternatonal conference on Knowledge dscovery n data mnng, pages ACM, 5. Z. Zhao, M. Khan, V. Kumar, and M. Marathe. Subgraph enumeraton n large socal contact networks usng parallel color codng and streamng. In Parallel Processng (ICPP), 1 39th Internatonal Conference on, pages , 1. Z. Zhao, G. Wang, A. R. Butt, M. Khan, V. Kumar, and M. V. Marathe. Sahad: Subgraph analyss n massve networks usng hadoop. In Parallel & Dstrbuted Processng Symposum (IPDPS), 1 IEEE 6th Internatonal, pages IEEE, 1. Zhao Zhao s pursung hs Ph.D degree n Computer Scence at Vrgna Tech. He s also a Software Engneer n Versgn Labs, Versgn Inc. Hs research nterests are n Network Scence and analytcs, especally n the desgn and analyss of parallel graph algorthms. Langsh Chen s a Postdoctoral researcher at the School of nformatcs and Computng n Indana Unversty. Hs research nterests nclude lnear solvers for HPC systems, energy effcency of HPC applcatons, data ntensve machne learnng applcatons on manycore archtectures, and so forth. Mha Avram s currently a Masters student who s studyng Computer Scence at Indana Unversty. Hs research nterests nvolve applyng varous CS sub-domans such as Bg Data, Hgh Performance Computng, IoT, Machne Learnng, HCI, and Data Mnng to solve large scale socal problems. Meng L s a Computer Scence Ph.D. student n the School of nformatcs and Computng at Indana Unversty. Hs advsor s Prof. Judy Qu. Hs research nterest s dstrbuted systems and parallel computng. Guanyng Wang earned hs PhD n Computer Scence from Vrgna Tech n 1. He s now a software engneer at Google. Al Butt receved hs Ph.D. degree n Electrcal & Computer Engneerng from Purdue Unversty n 6. He s a recpent of an NSF CAREER Award, IBM Faculty Awards, a VT College of Engneerng (COE) Dean s award for Outstandng New Assstant Professor, and NetApp Faculty Fellowshps. Al s research nterests are n dstrbuted computng systems and I/O systems. Maleq Khan s an Assstant Professor n the Department of Electrcal Engneerng and Computer Scence at Texas A&M Unversty Kngsvlle. He receved hs Ph.D. n Computer Scence from Purdue Unversty. Hs research nterests are n parallel and dstrbuted computng, bg data analytcs, hgh performance computng, and data mnng. Madhav Marathe s a professor of Computer Scence and the Drector of the Network Dynamcs and Smulaton Scence Laboratory, Bocomplexty Insttute, Vrgna Tech. Hs research nterests nclude hgh performance computng, modelng and smulaton, theoretcal computer scence and soco-techncal systems. He s a fellow of the IEEE, ACM and AAAS. Judy Qu s an assocate professor of Intellgent Systems Engneerng n the School of Informatcs and Computng at Indana Unversty. Her research nterests are parallel and dstrbuted systems, cloud computng, and hgh-performance computng. Her research has been funded by NSF, NIH, Intel, Mcrosoft, Google, and Indana Unversty. Judy Qu leads the Intel Parallel Computng Center (IPCC) ste at IU. She s the recpent of a NSF CAREER Award n 1. Anl Vullkant s an Assocate Professor n the Department of Computer Scence and the Bocomplexty Insttute of Vrgna Tech. Hs nterests are n the areas of approxmaton and randomzed algorthms, dstrbuted computng, graph dynamcal systems and ther applcatons to epdemology, socal networks and wreless networks. He s a recpent of the NSF and DOE Career awards.

An Optimal Algorithm for Prufer Codes *

An Optimal Algorithm for Prufer Codes * J. Software Engneerng & Applcatons, 2009, 2: 111-115 do:10.4236/jsea.2009.22016 Publshed Onlne July 2009 (www.scrp.org/journal/jsea) An Optmal Algorthm for Prufer Codes * Xaodong Wang 1, 2, Le Wang 3,