DUe to the recent developments of gigantic social networks

Similar documents
RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

Detection and Recognition of Alert Traffic Signs

Towards Adaptive Information Merging Using Selected XML Fragments

Point-Biserial Correlation Analysis of Fuzzy Attributes

A Recommender System for Online Personalization in the WUM Applications

And Ph.D. Candidate of Computer Science, University of Putra Malaysia 2 Faculty of Computer Science and Information Technology,

Lecture 27: Voronoi Diagrams

HISTOGRAMS are an important statistic reflecting the

Controlled Information Maximization for SOM Knowledge Induced Learning

An Unsupervised Segmentation Framework For Texture Image Queries

arxiv: v4 [cs.ds] 7 Feb 2018

Embeddings into Crossed Cubes

Illumination methods for optical wear detection

Information Retrieval. CS630 Representing and Accessing Digital Information. IR Basics. User Task. Basic IR Processes

An Extension to the Local Binary Patterns for Image Retrieval

A modal estimation based multitype sensor placement method

An Optimised Density Based Clustering Algorithm

IP Network Design by Modified Branch Exchange Method

Communication vs Distributed Computation: an alternative trade-off curve

A Novel Automatic White Balance Method For Digital Still Cameras

FACE VECTORS OF FLAG COMPLEXES

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

MapReduce Optimizations and Algorithms 2015 Professor Sasu Tarkoma

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

A Memory Efficient Array Architecture for Real-Time Motion Estimation

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

Clustering Interval-valued Data Using an Overlapped Interval Divergence

Separability and Topology Control of Quasi Unit Disk Graphs

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

A Minutiae-based Fingerprint Matching Algorithm Using Phase Correlation

Scaling Location-based Services with Dynamically Composed Location Index

Shortest Paths for a Two-Robot Rendez-Vous

Effective Missing Data Prediction for Collaborative Filtering

Effective Data Co-Reduction for Multimedia Similarity Search

THE THETA BLOCKCHAIN

Quality Aware Privacy Protection for Location-based Services

Assessment of Track Sequence Optimization based on Recorded Field Operations

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

Obstacle Avoidance of Autonomous Mobile Robot using Stereo Vision Sensor

DEADLOCK AVOIDANCE IN BATCH PROCESSES. M. Tittus K. Åkesson

Optical Flow for Large Motion Using Gradient Technique

INDEXATION OF WEB PAGES BASED ON THEIR VISUAL RENDERING

Lecture # 04. Image Enhancement in Spatial Domain

a Not yet implemented in current version SPARK: Research Kit Pointer Analysis Parameters Soot Pointer analysis. Objectives

Reachable State Spaces of Distributed Deadlock Avoidance Protocols

Hierarchically Clustered P2P Streaming System

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

ART GALLERIES WITH INTERIOR WALLS. March 1998

On Error Estimation in Runge-Kutta Methods

A Full-mode FME VLSI Architecture Based on 8x8/4x4 Adaptive Hadamard Transform For QFHD H.264/AVC Encoder

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

Image Enhancement in the Spatial Domain. Spatial Domain

Color Correction Using 3D Multiview Geometry

Module 6 STILL IMAGE COMPRESSION STANDARDS

LaSaS: an Aggregated Search based Graph Matching Approach

Topological Characteristic of Wireless Network

SAR: A Sentiment-Aspect-Region Model for User Preference Analysis in Geo-tagged Reviews

UNION FIND. naïve linking link-by-size link-by-rank path compression link-by-rank with path compression context. An Improved Equivalence Algorithm

Also available at ISSN (printed edn.), ISSN (electronic edn.) ARS MATHEMATICA CONTEMPORANEA 3 (2010)

Bo Gu and Xiaoyan Hong*

Optimal Adaptive Learning for Image Retrieval

Performance Optimization in Structured Wireless Sensor Networks

Evaluation of Partial Path Queries on XML Data

On the Forwarding Area of Contention-Based Geographic Forwarding for Ad Hoc and Sensor Networks

Modelling, simulation, and performance analysis of a CAN FD system with SAE benchmark based message set

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

Evaluation of Partial Path Queries on XML data

Title. Author(s)NOMURA, K.; MOROOKA, S. Issue Date Doc URL. Type. Note. File Information

Hierarchical Region Mean-Based Image Segmentation

SCALABLE ENERGY EFFICIENT AD-HOC ON DEMAND DISTANCE VECTOR (SEE-AODV) ROUTING PROTOCOL IN WIRELESS MESH NETWORKS

An Improved Resource Reservation Protocol

Event-based Location Dependent Data Services in Mobile WSNs

IP Multicast Simulation in OPNET

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

FINITE ELEMENT MODEL UPDATING OF AN EXPERIMENTAL VEHICLE MODEL USING MEASURED MODAL CHARACTERISTICS

Generalized Grey Target Decision Method Based on Decision Makers Indifference Attribute Value Preferences

Structure discovery techniques for circuit design and process model visualization

Fifth Wheel Modelling and Testing

Simulation and Performance Evaluation of Network on Chip Architectures and Algorithms using CINSIM

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

A Neural Network Model for Storing and Retrieving 2D Images of Rotated 3D Object Using Principal Components

arxiv: v2 [physics.soc-ph] 30 Nov 2016

Efficient protection of many-to-one. communications

Extract Object Boundaries in Noisy Images using Level Set. Final Report

INFORMATION DISSEMINATION DELAY IN VEHICLE-TO-VEHICLE COMMUNICATION NETWORKS IN A TRAFFIC STREAM

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

Slotted Random Access Protocol with Dynamic Transmission Probability Control in CDMA System

Conversion Functions for Symmetric Key Ciphers

Authentication of Moving Range Queries

High performance CUDA based CNN image processor

All lengths in meters. E = = 7800 kg/m 3

Class 21. N -body Techniques, Part 4

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

MULTI-TEMPORAL AND MULTI-SENSOR IMAGE MATCHING BASED ON LOCAL FREQUENCY INFORMATION

The EigenRumor Algorithm for Ranking Blogs

AN ANALYSIS OF COORDINATED AND NON-COORDINATED MEDIUM ACCESS CONTROL PROTOCOLS UNDER CHANNEL NOISE

Efficient Execution Path Exploration for Detecting Races in Concurrent Programs

A Family of Distributed Deadlock Avoidance Protocols and their Reachable State Spaces

Transcription:

Exploing Communities in Lage Pofiled Gaphs Yankai Chen, Yixiang Fang, Reynold Cheng Membe, IEEE, Yun Li, Xiaojun Chen, Jie Zhang 1 Abstact Given a gaph G and a vetex q G, the community seach (CS) poblem aims to efficiently find a subgaph of G whose vetices ae closely elated to q. Communities ae pevalent in social and biological netwoks, and can be used in poduct advetisement and social event ecommendation. In this pape, we study pofiled community seach (PCS), whee CS is pefomed on a pofiled gaph. This is a gaph in which each vetex has labels aanged in a hieachical manne. Extensive expeiments show that PCS can identify communities with themes that ae common to thei vetices, and is moe effective than existing CS appoaches. As a naive solution fo PCS is highly expensive, we have also developed a tee index, which facilitate efficient and online solutions fo PCS. axiv:1901.05451v1 [cs.db] 16 Jan 2019 Index Tems community seach, social netwoks, gaph queies, pofiled gaph 1 INTRODUCTION DUe to the ecent developments of gigantic social netwoks (e.g., Flick, Facebook, and Twitte), topics of gaph queies have attacted attention fom industy and eseach aeas [1 7]. Communities, which ae often found in lage gaphs, can be used in vaious applications, such as social event setting, fiend ecommendation, and eseach collaboation analysis [8 12]. Given a gaph G and a quey vetex q G, the goal of community seach (CS) is to extact communities, o densely connected subgaphs of G that contain q, in an online manne. CM ML CM ML AI B CM IS ML AI AI HW A E C D IS DMS (b) a subtee of CCS CM HW IS ML AI DMS ML IS HW IS DMS HW IS DMS (a) a pofiled gaph CCS Machine Leaning Infomation Systems Hadwae HW F CM AI DMS (c) abbeviations G H CM HW HW IS Computing Methodology Atificial Intelligence Data Management System Fig. 1. A pofiled gaph, a subtee of CCS and meanings of tems. In this pape, we investigate the CS poblem fo a pofiled gaph. This is essentially a kind of attibuted gaphs, whee Y. Chen, Y. Fang, and R. Cheng ae with the Depatment of Compute Science, The Univesity of Hong Kong, Hong Kong. E-mail: {ykchen, yxfang, ckcheng}@cs.hku.hk Y. Li is with Depatment of Compute Science and Technology, Nanjing Univesity, China. E-mail: liycse@gmail.com X. Chen is with College of Compute Science and Softwae,Shenzhen Univesity, China. E-mail: xjchen@szu.edu.cn J. Zhang is with School of Compute Science and Engineeing, Nanyang Technological Univesity, Singapoe. E-mail: zhangj@ntu.edu.sg Manuscipt eceived Mach 20, 2018. each gaph vetex is associated with a set of labels aanged in a hieachical manne called a P-tee. Fig. 1(a) shows a pofiled gaph, which is a compute science collaboation netwok; each vetex epesents a eseache, and a link between two vetices depicts that the two coesponding eseaches have woked togethe befoe. Each vetex is associated with a P-tee, which descibes the expetise of eseaches. Fig.1(c) shows the meanings of the tems in each P-tee, following the ACM Computing Classification System (CCS) 1, which is patially pesented in Fig.1(b). Fo instance, vetexb denotes a eseache, whose eseach domain is in computing methodology (CM), with specific inteest in machine leaning (ML) and atificial intelligence (AI). Pofiled gaphs ae infomative and can be found in vaious gaph applications (e.g., knowledge bases, social and collaboation netwoks). Moeove, the P-tees of pofiled gaphs systematically oganize labels elated to a vetex (e.g., hieachical and inteelated knowledge in knowledge bases, affiliation, expetise, and locations in social and collaboation netwoks), eflecting the semantic elationship among them. Fo example, in a P-tee, label London can be a child node of UK, because London is a UK city. C B A D E F (a) two PC s G H CM ML AI (b) {B, C, D} Fig. 2. Illustating pofiled community seach (PCS). IS DMS (c) {A, D, E} Pio woks. The methods elated to etieval communities can geneally be classified into community detection (CD) methods and community seach (CS) methods. In geneal, the aim of CD algoithms is to etieve all communities fo a gaph [13 20]. Note that these solutions ae not quey-based. This means that, given a use-specified quey vetex, they ae not customized fo a quey equest. As a esult, these algoithms nomally take a long time to find all the communities fo a lage gaph. Thus it is not suitable to use CD algoithms fo quick o online etieval of communities. To solve these poblems, CS solutions have been ecently poposed [8, 10, 21 24]. Compaed with CD solutions, 1. ACM CCS: http://www.acm.og/publications/class-2012

2 CS appoaches ae quey-based, and thus ae suitable to deive communities in an online manne. Howeve, to ou best knowledge, pevious CS algoithms ae not designed fo pofiled gaphs. Ealy solutions (e.g., [8 10]) often only conside gaph topology (e.g., a k-coe is a community such that each vetex is connected to k o moe vetexes). They did not conside the use of vetex labels. As pointed out in [11], the communities etuned by those solutions ae often huge (e.g., a community can easily contain ove 1, 000 vetices). Moeove, the vetices included in the communities wee not quite elated. Recent woks, such as ACQ [11] and ATC [12], popose to use both gaph stuctue and vetex label infomation. While these woks have been shown to be moe effective than CS solutions that do not utilize vetex labels, they did not employ the hieachical elationship among labels (e.g., P-tees in Fig. 1(a)). This may lead to suboptimal esults. In Fig. 1(a), suppose that a enowned expet D wants to oganize a semina whee eseaches ae closely elated to each othe. Based on the ACQ solution [11], with k=2, only a 2-coe is seached (Fig. 2(b)), whose vetices {B, C, D} have seveal labels (i.e.,, CM, ML, AI) in common. Howeve, it fails to etun the community in Fig. 2(c), whose vetices ae also highly simila. Fo these two communities, the shaed labels as well as thei elationships in the P-tee ae vey diffeent. Theefoe, both communities can be pesented to the oganize fo futhe selection. Pofiled community seach. In this pape, we study pofiled community seach (PCS), which aims to find pofiled communities, o PC s, fo a pofiled gaph. To obtain high-quality communities, we use stuctue cohesiveness and pofile cohesiveness to constain PC s. We adopt widely used metic minimum degee [8, 25 29] to measue the stuctue cohesiveness. Note that in PCS poblem, the minimum degee matic can be eplaced by othe useful matics, e.g., k-tuss [10] and k-clique [22], to fit in othe possible application scenaios. In a pofiled gaph, each vetex is associated with a P-tee. To measue the pofile cohesiveness, we fully utilize the infomation in P-tees. Conceptually, a PC is a goup of densely connected vetices, whose P-tees have the lagest degee of ovelap. This ovelapping pat is the lagest common subtee shaed by all the vetices. Fig. 2(a) illustates two PC s in the pofiled gaph of Fig. 1, namely {B, C, D} and {A, D, E}. In Fig. 2(b) and Fig. 2(c), the two PC s, as well as thei lagest common subtees ae espectively shown. Fo example, in Fig. 2(c), veticesa,d, ande all possess the subtee with oot and leaf nodes IS and DMS. Notice that these thee vetices also fom a 2-coe of D, and the common subtee among them is the lagest. The common subtee sufficiently eflects the theme of the community. In the PC of Fig. 2(b), all the eseaches involved shae inteest in machine leaning and atificial intelligence, wheeas fo Fig. 2(c), the eseaches ae all inteested in infomation systems and hadwae studies. Pesonalization. PCS poblem allows a quey use to seach communities that exhibits both stuctue cohesiveness and pofiled cohesiveness. The paamete k contols the density of connection intensiveness. The pofiled cohesiveness constains the community to be semantically simila as much as possible. Fo instance, PCS methods can answe questions such as who ae my close fiends so that we have stong connection and common intesets and expetise? In contast, existing CD methods [30 32] often use some global citeia (e.g., modulaity) whee the gaph is patitioned a-pioi with no efeence to the paticula quey vetices. Thus existing CD methods ae not suitable fo pesonalized queies. Online seach. Simila to othe online CS appoaches, ou PCS method is able to find PC s fom a lage-scale pofiled gaph effectively and efficiently. Howeve, existing CD methods fo gaph quey poblems ae geneally slowe. This is mainly because that they ae designed fo etieving all the communities fo an entie gaph. Contibutions. As we will explain, a simple solution to solve the PCS poblem is extemely expensive. To impove the efficiency of finding PC s (so that they can be used in online applications), we fist intoduce an anti-monotonicity popety, which allows the candidates fo a PC to be puned efficiently. We futhe develop the CP-tee index, which systematically oganizes the gaph vetices and P-tees of a pofiled gaph. The CPtee index enables the development of two fast PC discovey algoithms. We expeimentally evaluate ou solutions on two eal lage pofiled gaphs and two synthetic pofiled gaphs. Ou esults show that PC s ae bette epesentations of communities, and the CP-tee based algoithms ae up to 4 ode-of-magnitude faste than basic solution. Oganization. We eview the elated wok in Section 2. Section 3 pesents the PCS poblem and a basic solution. Section 4 discusses the CP-tee and its elated solutions. We epot the expeimental esults in Section 5, and conclude in Section 6. 2 RELATED WORK In the liteatue, thee ae two kinds of wok elated to the etieval of communities, namely community detection (CD) and community seach (CS). Community detection (CD) aims to obtain all the communities fom a given gaph. Ealie woks [16, 33] use link-based analysis to obtain these communities. Howeve, they do not conside the textual infomation associated with gaphs. Recent woks focus on attibuted gaphs and use some advanced techniques such as clusteing techniques to identify communities. Howeve, these studies often assume that the attibute of the vetex is a set of keywods, and do not conside the hieachical elationship among them. Fo Example, Zhou et al. [20] used keywods to descibe vetices and futhe compute the vetices paiwise similaities to cluste the gaph. Qi et al. [34] studied a poblem of dynamically maintaining communities of moving objects using thei tajectoies. Ruan et al. [35] poposed a method called CODICIL. Based on content similaity, CODICIL augments the oiginal gaphs by ceating new edges, and then uses an effective gaph sampling to boost the efficiency of clusteing. Anothe wide-used appoach is based on topic models [18, 36]. Essentially, these methods still analyze the one-dimensional content to obtain the communities. Anothe common appoach is based on topic models. Link- PLSA-LDA [14] and Topic-Link LDA [37] models jointly model vetices links and content based on the LDA model. In [6], the communities ae clusteed based on pobabilistic infeence. In [38], infomation such as topics, inteaction types and the social connections ae consideed to exploe the communities. CESNA [19] detects ovelapping communities by assuming communities geneate both the link and content. As we intoduced befoe, CD solutions ae typically time consuming, and they may not be suitable fo online applications that equie fast etieval of communities. It is also inteesting to examine how ou PCS solutions can be extended to suppot CD. Community seach (CS) etuns the communities fo a given gaph vetex in a fast and online manne. Most existing CS

3 solutions [8 10, 25, 26] only conside gaph topologies, but not the labels associated with the vetices. To define the stuctue cohesiveness of the community, the minimum degee is often used [8, 25, 26]. Sozio et al. [8] poposed the fist algoithm Global to find the k- coe containing the quey vetex. Cui et al. [25] poposed Local, which uses local expansion techniques to impove Global. We will compae these two solutions in ou expeiments. Othe definitions, such as k-clique [9], k- tuss [10] and edge connectivity [39], have been consideed fo seaching meaningful communities. Recent CS solutions, such as ACQ [11, 40] and ATC [12], make use of both vetex labels and gaph stuctue to find communities. Since CS is quey-based, it is much moe suitable fo fast and online quey of the communities on lage-scale pofiled gaphs. Howeve, all above woks ae not designed fo pofiled gaphs, and they do not conside the hieachical elationship among vetex labels. Thus in this pape, we popose methods to solve the community seach poblem on pofiled gaphs. We have pefomed detailed expeiments on eal datasets (Section 5). As we will show, ou algoithms yield bette communities than stateof-the-at CS solutions do. 3 PROBLEM DEFINITION AND BASIC SOLUTION In this section, we fist fomally intoduce the PCS poblem, and then give a basic solution to the PCS poblem. Table 1 lists all notations used in this pape. TABLE 1 Notations and meanings. Notation Meaning G(V,E) A pofiled gaph with vetex set V and edge set E n the numbe of vetices in V m the the numbe of edges in E deg G(v) The degee of vetex v in G T(v) The P-tee of vetex v M(G q) The maximal common subtee of G q the lagest connected subgaph of G s.t. G[T] q G[T], v G[T], T T(v) the lagest connected subgaph of G[T] G k [T] s.t. q G k [T], deg Gk (v) k 3.1 The PCS Poblem A pofiled community is a subgaph of G that fistly satisfies the stuctue cohesiveness (i.e., the vetices in this community ae connected to each othe in some way). Fomal definition will be intoduced late. A common notion of stuctue cohesiveness is that the minimum degee of all the vetices that in the community has to be at least k [8, 25 29]. This is used in the k-coe and the PC. Let us discuss the k-coe fist. Definition 1 (k-coe [27, 41]). Given an intege k (k 0), the k-coe of G, is the lagest subgaph of G, such that v k-coe, degee of v is at least k. Notice that k-coe may not be connected [27]. Its connected components, denoted by k-coe, ae the communities eteieved by k-coe seach algoithms. We use Example 1 to illustate it. Example 1. In Figue 2(a), each dashed cicle epesents a 2-coe and also a 2- coe. Vetices {A, B, D, E} goup a 3- coe and vetices {A, B, C, D, E} fom a 2- coe because C only has a degee of 2, even though othe vetices has a highe degee. A pofiled gaph G(V,E) is an undiected gaph with vetex set V and edge set E. Each vetex v V is associated with a pofiled tee (P-tee) to descibe v s hieachical attibutes. Definition 2 (P-tee). The P-tee of vetex q, denoted by T(q)= (V T(q),E T(q) ), is a ooted odeed tee, whee V T(q) is the set of attibute labels and E T(q) is the set of edges between labels. A P-tee satisfies following constaints: (1) Thee is only one oot node V T(q) ; (2) (x,y) E T(q), it is diected and y is the child attibute label of x; and (3) y V T(q) and y, thee is one and only one x V T(q), s.t. (x,y) E T(q). In pactice, labels in the uppe levels of the P-tee ae moe semantically geneal than those in lowe levels. All edges ine T(q) peseve the semantic elationships among labels in V T(q). Definition 3 (induced ooted subtee). Given two P-tees S=(V S,E S ) and T =(V T,E T ), S is the induced ooted subtee of T, denoted by S T, if V S V T and E S E T. Essentially, an induced ooted subtee defines an inclusion elationship between two P-tees. Unless othewise specified, we use subtee to mean induced ooted subtee. We call the unified P-tee of all vetices P-tees a Global P-tee (GP-tee), which usually coesponds to a taxonomy system in pactice. Definition 4 (maximal common subtee). Given a pofiled gaph G, the maximal common subtee of G, denoted by M(G), holds the popeties: (1) v G, M(G) T(v); (2) thee exists no othe common subtee M (G) such that M(G) M (G). The common subtee depicts the common hieachical pat among all P-tees in a subgaph. We use the maximal stuctue M(G) to conside both the high-level and low-level labels and it fully mines the common featues of this subgaph. As a esult, by using the maximal common subtee, we can maximize vetices common pofiles, including the topology and semantics of uses pofiles. Next, we fomally intoduce the PCS poblem. Poblem 1 (PCS). Given a pofiled gaph G(V, E), a positive intege k, and a quey node q G, find a set G of gaphs, such that G q G, the following popeties hold: Connectivity. G q G is connected and containsq; Stuctue cohesiveness. v G q, deg Gq (v) k, whee deg Gq (v) denotes the degee of v in G q ; Pofile cohesiveness. Thee exists no othe G q G satisfying the above two constaints, such that M(G q ) M(G q ). Maximal stuctue. Thee exists no othe G q satisfying the above popeties, such that G q G q and M(G q) = M(G q); Essentially, a pofiled community (PC) is a subgaph of G, in which vetices ae closely elated in both stuctue and semantics. In Poblem 1, the fist two popeties and last popety ensue the stuctue cohesiveness, as shown in the liteatue [26, 40]. The unique popety pofile cohesiveness captues the maximal shaed pofile among all the vetices of G q. Moeove, since the shaed subtee M(G q ) shows the common hieachical attibute, it can well explain the semantic theme of the community. 3.2 A Basic Solution Since vetices in the PC s shae a common subtee of the quey vetexq, a staightfowad method it that we can enumeate all the

4 subtees of q s P-tee and find the coesponding PC s. Howeve, as illustated in Lemma 1, the seach space may be exponentially lage and computation ovehead endes this method impactical. To alleviate this issue, we iteatively pefom the following two steps. Lemma 1. The maximum numbe of subtees of a P-tee with x nodes is 2 x 1 +1. p 1 p 2 px-2 (a) a special case Fig. 3. a P-tee with x nodes. p x-1 i (b) a geneal case Poof. Let f(x) = max{l N L is the numbe of subtees of a tee with x nodes}. As shown in Fig. 3(a), p i denotes the ith child of the P-tee. Then it is not had to find that thee ae (2 n 1 + 1) subtees including the empty tee (no P-tee node is contained). So f(x) 2 x 1 +1. In this case, we do need to woy about the paent-child elationship between P-tee nodes so that 2 x 1 + 1 is also the uppe bound of f(x). Then we can infe that f(x) = 2 x 1 +1. Moe fomally, we can veify the coectness of this fomula. As shown in Fig. 3(b), the left tiangle (including ) denotes the subtee with i nodes and the ight one epesents the subtee with x i nodes. We pesent the following equation 1. Note that the empty tee should be included and thus f(0) = 1. Obviously, we can constuct diffeent subtees by combining subtees in left and ight pats. Then we can compute f(x) by using f(i) and f(x i). Note that the empty tee in both left and ight pat should not be included simultaneously. Finally we add 1 to f(x) to epesent the empty tee. { 1 x = 0 f(x) = max x i=0 {f(i) [f(x i) 1]}+1 x 1,x N (1) Now we can diectly veify that f(x) = 2 x 1 +1 satisfy the equation and this complete the poof. Step 1: candidate subtee geneation. To geneate the candidate subtees, the key poblem is how to avoid edundancies of the subtee enumeation. In [42], Asai et al. intoduced a tee patten enumeation stategy, and it is based on the following two concepts: (1) Rightmost leaf is the last P-tee node accoding to the depth-fist tavesal ode. (2) Rightmost path is defined as a path fom the oot node to the ightmost leaf. Given a tee T, a new subtee T can only be geneated by adding a new node t to T such that the following hold: (1) t s paent node is on the ightmost path of T ; (2) t is the ightmost leaf of T. As shown in [42], this geneation stategy guaantees that all the subtees of the P-tee will be enumeated without epetition. Thus, we follow this stategy to geneate the candidate subtees. Step 2: community veification. Afte a candidate subtee T has been geneated, we veify the existence of the coesponding community. We use G k [T] to epesent the lagest connected x-i subgaph of G containing q whee each vetex has at least k neighbos and contains the subtee T. We say that, T is feasible, if G k [T] exists. The veification step is mainly based on the following lemma. Poposition 1. Given a pofiled gaph G, two P-tee T,T and the quey vetex q, if T T,G k [T] G k [T ]. Poof. As we defined befoe, G k [T] denotes the k- coe containingq whee each vetex contains the subteet. (1) If G k [T] =, G k [T] G k [T ] always holds. (2) If G k [T], we have v G k [T], T T(v). Then fom T T, we can infe v G k [T], T T(v). This means each vetex v G k [T] also contains the P-teeT. Thus ifg k [T],G k [T] G k [T ]. In summay, Poposition 1 holds. Lemma 2 (Anti-monotonicity). Given a subteet, ifg k [T], then T T, G k [T ]. Poof. Fom Poposition 1, we know T T,G k [T] G k [T ]. Now since G k [T], we have T T,G k [T ]. By Lemma 2, we can conclude that, if G k [T] is infeasible, then we can stop geneating subtees fomt. Thebasic method begins with geneating a subtee fom the oot node. Then, it iteatively pefoms the two steps above to etieve all the feasible G k [T]s, until no lage subtees can be geneated. Pseudocodes of basic ae attached in Algoithm 1. Complexity analysis. Let m be the numbe of edges in G. In wost case all edges ae tavesed to compute theg k [T] and all the subtees ae veified. As a esult,basic completes in O(2 T(q) m) time whee T(q) denotes the numbe of nodes of T(q). In pactice, the value of 2 T(q) could be exponentially lage and this makes basic impactical. To alleviate this issue, we popose moe efficient index-based solutions in next section. Algoithm 1 pesents basic. We fist initilize the esult set G and load the q s P-tee T(q) (line 2). Then we need to compute G k, the lagest connected subgaph ofgcontainingq whee each vetex has at least k degees (line 3). Now in the iteation, we geneate new subtees fom cuent subtee T. Fo each new subteet, we veify the existence ofg k [T] (lines 4-10). IfG k [T] exists, we addt inφ(lines 11-12); othewise if no subtee can be geneated fomt o all subtees geneated fomt ae infeasible, we add G k [T ] in G if T is maximal (line 13). Finally, all PC s ae etuned (line 14). Algoithm 1 basic quey algoithm 1: function QUERY(G,q, k) 2: G, load T(q) fom G; 3: compute G k fom G; 4: if G k then 5: Ψ GENERATESUBTREE(,T(q)); 6: while Ψ do 7: T Ψ.pop(); flag tue; 8: Φ GENERATESUBTREE(T,T(q)); 9: fo each T Φ do 10: compute G k [T] fom G k ; 11: if G k [T] then 12: flag false; Ψ.push(T); 13: if flag = tue and T is maximal then 14: G = G G k [T ]; 15: etun G;

5 4 INDEX-BASED SOLUTIONS We fist intoduce some peliminaies and the poposed CPtee index, and then discuss the index-based quey algoithms. 4.1 k-coe and CL-tee k-coe. In line with existing CS [11, 26], we use k-coe to satisfy the constaints of minimum degee and maximal stuctue of a PC. Given an intege k (k 0), the k-coe of G, denoted by G k, is the lagest subgaph ofg, such that v G k,deg Gk (v) k. Since G k may be disconnected, we use k- coes to denote one of its connected components. An impotant popety of k-coe is the nested popety: given two intege i and j, j-coe i-coe if i < j. In Fig. 4(a), the 0-coe epesents the whole gaph, and 3-coe is nested in 2-coe. Computing all the k-coes of a gaph G, known as coe decomposition, can be completed by an O(m) algoithm [27], whee m is the numbe of edges in G. CL-tee. Since k-coes ae nested, all the k-coes of a gaph can be oganized into a tee stuctue, called CL-tee [11]. In this pape, we adopt it, but skip the labels on the tee. The CL-tee of the gaph in Fig. 4(a) is shown in Fig. 4(b). Clealy, vetices in each CL-tee node and othe vetices in all its descendant nodes epesent a k-coe. Fo example, vetex C and othe vetices {A,B,D,E} in its child node compose a 2- coe. Since each vetex appeas only once, the space cost of CL-tee iso(n) whee n is the numbe of vetices in G. In addition, we maintain a map vetexnodemap, whee the key is the vetex and the value is the node of the coesponding CL-tee node, and it allows us to locate the k- coe containing any quey vetex efficiently. C 3-coe B A D 2-coe E F (a) k-coes Fig. 4. k-coes, CL-tee. 4.2 CP-tee Index G H 2:C 3:ABDE 0:# 2:FGH vetexnodemap: F (b) CL-tee Index Oveview. We build the Coe Pofiled tee (CP-tee) index by consideing both the P-tee stuctue andk-coes. We depict an example CP-tee in Fig. 5 using the pofiled gaph in Fig. 1(a). ML Fig. 5. CP-tee index. CM IS HW 0:G 2:BCD AI headmap: DMS E 2:ADE 0:# 2:FGH Each CP-tee node coesponds to a label and stoes the k- coes shaing this label. To summaize, each node p consists of following fou elements: (1) label: the attibute label; (2) paentnode: the paent node of p; (3) childlist: a list of child CP-tee nodes of p; and (4) vetexnodemap: a map that stoes the CL-tee. In addition, we maintain a map headmap, whee the key is a vetex v, and the value is a list of CP-tee nodes, each of which coesponds to a leaf node of v s P-tee. Main advantages of CPtee ae listed below. Restoe P-tees. By utilizing the headmap, each vetex s P-tee can be estoed by tavesing the leaf nodes up to the oot node. Locating k-ĉoe. Given an intege k, a quey vetex q and a CP-tee node t, using vetexnodemap, we design a function get(k,q,t) to get the k- coe containing q whee each vetex contains the label t.label in constant time cost. Quey efficiency. As discussed above, the label infomation of each vetex s P-tee can be efficiently accessed using the headmap. Index Constuction. We incementally ceate CP-tee nodes and then link them up to build the CP-tee index. Pseudocodes of CPtee index constuction ae pesented in Algoithm 2. Fo each vetex v, we ead T(v) and ceate new CP-tee nodes (lines 2-5). Fo each CP-tee node t, we add v in t fo late CL-tee constuction (lines 6, 9). If P-tee nodexis a leaf node, we update headmap (line 7). Then we link up all CP-tee nodes accoding to the GP-tee stuctue. Note that if GP-tee is unknown, we can simultaneously unify it whiling eading P-tees in the pevious step (line 10). Finally, I is etuned (line 11). Algoithm 2 CP-tee index constuction 1: function BUILDINDEX(G(V,E)) 2: fo each v V do 3: fo each x T(v) do 4: t a CP-tee node in I such that t.label = x.label; 5: if t = null then ceate a CP-tee node t and add it in I; 6: add v in t; 7: if x is the leaf node of T(v) then headmap.put(v,t); 8: fo each t I do 9: Build CL-tee fo the subgaph of t; 10: link to its paent and child nodes; 11: etun I; Complexity analysis. Obviously, lines 2-7 take the linea time. The time complexity of building a CL-tee is O(m α(n)) [11, 40] whee m is the numbe of edges in G and α(n), the invese Ackemann function, is less than 5 fo lage value of n. Thus the time complexity of building CP-tee is O( P m α(n)), and it is linea to the size of G. The space cost of CP-tee is O( P n) whee P denotes the numbe of labels in G. The space cost of the headmap is O(ˆl n) whee ˆl denotes the aveage numbe of leaf nodes in each vetex s P-tee andˆl < P. Theefoe, the total space complexity is O( P n) which is linea to the size of G. 4.3 Index-based Quey Algoithms Now we pesent ou index-based quey solutions. The fist one follows the famewok of basic, and it incementally geneates and veifies the subtees of P-tee (fom smalle subtees to lage ones). Thus we call it ince. The advanced methods boows some ideas fom MARGIN [43], the algoithm of mining maximal fequent subgaphs. As we will explain late, advanced methods can find all PC s by examining a small faction of subtees, esulting in high efficiency. In addition, thei time complexities ae O(2 T(q) m), because in the wost case all the subtees ae veified. Howeve, as we will show in Section 5.4, in pactice they ae much moe efficient than such wose-case time complexities. 4.3.1 The Methodince We begin with an inteesting lemma, which geatly acceleates the veification step.

6 Lemma 3. Given a CP-tee index I, a subtee T and a new subtee T which is geneated fom T by adding a new P-tee node. We have G k [T] G k [T ] I.get(k,q,T\T ), whee T\T denotes the new added node. Poof. T = T t, so we have T T. Based on Poposition 1, we know G k [T] G k [T ]. Similaly, t T, then we have that G k [T] I.get(k,q,T\T ) whee I.get(k,q,T\T ) is the k- coe containing the quey vetex q and P-tee node T\T. Hence G k [T] G k [T ] I.get(k,q,T\T ). As ince seaches the communities in the subgaph which ae found in fome iteation, the quey efficiency is impoved. We pesent ince in Algoithm 3. Algoithm 3 ince quey algoithm 1: function QUERY(I,q, k) 2: estoe T(q) using I.headMap; 3: G,Ψ GENERATESUBTREE(,T(q)); 4: while Ψ do 5: T Ψ.pop(); flag tue; 6: Φ GENERATESUBTREE(T,T(q)); 7: fo each T Φ do 8: compute G k [T] fom G k [T ] I.get(k,q,T\T ); 9: if G k [T] then 10: flag false; Ψ.push(T); 11: if flag = tue and T is maximal then 12: G = G G k [T ]; 13: etun G; We fist use headmap to locate the leaf nodes of T(q) and then estoe T(q) (line 2). We initialize Ψ by usingt(q) (line 3). In the iteation, fo cuent subtee T, we geneate new subtees. Fo each new subtee T, we veify the existence of G k [T] using the index (lines 4-8). If G k [T] exists, we add T in Φ (lines 9-10); othewise if no subtee can be geneated fom T o all subtees geneated fom T ae infeasible, we add G k [T ] in G if T is maximal (line 11). Finally, all PC s ae etuned (line 12). 4.3.2 TheAdvanced Methods The method ince follows the Apioi-based method, which exploes all possible subtees by tavesing the seach space fom smalle subtees to lage ones; while, as demonstated in the Section 5.1, the maximal feasible subtees often lie in the middle of the seach space, which implies that most of the exploation may be avoided. Based on this obsevation, we adapt MARGIN [43] to tackle PCS. MARGIN: It does not pefom a bottom-up (o top-down) tavesal of the seach space; instead, it naows the seach space by examining only subgaphs that lie on the bode of fequent and infequent subgaphs. It fistly finds an initial pai of gaphs (CR, R) whee R is fequent and CR is not. In addition, CR is the child subgaph of R (i.e., CR is the subgaph of R and they diffe by exactly one edge). Similaly, R is the paent subgaph of CR. (CR, R) is called a cut and fom this cut, MARGIN expands and finds all othecuts by adding o deleting an edge to obtain new adjacent subgaphs. MARGIN defines this function as expandcut and Thomas et al. [43] has poved that expandcut is able to find all maximal fequent subgaphs. Inspied by MARGIN, we design the following functions. 1. Function expandptee. This function is adapted fom expandcut [43] and the main modifications ae as follows. We dynamically obtain child subgaphs and paent sugaphs, which ae called child subtees and paent subtees in ou case, using the paentnodes and childlists of CP-tee nodes, instead of pe-computing all subtees in the seach space as MARGIN does. We define a pai of P-tees (IF,F ) as a cut, whee IF is the child subtee of F and F is feasible while IF is not; We dynamically veify whethe a feasible subtee is maximal. We develop a function veifyptee to veify the feasibility. Algoithm 4 expandptee 1: function EXPANDPTREE(IF,F, G) 2: if IF = and F then update G; 3: else 4: Q ; Q.push((IF,F)); 5: while Q do 6: (IF,F) Q.pop(); 7: fo each paent Y i of IF do 8: if Y i is feasible then 9: update G if Y i is maximal; 10: fo each child K of Y i do 11: if K is infeasible then Q.push((K,Y i )); 12: if K is feasible then 13: find common child C of K and IF ; 14: Q.push((C,K)); 15: else 16: fo each paent K of Y i do 17: if K is feasible then Q.push((Y i,k)); 18: etun G; We now illustate expandptee in Algoithm 4. As we will intoduce late, if IF = and F we can diectly update G because the F is aleady the maximal common subtee (line 2). Othewise, we fist use (IF,F ) to initialize the queue Q (line 4). Then, fo each pai, we iteatively veify its adjacent pais (lines 5-17). If the paent subtee Y i of IF is feasible, G k [Y i ] hee may not be the final esult. This is because subtees ae not egulaly enumeated, and thusy i may be tempoaily maximal, so we need to epeatedly veify it. If thee exist othe feasible subtees veified in pevious steps that ae the subtee ofy i, we need to eplace thei coesponding subgaphs with G k [Y i ] (line 9). Finally, we etun G (line 18). Lemma 4. Given a P-tee pai (IF,F ),expandptee can find all feasible subtees fo a PCS quey. The poof of Lemma 4 is based on following peliminaies. gaph : a b c lattice: a b a b b c a b c (a) lattice Fig. 6. the lattice and Uppe- -Popety [43]. c C i A e 1 e 2 e 2 P e 1 (b) Uppe- -Popety Lattice is essentially a pe-pocessed data stuctue whee all possible subgaphs of a given gaph ae enumeated. Taking the gaph in Fig. 6(a) as an example, its subgaphs in each level have the same size (i.e., numbes of edges). The bottom level (level 0) coesponds to the empty gaph and the level i lists all size-i subgaphs. In lattice, each subgaph is linked to its paent gaphs (i.e., subgaph of this gaph and they diffe exactly by one edge) and childs (i.e., supe-gaph of this gaph and they diffe exactly by one edge). We can obseve that the P-tee can diectly eplace the gaph to constuct the lattice. C j

7 Popety 1 (Uppe- -Popety [43]). Any two child subgaphs C i,c j of a gaphp will have a common child subgapha. In Popety 1, C i,c j,p and A ae fou subgaphs. C i,c j ae two child subgaphs of P (i.e., subgaphs of P and they espectively diffe with P by one egde e 1,e 2 ). Then thee must exist one subgaph A such that A is the child subgaph of C i and C j. Popety 1 is vey intuitive in gaphs. Based on Poposition2, we pove that the Uppe- -Popety can be simply adapted to fit in P-tee models. Poposition 2. P-tees satisfy the Uppe- -Popety. Poof. In P-tees, e 1 and e 2 can be two P-tee nodes such that subteesc i = P e 1 andc j = P e 2. Thee must exist a P-tee A = P e 1 e 2 = (P e 1 ) e 2 = (P e 2 ) e 1. Thus A = C i e 2 = C j e 1 which means A is the common child subtee of C i and C j. Now we fomally give the poof of Lemma 4. Poof. MethodexpandPtee is mainly adpted fom MARGIN. As mentioned in MARGIN, the coectness holds when the adapted poblem satisfies the following constaints [43]: (1) The seach space is a subset of the lattice. (2) The Uppe- -popety holds. (3) The anti-monotone popety is satisfied. (4) A candidate set can be defined which is a bounday set such that evey in the set satisfies a given useconstaint and thee exists an immediate child in the lattice that does not satisfy the constaint because of the anti-monotone popety. Fo evey in the set, thee exists an immediate paent that does not satisfy the constaint fo the monotone popety. (5) Solution sets can be geneated fom the candidate sets. Fo PCS poblem, the element in constaint (1) is the P-tee and obviously constaint (1) is satisfied. Poposition 2 has poved that constaint (2) is satisfied. The anti-monotonicity popety has been poved in Lemma 2 and thus constaint (3) is also satisfied. In MARGIN, the use-constaint of the constaint (4) is that, given a theshold, whethe a gaph is fequent o not. Hee fo constaint (4), the use-constaint is that whethe a P-tee is feasible. Fo instance, a P-tee T is feasible which means G k [T ] exists. If T, which is the child of T, is not feasible (i.e., G k [T ] does not exist). Then T can be defined in this bounday set and its immediate child T does not satisfy this use-constaint fo the anti-monotone popety. Hence constaint (4) holds. Once a is added in the candidate set, we need to veify whethe this is maximal. It means the solution set is the subset of this candidate set. Thus constaint (5) is satisfied. In conclusion, the coectness of Lemma 4 holds. 2. Function veifyptee. Given a subtee T, T child and T paent denote a child and the paent subtee of T. Let l denote the numbe of T paent s leaf nodes and t ni epesent the ith leaf node of T paent. Deived fom Lemma 3, we have G k [T child ] G k [T] I.get(k,q,T child \T). G k [T paent ] l i=1 I.get(k,q,t n i ). Since all P-tees ae subtees of the GP-tee, if a P-tee has the attibute t, then t s paent attibute t is also included. Thus, I.get(k,q,t) I.get(k,q,t ). Fo a special subtee T i (a path fom leaf node t ni to oot node ), we can finally get G k [T i ] = I.get(k,q,t ni ). Note that T paent can be seen as seveal paths and thus we get G k [T paent ] l i=1 I.get(k,q,t n i ). Based on CP-tee, veifyptee can efficiently veify subtees. Next we discuss thee methods to find the initial cut. 3. Function find-i. We can adaptince to find the initial cut. As shown in Algoithm 5, we incementally enumeate subtees and veify the existence of the coesponding communities. Once we find a subtee which is feasible while its child subtee is not, then we can egad them as an initial cut (lines 2-15). Algoithm 5 Find the initial cut: find-i 1: function FIND-I(I,S, q, k) 2: estoe T(q) using I.headMap; 3: IF ; F = T(q); 4: Ψ GENERATESUBTREE(,T(q)); 5: while Ψ do 6: T Ψ.pop(); flag tue; 7: Φ GENERATESUBTREE(T,T(q)); 8: fo each T Φ do 9: compute G k [T] fom G k [T ] I.get(k,q,T\T ); 10: if G k [T] then 11: flag false; Ψ.push(T); 12: if flag = tue and T is maximal then 13: F = T ; IF = T ; 14: beak; 15: etun (IF, F); 4. Function find-d. We can decementally geneate subtees fom lage subtees to smalle ones. We epesent find-d pseudocodes in Algoithm 6. Fistly, if G k [T(q)] exists, we can diectly etun it as a qualified community (lines 2-4). In each step, fo an infeasible subtee T, we emove one of T s leaf nodes and veify the feasibility of the new subtees (lines 6-11). Once thee is a new feasible subtee, we teat T and this new subtee as the initial cut (lines 12-17). Algoithm 6 Find the initial cut: find-d 1: function FIND-D(I,S, q, k) 2: IF ; F ; 3: estoe T(q) using I.headMap; 4: if G k [T(q)] then F = T(q); 5: else 6: Ψ.push(T); 7: while Ψ do 8: T Ψ.pop(); IF = T ; 9: Θ all leaf nodes of T ; 10: fo each t Θ do 11: compute G k [T \t ] fom G; 12: if G k [T \t ] then 13: F = T \t ; 14: Beak; 15: else 16: Ψ.push(T \t ); 17: etun (IF, F); 4. Function find-p. We can find the initial cut by diectly veifying subtees instead of the node one by one. Intuitively, P- tee can be divided into seveal paths (fom leaf nodes to the oot). Accoding to Lemma 2, these paths can be futhe veified by checking the coesponding leaf nodes. We call it find initial cut by path (find-p). We pesent the pseudocodes of find-p in Algoithm 7. S denotes a P-tee node set. Initially, it consists of all leaf nodes of T(q). If thee does not exist a feasible node in S, we tace up to veify thei paent nodes (lines 13-14). Next, we iteatively check the nodes ins. If we find a nodetandg k [F t] exists, we update F (lines 5-6). Let t paent denote the paent node of t. If we find

8 Algoithm 7 Find the initial cut: find-p 1: function FIND-P(I,S, q, k) 2: IF ; F find a leaf node t S s.t. I.get(k,q,t) ; 3: if F then 4: fo each t S do 5: computing G k [F t] fom G k [F] I.get(k,q,t); 6: if G k [F t] then F = F t; 7: else 8: path tace a path fom t to in I; 9: find t,t paent on path s.t. G k[t ]=, G k [t paent ] ; 10: IF = F t paent ; F = F t ; 11: Beak; 12: else 13: fo each t S do S.eplace(t, t.paent); 14: FIND-P(I,S, q, k); 15: complete subtees IF, F ; 16: etun (IF, F); a node t that G k [F t] does not exist, we tace up to find the bounday whee G k [t paent] exists while G k [t ] does not and thus we find an initial pai (lines 8-11). Note that at now stage, IF, F may not be complete subtees. Thus fo the nodes in IF and F, we need to include all thei ancesto nodes and then etun (IF,F) as a cut (lines 15-16). Algoithm 8 gives the oveall advanced methods. Notice that, thee ae thee functions, i.e., find-i, find-d, and find-p, of finding the initial cut, so we have thee vaiants of advanced, denoted by adv-i, adv-d and adv-p espectively. Algoithm 8 Advanced method 1: function QUERY(I,q, k) 2: G ; 3: (IF,F) FIND(I,S,q,k); 4: EXPANDPTREE(IF,F, G); 5: etun G; 5 EXPERIMENTS 5.1 Setup We conside two eal datasets (ACMDL and PubMed) and two synthetic datasets (Flick and DBLP). ACMDL 2 and PubMed 3 ae the co-authoship netwoks of eseaches in compute science and biomedical aeas espectively. Each vetex of them epesents an autho, and an edge is a co-authoship between two authos. Fo each autho, he papes have been categoized by a hieachical subject classification system (ACM CCS o Medical Subject Headings (MeSH) 4 ), so we build the P-tee by unifying the categoization infomation of all he papes. Fo Flick 5 [44], each vetex epesents a use and each edge denotes a follow elationship between two uses. Fo DBLP 6, a vetex is an autho and an edge epesents a co-authoship elationship. Fo each use, we use a hash function and map the associated textual content to subjects of CCS to synthesize a P-tee. By doing this, the same textual contents could be mapped fo constucting the same nodes in P-tees. Table 2 shows the statistics of the datasets, including the numbes of vetices and edges, vetices aveage degee d, the aveage numbe of labels in P-tees P, and the aveage numbe of labels in the GP-tee. 2. https://dl.acm.og/ 3. https://www.nlm.nih.gov 4. https://meshb.nlm.nih.gov/ 5. https://www.flick.com/ 6. http://dblp.uni-tie.de/xml/ To evaluate PCS queies, in line with [11], we set the default value of k to 6. Fo each dataset, we andomly select 100 quey vetices fom the 6-coe. We implement all the algoithms in Java, and un expeiments on a machine having an eight-coe Intel 3.40GHz pocesso, and 16GB of memoy, with Ubuntu installed. TABLE 2 Datasets used in ou expeiments. Dataset Vetices Edges d P GP-tee ACMDL 107,656 717,958 13.34 11.54 1,908 Flick 581,099 4,972,274 17.11 26.63 1,908 PubMed 716,459 4,742,606 13.22 27.10 10,132 DBLP 977,288 6,864,546 14.04 37.98 1,908 we conside all the fou datasets and check the locations of maximal feasible subtees of 100 communities in seach space fo each dataset. In ou expeiments, because the seach space may be vey lage, accoding to the depth, we aveage them into 5 levels. Notice that, in this case, level 3 epesents the middle location of the seach space. The expeimental esults ae attached below. Fo example, thee ae 43% maximal feasible subtees lying on the middle of the seach space in PubMed. This demonstates the above view and explains the motivation fo the advanced methods. TABLE 3 Locations of maximal feasible subtees. ACMDL Flick PubMed DBLP Level 1 3% 8% 11% 5% Level 2 15% 23% 5% 13% Level 3 18% 32% 43% 37% Level 4 26% 25% 24% 31% Level 5 38% 12% 17% 14% 5.2 PCS Effectiveness As mentioned befoe, the existing CS methods mainly focus on non-attibuted gaphs. A ecent wok ACQ [11, 40] investigates CS on attibuted gaphs. In ACQ, each vetex in the attibuted gaph is associated with a set of keywods. Communities etieved by ACQ should satisfy the stuctue cohesiveness (k-coe constaint) and keywod cohesiveness [11, 40], i.e., the numbe of common keywods shaed by all vetices in communities should be maximum. We compae PCS with ACQ. To un ACQ queies, we set each vetex s attibute as a set of keywods, which ae the keywods in its P-tee. In the following, we fist pesent a case study, and then show the quality and divesity of communities. A Case Study: We pefom a case study on the ACMDL dataset and conside a enowned eseache: Jim Gay. We set k = 4 hee. We pesent Jim s two PC s, i.e., PC1 and PC2, with diffeent eseach aeas in Fig. 7 and Fig. 8. Notice that ACQ only finds one community PC1 shown in Fig. 7(a). This is because, ACQ maximizes the numbe of shaed keywods, so PC2 shown in Fig. 8(a), which has five shaed keywods, cannot be etuned. In addition, as shown in Fig. 7(b), all shaed keywods of PC1 ae oganized in a tee with few banches, which implies that the semantics of keywods ae highly ovelapped with each othe. In contast, the shaed subtee of PC2 shown in Fig. 8(b) has multiple banches, so the semantics of keywods ae vey diffeent and divesified. Hence, PCS ae moe effective than ACQ fo extacting communities fom pofiled gaphs. Community Paiwise Similaity (CPS): We compae PCS with thee classic CS methods using minimum degee definition: ACQ [11], Global [8] and Local [25]. We use Tee Edit

9 Jim Gay A.Deshpande M. Liebhold A. Szalay M.Hansen S. Nath V. Tao P B. Gibbons M. J. Fanklin M. Balazinska (a) PC1 Fig. 7. One PC of Jim Gay. J. Cogan R. Buns Jim Gay R. Musaloiu-E (a) PC2 S. Oze A. Szalay A. Tezis K. Szlavecz Fig. 8. Anothe PC of Jim Gay. Infomation Retieval Infomation Extaction Infomation Systems Retieval Task & Goals Document Filteing (b) The maximal common subtee of PC1 Softwae & Engineeing Infomation Systems Hadwae Compute System Oganization (b) The maximal common subtee of PC2 Distance (TED) to compute the similaity between the P-tees of any pai of vetices in community G l. Let T i be the P-tee of the i-th vetex in G l. The CPS is then the aveage similaity ove all pais of G l s vetices, and all communities of G: G [ 1 CPS(G) = 1 G l 2 l=1 G l G l j=1 i=1 ] TED(T i,t j ) T i T j The CPS(G) value has a ange of 0 and 1. The highe the value is, the moe cohesive the community is. As shown in Fig 9(a), PCs denotes the communities that only PCS can seach. P- ACs epesents those etuned by both of PCS and ACQ. P-ACs have the most P-tee nodes (i.e., keywods in ACQ definition) in common, and the fewest vetices. Thus they have the highest CPS values. Note that PCs have a close CPS value with P-ACs which implies that these unique PC s ae also of highly quaility. Level-divesity atio (LDR): To futhe measue the quality of PC s, we define a metic, called level-divesity atio (LDR), to measue the divesity of attibutes level by level in the shaed subtees. F denotes the method that we use hee to compae with PCS. Given a quey vetex q, we use T (F,q,j) to epesent the maximal common P-tees of j-th community etuned by the method F. L is the numbe of levels in P-tee T(q). L i (T) is the numbe of unique labels in the i-th level of P-tee T. H and J denote the numbes of communities etuned by the method F and PCS espectively. A lowe LDR value implies that the method F is less divese than PCS. LDR(q,F) = 1 L L i=1 H h=1 ] L i [T (F,q,h) (2) J ] (3) L i [T (PCS,q,j) j=1 Intuitively, LDR eflects the popotion of unique labels in each level. The expeimental esults ae depicted in Fig. 9(b), which shows that communities etuned by ACQ can only cove 40% to 60% labels of PC s in each level. This implies that PC s found by PCS have highe divesity than those of ACQ, because PCS focuses on maximizing the common stuctue of P-tees, athe than the numbe of common keywods. As a esult, all communities with the semantically maximal popeties can be found, and the communities ae of high divesity. Community numbes: Fig. 10(a) epots the aveage numbe of communities that pe quey equest etuns in these methods. Fom the esults, we can see that PCS finds moe communities than othes. This is because only PCS focuses on pofiled gaphs and hieachical infomation in P-tees to etieve communities. Comaped with othe methods, PCS is able to extact communities with moe semantic focuses. Community P-tee Fequency (CPF): CPF is inspied by the document fequency measue. Let fe i,j epesent the numbe of vetices in G i whose P-tee contains T(q) s j-th P-tee node. We use CPF to compute the occuence fequency ove all nodes in T(q) and all communities in G: CPF(q) = 1 G l T(q) G l i=1 T(q) j=1 fe i,j G i Note that CPF(q) anges fom 0 to 1 and a highe value implies a bette cohesiveness. As shown in Fig 9(a), compaed with the communties etieved by both of PCS and ACQ, those unique PCs also have a highly degee of cohesiveness. (a) CPS Fig. 9. Compaing PCS with CS methods. (a) Community numbe Fig. 10. Compaing PCS with CS methods. (b) LDR (b) CPF F1-scoe: Hee we use Facebook ego-netwoks 7 to evaluate the accuacy. We use FBX to denote the X-th netwok and each egonetwok has seveal ovelapping gound-tuth communities, called fiendship cicles [45]. See Table 4, each vetex has eal pofiles, such as political, education, etc. Simila to Flick, we build each P-tee by using a hash function to map the eal pofiles to CCS subjects. We andom quey 100 vetices in these gound-tuth communities and compute the F1-scoes 8 ove diffeent methods. The F1-scoes of all methods ove thee netwoks ae shown in 7. http://snap.stanfod.edu/ 8. https://en.wikipedia.og/wiki/f1scoe (4)

10 Fig. 11. The expeimental esults show that, compaed with othe methods, PCS can stably extact communities with high accuacy ove thee eal netwoks. TABLE 4 Facebook datasets. Dataset Vetices Edges d P FB1 1,233 11,972 19.41 34.54 FB2 1,447 17,533 24.23 29.12 FB3 982 10,112 20.59 31.10 (a) CPS (b) LDR Fig. 11. F1-scoes ove thee netwoks. (c) community numbe Fig. 12. Evaluation on ACMDL, PubMed datasets. (b) CPF 5.3 Compaison with Othe Definition Metics In this section, we compae seveal potential metics to define the PCS poblems. Geneally, a good community should be a goup of uses, which ae cohesive in both stuctues and pofiles. To measue stuctue cohesiveness, we use the minimum degee metic, which is in line with existing woks [8, 11, 12, 25, 26]. To measue the pofile cohesiveness, we have tied a list of possible metics, including: (a) common nodes of P-tees; (b) common path of P-tees (fom the P-tee leaf to the oot); (c) common subtee of P-tee stuctues; (d) similaity of vetex P-tees. We compae these fou metics ove two eal datasets (ACMDL and pubmed). As shown in Fig 12, compaed with othe metics, Metic (c) can achieve highest scoes ove fou indices. We now discuss the eason fo such diffeences. In a ecently wok ACQ [40], the authos define the vetex attibute as a set of keywods and use the numbe of shaed keywods to constain the communities. Thus, in ou PCS poblem, it is natual to use the numbe of common P-tee nodes to measue the pofile cohesiveness, and it is natual to equie the numbe of common nodes to be the lagest. Howeve, as we have analyzed befoe, this will ignoe the inteelated elations among the nodes and violate the basic motivation fo the PCS poblem. Thus Metic (a) is not suitable fo PCS definition. Metic (b) is defined by common paths (i.e., a common path fom P-tee oot to a leaf node) shaed by all the nodes in the etuned community. Intuitively, we can equie the numbe of common paths to be maximum. This metic will still have some inadequacies, as it amounts to maximize the numbe of common leaf nodes, which will miss out meaningful communities with fewe common leaves. As a esult, based on the discussions above, we think metic (b) is also not suitable fo PCS poblem definition. Metic (c) focuses on the common subtee of all P-tees. Clealy, a subtee consists a set of nodes and thei hieachical elationships. Compaed with the metics above, the common subtee of P-tee stuctue is moe suitable fo measuing the pofile cohesiveness of a community, as it can adequately pesent the commonalities of vetex P-tees. Inspied by anothe ecent community seach wok [12], we tied to use the similaity of P-tees to define the poblem. It means, given a theshold, to find all vetices with a budgeted similaity scoe. Howeve, it is still not suitable fo the PCS poblem. This is because, nomally, if two P-tees ae to be compaed by some similaity methods, the divesity of these P-tees will be nevetheless egaded as the dissimilaity. Thus, based on above discussion and expeimental esults in Fig 12, we adopt Metic (c) in ou PCS poblem definition. 5.4 Results of Efficiency Evaluation In this section, we show the efficiency esults of index constuction and PCS queies. 1. Index constuction. Fig. 13(a)-13(b) show the scalability of the CP-tee index constuction method. To evaluate the scalability of index constuction method w..t the dataset size, fo each dataset, we andomly select 20%, 40%, 60% and 80% of its vetices to obtain fou sub-datasets espectively. As shown in Fig. 13(a), we obseve that, the time cost of the index constuction is linea to the size of pofiled gaphs, which confims ou analysis befoe. Futhemoe, to evaluate the scalability of index constuction method ove diffeent P-tee sizes of vetices and ove diffeent factions of the GP-tee size, we obtain fou sub-datasets in a simila way. As shown in Fig. 13(b) and Fig. 13(c), we demonstate that the time cost of the index constuction is linea to the size of P-tees and GP-tees. 2. Quey efficiency. We vay the value of k and show the quey efficiency of diffeent algoithms in Fig. 14(a)-14(d). The method ince is 100 times faste than the basic method, but slowe than the method adv-i. Futhe, adv-d and adv-p ae 10 times faste thanince. The eason is that, compaed withince, the advanced methods naow the seach space by veifying a smalle faction of subtees. Also, the efficiency gap in finding an initial cut esults in the sightly diffeent pefomance of the advanced methods. Thus, the index-based methods un fast and adv-p stably scales the best. Note that thee advanced methods pefom similaly on Flick. This is because the initial cut esults ae in the