CONE: Community Oriented Network Embedding

Size: px

Start display at page:

Download "CONE: Community Oriented Network Embedding"

Clarissa Carr
6 years ago
Views:

: Community Oriented Network Embedding Carl Yang University of Illinois, Urbana Champaign 201 N. Goodwin Ave Urbana, IL 61801, USA jiyang3@illinois.

Goodwin Ave Urbana, IL 61801, USA kcchang@illinois.edu ABSTRACT Detecting underlying communities has long been a popular topic in the research on networks.

In this work, we doubt the universality of these widely adopted assumptions and compare human labeled communities with machine predicted ones obtained via various mainstream algorithms.

1 : Community Oriented Network Embedding Carl Yang University of Illinois, Urbana Champaign 201 N. Goodwin Ave Urbana, IL 61801, USA Hanqing Lu Zhejiang University 38 Zheda Rd Hangzhou, , China Kevin Chen-Chuan Chang University of Illinois, Urbana Champaign 201 N. Goodwin Ave Urbana, IL 61801, USA ABSTRACT Detecting underlying communities has long been a popular topic in the research on networks. It is usually modeled as an unsupervised clustering problem on graphs, based on some heuristic assumptions about the characteristics of communities, such as edge density and node homogeneity. In this work, we doubt the universality of these widely adopted assumptions and compare human labeled communities with machine predicted ones obtained via various mainstream algorithms. Based on supportive results, we argue that communities are defined by various social patterns and unsupervised learning based on heuristics is incapable of capturing all such patterns. Therefore, we propose to inject supervision into community detection through Community Oriented Network Embedding (), which leverages limited community examples to learn the latent representations of nodes that are aware of the underlying social patterns.specifically, a deep architecture is developed based on Recurrent Neural Networks (RNN) and improved towards learning an embedding that captures important link structures and node attributes that lead to the formation of ground-truth communities. Generic clustering algorithms in the embedded space then effectively reveals more communities that share similar social patterns with the ground-truth ones. Categories and Subject Descriptors H.4.0 [Information Systems]: General Keywords Community detection, supervised learning, network embedding, deep architecture 1. INTRODUCTION One of the most popular topics in the research on networks is identifying network communities. On the one hand, Equal contribution Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ Student PhD Secretory Professor Dean Figure 1: Toy example of a CS research group as networks are growing larger than ever before, it is efficient and necessary to look into smaller sub-networks, which consist of specific groups of interacting objects (i.e., nodes) and their links (i.e., edges). On the other hand, the knowledge of community structures allows us to better understand the status of an object within a group and the relationships between it and its peers, so as to enable multiple benefits including the discovery of functionally related objects [23], the study of interactions between modules [2], the inference of missing attribute values [17], the prediction of unobserved connections [8] and so on. Many algorithms aim to solve this problem [3, 7, 9, 18, 31, 33, 35]. However, they are all formulated based on the heuristic assumptions of edge density and node homogeneity, i.e., communities are constructed by densely connected nodes and nodes within one community are all homogeneous in some ways. Most surprisingly, even the ground-truth community labels used for evaluation in many standard datasets are generated by machines based on these two assumptions, rather than humans [32]. In this work, we doubt the universality of the two widely trusted assumptions. To see how these two assumptions do not necessarily hold true, consider the scenario in Figure 1, where the community is a computer science (CS) research group including students, staff and faculties. This community is naturally a valid community as everyone is affiliated to the latent group. However, it is not densely connected as it is highly likely that there are people who have different schedules and thus have never met each other. The community is not homogeneous in a specific way either. Students may have similar age range, salary range and come from the same hometown, but these are quite different from those of the staff and faculties. Therefore, unsupervised community detection algorithms based on edge density and node homogeneity can hardly detect communities like this. Although neither of the density and homogeneity charac-

2 teristics is prominent in the community of Figure 1, there are indeed some interesting social patterns. E.g., a pattern of star is clearly formed among the PhD and the students around, and the secretory obviously acts as a connecting bridge. Those patterns are characterized by both link structures and node attributes. E.g., the PhD as a star center is different in ages from the students around but may share similar other attributes such as major and research field. The PhD is also densely connected with the students around, but the students may not know each other. On the other hand, the secretory as a bridge is different in attributes like job from any other members in the community, but connects to many of them. These sub-structures are prominent and valuable for defining and detecting communities of specific patterns, e.g., research groups in CS or any other discipline may well share the similar patterns. How can we automatically capture the important social patterns underlying network communities? As we see in Figure 1, the patterns can be complex and different across networks. It is impractical to enumerate all possible ones and design an unsupervised algorithm that can detect communities of all patterns. In this work, we propose to inject weak supervision into community detection. Instead of taking pre-defined assumptions about community characteristics, it is more reliable to let the data tell us what the communities look like. Specifically, given a network, we propose to leverage a few labeled communities as examples, and automatically learn their social patterns as described by some latent features. Those features can then be leveraged for the detection of similar other communities. While it is intuitive to learn the latent features via embedding, a supervised embedding directed by complex social patterns on networks is non-trivial and has never been done. In this work, we develop a deep architecture of Community Oriented Network Embedding (). To capture high dimensional noisy node attributes, we translate them into binary sequences of fixed lengths and leverage the successful recurrent neural networks (RNN) from natural language processing (NLP). To account for link structures, we design a random-walk layer, which fundamentally reconstructs the attribute embeddings according to neighborhood structures. To leverage supervision, we connect a softmax supervision layer and evaluate the empirical loss w.r.t. ground-truth community labels in an end-to-end framework. Unlike other network embedding techniques, decomposes the embedding process into multiple nonlinear layers of a deep architecture. The key advantages of are: Expressive: the deep architecture of makes it powerful in exploring various combinations of attribute values in a high-dimensional space and the global consistency among different ground-truth communities. Linkage aware: the random-walk layer of makes it aware of neighborhood structures as expressed by a high-order stationary distribution, so that link structures are leveraged to capture important sub-networks. Community oriented: is an end-to-end embedding framework, so the supervision of ground-truth communities can be properly propagated to the attribute embedding layer via the random-walk layer, which explores their mutual reinforcement and reliably captures social patterns. Out-of-sample: The output of is an embedding model, rather than the embeddings of a specific set of nodes. Therefore, it is able to handle the out-of-sample problem, which addresses the challenge of limited supervision and dynamic network. 2. MOTIVATING STUDY In this section, we show the deficiency of traditional unsupervised community detection algorithms by quantitatively examining the density and homogeneity assumptions they adopt and their consequences. We conduct a set of analysis using the Facebook dataset described in [18]. This dataset includes 10 ego-networks, consisting of 193 social circles and 4,039 nodes. 10 ego users have manually identified all the communities to which their friends belong. The average number of social circles in each ego-network is 19, with an average community size of 22 users. These social circles were used as the ground-truth communities. Let a graph G = {V, E} represent a network with a vertex set V and an edge set E. A community can be represented as a sub-graph C = {V C, E C} with V C V and E C E. Attr(v) represents the node attribute vector of a vertex v V. We evaluate the density and homogeneity of both ground-truth communities and communities detected by the state-of-the-art algorithms. Metrics. We use a density metric D( ) and a homogeneity metric H( ) defined as following. D(C) = 2 E C V C ( V C 1). (1) The value of D(C) ranges from 0 (a graph without edges) to 1 (a complete clique). It is a standard measure of the density of a network widely used in related literature [32]. Attr(x) Attr(y) σ( H(C) = x,y V C,x<y avg(attr(x) Attr(y)) ) V C ( V C 1). (2) avg(attr(x) Attr(y)) is the average of attribute similarity among all pairs of nodes in the whole network G, and σ(t) = 1 e t 1+e t is the adjusted half sigmoid function in the range of [0, 1]. The value of H(C) also ranges from 0 (none of the nodes share the same attributes) to 1 (all nodes share the same attributes), while the value increases rapidly as the similarity of Attr( ) in C is small compared with the average attribute similarity in the whole network G. Compared algorithms. We use three mainstream community detection algorithms within the state-of-the-art: Min- Cut [9] (based on network modularity), CESNA [33] (based on probabilistic generative model) and InfoMaps [21] (based on information theory). Density and homogeneity analysis. Assumption 1 : nodes within the same community are densely connected. Conflicting evidence: In Figure 2 (a), we present D(C) computed on the ground-truth communities as well as the communities detected by the state-of-the-art algorithms. It can be observed that ground-truth communities do not always have a high density. In fact, in some ego-networks like the one we show, there are more low-density communities

3 than high-density ones. The evidence directly contradicts with Assumption 1. Based on this inaccurate assumption, traditional unsupervised community detection algorithms tend to detect communities with high density. As an example, the distributions of D(C) computed on the detected communities in ego-network 686 are presented in Figure 2 (b-d). It can be observed that most of the detected communities have a high density (D(C) > 0.5), which is inconsistent with the density distribution of the ground-truth communities. It suggests that the inaccurate assumption of edge density indeed leads to unreliable community detection results. (a)ground-truth (c)cesna (b)mincut (d)infomap Figure 2: Community density in ego-net 686. We conduct the same homogeneity analysis on communities detected by the same group of state-of-the-art unsupervised algorithms. As can be seen in Figure 3 (c-d), since they assume node homogeneity in communities, CESNA and InfoMaps do detect communities of slightly higher homogeneity, which diverge from the ground-truth ones. 3. The analysis in Sec 2 clearly demonstrates the deficiency of existing unsupervised community detection algorithms, which is a consequence of their falsifiable assumptions of edge density and node homogeneity. To deal with this deficiency, we propose to inject supervision into community detection, so as to leverage the various social patterns underlying communities in different networks. 3.1 Challenges While it is intuitive to automatically learn the latent features that effectively describe and detect communities via an embedding model that is aware of social patterns, the task is non-trivial and posts some unique challenges: Exploring various combinations of high-dimensional noisy node attributes. Utilizing complex link structures. Leveraging the supervision of community labels. Dealing with limited supervision and dynamic networks. To deal with all challenges above, we develop Community Oriented Network Embedding (), which coherently combines node attributes, link structures and community labels to effectively capture important social patterns. 3.2 Framework (a)ground-truth (b)mincut Problem formulation: Our input is a network modeled by G = {V, E, A, F}, where V is the set of n nodes, E is the set of observed links among V, A is the set of observed node attributes, and F is a set of example community memberships. In order to detect more communities on G, we firstly learn the latent features that describe communities in G. Specifically, we learn an embedding model that transforms each node v V into a p-dimensional vector. In the p-dimensional embedded space, nodes forming each community that shares the similar social patterns with the ones indicated by F are close. Therefore, new communities can then be detected through generic clustering algorithms such as k-means on the p-dimensional features. Overlapping communities can also be discovered by algorithms like MOC [4]. (c)cesna (d)infomaps Figure 3: Community homogeneity in ego-net 698. Assumption 2 : nodes within the same community are homogeneous in some ways. Conflicting evidence: As an example, the distribution of H(C) computed on the ground-truth communities in egonetwork 698 is contradictory to Assumption 2. As shown in Figure 3 (a), few communities are indeed homogeneous (e.g., the one with homogeneity higher than 0.8), while the majority are much less homogeneous. Given σ(1) 0.46, the results indicate that most communities have a homogeneity similar to that of the whole network. Modeling node attributes with recurrent neural network: In real world networks, node attributes A can be noisy, high-dimensional and of various forms. E.g., in textrich networks like Twitter, A can be the set of recent tweets; in tag-rich networks like Flickr, A can be the set of frequently used hashtags; in networks with explicit attributes like Facebook and LinkedIn, A can be a set of categorical variables like Birthday, School, Occupation and etc. An intelligent model should be expressive enough to explore various combinations of such attributes so as to identify the important social patterns. The task of modeling such complex information reminds us of the challenge in natural language processing (NLP),

4 and specifically, paragraph embedding, where the frequency, ordering and co-occurrence of words all matter when interpreting the whole paragraph. While there is no previous work that models network vertexes (such as users in social networks) as paragraphs, we find it intuitive and rewarding to do so, because a i A is basically a description of node v i, and it is usually easy to translate a i into a paragraph. Specifically, tweets, with topics hidden in the words, are naturally appropriate to be modeled as sentences; hashtags and explicit categorical attributes, of which the frequency, ordering and co-occurrence of certain values carries rich semantics, can also be properly modeled as sentences (e.g., a Facebook attribute like birthday: July naturally equals to a sentence like the user s birthday is in July ). Just like the task of NLP, the length of the description A is naturally different for each node (paragraph), while the total number of possible values of words in tweets, popular hashtags and informative attributes is limited (vocabulary). To deeply understand the semantics of paragraphs, recurrent neural network (RNN) has been proven advantageous over various other methods, basically due to its encoding of sentences as sequences of words and supreme expressive power within the neural networks, both of which are intriguing considering our task of modeling high-dimensional noisy node attributes. Moreover, the deep learning framework naturally allows us to inject supervision into the model learning process and provides us with the flexibility of modifying the network architecture in order to leverage other information like link structures. Given those beneficial properties, we adopt RNN to model node attributes. Given node v i s d-dimensional attribute vector a i, we first transform it into a sequence {j x i,j = 1}. E.g., a 5-dimensional user attribute vector (0, 1, 0, 0, 1) T will be transformed into a sequence {1, 4}. Then we use this sequence as the input of RNN. As mentioned in [24], simple RNN would be difficult to train due to the resulting long term dependencies. Therefore, we choose long-short term memory (LSTM) [15] instead to model node attributes by learning a proper embedding through the following equations: i t = δ(w ia t + G ih t 1 + b i), Ĉ t = tanh(w ca t + G f h t 1 + b f ), f t = δ(w f a t + G f h t 1 + b f ), C t = i t Ĉt + ft Ct, o t = δ(w oa t + G oh t 1 + V oc t + b o) h t = o t tanh(c t). (3) where δ represents the sigmoid activation function; W i, W c, W f, W o, G i, G f, G o, and V o are weight matrices. b i, b f, and b o are bias vectors. The gates in LSTM cell can modulate the interactions between the memory cell itself and its environment. The input gates allow incoming signal to alter the state of the memory cell or block it and the output gates allow the state of the memory cell to have an effect on other neurons or prevent it. Moreover, the forget gates allow the cell to remember or forget its previous state. The architecture and implementation of LSTM can be found in the public website. After we input one element, there will be one output of the LSTM cell. As a result, there will be d output embedding vector {h 1, h 2,..., h d }. We take the average of them, i.e., ĥi = (h h d)/d, as the semantic embedding of the input sequence {x 1, x 2,..., x d }. This process is named as mean pooling. Supposing that the embedding size of user is m, we use H R n m to represent the node embeddings, where the i-th row of H equals to ĥi. Leveraging link structures with a random-walk layer: Another important characteristic of network data is the abundance of interactions among nodes, which is also significant for understanding social patterns. E.g., if some nodes are connected in a specific way, they are more likely to be in the same community. In order to leverage the interactions among nodes in networks, we introduce a novel randomwalk layer into RNN, which extends the modeling of node attributes to link structures. The random-walk layer is implemented by computing the one-step random transition matrix S based on network interactions. E.g., for networks with edge weights w ij (the weights are binary if we consider links such as friendships in social networks), S is computed as s ij = w ij/d i, where d i = j wij. A random-walk layer performing one step of the random walk process can be implemented by multiplying S to the attribute embedding H of RNN, i.e., T = SH. In order to simulate t steps of random walks, we can simply perform t random transitions, which can be implemented as T = S t H. In Sec 4, we experiment on the impact of different numbers of random transitions. To leverage link structures in the feature learning process, we connect the random-walk layer to the LSTM layer. Intuitively, the random-walk layer reconstructs the embedding of each node w.r.t. its neighbors, according to the distance of each neighbor measured by random walks of certain steps. This is done before the softmax prediction of community memberships and the evaluation towards ground-truth labels, so that the prediction is made on the reconstructed embeddings and the supervision is done with the consideration of link structures. Specifically, as supervision requires the embeddings of nodes within the same communities to be close, the random-walk layer allows the attribute embeddings to be not so close. As a consequence, only nodes within neighborhoods of both specific attributes and structures indicated by supervision will be embedded as close. Exploiting community supervision through end-toend embedding: To learn the essential features that describe social patterns as embeddings guided by labeled communities, we incorporate a softmax supervision layer into an end-to-end deep learning framework. By incorporating, is able Figure 4 illustrates the architecture of. A random-walk layer of several random transitions is connected to the attribute embedding layer and a softmax supervision layer is connected to exploit community labels. Suppose that the predicted labels are in matrix P, and the ground-truth labels are in matrix G. We adopt the following cross-entropy loss as our objective function: L = n G ilog(p i). (4) i=1 To optimization Eq. 4, we employ the stochastic gradient descent (SGD) with the diagonal variant of AdaGrad in [20].

5 Node Attrib utes A LSTM h1 LSTM h2 LSTM h... Attribute Embedding Matrix H Random Transition Matrix S Random Transition Matrix S Reconstructed Embedding Matrix T Softmax Prediction P Ground Truth Labels G (1) Attribute Embedding Layer (2) Random-walk Layer (3) Softmax Supervision Layer Figure 4: The deep architecture of Community Oriented Network Embedding (). At the t-th step, the parameters Θ is updated by: ρ Θ t Θ t 1 t g t, (5) i=1 g2 i where ρ is the initial learning rate and g t is the sub-gradient at time t. Since our proposed random-walk layer is implemented as a matrix multiplication, the gradients can be efficiently back-propagated to the attribute embedding layer. Thus the overall embedding framework is end-to-end and the mutual reinforcement between node attributes and link structures is leveraged. Out-of-sample community detection with : Our framework is designed to deal with the out-of-sample problem, i.e., an embedding model is learned once on the labeled data and can be used multiple times on unlabeled data. Unlike many network embedding algorithms that learn a specific embedding for nodes in the network by enforcing small distance among certain nodes [12, 19, 26], uses community labels to learn the model and store the parameters in a deep architecture. In this way, it memorizes the specific social patterns indicated by labeled communities and is able to apply them on new nodes. Such out-of-sample property is plausible in at least three folds. 1. Saving supervision. While supervision is usually expensive and limited, can be learned on a relatively very small training set and utilized on the rest of the large network. 2. Crossing networks. Learned on a specific network, can be used to compute embeddings on any other networks, assuming similar social patterns (such as from one ego-network to another), as long as the attribute set remains the same, or a mapping between attribute sets can be provided. 3. Excelling on dynamic networks. does not need to be re-learned for new nodes or links added to the network over time. When detecting communities in a network, embeddings of nodes computed by the model learned on labeled data are fed into generic clustering algorithms like k-means. We apply cross-validation to automatically choose the optimal number of communities as done similarly in [33]. The efficiency of : Since is an end-to-end deep learning framework implemented using TensorFlow, it is easy to run efficiently on GPU EXPERIMENTS In this section, we evaluate on the task of community detection with extensive experiments on real-world network datasets. 4.1 Experimental Settings Datasets. We use three real-world network datasets. The first is the Facebook dataset we used in Sec 2 for data analysis, which consists of 10 ego-networks with community labels explicitly provided by the ego users [18]. The attributes in this dataset are well-defined user profiles such as education, work and location, and links are undirected friendships among users. The second is a Flickr dataset collected by [29]. The community labels are generated from the groups joined by users. The attributes are the tags aggregated on users posted photos and the links are undirected friendships. The third is a Twitter dataset also collected by [18], which consists of 973 ego-networks. The community labels are generated from friend circles (or lists), i.e., friends put into the same list by the ego user are regarded as within one community. The attributes are the hashtags and popular account mentions within users recent tweets and the links are directed followings. Detailed statistics of the three datasets we use are shown in Table 1. In the model selection experiments, we vary the size of training and testing sets to show the ability of to leverage limited supervision. Dataset #nodes #comms #links #attrs Facebook 4, , Flickr 10, ,814 77,260 Twitter 5, ,556 12,274 Table 1: Statistics of 3 real-world network datasets. Compared algorithms. We compare with two groups of algorithms from the state-of-the-art to comprehensively evaluate the performance of. Traditional unsupervised community detection algorithms. Some of these algorithms are based on the edge density assumption alone, while others also assume node homogeneity. We compare with this group of algorithms to show the advantage of abandoning these inaccurate assumptions and leveraging the supervision of example communities. MinCut [9]: a classic community detection algorithm based on modularity.

6 BigClam [31]: an advanced community detection algorithm solely based on network structure. Circles [18]: a generative model of edges w.r.t. attribute similarity to detect communities. CESNA [33]: a generative model of edges and attributes to detect communities. Network embedding algorithms adapted to leverage community labels. While we find it intuitive to leverage community labels by learning embeddings, we compare with the stateof-the-art network embedding algorithms to show that the task is non-trivial and is advantageous. Inspired by [25], we adapt those network embedding algorithms to accommodate extra information by building a heterogeneous network augmented by node attributes and community labels. The embeddings learned by these algorithms are fed into the same generic clustering algorithm as to yield community detection results. [19]: we generate truncated random walks on the augmented heterogeneous network. node2vec [12]: we conduct the 2nd order random walks on the augmented heterogeneous network. [26]: we compute the model on the augmented heterogeneous network as what is done in [25]. Metrics. Two widely used metrics for evaluating community detection results are used in our experiments. For a detected community c i and a ground-truth community c i, the F1 similarity and similarity are defined as F 1 = 2 precision recall precision + recall, = ci c i c i c i, where precision = c i c i c i, recall = c i c i c i. For a set of ground-truth communities {c i } M i=1 and a set of detected communities {c i} N i=1, we compute the score as 1 2 C c i C max eval(ci, c c i ) + 1 i C 2 C c i C max c i C eval(ci, c i ), where eval(c i, c i ) can be either replaced by F 1 or. 4.2 Performance Comparison with Baselines We quantitively evaluate against all baselines on community detection. We randomly split the labeled communities into training sets and testing sets. For, we use 10% of the labeled communities as training set to learn the embedding model. For the adaptions of other embedding algorithms, the 10% community labels are used to build the heterogeneous networks for generating vertex sequences and computing neighborhood similarities. All compared algorithms including are then evaluated on the rest 90% labeled communities. To observe significant difference in performance, we split the training and testing sets 10 times and conduct statistical t-tests with p-value Table 2 and 3 show the average F1 and scores evaluated on all compared algorithms over the same 10 random splits of training and testing sets. The results all passed Algorithm Facebook Flickr Twitter MinCut BigClam CESNA Circles node2vec Table 2: Performance comparison using F1. Algorithm Facebook Flickr Twitter MinCut BigClam CESNA Circles node2vec Table 3: Performance comparison using. our t-tests. The parameters of baselines are all set to the default values as suggested by the original works. For, we intuitively use two random transitions in the randomwalk layer to stress network locality and leverage the link structures in close neighborhoods. reaches around 30% relative improvements on the Facebook dataset and more than 10% improvements on the other two datasets, compared with the second-runner among the 7 compared algorithms on both F1 and scores, while they have varying performances. This indicates the robustness and general advantages of our proposed approach. Taking a closer look at the scores, we observe that the scores on the Facebook dataset look much better than those on the other two datasets. This is due to the good quality of both attributes and links in the dataset. Since the Facebook dataset is the only one that has explicit human labels of communities, the large improvements on it indicate the advantage of in leveraging user provided community examples to understand specific social patterns. The scores of network embedding algorithms are lower than traditional community detection algorithms on the Facebook dataset, probably because the attributes are few and clean and traditional models work well on such settings. However, as the attributes become high-dimensional and noisy like in the Flickr and Twitter datasets, network embedding algorithms originally designed to deal with NLP problems excel. This indicates the advantage of leveraging embeddings for traditional social network tasks and further proves the efficacy of, especially in exploring highdimensional noisy attributes. 4.3 Model Selection We comprehensively evaluate the performance of when different amounts of supervision are available and study the impact of different numbers of random-walk transitions and embedding dimensions. Impact of supervision. We study the impact of supervision on the performance of by varying the training set portion. Figure 6 shows the results compared with all

7 baselines. s advantage of leveraging labeled data is significant on all three datasets. As can be observed, efficiently leverages supervision by learning from very small amounts of labeled data, and the performance converges rapidly as the training set portion reaches around 11% on the Facabook and Flickr datasets. On datasets with large amounts of labeled data such as Twitter, around 6% of labeled data are enough to learn a good model. Note that the other three network embedding algorithms do not leverage supervision effectively, basically because they do not deal with the out-of-sample problem and the social patterns they learn on labeled data do not generalize trivially to unlabeled ones. Impact of parameters. We study the impact of parameters inside by varying the number of random-walk transitions and embedding dimensions. Figure 7 shows the results compared with all baseline algorithms. The number of random-walk transitions has a large impact on the performance of, especially when the number is small. This demonstrates the utility of our novelly designed randomwalk layer. As the number of transitions grows larger and the stationary distribution of random walk is approximated, the performance becomes stable. Note that large numbers of transitions do not necessarily lead to optimal performance, because community structures are often well described within small neighborhoods. The number of embedding dimensions does not have a significant impact, indicating the robustness of inherited from RNN. 4.4 Case Study To understand the advantage of, we conduct the same density and homogeneity analysis on our detected communities and present the results on the same ego-networks we showed in Sec 2. Comparing Figure 5(a) and 5(b) with Figure 2(a) and 3(a), we observe that indeed finds communities that are similar in distributions of density and homogeneity as the ground-truth, which indicates its effectiveness in coherently leveraging community labels, node attributes and link structures. (a)density in ego 686 (b)homogeneity in ego 698 Figure 5: Communities detected by. 5. RELATED WORK We stress the novelty of this work and differentiate it with existing literature in three folds by comparing with community detection, network embedding and node classification. 5.1 Community detection Traditional community detection algorithms are mostly unsupervised, based on either of the edge density or node homogeneity assumptions, or both. Therefore, the communities they detect are only constructed by densely connected or homogeneous nodes. Among them, many are based on probabilistic generative models. For instance, COCOMP [35] extends LDA [6] by taking community assignments as latent variables. It leverages node homogeneity by assuming that each community is a group of people that use a specific distribution of words. BigClam [31] does not make use of node content at all. It leverages edge density by modeling the probabilities of edges solely based on the existing connections in the network. A variation of BigClam called CESNA [33] is later proposed, with additional consideration of binaryvalued node attributes. It leverages both edge density and node homogeneity by assuming communities to be formed by densely connected homogeneous nodes. Another method within the state-of-the-art, Circles [18], also leverages the same assumptions with a slightly different model. Many non-generative algorithms have also been proposed. Among them, some are based on graph metrics such as modularity [9] and maximal clique [11]. They are fundamentally leveraging the edge density assumption. As for node homogeneity, some algorithms firstly augment the graph with content links and then do clustering on the augmented graph [22, 34], while some others attempt to find partition that yields the minimum encoding cost based on information theory [21, 3]. They do not scale trivially to networks with high-dimensional noisy attributes. 5.2 Network embedding As we discuss in Sec 1, we aim to find node features that are suitable for the task of community detection. However, unlike the the techniques that generate features from graphs based on network structures [14, 27, 28], we frame the feature creation procedure as learning a community oriented representation of each node. Moreover, we aim to inject supervision into the learning procedure, instead of the unsupervised ways based on relational data in the networks [1, 30] or structural relationships between concepts [10, 16]. We note that while it is not a trivial task, some recent network embedding algorithms like [12, 19, 26] are flexible enough to consider node attribute and the supervision of community memberships. E.g., we can build a heterogeneous network with attributes and labels as augmented nodes and then apply the truncated random walks of [19, 12] on it, or learn the bipartite network embeddings on it as done in [25]. However, those algorithms can not handle the out-of-sample situations, so the label information is not properly leveraged for the embedding of unlabeled nodes. Since the supervision available is usually limited and only on a small part of the data, the performance is expected to be inferior. Finally we argue that the extension of their framework into to our situation leads to networks that are too large for the algorithms to be efficient, especially when the number of attributes are large. 5.3 Node classification Traditional supervised learning on networks is essentially node classification [5, 13, 27, 36], which leverages node attributes (labels) and network structures for the tasks like attribute prediction. However, although we utilize supervision, since our final goal is to perform unsupervised detection of communities based on the community oriented features we learn under supervision, our problem is still quite different from node classification. To be more specific, our predictions are new clusters of nodes, for which no predefined categories

8 % 7% 8% 9% 10% 11% 12% 13% 14% 15% (a)f1 on Facebook 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% (d) on Flickr % 7% 8% 9% 10% 11% 12% 13% 14% 15% (b) on Facebook 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% (e)f1 on Twitter Figure 6: Varying the training set portion % 7% 8% 9% 10% 11% 12% 13% 14% 15% 5 (c)f1 on Flickr 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% (f) on Twitter (a)f1 on Facebook 0.02 (d) on Flickr (b) on Facebook (e)f1 on Twitter F1-score (c)f1 on Flickr (f) on Twitter Figure 7: Varying the number of random-walk transitions and the embedding dimensions. or labeled data are available at all. As a consequence, node classification algorithms are not applicable to our problem. 6. CONCLUSION In this paper, we doubt the generality of the two widely trusted assumptions for all existing community detection methods, i.e., the edge density assumption and node homogeneity assumption. We further prove that those assumptions have led to inaccurate community detection results for various mainstream algorithms within the state-ofthe-art through data analysis on a real-world social network dataset with explicit manual community labels. To deal with this deficiency, we propose to leverage limited amount of labeled communities as examples. For this purpose, we design to inject supervision into the problem of community detection by learning an embedding model for the whole network, which captures community shapes in terms of both node attributes and link structures, as indicated by some community examples. Generic clustering algorithms performed on the embeddings then yield reliable community detection results on the whole network. Recently, there has been a trend in applying successful NLP algorithms, especially those involving deep learning architectures, to solve social network tasks. To the best of our knowledge, we are the first to successfully apply RNN to social networks. Our key leverage is the expressiveness of RNN in modeling high-dimensional noisy node attributes and the coherent combination with random-walk layers to incorporate link structures. The out-of-sample embedding model is learned in an end-to-end deep framework to effectively leverage community supervision and the mutual reinforcement between attributes and links. It is implemented using TensorFlow and runs efficiently on GPU. While is originally designed for leveraging supervision for robust community detection free from falsifiable assumptions, it can be easily applied to many network learning tasks to coherently leverage node attributes, link structures and various kinds of supervision. As we observe from the experiments, is especially advantageous in dealing with high-dimensional noisy attributes, such as bags of hashtags and even raw texts.

9 7. REFERENCES [1] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola. Distributed large-scale natural graph factorization. In WWW, pages ACM, [2] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. JMLR, 9(Sep): , [3] L. Akoglu, H. Tong, B. Meeder, and C. Faloutsos. Pics: Parameter-free identification of cohesive subgroups in large attributed graphs. In SDM, pages SIAM, [4] A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. J. Mooney. Model-based overlapping clustering. In SIGKDD, pages ACM, [5] S. Bhagat, G. Cormode, and S. Muthukrishnan. Node classification in social networks. In Social network data analytics, pages Springer, [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3: , [7] T. Chakraborty, S. Patranabis, P. Goyal, and A. Mukherjee. On the formation of circles in co-authorship networks. In SIGKDD, pages ACM, [8] J. Chang and D. M. Blei. Relational topic models for document networks. In AIStats, volume 9, pages 81 88, [9] A. Clauset, M. E. Newman, and C. Moore. Finding community structure in very large networks. Physical review E, 70(6):066111, [10] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):30 42, [11] N. Du, B. Wu, X. Pei, B. Wang, and L. Xu. Community detection in large-scale social networks. In WebKDD, pages ACM, [12] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. [13] J. He, M. Li, H.-J. Zhang, H. Tong, and C. Zhang. Manifold-ranking based image retrieval. In Proceedings of the 12th annual ACM international conference on Multimedia, pages ACM, [14] K. Henderson, B. Gallagher, L. Li, L. Akoglu, T. Eliassi-Rad, H. Tong, and C. Faloutsos. It s who you know: graph mining using recursive structural features. In SIGKDD, pages ACM, [15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8): , [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages , [17] R. Li, C. Wang, and K. C.-C. Chang. User profiling in an ego network: co-profiling attributes and relationships. In WWW, pages ACM, [18] J. J. McAuley and J. Leskovec. Learning to discover social circles in ego networks. In NIPS, volume 2012, pages , [19] B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In SIGKDD, pages ACM, [20] X. Qiu and X. Huang. Convolutional neural tensor network architecture for community-based question answering. In IJCAI, pages , [21] M. Rosvall and C. T. Bergstrom. Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences, 105(4): , [22] Y. Ruan, D. Fuhry, and S. Parthasarathy. Efficient community detection in large networks using content and links. In WWW, pages , [23] A. P. Streich, M. Frank, D. Basin, and J. M. Buhmann. Multi-assignment clustering for boolean data. In ICML, pages ACM, [24] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages , [25] J. Tang, M. Qu, and Q. Mei. Pte: Predictive text embedding through large-scale heterogeneous text networks. In SIGKDD, pages ACM, [26] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. Line: Large-scale information network embedding. In WWW, pages ACM, [27] L. Tang and H. Liu. Relational learning via latent social dimensions. In SIGKDD, pages ACM, [28] L. Tang and H. Liu. Leveraging social media networks for classification. Data Mining and Knowledge Discovery, 23(3): , [29] X. Wang, L. Tang, H. Liu, and L. Wang. Learning with multi-resolution overlapping communities. Knowledge and Information Systems (KAIS), [30] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE transactions on pattern analysis and machine intelligence, 29(1):40 51, [31] J. Yang and J. Leskovec. Overlapping community detection at scale: a nonnegative matrix factorization approach. In WSDM, pages ACM, [32] J. Yang and J. Leskovec. Defining and evaluating network communities based on ground-truth. Knowledge and Information Systems, 42(1): , [33] J. Yang, J. McAuley, and J. Leskovec. Community detection in networks with node attributes. In ICDM, pages IEEE, [34] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative approach. In SIGKDD, pages ACM, [35] W. Zhou, H. Jin, and Y. Liu. Community discovery and profiling with social messages. In SIGKDD, pages ACM, [36] X. Zhu, Z. Ghahramani, J. Lafferty, et al. Semi-supervised learning using gaussian fields and harmonic functions. In ICML, volume 3, pages , 2003.

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology