Multimedia Social Event Detection in Microblog

Size: px

Start display at page:

Download "Multimedia Social Event Detection in Microblog"

Patience Stewart
5 years ago
Views:

1 Multimedia Social Event Detection in Microblog Yue Gao 1, Sicheng Zhao 2, Yang Yang 1, and Tat-Seng Chua 1 1 School of Computing, National University of Singapore, Singapore 2 School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China {dcsgaoy,dcsyangy,dcscts}@nus.edu.sg, zsc@hit.edu.cn Abstract. Event detection in social media platforms has become an important task. It facilities exploration and browsing of events with early plans for preventive measures. The main challenges in event detection lie in the characteristics of social media data, which are short/conversational, heterogeneous and live. Most of existing methods rely only on the textual information while ignoring the visual content as well as the intrinsic correlation among the heterogeneous social media data. In this paper, we propose an event detection method, which generates an intermediate semantic entity, named microblog clique (MC), to explore the highly correlated information among the noisy and short microblogs. The heterogeneous social media data is formulated as a hypergraph and the highly correlated ones are grouped to generate the MCs. Based on these MCs, a bipartite graph is constructed and partitioned to detect social events. The proposed method has been evaluated on the Brand-Social-Net dataset. Experimental results and comparison with state-of-the-art methods demonstrate the effectiveness of the proposed approach. Further evaluation has shown that the use of the visual content can significantly improve the event detection performance. Keywords: Event detection, Microblog clique, Live data, Multimedia. 1 Introduction Social media platforms [11], such as Twitter 1, Facebook 2, and Sina Weibo 3,havebecome important real-time information resources and host a huge amounts of user contributed content (UGC). The rapid development of social media platforms has led to continuously increasing data, which plays an important role in information sharing and diffusion. An example is the Super Bowl 2013, which attracts up to 24 million tweets in total and the number of tweets about just the blackout is over 231k per minute. The users can employ these platforms to report real-life events which may spread quickly and widely across the entire social network. This live information in social media streams requires effective technique for organization and management. Events in social media refer to observable occurrences of people, places, times and activities [8]. As introduced in [16], an event can be regarded as a single episode of a large story and event detection can benefit the social media content analysis and enable powerful event browsing X. He et al. (Eds.): MMM 2015, Part I, LNCS 8935, pp , c Springer International Publishing Switzerland 2015

2 270 Y. Gao et al. In recent years, event detection in social media platforms has attracted extensive research attention [14,15]. Most of existing event detection methods consider only the textual context and the social connection while ignoring the visual context which has been growing in importance in social media. It is noted that detecting events in social media is challenging due to the following three characteristics. First, social media posts tend to be short and conversational in nature. Thus, the contents and vocabularies used in these posts tend to change rapidly. Under this circumstance, a single post may not be adequate to reflect meaningful content, and exploration of highly correlated posts on the same topic becomes an urgent requirement. Second, the content of social media posts has become increasingly heterogeneous and multimedia. The social posts may contain not only the text and image, but also time-stamp, location, social connection, user preference, and other metadata. Our recent investigation shows that about 30% of microblog posts now contain images and this number is still increasing. Therefore, visual content becomes more important in these days. Third, social media content comes in the form of social media streams. The amount of social media data is not only enormous but also continuously growing in every minute. These live data make it a hard task to detect new events and to handle increasingly large scale data. In this paper, we aim to detect events from social media posts. To address the problems of short and conversational posts, we propose to generate microblog cliques (MCs), which is a group of highly correlated microblogs. MCs help to enrich the single post content and tackle the data sparseness issue. To address the heterogeneity of microblog data, we propose to jointly employ the textual and visual content in microblog for analysis, which can substantially explore the intrinsic correlation among the heterogeneous data. Figure 1 presents the framework of the proposed event detection method. Fig. 1. The framework of the proposed event detection method in microblog The proposed event detection method can be briefly described as follows. Given a set of microblogs, we first construct a heterogeneous microblog hypergraph, in which the distance between two microblogs is measured by multiple facets, including textual content, visual content, location, time-stamp and the user connection. The heterogeneous hypergraph is then partitioned into small sub-graphs, which are denoted as the MCs. Each MC comprises a group of highly-correlated microblogs, such as near-duplicates or reposted microblogs. We summarize each MC by selecting several representative microblogs. These MCs are then used to construct an MC graph, and the K-way segmentation on the MC graph is conducted using the transfer cut to generate the K events for the given microblogs. In our method, the Bayesian information criteria is employed to select the optimal event number. The proposed event detection method has been evaluated on the Brand-Social-Net dataset [6], which includes 20 saga events of different types, i.e., the saga story. Experimental results and comparison with state-of-the-art methods demonstrate the effectiveness of the proposed method.

3 Multimedia Social Event Detection in Microblog 271 The remainder of this paper is organized as follows. Section 2 reviews related work on event detection in social media platforms. The detailed algorithms, including MC generation and event detection, are elaborated in Section 3 and Section 4, respectively. Section 5 presents the experimental results, followed by conclusions and discussions for further work in Section 6. 2 Related Work In this section, we briefly review the related work in event detection in social media platforms. Given new incoming data, the similarity between the new data and the existing events are computed first and the event with the maximal similarity is selected. When all the similarities are below a predefined threshold, it will be considered as a new event. A modified TF/IDF and time-based threshold are employed in [2] to measure the relevance between events and documents, in which an auxiliary dataset is used to estimate the IDF due to the fact that the future documents are unknown. An incremental IDF is introduced in [20] which considers a time window and a decay factor to measure the similarity between documents and events. Fung et al. [5] explored the word appearance as the binomial distribution, and the word burst is identified by a heuristic with thresholds. The frequency domain of textual content has also been investigated. The Wavelet-based signal processing has been introduced in [18] to detect events, in which the cross correlation between the word appearance is measure by using the Waveletbased feature. Reuter and Cimiano [12] proposed an event classification method to deal with incremental data in social media streams. In their method, a candidate retrieval step is first performed to gather related events by using the capture time, upload time, geographic location, tags, titles and the description. Then the similarities between the document and each event for the top returned retrieved events are measured based on nine features, including the temporal information, geographical information and textual information. The probabilities of the documents belonging to the event or belonging to a new event can then be computed by a trained Support Vector Machine. A threshold is empirically selected using a gradient descent method on a split of training data. Becker et al. [4] introduce the learning similarity metrics to identify events in social media streams, in which the event identification task is formulated as a clustering problem. In this method, each event is denoted by a document cluster, and the scalable clustering is evaluated using normalized mutual information and B-cubed [3]. Considering the different information in social media documents, such as the textual feature and the location data, different similarities are combined in an ensemble-clustering procedure. To classify new data into existing events, a group of training samples are first selected from labeled data and the logistic regression and SVM are employed as the classifier, which shows the best performance in experimental results, i.e., CLASS-LR and CLASS-SVM. It is noted that most of the existing methods are based on the textual content associated with the time-stamp. With the increasing amount of multimedia content in social media streams, such as images and videos, it is important to further explore the roles of visual context in microblogs for event analysis.

4 272 Y. Gao et al. 3 Microblog Clique Generation Most of the social media posts are short and conversational. Therefore, it is difficult to explore useful information from the limited and noisy content of one single microblog. On the other hand, most of the microblogs are not alone due to the conversational nature of social media. For instance, the highly correlated reposts and/or comments can be exploited as a valuable resource for enriching the original microblog post. Under this circumstance, we propose to generate a middle level object, termed microblog clique (MC), to represent the grouped microblogs. Here an MC is a set of highly correlated microblogs, which are all related to the same topic in a short time window. Each MC is a combination of several relevant microblogs, which is more informative. In this way, MCs can be used as basis to explore a set of microblogs as a basic unit instead of a single microblog. Figure 2 illustrates the workflow of the proposed MC generation method. To formulate the relationship among microblogs, a heterogeneous microblog hypergraph is constructed and the hierarchical bi-partition on the hypergraph is conducted under the constraint of Bayesian information criteria. Fig. 2. The framework for microblog clique generation 3.1 Microblog Hypergraph Construction Considering the multi-modal data in microblogs, such as the textual information, visual information, social connection, and location, a hypergraph is a good structure to formulate the microblog relationship. Hypergraph [22] is able to handle heterogeneous data and has been extensively employed in many data mining and information retrieval tasks [9,7] due to its superiority in high-order relationship modeling. Given a group of microblogs M = {m 1,m 2,...,m n }, a microblog hypergraph G H = {V, E, W} is constructed. In G H, each vertex denotes one microblog and there are n vertices in total. To generate the edges E linking different vertices, the heterogeneous data for microblogs are employed to measure the distance between each two microblogs. Textual information: The textual information of each microblog is described by TF-IDF, and the cosine similarity is employed to measure the pairwise microblog textual distance. Visual content: Given two microblogs with images, the visual content distance can be measured. Here a spatial pyramid image feature [19] is extracted for each image, which is highly discriminative on spatial layout and local information. The dense

5 Multimedia Social Event Detection in Microblog 273 sift feature is extracted for each image and a visual dictionary size of 1,024 is learnt. The spatial pyramid structure includes three levels, i.e., 1 1, 2 2 and 4 4,anda 21,504-D feature is generated for each image. A 200-D feature is further extracted by using PCA as the visual feature for each image. Location: The geographical similarity between two microblogs (if available) is measured by using the Harversine-formula [13]. Social connection: When the two microblogs share the same owner or the two corresponding owners are connected in the social media platform, such as the follower/followee relation, these microblogs are close in social space. We measure the social similarity between two microblogs by: s s (m i,m j )= 1, 0.5, 0, u i = u j ; u i u j ; otherwise. where u i u j indicates u i and u j are connected in the social media platform. Temporal information: When two microblogs are posted within a short time gap, there is a high probability that the two microblogs are related. Here the temporal similarity between two microblogs m i and m j is measured by: s t (m i,m j )=1 ( t i t j ), (2) τ where t i and t j are the time-stamps of m i and m j respectively, and τ is a normalized factor. Each microblog m i is regarded as the centroid, and the top N nearest microblogs are selected to generate an edge based on the textual information and the visual content, respectively. For the location information and the temporal information, each microblog m i is connected with its neighbors with a geo-distance threshold and a time-distance threshold, respectively. For the social connection, each user generates one edge, which connects all the microblogs of this user and the related users. Figure 3 illustrates the hypergraph construction procedure. The incidence matrix H of the microblog hypergraph G H is generated by: H (v, e) = The vertex degree of a vertex v Vis defined by: (1) { 1 if v e 0 if v/ e. (3) d (v) = e E w (e) H (v, e), (4) and the edge degree of an edge e Eis defined by: δ (e) = v V H (v, e). (5)

6 274 Y. Gao et al. Fig. 3. Illustration for microblog hypergraph construction 3.2 Microblog Hypergraph Segmentation Given the microblog hypergraph, we aim to generate the MCs, which are groups of microblogs with the same topics. We partition the hypergraph using the hypergraph cut approach, which has been widely investigated in recent years. Let S and S denote the two-way partition of G H. The hypergraph cut can be defined as ( Cut H S, S) := w (e) e S e S, (6) d (e) e S where S is the hyperedge boundary and it is defied by: S := { e E e S,e S }. The two-way normalized hypergraph partition can be defined as: ( ( ( ) 1 NCut H S, S) := CutH S, S) vol (S) + 1 vol (, (7) S) where vol (S) = d (v) and vol ( S) = d (v) are the volume of S and S v S respectively. Following [22], the normalized cut can be relaxed as a real valued optimization task, which can be solved by using the eigenvector for the smallest non-zeros eignevector of the hypergraph Laplacian, i.e., Δ = I D 1 2 v v S HWD 1 e HT D 1 2 v. In this way, the input microblogs M = {m 1,m 2,...,m n } are partitioned into two parts, and the two-way partition will be conducted in each new partition. This procedure continues until the optimal results are achieved. Here we employ the Bayesian information criteria (BIC) [17] to evaluate the partition results, which is used to determine whether to accept the two-way partitions or not.

7 Multimedia Social Event Detection in Microblog Bayesian Information Criteria To identify an optimal partition, we should measure the representability of different partition results. Here the Bayesian information criteria (BIC) [17] is employed to evaluate the representation of a model, i.e., the selected representative microblogs from each partition. Given a group of partitions P={P 1,P 2,...,P m } for the data M={m 1,m 2,...,m n }, the BIC value is calculated by: BIC = δ (M) N p log n, (8) 2 where N p is the number of parameters, which can be regarded as the feature dimension for microblog description, δ (M) is the log-likehood of the microblog for the partition P with the maximum likelihood, and n is the number of microblogs for processing. In our experiments, the maximum likehood estimate for the variance can be calculate by: 2 1 θ = d(m i,c mi ) 2, (9) n m i where d (m i,c mi ) is the distance between m i and the corresponding representative microblog c mi. The log-likehood of the microblog data can be measured by: δ (M) = 1 N p 1 i 2πθ 2 θ 2 d(m i,c mi ) 2 n i +log, (10) n where n i is the number of microblogs in the corresponding partition of m i. Here we take the bi-partition as an example. The to-be-measured partitions are {P 0 } and {P 1,P 2 },wherep 0 = P 1 + P 2 indicates that P 1 and P 2 are the partition results of P 0. The BIC values for these two partition results are calculated, and the partition result with higher BIC value is employed as the final result. 3.4 The MC Representation With the microblog hypergraph partition results, all the input microblogs can be divided into a group of clusters. These sets of microblogs are regarded as the MCs, which are used in the next event detection procedure. For each MC, we use the combination of the textual content and visual content of all its microblogs to represent the MC. After the duplicate textual and visual content removing, the enriched textual content and combined images are employed for MC description. In comparison with the use of a single microblog, MC provides enriched information from a set of highly relevant microblogs, which can generate more meaningful content for microblog analysis. 4 Event Detection with MC The generated MCs can be regarded as the intermediate level semantic entities for the event detection task. The task is formulated to explore the relations and infer events

8 276 Y. Gao et al. between different MCs, MC = {MC 1,MC 2,...,MC p } and the corresponding microblogs M = {m 1,m 2,...,m n }. The objective here is to further partition the MCs and the microblogs into event clusters. We formulate MCs and the corresponding microblogs in a bipartite graph G B = {X, Y, B}, where the vertex set X = MC M,the vertex set Y = MC, and B is the across-affinity matrix to link X and Y. B is defined as follows: η B ij = e γdij 0 if x i M and x i y i if x i MC and y j MC otherwise, (11) where η and γ are two parameters to balance the inner-mc correlation and the between- MC smoothness; d ij is the distance between two MCs, which can be calculated by using the combination of textual distance and visual distance, if the images are available. To partition MCs, the transfer cut method [10] is again employed here, which can be summarized as follows. Given the bipartite graph G B and the number of required partition numbers K, we first generate D X = diag (B1) and D Y = diag ( B T 1 ).As X is much larger than Y, we first focus on the smaller bipartite graph G BY = {Y,W Y }, which only contains the MC vertices and W Y = B T D 1 X B. The graph Laplacian of G BY can be calculated by L Y = D Y W Y.TheK bottom eigenpairs {λ i, v i } K 1 of G BY can be obtained. As proved in [10], the bottom K eigenpairs {ξ i, f i } K 1 can be calculated by: 0 ξ i 1 ξ i (2 ξ i )=λ i u i = 1 1 ξ i D 1 X Bv. (12) i f i = ( u T i, ) T vt i Then {f 1, f 2,...,f K } can be used for spectral clustering [21] on the bipartite graph G B,andK microblog clusters can be obtained. Due to the noise in microblogs, some small clusters are formed but they are disregarded as noise and only those clusters with more than 2% of microblogs are selected as the detected events. It is nontrivial to select an optimal K value. Here we further employ BIC to evaluate the selection of K. We assume that the number of existing events is K 0 which is initialized as 0 at the beginning, and the largest number of events with new incoming data is no more than K 0 + n new /t m,wheret m is a threshold to determine the minimal microblog requirement for an event which is set as 50 in our experiment. The bipartite graph will be partitioned n new /t m +1times, and the partition result with the highest BIC value is selected as the event detection output. Here we assume that {Γ 1,Γ 2,...,Γ K } are the K detected events in the last procedure. The description for each Γ i is based on the MC, where a MC selection is conducted to find key MCs for the event. Here the weight for each MC is measured by the importance, such as the number of microblogs, reposts, and comments. Then top n s MCs are selected, which is set as 3 in our experiments.

9 Multimedia Social Event Detection in Microblog Experiments 5.1 Experimental Settings The Testing Dataset. In the experiments, we employ the Brand-Social-Net dataset [6]. This dataset consists of 3 million microblogs with 1.3 million images from Sina Weibo on June and July, Each microblog contains the text description, the image if available, the owner information, posting time, geo location and user connections on Sina Weibo. The number of users is 1 million. This dataset contains 20 saga events, such as Windows 8 Preview, Chongqing Auto Expo, and Honda Elysion. These saga events happened during June and July, 2012, and the number of relevant microblogs for each saga event ranges from hundreds to thousands. Given the microblogs of each saga event, event detection is conducted to explore the sub-events in these saga events. Compared Methods. To evaluate the proposed event detection method, the following methods are employed for comparison. The Candidate-Ranking method [12] (CR). The Candidate-Ranking method first retrieves several promising events and the probability of the incoming document for these events or a new event can be measured by an SVM classifier. The Candidate-Ranking method [12] with visual content (CR+V). We further implement the Candidate-Ranking method by incorporating visual content analysis. The CLASS-SVM method [4] (CS). CS is an incremental clustering method which employs SVM as the classifier to identify whether a new document belongs to an existing event or a new event. The CLASS-SVM method [4] with visual content (CS+V). We further implement the CLASS-SVM method with visual content analysis. The proposed method, denoted by Proposed. The proposed method without visual content (Proposed-V). In this method, the visual content of microblogs are not taken into consideration. The proposed method without MC, i.e., Proposed-MC. In this method, the MC generation process is removed. Evaluation Criteria. Event detection is conducted on all related data. To evaluate the event detection performance, two types of ground truth are manually annotated, i.e., the summarized ground truth, consisting of tens of microblogs and reflecting most of the main content in the data, and the top-ranking content, consisting of 10 microblogs which are the most important content in the data. Three students were employed to manually select the summarized ground truth and the top-ranking content from all related microblogs. We adopt the following performance evaluation measures. Recall is to measure the data coverage of the generated events, Precision aims to evaluate the event detection accuracy, and F-Measure is a joint measure of Precision and Recall. Average normalized modified retrieval rank (ANMRR) [1] is a rank-based measure, which considers the ranking information of microblogs. A lower ANMRR value

A lower ANMRR value indicates that the important microblogs are listed at the top positions. 5.

10 278 Y. Gao et al. indicates better performance, i.e., relevant microblogs rank at top positions. In our experiments, the selected top-ranking microblogs are regarded as the positive samples, and ANMRR is to evaluate the ranking results. A lower ANMRR value indicates that the important microblogs are listed at the top positions. 5.2 Comparison with the State-of-the-Art We first compare the proposed method with the state-of-the-art methods, i.e., CR [12] and CS [4]. The average performance of different methods on event detection is presented in Figure 4. (a) Recall (b) Precision (c) F (d) ANMRR Fig. 4. The performance comparison on static event detection among different methods From Figure 4, we observe that the proposed method can achieve better results in comparison with CR and CS in the event detection task. The proposed method achieves an improvement of 40.9%, 46.1%, 43.4%, and 19.4% in terms of Recall, Precision, F, and ANMRR, respectively as compared to CR, and an improvement of 50.7%, 55.5%, 53.0%, and 20.9% as compared to CS. These results demonstrate the effectiveness of the proposed method on event detection. The better results are benefited from the proposed intermediate semantic level, i.e., the MCs, which can jointly explore the highly related microblogs to address the inadequate information issue. In our method, the visual content has been investigated in both the MC generation and the event detection procedures, which also contributes a lot to the event detection performance. In next two subsections, we will elaborate the effects of visual content and MC on event detection. 5.3 On Visual Content We evaluate the influence of visual content in event detection, in which we compare the performance of CR, CS and the Proposed method with/without visual content. Figure 5 presents the experimental results, which show that the use of visual content can significantly improve the event detection performance. In comparison with the textual content, the visual content has shown its superiority on information spreading in social media platforms. The visual content in microblogs has been forward to be able to enrich the short and conversational textual data.

Multimedia Social Event Detection in Microblog 279 (a) Recall (b) Precision (c) F (d) ANMRR Fig. 5. The performance comparison of with/without visual content on static event detection 5.

Here we compare the performance of the proposed method with Proposed-MC, which removes the MC generation step for event detection.

8% in terms of Recall, Precision, F, and ANMRR, respectively. These results indicate that the proposed MC is effective and essential for the event detection task.

11 Multimedia Social Event Detection in Microblog 279 (a) Recall (b) Precision (c) F (d) ANMRR Fig. 5. The performance comparison of with/without visual content on static event detection 5.4 On MC Performance As an intermediate concept level representation, MC aims to enrich the microblog information from a small group of highly correlated microblogs. Here we compare the performance of the proposed method with Proposed-MC, which removes the MC generation step for event detection. Experimental results are shown in Figure 6, which indicates that the event detection performance degrades without MC. The use of MC achieves an improvement of 18.2%, 23.2%, 20.6%, and 14.8% in terms of Recall, Precision, F, and ANMRR, respectively. These results indicate that the proposed MC is effective and essential for the event detection task. The advantage of MC comes from its intermediate concept level representation, which is beyond that can be conveyed in a single microblog. An MC is composed of a group of highly correlated microblogs, which may share similar textual content, visual content, closed geographical information and connected owners, such as the reposted microblogs and the corresponding comments. These microblogs can reinforce each other with the continuous data such as the new content from the reposts and/or the comments, which can address the information sparseness issue in microblogs. Fig. 6. The performance comparison of with/without MC on static event detection 6 Conclusion In this paper, we proposed an event detection method in microblog. An intermediate concept level, i.e., microblog clique, is introduced to explore the highly correlated microblogs to enrich event representation. To tackle the heterogeneous data in microblogs, the microblogs are formulated in a hypergraph structure and hierarchical bi-partition is conducted to generate MCs. A bipartite graph is then constructed using the MCs and the corresponding microblogs, and the bipartite graph partition is performed to detect events. The proposed method has been evaluated on the Brand-Social-Net dataset.

12 280 Y. Gao et al. From the experimental results and comparisons with the state-of-the-art methods, we can draw the following conclusions. The proposed event detection method outperforms the existing state-of-the-art methods on all evaluation criteria, which clearly demonstrates the superiority of the proposed method. The evaluation on the proposed intermediate concept level, i.e., MC, confirms that MC is able to explore richer information from highly correlated microblogs and further leads to better event detection performance. The evaluation on the visual content shows that it can improve the event detection performance too. To address the event detection task in social media platforms, there are still several difficult tasks. First, most existing methods directly combine multi-modal data in social media posts, such as the textual and visual content. It is noted that these heterogeneous data may have high correlation, and how to jointly investigate the multi-modal data in microblogs requires further attention. Second, the social network can infer important latent information for social events behind the microblogs and the users, which is another future research topic. Acknowledgements. This research is supported by the Singapore National Research Foundation under its International Research Singapore Funding Initiative and administered by the IDM Programme Office. References 1. Description of core experiments for mpeg-7 color/texture descriptors. In: Standard ISO/MPEGJTC1/SC29/WG11 MPEG98/M2819 (1999) 2. Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: ACM SIGIR (1998) 3. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information Retrieval 12(4) 4. Becker, H., Naaman, M., Gravano, L.: Learning similarity metrics for event identification in social media. In: WSDM, pp (2010) 5. Fung, G.P.C., Yu, J.X., Yu, P.S., Lu, H.: Parameter free bursty events detection in text streams. In: VLDB, pp (2005) 6. Gao, Y., Wang, F., Luan, H., Chua, T.-S.: Brand data gathering from social media streams. In: Proceedings of ACM Conference on Multimedia Retrieval (2014) 7. Gao, Y., Wang, M., Zha, Z.-J., Shen, J., Li, X., Wu, X.: Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing 22(1), (2013) 8. Hearst, M.: Search user interfaces. Cambridge University Press (2009) 9. Huang, Y., Liu, Q., Zhang, S., Metaxas, D.: Image retrieval via probabilistic hypergraph ranking. In: CVPR (2010) 10. Li, Z., Wu, X.M., Chang, S.F.: Segmentation using superpixels: A bipartite graph partitioning approach. In: CVPR, pp (2012) 11. Naveed, N., Gottron, T., Kunegis, J., Alhadi, A.C.: Searching microblogs: coping with sparsity and document quality. In: Proceedings of CIKM, pp (2011) 12. Reuter, T., Cimiano, P.: Event-based classification of social media streams. In: Proceedings of the 2nd ACM International Conference on Multimedia Retrieval (2012)

13 Multimedia Social Event Detection in Microblog Reuter, T., Cimiano, P., Drumond, L., Buza, K., Schmidt-Thieme, L.: Scalable event-based clustering of social media via record linkage techniques. In: ICWSM (2011) 14. Ritter, A., Etzioni, O., Clark, S., et al.: Open domain event extraction from twitter. In: KDD, pp ACM (2012) 15. Rozenshtein, P., Anagnostopoulos, A., Gionis, A., Tatti, N.: Event detection in activity networks. In: KDD, pp ACM (2014) 16. Sayyadi, H., Hurst, M., Maykov, A.: Event detection and tracking in social streams. In: WSDM (2009) 17. Schwarz, G.: Estimating the dimension of a model. Ann. Statist. 6, (1978) 18. Weng, J.S., Lee, B.S.: Event detection in twitter. In: ICWSM (2011) 19. Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR, pp (2009) 20. Yang, Y., Pierce, T., Carbonell, J.G.: A study on retrospective and on-line event detection. In: ACM SIGIR (1998) 21. Yang, Y., Yang, Y., Shen, H.T., Zhang, Y., Du, X., Zhou, X.: Discriminative nonnegative spectral clustering with out-of-sample extension. IEEE Transactions on Knowledge and Data Engineering 25(8), (2013) 22. Zhou, D., Huang, J., Schokopf, B.: Learning with hypergraphs: Clustering, classification, and embedding. In: NIPS (2007)

Learning Similarity Metrics for Event Identification in Social Media. Hila Becker, Luis Gravano

Learning Similarity Metrics for Event Identification in Social Media Hila Becker, Luis Gravano Columbia University Mor Naaman Rutgers University Event Content in Social Media Sites Event Content in Social