Feature identification for topical relevance assessment in feed search engines 1

Size: px
Start display at page:

Download "Feature identification for topical relevance assessment in feed search engines 1"

Transcription

1 Intelligent Data Analysis 17 (2013) DI /IDA IS Press Feature identification for topical relevance assessment in feed search engines 1 Yongwook Shin and Jonghun Park Department of Industrial Engineering, Seoul National University, Seoul, Korea Abstract. Feed has become a popular way to effectively distribute and acquire information on the web. The explosive growth of feeds demands a search engine that can help users quickly discover feeds of their interests. Retrieval effectiveness of feed search engine highly depends on a relevance assessment method that determines candidates for ranking query results. However, existing relevance assessment approaches proposed for web page retrieval may produce unsatisfactory result due to the different characteristics of feeds from traditional web pages. Compared to web pages, feed is a dynamic document since it continually generates information on some specific topics. In addition, it is a structured document that consists of several data elements such as title and description. Accordingly, the relevance assessment method for feed retrieval needs to effectively address these unique characteristics of feeds. This paper considers a problem of identifying significant features which are a feature set created from feed data elements, with the aim of improving effectiveness of feed retrieval while at the same time reducing computational cost. We conducted extensive experiments to investigate the problem using support vector machine on real-world data sets, and found the significant features that can be employed for feed search services. Keywords: Feed search, feed, feature identification, relevance assessment, text classification, search engine, information retrieval 1. Introduction Recently, the number of sites publishing feeds is dramatically increasing since feed is gaining an attention from people as an effective way to diffuse and consume information. Feed is an ML document published by a web site to facilitate syndication of its contents to subscribers. It provides an efficient mechanism that keeps its subscribers up-to-date with constantly changing information [29]. News and blogs are common sources for feeds, and feeds are used to deliver various types of structured information, ranging from weather data to search results [23]. More recently, social media and microblogging sites are increasingly delivering information through feeds. Two popular feed formats are RSS (Really Simple Syndication) [2] and Atom [18]. In order to address the challenges imposed by recent explosion of feeds, a feed search engine that can help users effectively access and find feeds becomes highly necessary. Feeds tend to be dedicated to some specific topic as they are often published by individual sites or groups of common interest. Users subscribe to feeds on their interests to keep themselves updated with recent information. 1 A preliminary version of this paper was presented at the SIGIR 2010 Workshop on Feature Generation and Selection for Information Retrieval. Corresponding author: Jonghun Park, Department of Industrial Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, , Korea. Tel.: ; Fax: ; jonghun@snu.ac.kr /13/$27.50 c 2013 IS Press and the authors. All rights reserved

2 718 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Accordingly, it is desirable for feed subscribers to discover relevant feeds to their queries through a feed search engine. Feed search engine takes a query from users and generates ranking candidates which are feeds assessed to be relevant to the query. Afterwards, it returns a ranked list of feeds by scoring the ranking candidates based on a specific feed retrieval model. Therefore, a relevance assessment method that determines the ranking candidates plays an important role in enhancing feed search quality. Under the assumption of binary relevance assessment, feeds are classified as being relevant or nonrelevant against a user query to construct the ranking candidates [28]. Feature-based model is one of the most popular and effective models for such a classification task [5], and it uses features to represent a user query and document. Specifically, the relevance is measured in terms of feature sets constructed from a user query as well as from a feed after defining specific features suited to feed relevance assessment. Unfortunately, existing relevance assessment methods for web page retrieval may not be appropriate for feed retrieval since they deal with different available features. Specifically, two main distinct properties of feeds make relevance assessment problem for feed retrieval challenging, compared to the web page retrieval. First, feed is a structured document composed of several data elements. Feed contains a feed title, a feed description, and multiple entries as major data elements. In addition, entry title and entry description are included within an entry. With this structural nature, it is necessary to investigate which data elements need to be selected to define a feature set for relevance assessment. Second, feed is a time-varying dynamic document that continually publishes new entries over time, whereas web pages tend to be rather static. This temporal characteristic of feed raises a problem of determining how many entries need to be considered for relevance assessment. To the best of authors knowledge, there have been very few studies that attempted to define features for feed relevance assessment by considering the unique structural and temporal characteristics of feed mentioned above. In this paper, we attempt to identify significant features for feed relevance judgment with respect to a user query. The significant features are constructed from feed data elements, and they are the features that can maximize effectiveness of feed relevance assessment while not hurting efficiency in terms of computational cost. The rest of the paper is organized as follows. In Section 2, related work is presented, and the problem and our proposed approach are presented in Section 3. Section 4 describes the data set for experiments and baseline features. In Section 5, we report experimental results in order to identify the significant features. Finally, we conclude the paper in Section Related work Several blog ranking methods that take account of unique properties of feeds have been studied previously. A ranking function for blog feed retrieval model was proposed by considering the structural characteristic of a feed. Research in [8,9,16] investigated whether it is more effective for feed retrieval to view a feed as a single document composed of entries or multiple individual documents. Similarly, research results on blog retrieval presented in [22] examined a ranking model for blog search. However, they did not pay attention to constructing feature set for feed relevance assessment. In addition, the temporal nature of feed and structural elements such as the feed title and feed description were not considered in [8,9,22]. An enhanced ranking model for blog posts, named PTRank, was proposed to improve search quality beyond the simple keyword matching through utilizing various features available from blog feeds [10]. PTRank, however, focused on ranking the individual blog posts to retrieve topically relevant posts. It

3 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 719 suggested a scoring function to measure the degree of relevance between a query and blog posts, instead of identifying the significant features for feed relevance judgment. Furthermore, the temporal characteristic of feed was not considered by PTRank. Feature identification for blog classification and ranking function for blog search are related to our problem since a blog has similar temporal characteristics to a feed in that a blog is a web site consisting of dated posts listed in reverse chronological order on a single page. Identifying novel features from a blog has been addressed for spam blog classification [12,15]. It was attempted to find the most effective feature types among the bag-of-words, bag-of-urls, and bag-of-anchors by regarding a blog as a single document [12]. In [15], temporal features based on the relationship among blog posts were examined to identify the publishing patterns of spam blogs. Text classification experiments were performed to categorize blogs into topics by using TF-IDF (Term Frequency Inverse Document Frequency) measure and weighted linguistic features such as the title of individual posts and the anchor text from incoming links while a blog was regarded as a single document without considering its temporal characteristics [20]. For blog genre analysis, various features related to the temporal and structural characteristics of blogs were used in order to classify blogs into several genres [11]. Their temporal features included blog post update frequency and age of blog. Structural features were also created from content, image, comment, and advertisement. In contrast to the above mentioned blog classification problems, we attempt to identify the significant features to measure the relevance for a feed retrieval task through feed classification using two topiclabeled feed data sets. Support vector machine is used to classify feeds on their corresponding topic for our experiments. 3. Proposed approach Feed is an ML file that contains partial or full descriptions of web page articles along with links to the original contents and other information such as author, title, and date. Figure 1 shows a typical example that illustrates major elements of an Atom feed [18], in which a feed contains entries corresponding to web page articles, and the entries are arranged in a reverse chronological order. Specifically, the schema of Atom 1.0 contains two core elements, namely <feed> and <entry>. <feed> is the root element of an Atom document, which declares a feed with appropriate namespaces and contains feed title and description. <feed> defines metadata of an information source that provides feed, and should have at least one <entry> element. In addition, <feed> element has mandatory subelements, <title> and <subtitle>. <entry> element specifies metadata of a web page article, and it includes <title>, <summary>,and <updated> elements. While <title> contains the title of the article, <summary> element has a short summary or full body of the article. <updated> gives the publication time of a particular <entry>. Among the various feed elements, we consider five elements in identifying the significant features for feed relevance assessment. These selected elements are summarized in Table 1. We refer to <title> element in <feed> as a feed title, and <subtitle> element in <feed> as a feed description. We also refer to <title> element of <entry> as an entry title, <summary> element of <entry> as an entry description, and <updated> element of <entry> as an entry publication time. This study concentrates on identification of the significant features for feature-based relevance assessment method in feed retrieval. The significant features are defined as a feature set created from data elements available from a feed, which can maximize effectiveness of relevance assessment. Specifically,

4 720 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 1 Elements of Atom 1.0 feed considered in the proposed approach Element Definition Feed element <title> Title of a feed <subtitle> Description of a feed Entry element <title> Title of an entry <summary> Description of an entry. Contains the content of a web page article <updated> Time and date when an entry was published <?xml version="1.0" encoding="utf-8"?> <feed xmlns=" <title>example Feed</title> <subtitle>a subtitle.</subtitle> <link href=" rel="self" /> <link href=" /> <id>urn:uuid:60a76c80-d399-11d9-b91c e0af6</id> <updated> t18:30:02z</updated> <author> <name>john Doe</name> < >johndoe@example.com</ > </author> <entry> <title>atom-powered Robots Run Amok</title> <link href=" /> <link rel="alternate" type="text/html" href=" <link rel="edit" href=" <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated> t18:30:02z</updated> <summary>some text.</summary> </entry> </feed> Fig. 1. An example Atom feed. we use a vector space model for representing the features. The notations used in our problem are presented as follows. Given the set of m feeds, F = {f 1,f 2,...,f m }, and the set of terms, T,letE i = {e i,j j = 1,...,n(f i )} represent the set of entries for f i,i =1,...,m, constructed by choosing the most recent n(f i ) entries from the set of all entries published from f i,wheree i,j is more recent than e i,j+1. We base our vector space model on a bag of terms representation, and denote the term bags of feed title and feed description as ft i and fd i, respectively. The term bag constructed from the title in e i,j is represented as et i,j, and the term bag constructed from the description in e i,j is represented as ed i,j. We also let gt i,k = j=g k j=g (k 1)+1 et i,j denote the term bags of aggregated entry titles from a group of entries, and gd i,k = j=g k j=g (k 1)+1 ed i,j the term bags of aggregated entry descriptions, for feed f i, k =1,...,r i,wherer i = n(f i )/g. g denotes a grouping factor to make aggregated term bags from entry titles and descriptions, and it is an integer such that 1 g n(f i ). For instance, if g = 1, gt i,1 = et i,1,...,gt i,ri = et i,ri and gd i,1 = ed i,1,...,gd i,ri = ed i,ri where r i = n(f i ). We do not consider other feature representations, such as the bag of anchors or URLs, since many feeds do not contain anchor texts and URLs in their entry descriptions but instead publish entry descriptions

5 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 721 as short snippets. Another reason for not considering URLs as feature representation is to prevent a bias toward the feeds publishing entries with shortened URLs, in which terms tend to be not useful for relevance judgment. Let Q denote a set of queries, and q Q a term set representation of a query. We also use notation B for representing a bag of terms, and B for the total number of term occurrences in B. In addition, tf(t, B) represents frequency of term t T in term bag B. In case of normalized term frequency (TF), scoring function in terms of query q and term bag B, σ(q,b) is defined as: { t q tf(t, B)/ B σ(q,b) = if q B 0 otherwise In contrast, σ(q, B) for the case of binary indicator is defined as: { 1 if σ(q,b) = t q tf(t, B) > 0 0 otherwise Given query q and E i, f i is modeled as a feature vector represented by (σ(q,ft i ), σ(q,fd i ), σ(q,gt i,1 ),...,σ(q, gt i,ri ), σ(q, gd i,1 ),...,σ(q,gd i,ri )). σ(q,ft i ), σ(q,fd i ), σ(q,gt i,k ),andσ(q,gd i,k ) are scoring functions defined in terms of the normalized TF or binary indicator presented above, and they indicate feature scores specified for feed title, feed description, entry title, and entry description, respectively. We consider normalized TF and binary indicator among the other alternatives, since it is reported that they are effective in many text classification problems [7,12,14]. We respectively refer to the normalized TF and the binary indicator as TF and BI in the following, and also call σ(q,ft i ) and σ(q,fd i ) as feed level feature scores, and σ(q, gt i,k ) and σ(q,gd i,k ) as entry level feature scores. The significant features are constructed from feed data elements by considering various methods for aggregating and sampling entries. Among the possible feature sets, the significant features are the ones that produce the best performance on a specific performance evaluation metric. We attempt to find the significant features through examining various alternatives for constructing the feature vector, considering the unique properties of feed mentioned in the previous section. First, feed data elements for the significant features are identified by investigating several combinations of structural data elements in feature vector construction. For instance, when the relevance assessment using feature vector (σ(q, ft i ),σ(q, fd i )) turns out to have better performance than (σ(q,fd i )), it can be concluded that a combination of feed title and feed description is more appropriate than feed description alone as feed data elements for the significant features. Second, the number of entries considered for the feature vector is identified by varying n(f i ) to address the temporal characteristics of feed. n(f i ) can be determined by two ways: by the number of entries or by the time horizon of entry publication. Let s c and s t respectively denote the number of entries and the time horizon of entries considered for constructing a feature vector. We also let p i,j represent the publication time of entry e i,j. In case that n(f i ) is determined by s c, n(f i ) is set to s c, while for the case of s t, n(f i ) is defined as the number of entries published from f i during the time period from p i,1 to (p i,1 s t ),wherep i,1 is the publication time of the most recent entry. Third, to examine the effect of the number of entries considered for an entry level feature score on the performance of relevance assessment method, we construct a feature vector by varying g. It is expected that performance of relevance assessment method changes along with g since sparseness and size of the feature vector decrease as g increases from 1 to n(f i ). Note that the case of g = 1 corresponds to the small document model and g = n(f i ) to the large document model in [8,9]. Finally, we investigate whether or not the topics of recent entries are more influential to the feed s topic than those of old entries. To deal with time sensitivity of feed s topic, we adopt an exponential

6 722 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines decay rule for entry level feature scoring [1,6,13]. Let tf d (t, gt i,k,p) denote time decayed tf(t, gt i,k ) at time p. Given E i and g, tf d (t, gt i,k,p) is defined as: tf d (t, gt i,k,p)= j=g k j=g (k 1)+1 e λ(p pi,j) tf(t, et i,j ), p > p i,j where λ is time decaying factor obtained from the half-life decay time δ, (i.e., the time required by TF to halve its value), with the relation e λδ = 1 2 [6]. We remark that tf d(t, gd i,k,p) is also defined in the same way. In this paper, topical relevance between a query and a feed is assessed under a text classification framework based on the binary relevance judgment that determines whether the feed is relevant to the query or not. We employ support vector machine (SVM) for our feed classification task, since it is reported that SVM is one of the most popular and effective supervised learning methods for text classification problems such as topic classification, spam blog, and sentiment classification [3,7,15,19,27]. SVM treats feed classification problem as the one involving binary class labels, which we refer to as non-relevant and relevant. We are given with labeled training data represented by {(x 1,y 1 ),...,(x h,y h )} where x l is a feature vector modeling a feed and y l is a class label of x l, l =1,...,h.Giveng and the set of queries Q = {q 1,...,q v }, each of which represents a specific topic, x l =(σ(q u,ft i ), σ(q u,fd i ), σ(q u,gt i,1 ),..., σ(q u,gt i,ri ), σ(q u,gd i,1 ),..., σ(q u,gd i,ri )), u =1,...,v. SVM builds a model using the given labeled training data, and predicts whether an instance of feature vector of a feed falls into non-relevant or relevant. Formally, two classes y l { 1, 1}, where 1 represents non-relevant and 1 represents relevant to a given topic, are to be predicted by the model. If y l s are linearly non-separable, then SVM finds the optimal weight vector w by solving the following convex optimization problem. min w w 2 + C 2 h l=1 ξ l subject to y l (w x l + b) 1 ξ l,, l =1, 2,...,h. where C and ξ l respectively represent a cost and a slack variable, and w and b are parameters of the model. To evaluate effectiveness of relevance assessment, we use precision, recall, and F measure which are commonly used to compare the relative performances with different features [5]. Precision, P,isameasure of the usefulness of feeds evaluated to be relevant and recall, R, is a measure of the completeness of a relevance assessment method. # of Relevant Feeds Retrieved P = (1) # of Retrieved Feeds # of Relevant Feeds Retrieved R = (2) # of Relevant Feeds where Retrieved in Eqs (1) and (2) means classified to be relevant. F measure, FM,isdefinedasa harmonic mean of recall and precision, which is: FM = 2 R P (R + P ) FM is an effectiveness measure used for evaluating text classification performance and it provides a single metric for comparison across different experiments.

7 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 723 Table 2 Data set statistics Google data set Twitter data set Source Google reader Wefollow Number of topics Number of feeds 3,050 4,354 Average number of feeds per topic Data sets for experiments and baseline features Two topic labeled feed data sets are selected for our experiments, which were collected from popular web services. The first data set is Bundles from Google available from Google Reader [21], a Google s web-based feed reading service. In Bundles from Google, several feeds for a topic are bundled in order to provide feed recommendations to Google reader users. We call Bundles from Google as Google data set. Google data set is chosen for our experiment as it covers various topics and broad types of feeds from public media sites, blogs, podcasts, and social media sites. The comprehensive nature in terms of topic coverage and feed source types makes this data set useful for our experiments since the narrow topic coverage and limited feed sources may produce biased experimental results. The second data set is Twitter user directory from Wefollow [25] which is one of the most popular Twitter [24] directory services. We call this Twitter data set. In Twitter data set, several Twitter users are grouped by a tag that represents a specific topic. We consider a Twitter user as a Twitter feed since every Twitter user automatically publishes his/her own Twitter feed that posts all tweets as entries. The Twitter data set is included in our experiments not only to identify the significant features of the feeds publishing the entries with small number of terms but also to examine the characteristics of social media feeds that may have different compositional patterns under a stringent limitation on the entry length. (The number of characters per entry of a Twitter feed is not allowed to exceed 140.) Twitter data set was built by collecting the most influential 25 Twitter users per a tag. Data sets for our experiments were collected from Google, Wefollow, and Twitter during January Table 2 presents brief statistics for our two data sets. Baseline features are selected from the features available from feed under the assumption that relevance assessment method using as many as feed data elements and entries possible would be the most effective. Similar to the significant features, baseline features are constructed by using feed data elements, based on two extreme grouping factors. Given F, we select all of ft i, fd i, et i,j,anded i,j, i =1,...,m, j = 1,...,n(f i ) as structural data elements for baseline features, and set s c = 200 for the time span of feed since the largest number of entries per feed available from our data sets is 200. Also, we set g = 1orn(f i ) for constructing the feature vector of baseline features in order to study whether or not the performance of relevance assessment method under 2 g (n(f i ) 1) is better than those of the small and large document models in [8,9]. 5. Experiment results For our experiments, we used the libsvm toolkit using a linear kernel [4]. We have tried various versions of the SVM kernel functions including higher order polynomials and radial basis functions, and

8 724 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 3 Best performance results of each feed data element and their combinations using Google data set (a) Best precision result TF BI P R FM P R FM FT FD FT+FD ET ED ET+ED FT+ET FT+ED FT+ET+ED FD+ET FD+ED FD+ET+ED FT+FD+ET FT+FD+ED FT+FD+ET+ED (b) Best F measure result TF BI P R FM P R FM FT FD FT+FD ET ED ET+ED FT+ET FT+ED FT+ET+ED FD+ET FD+ED FD+ET+ED FT+FD+ET FT+FD+ED FT+FD+ET+ED found that the linear kernel produces the best performance for our data sets. A five fold cross-validation technique is used to evaluate performance. In our experiments based on the two data sets, the topical tags manually annotated to feeds in Bundles from Google and Wefollow were used as queries. That is, when we use a specific tag as a query, only the feeds labeled with the same tag are judged to be relevant against the query while all the other feeds are classified to be not relevant for the query. In general, text classification problem is characterized by unbalanced data sets in which one of binary class labels is represented by a large portion of all the instances, while the other class label has only a small portion of the instances. We chose the method of under-sampling the majority class, by which we sample instances from majority class such that the number of training instances in both classes are made equal [17]. While there are several recommendations in the literature on how to sample, we employed the method of random sampling as there is an empirical evidence that this method does better than or as good as the other existing ones [26].

9 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 725 Table 4 Results for Google data set under TF P R FM s c g = g = g = ET ED FT+FD ET FT+FD ED FT+FD ET+ED Table 3 summarizes experimental results based on Google data set in order to investigate which feed data elements need to be employed for the significant features. In Tables 3-(a), precision results of several combinations of feed data elements show the maximum precision values along with the resulting recall and F measure performances achieved under TF and BI. The maximum precision results were taken among several configurations of s c and g values. In Tables 3-(b), F measure results for several combinations of feed data elements show the maximum F measure values along with corresponding precision and recall results produced in our experiments under TF and BI. The maximum F measure results were also taken among several configurations of s c and g values. Table 6 was constructed for Twitter data set in the same way as Table 3. Bold numbers in Tables 3 and 6 represent the maximum performances of precision or F measure under TF or BI. Tables 4, 5, 7, and 8 show our experiment design and results using Google data set or Twitter data set under TF or BI when entries are included in feature vector construction and n(f i ) is determined by s c. In particular, Tables 4 and 5 contain representative results selected from complete results, which include performances of the significant features, the baseline features, and several variations of feed data element combinations. In the result tables, FT stands for feed title, FD for feed description, ET for entry title, and ED for entry description. The first column on the left side in the result tables indicates feed data elements and their combinations in which we denote combination of feed data elements used for the feature vector with +. For instance, FT+FD+ET indicates that the feature vector is constructed by using feed title, feed description, and entry title, and it is represented by (σ(q,ft i ), σ(q,fd i ), σ(q,gt i,1 ),...,σ(q,gt i,4 ))

10 726 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 5 Results for Google data set under BI P R FM s c g = g = g = ET ED FT+FD ET FT+FD ED FT+FD ET+ED without features for entry description when s c = 40 and g = 10, where 40 entries are used in feature vector construction and 10 entries are grouped together for computing an entry level feature score. In Tables 4, 5, 7, and 8, numbers in gray boxes are F measure results of the baseline features defined earlier, and bold numbers correspond to F measure values of the significant features. We found that the significant features outperform the baseline features in all experiment results as these tables show. We did not include ED in experiments using Twitter data set as ET and ED are identical in Twitter feeds. Accordingly, ED is not contained in feed data element combinations in Tables 6, 7, and 8. We present a detail analysis for the experiment results with respect to precision and F measure in the following Google data set Effects of feed data elements Tables 3 (a) shows that FT achieves the best precision under TF and BI. Also, it was found that precision values of FD, ET, and ED are almost same under TF, whereas FD produces the second best precision under BI. From these observations, we can conclude that FT and FD are more competitive than the other feed data elements in terms of the precision performance of feed relevance assessment. n the other hand, Tables 3-(b) tells us that FT+FD+ED produces the best result for F measure under TF, while FT+FD+ET+ED achieves the best performance for F measure under BI. It is interesting to see that use of FT or FD alone is less effective in terms of F measure than FT+ET+ED or FD+ET+ED, due to lower recall results. Furthermore, Tables 3-(b) indicates that FD and ED respectively produce

11 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 727 Table 6 Best performance results of each feed data element and their combinations using Twitter data set (a) Best precision result TF BI P R FM P R FM FT FD FT+FD ET FT+ET FD+ET FT+FD+ET (b) Best F measure result TF BI P R FM P R FM FT FD FT+FD ET FT+ET FD+ET FT+FD+ET Table 7 Results for Twitter data set under TF P R FM s c g = g = g = ET FT+ET FD+ET FT+FD ET higher F measure results than FT and ET under TF, mainly due to the fact that FD and ED respectively have more terms than FT and ET. From the fact that FT and FD achieve low recall but high precision results, it can be concluded that many feeds do not put their topic into FT and FD, but it is highly likely that the terms in FT and FD

12 728 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 8 Results for Twitter data set under BI P R FM s c g = g = g = ET FT+ET FD+ET FT+FD ET represent feed s topic. In addition, FT+FD+ET+ED does not completely outperform FT+FD+ET for F measure, while ED requires relatively higher cost for computing feature scores than the other individual features. The differences between F measure values of FT+FD+ET+ED and those of FT+FD+ET are only 0.01 under TF and under BI, suggesting that FT+FD+ET produces competitive performance compared to FT+FD+ET+ED, considering computation cost for feature score of ED Effects of g and s c Tables 4 and 5 indicate that the significant features found correspond to FT+FD+ED based on 200 entries and g = 40 for F measure under TF, whereas they correspond to FT+FD+ET+ED based on 200 entries and g = 20 under BI. Furthermore, we examine effects of s c and g on the performance in Tables 4 and 5. In general, small g s yield higher precision values than large g s. Contrary to g, large s c s produce better precision results than small s c s. It turns out from these observations that relevance assessment method with high dimension of the feature vector contributes to improvement of precision results. In Tables 4 and 5, however, effects of s c and g on F measure performance turn out to be different from those on precision. This difference is mainly due to recall results. In contrast to the case of the baseline features in which g = 1 or 200, the maximum F measure results for each considered feed data element combination were observed at g = 40 and 200 under TF (underlined in Table 4), while they were found at g = 10, 20, and 40 under BI (underlined in Table 5). Large s c s still produce better results on F measure than small s c s for most g values. However, Fig. 2 depicts that F measure performances of most g values hardly increase beyond s c of 120, when F measure results were produced by the significant features (i.e., FT+FD+ED with s c = 200 and g = 40 under TF, and FT+FD+ET+ED with s c = 200 and g = 20 under BI). Accordingly, it can be deduced that a large number of entries containing old entries are not necessary for constructing a feature vector, but a small

13 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines F measure 0.75 F measure g=1 g=10 g=20 g=40 g=200 g=1 g=10 g=20 g=40 g= Sc (a) TF Sc (b) BI Fig. 2. F measure results of the significant features for Google data set. (Colours are visible in the online version of the article; number of recent entries are enough to assess relevance of feed. Therefore, we can conclude that feed relevance assessment method constructed by using FT+FD+ET and small values of s c and g produces comparable performance in terms of F measure to that of the significant features, able to reduce cost for computing feature scores Twitter data set Effects of feed data elements Distinct characteristics of Twitter feeds make experimental results for Twitter data set different from those for Google data set. FT achieves the maximum precision in Table 6-(a). Precision result of FT significantly outperforms that of ET under BI, whereas FT does not produce higher precision result than ET under TF. In contrast, FD results in lower precision than ET. FT and FD do not contribute to improvement of precision in spite of combining with ET, differently from the results of Google data set. A combination of all feed data elements has the maximum F measure performance under TF, while FT+ET produces the best F measure performance under BI as in Table 6-(b). FD alone yields relatively high precision and recall at the same time, and it improves F measure through combining it with other feed data elements such as FT and ET under TF. In particular, it is interesting to see that F measure result of FT+FD is only lower than that of FT+FD+ET under TF. As a result, it appears that FT+FD is a satisfactory combination for relevance assessment of Twitter feeds, considering high precision of FT+FD as well as cost for computing feature scores related to ET Effects of g and s c As illustrated in Tables 7 and 8, the significant features are the ones constructed from FT+FD+ET by using 160 entries and g = 40 for F measure under TF, whereas they consist of FT+ET based on 200 entries and g = 200 under BI. ET and FT+ET respectively produce the maximum values for F measure

14 730 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 9 Additional experiment results for Google data set under TF (g = 1) (a) Determination of n(f i) using s t (b) Time decayed feature score (c) Non-time decayed feature score s t P R FM s t P R FM s t P R FM F measure 0.40 g=1 g=10 g=20 g=40 g=ld F measure 0.50 g=1 g=10 g=20 g=40 g=ld Sc (a) TF Sc (b) BI Fig. 3. F measure results of the significant features for Twitter data set. (Colours are visible in the online version of the article; when g = 200 with help of high recall results in spite of quite low precision results under BI, and they yield the best F measure performance when g = 40 under TF. In Tables 7 and 8, the maximum F measure values for each considered feed data element combinations are underlined. Moreover, Tables 7 and 8 reveal that large s c s have higher F measure results than small s c s under TF and BI. It was found that high F measure results of large g s and s c s for Twitter data set are mainly because Twitter feeds more frequently publish entries not relevant to their topics than feeds in Google data set. This characteristic of Twitter feeds results from the fact that Twitter users usually publish short entries on their private life activities and merely refer URLs to resources they are interested in without topical terms. Figure 3 shows how F measure values are changed as s c varies for the significant features (i.e., FT+FD+ET with s c = 160 and g = 40 under TF, and FT+ET with s c = 200 and g = 200 under BI). It indicates that F measure results for most g values have no remarkable difference as s c increases in case of TF, and they seldom increase beyond s c of 160 in case of BI. Finally, it can be observed from Table 6 that F measure result of FT+FD is almost same as that of FT+FD+ET in assessing relevance of Twitter feeds under TF. In contrast, it was observed in Table 8 that FT+ET achieves the best F measure performance by using the entire entries, but when FD is combined with FT+ET, a small number of entries (i.e., s c = 80) can also produce comparable performance.

15 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Additional experiments We conducted some additional experiments to investigate the effects of n(f i ) determined by s t as well as the time decayed feature score on the performance of relevance assessment method. Tables 9- (a) and (b) show the results of additional experiments using FT+FD+ET+ED when g = 1 under TF in case of Google data set. In particular, we present Table 9-(c) by selecting experiment results using FT+FD+ET+ED when g = 1 from Table 4 in order to compare the additional experiment results with the results based on the non-time decayed feature scores and s c. Table 9-(a) represents experiment results based on s t where time unit is a day. We see from Tables 9-(a) and (c) that precision results in cases of determining the horizon by means of s t are improved compared to the cases of s c, but most F measure performances drop due to lower recall results than in the cases of s c. It seems that these results are caused by high variability of n(f i ) when it is determined by s t. This variability is due to the different entry publishing frequencies of feeds. Consequently, we propose to construct a feature set for feed relevance assessment by using a specific number of most recent entries rather than the recent entries selected by a specific time horizon, in order to achieve higher F measure performance. Table 9-(b) shows experiment results using the time decayed feature scores. As presented in Tables 9- (b) and (c), the time decayed feature scores do not yield better performance than the original non-time decayed feature scores. It was found that feed s topic does not more strongly depend on the topic of recent entries than that of old ones, since the feeds in Google data set tend to continually publish entries relevant to their topics for a long period, without frequently changing their topics. 6. Conclusions In this paper, we investigated the significant features for assessing relevance between feeds and a user query, using two topic-labeled feed data sets. A new feature set construction approach for feed relevance assessment was proposed based on the vector space model, considering unique characteristics of feeds. ur approach was proposed in the expectation that the use of a large number of feed data elements and entries that require high cost for computing feature scores may not produce the best performance, but that feed relevance can be effectively assessed by using only a small portion of them. There were three main methods proposed for identifying the significant features. First, we identified feed data elements for feed relevance assessment by investigating several combinations of feed data elements in feature vector construction. Four data elements, namely feed title, feed description, entry title, and entry description, were selected as candidate feed data elements. Second, we investigated the number of entries necessary for relevance assessment through constructing feature vectors of different sizes using various numbers of entries. Third, we examined how many entries need to be grouped for computing an entry level feature score. Furthermore, two additional methods were tested. The time decayed feature scores were employed in order to study time sensitivity of a feed s topic. The other we attempted to find out was the effect of different entry selection schemes that respectively choose entries for feature scoring based on the number of entries and on the publication time span. From the experimental results conducted using two topic-labeled feed data sets, we found that when the sizes of entry group for computing the entry level feature score are small, the combination of all feed data elements with a large number of entries (i.e., s c 160) tends to constitute the significant features for all data sets with respect to F measure. In contrast, the feed title alone was able to achieve the highest

Intro to the Atom Publishing Protocol. Joe Gregorio Google

Intro to the Atom Publishing Protocol. Joe Gregorio Google Intro to the Atom Publishing Protocol Joe Gregorio Google Atom Atom Syndication Feed example Feed 2003-12-13t18:30:02z

More information

RSS to ATOM. ATOM to RSS

RSS to ATOM. ATOM to RSS RSS----------------------------------------------------------------------------- 2 I. Meta model of RSS in KM3--------------------------------------------------------------- 2 II. Graphical Meta model

More information

RSS - FEED ELEMENTS. It indicates the last time the Feed was modified in a significant way. All timestamps in Atom must conform to RFC 3339.

RSS - FEED ELEMENTS. It indicates the last time the Feed was modified in a significant way. All timestamps in Atom must conform to RFC 3339. http://www.tutorialspoint.com/rss/feed.htm RSS - FEED ELEMENTS Copyright tutorialspoint.com Feed ID: It identifies the Feed using a universally unique and permanent URI. If you have a long-term, renewable

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

REST-based Integration Architecture for a Financial Business Service. Phillip Ghadir, innoq

REST-based Integration Architecture for a Financial Business Service. Phillip Ghadir, innoq REST-based Integration Architecture for a Financial Business Service Phillip Ghadir, innoq REST-based Integration Architecture for a Financial Business Service When we started out building a large-scale

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Semantic Annotation using Horizontal and Vertical Contexts

Semantic Annotation using Horizontal and Vertical Contexts Semantic Annotation using Horizontal and Vertical Contexts Mingcai Hong, Jie Tang, and Juanzi Li Department of Computer Science & Technology, Tsinghua University, 100084. China. {hmc, tj, ljz}@keg.cs.tsinghua.edu.cn

More information

Automated Tagging for Online Q&A Forums

Automated Tagging for Online Q&A Forums 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, gnagpal}@ucsd.edu Abstract Hashtags created

More information

Information Retrieval

Information Retrieval Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

REMIT. Guidance on the implementation of web feeds for Inside Information Platforms

REMIT. Guidance on the implementation of web feeds for Inside Information Platforms REMIT Guidance on the implementation of web feeds for Inside Information Platforms Version 2.0 13 December 2018 Agency for the Cooperation of Energy Regulators Trg Republike 3 1000 Ljubljana, Slovenia

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Publishing and Edi.ng Web Resources Atom Publishing Protocol. CS 431 Spring 2008 Cornell University Carl Lagoze 03/12/08

Publishing and Edi.ng Web Resources Atom Publishing Protocol. CS 431 Spring 2008 Cornell University Carl Lagoze 03/12/08 Publishing and Edi.ng Web Resources Atom Publishing Protocol CS 431 Spring 2008 Cornell University Carl Lagoze 03/12/08 Acknowledgments Dan Diephouse netzooid.org Elizabeth Fisher Colorado Ibm.com/developerworks

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Robust PDF Table Locator

Robust PDF Table Locator Robust PDF Table Locator December 17, 2016 1 Introduction Data scientists rely on an abundance of tabular data stored in easy-to-machine-read formats like.csv files. Unfortunately, most government records

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

A New Measure of the Cluster Hypothesis

A New Measure of the Cluster Hypothesis A New Measure of the Cluster Hypothesis Mark D. Smucker 1 and James Allan 2 1 Department of Management Sciences University of Waterloo 2 Center for Intelligent Information Retrieval Department of Computer

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Digital Libraries. Patterns of Use. 12 Oct 2004 CS 5244: Patterns of Use 1

Digital Libraries. Patterns of Use. 12 Oct 2004 CS 5244: Patterns of Use 1 Digital Libraries Patterns of Use Week 10 Min-Yen Kan 12 Oct 2004 CS 5244: Patterns of Use 1 Two parts: Integrating information seeking and HCI in the context of: Digital Libraries The Web 12 Oct 2004

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly

More information

Multiple Classifier Fusion using k-nearest Localized Templates

Multiple Classifier Fusion using k-nearest Localized Templates Multiple Classifier Fusion using k-nearest Localized Templates Jun-Ki Min and Sung-Bae Cho Department of Computer Science, Yonsei University Biometrics Engineering Research Center 134 Shinchon-dong, Sudaemoon-ku,

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi

Journal of Asian Scientific Research FEATURES COMPOSITION FOR PROFICIENT AND REAL TIME RETRIEVAL IN CBIR SYSTEM. Tohid Sedghi Journal of Asian Scientific Research, 013, 3(1):68-74 Journal of Asian Scientific Research journal homepage: http://aessweb.com/journal-detail.php?id=5003 FEATURES COMPOSTON FOR PROFCENT AND REAL TME RETREVAL

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT

SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT SYSTEMS FOR NON STRUCTURED INFORMATION MANAGEMENT Prof. Dipartimento di Elettronica e Informazione Politecnico di Milano INFORMATION SEARCH AND RETRIEVAL Inf. retrieval 1 PRESENTATION SCHEMA GOALS AND

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

SPSS INSTRUCTION CHAPTER 9

SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

Unsupervised Feature Selection for Sparse Data

Unsupervised Feature Selection for Sparse Data Unsupervised Feature Selection for Sparse Data Artur Ferreira 1,3 Mário Figueiredo 2,3 1- Instituto Superior de Engenharia de Lisboa, Lisboa, PORTUGAL 2- Instituto Superior Técnico, Lisboa, PORTUGAL 3-

More information

Whitepaper Spain SEO Ranking Factors 2012

Whitepaper Spain SEO Ranking Factors 2012 Whitepaper Spain SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics GmbH Greifswalder Straße 212 10405 Berlin Phone: +49-30-3229535-0 Fax: +49-30-3229535-99 E-Mail: info@searchmetrics.com

More information

Informativeness for Adhoc IR Evaluation:

Informativeness for Adhoc IR Evaluation: Informativeness for Adhoc IR Evaluation: A measure that prevents assessing individual documents Romain Deveaud 1, Véronique Moriceau 2, Josiane Mothe 3, and Eric SanJuan 1 1 LIA, Univ. Avignon, France,

More information

Identifying Important Communications

Identifying Important Communications Identifying Important Communications Aaron Jaffey ajaffey@stanford.edu Akifumi Kobashi akobashi@stanford.edu Abstract As we move towards a society increasingly dependent on electronic communication, our

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

CS6200 Information Retreival. Crawling. June 10, 2015

CS6200 Information Retreival. Crawling. June 10, 2015 CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on

More information

Application of Support Vector Machine Algorithm in Spam Filtering

Application of Support Vector Machine Algorithm in  Spam Filtering Application of Support Vector Machine Algorithm in E-Mail Spam Filtering Julia Bluszcz, Daria Fitisova, Alexander Hamann, Alexey Trifonov, Advisor: Patrick Jähnichen Abstract The problem of spam classification

More information

On the automatic classification of app reviews

On the automatic classification of app reviews Requirements Eng (2016) 21:311 331 DOI 10.1007/s00766-016-0251-9 RE 2015 On the automatic classification of app reviews Walid Maalej 1 Zijad Kurtanović 1 Hadeer Nabil 2 Christoph Stanik 1 Walid: please

More information

Overview of the INEX 2009 Link the Wiki Track

Overview of the INEX 2009 Link the Wiki Track Overview of the INEX 2009 Link the Wiki Track Wei Che (Darren) Huang 1, Shlomo Geva 2 and Andrew Trotman 3 Faculty of Science and Technology, Queensland University of Technology, Brisbane, Australia 1,

More information

Extraction of Semantic Text Portion Related to Anchor Link

Extraction of Semantic Text Portion Related to Anchor Link 1834 IEICE TRANS. INF. & SYST., VOL.E89 D, NO.6 JUNE 2006 PAPER Special Section on Human Communication II Extraction of Semantic Text Portion Related to Anchor Link Bui Quang HUNG a), Masanori OTSUBO,

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Whitepaper Italy SEO Ranking Factors 2012

Whitepaper Italy SEO Ranking Factors 2012 Whitepaper Italy SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics GmbH Greifswalder Straße 212 10405 Berlin Phone: +49-30-3229535-0 Fax: +49-30-3229535-99 E-Mail: info@searchmetrics.com

More information

Songtao Hei. Thesis under the direction of Professor Christopher Jules White

Songtao Hei. Thesis under the direction of Professor Christopher Jules White COMPUTER SCIENCE A Decision Tree Based Approach to Filter Candidates for Software Engineering Jobs Using GitHub Data Songtao Hei Thesis under the direction of Professor Christopher Jules White A challenge

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

CMPSCI 646, Information Retrieval (Fall 2003)

CMPSCI 646, Information Retrieval (Fall 2003) CMPSCI 646, Information Retrieval (Fall 2003) Midterm exam solutions Problem CO (compression) 1. The problem of text classification can be described as follows. Given a set of classes, C = {C i }, where

More information

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach

Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach Letter Pair Similarity Classification and URL Ranking Based on Feedback Approach P.T.Shijili 1 P.G Student, Department of CSE, Dr.Nallini Institute of Engineering & Technology, Dharapuram, Tamilnadu, India

More information

Improving Synoptic Querying for Source Retrieval

Improving Synoptic Querying for Source Retrieval Improving Synoptic Querying for Source Retrieval Notebook for PAN at CLEF 2015 Šimon Suchomel and Michal Brandejs Faculty of Informatics, Masaryk University {suchomel,brandejs}@fi.muni.cz Abstract Source

More information

Chapter 2. Architecture of a Search Engine

Chapter 2. Architecture of a Search Engine Chapter 2 Architecture of a Search Engine Search Engine Architecture A software architecture consists of software components, the interfaces provided by those components and the relationships between them

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Evaluating the suitability of Web 2.0 technologies for online atlas access interfaces

Evaluating the suitability of Web 2.0 technologies for online atlas access interfaces Evaluating the suitability of Web 2.0 technologies for online atlas access interfaces Ender ÖZERDEM, Georg GARTNER, Felix ORTAG Department of Geoinformation and Cartography, Vienna University of Technology

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Automatically Generating Queries for Prior Art Search

Automatically Generating Queries for Prior Art Search Automatically Generating Queries for Prior Art Search Erik Graf, Leif Azzopardi, Keith van Rijsbergen University of Glasgow {graf,leif,keith}@dcs.gla.ac.uk Abstract This report outlines our participation

More information

ALL content should be created with findability in mind.

ALL content should be created with findability in mind. Agenda Harnessing the Power of RSS Harnessing the Power of RSS Amy Greer & Roger Gilliam Closed Loop Marketing Web Builder 2.0 Las Vegas December 4, 2007 Build the Foundation What is RSS? RSS Feed Elements

More information

Whitepaper US SEO Ranking Factors 2012

Whitepaper US SEO Ranking Factors 2012 Whitepaper US SEO Ranking Factors 2012 Authors: Marcus Tober, Sebastian Weber Searchmetrics Inc. 1115 Broadway 12th Floor, Room 1213 New York, NY 10010 Phone: 1 866-411-9494 E-Mail: sales-us@searchmetrics.com

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

Modelling Structures in Data Mining Techniques

Modelling Structures in Data Mining Techniques Modelling Structures in Data Mining Techniques Ananth Y N 1, Narahari.N.S 2 Associate Professor, Dept of Computer Science, School of Graduate Studies- JainUniversity- J.C.Road, Bangalore, INDIA 1 Professor

More information

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH

A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A RECOMMENDER SYSTEM FOR SOCIAL BOOK SEARCH A thesis Submitted to the faculty of the graduate school of the University of Minnesota by Vamshi Krishna Thotempudi In partial fulfillment of the requirements

More information

Maintaining Mutual Consistency for Cached Web Objects

Maintaining Mutual Consistency for Cached Web Objects Maintaining Mutual Consistency for Cached Web Objects Bhuvan Urgaonkar, Anoop George Ninan, Mohammad Salimullah Raunak Prashant Shenoy and Krithi Ramamritham Department of Computer Science, University

More information

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search

Extracting Visual Snippets for Query Suggestion in Collaborative Web Search Extracting Visual Snippets for Query Suggestion in Collaborative Web Search Hannarin Kruajirayu, Teerapong Leelanupab Knowledge Management and Knowledge Engineering Laboratory Faculty of Information Technology

More information

HTML 5 and CSS 3, Illustrated Complete. Unit M: Integrating Social Media Tools

HTML 5 and CSS 3, Illustrated Complete. Unit M: Integrating Social Media Tools HTML 5 and CSS 3, Illustrated Complete Unit M: Integrating Social Media Tools Objectives Understand social networking Integrate a Facebook account with a Web site Integrate a Twitter account feed Add a

More information

Automatic Query Type Identification Based on Click Through Information

Automatic Query Type Identification Based on Click Through Information Automatic Query Type Identification Based on Click Through Information Yiqun Liu 1,MinZhang 1,LiyunRu 2, and Shaoping Ma 1 1 State Key Lab of Intelligent Tech. & Sys., Tsinghua University, Beijing, China

More information

Internet-Draft September 1, 2005 Expires: March 5, Feed History: Enabling Incremental Syndication draft-nottingham-atompub-feed-history-04

Internet-Draft September 1, 2005 Expires: March 5, Feed History: Enabling Incremental Syndication draft-nottingham-atompub-feed-history-04 Network Working Group M. Nottingham Internet-Draft September 1, 2005 Expires: March 5, 2006 Status of this Memo Feed History: Enabling Incremental Syndication draft-nottingham-atompub-feed-history-04 By

More information

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Self-tuning ongoing terminology extraction retrained on terminology validation decisions Self-tuning ongoing terminology extraction retrained on terminology validation decisions Alfredo Maldonado and David Lewis ADAPT Centre, School of Computer Science and Statistics, Trinity College Dublin

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

TISA Methodology Threat Intelligence Scoring and Analysis

TISA Methodology Threat Intelligence Scoring and Analysis TISA Methodology Threat Intelligence Scoring and Analysis Contents Introduction 2 Defining the Problem 2 The Use of Machine Learning for Intelligence Analysis 3 TISA Text Analysis and Feature Extraction

More information

DATA MODELS FOR SEMISTRUCTURED DATA

DATA MODELS FOR SEMISTRUCTURED DATA Chapter 2 DATA MODELS FOR SEMISTRUCTURED DATA Traditionally, real world semantics are captured in a data model, and mapped to the database schema. The real world semantics are modeled as constraints and

More information

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE

INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15 : CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Random projection for non-gaussian mixture models

Random projection for non-gaussian mixture models Random projection for non-gaussian mixture models Győző Gidófalvi Department of Computer Science and Engineering University of California, San Diego La Jolla, CA 92037 gyozo@cs.ucsd.edu Abstract Recently,

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

An Information Retrieval Approach for Source Code Plagiarism Detection

An Information Retrieval Approach for Source Code Plagiarism Detection -2014: An Information Retrieval Approach for Source Code Plagiarism Detection Debasis Ganguly, Gareth J. F. Jones CNGL: Centre for Global Intelligent Content School of Computing, Dublin City University

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Northeastern University in TREC 2009 Million Query Track

Northeastern University in TREC 2009 Million Query Track Northeastern University in TREC 2009 Million Query Track Evangelos Kanoulas, Keshi Dai, Virgil Pavlu, Stefan Savev, Javed Aslam Information Studies Department, University of Sheffield, Sheffield, UK College

More information

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags

NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags NUSIS at TREC 2011 Microblog Track: Refining Query Results with Hashtags Hadi Amiri 1,, Yang Bao 2,, Anqi Cui 3,,*, Anindya Datta 2,, Fang Fang 2,, Xiaoying Xu 2, 1 Department of Computer Science, School

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Competitor Identification Based On Organic Impressions

Competitor Identification Based On Organic Impressions Technical Disclosure Commons Defensive Publications Series September 22, 2017 Competitor Identification Based On Organic Impressions Yunting Sun Aiyou Chen Abinaya Selvam Follow this and additional works

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information Duc-Tien Dang-Nguyen, Giulia Boato, Alessandro Moschitti, Francesco G.B. De Natale Department of Information

More information

From Passages into Elements in XML Retrieval

From Passages into Elements in XML Retrieval From Passages into Elements in XML Retrieval Kelly Y. Itakura David R. Cheriton School of Computer Science, University of Waterloo 200 Univ. Ave. W. Waterloo, ON, Canada yitakura@cs.uwaterloo.ca Charles

More information

Implementation Techniques

Implementation Techniques V Implementation Techniques 34 Efficient Evaluation of the Valid-Time Natural Join 35 Efficient Differential Timeslice Computation 36 R-Tree Based Indexing of Now-Relative Bitemporal Data 37 Light-Weight

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Classification with Class Overlapping: A Systematic Study

Classification with Class Overlapping: A Systematic Study Classification with Class Overlapping: A Systematic Study Haitao Xiong 1 Junjie Wu 1 Lu Liu 1 1 School of Economics and Management, Beihang University, Beijing 100191, China Abstract Class overlapping has

More information

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News

Feature Selecting Model in Automatic Text Categorization of Chinese Financial Industrial News Selecting Model in Automatic Text Categorization of Chinese Industrial 1) HUEY-MING LEE 1 ), PIN-JEN CHEN 1 ), TSUNG-YEN LEE 2) Department of Information Management, Chinese Culture University 55, Hwa-Kung

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Predicting connection quality in peer-to-peer real-time video streaming systems

Predicting connection quality in peer-to-peer real-time video streaming systems Predicting connection quality in peer-to-peer real-time video streaming systems Alex Giladi Jeonghun Noh Information Systems Laboratory, Department of Electrical Engineering Stanford University, Stanford,

More information

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan

Proceedings of NTCIR-9 Workshop Meeting, December 6-9, 2011, Tokyo, Japan Read Article Management in Document Search Process for NTCIR-9 VisEx Task Yasufumi Takama Tokyo Metropolitan University 6-6 Asahigaoka, Hino Tokyo 191-0065 ytakama@sd.tmu.ac.jp Shunichi Hattori Tokyo Metropolitan

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information