Feature identification for topical relevance assessment in feed search engines 1

Size: px

Start display at page:

Download "Feature identification for topical relevance assessment in feed search engines 1"

Flora Erika Gardner
6 years ago
Views:

1 Intelligent Data Analysis 17 (2013) DI /IDA IS Press Feature identification for topical relevance assessment in feed search engines 1 Yongwook Shin and Jonghun Park Department of Industrial Engineering, Seoul National University, Seoul, Korea Abstract. Feed has become a popular way to effectively distribute and acquire information on the web. The explosive growth of feeds demands a search engine that can help users quickly discover feeds of their interests. Retrieval effectiveness of feed search engine highly depends on a relevance assessment method that determines candidates for ranking query results. However, existing relevance assessment approaches proposed for web page retrieval may produce unsatisfactory result due to the different characteristics of feeds from traditional web pages. Compared to web pages, feed is a dynamic document since it continually generates information on some specific topics. In addition, it is a structured document that consists of several data elements such as title and description. Accordingly, the relevance assessment method for feed retrieval needs to effectively address these unique characteristics of feeds. This paper considers a problem of identifying significant features which are a feature set created from feed data elements, with the aim of improving effectiveness of feed retrieval while at the same time reducing computational cost. We conducted extensive experiments to investigate the problem using support vector machine on real-world data sets, and found the significant features that can be employed for feed search services. Keywords: Feed search, feed, feature identification, relevance assessment, text classification, search engine, information retrieval 1. Introduction Recently, the number of sites publishing feeds is dramatically increasing since feed is gaining an attention from people as an effective way to diffuse and consume information. Feed is an ML document published by a web site to facilitate syndication of its contents to subscribers. It provides an efficient mechanism that keeps its subscribers up-to-date with constantly changing information [29]. News and blogs are common sources for feeds, and feeds are used to deliver various types of structured information, ranging from weather data to search results [23]. More recently, social media and microblogging sites are increasingly delivering information through feeds. Two popular feed formats are RSS (Really Simple Syndication) [2] and Atom [18]. In order to address the challenges imposed by recent explosion of feeds, a feed search engine that can help users effectively access and find feeds becomes highly necessary. Feeds tend to be dedicated to some specific topic as they are often published by individual sites or groups of common interest. Users subscribe to feeds on their interests to keep themselves updated with recent information. 1 A preliminary version of this paper was presented at the SIGIR 2010 Workshop on Feature Generation and Selection for Information Retrieval. Corresponding author: Jonghun Park, Department of Industrial Engineering, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul, , Korea. Tel.: ; Fax: ; jonghun@snu.ac.kr /13/$27.50 c 2013 IS Press and the authors. All rights reserved

2 718 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Accordingly, it is desirable for feed subscribers to discover relevant feeds to their queries through a feed search engine. Feed search engine takes a query from users and generates ranking candidates which are feeds assessed to be relevant to the query. Afterwards, it returns a ranked list of feeds by scoring the ranking candidates based on a specific feed retrieval model. Therefore, a relevance assessment method that determines the ranking candidates plays an important role in enhancing feed search quality. Under the assumption of binary relevance assessment, feeds are classified as being relevant or nonrelevant against a user query to construct the ranking candidates [28]. Feature-based model is one of the most popular and effective models for such a classification task [5], and it uses features to represent a user query and document. Specifically, the relevance is measured in terms of feature sets constructed from a user query as well as from a feed after defining specific features suited to feed relevance assessment. Unfortunately, existing relevance assessment methods for web page retrieval may not be appropriate for feed retrieval since they deal with different available features. Specifically, two main distinct properties of feeds make relevance assessment problem for feed retrieval challenging, compared to the web page retrieval. First, feed is a structured document composed of several data elements. Feed contains a feed title, a feed description, and multiple entries as major data elements. In addition, entry title and entry description are included within an entry. With this structural nature, it is necessary to investigate which data elements need to be selected to define a feature set for relevance assessment. Second, feed is a time-varying dynamic document that continually publishes new entries over time, whereas web pages tend to be rather static. This temporal characteristic of feed raises a problem of determining how many entries need to be considered for relevance assessment. To the best of authors knowledge, there have been very few studies that attempted to define features for feed relevance assessment by considering the unique structural and temporal characteristics of feed mentioned above. In this paper, we attempt to identify significant features for feed relevance judgment with respect to a user query. The significant features are constructed from feed data elements, and they are the features that can maximize effectiveness of feed relevance assessment while not hurting efficiency in terms of computational cost. The rest of the paper is organized as follows. In Section 2, related work is presented, and the problem and our proposed approach are presented in Section 3. Section 4 describes the data set for experiments and baseline features. In Section 5, we report experimental results in order to identify the significant features. Finally, we conclude the paper in Section Related work Several blog ranking methods that take account of unique properties of feeds have been studied previously. A ranking function for blog feed retrieval model was proposed by considering the structural characteristic of a feed. Research in [8,9,16] investigated whether it is more effective for feed retrieval to view a feed as a single document composed of entries or multiple individual documents. Similarly, research results on blog retrieval presented in [22] examined a ranking model for blog search. However, they did not pay attention to constructing feature set for feed relevance assessment. In addition, the temporal nature of feed and structural elements such as the feed title and feed description were not considered in [8,9,22]. An enhanced ranking model for blog posts, named PTRank, was proposed to improve search quality beyond the simple keyword matching through utilizing various features available from blog feeds [10]. PTRank, however, focused on ranking the individual blog posts to retrieve topically relevant posts. It

3 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 719 suggested a scoring function to measure the degree of relevance between a query and blog posts, instead of identifying the significant features for feed relevance judgment. Furthermore, the temporal characteristic of feed was not considered by PTRank. Feature identification for blog classification and ranking function for blog search are related to our problem since a blog has similar temporal characteristics to a feed in that a blog is a web site consisting of dated posts listed in reverse chronological order on a single page. Identifying novel features from a blog has been addressed for spam blog classification [12,15]. It was attempted to find the most effective feature types among the bag-of-words, bag-of-urls, and bag-of-anchors by regarding a blog as a single document [12]. In [15], temporal features based on the relationship among blog posts were examined to identify the publishing patterns of spam blogs. Text classification experiments were performed to categorize blogs into topics by using TF-IDF (Term Frequency Inverse Document Frequency) measure and weighted linguistic features such as the title of individual posts and the anchor text from incoming links while a blog was regarded as a single document without considering its temporal characteristics [20]. For blog genre analysis, various features related to the temporal and structural characteristics of blogs were used in order to classify blogs into several genres [11]. Their temporal features included blog post update frequency and age of blog. Structural features were also created from content, image, comment, and advertisement. In contrast to the above mentioned blog classification problems, we attempt to identify the significant features to measure the relevance for a feed retrieval task through feed classification using two topiclabeled feed data sets. Support vector machine is used to classify feeds on their corresponding topic for our experiments. 3. Proposed approach Feed is an ML file that contains partial or full descriptions of web page articles along with links to the original contents and other information such as author, title, and date. Figure 1 shows a typical example that illustrates major elements of an Atom feed [18], in which a feed contains entries corresponding to web page articles, and the entries are arranged in a reverse chronological order. Specifically, the schema of Atom 1.0 contains two core elements, namely <feed> and <entry>. <feed> is the root element of an Atom document, which declares a feed with appropriate namespaces and contains feed title and description. <feed> defines metadata of an information source that provides feed, and should have at least one <entry> element. In addition, <feed> element has mandatory subelements, <title> and <subtitle>. <entry> element specifies metadata of a web page article, and it includes <title>, <summary>,and <updated> elements. While <title> contains the title of the article, <summary> element has a short summary or full body of the article. <updated> gives the publication time of a particular <entry>. Among the various feed elements, we consider five elements in identifying the significant features for feed relevance assessment. These selected elements are summarized in Table 1. We refer to <title> element in <feed> as a feed title, and <subtitle> element in <feed> as a feed description. We also refer to <title> element of <entry> as an entry title, <summary> element of <entry> as an entry description, and <updated> element of <entry> as an entry publication time. This study concentrates on identification of the significant features for feature-based relevance assessment method in feed retrieval. The significant features are defined as a feature set created from data elements available from a feed, which can maximize effectiveness of relevance assessment. Specifically,

4 720 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 1 Elements of Atom 1.0 feed considered in the proposed approach Element Definition Feed element <title> Title of a feed <subtitle> Description of a feed Entry element <title> Title of an entry <summary> Description of an entry. Contains the content of a web page article <updated> Time and date when an entry was published <?xml version="1.0" encoding="utf-8"?> <feed xmlns=" <title>example Feed</title> <subtitle>a subtitle.</subtitle> <link href=" rel="self" /> <link href=" /> <id>urn:uuid:60a76c80-d399-11d9-b91c e0af6</id> <updated> t18:30:02z</updated> <author> <name>john Doe</name> < >johndoe@example.com</ > </author> <entry> <title>atom-powered Robots Run Amok</title> <link href=" /> <link rel="alternate" type="text/html" href=" <link rel="edit" href=" <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated> t18:30:02z</updated> <summary>some text.</summary> </entry> </feed> Fig. 1. An example Atom feed. we use a vector space model for representing the features. The notations used in our problem are presented as follows. Given the set of m feeds, F = {f 1,f 2,...,f m }, and the set of terms, T,letE i = {e i,j j = 1,...,n(f i )} represent the set of entries for f i,i =1,...,m, constructed by choosing the most recent n(f i ) entries from the set of all entries published from f i,wheree i,j is more recent than e i,j+1. We base our vector space model on a bag of terms representation, and denote the term bags of feed title and feed description as ft i and fd i, respectively. The term bag constructed from the title in e i,j is represented as et i,j, and the term bag constructed from the description in e i,j is represented as ed i,j. We also let gt i,k = j=g k j=g (k 1)+1 et i,j denote the term bags of aggregated entry titles from a group of entries, and gd i,k = j=g k j=g (k 1)+1 ed i,j the term bags of aggregated entry descriptions, for feed f i, k =1,...,r i,wherer i = n(f i )/g. g denotes a grouping factor to make aggregated term bags from entry titles and descriptions, and it is an integer such that 1 g n(f i ). For instance, if g = 1, gt i,1 = et i,1,...,gt i,ri = et i,ri and gd i,1 = ed i,1,...,gd i,ri = ed i,ri where r i = n(f i ). We do not consider other feature representations, such as the bag of anchors or URLs, since many feeds do not contain anchor texts and URLs in their entry descriptions but instead publish entry descriptions

5 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 721 as short snippets. Another reason for not considering URLs as feature representation is to prevent a bias toward the feeds publishing entries with shortened URLs, in which terms tend to be not useful for relevance judgment. Let Q denote a set of queries, and q Q a term set representation of a query. We also use notation B for representing a bag of terms, and B for the total number of term occurrences in B. In addition, tf(t, B) represents frequency of term t T in term bag B. In case of normalized term frequency (TF), scoring function in terms of query q and term bag B, σ(q,b) is defined as: { t q tf(t, B)/ B σ(q,b) = if q B 0 otherwise In contrast, σ(q, B) for the case of binary indicator is defined as: { 1 if σ(q,b) = t q tf(t, B) > 0 0 otherwise Given query q and E i, f i is modeled as a feature vector represented by (σ(q,ft i ), σ(q,fd i ), σ(q,gt i,1 ),...,σ(q, gt i,ri ), σ(q, gd i,1 ),...,σ(q,gd i,ri )). σ(q,ft i ), σ(q,fd i ), σ(q,gt i,k ),andσ(q,gd i,k ) are scoring functions defined in terms of the normalized TF or binary indicator presented above, and they indicate feature scores specified for feed title, feed description, entry title, and entry description, respectively. We consider normalized TF and binary indicator among the other alternatives, since it is reported that they are effective in many text classification problems [7,12,14]. We respectively refer to the normalized TF and the binary indicator as TF and BI in the following, and also call σ(q,ft i ) and σ(q,fd i ) as feed level feature scores, and σ(q, gt i,k ) and σ(q,gd i,k ) as entry level feature scores. The significant features are constructed from feed data elements by considering various methods for aggregating and sampling entries. Among the possible feature sets, the significant features are the ones that produce the best performance on a specific performance evaluation metric. We attempt to find the significant features through examining various alternatives for constructing the feature vector, considering the unique properties of feed mentioned in the previous section. First, feed data elements for the significant features are identified by investigating several combinations of structural data elements in feature vector construction. For instance, when the relevance assessment using feature vector (σ(q, ft i ),σ(q, fd i )) turns out to have better performance than (σ(q,fd i )), it can be concluded that a combination of feed title and feed description is more appropriate than feed description alone as feed data elements for the significant features. Second, the number of entries considered for the feature vector is identified by varying n(f i ) to address the temporal characteristics of feed. n(f i ) can be determined by two ways: by the number of entries or by the time horizon of entry publication. Let s c and s t respectively denote the number of entries and the time horizon of entries considered for constructing a feature vector. We also let p i,j represent the publication time of entry e i,j. In case that n(f i ) is determined by s c, n(f i ) is set to s c, while for the case of s t, n(f i ) is defined as the number of entries published from f i during the time period from p i,1 to (p i,1 s t ),wherep i,1 is the publication time of the most recent entry. Third, to examine the effect of the number of entries considered for an entry level feature score on the performance of relevance assessment method, we construct a feature vector by varying g. It is expected that performance of relevance assessment method changes along with g since sparseness and size of the feature vector decrease as g increases from 1 to n(f i ). Note that the case of g = 1 corresponds to the small document model and g = n(f i ) to the large document model in [8,9]. Finally, we investigate whether or not the topics of recent entries are more influential to the feed s topic than those of old entries. To deal with time sensitivity of feed s topic, we adopt an exponential

6 722 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines decay rule for entry level feature scoring [1,6,13]. Let tf d (t, gt i,k,p) denote time decayed tf(t, gt i,k ) at time p. Given E i and g, tf d (t, gt i,k,p) is defined as: tf d (t, gt i,k,p)= j=g k j=g (k 1)+1 e λ(p pi,j) tf(t, et i,j ), p > p i,j where λ is time decaying factor obtained from the half-life decay time δ, (i.e., the time required by TF to halve its value), with the relation e λδ = 1 2 [6]. We remark that tf d(t, gd i,k,p) is also defined in the same way. In this paper, topical relevance between a query and a feed is assessed under a text classification framework based on the binary relevance judgment that determines whether the feed is relevant to the query or not. We employ support vector machine (SVM) for our feed classification task, since it is reported that SVM is one of the most popular and effective supervised learning methods for text classification problems such as topic classification, spam blog, and sentiment classification [3,7,15,19,27]. SVM treats feed classification problem as the one involving binary class labels, which we refer to as non-relevant and relevant. We are given with labeled training data represented by {(x 1,y 1 ),...,(x h,y h )} where x l is a feature vector modeling a feed and y l is a class label of x l, l =1,...,h.Giveng and the set of queries Q = {q 1,...,q v }, each of which represents a specific topic, x l =(σ(q u,ft i ), σ(q u,fd i ), σ(q u,gt i,1 ),..., σ(q u,gt i,ri ), σ(q u,gd i,1 ),..., σ(q u,gd i,ri )), u =1,...,v. SVM builds a model using the given labeled training data, and predicts whether an instance of feature vector of a feed falls into non-relevant or relevant. Formally, two classes y l { 1, 1}, where 1 represents non-relevant and 1 represents relevant to a given topic, are to be predicted by the model. If y l s are linearly non-separable, then SVM finds the optimal weight vector w by solving the following convex optimization problem. min w w 2 + C 2 h l=1 ξ l subject to y l (w x l + b) 1 ξ l,, l =1, 2,...,h. where C and ξ l respectively represent a cost and a slack variable, and w and b are parameters of the model. To evaluate effectiveness of relevance assessment, we use precision, recall, and F measure which are commonly used to compare the relative performances with different features [5]. Precision, P,isameasure of the usefulness of feeds evaluated to be relevant and recall, R, is a measure of the completeness of a relevance assessment method. # of Relevant Feeds Retrieved P = (1) # of Retrieved Feeds # of Relevant Feeds Retrieved R = (2) # of Relevant Feeds where Retrieved in Eqs (1) and (2) means classified to be relevant. F measure, FM,isdefinedasa harmonic mean of recall and precision, which is: FM = 2 R P (R + P ) FM is an effectiveness measure used for evaluating text classification performance and it provides a single metric for comparison across different experiments.

7 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 723 Table 2 Data set statistics Google data set Twitter data set Source Google reader Wefollow Number of topics Number of feeds 3,050 4,354 Average number of feeds per topic Data sets for experiments and baseline features Two topic labeled feed data sets are selected for our experiments, which were collected from popular web services. The first data set is Bundles from Google available from Google Reader [21], a Google s web-based feed reading service. In Bundles from Google, several feeds for a topic are bundled in order to provide feed recommendations to Google reader users. We call Bundles from Google as Google data set. Google data set is chosen for our experiment as it covers various topics and broad types of feeds from public media sites, blogs, podcasts, and social media sites. The comprehensive nature in terms of topic coverage and feed source types makes this data set useful for our experiments since the narrow topic coverage and limited feed sources may produce biased experimental results. The second data set is Twitter user directory from Wefollow [25] which is one of the most popular Twitter [24] directory services. We call this Twitter data set. In Twitter data set, several Twitter users are grouped by a tag that represents a specific topic. We consider a Twitter user as a Twitter feed since every Twitter user automatically publishes his/her own Twitter feed that posts all tweets as entries. The Twitter data set is included in our experiments not only to identify the significant features of the feeds publishing the entries with small number of terms but also to examine the characteristics of social media feeds that may have different compositional patterns under a stringent limitation on the entry length. (The number of characters per entry of a Twitter feed is not allowed to exceed 140.) Twitter data set was built by collecting the most influential 25 Twitter users per a tag. Data sets for our experiments were collected from Google, Wefollow, and Twitter during January Table 2 presents brief statistics for our two data sets. Baseline features are selected from the features available from feed under the assumption that relevance assessment method using as many as feed data elements and entries possible would be the most effective. Similar to the significant features, baseline features are constructed by using feed data elements, based on two extreme grouping factors. Given F, we select all of ft i, fd i, et i,j,anded i,j, i =1,...,m, j = 1,...,n(f i ) as structural data elements for baseline features, and set s c = 200 for the time span of feed since the largest number of entries per feed available from our data sets is 200. Also, we set g = 1orn(f i ) for constructing the feature vector of baseline features in order to study whether or not the performance of relevance assessment method under 2 g (n(f i ) 1) is better than those of the small and large document models in [8,9]. 5. Experiment results For our experiments, we used the libsvm toolkit using a linear kernel [4]. We have tried various versions of the SVM kernel functions including higher order polynomials and radial basis functions, and

8 724 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 3 Best performance results of each feed data element and their combinations using Google data set (a) Best precision result TF BI P R FM P R FM FT FD FT+FD ET ED ET+ED FT+ET FT+ED FT+ET+ED FD+ET FD+ED FD+ET+ED FT+FD+ET FT+FD+ED FT+FD+ET+ED (b) Best F measure result TF BI P R FM P R FM FT FD FT+FD ET ED ET+ED FT+ET FT+ED FT+ET+ED FD+ET FD+ED FD+ET+ED FT+FD+ET FT+FD+ED FT+FD+ET+ED found that the linear kernel produces the best performance for our data sets. A five fold cross-validation technique is used to evaluate performance. In our experiments based on the two data sets, the topical tags manually annotated to feeds in Bundles from Google and Wefollow were used as queries. That is, when we use a specific tag as a query, only the feeds labeled with the same tag are judged to be relevant against the query while all the other feeds are classified to be not relevant for the query. In general, text classification problem is characterized by unbalanced data sets in which one of binary class labels is represented by a large portion of all the instances, while the other class label has only a small portion of the instances. We chose the method of under-sampling the majority class, by which we sample instances from majority class such that the number of training instances in both classes are made equal [17]. While there are several recommendations in the literature on how to sample, we employed the method of random sampling as there is an empirical evidence that this method does better than or as good as the other existing ones [26].

9 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 725 Table 4 Results for Google data set under TF P R FM s c g = g = g = ET ED FT+FD ET FT+FD ED FT+FD ET+ED Table 3 summarizes experimental results based on Google data set in order to investigate which feed data elements need to be employed for the significant features. In Tables 3-(a), precision results of several combinations of feed data elements show the maximum precision values along with the resulting recall and F measure performances achieved under TF and BI. The maximum precision results were taken among several configurations of s c and g values. In Tables 3-(b), F measure results for several combinations of feed data elements show the maximum F measure values along with corresponding precision and recall results produced in our experiments under TF and BI. The maximum F measure results were also taken among several configurations of s c and g values. Table 6 was constructed for Twitter data set in the same way as Table 3. Bold numbers in Tables 3 and 6 represent the maximum performances of precision or F measure under TF or BI. Tables 4, 5, 7, and 8 show our experiment design and results using Google data set or Twitter data set under TF or BI when entries are included in feature vector construction and n(f i ) is determined by s c. In particular, Tables 4 and 5 contain representative results selected from complete results, which include performances of the significant features, the baseline features, and several variations of feed data element combinations. In the result tables, FT stands for feed title, FD for feed description, ET for entry title, and ED for entry description. The first column on the left side in the result tables indicates feed data elements and their combinations in which we denote combination of feed data elements used for the feature vector with +. For instance, FT+FD+ET indicates that the feature vector is constructed by using feed title, feed description, and entry title, and it is represented by (σ(q,ft i ), σ(q,fd i ), σ(q,gt i,1 ),...,σ(q,gt i,4 ))

10 726 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 5 Results for Google data set under BI P R FM s c g = g = g = ET ED FT+FD ET FT+FD ED FT+FD ET+ED without features for entry description when s c = 40 and g = 10, where 40 entries are used in feature vector construction and 10 entries are grouped together for computing an entry level feature score. In Tables 4, 5, 7, and 8, numbers in gray boxes are F measure results of the baseline features defined earlier, and bold numbers correspond to F measure values of the significant features. We found that the significant features outperform the baseline features in all experiment results as these tables show. We did not include ED in experiments using Twitter data set as ET and ED are identical in Twitter feeds. Accordingly, ED is not contained in feed data element combinations in Tables 6, 7, and 8. We present a detail analysis for the experiment results with respect to precision and F measure in the following Google data set Effects of feed data elements Tables 3 (a) shows that FT achieves the best precision under TF and BI. Also, it was found that precision values of FD, ET, and ED are almost same under TF, whereas FD produces the second best precision under BI. From these observations, we can conclude that FT and FD are more competitive than the other feed data elements in terms of the precision performance of feed relevance assessment. n the other hand, Tables 3-(b) tells us that FT+FD+ED produces the best result for F measure under TF, while FT+FD+ET+ED achieves the best performance for F measure under BI. It is interesting to see that use of FT or FD alone is less effective in terms of F measure than FT+ET+ED or FD+ET+ED, due to lower recall results. Furthermore, Tables 3-(b) indicates that FD and ED respectively produce

11 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines 727 Table 6 Best performance results of each feed data element and their combinations using Twitter data set (a) Best precision result TF BI P R FM P R FM FT FD FT+FD ET FT+ET FD+ET FT+FD+ET (b) Best F measure result TF BI P R FM P R FM FT FD FT+FD ET FT+ET FD+ET FT+FD+ET Table 7 Results for Twitter data set under TF P R FM s c g = g = g = ET FT+ET FD+ET FT+FD ET higher F measure results than FT and ET under TF, mainly due to the fact that FD and ED respectively have more terms than FT and ET. From the fact that FT and FD achieve low recall but high precision results, it can be concluded that many feeds do not put their topic into FT and FD, but it is highly likely that the terms in FT and FD

12 728 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 8 Results for Twitter data set under BI P R FM s c g = g = g = ET FT+ET FD+ET FT+FD ET represent feed s topic. In addition, FT+FD+ET+ED does not completely outperform FT+FD+ET for F measure, while ED requires relatively higher cost for computing feature scores than the other individual features. The differences between F measure values of FT+FD+ET+ED and those of FT+FD+ET are only 0.01 under TF and under BI, suggesting that FT+FD+ET produces competitive performance compared to FT+FD+ET+ED, considering computation cost for feature score of ED Effects of g and s c Tables 4 and 5 indicate that the significant features found correspond to FT+FD+ED based on 200 entries and g = 40 for F measure under TF, whereas they correspond to FT+FD+ET+ED based on 200 entries and g = 20 under BI. Furthermore, we examine effects of s c and g on the performance in Tables 4 and 5. In general, small g s yield higher precision values than large g s. Contrary to g, large s c s produce better precision results than small s c s. It turns out from these observations that relevance assessment method with high dimension of the feature vector contributes to improvement of precision results. In Tables 4 and 5, however, effects of s c and g on F measure performance turn out to be different from those on precision. This difference is mainly due to recall results. In contrast to the case of the baseline features in which g = 1 or 200, the maximum F measure results for each considered feed data element combination were observed at g = 40 and 200 under TF (underlined in Table 4), while they were found at g = 10, 20, and 40 under BI (underlined in Table 5). Large s c s still produce better results on F measure than small s c s for most g values. However, Fig. 2 depicts that F measure performances of most g values hardly increase beyond s c of 120, when F measure results were produced by the significant features (i.e., FT+FD+ED with s c = 200 and g = 40 under TF, and FT+FD+ET+ED with s c = 200 and g = 20 under BI). Accordingly, it can be deduced that a large number of entries containing old entries are not necessary for constructing a feature vector, but a small

13 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines F measure 0.75 F measure g=1 g=10 g=20 g=40 g=200 g=1 g=10 g=20 g=40 g= Sc (a) TF Sc (b) BI Fig. 2. F measure results of the significant features for Google data set. (Colours are visible in the online version of the article; number of recent entries are enough to assess relevance of feed. Therefore, we can conclude that feed relevance assessment method constructed by using FT+FD+ET and small values of s c and g produces comparable performance in terms of F measure to that of the significant features, able to reduce cost for computing feature scores Twitter data set Effects of feed data elements Distinct characteristics of Twitter feeds make experimental results for Twitter data set different from those for Google data set. FT achieves the maximum precision in Table 6-(a). Precision result of FT significantly outperforms that of ET under BI, whereas FT does not produce higher precision result than ET under TF. In contrast, FD results in lower precision than ET. FT and FD do not contribute to improvement of precision in spite of combining with ET, differently from the results of Google data set. A combination of all feed data elements has the maximum F measure performance under TF, while FT+ET produces the best F measure performance under BI as in Table 6-(b). FD alone yields relatively high precision and recall at the same time, and it improves F measure through combining it with other feed data elements such as FT and ET under TF. In particular, it is interesting to see that F measure result of FT+FD is only lower than that of FT+FD+ET under TF. As a result, it appears that FT+FD is a satisfactory combination for relevance assessment of Twitter feeds, considering high precision of FT+FD as well as cost for computing feature scores related to ET Effects of g and s c As illustrated in Tables 7 and 8, the significant features are the ones constructed from FT+FD+ET by using 160 entries and g = 40 for F measure under TF, whereas they consist of FT+ET based on 200 entries and g = 200 under BI. ET and FT+ET respectively produce the maximum values for F measure

14 730 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Table 9 Additional experiment results for Google data set under TF (g = 1) (a) Determination of n(f i) using s t (b) Time decayed feature score (c) Non-time decayed feature score s t P R FM s t P R FM s t P R FM F measure 0.40 g=1 g=10 g=20 g=40 g=ld F measure 0.50 g=1 g=10 g=20 g=40 g=ld Sc (a) TF Sc (b) BI Fig. 3. F measure results of the significant features for Twitter data set. (Colours are visible in the online version of the article; when g = 200 with help of high recall results in spite of quite low precision results under BI, and they yield the best F measure performance when g = 40 under TF. In Tables 7 and 8, the maximum F measure values for each considered feed data element combinations are underlined. Moreover, Tables 7 and 8 reveal that large s c s have higher F measure results than small s c s under TF and BI. It was found that high F measure results of large g s and s c s for Twitter data set are mainly because Twitter feeds more frequently publish entries not relevant to their topics than feeds in Google data set. This characteristic of Twitter feeds results from the fact that Twitter users usually publish short entries on their private life activities and merely refer URLs to resources they are interested in without topical terms. Figure 3 shows how F measure values are changed as s c varies for the significant features (i.e., FT+FD+ET with s c = 160 and g = 40 under TF, and FT+ET with s c = 200 and g = 200 under BI). It indicates that F measure results for most g values have no remarkable difference as s c increases in case of TF, and they seldom increase beyond s c of 160 in case of BI. Finally, it can be observed from Table 6 that F measure result of FT+FD is almost same as that of FT+FD+ET in assessing relevance of Twitter feeds under TF. In contrast, it was observed in Table 8 that FT+ET achieves the best F measure performance by using the entire entries, but when FD is combined with FT+ET, a small number of entries (i.e., s c = 80) can also produce comparable performance.

15 Y. Shin and J. Park / Feature identification for topical relevance assessment in feed search engines Additional experiments We conducted some additional experiments to investigate the effects of n(f i ) determined by s t as well as the time decayed feature score on the performance of relevance assessment method. Tables 9- (a) and (b) show the results of additional experiments using FT+FD+ET+ED when g = 1 under TF in case of Google data set. In particular, we present Table 9-(c) by selecting experiment results using FT+FD+ET+ED when g = 1 from Table 4 in order to compare the additional experiment results with the results based on the non-time decayed feature scores and s c. Table 9-(a) represents experiment results based on s t where time unit is a day. We see from Tables 9-(a) and (c) that precision results in cases of determining the horizon by means of s t are improved compared to the cases of s c, but most F measure performances drop due to lower recall results than in the cases of s c. It seems that these results are caused by high variability of n(f i ) when it is determined by s t. This variability is due to the different entry publishing frequencies of feeds. Consequently, we propose to construct a feature set for feed relevance assessment by using a specific number of most recent entries rather than the recent entries selected by a specific time horizon, in order to achieve higher F measure performance. Table 9-(b) shows experiment results using the time decayed feature scores. As presented in Tables 9- (b) and (c), the time decayed feature scores do not yield better performance than the original non-time decayed feature scores. It was found that feed s topic does not more strongly depend on the topic of recent entries than that of old ones, since the feeds in Google data set tend to continually publish entries relevant to their topics for a long period, without frequently changing their topics. 6. Conclusions In this paper, we investigated the significant features for assessing relevance between feeds and a user query, using two topic-labeled feed data sets. A new feature set construction approach for feed relevance assessment was proposed based on the vector space model, considering unique characteristics of feeds. ur approach was proposed in the expectation that the use of a large number of feed data elements and entries that require high cost for computing feature scores may not produce the best performance, but that feed relevance can be effectively assessed by using only a small portion of them. There were three main methods proposed for identifying the significant features. First, we identified feed data elements for feed relevance assessment by investigating several combinations of feed data elements in feature vector construction. Four data elements, namely feed title, feed description, entry title, and entry description, were selected as candidate feed data elements. Second, we investigated the number of entries necessary for relevance assessment through constructing feature vectors of different sizes using various numbers of entries. Third, we examined how many entries need to be grouped for computing an entry level feature score. Furthermore, two additional methods were tested. The time decayed feature scores were employed in order to study time sensitivity of a feed s topic. The other we attempted to find out was the effect of different entry selection schemes that respectively choose entries for feature scoring based on the number of entries and on the publication time span. From the experimental results conducted using two topic-labeled feed data sets, we found that when the sizes of entry group for computing the entry level feature score are small, the combination of all feed data elements with a large number of entries (i.e., s c 160) tends to constitute the significant features for all data sets with respect to F measure. In contrast, the feed title alone was able to achieve the highest

Intro to the Atom Publishing Protocol. Joe Gregorio Google

Intro to the Atom Publishing Protocol Joe Gregorio Google Atom Atom Syndication Feed example Feed 2003-12-13t18:30:02z