ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

Size: px

Start display at page:

Download "ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System"

Darrell Lane
5 years ago
Views:

1 ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon, Hong Kong Abstract. In this paper, an Intelligent Context Aware Reporting System called ijade Reporter is presented. This system focuses on how context mining techniques are applied on news reporting under a multi-agent architecture, and categorize news content by information retrieval algorithm. This paper also investigates how to improve the similarity measurement between documents by ontology with WordNet graph structure of words. In a web querying case, a common information retrieval algorithm, Term Frequency with Inverse Document Frequency (TFIDF) is used to cluster news contents. The proposed system provides a simple, fast and efficient query in WWW. The proposed system makes use of multi-agent technology to increase the scalability and efficiency of the system. By using TFIDF algorithm and multi-agent based techniques, an online updating news reporter from popular Website, such as BBC & CNN Website is developed. 1 Introduction Retrieving, categorizing and reporting useful news from the web is one of the most challenging problems in machine learning. The common interest among researchers working in diverse fields is motivated by our remarkable innate ability to study and to report news in daily life. The current search engines provided by Google or Yahoo! on news retrieval do not have a logical categorization which is difficult for reading. In this paper, an intelligent multi-agent based context aware news reporting agents system called ijade Reporter is presented. For system implementation, ijade [8] (intelligent Java Agent Development Environment) is adopted to provide an intelligent agent-based platform for the implementation of various AI functionalities. 2 Web Context Mining (WCM) An Overview In web content mining, popular search engines (such as Lycos, WebCrawler, Infoseek, and Alta Vista) provide some basic web searching functionalities. However, they fail to provide concrete and structural information [10]. In recent years, interest has been focused on how to provide a higher level (semantic level) organization for semi-structured or even unstructured information on the Web using AI-based Web mining techniques. Agent-based systems such as Harvest [2], FAQ-Finder [4], Information Manifold [11], OCCAM [9], and Parasite [12] rely either on pre-specified domain specific in- R. Khosla et al. (Eds.): KES 2005, LNAI 3681, pp , Springer-Verlag Berlin Heidelberg 2005

2 694 Eddie C.L. Chan and Raymond S.T. Lee formation, or on hard-coded information sources to retrieve and interpret documents. For instance, Harvest system [4] relies on semi-structured documents to improve its ability to extract information. Although it knows how to find author and title information in Latex documents and how to strip position information from postscript files, it fails to discover new documents or to learn new document structures. Similarly, FAQ-Finder [6] extracts answers to frequently asked questions (FAQs) from FAQ files available on the web with priori knowledge. Web page ontology [3] can be defined in different ways depending on the objective of the ontology. Most of the web sources have its semantic meaning. Techniques for Ontology Generation, Ontology Mediation, Ontology Population and Reasoning from the Semantic Web have all been major areas of focus. Most web documents are organized in a content hierarchy, with more general nodes placed closer to the root of hierarchy. Each node is labeled by a set of keywords describing the content of documents that are placed in the node. Each document is described by a one-sentence summary including a hyperlink that points to the actual Web document located somewhere on the Web. 3 ijade Reporter A System Overview The ijade Reporter proposed in this paper consists of 4 types of ijade agents (figure 1): 1) ijade Search Agent 2) ijade Categorize Agent 3) ijade Update Agent 4) ijade Report Agent. BBC Online Search Engine + By user request, Servlet invoke Report agent Wordnet Database ijade Search Agent Final Search Result ijade Reporter Network../ shtml../ shtml../ shtml../ shtml../ shtml../ shtml Categorized Read online Website Extract useful Links ijade Categorize Agent Uncategorized Categorized Store Content ijade Update Agent Database Fig. 1. System Overview of ijade Reporter

3 ijade Reporter An Intelligent Multi-agent Based Context ijade Search Agent A mobile ijade agent aims at searching news from popular news websites such as BBC [1] and CNN [3] news websites. It connects to several different popular news search engines; combines the result into search lists and integrates all news search engines using WordNet [14] Dictionary to provide reconstruction and understanding of news query. 3.2 ijade Categorize Agent A stationery ijade agent aims at categorizing and clustering news into different regions. categorization is based on calculating the similarity between the web documents by using TFIDF (Term Frequency with Inverse Document Frequency method). TFIDF is a simple but powerful algorithm for machine learning to understand semantic document. It exhibits strong characteristics of word frequencies presented in a document. Vector Space Model (VSM) is used for document representation Term Frequency with Inverse Document Frequency (TFIDF) Algorithm TFIDF [13] is an information retrieval algorithm which aims at calculating a specific value of the semantic meaning among words and documents. TFIDF is simple but powerful to express the abstract idea of semantic meaning. Vector Space Model (VSM) is adopted to represent Web documents. The documents constitute the whole vector space. TFIDF is being used as a weight of term in document. If a term t occurs in document d, di di ( N idf ) w = tf log / (1) where t i is a word (or a term) in document collection, w di is the weight of t i, tf di is term frequency (term count of each word in a document) of t i, N is the number of total documents in the collection and idf di is the number of document in which t i appears Similarity Between Two Documents! Each document d is represented by a vector: V d = ( t1, wd1;...; ti, wdi ;...; t n, wdn ) where t i is a word (or a term) in document collection and w di is TFIDF value of t i in d. By calculating the Euclidean distance of two vectors of two documents, the similarity can be computed. ( d ) Sim d1, di d1 d 2 = d d (2) Modified TFIDF by Ontology In this paper, we proposed ontology-based term frequency (otf) to construct the web ontology on news categorization and retrieval. Assume each word is related to other word, a relationship graph can be constructed as shown in Fig. 3.

4 696 Eddie C.L. Chan and Raymond S.T. Lee Fig. 2. Euclidean distance of two vectors Fig. 3. The word graph example of Destroy and Damage In Fig.3., destroy" and "damage" are similar from the ontology point of view as they are at the same level of the hierarchal structure and have relation between each other. The similarity of two words can be measured by the distance of the tree structure. By comparing the meanings of two terms, the ontology-based term frequency (otf) can be obtained, otf 1 = tf 1 x (1+(1/D(t 1,t 2 )) tf 2 (3) where t 1, t 2, are different terms; otf 1 is ontology-based term frequency of t 1; tf 1, tf 2 are term frequency respectively to t 1 and t 2 ; D(t 1,t 2 ) is the depth between t 1 and t 2. D(t 1,t 2 ) can be calculated by using WordNet. For example, assume the term frequencies of "destroy" and damage are 3 and 2 respectively and the depth between "destroy" and "damage" is 3. The ontology-based term frequency of "destroy" will be otf 1 = 3 x (1+(1/3)) 2 = The ontology-based term frequency of "damage" will be otf 2 = 2 x (1+(1/3)) 3 = After adjusting each term frequency, their term frequency value could be increased, so that two terms will become more significant after computing TFIDF.

5 ijade Reporter An Intelligent Multi-agent Based Context Clustering Technique In this system, we adopt hierarchical clustering [7] technique to cluster the uncategorized news into the shortest distance of particular news inside same category or region. In hierarchical clustering, there is a set of document W={w1, w2 wi}, every document wi in W are considered to be a cluster ci, such that C={c1,c2 ci}. Two clusters ci and cj are randomly chosen and their similarity sim(ci,cj) is calculated. They are merged together if the similarity value is greater than the threshold value. Otherwise, this step repeats until reaching the termination condition. Fig. 4. Cluster Process 3.3 ijade Update Agent A mobile ijade agent aims at updating and collecting the news from popular news websites. When user clicks the update button at different categories, news from different news websites can be obtained. Also, the news will be re-categorized and stored in the user local storage. In additional, this agent calculates the preliminary analysis result based on the semantic relationship between words in a document. The formulation is to simply extract html tag by using web structural mining techniques. The metadata (keywords and title of news) of the html documents is captured, which is used to obtain related news links. After a list of links is given, the news will be further explored to capture related picture links, contents and the term frequency of news is then calculated. In each update process, a region or a category is chosen for information update. Even when the client-side goes offline, the agent will continue to perform its job until the job is finished. 3.4 ijade Report Agent A stationary ijade agent aims at news reporting. This agent provides a vector list of news with headers, short introduction and content with highlight keywords. Graphics and sound are added to increase the attractiveness in news reporting. 4 Experiments In this section, the precision rate for news categorization is tested based on the news database consists of over 5000 records. The news articles are subdivided into six

6 698 Eddie C.L. Chan and Raymond S.T. Lee categories. They are: Business, Health, Education, Science, Technology and Entertainment. A test set contains over 100 news items without categorization is used to evaluate the performance of the proposed model. To determine whether the categorization or not, human judgment is used. Fig. 5. Reporting Result Table 1 and 2 revealed that by using ijade Reporter, the precision rate is over 95% with clustering time around 1185 seconds. Compared with other methods, such as FFBP neural network as well as TFIDF with hierarchical clustering technique, ijade Reporter gives a better precision rate for categorizing news with reasonable time taken for clustering. Table 1. Comparison of the Precision Rate class/category FFBP NN TFIDF+hierarchical ijade (3-Layer) clustering Reporter Business 68.40% 90.30% 95.13% Health 58.45% 93.37% 96.72% Education 73.43% 93.88% 95.14% Science/Nature 67.00% 94.33% 97.44% Technology 70.22% 93.78% 94.38% Entertainment 50.23% 91.67% 95.24% Average 64.62% 93.72% 95.68% Table 2. Comparison of time taken for clustering 3 different sets of total 100 news class/category FFBP NN TFIDF+hierarchical ijade (3-Layer) Clustering Reporter Set seconds 1114 seconds 1175 seconds Set seconds 1123 seconds 1186 seconds Set seconds 1176 seconds 1195 seconds Average 5295 seconds 1138 seconds 1185 seconds

7 ijade Reporter An Intelligent Multi-agent Based Context Conclusion In this paper, an intelligent agent-based context aware news reporting system ijade Reporter is proposed. Experiments show that ijade Reporter provides both an effective and efficient solution for news categorization. By integrating with different popular news websites (e.g. BBC, CNN news) to collect, categorize and analysis news, it provides a convenient and semantic-based news retrieval and reporting solution. Acknowledgment This work was partially supported by the ijade projects B-Q569, A-PF74 and Cogito ijade project PG50 of the Hong Kong Polytechnic University. References 1. BBC Website, 2. C. M. Brown and B. B. Danzig, The harvest information discovery and access system, In Proc. 2nd International World Wide Web Conference, pp , CNN Website, 4. R. B. Doorenbos, O. Etzioni and D. S. weld, A scalable comparison shopping agent for the world wide web, Technical Report TR , University of Minnesiota, Etzioni, D. S. Weld, and R. B. Doorenbos, A Scalable Comparison - Shopping Agent for the World Wide Web, Univ. Washington, Dept. Comput. Sci., Seattle, Tech. Rep. TR, P17, , J. O. Everett, D. G. Bobrow, R. Stolle, R. S. Crouch, V. Paiva, C. Condoravdi, M. Berg, L. Polanyi, Making ontologies work for resolving redundancies across documents, Communications of the ACM 45 (2): 55-60, J. Ham and M. Kamber, Data Mining Concepts and Techniques, Morgan Kaufmann Publishers, ijade official site: 9. C. Kwok and D. Weld, Planning to gather information, In Proc. 14th Nat. Conf. AI, pp , H. V. Leighton and J. Srivastava, Precision among WWW Search Services (Search Engines): Alta Vista, Excite, Hotbot, Infoseek, Lycos, A.Y. Levy, T. Kirk and Y. Sagiv, The information manifold, AAAI Spring Symposium on Information Gathering From Heterogeneous Distributed Environments, pp , E. Spertus, Parasite: Mining structural information on the web, In Proc. 6th WWW Conf., pp , C. W. Wen, H. Liu, W. X. Wen and J. Zheng, A Distributed Hierarchical Clustering System for Web Mining, WAIM2002, LNCS2118, pp , Springer-Verlag Berlin Heidelberg, WordNet,

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798