A Method of Hot Topic Detection in Blogs Using N-gram Model

Similar documents
Cluster Analysis of Electrical Behavior

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

An Optimal Algorithm for Prufer Codes *

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

UB at GeoCLEF Department of Geography Abstract

A CALCULATION METHOD OF DEEP WEB ENTITIES RECOGNITION

Available online at Available online at Advanced in Control Engineering and Information Science

Query Clustering Using a Hybrid Query Similarity Measure

An Image Fusion Approach Based on Segmentation Region

The Research of Support Vector Machine in Agricultural Data Classification

Network Intrusion Detection Based on PSO-SVM

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

Classifier Selection Based on Data Complexity Measures *

Alignment Results of SOBOM for OAEI 2010

Load Balancing for Hex-Cell Interconnection Network

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Enhancement of Infrequent Purchased Product Recommendation Using Data Mining Techniques

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Description of NTU Approach to NTCIR3 Multilingual Information Retrieval

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

On-line Hot Topic Recommendation Using Tolerance Rough Set Based Topic Clustering

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

An Improved Image Segmentation Algorithm Based on the Otsu Method

Virtual Machine Migration based on Trust Measurement of Computer Node

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

A Novel Adaptive Descriptor Algorithm for Ternary Pattern Textures

FINDING IMPORTANT NODES IN SOCIAL NETWORKS BASED ON MODIFIED PAGERANK

BioTechnology. An Indian Journal FULL PAPER. Trade Science Inc.

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Study of Data Stream Clustering Based on Bio-inspired Model

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

A Deflected Grid-based Algorithm for Clustering Analysis

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

Solving two-person zero-sum game by Matlab

Collaboratively Regularized Nearest Points for Set Based Recognition

A Hybrid Re-ranking Method for Entity Recognition and Linking in Search Queries

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Using Fuzzy Logic to Enhance the Large Size Remote Sensing Images

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Simulation Based Analysis of FAST TCP using OMNET++

Concurrent Apriori Data Mining Algorithms

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Combining Multiple Resources, Evidence and Criteria for Genomic Information Retrieval

A Binarization Algorithm specialized on Document Images and Photos

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Pictures at an Exhibition

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Domain Thesaurus Construction from Wikipedia *

Machine Learning. Topic 6: Clustering

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

The Study of Remote Sensing Image Classification Based on Support Vector Machine

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Document Representation and Clustering with WordNet Based Similarity Rough Set Model

Keywords - Wep page classification; bag of words model; topic model; hierarchical classification; Support Vector Machines

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study

Face Recognition Based on SVM and 2DPCA

A fast algorithm for color image segmentation

Object-Based Techniques for Image Retrieval

KIDS Lab at ImageCLEF 2012 Personal Photo Retrieval

Application of Clustering Algorithm in Big Data Sample Set Optimization

BOOSTING CLASSIFICATION ACCURACY WITH SAMPLES CHOSEN FROM A VALIDATION SET

Web Document Classification Based on Fuzzy Association

A Feature-Weighted Instance-Based Learner for Deep Web Search Interface Identification

A Clustering Algorithm Solution to the Collaborative Filtering

Professional competences training path for an e-commerce major, based on the ISM method

Audio Content Classification Method Research Based on Two-step Strategy

A PATTERN RECOGNITION APPROACH TO IMAGE SEGMENTATION

Semantic Image Retrieval Using Region Based Inverted File

Ontology Generator from Relational Database Based on Jena

Application of k-nn Classifier to Categorizing French Financial News

Related-Mode Attacks on CTR Encryption Mode

Recommended Items Rating Prediction based on RBF Neural Network Optimized by PSO Algorithm

An Approach to Real-Time Recognition of Chinese Handwritten Sentences

Classic Term Weighting Technique for Mining Web Content Outliers

Keyword-based Document Clustering

Performance Evaluation of Information Retrieval Systems

A Similarity Measure Method for Symbolization Time Series

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Data Preprocessing Based on Partially Supervised Learning Na Liu1,2, a, Guanglai Gao1,b, Guiping Liu2,c

Clustering Algorithm Combining CPSO with K-Means Chunqin Gu 1, a, Qian Tao 2, b

A Novel Optimization Technique for Translation Retrieval in Networks Search Engines

HIGH-LEVEL SEMANTICS OF IMAGES IN WEB DOCUMENTS USING WEIGHTED TAGS AND STRENGTH MATRIX

Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence

Classifying Acoustic Transient Signals Using Artificial Intelligence

A Novel Term_Class Relevance Measure for Text Categorization

Novel Pattern-based Fingerprint Recognition Technique Using 2D Wavelet Decomposition

Optimizing Document Scoring for Query Retrieval

Algorithm for Human Skin Detection Using Fuzzy Logic

DYNAMIC NETWORK OF CONCEPTS FROM WEB-PUBLICATIONS

Gender Classification using Interlaced Derivative Patterns

Cross-Language Information Retrieval

Positive Semi-definite Programming Localization in Wireless Sensor Networks

SURFACE PROFILE EVALUATION BY FRACTAL DIMENSION AND STATISTIC TOOLS USING MATLAB

A Knowledge Management System for Organizing MEDLINE Database

Analysis of Continuous Beams in General

Load-Balanced Anycast Routing

Scheduling Remote Access to Scientific Instruments in Cyberinfrastructure for Education and Research

Clustering Algorithm of Similarity Segmentation based on Point Sorting

Transcription:

84 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 A Method of Hot Topc Detecton n Blogs Usng N-gram Model Xaodong Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna Emal: wangxaodong.wang@yahho.com.cn Juan Wang College of Computer and Informaton Technology, Henan Normal Unversty, Xnxang, Chna Emal: juaner.50@63.com Abstract Over the last few years, blogs (web logs) have ganed massve popularty and have become one of the most mportant web socal meda, through whch people can get and release nformaton. Hot topc detecton n blogs s most commonly used n analyzng network publc opnon. A method of hot topc detecton usng n-gram model and hotness of topc evaluaton s proposed n ths paper. Our approach conssts of three steps. Frst of all, keywords durng a gven tme perod are obtaned by means of calculatng word's weght, and hot keywords are collected by combnng keywords. Secondly, based on hot keywords, hot keyword groups are extracted usng n-gram model. In the thrd step, hot keyword groups are extracted and hot topcs are detected. The hotness of hot topc s evaluated by the value of keywords weght, whch s got n the second step. Evaluatons on Chnese corpus show that when the sze of n for n-gram s fve, the proposed method s most effectve. Index Terms n-gram model; blog; hot keyword; hot keyword group; hot topc I. INTRODUCTION Wth the development of WEB 2.0, blogs have ganed massve popularty and have become one of the most nfluental web socal meda n our tmes. Anyone wth an nternet connecton can convenently publsh topcs. Accordng to a research [], there are over 75,000 new blogs are created per day by people all over the world, on a great varety of subjects. The huge growth of blogs provdes a wealth of nformaton watng to be extracted. Blogs are becomng an extremely relevant resource for dfferent knds of studes focused on many useful applcatons. Accordngly, blogs offer a rch opportunty for detectng hot topcs that may not be covered n tradtonal newswre text. Unlke news reports, blog artcle expresses a wde range of topcs, opnons, vocabulary and wrtng style. The change n edtoral requrements allows blog authors to comment freely on local, natonal and nternatonal ssues, whle stll expressng ther personal sentment. These forms of self publshed meda mght also allow topc detecton systems to dentfy developng topcs before offcal news reports can be wrtten. Most prevous approaches n hot topc detecton are based on cluster technologes. L and Wu [2] used an mproved algorthm based on K-means clusterng method and support vector machne to group the forums nto varous clusters. Hao and Hu [3] proposed sngle-pass clusterng method to detect topc orented to BBS (Bulletn Board Systems). Da [4] and Wang [5] proposed herarchcal clusterng method. Zheng and Fang [6] used clusterng method and agng theory to detect hot topc on BBS. Another trend of research s to use Natural Language Processng technology. As two representatve methods, Chnese Word Segmentaton Technology (CWST) and Named Entty Recognton [7] are utlzed. Zhu and Wu [8] used CWST to dg out hot topcs based on combnaton of multple keywords. Chen [9] presented a nose-fltered model to extract the outburst topcs from web forums usng terms and partcpatons of users. K. Chen and L. Luesukprasert [0] extracted hot terms by mappng ther dstrbuton over tme, and dentfed key sentences through hot terms, then used multdmensonal sentence vectors to group key sentences nto clusters that represented hot topcs. M. Plataks, D. Kotsakos and D. Gunopulos proposed a hot topc detecton method durng a tme nterval whch was based on bursty dscovery. Yadong Zhou, Xaohong Guan and et al. [] utlzed statstcs and correlaton of popular words n network traffc content to extract popular topcs on the Internet. Two or three keywords can generally express a topc [2]. Mergng multple keywords [3] can reflect the hot news events n a certan tme. Zhou et al. [4] constructed a keyword network based on word frequency and cooccurrences to detect hot topc on professonal blogs. Unfortunately, the frequences of terms descrbng the same hot topc nformaton are dfferent. If the threshold value s not hgh enough, some mpact terms wll be fltered. As a result, the detected keywords wll not effectvely represent the hot topc nformaton. How to evaluate the hotness of topc s one of the mportant problems n hot topc detecton. Janjang L, Xuechun Zhang, and et al. [5] evaluated the blog hotness based on text opnon analyss. They took the number of revews, comments and publcaton tme of the do:0.4304/jsw.8..84-9

JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 85 blog topc, and the comment opnon nto account. Lan You, Yongpng Du and et al. [6] evaluated the hotness of the topc through ts popularty, qualty and message dstrbuton usng Back-Propagaton Neural Network based on classfcaton algorthm. Tngtng He, Guozhong Qu and et al. [7] presented a sem-automatc hot event detecton approach. They evaluated the actvty of events frstly, then fltered and sorted the events accordng to the actvty of events, and fnally got hot event. These methods above could detect hot topc and evaluate hotness of the topc effectvely. However, blog has some partcular structure features tself. These methods may not be completely sutable for hot topc detecton n blogs. In ths paper, employng CWST and takng both the several words n the sentence whch locate left and rght of the keywords nto account, we propose a method to detect hot topc n blogs based on n- gram model. Frstly, the hot keywords are extracted based on the co-occurrence nformaton n n-gram tems whch provde more nformaton about hot topc. Then, hot keywords groups are collected by calculatng the smlarty of n-gram tems contanng keywords. Fnally, the hot topcs are detected by computng the smlarty between keywords and n-gram tems to decde whether the keyword group represents a hot topc. The hotness of the topc s evaluated by the value of keywords weght. Ths paper s organzed as follows: Secton 2 ntroduces n-gram model and gves some concepts. Secton 3 presents our approach for hot topc detecton and hotness of the topc evaluaton n blogs. Secton 4 presents the expermental results on Chnese corpus. Fnally, we summarze the future work. II. N-GRAM MODEL AND RELATED CONCEPTS In the felds of computatonal lngustcs and probablty, an n-gram s a contguous sequence of n words from a gven sequence of text or speech. N-gram models are wdely used n statstcal natural language processng, and approxmate matchng. Statstcal N-gram models capturng patterns of local co-occurrence of contguous words n sentences have been used n varous hybrd mplementatons of Natural Language Processng and Machne Translaton systems [8-22]. In ths paper, one contguous sequence of n words s referred to as an n- gram tem, and one n-gram tem has n entres. We use n- gram model to present sentence that contans keyword. Words that relate wth the keywords can be found n the n-gram tems, so the detaled nformaton of topc that keywords depct can be detected. The keywords depctng one topc can be detected through calculatng the smlarty of n-gram tems that contan keywords. A blog artcle conssts of three parts: the ttle, the content and the reply. For a sentence n blog artcle, frst, we defne a stoppng-word lst to remove the words that are rrelevant to the theme of the blog artcle. Then we do word segmentaton and Part-of-Speech taggng. Fnally, ths sentence s represented by n-gram model. An example of fve-gram model s gven n Fg.. Every fve-gram tem has fve entres, all of whch n an tem have some co-occurrence nformaton. Every sentence has several fve-gram tems. In process of combnng fve-gram tems, f the number of the entres n the last fve-gram tem s smaller than fve, then we use stop words to fll. The n-grams can effectvely capture the relatonshp between words, whch have co-occurrence nformaton of keywords. III. HOT TOPIC DETECTION METHOD IN BLOGS Two or three keywords can generally express a topc, but they can't provde the detaled topc nformaton, such as tme, place and related people. A topc n blogs s a cluster that s composed of a number of blog artcles sharng theme. Each blog artcle has one or more keywords to depct the theme. For a blog artcle, n the sentence contanng keywords, words that are not far away from the keywords generally provde the topc nformaton. So for the sake of effectveness, we use n- gram model to present sentences, set a dstance wndow of n entres, and take the n-gram tems that contanng keywords nto account. The hot topc detecton method n blogs s composed of three parts, they are hot keyword extracton method, hot keyword group extracton method and hot topc detecton algorthm. In Fg.2, the general process of the method s gven. The frst step s to get blog artcles, and then do preprocessng and word segmentaton. Next, extract keyword by calculatng the word s weght, represent the sentence contanng keywords wth n-gram model, and dscover the frequency of every k-grams where k n by scannng the tranng dataset and estmatng each precson and recall. In the thrd step, we buld the hot keyword extracton algorthm, hot keyword group extracton algorthm and hot topc detecton algorthm usng n-grams. Sentence The fourth annversary of the Wen Chuan earthquake Word Segmentaton The/ fourth/ annversary/ of/ the/ Wen/ Chuan/ earthquake Fve-gram tems The- fourth- annversary -of -the; fourth- annversary -of - the-wen; annversary -of -the-wen- Chuan; of -the-wen- Chuan- earthquake Fgure. An example of fve-gram model

86 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 Blog Artcle Set Preprocessng Word Segmentaton Keywords Extracton Possble N-gram Hot Keywords Extracton Hot Keyword Group Extracton Hot Topc Detecton Fgure.2 The framework of our approach Feature Selecton Selected N-gram A. Hot Keyword Extracton Method Because words n the ttle, text, and reply have dfferent nfluence on theme of a blog artcle, we set a weght for each keyword, whch reflects the mportance and hotness of the keyword. The weght can be calculated as weght =α tf _ tt +β tf _ txt +γ * tf _ rly () where α + β + γ =, and tf _ tt s the word frequency n the ttle, tf _ txt s the word frequency n the text, and tf _ rly s the word frequency n the reply. In ths paper, we set α =0.5,β =0.3, γ =0.2. We consder the word whose weght exceeds the predefned threshold hot as a keyword, and put t nto a keyword lst. We create a keyword lst for each blog artcle. The data structure of the node n the keyword lst s defned as follows: Typedef struct{ Strng keyword ; Float weght ; Strng pos [] ; } every node n keyword lst has three element, keyword, weght that represents the weght of keyword, and pos [] whose member are n-gram tems contanng keyword. We use n-gram model to represent the sentence that contans keywords. In propose of smplfcaton, we do not consder all the n-gram tems of the sentence, only the n-gram tems that contan keywords are consdered. The keyword, keyword's weght and n-gram tems contanng keyword are stored n the keyword lst. After calculatng the keyword lst for each blog artcle, we get the hot keywords by mergng keyword lsts. The algorthm to merge keyword lst s descrbed as follows. Algorthm: Keyword Combnaton Algorthm Input: the keyword lst of blog artcle a, the keyword lst of blog artcle b. Output: merged keyword lst. ) For ( =; La. Length ; ++) 2) For ( j =; j Lb. Length ; j ++) 3) If(La[ ].keyword==lb[ j ].keyword) 4) Merge(La[ ],Lb[ j ]); 5) End For 6) End For 7) If( Lb. Length >0) 8) Put all the nodes n Lb nto La, delete Lb. In Algorthm, La. Length s the length of the keyword lst of artcle a, and Lb. Length s the length of the keyword lst of artcle b. The steps to calculate Merge (La[ ], Lb[ j ]) can be defned as follows: Step Alter the keyword s weght n La[ ].weght, set La[ ].weght=la[ ].weght + Lb[ j ].weght; Step2 Put all the n-gram tems n Lb[ j]. pos [] nto La[]. pos []; Step3 Remove the repeated n-gram tems n La[]. pos []; Step4 Delete Lb[j] n Lb. Suppose the number of the collected blog artcles s n, there are n keyword lsts. We use Algorthm to merge all the keyword lsts. Then get the merged keyword lst of all the n blog artcles. If there s a new blog artcle, t wll be combned wth the merged keyword lst.

JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 87 For convenence, the value of the keyword's weght s set to between 0 and. The normalzaton value of weght s descrbed as weght( j) weght( j) = max ( weght ( )) (2) l where weght( j) presents the weght of keyword n the jth node n the merged keyword lst, and l s the length of the merged keyword lst. We use pollng method to look over the weght of each keyword n merged keyword lst, f there s a keyword whose weght value s greater than the predefned threshold hot, the keyword wll be consdered as a hot keyword. All the hot keywords are stored n a hot keyword lst, whch has the same data structure wth keyword lst. For each node n hot keyword lst, we use ts weght value to measure the hotness of ts keyword, that s to say, the hotness of a hot keyword s presented wth the value of ts weght. We rank all the keywords accordng to ther weghts value. The hottest keyword s arranged n the frst node of hot keyword lst. B. Hot Keyword Group Extracton Method One hot keyword can't depct a hot topc comprehensvely, so a hot topc always has several hot keywords, all the keywords are contaned n the hot keyword lst. In ths paper, we combne hot keywords that descrbe the same hot topc to form a hot keyword group. The algorthm to combne hot keywords descrbng a hot topc s represented as follows. Algorthm 2: Hot Keyword Combnaton Algorthm Input: Hot keyword lst Lh. Output: Combned hot keyword lst. ) For( =; < Lh. Length ; ++) 2) For(j= +;j Lh. Length ; j++) 3) sm ( Lh[]. pos [], Lh[ j]. pos []); 4) If ( sm ( Lh[]. pos [], Lh[ j]. pos [])> threshold sm ) 5) Merge( Lh[], Lh[ j ]); 6) End For 7) End For where sm( Lh[ ]. pos[], Lh[ j]. pos []) represents the smlarty of Lh[]. pos [] and Lh[ j]. pos [], the hgher the value of smlarty s, the more nformaton Lh[]. pos[] and Lh[ j]. pos[] share, and the greater probablty of Lh[]. keyword and Lh[ j]. keyword depctng a hot topc s. Let the number of n-gram tems n Lh[]. pos [] be m, the number of n- gram tems n Lh[ j]. pos [] be k. The steps of method to compute sm( Lh[ ]. pos[], Lh[ j]. pos []) are descrbed as follows: ) Let th =0; 2) For ( =; m; ++) 3) For( j =; j k; j ++) 4) sm( pos[ ], pos[ j ]); 5) th = th + sm( pos[ ], pos[ j ]); 6) End For 7) End For where th represents the value of sm( pos[ ], pos[ j ]), and ts value s set to zero n the begnnng. We use the followng normalzaton formula. (3) to reset the value of th at last. th = th (3) th sm( pos[ ], pos[ j ]) represents the smlarty between the th n-gram tem n Lh[]. pos [] and the jth n-gram tem n Lh[ j]. pos []. The steps to calculate sm( pos[ ], pos[ j]) are defned as follows. Step Compare the n contnuous entres n pos[ j ] wth those n pos[], f they match, smn ( pos[ ], pos[ j ]) =, else smn ( pos[ ], pos[ j ]) =0. Step2 Cut the last entry n pos[ j ], compare the n- contnuous entres wth the n contnuous entres n pos[]. If the (n-)-gram n pos[ j ] s the part of n- gram n pos[], smn ( pos[ ], pos[ j ]) =n-/n, else smn ( pos[ ], pos[ j ]) =0. Step3 Repeat Step2 untl all the remanng contnuous entres n pos[ j ] are compared. In each step, f they match, smp ( pos[ ], pos[ j])= p, else n smp ( pos[ ], pos[ j ]) = 0, where p changes from n-2 to. Step4 The sm( pos[ ], pos[ j]) s defned as ( [ ], [ ]) = n sm pos pos j smp ( pos[ ], pos[ j ]) (4) p= In Algorthm 2, the method to calculate Merge( Lh[ ], Lh[ j ]) s represented as follows: () Put the hot keywords n Lh[ j ] nto Lh[] ; (2) Let Lh[]. weght = Lh[]. weght + Lh[ j]. weght ; (3) Put the n-gram tems n Lh[ j]. pos [] nto Lh[]. pos [], and remove the repeated n-gram tems n Lh[]. pos []; (4) Delete Lh[ j ] n Lh. The hot keyword group n combned hot keyword lst s the combnaton of several hot keywords that depct the same topc. Let L [] be a node n the combned hot keyword lst L, then L[]. keyword contans the keywords that have same nformaton n ther n-gram tems, L []. weght presents the sum of the value of all keywords weght n L[]. keyword, and L[]. pos [] contans the n-gram tems contanng keyword. The n- gram tems n L[]. pos contan the context nformaton of keywords n the sentence whch ncludes these keywords, so the n-gram tems descrpt the tme and place nformaton of hot topc n detal.

88 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 Due to the complexty of Chnese sentence, the detected keyword may not reflect the blog artcle theme, and the keywords n L[]. keyword may not descrbe the hot topc. If a keyword n L[]. keyword reflect a topc, then the keyword wll appear n more than one n-gram tems n L []. pos []. So the keyword group represents a hot topc canddate, we need to calculate the smlarty between the keyword group and ther n-gram tems to decde whether the keyword group represents a hot topc. C. Hot Topc Detecton Algorthm Let K = {k,k 2,...,k n} represent the combned hot keyword lst. Suppose the number of keywords n node k s m, the number of n-gram tems n k.[ pos ] s p, and the set of keywords s represented by { s, s, s,..., s }. The method to detect hot topc s S= 2 3 m descrbed as follows: Algorthm 3: Hot Topc Detecton Algorthm Input: Combned hot keyword lst Output: Hot topcs Step Calculate the smlarty between s and k.[ pos ] usng the followng formula.(5) p sm _ s (,.[ ]) s k pos = (5) p where p presents the number of n-gram tems contanng s n k.[ ] pos. Step2 Repeat step untl all the keywords n { s, s2, s3,..., s m } have been calculated smlarty wth k.[ pos ]. Step3 The smlarty between kkeyword. and.[ pos ] s defned as k m sm( k. keyword, k.[ pos]) = sm _ s ( s, k.[ pos]) (6) = Step4 If sm( k. keyword, k.[ pos ]) s greater than the predefned threshold tpc, then k represents a hot topc. The detected hot topc s represented by the keyword group n combned keyword lst, the hotness of the topc s measured by the value of keywords' weght n keyword lst. We rank the detected hot topcs accordng to ther hotness. Suppose the number of keywords n a keyword group s m, the hotness of the topc that the keyword group represents can be defned as m hotness = weght( keyword ) (7) = where weght( keyword ) represents the value of the th keyword s weght n keyword group. data were publshed from May, 202 to May 4, 202. There are up to 7980 artcles. All artcles have been already preprocessed, such as word segmentaton, Partof-Speech taggng and unknown words recognton. The corpus was dvded nto two data sets: Tranng Set: Publshed from May to May 7, whch contans 4480 artcles. Testng Set: Publshed from May 8 to May 4, whch contans 3500 artcles. The experment contans two steps. In frst step, we use the 4480 artcles to tran. The sze of n n n-gram, the values of threshold hot, threshold top, threshold sm, and threshold tpc should be set n ths step. In second step, we use the remanng 3500 artcles to test and evaluate the parameters. We come up wth sxteen sets of experments to fnalze the parameters n the algorthms above. All the cases are shown n TABLEⅠ. For all the cases, we mplemented the method descrbed n secton 3 to fnd out the best values for the parameters. We use precson and recall to evaluate the performance of the cases above. The results are shown n TABLEⅡ. TABLE Ⅰ THE VALUES OF PARAMETERS N THD hot THD top THD sm THD tpc Cases 3 4 5 6 0.25 0.30 0.30 0.5 0.30 0.35 0.35 0.55 2 0.35 0.40 0.40 0.60 3 0.40 0.45 0.45 0.65 4 0.25 0.30 0.30 0.5 5 0.30 0.35 0.35 0.55 6 0.35 0.40 0.40 0.60 7 0.40 0.45 0.45 0.65 8 0.25 0.30 0.30 0.5 9 0.30 0.35 0.35 0.55 0 0.35 0.40 0.40 0.60 0.40 0.45 0.45 0.65 2 0.25 0.30 0.30 0.5 3 0.30 0.35 0.35 0.55 4 0.35 0.40 0.40 0.60 5 0.40 0.45 0.45 0.65 6 Remark: THD represents en threshold IV. EXPERIMENT ANALYSIS AND RESULT At present, there s no publc blog corpus, we desgn a crawler to get blog artcles from Sna Blog, Tencent Blog, whch are two of most popular blogs n Chna. All the

JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 89 TABLE Ⅱ EVALUATION OF PARAMETERS Approaches Precson Recall of keywords, so the keywords depctng a topc can be effectvely clustered together. Case 63.4% 68.36% Case2 65.68% 67.64% Case3 69.89% 66.99% Case4 70.26% 63.26% Case5 62.8% 70.42% Case6 67.70% 69.68% Case7 7.8% 68.09% Case8 72.24% 65.38% Case9 73.4% 78.36% Case0 75.68% 77.64% Case 79.89% 76.99% Case2 80.26% 73.26% (a) Tr-grams Case3 70.20% 73.42% Case4 72.73% 74.7% Case5 74.85% 76.98% Case6 75.34% 76.9% As shown n TABLE Ⅱ, Case gves us the best result. That s N=5, threshold hot =0.35, threshold top =0.40, threshold sm =0.45, threshold tpc =0.60. In second step, we use the remanng 3500 artcles to test. In addton, we mplemented other methods as compare experments. The methods we used for comparson are Multple Keywords Combnaton (MKC) method, sngle-pass clusterng (SPC) method. TABLE Ⅲ EVALUATION OF COMPARED METHODS Approaches Num Precson Recall (b) Four-grams SPC 46 74.39% 75.93% HKC 52 82.32% 79.54% Our approach 56 83.32% 79.98% Remark: Num represents the numbers of detected hot topcs. As n TABLE Ⅲ, the result shows that, under the same experment condton, the HKC method mproves the effectveness compared wth SPC, and our approach got the best performance. Fg.3 gves the results of comparson of our approach and other two approaches wth feature selecton on testng data n tr-gram, four-gram, fve-gram and sxgram models. In Fg.3, we can see that our approach outperforms SPC and HKC approaches n most case, and we get the best results when we use fve-gram model. That s to say, fve-grams can well reflect the relatonshp (c) Fve-grams

90 JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 Ths work was supported by the Scentfc and Technology project of Henan Provnce of Chna (No. 023004098) and (NO. 0820220007). (d) Sx-grams Fgure.3 Select n-grams The detected top four blog topcs usng the proposed method based on fve-gram model s shown n TABLE Ⅳ. Ths lst s sorted by weght that represents the hotness of topc n descendng order. TABLE Ⅳ TOP FOUR HOT TOPICS IN BLOGS DURING & MAY TO 4 MAY Topc Expermental Result The fourth annversary of the Wen Chuan earthquake 2 What wll you do on Mother s Day Why oppose the war n the south Chna sea between 3 Chna and Phlppnes 4 Refned ol prce falls for the frst tme V. CONCLUSION AND FUTURE WORK Ths paper presents a hot topc detecton and hotness of the topc evaluaton method usng n-gram model, whch can mprove hot topc retreval n blogs. The man contrbutons of the paper are as follows. Frst, we analyze the n-gram model. Second we propose an effectve way to valdate the mportance of words n blog artcle accordng to the features of blog artcle. Thrd, we apply n-gram model to desgn the algorthm to detect hot topc. Experment results on Chnese corpus show that the proposed method s promsng. However, there are stll some shortages. Frst, the experment data s trend to be ncomplete n real lfe applcaton. Second, the method to optmze the parameters s just through repeated experments. Thrd, user partcpaton degree, opnon communcaton degree of blog artcle should be consdered. How to mprove the coverage of experment data, fnd the optmzaton algorthm to adjust the threshold and take more features of blog artcle nto account wll be conducted n the future. ACKNOWLEDGMENT REFERENCES [] M. Plataks, D. Kotsakos, and D. Gunopulos. Dscoverng Hot Topcs n the Blogosphere. In: Proceedng of the Second Panhellenc Scentfc Student Conference on Informatcs, Related Technologes and Applcatons EUREKA, pp. 22-32(2008) [2] N. L, and D. D. Wu. Usng text mnng and sentment analyss for onlne forums hotspot detecton and forecast [J]. Decson Support System, 48(2), 200, 354-368 [3] X. Hao, and Y. Hu. Topc detecton and trackng orented to BBS. n: Proceedngs of the 200 Internatonal Conference on Computer, Mechatroncs, Control and Electronc Engneerng(CMCE), 4(200), 54-57 [4] X. Y. Da, Q. C. Chen, X. L. Wang, and J. Xu. Onlne topc detecton and trackng of fnancal news based on herarchcal clusterng. n: Proceedngs of the Nnth Internatonal Conference on Machne learnng and Cybernetcs(ICMLC), 6(200),334-3346 [5] C. H. Wang, M. Zhang, S. P. Ma, and L. Y. Ru. Automatc hot event detecton usng both meda and user attenton. Journal of Computatonal Informaton Systems, 4(3), 2008, 985-992 [6] Y. Yang, J. Zhang, J. Carbonell, and C. Jn. Topccondtoned Novelty Detecton, n: Proceedngs of the 8th ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng'02, 2002, 668-693 [7] D. H. Zheng, and F. L. Hot Topc Detecton on BBS Usng Agng Theory. Web Informaton System and Mnng Lecture Notes n Computer Scence, 5854(2009), 29-38 [8] S. D. Zhu, X. H. Wu, and J. P. Fan. Analyss of Bulletn Board System Hot Topc Based on Multple Keywords Combnaton. n: Management and Servce Scence (MASS), 20 Internatonal Conference, 8(20), -4 [9] Y. Chen, X. Q. Cheng, and S. Yang. Outburst Topc Detecton for Web Forums. Journal of Chnese Informaton Processng, 24(3), 200, 29-36 [0] K. Chen, L. and Luesukprasert. Hot topc extracton based on tmelne analyss and multdmensonal sentence modelng. Knowledge and Data Engneerng, IEEE Transactons on, vol.9: pp.06-025, Aug.2007 [] Y. D. Zhou, X. H. Guan and et al. Approach to extractng hot topcs based on network traffc content. Fronters of Electrcal and Electronc Engneerng, vol.4: pp.20-23,2009 [2] H. X. L, H. P. Zhang, and et al. Keywords based hot topc detecton on Internet[C]. n: Proceedngs of the 5th CCIR, 2009,34-43(n Chnese)

JOURNAL OF SOFTWARE, VOL. 8, NO., JANUARY 203 9 [3] K.Zheng, X. M. Shu, and H. Y. Yuan. Hot Spot Informaton Auto-detecton Method of Network Publc Opnon, Computer Engneerng, 36(3),200,4-6 [4] J. J L, X. H. Zhang, and et al. Blog Hotness Evaluaton Model Based on Text Opnon Analyss, Proceedngs of the 2009 Eghth IEEE Internatonal Conference on Dependable, Autonomc and Secure Computng, p.235-240, December 2-4, 2009 [5] E. Z. Zhou, N. Zhong, and Y. F. L. Hot Topc Detecton n Professonal Blogs. Lecture Notes n Computer Scence, 6890(20), 4-52 [6] L. You, Y. P. Du, and et al. BBS Based Hot Topc Retreval Usng Back-Propagaton Neural Network. Proceedngs of st Internatonal Jont Conference on Natural Language Processng, Chna, Sprnger- Verlag, 2005, pp.39-48 [7] T. T. He, G. Z. Qu, and et al. Sem-automatc Hot Event Detecton. In proceedngs of the 2nd Internatonal Conference on Advanced Data Mnng and Applcatons, 2006, LNAI 4093, pp.008-06 [8] Knght, K, Hatzvassloglou, V. Two-Level, Many- Paths Generaton. In: Proceedngs of the 33rd Annual Meetng of the Assocaton for Computatonal Lngustcs(ACL-95), Cambrdge, MA(995)252 260 [9] Brown, R, Frederkng, R. Applyng Statstcal Englsh Language Modelng to Symbolc Machne Translaton. In: Proceedngs of the Sxth Internatonal Conference on Theoretcal and Methodologcal Issues n Machne Translaton, Leuven, Belgum(995)22 239 [20] Langklde, I, Knght, K. Generatng Word Lattces from Abstract Meanng Representaton. Techncal report, Informaton Scence Insttute, Unversty of Southern Calforna(998) [2] Bangalore, S, Rambow, O. Corpus Based Lexcal Choce n Natural Language Generaton. In: Proceedngs of the 38th Annual Meetng of the Assocaton for Computatonal Lngustcs(ACL2000), Hongkong, Chna(2000) [22] Habash, N. Dorr, B. Traum, D. Hybrd Natural Language Generaton from Lexcal Conceptual Structures. Machne Translaton7(2003) Xaodong Wang receved hs M.E. degree n Computer Scence from Tsnghua Unversty, Chna n 993 and receved hs Ph.D. degree n Informaton technology n Educaton from East Chna Normal Unversty, Chna n 2003. He s an assocate professor n the College of Computer Scence, Henan Normal Unversty, Chna. Hs current areas of nterest nclude Ontology and Knowledge Engneerng. Emal: wangxaodong.wang@yahho.com.cn Juan Wang s major n Computer Applcaton Technology. She s an undergraduate student at Henan Normal Unversty, Chna. Her research nterests nclude Natural Language Processng, Ontology and Knowledge Engneerng. Emal:juaner.50@63.com