Vehicle Fault Diagnostics Using Text Mining, Vehicle Engineering Structure and Machine Learning

Size: px

Start display at page:

Download "Vehicle Fault Diagnostics Using Text Mining, Vehicle Engineering Structure and Machine Learning"

Denis Tate
6 years ago
Views:

1 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): Publshed onlne July 8, 205 ( do: 0.648/.s ISSN: (Prnt); ISSN: (Onlne) Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng Y Lu Murphey, Lpng Huang, Hao Xng Wang, Ynghao Huang Department of Electrcal and Computer Engneerng, Unversty of Mchgan-Dearborn, Dearborn, USA Emal address: ylu@umch.edu (Y. L. Murphey) To cte ths artcle: Y Lu Murphey, Lpng Huang, HaoXng Wang, Ynghao Huang. Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng. Internatonal Journal of Intellgent Informaton Systems. Vol. 4, No. 3, 205, pp do: 0.648/.s Abstract: Ths paper presents an ntellgent vehcle fault dagnostcs system, SeaProSel(Search-Prompt-Select). SeaProSel takes a casual descrpton of vehcle problems as nput and searches for a dagnostc code that accurately matches the problem descrpton. SeaProSel was developed usng automatc text classfcaton and machne learnng technques combned wth a prompt-and-select technque based on the vehcle dagnostc engneerng structure to provde robust classfcaton of the dagnostc code that accurately matches the problem descrpton. Machne learnng algorthms are developed to automatcally learn words and terms, and ther varatons commonly used n verbal descrptons of vehcle problems, and to buld a TCW(Term-Code-Weght) matrx that s used for measurng smlarty between a document vector and a dagnostc code class vector. When no exactly matched dagnostc code s found based on the drect search usng the TCW matrx, the SeaProSel system wll search the vehcle fault dagnostc structure for the proper questons to pose to the user n order to obtan more detals about the problem. A LSI (Latent Semantc Indexng) model s also presented and analyzed n the paper. The performances of the LSI model and TCW models are presented and dscussed. An n-depth study of dfferent term weght functons and ther performances are presented. All experments are conducted on real-world vehcle dagnostc data, and the results show that the proposed SeaProSel system generates accurate results effcently for vehcle fault dagnostcs. Keywords: Vehcle Fault Dagnostcs, Text Data Mnng, Machne Learnng, Vehcle Dagnostc Engneerng Structure, TCW, LSI. Introducton As computers and networks grow more powerful and data storage devces become more plentful and less costly, the amount of nformaton n dgtal form s exploded. The maorty of such dgtal data are n text form. Text data mnng has many applcatons ncludng text document search and categorzaton, webste search, customer servces, and automatc dagnostc systems [~4]. In ths research we focus on text documents that are casually typed or recorded. Many text mnng applcatons requre processng casual text data, whch often are n semstructured or unstructured text, such as clncal document analyss [3, 5], emals, nstant messages, free-text of medcal records, operatonal notes, emals, nstant messages, etc., and the applcaton of ths research s n automotve dagnostc text mnng. In automotve ndustry there are abundant nformaton avalable n casual natural language descrpton form that contan valuable vehcle fault dagnostc knowledge, marketng nformaton, consumer evaluaton or satsfacton of certan vehcle models, styles, accessores, etc [6]. For example, several thousands of vehcle problems are reported daly to varous auto servce shops. It s mportant to fnd root-cause of a vehcle problem quckly and accurately. In a typcal vehcle fault dagnostc process, vehcle problems are frst descrbed by customers n casual words and terms. A servce advsor records the customer s complants or descrpton of symptoms verbatm on the repar order, and then searches for a dagnostc code that matches the descrpton. The dagnostc code s used to gude the dagnoss and reparng processes. Due to the complexty of modern vehcles, the number of dagnostc codes can be n hundreds, whch makes manual searchng of correct dagnostc code dffcult and may lead to a lengthy and less accurate dagnoss and repar process, and, possbly, unnecessary part replacements. In order to mprove the

2 59 Y Lu Murphey et al.: Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng accuracy and effcency of vehcle fault dagnostcs, t s mportant to develop an automated system that can help customers to report problems descrbed n casual words and terms, and techncans to quckly fnd the correct dagnostc code,.e., the root cause of the problems. There are several challenges nvolved n ths problem. The descrptons of vehcle problems provded by customers are often ll-structured. Most of such descrptons do not follow the Englsh grammar, and contan many msspelled words, self-nvented acronyms and shorthand descrptons. The descrptons of a problem by dfferent people vary based on the educaton and/or cultural background of the customers, and ther famlarty of vehcle termnologes and knowledge of automotve engneerng. For example, the term trunk used Unted States means the same thng as the term boot used n UK. One faulty symptom, for example, a nose s heard from the engne and the engne runs rough can be descrbed by customers n varous ways, such as engne knocks, hood squeak, engne msses dle, engne lopes, etc. The followng are examples of customer descrptons of the same vehcle problem: Customer : WENT ON A SALES ROAD TEST WITH CUST, VEHICLE WOULDNOT START, Customer 2: CHECK CAR WONT START, Customer 3: CK BATTERY HARD TO START. Hgh dmensons of terms and document classes. Snce there are typos and self-nvented acronyms and abbrevatons frequently occurrng n customer descrptons, the number of dstnct terms used n these documents s several tmes more than formally prnted documents. For example, the word engne has more than 20 dfferent spellngs n customers descrptons n our data collectons. Snce the output vector represents all the dagnostc codes used by a car manufacturng company, there can easly be several hundreds of dfferent document classes. These two hgh dmenson ssues pose challenges for generatng effcent and effectve response for a gven problem descrpton. In ths paper we present an ntellgent vehcle fault dagnostcs system, Search-Prompt-Select(SeaProSel), whch s developed by combnng automatc text categorzaton technques wth vehcle engneerng structure and machne learnng to provde effectve search functons for the dagnostc code that accurately matches a gven problem descrpton. The SeaProSel system uses machne learnng technques to automatcally learn words, terms and ther varatons commonly used n verbal descrpton of vehcle problems from tranng data, and ncorporate a vehcle fault dagnostc engneerng structure nto the search process to provde accurate dagnostc code that matches the problem descrpton. Ths paper s organzed as follows. Secton 2 presents a bref overvew of the state-of-art technologes for text mnng and document categorzaton, Secton 3 presents the proposed system, SeaProSel, Secton 4 presents the experment results generated from real-world vehcle dagnostc data, and Secton 5 concludes the paper. 2. Research n Text Mnng and Document Classfcaton Untl the late 80s, the most popular approach to text categorzaton are based on knowledge engneerng [7~9]. These approaches usually consst of a set of predefned rules that are encoded wth expert knowledge. Each rule s represented as a dsunctve normal form (DNF formula) followed by a category name. A document s classfed under a specfc category f t satsfes the DNF formula of the category. Ths DNF expresson s mostly defned by doman experts. If categores are updated or ported to a dfferent doman, doman experts need to ntervene to redefne DNF expressons for new categores from scratch. In recent years most technques used n text document classfcaton and categorzaton are developed based on machne learnng technologes, whch automatcally buld document classfers by learnng the characterstc of document categores from a set of tranng documents. The advantage of machne learnng s that t does not heavly rely on manual labors durng the model constructon stage. Its effectveness level, n many cases, s superor to that of professonal human work. Consequently, automatc text categorzaton has become a maor research area of machne learnng. Many text categorzaton systems have been developed usng dfferent machne learnng algorthms, ncludng k-nearest neghbor (K-NN), neural networks, latent semantc ndexng, probablstc models, support vector machne, and etc. The ntal applcaton of k-nearest Neghbor (K-NN) to text document categorzaton and classfcaton was ntroduced by Masand and hs colleagues [7, 0], and later t became a wdely used method n text classfcaton [~3]. In text document classfcaton and categorzaton, a document s often represented as a vector composed of a seres of selected words called as feature vector. A K-NN based text categorzaton system s to fnd the K documents n the tranng data that are most smlar to an nput unknown document. The category contans the maorty among the K best matched documents s consdered as the category to whch the unknown document belongs. The smlarty between the unknown document and each tranng document s measured by a smlarty functon, whch s crtcal n generatng accurate results [~3]. Neural networks (NNs) have been popular n text categorzaton and document retreval [4~6]. The most popular neural network archtecture for text classfcaton s a multlayer neural network traned wth the well-known backpropagaton algorthm usng supervsed learnng [7~22]. Self-Organzed Map(SOM), also known as the Kohonen network [23], s a popular unsupervsed neural network used n text classfcaton. A SOM network attempts to cluster the tranng data whle preservng the topologcal propertes of the nput space. Durng the tranng process, t bulds the network,.e. the map, by applyng a compettve process to nput examples. A fully traned SOM network can be used as a pattern classfer [3]. SOM has also been used n feature

3 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): selecton for text categorzaton [24] and text clusterng [25, 26]. Probablstc Modelng has been used n text document classfcaton. A wdely used framework of probablstc model for text document classfcaton s derved from the Bayesan theorem of condtonal probablty [8, 27]: P( c) P( d c) P( c d ) =, P( d ) where d s the nput document, d s the feature vector that represents d, c s the th document class, P( d ) s the probablty that a document d represented by vector d occurs randomly, P( c ) the probablty of a randomly pcked document belongs to category c, P( d c) s the probablty of document d occurrng gven that document d s n document class c, and P( c d ) s the probablty of d, represented by vector d belongng to document class c. To smplfy the calculaton of the condtonal probablty, Naïve Bayes (NB) classfer has been appled to document classfcaton [9]. A Naïve Bayes classfer assumes that the condtonal probablty of each term n the feature vector for a gven class s ndependent of the condtonal probablty of other terms n the feature vector for a gven class. Ths assumpton s called class condtonal ndependence. It makes the computaton of the NB classfer far more effcent than the exponental complexty of a non-naïve Bayes approach snce t does not requre the term combnaton as predctors. Studes comparng dfferent classfcaton models have shown the performance of Naïve Bayes classfer s comparable wth neural network classfers and batch lnear classfer [28, 29]. Support vector machne (SVM) approach was developed based on the structural rsk mnmzaton theores n statstcal learnng [30]. A SVM maps the nput feature space to a hgh dmensonal space through a kernel functon. It then chooses the hyperplane wth the maxmum margn that can separate the postve from negatve examples n the feature space. Accordng to the structural rsk mnmzaton, the generalzaton error s bounded by the sum of the tranng set error and a term derved from the Vapnk-Chervonenks (VC) dmenson of the learnng machne. Unlke tradtonal artfcal neural networks (ANNs), whch mnmze the emprcal tranng error, SVM ams at mnmzng the upper bound of the generalzaton error, whch represents the error on unseen data for a classfer. Thus hgh generalzaton performance can be acheved. SVM can potentally learn a larger set of patterns and be able to scale better than artfcal neural networks [3]. Many publshed lteratures show that SVM learnng can lead to hgh performance n a broad range of pattern classfcaton applcatons [3~34]. SVMs have been popular n text classfcaton and categorzaton [35, 36]. SVM s desgned for two-class pattern classfcaton. However n text document classfcaton, most applcatons nvolve more than two categores of documents. Therefore a text categorzaton system developed usng SVMs usually use one of the followng two approaches to buld a multclass SVM system. Let N > 2 be the number of text categores. The frst approach would desgn N SVM classfers, each of whch dscrmnates the kth class aganst the remanng N- classes, k = to N. The SVM assocated wth the class k seeks a decson surface n the feature space that separates class k from all other classes. Collectvely the N SVM models result n N decson boundares [37]. When a new document x s submtted to the system, all N SVMs are appled to x, and the class represented by the SVM that generates the largest output value s assgned to the nput document x. The second approach s to tran N(N-)/2 SVMs, each of whch s traned to par-wsely separate two dfferent classes n the tranng data set. Dfferent votng strateges, such as Max-Wns [37] or drected acyclc graph (DAG) can be used to make the fnal classfcaton decson based on the results from the N(N-)/2 par-wse SVMs [38]. Even wth the advanced technologes dscussed above text mnng contnuous to be a challengng research area. In ths paper we present an nnovatve technque that combnes automated text document classfcaton wth doman knowledge to derve a search result that precsely matches the nput query. 3. SeaProSel: an Intellgent Vehcle Fault Dagnostc System All automotve companes develop ts own dagnostc codes that are used n ther vehcle fault dagnostc processes. Some companes may have several sets of dagnostc codes wth names such as CSC (Customer Symptom Codes), CCC (Customer Concern Codes), and etc. Wthout losng generalty, we refer to such a code system as a vehcle dagnostc code (VDC). The SeaProSel system s desgned to map a query descrpton to a specfc VDC that accurately matches the problem descrpton. Ths query-to-vdc mappng s a M-to-M mappng. Multple descrptons can be mapped to the same VDC, and one query can be mapped to multple VDCs due to ambguty n language. For example, the three customer descrptons gven n Secton have the same dagnostc code that represent the problem of ENGINE WOULD NOT START. In addton to the synonymy and polysemy problems exstng n general text documents, the documents occurrng n vehcle fault dagnostcs pose partcular challenges: typos, grammar errors, self-nvented terms and acronyms, napproprate usages of punctuatons, and etc. The proposed SeaProSel s desgned to deal wth these challengng ssues n order to generate a unque VDC that accurately matches the nput query. SeaProSel has a herarchcal matchng and searchng system that uses text data mnng technology to quckly retreve dagnostc code that accurately matches the

4 6 Y Lu Murphey et al.: Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng query,.e. the problem descrpton provded by a user. In the cases that no unque dagnostc code s found, the system wll follow the gven automotve dagnostc system to prompt questons to user n an attempt to obtan more nformaton from the user about the vehcle problem. It then uses the answer provded by the user to search for the correct dagnostc code. Ths prompt and search process can be repeated untl a unque dagnostc code s found. Fgure llustrates the system archtecture of the SeaProSel. At the frst stage the SeaProSel drectly searches for a VDC that matches the nput query by usng a Vector Space Model. We present two approaches, frst s a Term- Code-Weght (TCW) model and the second a latent semantc ndexng (LSI) model. The TCW model s a weghted matrx that s obtaned through a machne learnng algorthm. The LSI model uses the reduced-rank matrces to approxmate the orgnal TCW matrx. Each category and query s converted nto a low-dmensonal vector n a LSI space. Both models along wth the crtcal research ssues related to the two models, such as term selecton and weght functons, are dscusson n depth n secton 3.A. If no exactly matched VDC s found, and the system outputs a lst of best matched VDCs and then enters the second stage, Prompt & Select, whch follows the vehcle dagnostc engneerng herarchy to prompt questons for user to select the more detaled and better descrbed vehcle problem. Based on the user s answers, the system generates a new lst, VDC_lst2, whch s a sublst of VDC_lst and contans the VDCs that satsfy the user selected descrptons. The Prompt & Select process s repeated untl a satsfactory VDC s found. Ths process s descrbed n Secton 3.B. Fgure. Overvew of SeaProSel System. A. Buldng a dagnostc document classfcaton model usng machne learnng In text data mnng, a Vector Space Model (VSM) s an algebrac representaton of text documents that contans vectors of dentfers or ndex terms such as words or phrases [39, 40]. In a VSM, all documents are represented n term weghted vectors. In ths research we nvestgate two VSM models, TCW(Term-Code-Weght) matrx and LSI matrx. Both models are bult usng machne learnng algorthms to represent vehcle fault dagnostc knowledge. The two-dmensonal TCW matrx, denoted as A MxN, s generated from a tranng data set Tr, where M s the number of effectve terms n Tr, and N s the number of dagnostc categores n Tr. The tranng data set contans samples of customer descrptons of vehcle problems, and each descrpton s assocated wth a correct dagnostc code assgned by automotve dagnostc experts. The machne learnng algorthm conssts of two maor computatonal components: Term Extracton, and TCW Matrx Constructon. The Term Extracton process nvolves the detecton of a lst of dstnctve and effectve terms, removal of punctuatons and stoppng words, word stemmng and word varaton detecton. The TCW matrx contans M terms generated by the Term Extracton process, and N vehcle dagnostc codes, whch represent document classes. An entry n a TCW, A MxN (,), represents the weght of the th term assocated wth the th dagnostc code, whch s generated based on the statstcal analyss of the th term occurrng n the tranng documents labeled wth th dagnostc code. The TCW matrx s used to map drectly from a problem descrpton to the best matched dagnostc codes. The TCW matrx s bult usng a machne learnng algorthm that contans the followng maor processes, Document preprocessng Document ndexng Term weght generaton Once we obtan the TCW matrx, the VDC that matches an nput document can be generated by the process, Vehcle Fault Dagnostcs usng a smlarty functon. ) Document Preprocessng: Two data nose problems assocate wth casual text documents, one s mproper use of punctuatons and specal symbols, and another mslabelng document categores n tranng data. The msuse of punctuatons n the text documents makes the text categorzaton less accurate. For example, the punctuatons n ACCEL., Dead.NEEDS make the two terms dfferent from ther correct forms, ACCEL and Dead and NEEDS. The exstence of such napproprate use of punctuaton results n a lot of addtonal entres n the term lst that actually should not exst, and cause msleadng statstcs. Three dfferent types of processes are mplemented. Frst type of processes nvolves the search for specal symbols or punctuatons appearng at one end or both ends of a word, for example, ACCEL., *DIESEL*, ****

5 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): TOW, and IN ****. These symbols can be removed drectly wthout any possblty of alternatng the meanng of the word. The second type of processes s to search for specal symbols or punctuatons such as., (, ), &, *, /, +, etc, These symbols are ether removed or replaced by a space. The thrd type of processes s more sophstcated. It manly concerns the punctuatons such as,, :,., etc. When they occur as a part of ntals, numercal or tme/date formats, such as A.C., P.I.D., 2,000, 6.00, 4:00, 4 4, they are kept as part of the orgnal strng. If they occur between two words for example, engne,check, the punctuatons are replaced wth a space. In supervsed machne learnng, each tranng data sample s assgned a target class code,.e. dagnostc code n our applcaton. The class code assgnment s stll been done largely by dagnostc experts. Some documents may have the class code mssng, others may be assgned of multple codes because the person who assgn the class codes s not sure whch one s correct. We developed the followng procedure to deal wth ths problem. For all the documents wth mssng labels, we buld a standard dagnostc code matrx, DC, based on standard descrptons of dagnostc code, whch are avalable n mechancs handbook. We extract the tranng documents wth specfc dagnostc codes to form a subset of tranng data, denoted as TrC, whch s then used to generate the TCW p q matrx C R and term lst T_Lc, where p s the number of ndex terms, and q s the number of codes. For a document q wth n labels, X, X2,, Xn, the relevance between q and X s calculated usng the cosne smlarty functon shown below: s = sm( q, CX ) = Let the smlarty scores between q and the n dagnostc code vectors be s, s 2,, s n, and S max = Max{s =,, n }. The dagnostc code correspondng to S max s assgned to document q. In the cases that multple dagnostc codes are useful, we set a threshold th, and f the dfference between S max and s s smaller than th, dagnostc code X s also assgned to the document q. 2) Document Indexng: The terms used n the TCW matrx need to be derved automatcally from tranng documents, and carefully selected so they effectvely represent document contents. We developed the followng document ndexng algorthm to extract effectve ndexng terms automatcally from tranng data. Let us assume a collecton of documents are to be classfed nto N dagnostc codes or categores, (C, C 2, C N ), and we have tranng documents Tr, Tr 2,, Tr N, where Tr contans the tranng documents belongng to category, =,, N. The obectve of the followng algorthm s to generate a lst of ndexng p = q ( C ) X p p 2 2 ( q) (( CX ) ) = = 2. terms, T_L, where each term t T_ L that effectvely represents the contents n the documents contaned n Tr, where Tr = Tr... Tr N. The document ndexng algorthm contans the followng maor computatonal components. Step : Extract all dstnct terms from Tr to form an ntal term lst T_L. Step 2: Generate a stop word lst, stop_word lst, whch s used to make sure those words do not occur n the term lst T_L. T_L contans the words, such as the, about, an, and, etc. that provde lttle nformaton for document class dscrmnaton. It also contans words have no specfc meanng n a gven applcaton doman. For example, n vehcle fault dagnostc documents, terms such as customer, states, sad, ck, cust, drvng, etc. occur n documents of all classes. Step 3: We mplemented the well-known Porter Stemmng algorthm (or Porter stemmer ) [4] and appled t to the tranng data to generate groups of words that have the same stem, and the varant word forms s represented by one root word. The Porter Stemmng algorthm s based on the dea that the suffxes n the Englsh language mostly consst of a combnaton of smaller and smpler suffxes. It has fve computatonal processes. In each step, f a suffx rule matches wth a word, then the condtons attached to that rule are tested on the resultng stem. A condton, for example, may be the number of stem length after suffx removal must be greater than the threshold. For example, the suffx of ng can be safely removed from the word sngng, and the remanng part,.e. the stem sng replaces the orgnal word. Stemmng word processng reduces the dmensonalty to the word lst sgnfcantly. Step 4: Elmnatng low-frequency words. Two frequency thresholds d_th, w_th, are defned to remove words occurrng nfrequently. A term s removed from T_L f ts occurrng frequency n the number of dfferent vehcle dagnostc code categores s less than d_th or ts occurrng frequency n all tranng documents s less than w_th tmes. The optmal values for d_th and w_th can be obtaned through experments. Step 5: Elmnatng words evenly dstrbuted across all categores. Words that have even dstrbutons among all document categores are also removed from T_L, snce they appear n the same frequency over all document categores. Step 6. Output T_L, whch s used for buldng the TCW matrx descrbed below. 3) Modelng dagnostc documents usng a TCW Matrx: An entry n a TCW matrx A MxN, denoted as a,, s the weght of the th term n T_L belongng to the th VDC, for =,, M and =,, N. The weght n each entry n A MxN s a functon of the occurrence frequency of a term wth respect to a category. The functon s referred to a weght functon. Term weghtng s an mportant component for mprovng performance n the VSM based text mnng [42]. Terms need to be weghted accordng to ther mportance for a partcular

6 63 Y Lu Murphey et al.: Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng document category and for the whole document collecton. A useful ndex term must fulfll a dual functon: t occurs n the documents of the same category wth hgh frequency so as to render the document retrevable, and t s useful to dstngush the documents of one category from the others. A term weght functon s usually a combnaton of a local weght and a global weght functon. The followng descrbes three popular local weght functons. Term Frequency: l = tf, whch s the occurrence frequency of term wthn document category, and 0 f tf = 0 Bnary: B =, f > 0 Log functon: l = Log 2 (tf +). A local weght functon provdes a measure of how well that a term descrbes the document contents n a partcular category. However, usng only local weght s not enough to evaluate the mportance of a term n the document classfcaton. Some terms, due to ther rarty use n a partcular category of documents, are more mportant n dentfyng these documents than others do. Some terms, however, because they appear n many documents, are not useful to dscrmnate documents n one category from the others. A global weght measure s used to reflect the overall mportance of the ndex term n the entre document collecton. Four well-known global weghts ntroduced by Dumas [43] are: Table. Popular weght functons. Normal: 2, tf gf GfIdf:, df ndocs Idf: log2 +, df plog( p) tf Entropy: where p =, log( ndocs) gf where df, the document frequency, s the total number of documents n the document collecton,.e. tranng data, that contan term, gf, the global frequency, s the frequency of term occurrng n the entre document collecton, and ndocs s the total number of documents n the document collecton. Dfferent weght functons transform the raw occurrence frequency of a term n a document to dfferent weghts. In general, the entry a, of a TCW matrx A s a functon of a local and a global weght components. The most commonly used term weght functons are lsted n Table. These weght functons have been evaluated through extensve experments and the results are dscussed n Secton 4. Based on these experments, the proposed SeaProSel system uses the tf-df weght functon n ts VCD drect search component. Entropy Gfdf Normal tf-df B-df tf * tf *( p ) tf * ndocs ndocs log( p ) gf 2 tf * log + B * log + log( ndocs) tf df df df B-normal Log-df Log-entropy log-gfdf log-norm B * log(tf +) * log(tf +) * log(tf +) * gf ndocs ( ) log(tf +) * 2 p (log + log( p) tf ) df log( ndocs) df tf 2 Fgure 2. A rank-k approxmaton matrx.

7 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): ) Modelng dagnostc documents usng a LSI Matrx: A popular varant of TCW matrx s constructed usng latent semantc ndexng (LSI) method [44~46]. It uses the reduced-rank matrces to approxmate the orgnal TCW matrx. Each category and query s converted nto a low-dmenson vector and mapped nto the LSI space. Relevance measures for the user query are also performed n ths space. A LSI matrx s bult from the TCW matrx. A s T decomposed nto the product of three matrces A = UΣV, T T where U U = V V = I n, and Σ = dag( σ,..., σ n ), σ >0 for r, σ = 0for r+. Matrces U and V contan left and rght sngular vectors of A, respectvely, and dagonal matrx Σ contans sngular values. A rank-k approxmaton to A s T represented Ak = UkΣkVk, where Uk and Vk are constructed by takng only the k largest sngular values of Σ along wth ther correspondng columns n the matrces U and V respectvely. Ak s the unque matrx of rank k that s closest n the least squares sense to A. Fg. 2 llustrates the relatonshp between A and Ak. The SVD method attempts to capture most mportant underlyng structure n the assocaton of terms and documents. Snce k s usually much smaller than the number of terms m, some nose are elmnated by deletng low rankng columns. Because SVD s a strctly mathematcal method, the contents of the matrces are not nterpretable wth respect to the documents or terms t analyzes. The best rank k n the SVD model depends on the tranng data, whch wll be further dscussed n the experment secton. However, t s a powerful technque to reduce the dmenson of any term-by-document matrx. 5) Vehcle Fault Dagnostcs usng TCW and LSI matrces: The obectve of vehcle fault dagnostcs s to classfy the user query to a dagnostc code that accurately matches the nput query. Vehcle Fault Dagnostcs usng TCW algorthm conssts of two processes, formulatng query vector, and measurng smlarty between the term vector and a column vector n the TCW. An nput problem descrpton d s frstly preprocessed usng the same procedures as descrbed earler, ncludng removng unnecessary punctuatons, stop words and word stemmng, etc. It s then transformed nto a term vector q wth the same length M of T_L. Let q = ( q,..., q ) T M, where q s the frequency of the th term on T_L occurred n the query document d, =,, M. The classfcaton decson on whch VDC category that best matches wth q s made based on the smlarty measure between q and each column vector of A. Let the column vectors of A be a (,..., ) T = a am, =,, N. We use the followng cosne based smlarty measure to generate a smlarty score between the vector qand the column vector a, q a c r( q, a) = = q a After smlarty score s obtaned for every VDC category and the nput query, there are two approaches by whch our system can use to determne f the dagnostc codes wth the best smlarty score should be returned as matched code class. One method s to use a threshold: all dagnostc categores wth smlarty scores larger than the threshold are regarded as relevant and assgned to the query. The second method s to output the VDC category represented by the column vector n the TCW matrx that has the hghest smlarty score wth the nput query vector. In the LSI model, a user s query s represented by a vector n the reduced-rank space. From a user query, we frst construct the same term vector q as n the TCW model. Then q s converted nto the vector n the reduced-rank T space by the followng formula: qk = q UkΣ k. The classfcaton decson on whch VDC category that best matches wth s made based on the smlarty measure q k q k between and each column vector of Ak, the same process as n the TCW classfcaton process descrbed above. Both TCW based and LSI based VDC drect search system wll be evaluated n Secton 4. B. Integratng automatc search wth vehcle engneerng structure The proposed vehcle fault dagnostc system, SeaProSel, s an ntegraton of drect search usng the TCW matrx and the progressve prompt and select process based on a herarchcal vehcle fault dagnostc engneerng structure. A dagnostc code system s usually organzed n a herarchcal structure that contans multple levels of functonal descrptons, and each level provdes descrptons about a class of symptoms, specfc functon or component faults, condtons, and etc. In ths representaton, the vehcle fault dagnostc codes are represented n the leaf nodes, the root of the tree s the entre vehcle system, and the subsequent levels represent the herarches of subsystems, components or devces. Fgure 3 shows an example. The hghest layer has three functon groups. Under each functon group, there are sub-functon groups. For each functon group at level 2, there are component groups. Under each component group, there are dfferent categores of devatons, under each of whch, there s a layer of condtons. Each node n the tree s accompaned wth a bref descrpton. For example, a descrpton for a functon group could be Engne wth mountngs and equpment, a descrpton for a component category under the functon group could be startng, the descrptons for condtons under such functon group could be engne turns, cold start or unsure when, and a descrpton for a VDC could be, ENGINE WOULD NOT START. M qa M M 2 2 q a = =

8 65 Y Lu Murphey et al.: Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng The SeaProSel system uses the TCW matrx to drectly obtan hghly matched dagnostc codes, nteracts wth user by promptng dagnostc questons based on the vehcle engneerng structure, takes user s selecton/answer to ether generate a dagnostc code that accurately match the user s answers or lead to the next level of functonal prompts. Fgure 4 shows the archtecture of the SeaProSel system developed based on the vehcle fault dagnostc system llustrated n Fgure 3. In Fgure 4, SQ represents the nput problem descrpton, SQ 2 represents the selected level 2 functon group, and SQ 3 represents the selected component/functonalty. The SeaProSel algorthm has the followng maor computatonal steps. Fgutre 3. A herarchcal vehcle fault dagnostc system archtecture. Fgure 4. Overvew of processes n SeaProSel for vehcle fault dagnostcs. Step : process the nput query document SQ usng the procedure, VDC Drect Search usng TCW. Let the output of the procedure be F (SQ) = {VDC, conf,, VDC k, conf k }, where conf conf 2, conf k Step 2: If = conf - conf 2 s hgh, then output VDC and ext Step 3: Fnd c k such that c = conf c conf c+ s hgh Step 4: Fnd the node H n the tree structure such that

9 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): Found_Codes = {VDC,, VDC c } are all H s descendants, and no other nodes n the tree has ths property except H s parent nodes. Example : If Found_Codes = {VDC 8, VDC 9, VDC 0 }, the H code s Devaton 3. Example 2: If Found_Codes = {VDC 6, VDC 9, VDC 0 }, the H code s Functon Group. Step 5: Call the followng Prompt & Select procedure. Step 5. Followng the H node s drect descendants, and present the descrptons assocated wth the descendants to the user. For example, the descrptons of Condton, Condton 2 and Condton 3 under Devaton 3 are presented to the user Step 5.2 Based on the user selecton, f the unque VCD s found, output the VCD and ext the program. Step 5.3 Follow the descendant node selected by the user and fnd the VCDs that match the user s selectons, and denote them F2 (SQ2) = {VDC, conf,, VDC k, conf k }, where conf conf 2, conf k Step 5.4 If 2 = conf conf 2, s hgh, then output VDC and ext Step 5.5: goto Step Experments We were provded by an automotve company wth the herarchcal vehcle fault dagnostc system llustrated n Fgure 3. The herarchcal vehcle fault dagnostc system has 540 vehcle dagnostc codes, and each node s accompaned wth a general descrpton of the vehcle problems the node covers. We conducted three dfferent sets of experments to evaluate, respectvely, dfferent weght functons, TCW matrx verse the LSI matrx, and the entre SeaProSel system. The TCW and LSI matrces were all traned on a data set of 200,000 real-world customer descrptons of vehcle problems. After removng extraneous documents and elmnatng documents wth wrong labels, we had 99,552 vald documents as tranng data. After data preprocessng such as punctuaton preprocessng, Portel stemmng and typo removal, the number of ndex terms generated from the tranng data were reduced from 7033 to The TCW and LSI components as well as the weght functons were evaluated on TEST6K, a testng set of 6000 vehcle dagnostc documents collected from dfferent retaler servce shops n USA durng one week tme perod. All test documents were labeled wth true dagnostc codes by auto techncans. The followng evaluaton crtera are used to analyze system performances. For each nput query, two levels of matchng accuracy are measured: the low level of matchng (LLM) and hgh level matchng (HLM). If the VDC Drect Search system usng ether TCW or LSI matrx returns the correct VDC code,.e. t exactly matches one of the leaf nodes n the herarchcal vehcle fault dagnostc system shown Fgure 3, then t s a low level matchng. In ths case, the ProSeaSel system wll output the VDC code and termnate the search. If the VDC Drect Search system returns the code that does not match any of the leaf nodes but matches the correct hgher level categores n the herarchcal system shown n Fgure 3, then t s a hgh level matchng. For example, f an nput query s true VDC s vdc, but the output of the VDC Drect Search system s vdc5. Snce vdc and vdc5 have the same devaton category, the system has a wrong LLM, but a correct HLM. Based on these two types of matchng crtera, we defne two accuracy measures of system performances when a batch of test queres s used as test data, Exact Match Rate (EMR) and Category Match Rate (CMR). EMR s defned as the number of correct matched outputs n LLM over the sze of the tranng data, and CMR the number of correct outputs n HLM over the sze of the tranng data. EMR 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 0.00% 0.00% VSM(Cosne) VSM(Spearman) entropy Gfdf normal tf-df B-df B-normal log-df logentropy TCW(Cosne) TCW(Pearson) VSM(Pearson) TCW(Spearman) LSI(cosne) log-gfdf log-norm weght schemes Fgure 5. Effects of Weght Schemes and Smlarty Functons.

10 67 Y Lu Murphey et al.: Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng Fgure 6. Performances of the LSI systems of varous K-values. (a) (b) Fgure 7. Two examples of query processes by SeaProSel system

11 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): A. Evaluaton of weght functons As descrbed before, several weght schemes can be used n both TCW and LSI models. All weght functons shown n Table were mplemented n both the VSM and the LSI models. We used three dfferent smlarty functons: Cosne, Pearson, and Spearman n the query classfcaton processes. The results are shown n Fgure 5. It shows that the best result s generated by the TCW model that uses the tf-df weght functon combned wth cosne smlarty functon. Pearson smlarty measure used n the TCW model combned wth the tf-df weght functon acheved the smlar performance to that of cosne measure. Both of them are better than the Spearman smlarty functon. B. Evaluaton of TCW and LSI models The accuracy of the LSI model s heavly affected by value of rank, K. Snce the results above ndcates that the weght scheme tf-df combned wth cosne measure produced the hghest EMR, they are used n both TCW and LSI systems. In order to explore the effects of parameter K, we appled varous K values to construct the sngular value T decomposton matrces, Ak = UkΣ kvk, and compared the performances of these SVD matrces wth the TCW matrx. Theoretcally, the range of K-value s between to the number of columns of TCW matrx. However, our experments shows that EMR degrades rapdly when K-value s larger than 40 n ths applcaton. Fgure 6 showed that the CMR and EMR of LSI models wth K values between and 40, as well as the performances of the TCW classfcaton system. It appears that the best performance has been acheved wth K equal to 7. But the TCW outperformed the best LSI system by more than %. C. Evaluaton of SeaProSel We tested the entre SeaProSel system on a set of 3273 query examples, none of whch were ncluded n the tranng data. The performances are analyzed as follows. 97% of the test queres were answered wth unque and correct VDCs by the VDC drect search usng TCW Model wthout gong to the Prompt & Select process. The other 3% of the test queres were processed through subsequent Prompt & Select processes. At the end of the processes, correct VDCs were found for all test queres. Overall the prompt and select process was used at the rate of 0.065/query. Fgure 7 shows the processes of two test queres by the SeaProSel system. In Fgure 7 (a), the test query was Instrumentaton warnng lghts and chmes. The VDC drect search component returned multple mached VDCs. By fndng the H node of these VDCs, the Prompt & Select component present to the user the descrptons of dfferent problems at the component functonalty level (see Fgure 3) related to the nput query. After the user selected Text wndow and warmng symbol, the SeaProSel went on to the questons at the devaton level. When the user selected Red symbol and text message, a unque VDC code s found that matches the nput query as well as the answers selected by the user durng the Prompt & Select processes. In the second example (see Fgure 7 (b)), the test query s Engne would not start. The VDC drect search component returned multple VDCs, whch s represented as VDC_Lst. By fndng the H node of these VDCs on the VDC_lst, the Prompt & Select component gave descrptons of dfferent problems at level 2 functon groups related to the VDC_Lst. After the user selected Engne wth mountngs and equpment, the SeaProSel went on to dsplay the questons under the selected node at the Component Functonalty level. When the user selected Engne, the SeaProSel went on to dsplay the questons under the selected node at the Devaton level. When the user selected Engne turns, the SeaProSel went on to dsplay the questons under the selected node at the Condton level. When the user selected clckng sound at start attempt, a unque VDC, E2, s found that matches the nput query as well as the answers selected by the user durng the Prompt & Select processes. 5. Concluson We have presented an ntellgent vehcle fault dagnostc system, SeaProSel. SeaProSel conssts of two maor components, VDC Drect Search usng a VSM, and Prompt & Select. Two VSM technologes were developed, mplemented and evaluated, a TCW model and a LSI model. Both models were developed based on machne learnng and text mnng technques. The Prompt & Select component s a system that s bult upon a vehcle fault dagnostc engneerng structure wth a progressve process of query, select, and search to acheve effcent and accurate classfcaton of vehcle problem descrptons. We also presented algorthms for preprocessng text documents that contans spellng errors, typos and self-nvented terms, choosng effectve weght functons, and buldng an effectve TCW matrx and LSI matrces from a gven tranng data set. We have conducted extensve experments to evaluate the algorthms and the entre SeaProSel system. Based on our expermental results we conclude that, n the applcaton doman of vehcle fault dagnostc text documents, the tf-df weght functon gves the best performance when t s used n ether TCW or LSI models wth the smlarty functon beng ether Cosne or Pearson. In terms of the optmal ranks n the LSI systems, our experments show that the optmal K values are n the range of K=3 through K=9, wth K=7 gvng the best performance. When we compare the performances of the TCW model wth the best LSI model,.e. the LSI system used K=7, we notce that the TCW model outperforms the best LSI system by more than %. The SeaProSel system s mplemented based on a realworld vehcle dagnostc code system wth 540 dfferent code classes, and s evaluated on 3273 query documents, whch are verbatm vehcle problem descrptons by customers. The SeaProSel system acheved 97% accuracy n fndng the dagnostc codes drectly based on the VDC Drect Search usng TCW. Through the nnovatve processes of Prompt and Select procedure, the SeaProSel was able to fnd the correct dagnostc code 00% for all test queres.

12 69 Y Lu Murphey et al.: Vehcle Fault Dagnostcs Usng Text Mnng, Vehcle Engneerng Structure and Machne Learnng Our maor contrbutons are summarzed as follows. () Presented an nnovatve computatonal framework, SeaProSel, that combnes automatc search wth engneerng structural search through a Prompt & Select strategy. Expermental results show that SeaProSel s effectve n searchng for dagnostc code accurately matchng a gven problem descrpton. (2) Presented new algorthms for learnng automotve dagnostc code usng TCW matrx and LSI model. Based on our expermental results, TCW s more effectve n the applcaton of casual text document categorzaton (3) Presented new algorthms for preprocessng engneerng dagnostc documents. Although the applcaton doman we presented s n the area of vehcle fault dagnostc documents, some technques we presented are applcable to other applcatons that nvolve processng casual text documents such as automated queston answerng servces, and classfcaton of Tweets, nstant messages, and e-mal messages. References [] Fang, J., Guo, L., Wang, X. D., & Yang, N Ontology- Based Automatc Classfcaton and Rankng for Web Documents. Fourth Internatonal Conference on Fuzzy Systems and Knowledge Dscovery -FSKD, [2] Zhuang, F. Z.; Luo, P.; Shen, Z. Y.; He, Q.; Xong, Y. H.; Sh, Z. Z. & Xong, H Mnng Dstncton and Commonalty across Multple Domans Usng Generatve Model for Text Classfcaton. IEEE Transactons on Knowledge and Data Engneerng, Volume: 24, Issue:, Page(s): , 202. [3] Huang, Y.H., Selya, N., Murphey, Y. L., & Fredenthal, R. B Classfyng Independent Medcal Examnaton Reports usng SOM networks. Proceedng of the 6th Internatonal conference on Data Mnng, Las Vegas, Nevada, USA, 200, p [4] Mencıa, E. L., Park, S. H., & Fürnkranz, J Effcent votng predcton for parwse multlabel classfcaton. Neuro computng 73 pp.64 76, 200. [5] Zeng, Q.; Zhang, X.; Zhang, W.; L, Z. & Lu, L Extractng Clncal Informaton from Free-text of Pathology and Operaton Notes va Chnese Natural Language Processng. 200 IEEE Internatonal Conference on Bonformatcs and Bomedcne Workshops, pp , Hong Kong, 200. [6] Huang, Y. H., Murphey, Y. L., & Ge, Y Automotve dagnoss typo correcton usng doman knowledge and machne learnng. IEEE Symposum Seres on Computatonal Intellgence, 203. [7] Creecy, R.M., Masand, B. M., Smth, S. J., and Waltz, D. L Tradng MIPS and memory for knowledge engneerng: classfyng census returns on the Connecton Machne, Communcatons of the ACM, 35(8): p , 992. [8] Sebastan, F Machne learnng n automated text categorzaton. ACM Computng Surveys, (): p [9] Yang, Y. & Lu, X A re-examnaton of text categorzaton methods. Proc. 22th ACM Int. Conf. on Research and Development n Informaton Retreval (SIGIR'99) Berkeley, CA. [0] Masand, B., Lnoff, G., & Waltz, D Classfyng news stores usng memory based reasonng. Development n Informaton Retreval, 992: ACM Press, New York, US. [] Radovanovć, M. & Ivanovć, M Text mnng: approaches and applcatons, Nov Sad J. Math. Vol. 38, No. 3, 2008, [2] Lu, F. & Ba, Q. Y Refned weghted K-Nearest Neghbors algorthm for text categorzaton. Internatonal Conference on Intellgent Systems and Knowledge Engneerng (ISKE), 200. [3] Balwan. V., Kumar, V., Kumar, P., & Pascual, J KNN based Machne Learnng Approach for Text and Document Mnng. Internatonal Journal of Database Theory and Applcaton, Vol. 7, No., 204, pp [4] Baeza-Yates, R., Rbero-Neto, B., Modern Informaton Retreval, 999: Addson Wesley. [5] Syu, I., Lang, S.D. & Deo, N.; 996. Incorporatng latent semantc ndexng nto a neural network model for nformaton retreval. Proceedngs of the ffth nternatonal conference on Informaton and knowledge management, 996. [6] Chen. Z.H., N, C. W. and Murphey, Y. L., Neural Network Approaches for Text Document Categorzaton. IEEE Internatonal Jont Conference on Neural Networks, July, [7] Zhang, M.L. and Zhou, Z. H Multlabel Neural Networks wth Applcatons to Functonal Genomcs and Text Categorzaton. IEEE Transacton n Knowledge and Data Engneerng, Vol. 8, Issue 0, Oct [8] Cho, S.B. and Lee, J. H., Learnng Neural Network Ensemble for Practcal Text Classfcaton. Lecture Notes n Computer Scence, Volume 2690, Pages , [9] Yu, B.; Xu, Z. B. & L, C. H Latent semantc analyss for text categorzaton usng neural network. Knowledge- Based Systems, 2- pp , 2008 [20] Th, H. N. T.; Huu, O. N. & Ngoc, T. N. T.;203. A supervsed learnng method combne wth dmensonalty reducton n Vetnamese text summarzaton. IEEE Computng, Communcatons and IT Applcatons Conference (ComComAp), 203. [2] Vnodhn, G. & Chandrasekaran, R.M.; 204. Sentment classfcaton usng prncpal component analyss based neural network model. 204 Internatonal Conference on Informaton Communcaton and Embedded Systems (ICICES), 204 [22] L, C. H. and Park, S. C., An effcent document classfcaton model usng an mproved back propagaton neural network and sngular value decomposton. Expert Systems wth Applcatons, 36, pp , [23] Kohonen, T The self-organzng map. Proc. of the IEEE, 9, , 990.

13 Internatonal Journal of Intellgent Informaton Systems 205; 4(3): [24] Manomasupat, P., and Abmad k. Feature Selecton for text Categorzaton Usng Self Orgnzng Map. 2nd Internatonal Conference on Neural Network and Bran, 2005, IEEE press Vol 3, pp , [25] Lu, Y.C.; Wang, X.L.; & Wu, C.; ConSOM: A conceptonal self-organzng map model for text clusterng. Neurocomputng, 7(4-6), , [26] Lu, Y.C., Wu, C., & Lu, M. 20. Research of fast SOM clusterng for text nformaton. Expert Systems wth Applcatons, 38(8), , 20. [27] Lews, D.D Nave (Bayes) at forty:the ndependence assumpton n nformaton retreval. Proceedngs of ECML-98. Sprnger Verlag, Hedelberg, 998. [28] Fredman, N.; Geger, D.; Goldszmdt. M.; 997. Bayesan Network Classfers. Machne Learnng, November 997, Volume 29, Issue 2-3, pp [29] Theodords, S.; 205. Machne Learnng: A Bayesan and Optmzaton Perspectve. Academc Press, 205. [30] Vapnk, V.; 995. The Nature of Statstcal Learnng Theory. Sprnger Verlag, New York, 995. [3] Mukkamala, S., Janosk, G., Sung, A H Intruson Detecton Usng Neural Networks and Support Vector Machnes. Proceedngs of IEEE Internatonal Jont Conference on Neural Networks, IEEE Computer Socety Press, pp [32] Murphey, Y.L.; Chen, Z.H.; Putrus, M. & Feldkamp, L.A SVM learnng from large tranng data set. IEEE Internatonal Jont Conference on Neural Networks, July, [33] Hong, H.B.; Murphey, Y.L.; Gutchess, D. & Chang, T.S Identfyng knowledge doman and ncremental new class learnng n SVM. IEEE Internatonal Jont Conference on Neural Networks, July, [34] Chapelle, O. & Vapnk, V Model selecton for support vector machnes. In S.A. Solla, T.K. Leen, and K.R. Muller, edtors, Advances n Neural Informaton Processng Systems, volume 2. MIT Press, Cambrdge, MA, [35] Zhang, W.; Yoshda, T.; & Tang, X Text Classfcaton based on Mult-word wth Support Vector Machne. Knowledge-Based Systems, vol. 2, [36] Fenerer, I. & Karatzoglou, A., 200. Support Vector Machnes for Large Scale Text Mnng n R. 9th Internatonal Conference on Computatonal Statstcs, 200. [37] Hsu, Chh-We and Ln, Chh-Jen, A Comparson of Methods for Multclass Support Vector Machnes. IEEE Transactons On Neural Networks, VOL. 3, NO. 2, MARCH [38] Platt, J. C., Crstann, N., and Shawe-Taylor, J., Large margn DAG s for multclass classfcaton. Advances n Neural Informaton Processng Systems. Cambrdge, MA: MIT Press, vol. 2, pp , [39] Huang, L.P Intellgent Systems for text categorzaton and retreval. M.S. Thess, Department of Electrcal and Computer Engneerng, Unversty of Mchgan-Dearborn, [40] Raghavan, V.V., & Wong, S.K.M A Crtcal Analyss of Vector Space Model for Informaton Retreval. Journal of the Amerca Socety for Informaton Scence, (5): [4] Porter, M.F An algorthm for suffx strppng. Readngs n Informaton Retreval, 997. Morgan Kaufmann Publshers Inc. San Francsco, CA, USA. [42] Dumas, S.T., 99. Improvng the retreval of nformaton from external sources. Behavor Research Methods, Instruments and Computers, (2): p [43] Dumas, S.T., 990. Enhancng performance n latent semantc ndexng (LSI) retreval. Techncal Report Techncal Memorandum, Bellcore, 990. [44] Dumas, S.T., Furnas, G. W., Landauer, T. K. and Deerwester, S Usng latent semantc analyss to mprove nformaton retreval,. In Proceedngs of CHI'88: Conference on Human Factors n Computng New York: ACM. [45] Jessup, E. R., & Martn, J.H., 200. Takng a new look at the latent semantc analyss approach to nformaton retreval. Computatonal nformaton retreval, 200: p [46] Sebastan, F. & Rcerche, C. N., Machne learnng n automated text categorzaton. Journal of ACM Computng Surveys, Volume 34, Issue, March 2002.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task Proceedngs of NTCIR-6 Workshop Meetng, May 15-18, 2007, Tokyo, Japan Term Weghtng Classfcaton System Usng the Ch-square Statstc for the Classfcaton Subtask at NTCIR-6 Patent Retreval Task Kotaro Hashmoto