Relational Lasso An Improved Method Using the Relations among Features

Size: px

Start display at page:

Download "Relational Lasso An Improved Method Using the Relations among Features"

Myrtle Lewis
6 years ago
Views:

1 Relatonal Lasso An Improved Method Usng the Relatons among Features Kotaro Ktagawa Kumko Tanaka-Ish Graduate School of Informaton Scence and Technology, The Unversty of Tokyo Abstract Relatonal lasso s a method that ncorporates feature relatons wthn machne learnng. By usng automatcally obtaned nosy relatons among features, relatonal lasso learns an addtonal penalty parameter per feature, whch s then ncorporated n terms of a regularzer wthn the target optmzaton functon. Relatonal lasso has been tested on three dfferent tasks: text categorzaton, polarty estmaton, and parsng, where t was compared wth conventonal lasso and adaptve lasso (Zou, 2006) when usng a mult-class logstc regresson optmzaton method. Relatonal lasso outperformed these other lasso methods n the tests. 1 Introducton As machne learnng methods scale up and now deal wth mllons of features, we deally want to add all possble features wthout havng to manually verfy or consder the effectveness of each feature wth respect to the performance. In other words, we need an automatc way to explot any usable nformaton that can be obtaned from features. However, wth current machne learnng methods, addng nosy features could lower performance, so the user stll has to decde whch features are worth addng. Regularzaton methods have recently receved greater nterest because of ths need. Regularzaton s expressed as a constrant term wthn an optmzaton functon, where the term s gven as a functon regardng the mportance weght of each feature. Regularzaton provdes a means of mportance control embedded wthn the target optmzaton problem. Among the varous regularzers, lasso, proposed by (Tbshran, 1996) s the most wdely used because of ts mathematcal comprehensveness. Ths paper descrbes a method, whch we call relatonal lasso, that mproves upon the conventonal lasso method. We show that relatonal lasso mproves the overall performance of classfcaton compared wth that of other lasso methods. Ths study was motvated through a lmtaton we observed n conventonal lasso: that features could be nter-related, but such dependences are not ncorporated wthn the current regularzaton. Therefore, the conventonal method tends to favor correlated features, whch can lead to the mportance of non-correlated features beng neglected. Relatonal lasso overcomes ths lmtaton of conventonal lasso by ntroducng an addtonal penalty parameter for each feature; ths parameter s estmated automatcally gven the nosy relatons among features, where the relatons are also automatcally generated. Whle the proposed method does not add to the computatonal complexty of the conventonal regularzaton method, t mproves the qualty of classfcaton. We explan the method and show emprcal results from tests based on three dfferent text classfcaton and parsng tasks. Regularzaton was orgnally proposed as a way to avod over-fttng by favorng some small number of features. Our method presented n ths artcle explots ths approach further and attempts to estmate the mportance of each feature wthn the relatons t has wth other features. As explaned n more detal n the followng secton, attempts along the same lne have been made to ncorporate underlyng relatons among features, such as by fused lasso through the orderng among features (Tbshran et al., 2005), or by group lasso through groups among features (Yuan and Ln, 2006). However, fused lasso assumes problems wth features that can be ordered n some meanngful way, and group lasso requres under Proceedngs of the 5th Internatonal Jont Conference on Natural Language Processng, pages , Chang Ma, Thaland, November 8 13, c 2011 AFNLP

2 lyng group nformaton to be confgured before the method s appled. The closely related work to ours s adaptve lasso (Zou, 2006), whch ntroduces an addtonal parameter per feature. However, snce adaptve lasso does not explctly process relatons among features, the estmaton of ths addtonal parameter s completely dfferent from our method. Moreover, as wll be emprcally shown, our method outperforms adaptve lasso, whch n fact performed worse than even the conventonal method. Related work on regularzaton technques has shown the potental of regularzaton not only to prevent over-fttng, but also to serve as a knd of feature selecton. Relatonal lasso provdes another step along ths lne. Here, we show that our method outperforms other lasso methods n dfferent classfcaton tasks. Moreover, t works well even when nosy features are added, somethng that degrades the performance of other lasso methods. 2 Related Work As explaned, feature selecton s the key to our method s effectveness, and here we summarze the related work done along ths lne. A substantal number of studes have been done on feature selecton technques, and these technques can be classfed nto three categores accordng to (Guyon and Elsseeff, 2003): wrapper methods, flter methods and embedded methods. The wrapper s the most naïve way of selectng a subset of features through predctve accuracy (Kohav and John, 1997). The user searches the possble feature space greedly usng an nducton algorthm and selects the best subset wth the best predctve accuracy. Wrappers wth greedy algorthms can be computatonally expensve and the stepwse selecton s often trapped nto a local optmal soluton. Snce the possble number of subsets for a set s exponental to the number of features n the orgnal set, n practce the user typcally defnes the subset arbtrarly dependng on the category of features, as was tested by (Scott and Matwn, 1999). Ths makes mpossble any fne adjustment as to whch ndvdual features to use. Flter methods, on the other hand, select good features accordng to some crtera, thus provdng the means for selectng ndvdual features. These methods have been extensvely studed, and a good overvew s avalable n (Mannng, 1999). The representatves of the evaluaton functon for choosng good/bad features are ch-squares and mutual nformaton, and features havng hgher scores for these functons are consdered good features. Whle flterng methods are effectve and therefore often used, these methods are ndependent of the learnng method that they are used wth. Moreover, the performance s not guaranteed to mprove even though feature selecton s used. The last category s embedded methods, where the feature selecton s embedded wthn the overall classfcaton problem. The decson tree s an example of an embedded method, and machne learnng technques usng prunng steps have been studed (Perkns et al., 2003). Lasso, an acronym for least absolute shrnkage and selecton operator, usng the L 1 norm (Tbshran, 1996) s a computatonally effcent method for smultaneously achevng the estmaton and feature selecton. Although lasso helps acheve an effectve model, the L 1 norm could cause based estmaton among features (Knght and Fu, 2000) by not beng able to dstngush between truly sgnfcant and nosy features. To cope wth ths problem, (Zou and Haste, 2005) proposed the elastc net method, whch s expressed as the conjunct of L 1 and L 2 norms of the feature weghts. They show that ths method works when the number of features substantally exceeds the number of learnng data, and also when there s strong correlaton between some features. Another proposal s fused lasso, whch ncorporates the order of features (as found n ther numbers, such as found n the case when each mage pxel value forms a feature) (Tbshran et al., 2005). Here, the target applcaton s proten mass spectroscopy and gene expresson data, and the method s only applcable to a target where the order among features s explct, as n the case of gene or mage pxels. (Yuan and Ln, 2006) proposed grouped lasso, whch ncorporates underlyng groups among features. The fused and grouped lasso methods requre confguraton of the structure among features. Recently, a new approach called weghted lasso s proposed whch calculates the L 1 norm on features, each of whch s weghted. As one method, (Zou, 2006) proposed a two-step approach called 1081

3 adaptve lasso. Ths paper proposes an alternatve weghted lasso method, n whch the estmaton of the weghts s processed dfferently from that of adaptve lasso. Although the procedure of adaptve lasso s closest to relatonal lasso, the learnng of adaptve lasso does not explctly handle the relatons among features. All these methods are attempts to ncorporate the relatons, or structure, among features such as dependence, orderng and groups nto machne learnng through the framework of regularzaton. Although such dependence s not always gven or tractable, we beleve ths nformaton can be learned from some automatcally generated nosy relaton among features. 3 L1-Regularzaton of Mult-Class Logstc Regresson Before gong on to the man ponts of relatonal lasso, let us summarze the regularzaton framework that we adopt. Regularzaton s the general method used n classfcaton. The target functon has two terms, one for fttng and another for regularzaton. Ths second term penalzes the weghts acqured by each feature, typcally by ncorporatng the addton of ther norms nto a target functon for the classfer. Ths prevents the target functon becomng too over-ftted by favorng some specfc sets of features. In ths sense, the regularzaton term can be consdered as servng for feature selecton. Of the varous ways to defne the target functon, n ths paper we focus on the mult-class logstc regresson model and L1-regularzaton; namely, the lasso method. The fttng functon adopted n ths paper s a mult-class logstc regresson model, denoted as LR n the followng. LR s used to model the relatonshp between the nput vectors x = R n and labels y Y. The condtonal probablty for a label y gven x s defned as p(y x; w) = 1 Z(x) exp ( w T ϕ(x, y) ) Z(x) = y exp ( w T ϕ(x, y ) ), where ϕ(x, y) R m s the feature vector and w R m s the weght vector. When the tranng examples {(x, y )}( = 1,, l) are gven, mnmzaton of the loss functon L(w) = log p(y x ; w), s equvalent to the maxmum lkelhood estmaton. Regularzaton makes t possble to obtan a good model for LR, wthout restrctng the number of features, by mposng approprate restrctons on weght w. Of the dfferent ways of regularzaton, n ths paper we adopt (Tbshran, 1996) s method of mposng an L 1 norm on parameters because of ts mathematcal smplcty, especally when appled wth LR. Ths method s called lasso and facltates both estmaton and automatc varable selecton. When applyng ths lasso to logstc regresson, the MAP estmaton of weghts for each feature s gven by the followng formula, the target functon to be optmzed: w = arg mn L(w) + λ w w, (1) where λ s the parameter defnng the strength of the regularzaton term s nfluence on the optmzaton. Although our proposal apples n general to varous types of target functon, n ths paper we examne ts effectveness wthn ths partcular target functon. Ths target functon was chosen because the target functon of LR-lasso s mathematcally comprehensve, so mult-class logstc regresson and lasso are wdely appled. Further nvestgaton to determne whether our method works well for other target functons wll be part of our future work. 4 Relatonal Lasso The Proposed Method The lmtaton of conventonal lasso s that relatons among features cannot be ncorporated. For example, hghly correlated features whch lead towards a hgher performance could all acqure relatvely large weghts. Ths would lead to favorng a sngle aspect that counts for the classfcaton and neglectng other mnor but stll mportant aspects whch would enable better classfcaton. Ths happens when a hgh correlaton s found among features. Therefore, when one feature s favored, the other correlated features must be heavly penalzed, so that features whch count 1082

4 for classfcaton from a dfferent aspect are more favored. To express ths, we adopt the weghted lasso approach, so an addtonal penalzng parameter α for each feature s ntroduced n the second term of formula (1): w = arg mn L(w) + λ w α w. (2) The soluton found here, subject to the L 1 regularzer, s equvalent to the soluton obtaned from the constraned optmzaton problem: mnmze w s.t. L(w), α w γ. The parameter γ corresponds to λ of formula (2). Each parameter α determnes the penalty for w and drectly affects the mportance of the th feature. Prevous work on adaptve lasso (Zou, 2006) also ntroduces an addtonal parameter, such as α, n addton to the weght parameter. Adaptve lasso focuses on the presence of oracle propertes 1, and works n two stages of optmzaton. Frst weght w s learned wth conventonal lasso. Second, L 1 norm s re-weghted wth the parameters α as α = 1/ ŵ δ beng set from ntal lasso estmator ŵ usng a parameter δ, and the optmzaton problem s processed usng α. Therefore, ther way of learnng ths addtonal parameter does not explctly concern the explotaton of addtonal nformaton dfferent from the orgnal weght w. On the other hand, our α s estmated gven a nosy relaton among features, thus t plays a dfferent role from w. In ths sense, the way to handle ths addtonal parameter for relatonal lasso s completely dfferent from adaptve lasso. In other words, the orgnalty of our method les n usng α to express the relatons between underlyng features. The relatons between features are denoted as R, whch s provded to the proposed algorthm and used to estmate α. R denotes a parwse dependence relaton between features. That s, f there are m features n total, R (1,, m) 1 Oracle property (Fan and L, 2001) s satsfed f the optmzaton problem can correctly select the nonzero weghts wth probablty convergng to one and the estmators of the nonzero weghts are asymptotcally normal wth the same means and covarance that they would have f the zero coeffcents were known n advance. (1,, m). Unlke prevous work such as fused (Tbshran et al., 2005) and grouped lasso (Yuan and Ln, 2006), R n our work s a nosy relaton whch can be automatcally obtaned by scannng through features. Although there are varous possbltes for obtanng R, one way s through the ncluson relaton among features. Gven a par of features p and q, the feature q ncludes p, f n every data of the learnng data, when the value of feature q s nonzero, the value of feature p s always non-zero. For example, for the case of the adjectve economc and ts stem econom, the latter ncludes the former whle also beng a stem for other terms such as economy and economst. In the fnal classfcaton, t s unknown whch of econom and economc counts. For part-of-speech taggng, econom would not provde much nformaton snce t does not have a complete form, but for topc estmaton, econom mght provde suffcent nformaton by representng the terms economc, economy and economst. In both cases, when the two words appear as features, they share a tght relaton and when one s gven hgh mportance, the other wll as well n conventonal lasso. In relatonal lasso, f one representatve s selected, then other smlar features n the same group wll acqure less mportance by havng a larger penalty. The overall procedure s shown n Procedure 1. The procedure obtans three knds of learnng data nput, a relaton among features, and parameter values. Before optmzng the target functon, denoted n the second lne from the bottom, the procedure calculates α dependng on the gven relaton R among features. Ths procedure s expressed n terms of a whle-structure, whch enables the adjustment of penalty parameters for hghly correlated features. The processed feature s held n set F to avod any duplcate processng of features. In the whle-structure, features are selected one at a tme n the order of larger values of L(w) w, wth w beng the zero vector 2. The number of the selected feature s denoted as k. Then, for all ks whch are related to k n R, the α s enlarged by a 2 There are other possbltes for ths order of processng features, such as randomzng the order. In the Graftng method (Perkns et al., 2003), the processed feature s selected by calculatng L(w) every tme n the whle-structure, w whch s also possble wth relatonal lasso. Ths however remans as future work. 1083

5 Procedure 1 Relatonal Lasso Input: (x, y )( = 1,, n) Parameters λ, a 1, a 2 Relaton among m features R (1,, m) (1,, m) α = 1, w = 0, F = {} v = L(w) w whle F < m do k = arg max v k k F for all k F and (k, k) R do α k = α k + a 1 end for for all k F and (k, k ) R do α k = α k + a 2 end for F = F k end whle w = arg mn L(w) + λ k α k w k w return w certan constant a 1 and a 2, dependng on whether the feature k s ncluded or k s ncluded. When the whle procedure ends, F ncludes all the features. Fnally, w s estmated n terms of LR-lasso, where the second term s weghted further wth the thus estmated α. When some specfc w k becomes zero, ths means that the weght s consdered as not selected for the classfcaton task. One further mprovement that mght be possble for the above procedure s to repeat the whlestructure and the estmaton of w, so that α and w perform co-tranng; ths also remans for our future work. Moreover, the procedure presented here remans an ad hoc modfcaton of conventonal lasso based on our motvaton. A more proper mathematcal reformulaton of ths method wll be part of our future work. 5 Evaluaton 5.1 Expermental Settngs We wll consder the followng two feature sets. Standard features Standard and addtonal features Here, an addtonal feature set s ntroduced so that t can be nosy wth respect to the classfcaton. Therefore, the nterest les n whether the performance s better when we have the addtonal features than t s when we have only the standard features. We consder three methods: Conventonal lasso (Tbshran, 1996) Adaptve lasso (Zou, 2006) Relatonal lasso (proposed method) We are nterested n whether relatonal lasso performs better than the other methods. In practce we can select the best λ parameter used for lasso methods, usng cross-valdaton, although we examned multple λs, whch determne the strength of the regularzers nfluence as shown n formulas (1) and (2) of Sectons 3 and 4. The other parameters ntroduced n Secton 4 are set as follows. Adaptve lasso has the parameter δ = 1, whch s set as the common choce n (Krämer et al., 2009) and the parameter α s defned from ntal estmator ŵ as follows: α = max { 1 wˆ, 1 For relatonal lasso, parameters a 1 and a 2 were each set to 1. For L 1 regularzed LR, a coordnate descent method s mplemented by modfyng LIBLINEAR 3. Coordnate descent methods have been wdely appled elsewhere because of ther sutablty for applcaton to hgher-order problems (Yuam et al., 2010). For each par of a feature and a method, we consdered the followng three problems of text classfcaton, polarty estmaton, and statstcal parsng. The next three sectons explan the standard and addtonal feature sets, the relaton among features R, and the evaluaton scores. Task 1: Text Classfcaton Twenty Newsgroups (20NG) 4 were used as the dataset for text classfcaton. Ths collecton contans 18,846 Englsh documents parttoned across 20 dfferent news groups. The data was sorted by date, wth the frst 60% used as a tranng set and the remanng 40% used as a test set. A smple bag of words was used 3 cjln/ lblnear/ 4 provded by Jason Renne, csal.mt.edu/jrenne/20newsgroups/ }. 1084

6 Table 1: Features used for dependency parsng ungram for w n w 1, w, w j, w j+1, w j+2, w j+2 pos(w), lex(w) for w n w 1, w, w j, pos(w left ), lex(w left ) for w n w 1, w, pos(w rght ), lex(w rght ), pos(w head ),lex(w head ) bgram for (v, w) n (w, w j ), (w 1, w j ) pos(v)pos(w),pos(v)lex(w),lex(v)pos(w),lex(v)lex(w) add bgram for (v, w) n (w, w j+1 ), (w j, w j+1 ) pos(v)pos(w),pos(v)lex(w),lex(v)pos(w),lex(v)lex(w) add preposton for w n w j+1, w j+2, w j+3 lex(w )lex(w j )pos(w), pos(w )lex(w j )lex(w) (f w j s a preposton) as the standard feature sets, whereas all stems of all words were used as addtonal features. R was defned as the relaton between each word and ts stems. A mult-class classfcaton task s typcally evaluated by macro and mcro F1 values, so we also provded these values. Task 2: Polarty Estmaton Polarty dataset v2.0 5 was used as the second data set. The content of each data was a move revew n text, tagged wth the sentment of postve or negatve. The data conssted of 1,000 postve and 1,000 negatve revews. Snce the data set was small, the average accuracy was obtaned through 10-fold cross valdaton. Feature sets were bascally the same as for Task 1, where the standard was a bag of words, the addtonal set conssted of word stems, and relaton R was the relaton among words and ther stems. The evaluaton was based on the accuracy of the bnary classfcaton of postve/negatve. Task 3: Parsng We also tested the methods on a parsng task, whch was a task drastcally dfferent from tasks 1 and 2. We used CoNLL-X formatted sentences from the Wall Street Journal secton of the Penn Tree-bank. Sectons 2-21 were used as tranng data (39,832 sentences), and secton 23 was used as test data (2,416 sentences). The parsng algorthm we tested s the standard shft-reduce parsng proposed by (Nvre, 2003), where the parsng proceeds by successve determnaton of the relaton between two words (denoted as w and w j ). Such a determnaton s consdered a 4-class classfcaton problem that s modeled and learned by LR, augmented by the three lasso methods beng evaluated. 5 provded by Bo Pang, edu/people/pabo/move-revew-data/ The standard features used are lsted n Table 1. Here, pos(w) ndcates the part of speech of the word w, where w ndcates the th word of a gven sentence, and w left ndcates the already parsed dependent word of w placed to ts farthest left sde. The addtonal feature set ncluded all dependent words nvolvng w and w j, and all bgrams concernng words used as features n the standard set. In our dependency parsng task, we measured the word accuracy whch was defned as the rato of words assgned correct heads dvded by the total of all words. mcro F value 70% 75% 80% 85% 90% 95% relatonal conventonal adaptve number of non zero features Fgure 1: Task 1: Mcro F1 values for the number of non-zero features macro F value 70% 75% 80% 85% 90% 95% relaton conventonal adaptve number of non zero features Fgure 2: Task 1: Macro F1 values for the number of non-zero features 1085

7 5.2 Results Fgures 1 and 2 show the results for Task 1, Fgure 3 those for Task 2, and Fgure 4 those for Task 3. Horzontal axes show the number of features and vertcal axes show accuracy. Each graph has three lnes, ndcatng the conventonal, adaptve and relatonal lasso methods appled to the standard and addtonal features all together. Each lne has fve ponts, each correspondng to a dfferent value of λ. The horzontal coordnate was determned by countng how many features remaned non-zero for each value of λ. Overall, all fgures, except for Fgure 3 show that relatonal lasso outperformed the adaptve and conventonal lasso methods. Ths was to be expected, snce relatonal lasso has the relaton R as nput, unlke the conventonal lasso method. Ths confrms that nformaton from the underlyng R does mprove lasso performance. Curously, the performance of adaptve lasso for some fgures was lower than that of the conventonal method. The reason for ths wll be gven later n ths secton. As Fgure 3 and Fgure 4 show, the performance was compettve among the three lasso methods, when the number of features were small. However, wth a large number of features, relatonal lasso generally outperforms the other lassos. accuracy 80% 82% 84% 86% 88% 90% 92% relatonal conventonal adaptve number of non zero features Fgure 3: Task 2: Accuracy for the number of nonzero features It s dffcult to compare the performance snce dfferent λ values lead to dfferent levels of performance, so the maxmum performance obtaned by changng the λ value s shown n Table 2. Columns are for dfferent lasso methods wth standard and addtonal feature sets, whereas rows represent dfferent tasks. Note that the best values of λ dffer dependng on the pars of methods and features. accuracy 82% 84% 86% 88% 90% 92% relatonal conventonal adaptve number of non zero features [10^6] Fgure 4: Task 3: Word accuracy for the number of non-zero features Overall, the last column presents the hghest performance n each row, thus suggestng the effectveness of relatonal lasso. For Task 2, when features of standard and addtonal sets were used, the performance of the conventonal method decreased compared to that when only the standard set was used. Ths could happen f the addtonal feature set s nosy and the regularzer cannot explot the useful nformaton from the addtonal set of features. On the other hand, the performance of relatonal and adaptve lasso for the same task was mproved by extractng the useful nformaton; that s, the performance was hgher than when usng only the standard features. Ths shows that the use of underlyng nformaton among features enhances the overall performance. For Tasks 1 and 3, addng features led to better performance than when usng only the standard set. Note, though, that the performance ncrease was greatest for relatonal lasso. Thus, relatonal lasso s the best among the three lasso methods at explotng nformaton, and thus performs better n terms of accuracy. In ths table, too, we see that for Task 2, the performance of adaptve lasso s below that of the conventonal lasso. We consder the reason for ths to be as follows. Snce the optmzaton for adaptve lasso s done n two stages, some of the features are dropped wthn the frst stage. In the second stage, these features wll never re-acqure any mportance. In other words, the feature selecton must be done at the very end n order to preserve the possblty of some features to re-acqure the mportance through learnng. Before endng, we must note the mpact of relatonal lasso on the speed of overall processng. 1086

8 Table 2: Maxmum Performance among Varous Values of λ for Three Lasso Methods Lasso Methods Conventonal Adaptve Relatonal Feature Sets Std Std+Add Std Std+Add Std Std+Add Task 1 (mcro) 79.67% 81.03% 76.78% 77.16% 78.87% 81.81% Task 1 (macro) 79.43% 80.69% 76.53% 77.67% 78.52% 81.72% Task % 84.95% 85.56% 86.1% 84.82% 86.45% Task % 88.81% 74.64% 87.71% 75.97% 89.23% The pre-processng to obtan α s very fast, snce t only scans the number of features once. The bottleneck of the procedure les n the estmaton of w snce ths requres convergence through a repettve procedure. Therefore, the computatonal complexty of relatonal lasso wll not change even wth α wthn the regularzer, and the overall speed of relatonal lasso s almost the same as that of the conventonal method. In contrast, adaptve lasso requres twce as much tme snce the bottleneck part s done twce. In ths sense, our method outperforms adaptve lasso n speed and s not sgnfcantly slower than conventonal, at least for the settngs we have examned n ths secton. 6 Concluson Relatonal lasso utlzes relatons among features to better explot nformaton through regularzaton, especally through lasso methods. The conventonal lasso method s not desgned to ncorporate relatons among features, and ths leads to based weghtng of a group of features havng smlar behavor. Relatonal lasso controls such relatons underlyng features by ntroducng a penalty parameter for each feature. The penalty ncreases when a feature s related to some other feature havng less of a penalty. Ths parameter score s ncorporated as the regularzaton term of the target machne learnng functon for the optmzaton objectve. We compared relatonal lasso to the conventonal method and the adaptve lasso proposed by (Zou, 2006), whch also uses an addtonal parameter per feature. We evaluated the methods based on how well they performed three tasks of text categorzaton, polarty estmaton, and parsng. Relatonal lasso outperformed the other lasso methods n these tasks. Moreover, the performance of the conventonal lasso methods deterorated when nosy features were added, whle relatonal lasso successfully extracted useful nformaton from these features and ts performance mproved. As part of our future work, we plan to nvestgate whether our method works for other tasks such as taggng, and wth other target functons. Moreover, there are many drectons we can take to further mprove the method, such as through cotranng. Last, t wll be nterestng to see how our method can be mathematcally reformulated. References Janqng Fan, Runze L Varable selecton va nonconcave penalzed lkelhood and ts oracle propertes. Journal of the Amercan Statstcal Assocaton, vol.96, pp Isabelle Guyon, André Elsseeff An ntroducton to varable and feature selecton. The Journal of Machne Learnng Research, vol.3, pp G. V. Kass An exploratory technque for nvestgatng large quanttes of categorcal data. Journal of the Royal Statstcal Socety. Seres C, vol.29, pp Keth Knght, Wenjang Fu Asymptotcs for lassotype estmators. The Annals of Statstcs, vol.28, pp Ron Kohav, George H. John Wrappers for feature subset selecton, Artfcal ntellgence, vol.97, pp Ncole Krämer, Julane Schäfer, Anne-Laure Boulestex Regularzed estmaton of large-scale gene assocaton networks usng graphcal Gaussan models. BMC Bonformatcs, vo.10. Chrstopher Mannng, Hnrch Schuetze Foundatons of Statstcal Natural Language Processng. MIT Press. Joakm Nvre An effcent algorthm for projectve dependency parsng. Proceedngs of the 8th Internatonal Workshop on Parsng Technologes, pp Smon Perkns, Kervn Lacker, and James Theler Graftng: Fast, Incremental Feature Selecton by Gradent Descent n Functon Space. The Journal of Machne Learnng Research, vol.3, pp Yvan Saeys, Iñak Inza, and Pedro Larrañaga A revew of feature selecton technques n bonformatcs. Bonformatcs, vol.23, num.19, pp Sam Scott, Stan Matwn Feature engneerng for text classfcaton. Proceedngs of ICML-99, 16th Internatonal Conference on Machne Learnng, pp

9 Robert Tbshran Regresson shrnkage and selecton va the lasso. Journal of the Royal Statstcal Socety, Seres B, vol.58, pp Robert Tbshran, Mchael Saunders, Saharon Rosset, J Zhu, and Keth Knght Sparsty and smoothness va the fused lasso. Journal of the Royal Statstcal Socety, Seres B, vol.67, pp Guo-Xun Yuan, Ka-We Chang, Cho-Ju Hseh, Chh-Jen Ln A comparson of optmzaton methods and software for large-scale l1-regularzed lnear classfcaton. The Journal of Machne Learnng Research, vol.11, pp Mng Yuan, Y Ln Model selecton and estmaton n regresson wth grouped varables. Journal of the Royal Statstcal Socety, Seres B, vol.68, pp Yng Yang, Jan O. Pedersen A comparatve study on feature selecton n text categorzaton. Proceedngs of the Fourteenth Internatonal Conference on Machne Learnng, pp Hu Zou The adaptve lasso and ts oracle propertes. Journal of the Amercan Statstcal Assocaton, vol.101, pp Hu Zou, Trevor Haste Regularzaton and varable selecton va the elastc net. Journal of the Royal Statstcal Socety, Seres B, vol.67, pp

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration Improvement of Spatal Resoluton Usng BlockMatchng Based Moton Estmaton and Frame Integraton Danya Suga and Takayuk Hamamoto Graduate School of Engneerng, Tokyo Unversty of Scence, 6-3-1, Nuku, Katsuska-ku,