Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition

Size: px

Start display at page:

Download "Fast and Scalable Training of Semi-Supervised CRFs with Application to Activity Recognition"

Julius Shanon Byrd
6 years ago
Views:

1 Fast and Scalable Tranng of Sem-Supervsed CRFs wth Applcaton to Actvty Recognton Maryam Mahdavan Computer Scence Department Unversty of Brtsh Columba Vancouver, BC, Canada Tanzeem Choudhury Intel Research 1100 NE 45th Street Seattle, WA 98105,USA Abstract We present a new and effcent sem-supervsed tranng method for parameter estmaton and feature selecton n condtonal random felds (CRFs). In real-world applcatons such as actvty recognton, unlabeled sensor traces are relatvely easy to obtan whereas labeled examples are expensve and tedous to collect. Furthermore, the ablty to automatcally select a small subset of dscrmnatory features from a large pool can be advantageous n terms of computatonal speed as well as accuracy. In ths paper, we ntroduce the sem-supervsed vrtual evdence boostng (sveb) algorthm for tranng CRFs a sem-supervsed extenson to the recently developed vrtual evdence boostng (VEB) method for feature selecton and parameter learnng. The objectve functon of sveb combnes the unlabeled condtonal entropy wth labeled condtonal pseudo-lkelhood. It reduces the overall system cost as well as the human labelng cost requred durng tranng, whch are both mportant consderatons n buldng real-world nference systems. Experments on synthetc data and real actvty traces collected from wearable sensors, llustrate that sveb benefts from both the use of unlabeled data and automatc feature selecton, and outperforms other sem-supervsed approaches. 1 Introducton Condtonal random felds (CRFs) are undrected graphcal models that have been successfully appled to the classfcaton of relatonal and temporal data [1]. Tranng complex CRF models wth large numbers of nput features s slow, and exact nference s often ntractable. The ablty to select the most nformatve features as needed can reduce the tranng tme and the rsk of over-fttng of parameters. Furthermore, n complex modelng tasks, obtanng the large amount of labeled data necessary for tranng can be mpractcal. On the other hand, large unlabeled datasets are often easy to obtan, makng sem-supervsed learnng methods appealng n varous real-world applcatons. The goal of our work s to buld an actvty recognton system that s not only accurate but also scalable, effcent, and easy to tran and deploy. An mportant applcaton doman for actvty recognton technologes s n health-care, especally n supportng elder care, managng cogntve dsabltes, and montorng long-term health. Actvty recognton systems wll also be useful n smart envronments, survellance, emergency and mltary mssons. Some of the key challenges faced by current actvty nference systems are the amount of human effort spent n labelng and feature engneerng and the computatonal complexty and cost assocated wth tranng. Data labelng also has prvacy mplcatons because t often requres human observers or recordng of vdeo. In ths paper, we ntroduce a fast and scalable sem-supervsed tranng algorthm for CRFs and evaluate ts classfcaton performance on extensve real world actvty traces gathered usng wearable sensors. In addton to beng computatonally effcent, our proposed method reduces the amount of labelng requred durng tranng, whch makes t appealng for use n real world applcatons. 1

2 Several supervsed technques have been proposed for feature selecton n CRFs. For dscrete features, McCallum [2] suggested an effcent method for feature nducton by teratvely ncreasng condtonal log-lkelhood. Detterch [3] appled gradent tree boostng to select features n CRFs by combnng boostng wth parameter estmaton for 1D lnear-chan models. Boosted random felds (BRFs) [4] combne boostng and belef propagaton for feature selecton and parameter estmaton for densely connected graphs that have weak parwse connectons. Recently, Lao et.al. [5] developed a more general verson of BRFs, called vrtual evdence boostng (VEB) that does not make any assumptons about graph connectvty or the strength of parwse connectons. The objectve functon n VEB s a soft verson of maxmum pseudo-lkelhood (ML), where the goal s to maxmze the sum of local log-lkelhoods gven soft evdence from ts neghbors. Ths objectve functon s smlar to that used n boostng, whch makes t sutable for unfed feature selecton and parameter estmaton. Ths approxmaton apples to any CRF structures and leads to a sgnfcant reducton n tranng complexty and tme. Sem-supervsed tranng technques have been extensvely explored n the case of generatve models and naturally ft under the expectaton maxmzaton framework [6]. However, t s not straght forward to ncorporate unlabeled data n dscrmnatve models usng the tradtonal condtonal lkelhood crtera. A few sem-supervsed tranng methods for CRFs have been proposed that ntroduce dependences between nearby data ponts [7, 8]. More recently, Grandvalet and Bengo [9] proposed a mnmum entropy regularzaton framework for ncorporatng unlabeled data. Jao et.al. [10] used ths framework and proposed an objectve functon that combnes the condtonal lkelhood of the labeled data wth the condtonal entropy of the unlabeled data to tran 1D CRFs, whch was extended to 2D lattce structures by Lee et.al. [11]. In our work, we combne the mnmum entropy regularzaton framework for ncorporatng unlabeled data wth VEB for tranng CRFs. The contrbutons of our work are: () sem-supervsed vrtual evdence boostng (sveb) - an effcent technque for smultaneous feature selecton and sem-supervsed tranng of CRFs, whch to the best of our knowledge s the frst method of ts knd, () expermental results that demonstrate the strength of sveb, whch consstently outperforms other tranng technques on synthetc data and real-world actvty classfcaton tasks, and () analyss of the tme and complexty requrements of our algorthm, and comparson wth other exstng technques that hghlght the sgnfcant computatonal advantages of our approach. The sveb algorthm s fast and easy to mplement and has the potental of beng broadly applcable. 2 Approaches to tranng of Condtonal Random Felds Maxmum lkelhood parameter estmaton n CRFs nvolves maxmzng the overall condtonal log-lkelhood, where x s the observaton sequence and y s the hdden state sequence: exp( K θ k f k (x, y)) L(θ) = log(p(y x, θ)) θ /2 = log k=1 exp( K θ k f k (x, y )) y k=1 θ /2 (1) The condtonal dstrbuton s defned by a log-lnear combnaton of k features functons f k assocated wth weght θ k. A regularzer on θ s used to keep the weghts from gettng too large and to avod overfttng 1. For large CRFs exact nference s often ntractable and approxmate methods such as mean feld approxmaton or loopy belef propagaton [12, 13] are used. An alternatve to approxmatng the condtonal lkelhood s to change the objectve functon. ML [14] and VEB [5] are such technques. For ML the CRF s cut nto a set of ndependent patches; each patch conssts of a hdden node or class label y, the true value of ts drect neghbors and the observatons,.e., the Markov Blanket(MB y ) of the node. The parameter estmaton then becomes maxmzng the pseudo log-lkelhood: L pseudo (θ) = N log(p(y MB y, θ)) = N exp( K θ k f k (MB y,y )) k=1 log exp( K θ k f k (MB y,y y k=1 )) ML has been known to over-estmate the dependency parameters n some cases and there s no general gudelne on when t can be safely used [15]. 1 When a pror s used n the maxmum lkelhood objectve functon as a regularzer the second term n eq. (1), the method s n fact called maxmum a posteror. 2

3 2.1 Vrtual evdence boostng By extendng the standard LogtBoost algorthm [16], VEB ntegrates boostng based feature selecton nto CRF tranng. The objectve functon used n VEB s very smlar to ML, except that VEB uses the messages from the neghborng nodes as vrtual evdence nstead of usng the true labels of neghbors. The use of vrtual evdence helps to reduce over-estmaton of neghborhood dependences. We brefly explan the approach here but please refer to [5] for more detal. VEB ncorporates two types of observatons nodes: () hard evdence correspondng to the observatons ve(x ), whch are ndcator functons at the observaton values and () soft evdence, correspondng to the messages from neghborng nodes ve(n(y )), whch are dscrete dstrbutons over the hdden states. Let ve {ve(x ), ve(n(y ))}. The objectve functon of VEB s as follows: L V EB (θ) = N ve exp( K θ k f k (ve, y )) ve log(p(y ve, θ)), where p(y ve, θ) = k=1 (2) ve exp( K θ k f k (ve, y )) ve VEB learns a set weak learners f t s teratvely and estmates the combned feature F t = F t 1 + f t by solvng the followng weghted least square error(wlse) problem: N N f t (ve ) = arg mn w E(f(ve ) z ) 2 = arg mn[ w p(y ve )(f(ve ) z ) 2 ] (3) f f ve where w = p(y ve )(1 p(y ve )), z = y 0.5 (4) p(y ve ) The w and z n equaton 4 are the boostng weght and workng response respectvely for the th data pont, exactly as n LogtBoost. However, the least square problem for VEB (eq.3) nvolves NX ponts because of vrtual evdence as opposed to N ponts n LogtBoost. Although eq. 4 s gven for the bnary case (.e. y {0, 1}), t s easly extendble to the mult-class case and we have done that n our experments. At each teraton, ve s updated as messages from n(y ) changes wth the addton of new features. We run belef propagaton (B) to obtan the vrtual evdence before each teraton. The CRF feature weghts, θ s are computed by solvng the WLSE problem, where the local features, n k s the count of feature k n data nstance and the compatblty features, n k s the vrtual evdence from the neghbors.: θ k = N w z n k / N w n k. 2.2 Sem-supervsed tranng For sem-supervsed tranng of CRFs, Jao et.al. [10] have proposed an algorthm that utlzes unlabeled data va entropy regularzaton an extenson of the approach proposed by [9] to structured CRF models. The objectve functon that s maxmzed durng sem-supervsed tranng of CRFs s gven below, where (x l, y l ) and (x u, y u ) represent the labeled and unlabeled data respectvely: y k=1 L SS (θ) = log p(y l x l, θ) + α y u p(y u x u, θ)log p(y u x u, θ) θ /2 By mnmzng the condtonal entropy of the unlabeled data, the algorthm wll generally fnd labelng of the unlabeled data that mutually renforces the supervsed labels. One drawback of ths objectve functon s that t s no longer concave and n general there wll be local maxma. The authors [10] showed that ths method s stll effectve n mprovng an ntal supervsed model. 3 Sem-supervsed vrtual evdence boostng In ths work, we develop sem-supervsed vrtual evdence boostng (sveb) that combnes feature selecton wth sem-supervsed tranng of CRFs. sveb extends the VEB framework to take advantage of unlabeled data va mnmum entropy regularzaton smlar to [9, 10, 11]. The new objectve functon L sv EB we propose s as follows, where ( = 1 N) are labeled and ( = N + 1 M) are unlabled examples: N M L sv EB = log p(y ve ) + α p(y ve ) log p(y ve ) (5) =N+1 y 3

4 The sveb aglorthm, smlar to VEB, maxmzes the condtonal soft pseudo-lkelhood of the labeled data but n addton mnmzes the condtonal entropy over unlabeled data. The α s a tunng parameter for controllng how much nfluence the unlabeled data wll have. By consderng the soft pseudo-lkelhood n L sv EB and usng B to estmate p(y ve ), sveb can use boostng to learn the parameters of CRFs. The vrtual evdence from the neghborng nodes captures the label dependences. There are three dfferent types of feature functons f s that s used: for contnuous observatons f 1 (x ) s a lnear combnaton of decson stumps, for dscrete observatons the learner f 2 (x ) s expressed as ndcator functons, and for vrtual evdences the weak learner f 3 (x ) s the weghted sum of two ndcator functons (for bnary case). These functons are computed as follows, where δ s an ndcator functon, h s a threshold for the decson stump, and D s the number of dmensons of the observatons: D 1 f 1 (x ) = θ 1 δ(x h) + θ 2 δ(x < h), f 2 (x ) = θ k δ(x = d), f 3 (y ) = θ k δ(y = k) (6) Smlar to LogtBoost and VEB, the sveb algorthm estmates a combned feature functon F that maxmzes the objectve by sequentally learnng a set of weak learners, f t s (.e. teratvely selectng features). In other words, sveb solves the followng weghted least-square error (WLSE) problem to learn f t s: N M f t = arg mn[ w p(y ve )(f(x ) z ) 2 + w p(y f ve )(f(x ) z ) 2 ] (7) ve ve k=1 =N+1 For labeled data (frst term n eq.7), boostng weghts, w s, and workng responses, z s, are computed as descrbed n equaton 4. But for the case of unlabeled data the expresson for w and z becomes more complcated because of the entropy term. We present the equatons for w and z below, please refer to the Appendx for the dervatons: y k=0 w = α 2 (1 p(y ve ))[p(y ve )(1 p(y ve )) + log p(y ve )] z = (y 0.5)p(y ve )(1 log p(y ve )) (8) α[p(y ve )(1 p(y ve )) + log p(y ve )] The soft evdence correspondng to messages from the neghborng nodes s obtaned by runnng B on the entre tranng dataset (labeled and unlabeled). The CRF feature weghts θ k s are computed by solvng the WLSE problem (e.q.(7)), θ k = M w z n k / M w n k y y Algorthm 1 gves the pseudo-code for sveb. The man dfference between VEB and sveb are steps 7 10, where we compute w s and z s for all possble values of y based on the vrtual evdence and observatons of unlabeled tranng cases. The boostng weghts and workng responses are computed usng equaton (8). The weghted least-square error (WLSE) equaton (eq. 7) n step 10 of sveb s dfferent from that of VEB and the soluton results n slghtly dfferent CRF feature weghts, θ s. One of the major advantages of VEB and sveb over ML and sml s that the parameter estmaton s done by manly performng feature countng. Unlke ML and sml, we do not need to use an optmzer to learn the model parameters whch results n a huge reducton n the tme requred to tran the CRF models. lease refer to the complexty analyss secton for detals. 4 Experments We conduct two sets of experments to evaluate the performance of the sveb method for tranng CRFs and the advantage of performng feature selecton as part of sem-supervsed tranng. In the frst set of experments, we analyze how much the complexty of the underlyng CRF and the tunng parameter α effect the performance usng synthetc data. In the second set of experments, we evaluate the beneft of feature selecton and usng unlabeled data on two real-world actvty datasets. We compare the performance of the sem-supervsed vrtual evdence boostng(sveb) presented n ths paper to the sem-supervsed maxmum lkelhood (sml) method [10]. In addton, for the actvty datasets, we also evaluate an alternatve approach (sml+boost), where a subset of features s selected n advance usng boostng. To benchmark the performance of the sem-supervsed technques, we also evaluate three dfferent supervsed tranng approaches, namely maxmum lkelhood 4

5 Algorthm 1: Tranng CRFs usng sem-supervsed VEB nputs : structure of CRF and tranng data (x, y ), wth y {0, 1}, 1 M, and F 0 = 0 output: Learned F T and ther correspondng weghts, θ for t = 1, 2,, T do Run B usng F t to get vrtual evdences ve ; for = 1, 2,, N do Compute lkelhood p(y ve ); Compute w and z usng equaton (4) end for = N + 1,..., M and y = 0, 1 do Compute lkelhood p(y ve ); Compute w and z usng equaton (8) end Obtan best weak learner f t accordng to equaton (7) and update F t = F t 1 + f t ; end Accuracy (a) 0.6 sml sveb Dmenson of Observatons Accuracy (b) 0.7 sml 0.65 sveb Number of states Accuracy (c) 0.75 sml sveb Values of α Fgure 1: Accuracy of sml and sveb for dfferent number of states, local features and dfferent values of α. method usng all observed features(ml), (ML+Boost) usng a subset of features selected n advance, and vrtual evdence boostng (VEB). All the learned models are tested usng standard maxmum a posteror(ma) estmate and belef propagaton. We used a l 2 -norm shrnkage pror as a regularzer for the ML and sml methods. 4.1 Synthetc data The synthetc data s generated usng a frst-order Markov Chan wth self-transton probabltes set to 0.9. For each model, we generate fve sequences of length 4,000 and dvde each trace nto sequences of length 200. We randomly choose 50% of them as the labeled and the other 50% as unlabeled tranng data. We perform leave-one-out cross-valdaton and report the average accuraces. To measure how the complexty of the CRFs affects the performance of the dfferent sem-supervsed methods, we vary the number of local features and the number of states. Frst, we compare the performance of sveb and sml on CRFs wth ncreasng the number of features. The number of states s set to 10 and the number of observaton features s vared from 20 to 400 observatons. Fgure (1a) shows the average accuracy for the two sem-supervsed tranng methods and ther confdence ntervals. The expermental results demonstrate that sveb outperforms sml as we ncrease the dmenson of observatons (.e. the number of local features). In the second experment, we ncrease the number of classes and keep the dmenson of observatons fxed to 100. Fgure (1b) demonstrates that sveb agan outperforms sml as we ncrease the number of states. Gven the same amount of tranng data, sveb s less lkely to overft because of the feature selecton step. In both these experments we set the value of tunng parameter, α, to 1.5. To explore the effect of tunng parameter α, we vary the value of α from 0.1 to 10, whle settng the number of states to 10 and the number of dmensons to 100. Fgure (1c) shows that the performance of both sml and sveb depends on the value of α but the accuracy decreases for large α s smlar to the sml results presented n [10]. 5

Sensor Traces Classes 8 7 6 5 4 3 2 1 Ground truth Inference Tme 1000 2000 3000 4000 5000 Tme Fgure 2: An example of a sensor trace and a classfcaton trace Labeled Average Accuracy (%) - Dataset 1

7 80% 80.6 ± 2.9 84.8 ± 2.9 93.4 ± 3.8 100% 77.8 ± 3.4 87.0 ± 2.3 91.5 ± 3.8 100% 86.2 ± 3.1 87.5 ± 3.1 93.8 ± 4.

2 Actvty dataset We collected two actvty datasets usng wearable sensors, whch nclude audo, acceleraton, lght, temperature, pressure, and humdty.

6 Sensor Traces Classes Ground truth Inference Tme Tme Fgure 2: An example of a sensor trace and a classfcaton trace Labeled Average Accuracy (%) - Dataset 1 Labeled Average Accuracy (%) - Dataset 2 ML+all obs ML+Boost VEB ML+all obs ML+Boost VEB 60% 62.7 ± ± ± % 74.3 ± ± ± % 73.0 ± ± ± % 80.6 ± ± ± % 77.8 ± ± ± % 86.2 ± ± ± 4.6 Table 1: Accuracy ± 95% confdence nterval of the supervsed algorthms on actvty datasets 1 and Actvty dataset We collected two actvty datasets usng wearable sensors, whch nclude audo, acceleraton, lght, temperature, pressure, and humdty. The frst dataset contans nstances of 8 basc physcal actvtes (e.g. walkng, runnng, gong up/down stars, gong up/down elevator, sttng, standng, and brushng teeth) from 7 dfferent users. There s on average 30 mnutes of data per user and a total of 3.5 hours of data that s manually labeled for tranng and testng purposes. The data s segmented nto 0.25s chunks resultng n a total of data ponts. For each chunk, we compute 651 features, whch nclude sgnal energy n log and lnear frequency bands, autocorrelaton, dfferent entropy measures, mean, varances etc. The features are chosen based on what s used n exstng actvty recognton lterature and a few addtonal ones that we felt could be useful. Durng tranng, the data from each person s dvded nto sequences of length 200 and fed nto lnear chan CRFs as observatons. The second dataset contans nstances of 5 dfferent ndoor actvtes (e.g. computer usage, meal, meetng, watchng TV and sleepng) from a sngle user. We recorded 15 hours of sensor traces over 12 days. As ths set contans longer tme-scale actvtes, the data s segmented nto 1 mnute chunks and 321 dfferent features are computed, smlar to the frst dataset. There are a total of 907 data ponts. These features are fed nto CRFs as observatons, one lnear chan CRF s created per day. We evaluate the performance of supervsed and sem-supervsed tranng algorthms on these two datasets. For the sem-supervsed case, we randomly select 40% of the sequences for a gven person or a gven day as labeled and a dfferent subset as the unlabeled tranng data. We compare the performance of sml and sveb as we ncorporate more unlabeled data (20%, 40% and 60%) nto the tranng process. We also compare the supervsed technques, ML, ML+Boost, and VEB, wth ncreasng amount of labeled data. For all the experments, the tunng parameter α s set to 1.5. We perform leave-one-person-out cross-valdaton on dataset 1 and leave-one-day-out cross-valdaton on dataset 2 and report the average the accuraces. The number of features chosen (. e. through the boostng teratons) s set to 50 for both datasets ncludng more features dd not sgnfcantly mprove the classfcaton performance. For both datasets, ncorporatng more unlabeled data mproves accuracy. The sml estmate of the CRF parameters performs the worst. Even wth the shrnkage pror, the hgh dmensonalty can stll cause over-fttng and lower the accuracy. Whereas parameter estmaton and feature selecton va sveb consstently results n the hghest accuracy. The (sml+boost) method performs better than sml but does not perform as well as when feature selecton and parameter estmaton s done wthn a unfed framework as n sveb. Table 2 summarze our results. The results of supervsed learn- Un- Average Accuracy (%) - Dataset 1 Un- Average Accuracy (%) - Dataset 2 labeled sml+all obs sml+boost sveb labeled sml+all obs sml+boost sveb 20% 60.8 ± ± ± % 71.4 ± ± ± % 68.1 ± ± ± % 73.5 ± ± ± % 74.9 ± ± ± % 75.6 ± ± ± 4.7 Table 2: Accuracy ± 95% confdence nterval of sem-supervsed algorthms on actvty datasets 1 and 2 6

7 Labeled Average Accuracy (%) - Dataset 2 Labeled Average Accuracy (%) - Dataset 2 ML+all obs ML+Boost VEB ML+all obs ML+Boost VEB 5% 59.2 ± ± ± 5.7 5% 71.2 ± ± ± % 66.9 ± ± ± % 71.4 ± ± ± 6.4 Table 3: Accuracy ± 95% confdence nterval of sem-supervsed algorthms on actvty datasets 1 and 2 ng algorthms are presented n Table 1. Smlar to the sem-supervsed results, the VEB method performs the best, the ML s the worst performer, and the accuracy numbers for the (ML+Boost) method s n between. The accuracy ncreases f we ncorporate more labeled data durng tranng. To evaluate sveb when a small amount of labeled data s avalable, we performed another set of experments on datasets 1 and 2, where only 5% and 20% of the tranng data s labeled respectvely. We used all the avalable unlabeled data durng tranng. The results are shown n table 3. These experments clearly demonstrate that although addng more unlabeled data s not as helpful as ncorporatng more labeled data, the use of cheap unlabeled data along wth feature selecton can sgnfcantly boost the performance of the models. 4.3 Complexty Analyss The sveb and VEB algorthm are sgnfcantly faster than ML and sml because they do not need to use optmzers such as quas-newton methods to learn the weght parameters. For each tranng teraton n sml the cost of runnng B s O(c l ns 2 +c u n 2 s 3 ) [10] whereas the cost of each boostng teraton n sveb s O((c l +c u )ns 2 ). An effcent entropy gradent computaton s proposed n [17], whch reduces the cost of sml to O((c l + c u )ns 2 ) but stll requres an optmzer to maxmze the log-lkelhood. Moreover, the number of tranng teratons needed s usually much hgher than the number of boostng teratons because optmzers such as L-BFGS requre many more teratons to reach convergence n hgh dmensonal spaces. For example, for dataset 1, we needed about 1000 teratons for sml to converge but we ran sveb for only 50 teratons. Table 4 shows the tme for performng the experments on actvty datasets (as descrbed n the prevous secton) 2. On the other hand the space complexty of sveb s lnearly smaller than sml and ML. Smlar to ML, sml has the space complexty of O(ns 2 D) n the best case [10]. VEB and sveb have a lower space cost of O(ns 2 D b ), because of the feature selecton step D b D usually. Therefore, the dfference becomes sgnfcant when we are dealng wth hgh dmensonal data, partcularly f they nclude a large number of redundant features. n length of tranng sequence Tme (hours) c ML ML+Boost VEB sml sml+boost sveb l number of labeled tranng sequences c Dataset u number of unlabeled tranng sequences s number of states Dataset D, D b dmenson of observatons Table 4: Tranng tme for the dfferent algorthms. 5 Concluson We presented sveb, a new sem-supervsed tranng method for CRFs, that can smultaneously select dscrmnatve features va modfed LogtBoost and utlze unlabeled data va mnmumentropy regularzaton. Our expermental results demonstrate the sveb sgnfcantly outperforms other tranng technques n real-world actvty recognton problems. The unfed framework for feature selecton and sem-supervsed tranng presented n ths paper reduces the computatonal and human labelng costs, whch are often the major bottlenecks n buldng large classfcaton systems. Acknowledgments The authors would lke to thank Nando de Fretas and Ln Lao for many helpful dscussons. Ths work was supported by the NSF under grant number IIS and NSERC Canada Graduate Scholarshp. References [1] J. Lafferty, A. McCallum, and F. erera. Condtonal random felds: robablstc models for segmentng and labelng sequence data. In roc. of the Internatonal Conference on Machne Learnng (ICML), The experments were run n Matlab envronment and as a result they took longer. 7

8 [2] Andrew McCallum. Effcently nducng features or condtonal random felds. In roc. of the Conference on Uncertanty n Artfcal Intellgence (UAI), [3] T. Detterch, A. Ashenfelter, and Y. Bulatov. Tranng condtonal random felds va gradent tree boostng. In roc. of the Internatonal Conference on Machne Learnng (ICML), [4] A. Torralba, K.. Murphy, and W. T. Freeman. Contextual models for object detecton usng boosted random felds. In Advances n Neural Informaton rocessng Systems (NIS), [5] L. Lao, T. Choudhury, D. Fox, and H Kautz. Tranng condtonal random felds usng vrtual evdence boostng. In roc. of the Internatonal Jont Conference on Artfcal Intellgence (IJCAI), [6] K. Ngam, A. McCallum, A. Thrun, and T. Mtchell. Text classfcaton from labeled and unlabeled documents usng em. Machne learnng, [7] A. Zhu, Z. Ghahraman, and J. Lafferty. Sem-supervsed learnng usng gaussan felds and harmonc functons. In roc. of the Internatonal Conference on Machne Learnng (ICML), [8] W. L and M. Andrew. Sem-supervsed sequence modelng wth syntactc topc models. In roc. of the Natonal Conference on Artfcal Intellgence (AAAI), [9] Y. Grandvalet and Y. Bengo. Sem-supervsed learnng by entropy mnmzaton. In Advances n Neural Informaton rocessng Systems (NIS), [10] F. Jao, W. Wang, C. H. Lee, R. Grener, and D. Schuurmans. Sem-supervsed condtonal random felds for mproved sequence segmentaton and labelng. In Internatonal Commttee on Computatonal Lngustcs and the Assocaton for Computatonal Lngustcs, [11] C. Lee, S. Wang, F. Jao, Schuurmans D., and R. Grener. Learnng to Model Spatal Dependency: Sem- Supervsed Dscrmnatve Random Felds. In NIS, [12] J.S. Yedda, W.T. Freeman, and Y. Wess. Constructng free-energy approxmatons and generalzed belef propagaton algorthms. IEEE Transactons on Informaton Theory, 51(7): , [13] Y. Wess. Comparng mean feld method and belef propagaton for approxmate nference n mrfs [14] J. Besag. Statstcal analyss of non-lattce data. The Statstcan, 24, [15] C. J. Geyer and E. A. Thompson. Constraned Monte Carlo Maxmum Lkelhood for dependent data. Journal of Royal Statstcal Socety, [16] Jerome Fredman, Trevor Haste, and Robert Tbshran. Addtve logstc regresson: a statstcal vew of boostng. The Annals of Statstcs, 38(2): , [17] G. Mann and A. McCullum. Effcent computaton of entropy gradent for sem-supervsed condtonal random felds. In Human Language Technologes, Appendx In ths secton, we show how we derved the equatons for w and z (eq. 8): L F = L sv EB = L V EB αh emp = N log p(y ve ) + α M p(y ve ) log p(y ve ) =N+1 y As n LogtBoost, the lkelhood functon L F s maxmzed by learnng an ensemble of weak learners. We start wth an empty ensemble F = 0 and teratvely add the next best weak learner, f t, by computng the Newton update s H, where s and H are the frst and second dervatve respectvely of LF wth respect to f(ve, y ). F (ve, y )) F (ve, y ) s, where s = L F +f H f f=0 and H = 2 L F +f f 2 f=0 s = N 2(2y 1)(1 p(y ve )) + α M [2(2y 1)(1 p(y ve ))p(y ve )(1 log p(y ve ))] =N+1 y H = N 4p(y ve )(1 p(y ve ))(2y 1) 2 + α 2 M =N+1 y 4(2y 1) 2 (1 p(y ve ))[p(y ve )(1 p(y ve )) + log p(y ve )] N z w + M z w ( y 0.5 =N+1 y F F + f 1 N p(y N w + M where z = ve eq. (4) ) (y 0.5)p(y ve )(1 log p(y ve )) w α[p(y =N+1 y ve )(1 p(y ve ))+log p(y ve f N < M eq. (8) )] p(y ve )(1 p(y ve )) f 1 N eq. (4) and w = α 2 (1 p(y ve ))[p(y ve )(1 p(y ve )) + log p(y ve )] f N < M eq. (8) At teraton t we get the best weak learner, f t, by solvng the WLSE problem n eq. 7. 8

Support Vector Machines

Support Vector Machines /9/207 MIST.6060 Busness Intellgence and Data Mnng What are Support Vector Machnes? Support Vector Machnes Support Vector Machnes (SVMs) are supervsed learnng technques that analyze data and recognze patterns.