Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning

Size: px

Start display at page:

Download "Language Understanding in the Wild: Combining Crowdsourcing and Machine Learning"

Sheryl Marshall
5 years ago
Views:

1 Language Understandng n the Wld: Combnng Crowdsourng and Mahne Learnng Edwn Smpson Unversty of Oxford, UK edwn@robots.ox.a.uk Pushmeet Kohl Mrosoft Researh, Cambrdge, UK pkohl@mrosoft.om Matteo Venanz Unversty of Southampton, UK mv1g10@es.soton.a.uk John Guver Mrosoft Researh, Cambrdge, UK joguver@mrosoft.om Nholas R. Jennngs Unversty of Southampton, UK nrj@es.soton.a.uk Steven Reee Unversty of Oxford, UK reee@robots.ox.a.uk Stephen J. Roberts Unversty of Oxford, UK sjrob@robots.ox.a.uk ABSTRACT Soal meda has led to the demoratsaton of opnon sharng. A wealth of nformaton about publ opnons, urrent events, and authors nsghts nto spef tops an be ganed by understandng the text wrtten by users. However, there s a wde varaton n the language used by dfferent authors n dfferent ontexts on the web. Ths dversty n language makes nterpretaton an extremely hallengng task. Crowdsourng presents an opportunty to nterpret the sentment, or top, of free-text. However, the subjetvty and bas of human nterpreters rase hallenges n nferrng the semants expressed by the text. To overome ths problem, we present a novel Bayesan approah to language understandng that reles on aggregated rowdsoured judgements. Our model enodes the relatonshps between s and text features n douments, suh as tweets, web artles, and blog posts, aountng for the varyng relablty of human lers. It allows nferene of annotatons that sales to arbtrarly large pools of douments. Our evaluaton usng two hallengng rowdsourng datasets shows that by effently explotng language models learnt from aggregated rowdsoured s, we an provde up to 25% mproved lassfatons when only a small porton, less than 4% of douments has been led. Compared to the sx state-of-the-art methods, we redue by up to 67% the number of rowd responses requred to aheve omparable auray. Our method was a jont wnner of the CrowdFlower - CrowdSale 2013 Shared Task hallenge at the onferene on Human Computaton and Crowdsourng (HCOMP 2013). Copyrght s held by the Internatonal World Wde Web Conferene Commttee (IW3C2). IW3C2 reserves the rght to provde a hyperlnk to the author s ste f the Materal s used n eletron meda. WWW 2015, May 18 22, 2015, Florene, Italy. ACM /15/05. General Terms Crowdsourng, mahne learnng, varatonal Bayes, lassfer ombnaton, text lassfaton, sentment analyss, human omputaton 1. INTRODUCTION Soal meda provdes an nreasngly rh soure of nformaton about publ opnon and urrent events, whh an be valuable to professonals aross a wde range of ndustres. For example, Twtter 1 an reflet the publ s sentment about the weather, suh as n the data olleted durng the CrowdSale 2013 Shared Task hallenge 2, opnon of major health emergenes suh as the H1N1 flu pandem [6], or knowledge of dsaster events suh as Typhoon Hayan [5]. Mnng ths large body of unstrutured data requres an understandng of the language used n eah spef ontext. For example, the sentment of a doument, whh reflets the author s atttudes or opnon of a subjet, s aptured n the language they use. However, that relatonshp between sentment and language typally depends on fators suh as the vewpont and the gender of the authors and the ontext of ther wrtng. For example, dstntve terms suh as love and dude are more frequently used by female and male Twtter users, respetvely, to refer to the same onept of a frend or a famly member [15]. Smlarly, reports posted by members of the publ to Ushahd after the 2010 Hat earthquake used a type of language that s sgnfantly dfferent to that seen n other loatons and other types of emergeny [22]. Ths dversty n soal meda text nhbts the performane of any gener method for automated doument lassfaton n the wld. However, ths problem an be allevated by human nterpreters who an use ther bakground knowledge and natural language understandng sklls to reognse the sentment of douments and adapt to the dverse language used n dfferent ontexts. Interpretng sentment or relevane of a pee of text s hghly subjetve and, along wth varatons n annotators skll levels, t an result n dsagreement. To overome ths

2 problem, exstng methods for rowdsoured doument lassfaton requre s from multple annotators for every doument n the orpus [28, 26], whh an be prohbtvely ostly or tme onsumng [22]. Fortunately, the ourrene of ertan terms n eah doument also provdes weak ndatons of the sentment of a doument, whh an be used to redue the ost of employng human nterpreters to annotate the entre orpus. Therefore, we propose a hybrd approah to large sale doument lassfaton that ntegrates human ntellgene wth automated analyss of text. In ths paper we present Bayesan Classfer Combnaton wth Words (BCCWords), a framework for ombnng annotatons from a rowd of workers wth text features to lassfy a orpus of douments. Ths approah s an example of an emergng researh area known as human-agent olletves [12]. We ntrodue a salable Bayesan nferene mehansm for BCCWords, whh learns posteror dstrbutons over the workers relablty and doument lassfatons, gven the douments text features and a set of rowdsoured annotatons. Our method not only allows us to handle the varyng error rates and bas of ndvdual members of the rowd, but also allows us to annotate an entre set of douments when only a subset have been led by the rowd, by leveragng the nferred language model to automatally annotate the remanng douments. In more detal, we make the followng ontrbutons to the state-of-the-art: 1. We present a novel gener model, BCCWords, that ombnes human and omputer nterpretatons of free text douments and nfers ther sentment. 2. We present a novel salable varatonal Bayes nferene algorthm, BCCWords VB, for tranng the BCCWords model. Ths algorthm was frst demonstrated at the CrowdSale 2013 Shared Task Challenge and was a jont wnner. 3. We derve an effent nferene deomposton method that allows our algorthm to perform bath nferene over hundreds of thousands of douments and demonstrate nferene wth 569, 786 rowdsoured sentment judgements for 98, 979 douments n approxmately 20 mnutes on a standard laptop. 4. We present an exhaustve evaluaton of our algorthm on two real datasets of text annotatons and ompare t aganst sx state-of-the-art methods for rowd-based text lassfaton and data aggregaton. Spefally, our evaluaton shows that our algorthm s up to 25% more aurate when only a small porton, less than 4% of the douments, have been led and that our algorthm redues by up to 67% the amount of rowd s requred to aheve omparable auray wth standard methods. The paper s strutured as follows. We revew the lterature on language modellng and aggregaton models for rowdsoured judgements n Seton 2. Seton 3 presents our model n detal, then Seton 4 provdes mathematal detals for our varatonal nferene algorthm. Seton 5 demonstrates the effay of our approah by omparng t aganst state-of-the-art benhmarks on two real world rowdsourng datasets. Fnally, we onlude and dsuss future work n Seton AGGREGATING JUDGEMENTS Many applatons n the lterature have employed rowdsourng, whereby multple people proess eah doument or data pont [13, 1, 33]. A key hallenge n suh rowdsourng applatons s to mtgate the bas of subjetve lers. Prevous work has addressed ths problem by ombnng rowd responses to obtan relable aggregate lassfatons. However, as yet, these methods have not exploted the language used n the text to further assst n nterpretng the text. We propose to use the varatons n language assoated wth sentment to redue errors and bas arse when employng members of the publ to perform lng tasks. A further hallenge wth real world applatons of doument rowdsourng s the ost of employng a suffent number of annotators to rapdly a large dataset. For example, the Ushahd dataset omprses at least 40, 000 text messages whh had to be nterpreted n the frst month after the earthquake n Hat, whh proved to be nfeasble [22]. However, a sutable language model would enable automated analyss at muh greater sale and allows the annotators to fous ther efforts on the most dffult douments. We therefore propose a learnng method for harnessng the sklls of human lers to learn a bespoke language model from muh larger sets of douments. A number of methods have been used n the lterature to address the hallenge of aggregatng annotatons from the rowd, nludng the smple tehnque of majorty votng [18]. However, smple majorty votng treats all annotators as equally relable and does not provde any meanngful measure of onfdene n the ombned deson to aount for onflts n judgement or low annotator skll levels. To overome ths problem, probablst methods have been developed whh learn the skll levels or bas of eah annotator and aggregate ther desons aordngly [26, 7, 32, 25, 31]. These methods are prone to error when only small amounts of gold s are avalable as they do not onsder unertanty n skll levels and other model parameters. For example, when only one s obtaned from a worker, these methods may nfer that the worker s ether perfetly relable or totally nompetent when, n realty, the worker s nether. Ths s a ommon problem wth approahes to nferene that use maxmum lkelhood or maxmum a-posteror solutons [4]. In order to overome ths lmtaton, algorthms for aggregatng rowdsoured data nludng SFlter [8] and Bayesan Classfer Combnaton (BCC) [14, 30] apture the unertanty n the workers skll levels or bas, as well as the unertanty n the aggregated s. Unfortunately, these methods do not explot the text features of douments, and onsequently requre eah doument to be led by the rowd, often multple tmes, to obtan onfdent lassfatons. Prevous work has ntrodued methods for automat text lassfaton based solely on word ontent, suh as the bagof-words lassfer [11]. Although suh methods have been appled to automated sentment analyss, they need a language model for eah applaton ontext [21]. Ths often requres large amounts of tranng data and substantal effort by the system desgner to ope wth the dversty n language [17]. In ontrast our approah uses a rowd of human annotators to learn a language model rapdly and heaply. In the followng setons we develop the BCCWords model and then demonstrate ts effay wth benhmark methods. 993

3 3. THE BCCWORDS MODEL In ths seton we desrbe our novel BCCWords model. Ths model s an extenson of the ndependent Bayesan lassfer ombnaton (IBCC) model presented n [28] whh lassfed data ponts usng only rowdsoured s. BC- CWords models the rowd as multple heterogeneous lassfers, and uses both the rowd s responses and the word struture of douments to lassfy them. An advantage of BCCWords s that t an be nferred n a sem-supervsed or unsupervsed manner. It does not requre separate tranng and test phases but uses a sngle, ombned learnng phase over all avalable data. The sem-supervsed approah smultaneously learns from led tranng data and the latent struture n the entre dataset, makng t partularly sutable when gold-standard data s lmted. We start by ntrodung our notaton. There s a rowd of K annotators expressng ther judgement about the orret lassfaton of N douments over a range of C possble lasses. The lasses may represent sentment lasses, top s, or other types of annotaton. Eah doument,, has an unknown true lass t C. The judgement of annotator k for doument s denoted as l (k), where l (k) C. Also, we assume that the nth word, w,n, of doument takes a value d from a dtonary of sze D words. For notatonal smplty, we assume a dense set of judgements n whh eah annotator rates all N douments. However, as wll beome lear n seton 4.1, our model naturally supports sparsty n the dataset, whh s the ase for the CrowdFlower dataset used n Seton 5. C true values α (k) 0, Dr. K workers N douments l (k) Dr. p β 0 t w d ω,d γ 0, D words Dr. Fgure 1: The fator graph of BCCWords. The rular, shaded nodes represent observed varables and the square, shaded nodes represent the hyperparameters. The plates desrbe () the set of K annotators, () the N douments, () the C possble true values and (v) the D words ontaned n the dtonary of terms used n the douments. The fator graph of our Bayesan Combnaton Model, BCCWords, s shown n Fgure 1, and the model s desrbed as follows. We assume that eah annotator draws judgements for douments of lass t = from a ategoral dstrbuton wth parameters : l (k) π (k), t Cat( ) where s the auray vetor of annotator k for douments of lass. That s, eah element of spefes the probablty that annotator k wll gve a judgement when presented wth a doument whose true lass s :, = p(l (k) = t = ) where p s the Drhlet dstrbuton. The set of auray vetors for all s alled the onfuson matrx representng k s relablty. In Fgure 1, the annotator onfuson matres are shown n the left-hand plate for all K annotators, deptng how the response of an annotator depends on the true lass t of the doument they are judgng. The use of onfuson matres allows our model to ombne annotators of very dfferent skll levels, and s able to handle those who make random guesses or whose responses are the opposte of what we expet. Furthermore, a onfuson matrx aounts for the personal bas of an annotator, sne a tendeny to selet a more postve or negatve judgment,, than other members of the rowd, when presented wth douments of true lass, wll result n an nreased lkelhood,,. A personal bas toward seletng judgement for all douments wll result n hgh lkelhoods, for all true lasses, thus the model wll lean that the from annotator k s less strongly dsrmnatve. Our language model s defned as follows. Gven a doument of lass we assume that the probablty that the nth word s d (.e. w,n = d) follows a ategoral dstrbuton wth parameters ω = {ω,d d}: w,n ω,d, t Cat(ω ), where ω,d = p(w,n = d t = ) s the probablty that a randomly-drawn word from a doument of lass s the word d. Ths probablst representaton of text n douments orresponds to a mxture of bag-of-words model [11], where eah mxture omponent s a bag-of-words model assoated wth one partular objet lass. The word dstrbutons are represented n Fgure 1 n the rght-hand plate, showng the varables orrespondng to the D words n the dtonary. We assume that the true lass for eah doument, t, s drawn from a ategoral dstrbuton wth parameters ρ: t ρ Cat(ρ). The parameters ρ an be regarded as the proporton of douments n eah lass, so that ρ = p(t = ρ). These parameters are shown at the top of Fgure 1. To model the unertanty n the latent varables n our model, we assgn onjugate Drhlet pror dstrbutons to, ω and ρ, for eah lass C and annotator k K: α (k) 0, Dr(α(k) 0, ) ρ β 0 Dr(β 0 ) ω γ 0, Dr(γ 0, ), where α (k) 0,, s the per-annotator onfuson matrx hyperparameter, and γ 0, s the hyperparameter for the bag-ofwords dstrbuton for eah lass. These hyperparameters have ntutve nterpretatons as pror pseudo-ounts, meanng that ther values are equvalent to a number of pror observatons, whh represent the strength of pror belefs. When mplementng BCCWords, the dagonal values of hyperparameters α 0, of the onfuson matres an be set to hgher values than the off-dagonals, enodng the pror belef that annotators are expeted to be better than random. The hyperparameters for the word dstrbutons, γ 0,, and 994

4 the lass proportons, β 0, an both be set so that the prors are unform. Ths reflets an ntal lak of nformaton n the word struture of the douments and the lass dstrbuton of the douments. To enable us to perform Bayesan nferene over our model, we frst spefy the omplete jont dstrbuton: ( p l, t, π (1) 1,.., π(k) C, ω1,.., ωc, ρ α(1) 0,1 { N K = =1 ρ t k=1 =1,.., α(k) t,l (k). 0,C, β 0, γ 0,1,.., γ 0,C W n=1 { C K p(ω γ 0, ) ) ω t,w,n } p(ρ β 0 ) k=1 } p(π (k) α (k) 0, ) where l s the set of s from all annotators for all douments, and W s the number of words n doument. The next seton desrbes a method that uses ths jont dstrbuton to estmate the posteror dstrbuton over the unknown varables n our model. 4. EFFICIENT VARIATIONAL INFERENCE The BCCWords model presented n the prevous seton s tuned, or nferred, by learnng the parameters of the posteror dstrbuton over ts unknown varables, so that the model fts the data. In ths seton we desrbe an effent method for nferene usng varatonal Bayes (VB) [4]. The next subseton presents detals of the VB algorthm for BC- CWords. Then, Seton 4.2 desrbes how ths BCCWords- VB algorthm an be extended to bath proessng and subsequently an sale to large datasets when omputer memory apaty s lmted. Varatonal Bayesan nferene s an approxmate method for obtanng a strt lower bound of the true (log) jont posteror. VB expltly takes unertanty nto aount at all levels of nferene, allowng us to margnalse (albet under the VB approxmatons) over unknown varables, rather than seletng the sngle most lkely value. The approxmaton offers huge speed ups over Monte Carlo samplng based Bayesan methods [28], and the performane degradaton appears small n BCC models. Hene, VB s our preferred algorthm for workng wth potentally large sets of douments. In our experments we mplement our VB algorthm usng Infer.NET [20], whh s a framework that enables rapd development and runnng of Bayesan nferene n graphal models. In partular, the Infer.NET nferene engne enables us to swth between alternatve nferene algorthms for BCCWords, nludng Gbbs samplng [9] and Expetaton propagaton [19], that are potentally more aurate but muh slower than VB and less sutable for performng nferene over large sale datasets. The varatonal Bayesan nferene algorthm uses an approxmaton to the jont probablty dstrbuton, q(t, θ) = q(t)q(θ), that fatorses between the true lasses of the douments, t = {t }, and the set of model parameters, θ = (1) { k, ω, ρ}. The algorthm terates between updatng the approxmate posteror dstrbuton over the true lasses of the douments, q(t), and the model parameters q(θ), untl t onverges. The theory behnd varatonal nferene guarantees that eah teraton redues the Kullbak- Lebler dvergene [16] between the approxmate soluton and the true posteror at eah teraton, so that the approxmaton beomes loser to the exat soluton wth eah teraton. The updates an be vewed as passng messages between the true lass s t and model parameters θ. As we wll see n the detaled explanaton below, the spef forms of the fators q(t) and q(θ) arse naturally from the BCCWords model and ts hoe of dstrbutons. The ondtonal ndependene relatons n our model allow us to further fatorse the dstrbuton over the model parameters, q(θ) wthout addtonal approxmaton: { } q(θ) = q(ρ) C q(ω ) k K q( ) Ths means we an update eah subset of model parameters separately, and eah of these fators wll exhange dfferent messages wth q(t). Ths type of algorthm s also known as varatonal message passng (VMP) and has an effent salable mplementaton whh s desrbed n Seton The BCCWords VB Algorthm We now present the mplementaton of the varatonal nferene approah to BCCWords desrbed n the prevous seton. Spefally, we desrbe the detals of the teratve updates wthn a step-by-step desrpton of the nferene algorthm based on the VB equatons that we derve for BCCWords. Inputs: the algorthm takes as nput a data set of annotators responses, l, and where avalable, a set of known target s whh are gold-standard tranng s. We note that gold s are not neessary and the algorthm an operate n unsupervsed mode. To run the algorthm, we must also selet values for α (k) 0,, k, β 0 and γ 0,, as desrbed above. A number of tehnques an be used to ntalse these hyperparameters when the hoe of values s unlear [3]. Step 1. Intalsaton: ntalse approxmate posteror dstrbutons over the model parameters, θ. The hoe of ntal dstrbutons affets the number of teratons requred for onvergene. In our mplementaton we ntalse the posteror dstrbutons over the model parameters by settng them to ther pror dstrbutons. Step 2. Update true lass predtons: update the approxmate posteror q(t ) over the lass of eah doument, N. For any doument whh has a gold, the value of t s known so we do not need to update q(t ). Instead we set q(t = ) = 1 where s the observed value of t, and q(t ) = 0 for all other lass values. For all other douments, we obtan the urrent estmate q (t ) of the probablty that the true lass of s : q (t = ) = r, C r,, (2) where r, s the expetaton of the lkelhood, gven by r, = E ρ,π,ω [ln p(t =, ρ, π, l, ω )] W K = E ρ[ln ρ ] + [ln ] + E ω [ln ω,w,n ]. (3) k=1 E π (k),l (k) n=1 The expetatons n ths equaton are found usng the urrent estmates of the dstrbutons over the model parameters, and are defned expltly n the subsequent steps of 995

5 the algorthm. These terms an be seen as messages from the model parameters to the true lass s, t. Equaton 2 an then be used to determne the messages to pass to the model parameters θ, whh are expetatons over the suffent statsts of the set of true lass s for all douments. The message for ρ ontans expeted ounts of eah true lass: N N = q(t = ). =1 The message for the onfuson matres ontans the ounts of eah judgement C gven the true C: N (k), = N =1 δ (k) l q(t, = ) (4) where δ (k) l s the Kroneker delta and s unty f l (k), = and zero otherwse. Smlarly, the message for the word dstrbutons ontans ounts of word ourrenes n eah lass: W N N,d = δ w,n,d q(t = ). (5) =1 n=1 Step 3. Update onfuson matres: update the approxmate posteror q(π (k) ) for eah lass C and eah annotator k K. The pror dstrbutons over the onfuson matres are Drhlet dstrbutons, whh are onjugate to the ategoral dstrbutons. Ths means that the posteror dstrbutons over the onfuson matres are also Drhlets, wth updated parameters: q ( ) ( = Dr α (k),1 ),..., α(k), (6) where L s the ardnalty of C and α (k) s alulated by addng ounts from the true lass message to the pror pseudo-ounts α (k) 0, : α (k), = α (k) 0,, + N (k),. (7) A more detaled dervaton of these teratve update equatons an be found n [28]. We an now alulate the message to send bak to the true s, whh s the expetaton term requred for Equaton (3): [ ] ( ) E ln, = Ψ α (k), Ψ ( L b=1,l α (k),b where Ψ(.) s the standard dgamma funton. ), (8) Step 4. Update word dstrbutons: update the approxmate posteror q(ω ) for eah row C to urrent estmate q (ω ). Agan, we have a posteror Drhlet dstrbuton due to the use of onjugate exponental-famly dstrbutons n our model: q (ω ) = Dr (ω γ,1,..., γ,d), (9) where the parameters are updated by γ,d = γ 0,,d + N,d. The message to the true lass s, whh s requred for Equaton (3), ontans the terms: ( D ) E [ln ω,d ] = Ψ (γ,d ) Ψ γ,d. (10) d =1 Step 5. Update lass proportons: update the approxmate posteror q(ρ) usng the Drhlet parameter update: q (ρ) = Dr (ρ β 1,..., β C) (11) where the parameters are updated by β = β 0, + N. The message from ths parameter s: ( L ) E [ln ρ ] = Ψ (β ) Ψ β b. (12) b=1 So, for one teraton of the algorthm, we alulate the updated parameters, dstrbutons and expetaton terms defned n steps 2 to 5. Step 6. Chek onvergene: f the target dstrbutons q (t = ) have not onverged to a stable soluton wthn a gven tolerane, repeat the algorthm from Step 2. Outputs: predtons of the doument lass s, gven by the urrent estmates of q(t = ), ther posteror expetatons. The algorthm also outputs approxmate posteror dstrbutons over the model parameters, q( ), q(ω ), and q(ρ) for eah row C and eah annotator k K. 4.2 Salablty Through Inferene Deomposton Performng a task suh as sentment analyss or dsaster report analyss an requre us to work wth extremely large datasets wth vast memory requrements. A major soure of memory usage s the large set of annotator onfuson matres that the nferene algorthm must teratvely update. For example, the Ushahd dataset, gathered after the Hat earthquake, was nterpreted by approxmately 700 workers [22]. To resolve memory exhauston dffultes of VB nferene at sale, ths seton proposes a salable verson of the BCCwords-VB algorthm, Salable BCCwords (SalBC- CWords), whh an be run on a sngle omputer. SalBCC- Words s dental to BCCWords exept that we deompose the entre data set nto a set of bathes of data by dstrbutng the annotators aross P parttons. Durng eah teraton of the VB nferene algorthm SalBCCWords swthes eah bath n and then out of memory n turn. Bathes produe messages that summarse eah porton of data and oupy onsderably less omputer memory than the entre data set. We hose to dstrbute the workers between the bathes so that eah bath ontans all the responses from workers n that bath. Ths parttonng rteron s sensble as eah bath only has to represent a subset of annotators, and thus only represents a small set of onfuson matres. The splts an be hosen to meet memory onstrants. The orrespondng fator graph s shown n Fgure 2, showng how all parttons share the same lass dstrbuton of douments and the word dstrbutons ondtoned on doument lass. When bath p s proessed, the pseudo ounts N (k), are alulated usng Equaton (4), for all k p. The log onfuson matrx messages, M p,,, for bath p for eah lass and doument are alulated as follows, M p,, = k p E [ ln,l (k) usng Equaton 8 to alulate the expeted log onfuson matrx. The log onfuson matrx for bath p s then deleted ] 996

6 C true values Worker k n bath p α (k) 0, Dr. l (k) w d p Dr. β 0 t γ 0, ω,d Dr. D words Worker k n bath p α (k ) 0, π (k ) Dr. l (k ) Fgure 2: Fator graph for Salable BCCwords. The four plates nluded n the graph desrbe () the set of workers K n the bath p, () the set of workers K n the bath p, () the C possble true values and (v) the D words ontaned n the dtonary of terms used n the tweets. The plate for the N douments s omtted for smplfaton. from memory. One a message s obtaned for eah bath, the log true lass predton probablty s alulated usng, r, = E ρ[ln ρ ] + W M p,, + E ω [ln ω,w,n ] p P n=1 as per Equaton 3. Hene, salbccwords s mathematally equvalent to BCCWords and both methods onverge to the same soluton. The remanng steps of the salbccwords algorthm are dental to those of BCCWords. We note that SalBCCwords may proess the bathes n any order and not all the bathes need be updated durng eah teraton of the VB algorthm. In our experments, we provde an empral evaluaton of both our algorthms showng the advantage of SalBCCWords-VB n memory oupany. 5. EMPIRICAL EVALUATION We evaluate the effay of our approah, SalBCCWords, usng two real-world datasets, omparng performane aganst the followng fve rval benhmark approahes. We note that BCCWords-VB and SalBCCWords produe the same lassfatons as they are mathematally equvalent and therefore only SalBCCWords results are shown. Majorty Votng (MV) s a popular and smple algorthm for obtanng a sngle deson from multple opnons provded by a rowd [18, 29]. MV greedly assgns a lass to eah doument by hoosng the wth the most votes from the rowd. All votes are onsdered wth unform weght, thus treatng all annotators as equally relable. Typally, no measure of unertanty n the fnal deson s provded. Vote Dstrbuton treats the fraton of votes n support of eah lass as the probablty of that lass. It therefore represents a smple tehnque for estmatng the empral probablty that a doument has a partular true, assumng that all annotators are equally relable. Bag-of-words Classfer + MV trans a bag-of-words lassfer by treatng the majorty vote as gold-led tranng data [11]. Therefore, ths approah learns a language model that an be used to lassfy douments that have not yet been led by the rowd, but does not aount for varyng relablty of the rowd lers when tranng the model. Dawd & Skene s a model for ombnng s from multple lassfers, usng onfuson matres to model the relablty of ndvdual lers [7]. The learnng algorthm for Dawd & Skene does not aount for unertanty n the onfuson matres or other model parameters, whh an lead to errors when gold-led data s lmted. Independent Bayesan Classfer Combnaton (IBCC) learns the onfuson matres usng varatonal Bayes (VB). Therefore, n ontrast to Dawd&Skene, t handles model unertanty and an operate n unsupervsed settngs when gold-led examples are unavalable. However, ths model does not onsder text features and reles solely on s provded by the rowd. Communty-Based Bayesan Classfer Combnaton (CBCC) s an extenson of IBCC that models ommuntes of workers wth smlar onfuson matres. It learns both the onfuson matres of eah ommunty and eah worker but, lke IBCC, t does not aount for text features n the douments [30]. We run CBCC wth three ommuntes as suggested n the orgnal paper, for both CF and SP. In our experments, we set the prors for IBCC, CBCC and SalBCCWords as follows. For the lass proportons, ρ, we used unbased prors by settng the values of β 0, to be equal for all lasses. For the workers onfuson matres, we used nformatve prors, settng the dagonal ounts α (k) 0,, to C + 1.5, wth the off-dagonals set to 1. Ths means that workers are ntally assumed to be reasonably aurate. For SalBCCWords, the word dstrbutons were gven unnformatve prors, by settng unform values for γ 0,,d for all words d D. 5.1 Datasets We evaluate our approah usng two rowdsourng datasets, whh provde real sentment judgements obtaned from human workers. The two datasets demonstrate our approah on two very dfferent knds of doument, wth dstnt sentment analyss problems. The CrowdFlower dataset (CF) was provded by Crowd- Flower 3 as part of the 2013 Crowdsourng at Sale shared task hallenge. The dataset ontans 569, 375 judgements for 98, 980 tweets. Ths dataset nludes 300 tweets wth gold-standard sentment s, whh orrespond to 1, 720 judgements from 461 workers. The judgements reflet the sentment of tweets dsussng the weather, and an take values from four sentment ategores: negatve (0), neutral (1), postve (2), tweet not related to weather (4) and annot tell (5). Ths dataset therefore onerns a mult-lass lng problem

7 IBCC CBCC SalBCCWords MV Text lassfer Dawd&Skene MV Vote dstrbuton IBCC CBCC SalBCCWords Bag of words + MV Dawd&Skene MV VoteDstrbuton Auray Auray ,000 50,000 75, , , , # s # s (a) CrowdFlower (CF) (b) Sentment Polarty (SP) Fgure 3: Auray of seven methods measured wth nreasng proportons of s for both datasets. The Sentment Polarty dataset (SP) ontans annotatons for a set of 5, 000 sentenes from move revews, extrated by [23] from the webste RottenTomatoes 4. Ths dataset has gold-standard sentment s for all the move revews assgned by the webste, whh marked them as ether fresh (postve) or rotten (negatve). A set of 27, 747 sentment judgements were olleted from 203 workers usng Amazon Mehanal Turk (AMT) 5 by [27]. The SP dataset therefore presents a bnary sentment analyss problem, wth workers fored to selet ether postve (1) or negatve (0), wth no opton to express ther unertanty. The voabulary of a real-world text orpus s often extremely large, so most pratal deployments of language modellng methods employ a set of heurst pre-proessng steps to remove nosy data that would otherwse add unneessary omputaton and memory osts. In our experments, the dtonary for both datasets was obtaned usng the standard approah of stemmng the text through the Porter s stemmer algorthm [24], then removng ommon stop words, before extratng the 300 words wth the hghest term frequeny x nverse doument frequeny (TF-IDF) sore [2]. TF-IDF s a heurst for seletng words that are mportant n dstngushng douments wthn a orpus, where term frequeny s the number of ourrenes of word d wthn the orpus, and nverse doument frequeny = log(n/n d ), where N d s the number of douments ontanng d. Whle SalBCCWords s agnost to the type of features suppled, these standard text pre-proessng steps are used to allow the experments to fous on omparng rowdsoured sentment lassfaton methods. 5.2 Performane Comparsons We nvestgated how the effetveness of the language model learnt by SalBCCWords vares wth the number of s suppled by the rowd. To do ths we ompared the performane of the alternatve methods on our two datasets and evaluated the effay of eah usng four standard metrs: Auray s the proporton of douments that were orretly led. For methods suh as SalBCCWords that output probabltes we assgn the wth the hghest probablty. Average reall s the mean aross all lasses of the reall rate, defned as the fraton of postve nstanes of a gven lass that were orretly led. Negatve log probablty densty (NLPD) s an error measure the lower the better based on how muh weght a lassfer gave to the orret lass of eah doument as defned n [30]. AUC s the area under the urve of the reever operatng haraterst (RoC), whh s the probablty that a randomly-hosen postve example s assgned a hgher probablty than a randomly-hosen negatve example [28]. Ths s a measure of an algorthm s ablty to dfferentate between lasses, regardless of whether the lasses are mbalaned. For the CF dataset, we provde the mean AUC over pars of lasses, usng the method of [10]. The experment s run teratvely, startng by runnng eah method wth 2% randomly-hosen judgements from the rowd, then evaluatng the lassfaton effay. We then nrease the number of judgements by addng a further 2% randomly-seleted s from the rowd and re-runnng all the methods. Ths proess s repeated untl all rowdsoured s have been used by the predton methods. Fgure 3 shows the auray for the sx methods for both datasets, whh mproves for all methods as they get more data. In partular, SalBCCWords has hghest auray for a small number of s, demonstratng the added value of the language model. SalBCCWords mantans the hghest auray throughout, although IBCC, CBCC and Dawd & Skene ath up for large numbers of rowdsoured la- 998

8 CF (20% s) SP (20% s) Method Auray Avg. reall NLPD AUC Auray Avg. reall NLPD AUC Majorty vote Vote dstrbuton Bag-of-words lassfer + MV Dawd&Skene IBCC CBCC Salable BCCwords Table 1: Auray, average reall and negatve log probablty densty sore (NLPD) for the CF and the SP datasets for the sx tested methods (one for eah row) when usng 20% rowdsoured s. Usng ths subset of s, 70% of the douments n both datasets have at least one rowdsoured. CF (all s) SP (all s) Method Auray Avg. reall NLPD AUC Auray Avg. reall NLPD AUC Majorty vote Vote dstrbuton Bag-of-words lassfer + MV Dawd&Skene IBCC CBCC Salable BCCwords Table 2: Auray, average reall and negatve log probablty densty sore (NLPD) for the CF and the SP datasets for the sx tested methods (one for eah row) when usng all avalable rowdsoured s. bels. The auray of SalBCCWords s 25% hgher than CBCC (7 vs. 0.40) after 20,000 s n CF and s 8% hgher than CBCC (4 vs. 0) after 110 s n SP. Importantly, n order to aheve the same auray, Sal- BCCWords requres up to 56, 935 fewer s n CF and up to 440 fewer s n SP ompared to the benhmarks. Furthermore, Dawd & Skene ntally nfers a poor model of worker auray due to sare s, whh leads to poor lassfaton performane. Suh a old start phase s mtgated n the BCC methods by aountng for unertanty n the workers auray. Both MV and Vote Dstrbutons are more aurate than Dawd & Skene n the ntal phase but they are less aurate than all the other methods when a larger amount of rowd s have been used. Table 1 shows predton metrs for both datasets when only 20% of the omplete set of rowdsoured s were used. Usng ths subset redues the number of s avalable for eah doument so that only 70% of douments have one or more rowdsoured s. To lassfy the 30% of douments wth no rowdsoured s, SalBCCWords and the bag-of-words lassfer apply ther language models, whle other methods have no nformaton about these douments and assgn a default ategory to all unled douments. SalBCCWords has the hghest auray and average reall on both datasets, sgnfantly outperformng the bagof-words model, whh also uses a language model but does not model annotator bas and error rates. Table 2 shows predton metrs for both datasets, when all rowdsoured s were used. SalBCCWords has the hghest auray and average reall on both datasets and ompettve AUC and NLPD on the SP dataset. Ths demonstrates that SalBCCWords performs well even when s are plentful (n ths ase, on average 6 s per doument). 5.3 Language Model The language model nferred by SalBCCWords represents the probabltes of eah word n the dtonary ondtoned on the sentment lasses. In Fgure 4, the top row shows the Wordles (word louds) wth the most probable 27 words n the fve sentment lasses of the CF dataset. SalBCCWords s able to dentfy words that dsrmnate the sentment lasses, suh as love and perfet, whh are more lkely to our n the postve sentment lass, whereas words suh as old and hate are more lkely to appear n the negatve lass. We note that ommon words suh as day are hghly lkely n both postve and negatve lasses and are therefore not good dsrmnators n ths dataset. However, SalBCCWords naturally uses the most dsrmnatve words to nfer the sentment lass through Equaton 2. In the seond row of Fgure 4, the wordles show the words that most strongly ndate the lass,.e. the words d wth hghest probablty p(t = w,n = d) for lass. Here we an see that day s not ndatve of sentment lass and there s lttle overlap between the word louds for eah lass. Fgure 5 shows the Wordles for the SP dataset, wth the word good beng equally lkely n both sentment lasses, suggestng that words that seem ntutvely postve may be poor dsrmnators, possbly beause ther meanng s hghly ontext-dependent. To valdate the qualty of the language model nferred by SalBCCWords usng the rowd s, we ompare t to a language model learned by tranng SalBCCWords on the gold-standard s. For both models, we rank words by ther probabltes ω,d n eah lass, to examne whh terms the model has nferred are mportant to eah lass. Usng the non-parametr Kendall s τ rank orrelaton test, we fnd a sgnfant postve orrelaton between the rank- 999

(a) Postve (b) Negatve () Neutral (d) Not related (e) Unknown (f) Postve (g) Negatve (h) Neutral () Not related (j)

of the CF dataset. Word sze s proportonal to estmated lkelhood gven the true lass.

Word sze s proportonal to the lass probablty gven the word. Colours are for legblty only. Postve Gold Crowd 0.

word dstrbutons, ω,d, estmated by BCCWords usng gold s and rowd judgements. Colour ntensty ndates the orrelaton strength.

Ths suggests that the rowd s desons may more aurately reflet the gold-led data when lassfyng the SP dataset, whh has

4 (a) Postve (b) Negatve Fgure 5: Word louds of the most probable 27 words from SalBCCWords for the sentment lasses of

Proflng Crowd Workers Besdes predtng doument lassfatons and the language model, SalBCCWords learns the onfuson matres

Fgure 6 shows example rowd members wth very dfferent onfuson matres.

9 (a) Postve (b) Negatve () Neutral (d) Not related (e) Unknown (f) Postve (g) Negatve (h) Neutral () Not related (j) Unknown Fgure 4: Top row (a) to (e): word louds of the most probable 27 words from SalBCCWords for the sentment lasses of the CF dataset. Word sze s proportonal to estmated lkelhood gven the true lass. Seond row (f ) to (j): word louds for the most dsrmnatve 27 words for three lasses of the CF dataset. Word sze s proportonal to the lass probablty gven the word. Colours are for legblty only. Postve Gold Crowd Negatve Neutral Unknown Not related Postve Negatve (a) CrowdFlower (CF) (b) Sentment Polarty (SP) Table 3: The Kendall s τ rank orrelaton oeffents (p < 10 5 ) for the word dstrbutons, ω,d, estmated by BCCWords usng gold s and rowd judgements. Colour ntensty ndates the orrelaton strength. CWords on the SP dataset shown n Table 2. Ths suggests that the rowd s desons may more aurately reflet the gold-led data when lassfyng the SP dataset, whh has only two dametrally opposed lasses, rather than the fve less easly dstngushed lasses of CF. 5.4 (a) Postve (b) Negatve Fgure 5: Word louds of the most probable 27 words from SalBCCWords for the sentment lasses of the SP dataset. Word sze s proportonal to estmated lkelhood gven the true lass. Colours are for legblty only. Proflng Crowd Workers Besdes predtng doument lassfatons and the language model, SalBCCWords learns the onfuson matres that haraterse the workers skll levels aross sentment lasses. Fgure 6 shows example rowd members wth very dfferent onfuson matres. For example, subfgure (a) shows a ompetent annotator who provdes aurate s aross all sentment lasses, hene the hghly peaked lkelhoods along the dagonals. Subfgure (b) shows an annotator whose relablty s nonsstent aross the lasses, wth varyng lkelhoods of norret worker s. Ths fgure shows that we are able to detet annotators wth very dfferent behavour wthn our two real-world datasets. The BCCWords model aptures not just the overall skll level, but also the ngs obtaned from the model traned on the rowdsoured data and the model traned on gold s, as shown n Table 3 for both datasets. Ths ndates that SalBCCWords an effetvely tran a language model usng rowd s when gold-standard data s unavalable. Kendall s τ s muh hgher for SP, whh reflets the hgher auray of SalBC- 1000

10 auray and bas of the annotator for eah spef lass, shown by the dstrbuton n eah row of Fgure 6. Good worker Inaurate worker True Worker True (a) CF dataset Worker (b) CF dataset True Worker () SP dataset Fgure 7: Memory usage of BCCWords (orange lne) and the salable mplementaton of BCCWords (blue lne) measured on real data. True We ompared our algorthm wth sx benhmark methods on two real-world rowdsourng datasets and showed that our method an mprove auray by 25% over both standard text lassfers and promnent aggregaton models for rowdsoured data wth annotatons for a small porton of douments. Furthermore, our approah sgnfantly redues, by up to 67%, the amount of s that must be obtaned through rowdsourng to aheve omparable auray wth rval methods. We are urrently nvestgatng other promnent applatons of our method: dentfyng ad requrements n dsaster response usng reports from members of the publ and frst responders; evaluatng nvestor sentment towards ompanes expressed n free-text reports; and determnng student sentment from onlne forum postngs to ad pastoral are. These domans provde vast amounts of unstrutured data that an beneft from nsghts provded by human annotators and the salablty of automated methods. Future work wll evaluate BCCWords wth the dfferent types of features avalable n these domans, nludng alternatve text features, doument metadata, and mage features. The way that BCCWords learns the annotator onfuson matres ould be modfed for problems wth ordnal lasses, suh as those representng dfferent strengths of sentment, to take advantage of relatonshps between neghbourng lasses when samples of annotator behavour for eah and true lass value are sparse. Worker (d) SP dataset Fgure 6: Confuson matres of four workers, wth the lkelhood of worker gven true lass on the vertal axs. These profles show two workers who are very well albrated n ther opnons (left) and two workers who provde less aurate s (rght). 5.5 Memory Usage We ompared the memory usage of BCCWords-VB and SalBCCWords on the CF dataset. Fgure 7 shows a plot of memory demand when runnng the BCCWords-VB algorthm wth nreasng subsets of s. In partular, we measured memory demand through the standard memory profler avalable n.net6 that provdes the approxmate memory alloated on the garbage olleton heaps to store the model nstanes of BCCWords. Despte the hgh nose of these ounters, whh explans the varablty of the urves n the graph, we an stll observe the general nreasng trend of memory usage when usng more s. As shown n Fgure 7, the SalBCCWords algorthm uses up to 80% less memory than the standard mplementaton of BCCWords (1 GB vs. 200 MB) when the dataset nludes 50, 000 s. 6. CONCLUSIONS 7. Ths paper presents BCCWords, a novel algorthm for ombnng rowdsoured annotatons wth text features n order to determne the sentment of douments. We presented a salable varatonal Bayes nferene algorthm for BCCWords and demonstrated how t an be mplemented for a large orpus annotated by rowd workers. Our analyss demonstrated that BCCWords s able to dentfy key dfferentatng text features, whh produe more aurate sentment lassfatons when rowdsoured s are sare. It s able to lassfy short messages suh as tweets, despte the lmted number of text features n these short messages. 6 ACKNOWLEDGMENTS We thank Gabrella Kaza, Lum, London, UK for ntal dsussons. Ths work was funded by the EPSRC ORCHID programme grant (EP/I011587/1) and Mrosoft Researh, Cambrdge, UK. 8. REFERENCES [1] Y. Bahrah, T. Graepel, T. Mnka, and J. Guver. How To Grade a Test Wthout Knowng the Answers A Bayesan Graphal Model for Adaptve Crowdsourng and Apttude Testng. In Pro. of the 29th Int. Conf. on Mahne Learnng, pages ACM, CLR Profler, lrprofler.odeplex.om 1001

11 [2] R. Baeza-Yates and B. Rbero-Neto. Modern Informaton Retreval. Addson Wesley, [3] J. Bergstra and Y. Bengo. Random Searh for Hyper-Parameter Optmzaton. The Journal of Mahne Learnng Researh, 13: , [4] C. Bshop. Pattern Reognton and Mahne Learnng. Sprnger, 4th edton, [5] D. Butler. Crowdsourng Goes Manstream n Typhoon Hayan Response. Nature News, do: /nature , [6] C. Chew and G. Eysenbah. Pandems n the Age of Twtter: Content Analyss of Tweets durng the 2009 H1N1 Outbreak. PloS One, 5(11):e14118, [7] A. P. Dawd and A. M. Skene. Maxmum Lkelhood Estmaton of Observer Error-Rates Usng the EM Algorthm. Journal of the Royal Statstal Soety. Seres C (Appled Statsts), 28(1):20 28, Jan [8] P. Donmez, J. Carbonell, and J. Shneder. A Probablst Framework to Learn from Multple Annotators wth Tme-Varyng Auray. In Pro. of the Int. Conf. on Data Mnng, pages , [9] A. Gelfand and A. Smth. Samplng-Based Approahes to Calulatng Margnal Denstes. Journal of the Ameran Statstal Assoaton, 85(410): , [10] D. J. Hand and R. J. Tll. A Smple Generalsaton of the Area Under the ROC Curve for Multple Class Classfaton Problems. Mahne learnng, 45(2): , [11] Z. S. Harrs. Dstrbutonal Struture. Word, pages , [12] N. R. Jennngs, L. Moreau, D. Nholson, S. D. Ramhurn, S. Roberts, T. Rodden, and A. Rogers. On Human-Agent Colletves. Communatons of the ACM, [13] E. Kamar, S. Haker, and E. Horvtz. Combnng Human and Mahne Intellgene n Large-Sale Crowdsourng. In Pro. of the 11th Int. Conf. on Autonomous Agents and Multagent Systems, pages , [14] H. Km and Z. Ghahraman. Bayesan Classfer Combnaton. In Pro. of the 15th Int. Conf. on Artfal Intellgene and Statsts, page 619, [15] F. Kvran-Swane, S. Brody, and M. Naaman. Effets of Gender and Te Strength on Twtter Interatons. Frst Monday, 18(9), [16] S. Kullbak and R. A. Lebler. On Informaton and Suffeny. The Annals of Mathematal Statsts, 22(1):79 86, [17] A. Levenberg, S. Pulman, K. Molanen, E. Smpson, and S. Roberts. Predtng Eonom Indators from Web Text Usng Sentment Composton. In Int. Journal of Computer and Communaton Engneerng, [18] N. Lttlestone and M. Warmuth. The Weghted Majorty Algorthm. In 30th Annual Symposum on Foundatons of Computer Sene, pages IEEE, [19] T. Mnka. Expetaton Propagaton for Approxmate Bayesan Inferene. In Pro. of the 17th Conf. on Unertanty n Artfal Intellgene, page 362, [20] T. Mnka, J. Wnn, J. Guver, and D. Knowles. Infer.NET 2.5. Mrosoft Researh Cambrdge. See mrosoft. om/nfernet, [21] K. Molanen and S. Pulman. Sentment omposton. In Pro. of the Reent Advanes n Natural Language Proessng Int. Conf., pages , [22] N. Morrow, N. Mok, A. Papendek, and N. Komh. Independent Evaluaton of the Ushahd Hat Projet. Development Informaton Systems., 8:2011, [23] B. Pang and L. Lee. A Sentmental Eduaton: Sentment Analyss usng Subjetvty Summarzaton Based on Mnmum Cuts. In Pro. of the 42nd annual meetng on Assoaton for Computatonal Lngusts, page 271, [24] M. F. Porter. An algorthm for suffx strppng. Program: Eletron lbrary and Informaton Systems, 14(3): , [25] V. C. Raykar and S. Yu. Elmnatng Spammers and Rankng Annotators for Crowdsoured Labelng Tasks. Journal of Mahne Learnng Researh, 13: , [26] V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florn, L. Bogon, and L. Moy. Learnng From Crowds. Journal of Mahne Learnng Researh, 11: , [27] F. Rodrgues, F. Perera, and B. Rbero. Learnng from Multple Annotators: Dstngushng Good from Random Labelers. Pattern Reognton Letters, 34(12): , [28] E. Smpson, S. Roberts, I. Psoraks, and A. Smth. Dynam Bayesan Combnaton of Multple Imperfet Classfers. In Intellgent Systems Referene Lbrary seres: Deson Makng and Imperfeton, pages Sprnger, [29] L. Tran-Thanh, M. Venanz, A. Rogers, and N. R. Jennngs. Effent Budget Alloaton wth Auray Guarantees for Crowdsourng Classfaton Tasks. In Pro. of the 12th Int. Conf. on Autonomous Agents and Multagent Systems, pages , [30] M. Venanz, J. Guver, G. Kaza, P. Kohl, and M. Shokouh. Communty-based Bayesan Aggregaton Models for Crowdsourng. In Pro. of the 23rd Int. Conf. on World Wde Web, pages , [31] P. Welnder, S. Branson, P. Perona, and S. J. Belonge. The Multdmensonal Wsdom of Crowds. In Advanes n Neural Informaton Proessng Systems, pages , [32] J. Whtehll, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo. Whose Vote Should Count More: Optmal Integraton of Labels from Labelers of Unknown Expertse. In Advanes n Neural Informaton Proessng Systems, pages , [33] K. W. Wllett, C. J. Lntott, S. P. Bamford, K. L. Masters, B. D. Smmons, K. R. Casteels, E. M. Edmondson, L. F. Fortson, S. Kavraj, W. C. Keel, et al. Galaxy Zoo 2: Detaled Morphologal Classfatons for 304,122 Galaxes from the Sloan Dgtal Sky Survey. Monthly Notes of the Royal Astronomal Soety, 435(4): ,

Color Texture Classification using Modified Local Binary Patterns based on Intensity and Color Information

Color Texture Classification using Modified Local Binary Patterns based on Intensity and Color Information Color Texture Classfaton usng Modfed Loal Bnary Patterns based on Intensty and Color Informaton Shvashankar S. Department of Computer Sene Karnatak Unversty, Dharwad-580003 Karnataka,Inda shvashankars@kud.a.n