Debiasing Crowdsourced Batches

Size: px

Start display at page:

Download "Debiasing Crowdsourced Batches"

Jeremy George
6 years ago
Views:

1 Debiasing Crowdsourced Batches Honglei Zhuang, Aditya Paraeswaran, Dan Roth and Jiawei Han Departent of Coputer Science University of Illinois at Urbana-Chapaign {hzhuang3, adityagp, danr, ABSTRACT Crowdsourcing is the de-facto standard for gathering annotated data. While, in theory, data annotation tasks are assued to be attepted by workers independently, in practice, data annotation tasks are often grouped into batches to be presented and annotated by workers together, in order to save on the tie or cost overhead of providing instructions or necessary background. Thus, even though independence is usually assued between annotations on data ites within the sae batch, in ost cases, a worker s judgent on a data ite can still be affected by other data ites within the batch, leading to additional errors in collected labels. In this paper, we study the data annotation bias when data ites are presented as batches to be judged by workers siultaneously. We propose a novel worker odel to characterize the annotating behavior on sall data batches, and present how to train the worker odel on annotation data sets. We also present a debiasing technique to reove the effect of such annotation bias fro adversely affecting the accuracy of labels obtained. Our experiental results on both synthetic data and realworld data deonstrate the effectiveness of our proposed ethod. 1. INTRODUCTION Crowdsourcing provides an efficient ethod to annotate data on a large scale for various achine learning tasks by eploying a assive workforce drawn fro global Internet users. Popular online crowdsourcing platfors include Aazon Mechanical Turk 1 and CrowdFlower 2. However, while crowdsourcing is relatively cheap copared to eploying experts, getting large quantities of training data annotated by crowds (say thousands, or illions of data ites) can be rather expensive. A key echanis, often eployed in practice for reducing costs, is batching, i.e., grouping ultiple data ites (to be annotated together) into one single task as a batch. Batching can save significant onetary costs, since the necessary instructions and background for copleting the task needs to be provided just once for the entire batch. Thus, the worker will spend less tie on reviewing these Perission to ake digital or hard copies of all or part of this work for personal or classroo use is granted without fee provided that copies are not ade or distributed for profit or coercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific perission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX...$ instructions, and ore tie on annotating data ites, and therefore will be able to annotate ore data ites within the sae tie. For instance, consider a scenario where a worker has to judge whether a coent is relevant to a docuent. Here, aking a judgent for each coent requires reading through the entire docuent. Instead, with batching, the worker only needs to read the entire docuent once, and then ake a judgent for all the coents in the batch. In fact, even fro the workers point of view, it is also ore attractive to label batches of data ites as they can save tie on switching between different tasks. Furtherore, once they start on a batch, they no longer have to fight for other high paying or attractive tasks. However, even though batching is an attractive option in practice due to its cost and tie savings, having workers annotate batches can lead to severe correlation between annotations within batches. For exaple, say we have a task of annotating whether a review of the ovie The Iitation Gae crawled fro IMDb is positive. As illustrated in Figure 1(a), if we only show one review to be judged as part of each crowdsourcing unit task, workers will have to spend soe tie reading the instructions and possibly looking up the ovie before they can ake a single judgent on a review. Although judgents are likely to be independent, this way of assigning work is too costly to be practical. Instead, if we asseble ultiple reviews of the sae ovie into a batch, as shown in Figure 1(b), workers can ake ultiple judgents after they read the instructions. Nevertheless, in this case, the annotation of different reviews ight interfere with each other. For exaple, when the review Average In The Extree does not see like a positive review per se (Cf. top right in Figure 1(a)), when grouped with the review Stack of Lies, it looks uch ore like a positive review (Cf. top in Figure 1(b)). Siilarly, when the review Good enough but historically sketchy looks quite positive by itself (Cf. botto left in Figure 1(a)), it does not look as positive as a strongly effusive review siply saying Great ovie, as shown in the botto of Figure 1(b). Thus, overall these effects ight be undesirable and isleading as it is inconsistent with the case when workers ake independent judgents. Therefore, it is challenging to ascertain true labels of data ites in batches. So far, there has been little to no work in exploring the the possible annotation error introduced by grouping data ites into batches. Although batching data ites has been adopted in any crowdsourced tasks such as sorting [17, object recognition [32 or clustering [10, and anecdotally very widely used in practice, the assuption is often that the annotations are collected independently, which is not the case. While there is liited work on judging data ites in sequence [18, 26, 27, it is not directly applicable to our setting where a batch of data ites are presented and annotated in parallel. Our previous research [36 also noticed this specific

2 the ovie The Iitation Gae the ovie The Iitation Gae the ovie The Iitation Gae ý Stack of Lies ý Average In The Extree ý Stack of Lies þ Average In The Extree the ovie The Iitation Gae the ovie The Iitation Gae the ovie The Iitation Gae þ Good enough but historically sketchy þ Great ovie, worth a watch ý Good enough but historically sketchy þ Great ovie, worth a watch (a) Data ites judged independently (b) Data ites judged in a batch Figure 1: Exaple of correlation between annotations on data ites in the sae batch. Workers are asked to label whether a review on the ovie The Iitation Gae crawled fro IMDb is positive. Assign each review-ovie pair to workers separately can be costly, while assigning a batch of reviews together with a ovie to workers ight affect workers judgents. type of annotation bias, but instead of focusing on debiasing, we exploited the bias to develop an active learning algorith aiing to iprove a certain classifier perforance. We defer the detailed discussion of the related work to Section 7. There are several research challenges in solving this proble. First, how do we odel workers behavior when they ake judgents in batches? Second, how do we leverage the odel to debias the crowdsourced annotation of data batches? We ake the following contributions in answering these questions: 1. Proposing an interpretable worker annotation odel on sall batches of data. We propose a novel worker odel for binary annotation behavior with data ites presented as batches. The odel incorporates independent judgents and batch judgents based on ranking. Different fro the factor graph odel in our previous work [36, we focus on obtaining the true labels of data ites instead of iproving classifier perforance. 2. Debiasing annotation data obtained as batches. Based on our proposed worker odel, we provide an algorith to debias the inferred labels when they are collected fro data ites in sall batches. 3. Conducting experients on a real-world crowdsourcing platfor. We conduct experients on both synthetic and real-world crowdsourcing data sets to verify the effectiveness of our proposed odel and debiasing strategies. Experiental results show the effectiveness of our debiasing ethod over other baselines. The rest of this paper is organized as follows: Section 2 introduces the background of crowdsourcing, and foralizes the research proble; Section 3 proposes the worker odel for annotating sall batches of data; Section 4 presents a strategy to debias batch annotations; Section 5 describes experiental results; Section 6 discusses extensions of our proposed ethod; Section 7 presents related work and Section 8 concludes. 2. PRELIMINARIES In this section, we forally define the concepts and notations we use in this paper; we then give a foralize the proble of debiasing crowdsourced batches. 2.1 Basic Concepts and Terinology First we need to foralize several basic concepts in a crowdsourcing platfor. Suppose we are given a set of data ites X = {xi }, where i = 1,..., n. Each data ite is associated with a label yi Y, where Y = {yi }n i=1. In following discussion, we focus on a binary classification task, where Y = {0, 1}, but our fraework generalizes to ulti-class or rating cases sealessly. According to a standard foralization in learning theory for binary classification, we suppose each (xi, yi ) is generated fro a joint probability distribution PX Y. We define ηi to be the conditional probability P (yi = 1 xi ). In a job or task subitted to a crowdsourcing platfor, we can asseble several data ites into a batch. Each batch bj is represented by a set of indices of data ites in the batch, denoted as {bj1,..., bjk }, where k is the size of a batch. To be strict, data ites in the batch should be represented by xj = {xbj1,..., xbjk }. However, for siplicity, we denote data ites in the batch specified by bj as {xj1,..., xjk }. Siilarly, we define yj = {yj1,..., yjk } to be true labels associated with data ites in xj, where yjl is the true label of xjl according to Y for 1 l k. In CrowdFlower language, a batch corresponds to a single unit, where a worker has to judge the entire unit at the sae tie; in Mechanical Turk language, a batch corresponds to a single HIT (short for Huan Intelligence Task). Usually, data ites in the sae batch ight share the sae context, background, or the sae instruction, in order to reduce the overhead. For exaple, if one is asked to judge whether a review about a restaurant is positive or negative, it ight save tie for workers if reviews of the sae restaurant are grouped into the sae batch, as they only need to read the description of the restaurant once before they can ake ultiple judgents on different reviews. As we asseble data ites into batches, each worker has to judge the entire batch as a single judgent. Given a batch bj, the judgent provided by a worker can be represented as yj0 =

3 (y j1,..., y jk), where y jl Y is the annotation of data ite corresponding to x jl, provided by the worker. Noting that the worker annotation y j can be different fro the true label y j. We refer to worker annotation as annotation, while the ground-truth label is referred to as siply the label. In CrowdFlower, as a judgent can only be ade based on a unit, workers are not allowed to subit partial results on a batch (as with Mechanical Turk). However, one can always add an unknown option for every data ite, so that the workers can provide partial results on a batch. For siplicity, we consider no partial judgents in the rest of the paper. Now, we are in a position to give a foral definition for a batch of data ites: DEFINITION 1 (BATCH). Given a data set (X, Y ), a batch of data ites with size k extracted fro the given data set can be represented as (b j, x j, y j, y j), where b j = (b j1,..., b jk ) is a set of indices for X and Y ; x j = {x j1,..., x jk } is a set of all the data ites, indexed by b j (i.e. x jl = x bjl ); y j = {y j1,..., y jk } consists of the corresponding true labels of data ites in x j, also indexed by b j; y j = {y j1,..., y jk} is the worker annotation on the set of the batch. Additionally, a set of batches can be defined as: DEFINITION 2. Given a data set (X, Y ), a set of batches extracted fro the given data set is denoted as A = (B, X B, Y B, Y B), where B = {b j} consists of the indices of each batch; X B = {x j} is the set of data ite batches, with their corresponding true labels Y B = {y j} and worker annotations Y B = {y j}. Rearks. 1) Notice that a data ite x i X ay certainly appear in ultiple batches in A. If data ites in two different batches share the sae index as indicated by corresponding ite in b j, they refer to the sae data ite in X; 2) For the sake of fully utilizing the workforce of crowds, without loss of generality, we focus on the scenario when all batches have the identical size k. However, our odel generalizes to the case when batches have different sizes; 3) In soe real world crowdsourcing platfors, a batch can actually be judged by ultiple workers, which eans there could be ultiple y j s associated to a single (b j, x j, y j) for instance, this is referred to as ultiple assignents on Mechanical Turk. However, for the purposes of debiasing, it is equivalent to regard a single batch as ultiple identical batches, and associate each batch with a unique judgent ade by different workers. 2.2 Proble Definition Based on the concepts described thus far, we can foralize the proble of debiasing crowdsourced batches as the following: PROBLEM 1 (DEBIASING CROWDSOURCED BATCHES). Suppose we have a labeled data set (X L, Y L) with Y L known, as well as its extracted batches and their crowdsourced annotation (B L, X BL, Y BL, Y B L ). If we are then given another unlabeled data set X U, as well as its extracted batches and crowdsourced annotation (B U, X BU, Y B U ), the objective is to infer the true labels Y U associated with X U fro the crowdsourced annotation. Notice that our proble forulation as described above requires as input labeled and annotated data ites for training purposes. In practice, the labeled data for training can be collected fro the test questions with ground-truth labels, inserted by the crowdsourcing platfor for the purpose of quality control and onitoring of workers. The usage of test questions is standard practice: As an exaple, in CrowdFlower, all workers have to attept a certain nuber Notation Table 1: Notation description. Description X Set of all the data ites {x i} n i=1 Y Set of all true labels associated with data ites in X b j A set of data ite indices {b jl } k l=1 x j A data ite batch {x jl } k l=1 where x jl is extracted fro the data ite in X with index specified by b jl y j A label batch consists of true labels {y jl } k l=1 associated with data ites in x j y j Worker annotation collected fro a crowdsourcing platfor for data ites in x j B Set of all the batches {b j} i=1 X B Set of all the data ite batches Y B Set of all the true labels associated with data ite batches in X B Y B Set of all the worker annotation fro crowds on data ite batches in X B of test questions with correct labels and need to achieve an accuracy over a certain threshold (e.g. 70%) before they can proceed to work on the regular task(s). Also, additional hidden data ites with known labels can be inserted into the regular tasks to onitor the accuracy of workers. In our setting, worker behavior on these test questions or labeled data can additionally be used for training purposes. Also notice that in this version of our proble forulation, we assue identical worker behavior. In practice, it is ore often to assue identical worker behavior, as there are usually not enough work done by each worker to ascertain individual behavior. Also, it is straightforward to extend our odel when different workers have different behavior when working on tasks. 3. CROWDSOURCING WORKER ANNO- TATION MODEL ON BATCHES In this section, we first describe our odel for workers annotation behavior on a sall batch of data ites; then we introduce how to train the odel based on a training data set. Our key intuition is the follows: when a worker judges a batch of data ites, she can either: 1) choose to judge data ites independently as if they are presented alone; or 2) to rank all the data ites according to their relative inherent values and annotate the top several ites as positive, leaving the rest in the batch as negative. Plackett-Luce odel. Before we delve into our odel, we first recap a probability odel for generating rankings based on scores associated with ites, naely the classical Plackett-Luce odel [15, 21 introduced in the 70s. Without loss of generality, suppose we are given a set of ites x 1,..., x k. Each ite x i is associated with a certain score s(x i) > 0. Here the score s(x i) odels the tendency of ranking x i higher in a randoly generated ranking and can be viewed as a easure of the inherent goodness of the ite. A ranking of these ites can be represented as a bijection π : {i} k i=1 {x i} k i=1, that aps the i to the data ite at the i-th position in the ranking. The corresponding ranking list can be represented as π(1) π(k). In Plackett-Luce odel, the probability of generating a ranking π is: P (π) = k i=1 s(x i) k r=i s(xr) (1) The equation above can be interpreted as the following process:

4 Initially, we have a pool A of all the data ites. Each tie one picks an ite x i fro a pool A of data ites with a probability proportional to its score, naely: s(x i) P (picking x i fro A) = x s(xr) r A This ite is then reoved fro the pool A and placed at the next position in the ranking. Repeat this operation until A becoes epty. The probability of generating a ranking list according to this process is equivalent to the probability described in the Plackett- Luce odel. Worker odel. We now introduce our worker odel for annotating sall batches of data ites. Again, without loss of generality, suppose we are given a batch x j where x jl = x l, naely the given data ite batch can be denoted as x j = {x 1,..., x k }. Also, recall that for each data ite x i, we denote P (y i = 1 x i) as its inherent score η i, which is not explicitly known. When a worker starts to work on a certain batch of data ites, they ay choose to use one of two strategies: Independent judging. If the worker is aking judgents based on the absolute value of η i for each data ite, we suppose the worker judges each data ite x i x j independently by drawing the annotation y i = 1 with probability η i and y i = 0 with probability (1 η i). Relative judging. If the worker is aking judgents based on coparing data ites within the sae batch without knowing absolute value of η i, we suppose the worker chooses to first rank all the data ites based on their probability of being positive, then annotates the top-τ ites in the ranking as positive, leaving the other ites annotated as negative. To be precise, the worker generates a ranking π for k ites in the batch according to the Plackett-Luce odel, with the scoring function defined as s(x i) = η i. Then the worker draws an integer 0 τ k fro a certain distribution, where p τ denotes the probability of drawing the integer τ. For data ites ranked as top-τ in the ranking, denoted as x i {π 1 (1),..., π 1 (p)} (could be epty if p = 0), the worker annotates the as y i = 1, while other data ites not within the top-τ of the ranking π are annotated as y i = 0. To cobine these two different scenarios, we suppose the worker is able to judge independently with a certain probability 0 < λ < 1, while with probability (1 λ) the worker can only ake relative judgents. The intuition of this odel is to capture two behavior pattern of workers. In the independent judging scenario, workers can reain independent in judging different data ites in the sae batch, with each data ite being judged based on its inherent score η i. Nevertheless, soeties workers ight judge data ites within a batch by coparison. In the relative judging scenario, workers still have the ability to judge the relative relationships between data ites in the sae batch, which is captured by the Plackett-Luce odel for generating the ranking. In order to deterine the labels of data ites, they have an expectation of label distribution, which is reflected by the distribution of generating τ, as it characterizes the probability of having τ positives within k data ites. For instance, if workers expect there to be few positive ites, then the probability of τ being low is high, while if workers expect the batches to be balanced, then the probability of τ being close to k/2 is high coparing to other values of τ. However, this distribution does not necessarily reflect the correct label distribution. When they try to apply their expectation of the label distribution on the batch, bias ight occur. We suarize the process of generating annotation for a batch of data ites in our proposed odel as below: 1. Toss a coin Z Bernoulli(λ). If Z = 1, go to Step 2; otherwise go to Step For each x i, generate y i Bernoulli(η i). Output the results and exit. 3. Generate a ranking π based on Plackett-Luce odel for data ites x i in the batch. 4. Draw τ Mult(p τ ). 5. For the top-τ ites in ranking π, generate y i = 1; otherwise generate y i = 0. Output the results and exit. Model learning. The paraeters that need to be deterined in this worker odel include: the probability of aking independent judgents λ, and the distribution of the nuber of positive annotation when aking relative judgents, represented by p 0,..., p k, where 0 p τ 1 and p τ = 1. We assue these two paraeters are fixed for each new application of our techniques. However, for different applications, these paraeters ight be different for instance, these paraeters for content oderation ay be different fro the sae paraeters for spa identification or sentient analysis. Suppose we are given a set of n L ites X L with their true labels Y L, or ore ideally, their inherent scores {η i} xi X L. If the inherent score of a data ite η i is not given, but only y i {0, 1} is known, as a heuristic, we can estiate η i by η i = (y i +ɛ)/(1+2ɛ) where ɛ is a sall constant, which is set to 10 3 in our experients. Then, we for the into L batches B L, send the to the crowds, and obtain their annotation fro workers, denoted as Y B L. For siplicity, we write x bjt as x jt, and siilar to y jt and η jt. For each batch b j B L, we denote the set of ites annotated by workers as positive as Xj 1 = {x jt y jt = 1}, and the set of ites annotated as negative as Xj 0 = {x jt y jt = 0}. We train the odel by axiu likelihood estiation. The likelihood of the obtained annotation can be written as: L [ k L = λ η y jt jt (1 ηjt)(1 y jt) +(1 λ) p τj P (Xj 1 Xj 0 ) }{{} t=1 }{{} relative judging independent judging (2) where τ j = Xj 1 is the nuber of positive annotation in batch b j; P (Xj 1 Xj 0 ) denotes the probability of generating any rankings π that rank ites in Xj 1 higher than any ites in Xj 0, naely: P (Xj 1 Xj 0 ) = P (π) π R(X j 1,X0 j ) where R(X 1, X 0) = {π π 1 (x 0) > π 1 (x 1), x 1 X 1, x 0 X 0}; and P (π) is defined by the Plackett-Luce odel, as presented in (1). Applying an EM-algorith, where at E-step, we can have ˆλ j = ˆλ k t=1 ηy jt jt (1 ηjt)(1 y jt) ˆλ k t=1 ηy jt jt (1 ηjt)(1 y jt) + (1 ˆλ)ˆp τj P (X 1 j X0 j ) (3)

5 And at M-step, we update the paraeters ˆλ and ˆp τ by ˆλ = 1 L ˆλ j, ˆp τ = 1 L (1 ˆλ j)1 { X 1 L Ẑ j =τ} (4) where Ẑ = L (1 ˆλ j). 4. DEBIASING ANNOTATION In this section, we introduce our ethod that debiases annotations collected for sall batches of data given the trained worker odel. More precisely, given a set of n U unlabeled data ites X U, assebled into U batches represented by B U, as well as their annotations obtained fro the crowds Y U, how do we infer their true labels Y U? The basic idea is, based on the given worker odel, we infer η i for each x i X U. Then, we siply apply the Bayes classifier to deterine the inferred label, which yields ŷ i = 1 if η i > 0.5, or ŷ i = 0 if η i 0.5. We again adopt a axiu likelihood estiation techique. The log-likelihood of the obtained annotation is: U [ log L(η) = log ˆλ x jt1 X 1 j η jt1 x jt0 X 0 j +(1 ˆλ)ˆp τj P (Xj 1 Xj 0 ) (1 η jt0) Notice that ˆλ and ˆp τj are paraeters learned fro Section 3, and P (Xj 1 Xj 0 ) is also a function of η i s. Siilar to the previous section, we apply an EM-algorith here by first calculating ˆλ j for each batch at the E-step according to (3) but replacing λ and p τ by the value we learned during the training step. Then we have: U [ log L(η) ˆλ j x jt1 X 1 j log η jt1 + x jt0 X 0 j log(1 η jt0) (5) U + (1 ˆλ [ j) log ˆp τj + log P (Xj 1 Xj 0 ) (6) where the second ter includes log P (Xj 1 Xj 0 ), which is hard to optiize. We apply the idea of the EM-algorith again here. We use notation R j to represent R(Xj 1, Xj 0 ). For each π R j, we can calculate its conditional probability given Xj 1 Xj 0, denoted as ˆq π by: ˆq π = P (π Xj 1 Xj 0 P (π; ˆη) ; ˆη) = P (π; ˆη) which is the E-step. According to Jensen s inequality we have: log P (X 1 X 0) = log P (π) (7) ˆq π log P (π) (8) where the last inequality yields the objective function we want to optiize. The correctness of EM-algorith guarantees the convergence of optiizing this function. Furtherore, according to the inorization-axiization (MM) algorith used in [11, we obtain the lower bound for log P (π), which is defined by the Plackett-Luce odel, by: k 1 [ k log P (π) = log η π 1 (t) log t=1 s=t η π 1 (s) k 1 [ k s=t log η π 1 (t) η π 1 (s) k s=t ˆη π 1 (s) t=1 where ˆη i is the estiated paraeter of last iteration. By cobining (6), (8) and (9), we obtain the objective function to optiize as: U [ Q(η) = ˆλ j log η jt1 + log(1 η jt0) x jt1 X 1 j U + (1 ˆλ j) ˆq π x jt0 X 0 j (9) k 1 [ k s=t log η π 1 (t) η π 1 (s) k s=t ˆη π 1 (s) t=1 (10) Notice that Q(η) is actually a lower-bound of the original loglikelihood function. Moreover, for two EM-step and one MM-step we apply in deriving the Q-function, it is proven that by iproving Q(η) fro this iteration Q(ˆη), the iproveent of the loglikelihood is no less than the iproveent we achieve on the Q- function. Therefore optiizing Q(η) can also optiize the loglikelihood. Take the derivative, we obtain Q(η) η i = j M 1 (i) 1 η i U (1 ˆλ j) j M 0 (i) ˆq π ˆλ j 1 η i [ X 1 j t=1 1 {π 1 (i) t} k s=t ˆη π 1 (s) (11) where M 1(i) and M 0(i) are defined as M y(i) = {j : x i X y j } for y {0, 1}. The updating rule can be obtained by solving Q(η)/ η i = 0, naely where ˆη i = T1 + T2 + T3 (T 1 + T 2 + T 3) 2 4T 1T 3 2T 3 (12) T 1 = M 1(i), T 2 = U T 3 = (1 ˆλ j) ˆq π j M 0 (i) [ X 1 j t=1 ˆλ j 1 {π 1 (i) t} k s=t ˆη π 1 (s) By iteratively updating the scores to optiize the likelihood of the annotation on test data, we can obtain the inferred η i of each ite. Based on this, we can deterine the inferred binary label for each data ite by assigning y i = 1 if ˆη i > 0.5, or y i = 0 otherwise. Notice that we do not further tune the threshold in this step, as the scores we learned here are expected to be a reasonable estiate of the true η i s. Therefore, if the inherent scores are known, learning theory guarantees us that by using Bayes classifier (naely to take 0.5 as threshold) is supposed to achieve the best expected perforance in ters of square loss. 5. EXPERIMENTAL RESULTS In this section, we conduct experients on a synthetic data set and a real data set to verify the effectiveness of our proposed worker odel and debiasing technique.

6 Input: Data batches B U = {b j }, crowdsourced annotation Y U = {y j }; training data batches B L, crowdsourced annotation Y L, true labels Y L. Output: Inferred labels Y U. // Training work odel; 1 ˆλ j 0.5; Initialize ˆp τ by rando values; 2 repeat 3 Update ˆλ j bfor 1 j L by (3); 4 Update ˆλ and ˆp τ by (4) 5 until L converged; // Calculate debiased labels; 6 repeat 7 Update η i for 1 i n U by (12); 8 until Q(η) converged; 9 y i 1 {ηi >0.5} for 1 i n U ; 10 Output Y U. Algorith 1: Debiasing crowdsourced annotation on batches of data ites. 5.1 Experiental Data Sets We first introduce the data sets we used in this experients. A suary of the data sets we use in our experients is provided in Table 2. Synthetic data set. We construct synthetic data sets following the worker annotation odel we propose in Section 3. Suppose we have n ites in X, we first generate their inherent scores η i for each x i X fro a Beta distribution Beta(α, β), then generate the true labels Y by drawing y i fro a Bernoulli distribution paraeterized by η i for each i. In our synthetic data set, we set α = 2 and β = 4 to siulate the case when negative data ites overwhel positive data ites. Then, we generate batches of size k by sapling without replaceent for each batch. Notice that by the phrase without replaceent we ean there are no identical data ites within the sae batch, while the sae ite can still appear in ultiple batches as we do replace the ites back into the pool after a batch is generated. Thereby we obtain the set of batches B. For each b j in B, we generate the workers annotation y j fro our proposed batch annotation odel. The probability of aking independent judgents λ is set as 0.5. The distribution of deterining nuber of positive annotations p τ is also assigned to be: p τ (τ + 1) ρ where ρ is positive constant, set as 2 in our experients. Coents data set. We utilize a real world crowdsourcing data set for annotating coents, which is used in [36. The original crowdsourcing task was to identify inappropriate coents on LinkedIn posts published by copanies or LinkedIn influencers. Inappropriate coents are defined as coents containing prootional, profane, blatant soliciting, rando greeting coents, as well as coents with only web links and contact inforation. In order to collect annotation of coents, for each post, k coents are sapled and sent to CrowdFlower as a batch (unit). Workers are also provided with a codebook (i.e., a sequence of instructions) explaining how to annotate the data ites. Each coent is regarded as a data ite and can be annotated as positive (inappropriate coent) or negative (acceptable coent). Each batch is annotated by 5 or ore workers. In order to provide test questions and track the perforance of each worker, soe of the batches are annotated by 9 trained LinkedIn eployees (experts) with the sae codebook and inter- Table 2: Data set statistics. Data set k n L L n U U Synthetic 5 1,000 10,000 5,000 50,000 Coents , ,267 face as used for crowd workers. The average Cohen s kappa for all expert pairs is For this experients, we only adopt the batches with all of their data ites annotated by both crowds and experts as we can use the experts annotation as ground truth (aggregated by ajority voting). Out of these batches, the 1,099 batches that are annotated before a worker actually starts on the job are utilized as training data set B L. while the other 5,267 batches are utilized as the test data set B U to infer the 651 data ites the 5,267 batches covered. 5.2 Experiental Setup Methods evaluated. We copare the perforance of our proposed ethod with several baselines: Majority Voting (MV). For each data ite in the test data set, siply deterine its inferred label by the way it is annotated by the ajority of workers. This aggregation strategy is often used in practice [28. Majority Voting with Tuned Threshold (). Instead of siply applying ajority voting, we calculate the ratio of positive annotation on each ite as a score, and tune the threshold for deterining the binary inferred label. Based on a given training set of annotation and true labels, we find the threshold yielding the best F 1-score on training data set, and apply the sae threshold on the test data set. Plackett-Luce Model (PL). A strategy is to fit the Plackett- Luce odel on the test data by inferring the scores s(x i) associated with each data ites. We apply a Bayesian regularization on the inferred scores to confine it to be 0 s(x i) 1. We then infer a positive label to each data ite with an inferred score s(x i) > 0.5 and a negative label otherwise. * Batch Annotation Model (). The debiasing strategy proposed in Section 3 and 4. Evaluation ethodology. For baselines without training, we directly apply the on the test data set; for our proposed ethod as well as, we first train the worker odel on the training data set, then apply the debiasing strategy based on the trained worker odel on the test data set. We copare the inferred labels to the ground-truth and evaluate the perforance in ters of accuracy, precision, recall and F 1-score. Trials and setup. For our proposed odel, in training phase, we initialize λ to be 0.5 and p τ to be a rando value between 0 and 1; in debiasing phase, we initialize the all the inferred scores as 0.1. For training the worker odel, we set a fixed nuber of iteration as 100. Our experiental results presented later show the odel converges within a nuber of iterations uch fewer than 100. For debiasing, we calculate the log-likelihood of the odel and stop when the relative change of log-likelihood is within Experiental Results Now we present the experiental results. We first verify the learning algorith of our odel on the synthetic data set, then present the learned odel paraeters on a real data set; we also

7 p τ Generative paraeters Learned paraeters τ (a) Paraeter coparison of p τ s Log-likelihood (10^3) Iteration nuber (b) Convergence analysis p τ Learned paraeters τ (a) Paraeter analysis of p τ s Log-likelihood Iteration nuber (b) Convergence analysis Figure 2: Learning worker odel fro the synthetic training data set. Figure 4: Learning worker odel fro the coents training data set. ˆλ λ λ (a) Estiation error of λ ˆp τ p τ ρ (b) Estiation error of p τ Figure 3: Analysis of estiation error of paraeters in the worker odel under different configurations. evaluate the effectiveness of our debiasing strategy on both synthetic data set and real data set, which deonstrates an iproveent in ters of F 1-score; finally we conduct a study on different configurations of experients as a guideline for setting up a batched crowdsourcing task. Worker odel learning. We first verify the effectiveness of learning our proposed worker odel. On our synthetic data set, the true value of probability of aking independent judgents λ is set to 0.5. We learn the odel fro the synthetic training data and obtain the inferred ˆλ as , which reasonably recovers the original value. We also copare the original odel paraeters p τ s to the inferred paraeters in Figure 2(a). The black dashed line represents the original paraeters used for generating synthetic annotation data, while the red solid line shows the inferred paraeters of worker odel, which sees as a precise fit of the original paraeter. We also show the curve of log-likelihood of the training data set, which sees to converge within 20 iterations. To further confir the robustness of our learning ethod, we odify the configuration of synthetic data generation, and train the worker odel on different data sets to check if they can recover the original paraeters. We still take the sae configuration of n L = 1, 000 and L = 10, 000. The estiation error analysis is shown in Figure 3. Figure 3(a) shows the difference between the inferred paraeter ˆλ and the true paraeter λ, given the annotation data generated by λ varying fro 0.1 to 0.9. It can be observed that the error is reasonable sall, basically within 0.1. Figure 3(b) shows the l 2 nor of the difference between the estiated distribution ˆp τ and the true distribution p τ, when p τ is generated with respect to different ρ varying fro 1 to 3. In ost of the settings, the error is below , which is fairly low. Although we only instantiate p τ s using a power-law distribution, as the learning ethod does not confine the learned distribution to be paraetric, it can be directly applied to any other type of distributions. Learned odel on real data set. Given the effectiveness of our learning ethod verified, we apply the worker odel trying to fit the data set of annotating inappropriate coents. The learned probability of a worker aking independent judgents ˆλ is The learned distribution for deterining the nuber of positive annotations in a batch is presented in Figure 4(a). It shows that a worker tends to annotate the entire batch as negative (i.e. acceptable coent) with a probability over 0.6, while picking only 1 of the as positive (i.e. inappropriate coent) also occurs with a relative high probability around The workers see to be reluctant to annotate ore than 1 coents in a size-5 batch. This is coherent with ost people s intuition that inappropriate coents are rare coparing to the entire set of coents. The convergence analysis is shown in Figure 4(b). The odel converges within 50 iterations. Perforance coparison. We proceed to evaluate the perforance of different aggregation strategies on both data sets. The overall perforance results are shown in Table 3. In both data sets, our proposed debiasing strategy is a clear winner in ters of F 1- score, and also achieves the best accuracies. In synthetic data set, ajority voting, without tuning the threshold (default set to 0.5), fails to identify ost of the positive data ites, and therefore achieves an extreely low recall. Only after the threshold is tuned on a training data set can it achieve a reasonable F 1-score of 71%. PL-odel, in contrast, achieves a relatively low precision of 52%. Our proposed ethod is able to achieve the best overall perforance in ters of F 1-score and accuracy, and the precision and recall achieved by our ethod are also relatively balanced. Notice that we do not directly apply any threshold tuning for our ethod and siply takes the threshold as 0.5. In coents data set, the naïve ajority voting strategy again obtains a poor recall below 80%. After tuning the threshold, its recall rises to around 85%, but still lower than our proposed ethod. The scores learned by PL-odel yield a coparable recall to ajority voting with tuned threshold, but fail to achieve a high precision. Our proposed ethod achieves a coparable precision of 93% and a higher recall of 87%, and therefore beat all the other baselines in ters of F 1-score (90%). Batch nuber vs. ite nuber n. An interesting question to study is, for a certain nuber of ites, how any (rando) batches of data ites does one need to label to obtain an aggregated result accurate enough. We study this question by generating synthetic data sets with different settings of nuber of batches L and U while nuber of data ites n L and n U are fixed. In this experients, we set n L = 1, 000 and n U = 5, 000, and generate synthetic data sets with L/n L = U /n U = 2, 5, 10, and 20. We then apply all the strategies on these data sets. To iniize randoness, for each setting we repeat the data generation and application of debiasing strategies for 10 ties, then report the

8 Table 3: Perforance coparison of Majority Voting (MV), Plackett-Luce Model (PL) and Batch Annotation Model (). All results are shown as percents. Data set Method Acc. Prc. Rcl. F 1 MV Synthetic PL MV Coents PL Accuracy MV 0.70 PL /n (a) Accuracy with different /n ratio F 1 -score MV PL /n (b) F 1-score with different /n ratio Figure 5: Perforance of debiasing strategies on synthetic data sets generated by setting both L/n L and U /n U as 2, 5, 10, 20 respectively. average perforance. Results are shown in Figure 5. As we can observe, under all the different settings, the proposed ethod consistently outperfors other baselines, in ters of both accuracy and F 1-score. Majority voting with tuned threshold () is able to achieve coparable results to our proposed ethod when /n are large enough (e.g. /n = 20). However, when /n is relatively sall, our proposed ethod can achieve uch better results than ost of other baselines. When /n = 2, it achieves an accuracy approxiately 9% higher than, and an F 1-score around 5% ore than. An exception is the naïve ajority voting strategy that achieves the best accuracy when /n = 2. This is due to the skewed distribution of data labels, and by siply labeling all the data ites as negative can get an accuracy of approxiately 80%. In coparison, the F 1-score of MV is only around 30%. Another observation that we can ake about Figure 5(b) is that the perforance of ajority voting drops as /n increases. This result indicates when workers are biased and no debiasing techniques are applied, increasing the quantity of labels collected does not help. Size of training data set. As our ethod requires a sall set of training data, there ight be soe concerns about how large a training data set is sufficient. We test the perforance of two ethods that rely on training data sets and our proposed ethod on synthetic data sets and the coents data set. For the synthetic data set, we keep the size of test data set as n U = 5, 000 and U = 50, 000, and vary the size of training data set by setting n L as 10, 20, 50, 100, 200, 500, and 1, 000, while setting L as 10n L. For each configuration, we generate synthetic data sets 10 ties and utilize the average perforance on these 10 data sets to evaluate the debiasing perforance. For the coents data set, we randoly saple L batches fro the training data set, where Accuracy n L (a) Accuracy with different sizes of training data set F 1 -score n L (b) F 1-score with different sizes of training data set Figure 6: Perforance of debiasing strategies on synthetic data sets generated by different size of training data set n L ( L = 10n L), while the size of testing data set reains n U = 5, 000 and U = 50, 000. Accuracy L (a) Accuracy with different sizes of training data set F 1 -score L (b) F 1-score with different sizes of training data set Figure 7: Perforance of debiasing strategies on coents data set where the training data set is randoly sapled fro the original training set with different size of L, while the testing data set reains the sae. L is set to 100, 200,..., Again, for each configuration of training data size, we repeat the rando sapling for 10 ties and report the average perforance. The results of synthetic data set are shown in Figure 6. As observed, when training data set is extreely sall (e.g. n L = 10), the perforance of drops in ters of both accuracy and F 1- score (73% and 56% respectively). As the size of training data set increases, the perforance of becoes coparable to our proposed ethod. However, the perforance of our proposed ethod is surprisingly stable, even when there are only 10 ites and 100 batches as training data, which is as 1/500 large as the data set used for testing. The results iply our proposed ethod can obtain very high perforance with a sall cost of labeling ground-truth data for collecting training data. The results for the coents data set are shown in Figure 7. Again, when training data size is extreely sall (e.g. L = 100), the perforance of drops substantially (89% in accuracy and 75% in F 1-score), while its perforance gets ore and ore coparable to our ethod as the training data size increases. In contrast, our proposed ethod aintains a fairly stable perforance for different sizes of the training data set. This verifies again the ability of our proposed ethod to yield high-quality results with a sufficiently sall training data set. 6. EXTENSIONS In this section, we discuss two straightforward extensions of our proposed worker odel and debiasing strategies, with respect to two useful applications other than binary classification: ulti-class classification, and rating estiation. Rating estiation. In rating estiation,each data ite x i is no

9 longer associated with a discrete label fro a finite set of labels, but instead, a real value y i R. Suppose workers are asked to provide ratings on each data ite in batches, and our goal is to identify the true (unknown) rating of each data ite. A naïve way to aggregate their ratings on the sae ite is to take the average value of their ratings as the estiated rating. However, when ultiple data ites are grouped into batches, there can be bias siilar to the one introduced in this paper. Although we do not explicitly foralize our proble for a rating task, with soe straightforward odifications, our techniques can still be applied. Without loss of generality, we can assue 0 < y i <. If the actual rating can be negative, we can always apply a certain sigoid function to noralize the scores to be positive values. For independent judging, we can design a wellregularized distribution with expectation of y i for a worker to draw a rating, e.g. Gaussian distribution centered at y i. For relative judging, we can still assue the worker to generate a ranking fro Plackett-Luce odel with paraeters y i s, and introduce distributions for generating rating of data ites at different positions in the ranking, which are to be learned fro the training data. For exaple, workers ay tend to generate a rating fro Gaussian distribution centered at µ 1 = 5.0 for a top-ranked data ite π(1), but generate a rating fro another Gaussian distribution centered at µ 5 = 1.0 for a data ite ranked as the fifth π(5). If the training data is sparse and the ratings can take on any real values, it is probably preferable to eploy soe paraetric distributions. Once the design of odel is accoplished, it is straightforward to apply the sae technique described in this paper to derive the debiasing strategy by axiizing the likelihood of observed annotations to estiate the underlying ratings for unrated data ites. Multi-class classification. In a ulti-class classification proble, the label set Y ay contain ore than 2 possible labels. Workers are usually requested to assign data ites with different labels. This is a natural extension fro binary classification proble. If the labels in Y have an order, for exaple, judging whether a review is very helpful, helpful or not helpful, the proble reduces to a rating estiation proble, where the possible value of rating are discrete values. We can siply apply the extended strategy described above. If the labels in Y do not have an order, the proble can be reduces to several binary classification probles, which is straightforward to apply our strategy for debiasing workers annotations. 7. RELATED WORK In this section, we first introduce existing studies on annotation bias of crowds, when data ites are presented either independently, or in a sequence or batches; we then introduce rank aggregation techniques as well as their application on crowdsourced ranking or rating. Annotation bias in independent judgents. A nuber of studies have been conducted on verifying and quantifying annotation bias of crowd workers. Snow et al. [29 explore the perforance of annotations by non-expert workers for several NLP tasks. Deeester et al. [8 discuss the disagreeent between different users on assessent of web search results. There are also extensive studies on odeling worker behaviors. Raykar et al. [23, 24, 25 study how to learn a odel with noisy labeling. Specifically, they eploy a logistic regression classifier, and insert hidden variables indicating whether a worker tells the truth. Karger et al. [12 propose an iterative algorith to infer workers reliability and aggregating their answers. Whitehill et al. [35 odel the annotator ability, data ite difficulty, and infer the true label fro the crowds in a unified odel. Most of these work also proposes various generative odel to capture worker behavior. However, they assue judgents on different data ites are independent, which is not necessarily true when data ites are grouped into batches. Venanzi et al. [33 propose a counity-based label aggregation odel to identify different types of workers, and correct their labels correspondingly. Das et al. [7 address the interactions of opinions between people connected by networks. They focus on another aspect of dependencies, which is the dependencies between workers, while in our studies, we are ore concerned about dependencies between data ites and their judgents. Annotation bias in sequential and batch judgents. A few researchers also notice the correlation between judgents on different data ites, but their work are ainly developed in the setting when data ites are reviewed in a sequence. Scholer et al. [26, 27 study the annotation disagreeents in a relevance assessent data set. They discover correlations between annotations of siilar data ites. They also explore threshold priing in annotation, where the annotators tend to ake siilar judgents or apply siilar standard on consecutive data ites they review. However, their work focuses on the scenario when data ites are organized in a long sequence. It confines the dependencies to exist only between consecutive data ites. Also, they focus ore on qualitative conclusions, without a quantitative odel to characterize and easure the discovered factors. Carterette et al. [4 provide several assessor odels for the TREC data set. Mozer et al. [18 study the siilar relativity of judgents phenoenon on sequential tasks instead of sall batches. Again, their focus is ore on data ites presented as a long sequence, while we focus ore on data ites presented in batches siultaneously. Our recent work [36 also considers a siilar setting when data ites are organized in sall batches; we verify the existence of annotation bias caused by batching data ites. Our focus in that paper was to design an active learning algorith to sartly asseble batches, aiing to iprove the perforance of the classifier trained on this annotation batches. Our focus was not on iproving the quality of labels collected, and we still used ajority voting to obtain labels for data ites. In this paper, we focus on debiasing the obtained labels directly, the higher label quality can benefit tasks including training and/or evaluating classifiers, with a broader range of applications. Crowdsourced ranking and rating. In our odel, we eploy the Plackett-Luce odel to capture worker behavior, and aggregate worker annotations on batches as rankings in order to infer true labels. There is a related thread of work on rank aggregation; however, to the best of our knowledge, we are the first to odel crowds annotating behavior on batches by ranking, and propose a debiasing strategy. Studies on aggregating ultiple rankings into a consistent ranking can be dated back to the seinal work of Arrow [2. Negahban et al. [19 study how to aggregate pairwise coparisons into a ranking by utilizing the Bradley-Terry odel [3, which is a siplified version of Plackett-Luce odel utilized in this paper. Hunter et al. [11 propose the inorization-axiization (MM) algorith to infer Plackett-Luce odel fro ultiple partial orderings. Soufiani et al. [30 generalize Negahban et al. s work and proposed a class of generalized ethod-of-oents (GMM) algorith to infer paraeters of Plackett-Luce odel fro ultiple orderings, and copare the perforance against MM-algorith. They then further extend their algorith to be applied to a ore general class of ranking odels called rando utility odels (RUMs) [31. In ad-

Leveraging Relevance Cues for Improved Spoken Document Retrieval

Leveraging Relevance Cues for Iproved Spoken Docuent Retrieval Pei-Ning Chen 1, Kuan-Yu Chen 2 and Berlin Chen 1 National Taiwan Noral University, Taiwan 1 Institute of Inforation Science, Acadeia Sinica,