1 Introduction In this module, we will be simulating the draft lottery used by the National Basketball Association (NBA). Each year, the worst 14 teams are entered into a drawing to determine who will get the first three picks in the upcoming draft. The lottery is designed so that the worse a team s record is, the better their chances of getting the first pick. You can find more information on how the lottery works here. For clarity we label the 14 teams in the lottery 1, 2,..., 14, with 1 corresponding to the team with the best chance of landing the first pick in the draft. Table?? shows the probability that each team wins the lottery and gets the first pick in the draft. Team Prob. Team Prob. 1 0.250 8 0.028 2 0.199 9 0.017 3 0.156 10 0.011 4 0.119 11 0.008 5 0.088 12 0.007 6 0.063 13 0.006 7 0.043 14 0.005 Table 1: Probability of getting the first pick in the draft At this point, it is important to formalize what we mean when we say that team 1 wins the lottery with probability 0.250. We will adopt the frequency definition. 1 Briefly, suppose we repeatedly perform a random experiment, which can result in one of many outcomes, several times and record the number of times each outcome occurred. With these counts, we can compute the proportion of times each outcome occurred of each outcome by dividing the count by the number of trials. Now imagine that we were able to repeat the experiment for infinitely many trials. The probability of any particular outcome would be the proportion of trials which resulted in that outcome. So when we saw the probability that team 1 will win the lottery is 25%, we really mean that if we were to run the lottery infinitely many times, team 1 would win in a quarter of these trials. Unfortunately, we cannot actually perform any experiment infinitely often and we are limited to only finitely many trials. Luckily, if we perform a random experiment for a sufficiently large number of trials, the resulting empirical proportion of an outcome will be very close to the true probability. In the context of the NBA draft lottery, this means that if we were to simulate the lottery, say, 1,000,000 times, the number of times team 1 won the lottery will be very close to 0.25 1, 000, 000 = 250, 000. 2. 1 Interpretting probabilities is a fundamental philosophical problem. For more information, check out the entry on interpreting probability in the Stanford Encyclopedia of Philosophy 2 This is essentially what the Law of Large Numbers guarantees 1
2 Generating Random Numbers in R One extremely useful feature of R is its ability to draw random numbers from a wide variety of probability distributions. For instance, let s say that we wish to pick a random number from the set {1, 2, 3, 4} uniformly (i.e. each number is equally likely to be picked). We can do this using the sample() function. [ 1 ] 1 [ 1 ] 1 [ 1 ] 4 [ 1 ] 1 Looking at these examples, however, it is not immediately obvious that each number was equally likely to be picked (in fact, there is barely a 10% chance that 2 was not picked in 8 repeated uniform drawings from {1, 2, 3, 4}). One way to see whether sample() is actually picking the numbers uniformly at random, we can ask it to simulate this random drawing 1,000,000 times and make a histogram to measure the relatively frequency that each number is picked. The following code does just that and produces the image in Figure 1 > x < sample ( 1 : 4, s i z e = 1000000, r e p l a c e = TRUE) > h i s t ( x, breaks = seq ( 0, 4, by = 1), f r e q = FALSE, ylim = c (0, 0. 3 ) ) > a b l i n e ( h = 0. 2 5, c o l = ' red ' ) 2
Figure 1: Histogram showing results of repeatedly sampling uniformly from {1, 2, 3, 4}. There are a few things to observe in this code above. First, when we call the sample() function, we must specify replace=true. If we had left replace=false (which is the default setting), we would have gotten an error: > sample ( 1 : 4, s i z e =1000000, r e p l a c e = FALSE) Error in sample. i n t ( l ength ( x ), s i z e, r e p l a c e, prob ) : cannot take a sample l a r g e r than the population when ' r e p l a c e = FALSE ' After defining the vector x, we create a histogram. We have manually set the breaks argument so that the bins in our histogram go from 0 to 1, 1 to 2, etc. We have also set the freq argument to FALSE, yielding a density histogram. Finally, we have added a red line at height 0.25. It is quite reassuring to see that each number was picked approximately 25% of the time! Up to this point, we have only seen how sample() can be used to generating from a set uniformly at random. In order to properly simulate the draft lottery process, we need to draw numbers from a non-uniform distribution. This is where the prob argument in sample() comes in handy. Exercise 1. Just like we did in Figure 1, verify that when we set the argument prob = c(0.5, 0.25, 0.15, 0.1) in sample(1:4, replace = TRUE), the number 1 is picked about 50% of the time, 2 is 3
picked about 25% of the time, 3 is picked about 15% of the time, and 4 is picked 10% of the time 2. Create a vector named lottery.probs that contains the probability listed in Table??each team in the lottery wins. Pass this vector as the prob argument to sample() to simulate the winner of the lottery 1,000,000. Make a histogram similar to Figure 1 based on these results. 4
3 Beyond the first pick Admittedly, the last exercise was a bit anti-climatic, in light of our frequentist interpretation of probability. Using just the values in Table 1, it is not trivial to compute the probability that a team gets the 2 nd pick in the draft. In order to simulate the full draft order we ll need to do a bit more work. The actual lottery is performed as follows: teams are assigned a set of four-number combinations from {1, 2,..., 14}, with the team with the worst record receiving 250 combinations, the team with the second-worst record receiving 199 combinations, and so on. To determine the first pick, four balls are selected uniformly and at random, with replacement, from a bin of balls labelled 1, 2,..., 14. The team who has that combination is awarded the first pick. Then, the process of drawing four balls from the bin is repeated and the team with the resulting combination is awarded the second pick. Since no team can be awarded multiple picks through the lottery, if the team with the first pick also owns the second four-number combination drawn, the process of drawing four balls is repeated until we get a combination which does not belong to the team with the first pick. A similar procedure is followed to award the third pick. After the first three picks have been awarded, the remainder of the draft order is set by team record. Rather than simulate the actual process of drawing four balls and looking up which teams own which combinations, we will simulate an equivalent process. In particular, we will use sample() to pick a number from {1, 2,..., 14} with probabilities listed in Table 1 to determine which team gets the first pick. We ll then pick another number from {1, 2,..., 14} with the same probabilities. If they are equal, we will keep drawing until we get a new number to determine who gets the second pick. In order to program the process of repeatedly drawing until we get a new number, we need to use a while() loop. A while() loop consists of two parts: a logical condition and a block of code. The loop starts by checking the condition and if it is TRUE, it will execute the block of code. It will moreover repeatedly execute the block of code until the condition is no longer true. Here s a really basic example of a while() loop > x < 0 > while ( x < 5) + p r i n t ( paste0 ( "x = ", x ) ) + x < x + 1 [ 1 ] "x = 0" [ 1 ] "x = 1" [ 1 ] "x = 2" [ 1 ] "x = 3" [ 1 ] "x = 4" > x [ 1 ] 5 In this example, the loop checks to see whether or not x is less than 5. If x < 5, then we add one to 5
x and check the condition again. From the printed statements, we see that the block of code within the loop (i.e. between the curled braces) is executed until x = 5. Note that a while() loop will continue executing the block of code until the condition is no longer TRUE, meaning that there is a potential for a while() loop to continue indefinitely. Such infinite loops are highly problematic and care must be taken to avoid them. If you find yourself stuck in an infinite loop, you should halt the execution with ESC key. To avoid getting stuck in an infinite loop, you need to make sure that the you haven t used a constraint that is always TRUE. Additionally, you need to make sure that there is a way to update the expression or quantities being checked by the condition in the block of code being executed. For instance, in the above example, if we had neglected to include x < x + 1, then there would be no way for the condition x < 5 to fail. To simulate awarding first and second picks of the draft, we can do > f i r s t. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) > second. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) > while ( second. pick == f i r s t. pick ) + p r i n t ( " f i r s t. pick = second. pick! need to re draw f o r 2nd pick " ) + second. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) > > > f i r s t. pick > second. pick [ 1 ] 4 In this example, we first drew the first pick and second picks and stored their values as first.pick and second.pick. We then wrote a while() loop to re-draw the second pick if necessary. Notice in the while() loop we included a print statement. This lets us know that we have started executing the block of code contained in the while() loop. The fact that nothing was printed when we executed the code in this example means that in this case, we did not have to re-draw for the second pick (we confirm this at the end of the example). Exercise 1. Execute the code in the example above several times. Keep track of the number of times that you had to re-draw the second pick. Hint: It d be helpful to write this code in a script. Then you can select all of the lines and execute them at once with Command+Enter on a Mac or Control + R on Windows 6
The previous exercise had you repeatedly execute a block code several times by hand. This process, however, is not scalable, especially if you wish to simulate drawing the first two picks 1,000,000 times. This brings us to for () loops. Basically, a for () loop allows us to repeatedly execute a block of code several times. Like a while() loop, a for () consists of two parts: a vector of iterators and a block of code. For each iterator in the vector, the loop will execute the block of code. > f o r ( i in 1 : 4 ) + f i r s t. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) + p r i n t ( f i r s t. pick ) [ 1 ] 2 [ 1 ] 8 In this example, we have drawn the first pick of the draft 4 separate times, each time printing the result. Typically, when we use a for () loop to simulate a random process, we d like to save the results in a matrix or data.frame. In this case, we ll use a matrix > p i c k s. matrix < matrix ( nrow = 5, ncol = 2, dimnames = l i s t ( c ( ), c ( " F i r s t Pick ", " Second Pick " ) ) ) > f o r ( i in 1 : 5 ) + f i r s t. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) + second. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) + while ( second. pick == f i r s t. pick ) + second. pick < sample ( 1 : 1 4, s i z e = 1, r e p l a c e = TRUE, prob = l o t t e r y. probs ) + + p i c k s. matrix [ i, " F i r s t Pick " ] < f i r s t. pick + p i c k s. matrix [ i, " Second Pick " ] < second. pick + In this example, we first created a matrix picks.matrix to store the results from our simulation. Then we ran the code we wrote earlier simulate the first two picks (note, we removed the print() statement from the while() loop for ease of presentation). The resulting picks were 7
F i r s t Pick Second Pick [ 1, ] 2 14 [ 2, ] 2 1 [ 3, ] 6 1 [ 4, ] 1 2 [ 5, ] 4 1 We are now equipped to answer the following question What is the probability that team 1 gets the second pick of the draft? Above, we see that team 1 received the second pick 3 times out 5 (i.e. 60% of the time). Unfortunately, 5 simulations is by no means close to sufficient to determine this probability accurately. We need, instead, to do something like 1,000,000 simulations and in this case, we can not rely on printed output to find the probability. Instead, we can do the following: > mean( p i c k s. matrix [, " Second Pick " ] == 1) [ 1 ] 0. 6 This line of code is doing several things: first, the expression picks. matrix[,"second Pick"] == 1 creates a logical vector. R can evaluate TRUE s to be 1 s and FALSE s to be 0 s so when we pass this vector to mean(), we are simply computing the proportion of TRUE s. This is precisely the proportion of simulations in which the second pick went to team 1. Exercises 1. Modify the code in the example above to simulate drawing the first two draft picks 1,000,000 times. 2. For each team, compute the probability that it is awarded the second pick in the draft. At this point, you may be wondering how we can modify our code to answer questions such as What is the probability that team 1 does not get either the first pick or the second pick in the draft? and What is the probability that team 5 gets either the first pick or the second pick? To answer this, we need the quantifies like AND and OR. Indeed, the first probability is the proportion of times that picks. matrix[,"first Pick"]! = 1 AND picks. matrix[,"second Pick"]!= 1, while the second probability is the proportion of times that picks. matrix[,"first Pick"] == 5 OR picks. matrix[,"second Pick"] == 5. In R, the AND quantifier is denoted & and the OR quantifier is denoted. We can use them as follows: > mean( p i c k s. matrix [, " Second Pick " ]!= 1 & p i c k s. matrix [, " F i r s t Pick " ]!= 1) > mean( p i c k s. matrix [, " Second Pick " ] == 5 p i c k s. matrix [, " F i r s t Pick " ] == 5) 8
4 The first 3 picks Up to this point, we ve only focused on the first two picks of the draft. In order to simulate drawing the third pick we can add to the code we ve written before. First, we can create a new variable called third.pick and initialize it like we did with second.pick. We can then go ahead and re-draw second.pick as necessary until we know that first. pick! = second.pick. We then can write another while() loop to re-draw the third pick as necessary. The condition in this loop, however, is a little bit more complex than the previous loop. In particular, we need to re-draw third.pick if third. pick == first.pick OR third.pick == second.pick. We can add the OR operator to our condition with the symbol: > while ( t h i r d. pick == f i r s t. pick t h i r d. pick == second. pick ) {... } Exercise 1. Fill in the... in the above example with the code necessary to re-draw third.pick. At this point, you should have a single block of code that will simulate drawing the first three picks of the draft. 2. Wrap this block of code into a for () loop and simulate drawing the first pick 100 times. Be sure to save the resulting picks in a matrix like the one we created above. 3. Using this matrix, answer the following questions (a) What is the probability that team 1 gets the second or third pick? (b) What is the probability that team 14 gets one of the top three picks? (c) What is the probability that team 1 does not get any of the top 3 picks? 9