Statistics I Practice 2 Notes Probability and probabilistic models; Introduction of the statistical inference

Statistics I Practice 2 Notes Probability and probabilistic models; Introduction of the statistical inference 1. Simulation of random variables In Excel we can simulate values from random variables (discrete or continuous). The simulation tool is in the data analysis complement that we have already installed in the first computer class. The steps for simulating values of random variables are similar for all types of variables. First, we open Excel and select Datos in the Menu above, where we can look for the Análisis de datos complement: Now, we can look for a function called Generación de números aleatorios. Once this is selected, a new window opens: Número de variables: the number of variables that we want to simulate. Usually 1. Cantidad de números aleatorios: the sample size. Distribución: the distribution of our variable: either discrete (Bernoulli, Binomial) or continuous (Uniforme, Normal). Parámetros: the parameters of the distribution. Iniciar con: left unfilled. 1

Opciones de salida: this command is useful for selecting the range of the output in the actual sheet or in a new sheet. Moreover, we can give it a name that may depends of the distribution that we are using. 1.1. Discrete random variables: Bernoulli and Binomial 1.1.1. First, we simulate from a sample of n = 50 observations of the Bernoulli distribution. We open the simulation window as we have seen before, fill the following fields and click on Aceptar: In column A we get a simple random sample of a Bernoulli distribution with parameter p = 0.4. We know that and, then and. We compute the sample mean and variance using the Excel functions PROMEDIO and VAR, and compare the sample quantities with their population counterparts: Important: each student will have different results because the simulated values are random. 1.1.2. Following the same steps, we simulate a sample of size n = 100 from a Binomial distribution:. 2

We compute the population mean and variance and compare with the sample mean and variance: 1.2. Continuous random variables: Normal We want to generate a sample of size n = 20 from a Normal:, where and. We follow the same steps as it was explained before, and compute the sample mean and standard deviation: Are the sample parameters close to the population parameters? What would happen if, instead of n = 20, we take n = 1000? 2. Point estimation and adjustment 2.1. Quantile-quantile Plot (QQ - plot) for a Normal distribution We use the same data that we have generated from a Normal. First, we insert an additional row at the top with the names of the columns. After that, we select all the data and sort them from lowest to biggest through Datos in the above Menu and obtain the following: 3

The next step is to compute the sample quantiles. For that purpose, it is necessary to assign first the range of each observation. Put on the cell B2 and write 1, which means that the number in A2 is the first observation. In B3, we introduce the formula =B2+1 and copy the formula till the end of the column. Finally, we compute the sample quantiles in the third column. Put in cell C2 and introduce the formula =(B2-0.5)/20 (remain that 20 is the sample size). Copy this formula till the end of the column. To check if the sample quantiles have been obtained properly, we can compute the median that should be at position (20+1)/2=10.5, between 10 and 11. As we can see, the Q50% appears just between the positions 10 and 11. Finally, we have to compute the values of the estimated Normal distribution, associated with each quantile:, where and are the sample mean and standard deviation. Before that, we compute the z-scores, which are the values of the standard Normal distribution associated with each quantile. Put on cell D2 and introduce the following Excel function =DISTR.NORM.ESTAND.INV(C2), and copy this formula till the end of the column. To convert these z-scores in the associated values with the original sample, it is necessary to perform the inverse operation, i.e. the inverse standarization: multiply each score with the sample standard deviation and add the estimated mean of X (called x-scores): 4

Now, we have all the information needed to graph the QQ-plot. Before that, it is necessary to copy the column A with the original data at the right of column E of x-scores, because Excel can now recognize which data are on axis x and which data are on axis y. Now, we select the two columns and click on Insertar in the above Menu. Then click on Dispersión in the above Menu where we select the type of plot that we want (only points): To change the size and style of the points, it is necessary to put on one point, right click on the mouse and select Dar formato a serie de datos, Opciones de marcador. If the data have been generated from the considered distribution, then the points in the plot should be along a straight line. To plot this line, we copy in column G the x-scores, select the three columns and repeat: Insertar, Dispersión Then, Excel plots the straight line (be careful when copying and pasting the x-scores because there are formulas copied. Then, right click on the mouse and select Pegado Especial and then select Sólo Valores). 5

When the following plot appears, we change the style of the points of the x-scores to convert them in a straight line: put the mouse on a point, right click on the mouse and select Dar formato a series de datos, Opciones de marcador: ninguno, Color de línea: Línea Sólida. Finally, we obtain the following plot: As we can see, the points of the plot are along the straight line. This means that the distribution fits well the data. 2.2. Graphical fitting: histograms with area of 1 (on a density scale) and density curves We use the same data that we have generated from a Normal. For this example, we are interested in generating again 20 observations. In order to create the histogram with area of 1 (on a density scale), we need to use the following information as explained in Lab 1,: Number of observations: 20 Minimum value: -3,470255928 approximate -3,4 Maximum value: 3,70535465 approximate 3,8 Range: 7,2 Number of classes: 20^(1/2)= 4,472135955 approximate 4 or 5 classes. The steps would be the following: 1.- Imagine that we are going to use 5 classes. Following the steps explained in laboratory 1, the length of the intervals (range / number of classes = 1.44) and the upper limits of the classes starting with the minimum value are established and then adding the amplitude to the previous limit. 2.- Once the upper limits of the classes are obtained, we create the histogram by selecting Análisis de datos in Datos; Histograma and click on Aceptar. So, we obtain the absolute frequency of each interval. 6

3.- The relative frequencies associated with each interval (relative frequency -fi- = absolute frequency / n) are calculated. 4.- To create a histogram with area of 1 (on a density scale), it is necessary to divide the relative frequencies by the amplitude of the intervals (fi / ai) obtaining the height of the bars. So, the histogram with area of 1 (on a density scale) is plotted changing the data of the column of absolute frequencies by the heights. We also remove the space between bars. 7

5.- Once the histogram with area of 1 is obtained, the normal density curve can be added. In order to perform the graph of the N(, ), the values of the axis OX are obtained as the center point between upper and lower limits of the intervals. 6.- We calculate and add the value of the normal density in the histogram as the density curve. It is necessary to calculate the mean and standard deviation of the simulated values. We can use, for example, the PROMEDIO and DESVEST statistics functions. The density would be calculated using DISTR.NORM function. DISTR.NORM( punto central ;PROMEDIO(A$2:A$21);DESVEST(A$2:A$21);0) In order to add the density curve to the histogram with area of 1 (on a density scale), you have to position the graph, right button, Seleccionar datos, Agregar, nombre de la serie (for example, curva) and valores de la serie (we select the density values). So, the bars corresponding to the densities are added in another color. In order to be drawn as a curve, you must change chart type into lines by selecting a line type without points (Cambiar tipo de gráfico, Líneas). 8

3. Confidence Intervals In order to calculate a confidence interval we can use statistical function INTERVALO.CONFIANZA INTERVALO.CONFIANZA Returns the confidence interval for the mean μ of a population distributed as a normal distribution. Alfa: significance level used to calculate the confidence level. The confidence level is equal to 100 * (1 - alpha)%, ie, an alpha of 0.05 indicates a 95% confidence level. Desv_estándar: standard deviation of the population. It is assumed that it is known. Tamaño: sample size. The confidence interval for the population mean, given the level of significance, is calculated by adding (and subtracting) to the sample mean the value calculated with this formula thus obtaining the upper limit and the lower limit of the interval. 9

Example In order to estimate the average grade of a given subject in a University, a sample of 35 marks of students has been obtained. It is known from other courses that the grade of this subject follow a Normal distribution, N(, ). The standard deviation of the grades is 2.41 points. Considering that the average score obtained in the sample has been of 5,02, find: a) A 90% confidence interval for the mean based on the sample INTERVALO.CONFIANZA(0.1;2,41;35) = 0,67005473 So, confidence interval will be: 5,02 0,67005473 ; 5,02 + 0,67005473 (4,34994527; 5,69005473) b) A 95% confidence interval for the mean based on the sample INTERVALO.CONFIANZA(0.05;2,41;35) = 0,67005473 So, confidence interval: 5,02 0,787905522 ; 5,02 + 0,787905522 (4,232094478; 5,807905522) 10

4. Exercises (give to the professor at the end of the class with the answers written in the last page) 4.1. Simulate a random variable of size n = 150 from the Uniform distribution X U(3,12), compute the sample mean, variance and standard deviation and their sample counterparts and write the results in Table 1. 4.2. Simulate a random variable of size n = 50 from the Normal X N(4,2) a. Compute the sample mean, variance and standard deviation and their sample counterparts and write the results in Table 2. b. Draw the QQ plot of this approximation and explain the results. c. Draw the corresponding histogram with area of 1 and density curve. d. Find a 98% confidence interval considering a random sample size = 250. 11

Answers to part 4. Name: NIU: Degree: Group Table 1. Results for n = 150, X U(3,12) X Sample Population Mean Variance Standard deviation Table 2. Results for n = 50, X N(4,2) X Sample Population Mean Variance Standard deviation Explain the results from the QQ plot: A 98% confidence interval (CI) considering a random sample size = 250 Fill in the statistical function and the results: INTERVALO.CONFIANZA( ; ; ; ) = So, CI will be (, ). 12