Q: Which month has the lowest sale? Answer: Q:There are three consecutive months for which sale grow. What are they? Answer: Q: Which month

Lecture 1

Q: Which month has the lowest sale? Q:There are three consecutive months for which sale grow. What are they? Q: Which month experienced the biggest drop in sale?

Q: Just above November there is the light blue bar. What is its (approximate) height? Q:What is the interpretation of this number? (max one short sentence answer) Q: Which month has historically the highest rainfall? Q: Which month experienced the highest drop in precipitation compared to the historical values?

Q: Which country has the highest number? Answer: Saudi Arabia 152 China 789 Q:Which country has the second highest number? Answer: Q: Which countries have 35 and 31 numbers? Answer:

Q: What are these numbers? 789, 152, 35 and 31 represent what? China 789 Saudi Arabia 152

Q: What is the slope of the line? Q: What is the Y-intercept? Q: What is the approximate weight for a person who spends 50 minutes in the gym (weekly)? Q: What about the weight for a person that spends zero minutes in the gym? Q: If a person decides to spend additional 20 minutes in the gym, what will happen with his weight? The person would lose about pounds

Types of Data In Statistics we distinguish data by their types: Categorical type or Numerical type. The latter are often differentiated into Discrete data, (integers like 1, 5, 14,.. ) or Continuous data (decimals like 2.33.. or 4.35). The example of discrete data would be the number of credits a student has taken and the example for continuous data type would be the student s GPA. The Categorical data are the non-numerical data. For example, imagine that someone wants to make a statistical analysis of students names. The list of all these names would be characterized as the categorical data. Typically, one would transform these names into the appropriate list of number. For example there are 45Roberts, 22Tonys etc.

The charts here related to Sales, Precipitation and International students, are based on some data. Answer the following questions: Q1. Which of the three data files was Numerical but continuous type (i.e. decimal)? Q2. One of the charts is based on Categorical data. Which one? Q3. What type of data was used in making the Precipitation chart in Task 2?

Lecture 2 (uploading data) Review Lab 1 Motivation: We cannot do statistics without data Real data are often not given in Excel format but rather in plain text format. (i.e. with extensions:.txt ) These are easy to save and open using Notepad or WordPad Opening with Excel Open the excel page and click the top- left corner Icon and choose Open Trick: Before browsing, at the bottom of the window, under the tab Files of type, click on the circled region; a tab will open and then you need to select text files. (Otherwise your file will not be visible) Browse through the computer until you find the folder where you saved the data: Height and Weight.txt and click on your file.

Lecture 2 (uploading data) Preview for Lab 1 The formatting window will pop-up. Under Choose the file type click on Delimited; do not choose Fixed. Click on Next and then again on Next and then on Finished

Large data files -Splitting the screen a) UPLOADING : Use the Gender data. Save the data as txt file at any location on your computer. Next open the data via Excel by following the steps from previous slide. b) SPLITTING: Click on any of the cell in the far left column (Suggestion: somewhere half way, say on row 10 column A). Next, on the top menu, click on View and then on Split: (see chart). C) SCROLLING: This operation will split the screen and on your right you will find two scrollers, (see figure) designed to move up and down the data list. Scroll the bottom screen to the bottom of the data file.

Uploading very large data file. Use the data US_CRime The file contains the data related to US crimes by States. A) save as text B) Open in Excel C) Cut and paste Connecticut data The result is a new Excel document containing only the data related to Connecticut. (see the table to the right). The First trick has to do with delimitation. The file is in a text format but one has to experiment a bit with formatting (do not use space or Tab ). The Second trick is about splitting the screen. In order to extract Connecticut data, one only need to highlight the relevant portion and Cut & paste it into a new document. Splitting the screen trick helps a lot here.

Random Number Generator Click on the tab labeled Data and then on Data Analysis (top right). If Data Analysis is missing you need to upload Data Analysis Toolpak. Next, follow the instructions (see figures) Click on clicking on the O.K button when done The result should contain a column of 62 random numbers. The meaning and the significance of the words: Normal, distribution, and Random Seed will be explained later. The main goal at this moment is that we can create, at will, random data with a desired structure and of desired size.

Data Collection: There are various ways data are collected. Non-random or census type data Random sampling type of data collection. The census data are typically collected from a site, without using any random mechanism of selection. Example: Collect all the students enrolled at an University and analyze their data related to classes they have taken. Given the two data uploaded in this lecture Height-Weight-Gender data as well as The US Crime data, answer the following questions: Q1. Which, if any, of the two data files was collected using the Random sampling method? Q2. Which, if any, of the two data files was collected using the Census method? The sampling data : Pick randomly a subset of the students from a University and then collect the relevant data.

Q: What is the slope of the line? Q: What is the Y-intercept? Q: For a person that takes more math classes does the equation suggest that his salary will increase or decrease? Is this expected? Q: A person A has taken X math credits during his college years while person B has taken 10 more credits than him. What can you conclude regarding their respective salaries? Note: You are not asked what the salary will be; you are asked how much the salary will change if this person has taken 10 more math credits.

Q: What is the approximate average weight for the subjects presented on this chart? Q: What is the approximate lowest height for the subjects presented here? Q: Does the data present positive, or negative trend, or no trend at all? Q: What is the approximate range for the X-axis data? Q: What is the approximate number for the Y-Axis?

Q: Which of the two data files has the higher Average? Data A, Data B or Approximately equal? Q:Which of the two data files has the higher Variation? Data A, Data B or Approximately equal?

Q: Which of the two data files has the higher Median? Data A, Data B or Approximately equal? Q: Which of the two data files has the wider Range? Data A, Data B or Approximately equal?

Q: Which of the two data files has the higher Average? Data A, Data B or Approximately equal? Q:Which of the two data files has the higher Variation? Data A, Data B or Approximately equal?

Q: Which of the two data files has the higher Median? Data A, Data B or Approximately equal? Q: Which of the two data files has the wider Range? Data A, Data B or Approximately equal?

Q: Which of the two data files has the higher Average? Data A, Data B or Approximately equal? Q:Which of the two data files has the higher Variation? Data A, Data B or Approximately equal?

Q: Which of the two data files has the higher Median? Data A, Data B or Approximately equal? Q: Which of the two data files has the wider Range? Data A, Data B or Approximately equal? COMMENT: Be careful when you judge the charts! The y-axis were on different scale which can fool the readers.

In modern statistical two main numbers that describe a data set: The Mean and The Standard Deviation. Imagine that you are given a file containing weekly sales for a few hundred locations your company supervises. The list seems like endless non-informative collection of numbers: $4453.9, $5263.9, $899.8,..,and thousands just like these. What would be the first thing to do here? What would be the first number to compute in order to somehow describe these sales? Clearly, one would compute the Average; which in statistics we often refer as the Mean. The center of the data. Imagine, for a moment, that in this fictitious case the average is $3222.5. Well, now we have some idea about the data. Apparently, through hundreds of weeks and hundreds of different shops, the average sale was slightly larger than $3000. Which is rather informative. And intuitive. In statistics we refer to this as a center of the data; a number that captures the middle, of the data set. Moreover, this number is trivially computed.

As informative as this number is, very often it is not sufficient. Why? Namely, a quick glance at this list reveals that the numbers are haphazard and somewhat random. $4453.9, $5263.9, $899.8 The average might be $3222.5, but the very second number on our list is $5263.9 which is considerably higher than the average, and the third number on this list is $899.8 which is just a fraction of the second number. And who knows what other thousands of numbers would look like. Thus, we need to characterize this haphazardness, this variation. And for this we use : Standard Deviation. The Standard Deviation, or StDev, as we will call it here, is computed by a specific formula designed to capture the data variability. The formula is rather complex which implies that Mathematical theory behind it is difficult as well. Nevertheless, modern software has a built in function that computes this number for us and in this course we are more concerned with the interpretation and not the computations. And the intuition is as follows: StDev tells us, roughly, how much the data deviate from the average.

For typical data sets the following rules of thumb hold: 65% of the data are within the interval [Average-StDev, Average+StDev] 95% of the data are within the interval [Average-2*St Dev, Average+2*StDev] This rule is best understood via example. Imagine that for the above fictitious data set, the Standard Deviation is equal to $1200. The rule of thumb would now imply that about 65% of all data on this list are between [Average-StDev, Average+StDev] Henceforth, 65% of all data on this list are between [$3222.5-$1200, $3222.5+$1200] = [$2022.5, $4422.5]. Similarly, we can estimate that 95% of all data are between [3222.5-2400, 3222.5+2400] which is the interval [$822.5, $5622.5]. This alone is rather striking! Without actually counting and comparing the thousands of data points we can say that it is likely that 95% of all the sales are between $822.5 and $5622.5!

The Median. This is another statistical tool designed to characterize the center of the data. Its actual computation is straight forward: Given a data set, say 5,9,3,4,5,7,11, one first sorts the data: 3,4,5,5,7,9,11 and then picks the middle point which in this case is 5. Thus the Median=5. If we have an even number of data, we sort them, pick the middle two and then average them. Example: 4,2,6,7,12,5,4,3 after sorting becomes 2,3,4,4,5,6,7,12, and the Median=(4+5)/2=4.5. Why Median? Well, it turns out that some data have a structure and its center that is not well characterized by the Average. These are data that exhibit the outliers, that is a few observations that are much, much larger or smaller than the majority. A few examples will help.

Examples: Annual income: Clearly an average annual income for a town or a city, would be greatly swayed if a billionaire moves in the neighborhood. However, the median income would not change at all. House price: Similar argument. Imagine a city where the vast majority of houses, say 95% of them, cost less than $500, 000. Clearly, a sale of a 50 million dollar mansion would inflate the average sale price and provide a much distorted picture about the house pricing for this city. But the median sale price would not be effected. For these reasons, in literature, in newspapers and business reports, the house prices and incomes are typically characterized by their medians and not their averages.