Math 183 Statistical Methods

Similar documents
Table of Contents (As covered from textbook)

Statistical Methods. Instructor: Lingsong Zhang. Any questions, ask me during the office hour, or me, I will answer promptly.

Chapter 2: Understanding Data Distributions with Tables and Graphs

Course Syllabus - CNT 4703 Design and Implementation of Computer Communication Networks Fall 2011

CSC 111 Introduction to Computer Science (Section C)

1. Textbook #1: Our Digital World (ODW). 2. Textbook #2: Guidelines for Office 2013 (GFO). 3. SNAP: Assessment Software

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Frequency Distributions

1.3 Graphical Summaries of Data

College Algebra. Cartesian Coordinates and Graphs. Dr. Nguyen August 22, Department of Mathematics UK

Measures of Central Tendency

The Normal Distribution & z-scores

Math 2280: Introduction to Differential Equations- Syllabus

Homework Packet Week #3

Chapter 3 - Displaying and Summarizing Quantitative Data

Linear Algebra Math 203 section 003 Fall 2018

Course Syllabus. Course Information

Chapter 1. Looking at Data-Distribution

Chapter 2: Frequency Distributions

Statistics Lecture 6. Looking at data one variable

Lecture 1: Exploratory data analysis

The Normal Distribution & z-scores

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data

B.2 Measures of Central Tendency and Dispersion

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Visualizing Data: Freq. Tables, Histograms

LESSON 3: CENTRAL TENDENCY

ESET 369 Embedded Systems Software, Fall 2017

The Normal Distribution & z-scores

UNIT 1A EXPLORING UNIVARIATE DATA

STA 570 Spring Lecture 5 Tuesday, Feb 1

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

+ Statistical Methods in

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

Sections Graphical Displays and Measures of Center. Brian Habing Department of Statistics University of South Carolina.

EXPLORATORY DATA ANALYSIS. Introducing the data

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Introduction to Computer Systems

Sections 4.3 and 4.4

1. Textbook #1: Our Digital World (ODW). 2. Textbook #2: Guidelines for Office 2013 (GFO). 3. SNAP: Assessment Software

Central Washington University Department of Computer Science Course Syllabus

CS157a Fall 2018 Sec3 Home Page/Syllabus

San José State University Department of Computer Science CS-144, Advanced C++ Programming, Section 1, Fall 2017

General course information

STA 490H1S Initial Examination of Data

CS 200, Section 1, Programming I, Fall 2017 College of Arts & Sciences Syllabus

SYS 6021 Linear Statistical Models

San José State University Department of Computer Science CS049J, Programming in Java, Section 2, Fall, 2016

Chapter 11. Worked-Out Solutions Explorations (p. 585) Chapter 11 Maintaining Mathematical Proficiency (p. 583)

CS 0449 Intro to Systems Software Fall Term: 2181

8. MINITAB COMMANDS WEEK-BY-WEEK

Syllabus CSCI 405 Operating Systems Fall 2018

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

CLOVIS WEST DIRECTIVE STUDIES P.E INFORMATION SHEET

San Jose State University College of Science Department of Computer Science CS151, Object-Oriented Design, Sections 1, 2, and 3, Spring 2018

AP Statistics Assignments Mr. Kearns José Martí MAST 6-12 Academy

Averages and Variation

ESET 349 Microcontroller Architecture, Fall 2018

Announcements. Unit 4: Inference for Numerical Data Lecture 1: Bootstrapping. Rotten horrors. Data

STP 226 ELEMENTARY STATISTICS NOTES

Introduction to Computer Systems

Computer Science Technology Department

12. A(n) is the number of times an item or number occurs in a data set.

IT 341 Fall 2017 Syllabus. Department of Information Sciences and Technology Volgenau School of Engineering George Mason University

Univariate Statistics Summary

Computer Science Technology Department

Introduction to Databases

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

VE281 Data Structures and Algorithms. Introduction and Asymptotic Algorithm Analysis

ENEE 457: Computer Systems Security 8/27/18. Lecture 1 Introduction to Computer Systems Security

Basic Statistical Terms and Definitions

Welcome to class! Put your Create Your Own Survey into the inbox. Sign into Edgenuity. Begin to work on the NC-Math I material.

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

EECE 321: Computer Organization

Office Hours Time: Monday/Wednesday 12:45PM-1:45PM (SB237D) More Information: Office Hours Time: Monday/Thursday 3PM-4PM (SB002)

Announcements. 1. Forms to return today after class:

Course and Contact Information. Course Description. Course Objectives

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Displaying Distributions - Quantitative Variables

DEPARTMENT OF BUSINESS AND OFFICE ADMINISTRATION COURSE OUTLINE FALL 2017 OA 1145 B2 3( ) Excel and Access, Core 67.5 Hours

Syllabus Revised 08/21/17

DM4U_B P 1 W EEK 1 T UNIT

Chapter 2 - Descriptive Statistics

Syllabus CS476 COMPUTER GRAPHICS Fall 2009

CIS*1500 Introduction to Programming

INF 315E Introduction to Databases School of Information Fall 2015

IT 403 Practice Problems (1-2) Answers

Section 3.2 Measures of Central Tendency MDM4U Jensen

: Dimension. Lecturer: Barwick. Wednesday 03 February 2016

Female Brown Bear Weights

CHAPTER 3: Data Description

View a Students Schedule Through Student Services Trigger:

Quantitative - One Population

San Jose State University College of Science Department of Computer Science CS151, Object-Oriented Design, Sections 1,2 and 3, Spring 2017

Chapter 6: DESCRIPTIVE STATISTICS

COMP251: Algorithms and Data Structures. Jérôme Waldispühl School of Computer Science McGill University

2.1: Frequency Distributions and Their Graphs

San José State University Department of Computer Science CS166, Information Security, Section 1, Fall, 2018

Transcription:

Math 183 Statistical Methods Eddie Aamari S.E.W. Assistant Professor eaamari@ucsd.edu math.ucsd.edu/~eaamari/ AP&M 5880A 1 / 24

Math 183 Statistical Methods Eddie Aamari S.E.W. Assistant Professor eaamari@ucsd.edu math.ucsd.edu/~eaamari/ AP&M 5880A Today: Presentation of the course Chapter 1: Introduction to data 1 / 24

Course Home Instructor s webpage: www.math.ucsd.edu/~eaamari/ Lecture slides Homework sets Course syllabus Provisional course calendar Links to R and RStudio Office hour info and location 2 / 24

How the Course is Graded The one following formula giving you the better result will be used: Formula 1 20% Homework 20% Midterm Exam 1 20% Midterm Exam 2 40% Final Exam Formula 2 20% Homework 20% Best Midterm Exam 60% Final Exam Your worst homework grade will be dropped for computing your final Homework score. No makeup exams. The grading scheme will be curved and scaled to the best student in class. 3 / 24

Homework Homework is due weekly on Friday s lecture. Late assignments will not be accepted. Your worst homework grade will be dropped. Randomly selected problems on the assignment will be graded. Tacit homework: Read the textbook! Homework handed back on Discussion sections. No homework re-grading will be allowed after the section ends. This means that if you come back after you went out the room, your grade is fixed and your homework will not be regraded. Complaints/reclamation during the section will be considered with concern. 4 / 24

Class Calendar Monday Tuesday Wednesday Thursday Friday Week 1 Week 2 Week 3 October 2 Chapter 1 9 Chapter 2 16 Chapter 3 3 4 Chapter 2 10 11 Chapter 2 17 18 Chapter 3 5 Discussion 12 Discussion 19 Discussion 6 Chapter 2 HW 1 due 13 Chapter 3 HW 2 due 20 Midterm Exam I Week 4 Week 5 Week 6 23 Chapter 3 30 Chapter 4 6 Review 24 25 Chapter 3 31 November 1 Chapter 4 7 8 Midterm Exam II 26 Discussion 2 Discussion 9 Discussion 27 Chapter 4 HW 3 due 3 Chapter 4 HW 4 due 10 Veterans Day Week 7 Week 8 13 Chapter 5 20 Chapter 5 14 15 Chapter 5 21 22 Chapter 6 16 Discussion 23 Thanksgiving 17 Chapter 5 HW 5 due 24 Thanksgiving Week 9 27 Chapter 6 28 29 Chapter 6 30 Discussion December 1 HW 6 due Week 10 Final Week 4 Chapter 7 5 6 Chapter 7 7 Discussion 11 12 13 14 Final Exam 11:30am-2:29pm 8 Review HW 7 due 15 5 / 24

Components you Need Textbook: OpenIntro Statistics, Third edition, by Diez, Barr, & Cetinkaya-Rundel Free pdf available online. Software: R and RStudio Open-source statistical programming language and development environment used in data analysis. Calculators: Used on exams and homework Need not be graphics, nor have statistical functions Cannot be your phone or computer (for exams) 6 / 24

Content of this Course The lectures will cover Chapters 1 to 7 of the textbook. Introduction to data Introduction to probability: - Discrete and continuous random variables - Binomial, Poisson and Gaussian distributions - Central limit theorem Data analysis and inferential statistics: - Graphical techniques - Confidence intervals, hypothesis testing - Curve fitting (regression) 7 / 24

Before Carrying On... Any questions so far? 8 / 24

What Does Data Look Like? 9 / 24

What Does Data Look Like? observation (case) variable 9 / 24

Describing a Data Set spam Indicator for whether the email was spam. to_multiple Indicator for whether the email was addressed to more than one recipient. from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). cc Indicator for whether anyone was CCed. sent_email Indicator for whether the sender had been sent an email in the last 30 days. time Time at which email was sent. image The number of images attached. attach The number of attached files. dollar The number of times a dollar sign or the word dollar appeared in the email. winner Indicates whether winner appeared in the email. inherit The number of times inherit (or an extension, such as inheritance ) appeared in the email. viagra The number of times viagra appeared in the email. password The number of times password appeared in the email. num_char The number of characters in the email, in thousands. line_breaks The number of line breaks in the email (does not count text wrapping). Types of variables: Numeric (= quantitative) Discrete (space between possible values) Continuous (real numbers) Categorical (= qualitative) Nominal (with no natural ordering) Ordinal (with a natural ordering) 10 / 24

Describing a Data Set spam Indicator for whether the email was spam. to_multiple Indicator for whether the email was addressed to more than one recipient. from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). cc Indicator for whether anyone was CCed. sent_email Indicator for whether the sender had been sent an email in the last 30 days. time Time at which email was sent. image The number of images attached. attach The number of attached files. dollar The number of times a dollar sign or the word dollar appeared in the email. winner Indicates whether winner appeared in the email. inherit The number of times inherit (or an extension, such as inheritance ) appeared in the email. viagra The number of times viagra appeared in the email. password The number of times password appeared in the email. num_char The number of characters in the email, in thousands. line_breaks The number of line breaks in the email (does not count text wrapping). Types of variables: Numeric (= quantitative) Discrete (space between possible values) Continuous (real numbers) Categorical (= qualitative) Nominal (with no natural ordering) Ordinal (with a natural ordering) 10 / 24

Describing a Data Set spam Indicator for whether the email was spam. to_multiple Indicator for whether the email was addressed to more than one recipient. from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). cc Indicator for whether anyone was CCed. sent_email Indicator for whether the sender had been sent an email in the last 30 days. time Time at which email was sent. image The number of images attached. attach The number of attached files. dollar The number of times a dollar sign or the word dollar appeared in the email. winner Indicates whether winner appeared in the email. inherit The number of times inherit (or an extension, such as inheritance ) appeared in the email. viagra The number of times viagra appeared in the email. password The number of times password appeared in the email. num_char The number of characters in the email, in thousands. line_breaks The number of line breaks in the email (does not count text wrapping). Types of variables: Numeric (= quantitative) Discrete (space between possible values) Continuous (real numbers) Categorical (= qualitative) Nominal (with no natural ordering) Ordinal (with a natural ordering) 10 / 24

Describing a Data Set spam Indicator for whether the email was spam. to_multiple Indicator for whether the email was addressed to more than one recipient. from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). cc Indicator for whether anyone was CCed. sent_email Indicator for whether the sender had been sent an email in the last 30 days. time Time at which email was sent. image The number of images attached. attach The number of attached files. dollar The number of times a dollar sign or the word dollar appeared in the email. winner Indicates whether winner appeared in the email. inherit The number of times inherit (or an extension, such as inheritance ) appeared in the email. viagra The number of times viagra appeared in the email. password The number of times password appeared in the email. num_char The number of characters in the email, in thousands. line_breaks The number of line breaks in the email (does not count text wrapping). Types of variables: Numeric (= quantitative) Discrete (space between possible values) Continuous (real numbers) Categorical (= qualitative) Nominal (with no natural ordering) Ordinal (with a natural ordering) 10 / 24

Be Careful About Data Types Some variables encoded with numbers are not numeric. - 0/1 for TRUE/FALSE - ZIP codes (92093) Not all numeric variables look like numbers. - Dates (Friday the 13th, 2017) - GPS coordinates (40 26 46 N 79 58 56 W) 11 / 24

Be Careful About Data Types Some variables encoded with numbers are not numeric. - 0/1 for TRUE/FALSE - ZIP codes (92093) Not all numeric variables look like numbers. - Dates (Friday the 13th, 2017) - GPS coordinates (40 26 46 N 79 58 56 W) Data types determine how you analyze them. Tools are specifically suited to on of them. 1 variable 2 variables Categorical Bar chart Contigency table Numeric Histogram Scatterplot Various situations come various visualizations 11 / 24

One Categorical Variable See the "attachment" variable as categorical. 1 > email50 $ attach 2 [ 1 ] 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 > t a b l e ( email50 $ a t tach ) 4 0 1 2 5 47 1 2 12 / 24

One Categorical Variable See the "attachment" variable as categorical. 1 > email50 $ attach 2 [ 1 ] 0 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 > t a b l e ( email50 $ a t tach ) 4 0 1 2 5 47 1 2 1 p i e ( t a b l e ( email50 $ a t tach ) ) 2 b a r p l o t ( t a b l e ( email50 $ attach ) ) 0 1 2 0 10 20 30 40 0 1 2 12 / 24

Two Categorical Variables Contingency table: correlation between frequencies of two categorical variables. 1 > addmargins ( t a b l e ( email50 $ attach, email50 $ to_m u l t i p l e ) ) 2 0 1 Sum 3 0 41 6 47 4 1 1 0 1 5 2 1 1 2 6 Sum 43 7 50 Rows: {0, 1, 2} are the number of attachment Columns: {0, 1} (= {No, Y es}) indicates if the email was sent to several addresses. 13 / 24

Two Categorical Variables Contingency table: correlation between frequencies of two categorical variables. 1 > addmargins ( t a b l e ( email50 $ attach, email50 $ to_m u l t i p l e ) ) 2 0 1 Sum 3 0 41 6 47 4 1 1 0 1 5 2 1 1 2 6 Sum 43 7 50 Rows: {0, 1, 2} are the number of attachment Columns: {0, 1} (= {No, Y es}) indicates if the email was sent to several addresses. Use contingency tables to apprehend 2 categorical variables: % { 1 - sended email with attachment} = 1+1 43 4.65%. % { 1 > - sended email with attachment} = 0+1 7 14.28%. 13 / 24

Two Categorical Variables Contingency table: correlation between frequencies of two categorical variables. 1 > addmargins ( t a b l e ( email50 $ attach, email50 $ to_m u l t i p l e ) ) 2 0 1 Sum 3 0 41 6 47 4 1 1 0 1 5 2 1 1 2 6 Sum 43 7 50 Rows: {0, 1, 2} are the number of attachment Columns: {0, 1} (= {No, Y es}) indicates if the email was sent to several addresses. Use contingency tables to apprehend 2 categorical variables: % { 1 - sended email with attachment} = 1+1 43 4.65%. % { 1 > - sended email with attachment} = 0+1 7 14.28%. Be careful with sample size! (We only have one email user.) 13 / 24

Numerical Data: Histograms 1.0 0e+00 1e+05 2e+05 3e+05 4e+05 Annual Income 14 / 24

Numerical Data: Histograms 1.0 0e+00 1e+05 2e+05 3e+05 4e+05 Annual Income Group points into bins to get an histogram (R function hist). Frequency 0 400 800 1400 0e+00 1e+05 2e+05 3e+05 4e+05 Annual Income 14 / 24

Numerical Data: Histograms The choice of bin size influences crudely the histogram plot. 10 bins 100 bins 1000 bins Frequency 0 400 800 1200 Frequency 0 200 400 600 800 Frequency 0 200 400 600 0e+00 2e+05 4e+05 Annual Income 0e+00 2e+05 4e+05 Annual Income 0e+00 2e+05 4e+05 Annual Income By default, R tries its best to display an informative histogram. 15 / 24

0.0 0.2 0.4 0.6 0.8 1.0 10 5 0 5 0.0 0.2 0.4 0.6 0.8 1.0 10 0 10 20 Numerical Data Vocabulary: Modes We say an histogram has a mode when it is peaked somewhere. 0 20 40 60 80 100 120 0 200 400 600 (a) No mode (b) Unimodal (1 mode) 0 500 1000 1500 2000 2500 3000 3500 0 1000 2000 3000 4000 5000 6000 7000 (c) Bimodal (2 modes) (d) Multimodal (> 2) 16 / 24

0.0 0.2 0.4 0.6 0.8 1.0 10 0 10 20 10 5 0 5 10 Numerical Data Vocabulary: Symmetry An histogram is symmetric if both sides of mode look the same. 0 200 400 600 0 5000 10000 15000 20000 (a) Symmetric (b) Symmetric 0 1000 2000 3000 4000 5000 6000 7000 (c) Not symmetric 17 / 24

Numerical Data Vocabulary: Tails The tails of an histogram are the parts away from the center. Tails 18 / 24

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 Numerical Data Vocabulary: Skewness When an histogram is not symmetric, we can describe further its asymmetry by saying it is Skewed left: if the left tail is longer than the right tail. Skewed right: if the right tail is longer than the left tail. 0 200 400 600 800 1000 0 5000 10000 15000 20000 (a) Skewed left (b) Skewed right Skewed left/right = the left/right tail stretches out longer. 19 / 24

Numerical Data Vocabulary: Outlier An outlier is an observation that appears extreme relative to the rest of the data. (= Not conventional) Outlier Examples: Extreme values in precision measurements for astrophysics Trolls answers in online questionnaires Sometimes outliers are informative, sometimes just annoying. 20 / 24

Practice Describe this histogram. Frequency 0 40 80 120 0e+00 1e+05 2e+05 3e+05 4e+05 Monthly Income 21 / 24

Practice Describe this histogram. Frequency 0 40 80 120 0e+00 1e+05 2e+05 3e+05 4e+05 Monthly Income Unimodal, (strongly) skewed right, some outliers. 21 / 24

Practice Describe this histogram. Heights of NBA players from the 2008 9 season Frequency 0 20 40 60 80 1.8 1.9 2.0 2.1 2.2 2.3 Height (in m) 22 / 24

Practice Describe this histogram. Heights of NBA players from the 2008 9 season Frequency 0 20 40 60 80 1.8 1.9 2.0 2.1 2.2 2.3 Height (in m) Unimodal, skewed left, no outlier. 22 / 24

Practice Describe these histograms. 23 / 24

Practice Describe these histograms. Multimodal, skewed right (up), no outlier. 23 / 24

What you Should Do After the Lecture Read Chapter 1 Start Homework 1! Turn in the following exercises of Chapter 1 in the Textbook: 1.6 (Stealers, study components) 1.14 (Cats on Youtube) 1.38 (Mammal life span) 1.50 (Mix-and-match) 1.58 (Exam scores) 1.66 (Views on immigration) Due date is Friday 6th, October on lecture. Out of the 6 exercises here, 4 will be randomly chosen to be graded. Make sure you have written down your full name, the PID and the TA s name of your section. 24 / 24