CS CS 5623 Simulation Techniques

Similar documents
LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Computer exercise 1 Introduction to Matlab.

A Quick Introduction to R

Objectives. 1 Basic Calculations. 2 Matrix Algebra. Physical Sciences 12a Lab 0 Spring 2016

Constructing Statistical Tolerance Limits for Non-Normal Data. Presented by Dr. Neil W. Polhemus

Lecture 11: Distributions as Models October 2014

Fathom Dynamic Data TM Version 2 Specifications

Page 1. Graphical and Numerical Statistics

Corso di Identificazione dei Modelli e Analisi dei Dati

CREATING THE DISTRIBUTION ANALYSIS

CHAPTER 6. The Normal Probability Distribution

SF1901 Probability Theory and Statistics: Autumn 2016 Lab 0 for TCOMK

Statistics I 2011/2012 Notes about the third Computer Class: Simulation of samples and goodness of fit; Central Limit Theorem; Confidence intervals.

10.4 Linear interpolation method Newton s method

You will begin by exploring the locations of the long term care facilities in Massachusetts using descriptive statistics.

Chapter 6: Simulation Using Spread-Sheets (Excel)

Probability Models.S4 Simulating Random Variables

(1) Generate 1000 samples of length 1000 drawn from the uniform distribution on the interval [0, 1].

Computing With R Handout 1

MATLAB TUTORIAL WORKSHEET

MAT 142 College Mathematics. Module ST. Statistics. Terri Miller revised July 14, 2015

Getting to Know Your Data

Chapter 6: DESCRIPTIVE STATISTICS

Probabilistic Models of Software Function Point Elements

What s Normal Anyway?

Stat 302 Statistical Software and Its Applications SAS: Distributions

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Bi 1x Spring 2014: Plotting and linear regression

A Brief Introduction to MATLAB

Matlab Introduction. Scalar Variables and Arithmetic Operators

Getting Started. Chapter 1. How to Get Matlab. 1.1 Before We Begin Matlab to Accompany Lay s Linear Algebra Text

APPM 2460 PLOTTING IN MATLAB

IT 403 Practice Problems (1-2) Answers

Week 4: Describing data and estimation

StatsMate. User Guide

R Programming Basics - Useful Builtin Functions for Statistics

Use of Extreme Value Statistics in Modeling Biometric Systems

1 Introduction to Matlab

Matlab Tutorial: Basics

Introduction to MATLAB Practical 1

Ch. 1.4 Histograms & Stem-&-Leaf Plots

Introduction to Fitting ASCII Data with Errors: Single Component Source Models

EXCEL SKILLS. Selecting Cells: Step 1: Click and drag to select the cells you want.

CHAPTER 3: Data Description

Clustering Images. John Burkardt (ARC/ICAM) Virginia Tech... Math/CS 4414:

Chapter 5snow year.notebook March 15, 2018

Frequency Distributions

Grace days can not be used for this assignment

Excel 2010 with XLSTAT

Hello Earth! A grounded introduction to Matlab. Frederik J Simons. Christopher Harig. Adam C. Maloof Princeton University

INSTRUCTIONS FOR USING MICROSOFT EXCEL PERFORMING DESCRIPTIVE AND INFERENTIAL STATISTICS AND GRAPHING

Week 7: The normal distribution and sample means

AMS 27L LAB #2 Winter 2009

Introduction to Matlab to Accompany Linear Algebra. Douglas Hundley Department of Mathematics and Statistics Whitman College

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Data and Function Plotting with MATLAB (Linux-10)

Introduction to RStudio

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

An Introduction to Numerical Methods

Applied Calculus. Lab 1: An Introduction to R

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Density Curve (p52) Density curve is a curve that - is always on or above the horizontal axis.

Introduction to Matlab

1 More configuration model

Stat 528 (Autumn 2008) Density Curves and the Normal Distribution. Measures of center and spread. Features of the normal distribution

Stokes Modelling Workshop

1 Pencil and Paper stuff

CSCI 204 Introduction to Computer Science II

ENGR Fall Exam 1

How to learn MATLAB? Some predefined variables

Instructions for Using ABCalc James Alan Fox Northeastern University Updated: August 2009

A Short Introduction to R

Student Version 8 AVERILL M. LAW & ASSOCIATES

CSC D84 Assignment 2 Game Trees and Mini-Max

Introduction to MATLAB: Graphics

Exploring Fractals through Geometry and Algebra. Kelly Deckelman Ben Eggleston Laura Mckenzie Patricia Parker-Davis Deanna Voss

Optimization and Simulation

Install RStudio from - use the standard installation.

Introduction to Matlab

CHAPTER 2 DESCRIPTIVE STATISTICS

9.8 Rockin the Residuals

Statistics for HEP Hands-on Tutorial #2. Nicolas Berger (LAPP)

CHAPTER 2: Describing Location in a Distribution

CDA6530: Performance Models of Computers and Networks. Chapter 8: Statistical Simulation --- Discrete-Time Simulation

Math 14 Lecture Notes Ch. 6.1

Tutorial 3: Probability & Distributions Johannes Karreth RPOS 517, Day 3

A very brief Matlab introduction

Introduction to MATLAB

Box-Cox Transformation

Introduction to MatLab. Introduction to MatLab K. Craig 1

Chapter 2: The Normal Distributions

Machine Learning for Software Engineering

INTRODUCTORY NOTES ON MATLAB

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

On the Parameter Estimation of the Generalized Exponential Distribution Under Progressive Type-I Interval Censoring Scheme

COMPUTER SCIENCE LARGE PRACTICAL.

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Transcription:

CS 4633 - CS 5623 Simulation Techniques How to model data using matlab Instructor Dr. Turgay Korkmaz This tutorial along with matlab s statistical toolbox manual will be useful for HW 5. Data collection I got npd-routes-.tar.gz data from http://ita.ee.lbl.gov/html/contrib/npd-routes.html Basically, they collected data using traceroute. After un-tar ing the data on a linux machine, I used the followings to extract the round-trip-time that I want to model > cat r*/* awk {if (NF ==8) print $, $3, $5, $7 } > tall The file tall has the following format hopcount time time2 time3 where time, time2, and time3 denote the round-trip-times between the source and the router which is hopcount hops away from the source node. After cleaning some invalid rows manually, I selected the rows with hopcount=5 and saved them into t5 using > cat tall awk {if (($==5)) print $, $2, $3, $4 } > t5 Now we have some data to analyze! Start Matlab and do the followings: We can load our data in matlab as follows >> load t5 Assume that we consider only the time data which can be accessed using t5(:,2) We can compute sample mean, variance, median, min, max for time data >> min(t5(:,2)) >> max(t5(:,2)) >> median(t5(:,2)) >> mean(t5(:,2)) >> var(t5(:,2)) To get help about any matlab function use >> help func-name >> help mean

2 Identify the prob dist family In this part, we need to have a frequency or histogram diagram. For this, matlab has a function called hist which (by default) puts data points into equally spaced intervals. hist(t5(:,2)); ylabel( frequency ); xlabel( intervals: values of data points ); These three commands will give the histogram in Figure. This histogram shows that the distribution of 25 2 5 frequency 5 5 5 2 25 intervals: values of data points Figure : Histogram within intervals (default). our data looks like an exponential distribution, right! But this will not be a good fit because default hist hides so much valuable data. Remember the recommended number of intervals is n, where n is the number of data points. To determine the number of data points and the number of intervals, we can use the followings in matlab: n=size(t5(:,2)); ninterval=ceil(sqrt(n())); Then, we can draw the histogram using hist(t5(:,2),ninterval); These three commands will give the histogram in Figure 2. Now if you run disttool in matlab and consider various distributions and their pdfs, you will see that histogram looks like gamma (Weibull also seems good), as shown in Figure 3. Accordingly, we can select 2

8 7 6 5 4 3 2 5 5 2 25 Figure 2: Histogram within n intervals (recommended)..2.5..5 2 3 4 5 6 Figure 3: The disttool showing the pdf of gamma distribution. 3

family of gamma dist in this step as the best possible model for our data. Actually, matlab also provides another nice function to make the first guess accurately. For example, we can also draw the empirical pdf to have better idea in deciding which distribution family to choose. For this, we can use ksdensity function as follows. [f,x, u ]= ksdensity(t5(:,2)); plot(x,f, r ) hold on [f,x ] = ksdensity(t5(:,2), width,2*u); plot(x,f, b ) [f,x ] = ksdensity(t5(:,2), width,4*u); plot(x,f, g ) [f,x ] = ksdensity(t5(:,2), width,6*u); plot(x,f, y ) This function produces the smoothed forms of the empirical distribution, as shown in Figure 4. This figure 8 x 3 7 6 5 4 3 2 5 5 5 2 25 Figure 4: Empirical distribution. also shows that the empirical pdf of our data looks like gamma distribution, right!. So, we assume that our data has gamma distribution. 4

3 Determine parameters for the chosen distribution Matlab has several functions to estimate parameters of several distributions and the 95% confidence intervals for those estimations. For our data, if we chose exp, we would use [muhat, cihat]=expfit(t5(:,2)) But, since we selected gamma, we will use [phat, pcihat]=gamfit(t5(:,2)) phat = 2.757 87.6674 % means A and B of gamma distribution, respectively. pcihat = 2.682 84.938 2.2833 9.43 So, we assume that our data has gamma distribution with A(θ in our book)=phat()=2.757 and B(β in our book)=phat(2)=87.6674). Theoretical pdf of gamma distribution with these parameters can be drawn as follows x=:.:25; y = gampdf(x,phat(), phat(2)); plot(x,y) As a result, we get the pdf of gamma with the given parameters as shown in Figure 5. 4 x 3 3.5 3 2.5 2.5.5 5 5 2 25 Figure 5: Theoretical pdf of gamma distribution. 5

4 Evaluate how good the chosen distribution In matlab, we can do several graphical comparisons between the theoretically assumed distribution and the empirical distribution. So let s first do graphical comparisons. We can then talk about formal statistical tests. 4. Graphical (heuristic) tests Using matlab, we can draw the theoretical pdf and cdf of gamma dist with estimated parameters A=phat() and B=phat(2). We can also draw empirical pdf and cdf. We can then compare them, respectively. To compare pdfs, we can use [f,x,u ]= ksdensity(t5(:,2)); [f,x ]= ksdensity(t5(:,2), width,5*u); plot(x,f, r ) hold on x=:.:2; y = gampdf(x,phat(), phat(2)); plot(x,y) As a result, we will get Figure 6. 4 x 3 3.5 3 2.5 2.5.5 5 5 5 2 25 Figure 6: Compare theoretical and empirical pdfs in case of gamma distribution. To compare cdfs, we can use hold off [F,x]=ecdf(t5(:,2)); 6

plot(x,f, r ); hold on x=:.:2; y=gamcdf(x,phat(), phat(2)); plot(x,y) As a result, we will get Figure 7..9.8.7.6.5.4.3.2. 5 5 2 25 Figure 7: Compare theoretical and empirical cdfs in case of gamma distribution. How do they look? They seem similar, right! But we cannot trust these similarities. So we need to look at the qqplot, ppplot, scater, to check if the assumed and empirical distributions are really similar To draw qqplot, you need to use followings. clear load t5 n=size(t5(:,2)); ninterval=ceil(sqrt(n())); [phat, pcihat]=gamfit(t5(:,2)) x=[]; a=sort(t5(:,2)); prev=-; j=; for i=:n(), if prev~=a(i) j = j+; 7

x(j) = a(i); prev=a(i); end end for i=:j, y(i)=gaminv(((i-.5)/j), phat(), phat(2)); end plot(x,y,. ); hold on The above procedure might take too much time. Another simple approach could be the following. Simply, generate some random data from the given theoretical gamma distribution and then use the qqplot in matlab to draw the qqplot between the randomly generated data and the empirical data. y = gamrnd(phat(), phat(2),,n()); qqplot(t5(:,2),y); legend( Actual procedure, Simple procedure ); As a result, we get approximately the same qqplot when we use the actual and simple approach, as shown in Figure 8. As shown in qqplot, the assumed gamma is not a good fit at the tail of the distribution while 35 3 Actual procedure Simple procedure 25 2 Y Quantiles 5 5 5 5 5 2 25 X Quantiles Figure 8: qqplot using actual and a simple procedures. it is a good fit in the main part of the distribution. We can also look at the pp-plot. For pp-plot we can use the followings [ef,x]=ecdf(t5(:,2)); 8

tf = gamcdf(x, phat(), phat(2)); plot(ef,tf,. ); As a result, we get the pp-plot in Figure 9..9.8.7.6.5.4.3.2...2.3.4.5.6.7.8.9 Figure 9: pp-plot. pp-plot also shows that gamma is a better fit in the middle of dist than the tail. But still the fit is not exact. At this point, there is no need to use chi-square or KS statistical tests. We need to consider another distribution like Weibull or simply use empirical distribution. 4.2 Goodness-of-fit tests left as an exercise. 5 Learn by doing it Download tall and then do the followings.. For the same data used in this tutorial, consider the Weibull distribution and repeat the above operations. 2. Consider time data when hopcount= and repeat the above operations. 9