Evaluating Robot Systems

Similar documents
Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Fitting. Fitting. Slides S. Lazebnik Harris Corners Pkwy, Charlotte, NC

Humanoid Robots Exercise Sheet 9 - Inverse reachability maps (IRM) and statistical testing

For our example, we will look at the following factors and factor levels.

Descriptive Statistics, Standard Deviation and Standard Error

Analysis of variance - ANOVA

Week 7: The normal distribution and sample means

Comparison of Means: The Analysis of Variance: ANOVA

Performance Characterization in Computer Vision

Free Fall. Objective. Materials. Part 1: Determining Gravitational Acceleration, g

Quantitative - One Population

Unit 5: Estimating with Confidence

CHAPTER 3 AN OVERVIEW OF DESIGN OF EXPERIMENTS AND RESPONSE SURFACE METHODOLOGY

Normal Curves and Sampling Distributions

Section 2.3: Simple Linear Regression: Predictions and Inference

Data Analysis & Probability

Robotics. Lecture 5: Monte Carlo Localisation. See course website for up to date information.

height VUD x = x 1 + x x N N 2 + (x 2 x) 2 + (x N x) 2. N

STATS PAD USER MANUAL

Chapter 2: Modeling Distributions of Data

Week 4: Simple Linear Regression II

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Procedures: Algorithms and Abstraction

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Welfare Navigation Using Genetic Algorithm

Chapter 2 Modeling Distributions of Data

Model Fitting, RANSAC. Jana Kosecka

More Summer Program t-shirts

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:

An Experiment in Visual Clustering Using Star Glyph Displays

Metaheuristic Development Methodology. Fall 2009 Instructor: Dr. Masoud Yaghini

Normal Distribution. 6.4 Applications of Normal Distribution

Chapter 2: The Normal Distribution

Worksheet Answer Key: Scanning and Mapping Projects > Mine Mapping > Investigation 2

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Computational Geometry csci3250. Laura Toma. Bowdoin College

The Normal Distribution & z-scores

Chapters 5-6: Statistical Inference Methods

SUMMARY: DISTINCTIVE IMAGE FEATURES FROM SCALE- INVARIANT KEYPOINTS

Ranking Clustered Data with Pairwise Comparisons

CREATING THE DISTRIBUTION ANALYSIS

Math Dr. Miller - Constructing in Sketchpad (tm) - Due via by Friday, Mar. 18, 2016

Statistical Pattern Recognition

Chapter 8. Interval Estimation

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

PROBE RADIUS COMPENSATION AND FITTING ERRORS IN CAD-BASED MEASUREMENTS OF FREE-FORM SURFACE: A CASE STUDY

Sensor Modalities. Sensor modality: Different modalities:

Probability and Statistics for Final Year Engineering Students

Report: Reducing the error rate of a Cat classifier

Ch6: The Normal Distribution

Verification and Validation of X-Sim: A Trace-Based Simulator

1. The Normal Distribution, continued

Developing Effect Sizes for Non-Normal Data in Two-Sample Comparison Studies

A Virtual Laboratory for Study of Algorithms

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Dynamic Robot Path Planning Using Improved Max-Min Ant Colony Optimization

Lab 5 - Risk Analysis, Robustness, and Power

Face Detection CUDA Accelerating

Sensor Tasking and Control

adjacent angles Two angles in a plane which share a common vertex and a common side, but do not overlap. Angles 1 and 2 are adjacent angles.

Control Charts. An Introduction to Statistical Process Control

Computer Graphics Prof. Sukhendu Das Dept. of Computer Science and Engineering Indian Institute of Technology, Madras Lecture - 14

Scope and Sequence for the Maryland Voluntary State Curriculum for Mathematics

Continuous Improvement Toolkit. Normal Distribution. Continuous Improvement Toolkit.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Chapter 6. THE NORMAL DISTRIBUTION

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow.

[ Ω 1 ] Diagonal matrix of system 2 (updated) eigenvalues [ Φ 1 ] System 1 modal matrix [ Φ 2 ] System 2 (updated) modal matrix Φ fb

Using Excel for Graphical Analysis of Data

Today. Parity. General Polygons? Non-Zero Winding Rule. Winding Numbers. CS559 Lecture 11 Polygons, Transformations

Agenda. Introduction Background. QPM Discrete Event Simulation. Case study. Using discrete event simulation for QPM

Quantification of the characteristics that influence the monitoring of crimping operations of electric terminals for use in the automotive industry

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

CHAPTER 2: Describing Location in a Distribution

CHAPTER 2 Modeling Distributions of Data

Rigid Motion vs. Non-rigid Motion Transformations

Chapter 6. The Normal Distribution. McGraw-Hill, Bluman, 7 th ed., Chapter 6 1

Chapter 6. THE NORMAL DISTRIBUTION

Confidence Intervals. Dennis Sun Data 301

Effective probabilistic stopping rules for randomized metaheuristics: GRASP implementations

Notes on Simulations in SAS Studio

Binocular Stereo Vision. System 6 Introduction Is there a Wedge in this 3D scene?

Distributions of Continuous Data

Section 4 General Factorial Tutorials

One Factor Experiments

CHAPTER 6. The Normal Probability Distribution

MAC-CPTM Situations Project. Situation 04: Representing Standard Deviation* (* formerly Bull s Eye )

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

SPSS Basics for Probability Distributions

Chapter 2: The Normal Distributions

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

Local Search Methods. CS 188: Artificial Intelligence Fall Announcements. Hill Climbing. Hill Climbing Diagram. Today

Regression Analysis and Linear Regression Models

Robust Linear Regression (Passing- Bablok Median-Slope)

Transcription:

Evaluating Robot Systems November 6, 2008 There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies. And the other way is to make it so complicated that there are no obvious deficiencies. -- C.A.R. Hoare

Objectives Understand techniques for evaluating robot control code: Theoretical / Analytical Empirical

Techniques for Evaluating Robot Control Code Theoretical / Analytical: Proofs of correctness, stability, liveness, etc. Computational complexity comparisons Empirical (This is today s focus): Selection of application domain/benchmark Selection (or definition) of metrics of evaluation Multiple experimental runs, to eliminate effect of specific starting conditions, environments, parameter settings, etc. Average experiments over many trials Maintain data on averages and standard deviations Number of experiments needed: enough to ensure statistical significance

Empirical Evaluation and Validation of Robot Control Code Main point: since it is difficult to sufficiently model robot s environment and interactions, it s better to validate robot control code empirically However, in empirical evaluations, must ensure that performance evaluation is not skewed through limited test case selection Must generalize over: Range of parameter settings Range of robot environments (or, at least across a certain class) (Perhaps) numbers of robots Etc.

Typical Experimental Evaluation Approach 1. Define important variables/parameters of the system 2. Determine a range of values for variables/parameters to be tested 3. Define metrics to be used for evaluating system 4. Determine characteristics of environment which algorithm is designed to handle 5. Generate a variety of environments within this class to test Best if environments are randomly selected For example, generate obstacle size and location based upon a particular distribution 6. Determine how many experimental runs are needed to generate statistical significance 7. Determine a baseline performance to which the new algorithm will be compared 8. Run multiple experiments, varying starting position of robot randomly 9. Collect data over multiple experimental runs, and average data, also maintaining standard deviation

1. Define important variables/parameters of the system Want to define any variables/parameters that influence the outcome of the experiment Example variables/parameters: Weights on force vectors Robot speed/velocity profile Robot sensing range Robot communications range Communications failure rate Other application-specific parameters (e.g., in swarming, have to define preferred distance between robots) Size of the environment Etc.

2. Determine a range of values for variables/parameters to be tested For all the variables/parameters identified in step 1, you now must decide what the range of values should be. For example: Number of robots: vary from 1 to 10 (or 100, or 1000), in increments of 1 (or 10 or 100) Sensing range: vary from 1 to 10, in increments of 1 (or 2 or 3 ) Communications range: vary from 1 to 10, in increments of 1 (or 2 or 3 ) Communications failure rate: vary from 0 to 100, in increments of 10 (or 20, )

3. Define metrics to be used for evaluating system Metrics are often application-specific For example, what metric do we want to use here? QuickTime and a YUV420 codec decompressor are needed to see this picture. Formation-keeping: metric = average formation error: t max d i (t) t = 0 i leader

3. Define metrics to be used for evaluating system (con t.) Consider the DARPA Grand Challenge application/task: (DARPA_.avi movie) What characteristics do we want to measure?

3. Define metrics to be used for evaluating system (con t.) What metric would you use here? QuickTime and a YUV420 codec decompressor are needed to see this picture. Box pushing: metric = perpendicular distance pushed per unit time d ( t) / t

3. Define metrics to be used for evaluating system (con t.) What about here? QuickTime and a YUV420 codec decompressor are needed to see this picture.

3. Define metrics to be used for evaluating system (con t.) Cooperative target tracking: metric = average number of targets observed (collectively) per unit time (with m robots and n targets) A = T n t= 1 j= 1 g( B( t), j) T where: 1 if robot i is observing target j at time t B( t) = [ bij ( t)] m n such that bij ( t) = 0 otherwise 1 if an i such that g( B( t), j) = 0 otherwise b ( t ) = 1 ij

Metrics: Application Dependent vs.independent Easy to measure application-dependent characteristics Numbers of objects moved Distance traveled Number of targets within view Time of task completion Harder to derive application-independent measures Issues of robustness, reliability, adaptivity, etc. are hidden E.g., difficulties in adaptivity ==> measured as fewer objects moved

4. Determine characteristics of environment which algorithm is designed to handle Characteristics of the environment: Wide open space? Open space with clutter? Structured?

4. Determine characteristics of environment which algorithm is designed to handle (con t.) How can you quantify the complexity of the environment? One measure (Cazals and Sbert, 1997): Generate uniformly distributed random rays Compute statistics such as: To measure occlusion: Average number (and standard deviation) of times rays intersect with objects in the environment To measure how large open spaces are: Average length (and standard deviation) of uninterrupted lines

5. Generate a variety of environments within this class to test For example, can randomly generate locations of obstacles: b a (x,y) To do this: y Generate random number in (0,a) x Generate random number in (0,b) L1 Generate random number in (0,max_obstacle_size) L2 Generate random number in (0,max_obstacle_size) Add obstacle with corner at (x,y) of size (L1 x L2) Can vary approach for different obstacle shapes; can add rotation L2 L1

6. Determine how many experimental runs are needed to generate statistical significance Example: Normalized Cumulative Formation Error Strategy IV *** Strategy III *** Strategy II ******** ** Strategy I ** **** ** *** ** 0 50 100 150 200 250 300 Error Does strategy IV have a statistically different performance than strategy III? What about II and III? III and IV? How do you tell, in general?

6. Determine how many experimental runs are needed to generate statistical significance (con t.) Obviously, one run is never enough, if there is noise in the system. Might be really lucky, or really unlucky, on a single run What we want to know about is likely performance Need to collect sufficient data so that you have confidence that the results are meaningful (i.e., significant) In general, the more data you collect, the more confident you can be in your results In general, the closer the performance is of two approaches, or the more variation there is in the performance, the more data you ll need to determine if there is a statistical difference

6. Determine how many experimental runs are needed to generate statistical significance (con t.) Example #1: Metric (e.g., time) Example #2: Metric (e.g., time) Alg. #1 Alg. #2 Alg. #1 Alg. #2 Here, we ran 3 experiments with each algorithm. Are we confident that the algorithms perform differently? Yes! Average performance is very different, and standard deviation on each performance is small. Here, we ran 3 experiments with each algorithm. Are we confident that the algorithms perform differently? No! Average performance is somewhat different, but standard deviation on each performance is large.

6. Determine how many experimental runs are needed to generate statistical significance (con t.) What if we ran more experiments on Example #2? Metric (e.g., time) Alg. #1 Alg. #2 Are we confident now that the algorithms perform differently? Yes! The larger number of samples shows us that these results are coming from two different distributions.

6. Determine how many experimental runs are needed to generate statistical significance (con t.) How to determine the number of runs needed for statistical significance? Use a statistical test, such as Students t test Student s t distribution is a probability distribution that estimates the mean of a normally distributed population, with small sample size Here, the standard deviation is unknown, and has to be estimated from the data What we want to do is to determine if the two distributions (from the performance of two different algorithms) are different

6. Determine how many experimental runs are needed to generate statistical significance (con t.) Compare performance of 2 algorithms as follows: Define µ 1 and µ 2 as the means (i.e., averages) of the two distributions Define a null hypothesis, H 0 H 0 : µ 1 = µ 2 and there is essentially no difference between the two policies. Let µ 1 = mean of group 1; µ 2 = mean of group 2; S 1 = standard deviation of group 1; S 2 = std. dev. of group 2 n 1 = # of samples of group 1; n 2 = # of samples of group 2 Compute the following: T = µ 1 µ 2 σ 1 n 1 + 1 n 2 where : σ = n 1 S 1 2 + n 2 S 2 2 n 1 + n 2 2

6. Determine how many experimental runs are needed to generate statistical significance (con t.) Then, on the basis of a two-tailed test at a 0.01 level of significance, we would reject H 0 if T were outside the range t.995 to t.995 To calculate values of t.995 to t.995, we need to know the number of degrees of freedom Example: For n 1 + n 2 2 = 250 + 250 2 = 498 degrees of freedom Then, we look up these values from a statistical table. For 498 DOF, t.995 to t.995 is in the range 2.58 to 2.58. If our T calculation is outside this range, then we declare the two algorithms different. Repeat this process for each pair of algorithms, if you have more than two.

6. Determine how many experimental runs are needed to generate statistical significance (con t.) When plotting averaged data, need to show error bars, which show the standard deviation. The error bars provide information on the variation in performance Example:

7. Determine a baseline performance to which the new algorithm will be compared When developing a new algorithm, how do we assess the quality of the performance? We may know actual values of metrics (e.g., time), but how do we know if those answers are good or bad? Typically: use a baseline algorithm against which to compare your new results. Baseline algorithm: an obvious choice algorithm, which is the most reasonable alternative approach that you can think of For example: Established alternative approaches in the field (e.g., by others who have solved similar problems in the past) A pure random approach (in cases where we are the first to address this problem) A pure greedy approach (i.e., where the robot does what looks best at the moment) Others

8. Run multiple experiments, and 9. Collect data over multiple experimental runs Here s where you vary all the parameters that matter, the environment features, the robot starting position, etc. Collect data according to your metrics, analyzing whether it is significant, and continuing until you get statistically significant results.

Remember: Key Points of Empirical Robot Control Evaluation 1. Define important variables/parameters of the system 2. Determine a range of values for variables/parameters to be tested 3. Define metrics to be used for evaluating system 4. Determine characteristics of environment which algorithm is designed to handle 5. Generate a variety of environments within this class to test 6. Determine how many experimental runs are needed to generate statistical significance 7. Determine a baseline performance to which the new algorithm will be compared 8. Run multiple experiments, varying starting position of robot randomly 9. Collect data over multiple experimental runs, and average data, also maintaining standard deviation