Mineração de Dados Aplicada

Similar documents
Getting to Know Your Data

Data Statistics Population. Census Sample Correlation... Statistical & Practical Significance. Qualitative Data Discrete Data Continuous Data

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Road Map. Data types Measuring data Data cleaning Data integration Data transformation Data reduction Data discretization Summary

1.2. Pictorial and Tabular Methods in Descriptive Statistics

Chapter 2 Describing, Exploring, and Comparing Data

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Measures of Dispersion

Measures of Central Tendency. A measure of central tendency is a value used to represent the typical or average value in a data set.

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Chapter 3 - Displaying and Summarizing Quantitative Data

STP 226 ELEMENTARY STATISTICS NOTES PART 2 - DESCRIPTIVE STATISTICS CHAPTER 3 DESCRIPTIVE MEASURES

Measures of Central Tendency

3. Data Analysis and Statistics

Chapter 1. Looking at Data-Distribution

Name Date Types of Graphs and Creating Graphs Notes

CS6220: DATA MINING TECHNIQUES

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Data Preprocessing. Slides by: Shree Jaswal

STA 570 Spring Lecture 5 Tuesday, Feb 1

Chapter 3. Descriptive Measures. Slide 3-2. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

STA Rev. F Learning Objectives. Learning Objectives (Cont.) Module 3 Descriptive Measures

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

The first few questions on this worksheet will deal with measures of central tendency. These data types tell us where the center of the data set lies.

MATH& 146 Lesson 8. Section 1.6 Averages and Variation

NOTES TO CONSIDER BEFORE ATTEMPTING EX 1A TYPES OF DATA

STA Module 4 The Normal Distribution

STA /25/12. Module 4 The Normal Distribution. Learning Objectives. Let s Look at Some Examples of Normal Curves

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Table of Contents (As covered from textbook)

Chapter 3: Data Description - Part 3. Homework: Exercises 1-21 odd, odd, odd, 107, 109, 118, 119, 120, odd

Data analysis using Microsoft Excel

AND NUMERICAL SUMMARIES. Chapter 2

Averages and Variation

LESSON 3: CENTRAL TENDENCY

Overview of Clustering

Let s take a closer look at the standard deviation.

Basic Statistical Terms and Definitions

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

2. (a) Briefly discuss the forms of Data preprocessing with neat diagram. (b) Explain about concept hierarchy generation for categorical data.

CHAPTER 1. Introduction. Statistics: Statistics is the science of collecting, organizing, analyzing, presenting and interpreting data.

Lecture 7: Decision Trees

Data Mining By IK Unit 4. Unit 4

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Chapter 2: Descriptive Statistics

1.3 Graphical Summaries of Data

Univariate descriptives

Intrusion Detection using NASA HTTP Logs AHMAD ARIDA DA CHEN

Create a bar graph that displays the data from the frequency table in Example 1. See the examples on p Does our graph look different?

MATH 112 Section 7.2: Measuring Distribution, Center, and Spread

Chapter 5snow year.notebook March 15, 2018

3. Data Preprocessing. 3.1 Introduction

Learning Log Title: CHAPTER 7: PROPORTIONS AND PERCENTS. Date: Lesson: Chapter 7: Proportions and Percents

2. Data Preprocessing

Common Core Vocabulary and Representations

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Descriptive Statistics Descriptive statistics & pictorial representations of experimental data.

IT 403 Practice Problems (1-2) Answers

MATH& 146 Lesson 10. Section 1.6 Graphing Numerical Data

More Summer Program t-shirts

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

M7D1.a: Formulate questions and collect data from a census of at least 30 objects and from samples of varying sizes.

Chapter 6: DESCRIPTIVE STATISTICS

Frequency Distributions

HS Mathematics Item Specification C1 TP

STA Module 2B Organizing Data and Comparing Distributions (Part II)

STA Learning Objectives. Learning Objectives (cont.) Module 2B Organizing Data and Comparing Distributions (Part II)

Interactive Math Glossary Terms and Definitions

Data Mining and Analytics. Introduction

An Introduction to Minitab Statistics 529

DATA PREPROCESSING. Pronalaženje skrivenog znanja Bojan Furlan

Week 4: Describing data and estimation

Data Preprocessing. Chapter Why Preprocess the Data?

3 Graphical Displays of Data

(Refer Slide Time: 00:50)

Use of GeoGebra in teaching about central tendency and spread variability

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Chapter 2: Understanding Data Distributions with Tables and Graphs

How individual data points are positioned within a data set.

MATH11400 Statistics Homepage

MKTG 460 Winter 2019 Solutions #1

Chapter 5. Understanding and Comparing Distributions. Copyright 2012, 2008, 2005 Pearson Education, Inc.

What are we working with? Data Abstractions. Week 4 Lecture A IAT 814 Lyn Bartram

Univariate Statistics Summary

Mean,Median, Mode Teacher Twins 2015

Ex.1 constructing tables. a) find the joint relative frequency of males who have a bachelors degree.

ECLT 5810 Clustering

1. To condense data in a single value. 2. To facilitate comparisons between data.

Secondary 1 Vocabulary Cards and Word Walls Revised: June 27, 2012

Grade 6 Math Vocabulary

Glossary Common Core Curriculum Maps Math/Grade 6 Grade 8

Test Bank for Privitera, Statistics for the Behavioral Sciences

MATH 1070 Introductory Statistics Lecture notes Descriptive Statistics and Graphical Representation

Transcription:

Data Exploration August, 9 th 2017 DCC ICEx UFMG

Summary of the last session Data mining Data mining is an empiricism; It can be seen as a generalization of querying; It lacks a unified theory; It implies trades-off between quality and computational complexity; It should always be practiced in an ethical way (privacy concerns). 2 / 25

Summary of the last session The pattern discovery process The pattern discovery process is iterative; It is interactive, hence statistics and visualization techniques, on both the data and the patterns, are essential; It involves pre-processing steps to get a dataset that: is relevant to the analysis; can be processed by chosen data mining algorithms. 3 / 25

Summary of the last session Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 4 / 25

Data structure Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 5 / 25

Data structure Classical data structure Most datasets can be represented as tables that describe objects (the rows, whose order is meaningless) with attributes (the columns, whose order is meaningless): a 1 a 2... a n o 1 d 1,1 d 1,2... d 1,n o 2 d 2,1 d 2,2... d 2,n....... o m d m,1 d m,2... d m,n m is called size of the dataset; n its dimensionality. 6 / 25

Data structure Data streams A dataset with an infinite size is a data stream. The complexity of an algorithm processing it cannot depend on the number of objects seen so far. 7 / 25

Data structure The dataset is a sample The dataset with a finite size is seen as a sample, i. e., not all objects of study (usually in infinite number) are in the dataset. 8 / 25

Data structure The dataset is a sample The dataset with a finite size is seen as a sample, i. e., not all objects of study (usually in infinite number) are in the dataset. It is almost always assumed that, in the sample, the values taken by an attribute are independent and identically distributed and that they follow a known distribution whose parameters are estimated from the sample. 8 / 25

Data structure The dataset is a sample The dataset with a finite size is seen as a sample, i. e., not all objects of study (usually in infinite number) are in the dataset. It is almost always assumed that, in the sample, the values taken by an attribute are independent and identically distributed and that they follow a known distribution whose parameters are estimated from the sample. It is important to understand whether (or to what extent) the assumption holds. If not, many analyses do not apply. 8 / 25

Data structure More structured datasets Patterns can be searched inside one large word, sequence, graph (directed or not, weighted or not, labeled or not), with timestamps (including sounds), with spacial positions (including pictures), etc. 9 / 25

Data structure More structured datasets Patterns can be searched inside one large word, sequence, graph (directed or not, weighted or not, labeled or not), with timestamps (including sounds), with spacial positions (including pictures), etc. Studying and using a method processing such data can be the non-trivial step of your project. 9 / 25

Data structure More structured datasets Patterns can be searched inside one large word, sequence, graph (directed or not, weighted or not, labeled or not), with timestamps (including sounds), with spacial positions (including pictures), etc. Studying and using a method processing such data can be the non-trivial step of your project. The structured dataset can be broken into components, objects that are individually described. Attributes can be derived from the initial structure (degree in the graph, distance to a reference object, statistics on neighbors, etc.). 9 / 25

Data exploration and completion Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 10 / 25

Data exploration and completion Difficulties Data usually are: incomplete; 11 / 25

Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; 11 / 25

Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; uncertain/noisy or even plainly wrong; 11 / 25

Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; uncertain/noisy or even plainly wrong; with some exceptions. 11 / 25

Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; uncertain/noisy or even plainly wrong; with some exceptions. Never assume the data are perfect. Detect the problems (basic statistics and visualizations help a lot) and understand the limitations of the data generation/acquisition process. 11 / 25

Data exploration and completion Looking at specific objects It may be interesting to look at particular objects: those taking uncommon or extreme values; those that you particularly know; those that everybody knows. 12 / 25

Data exploration and completion Looking at specific objects It may be interesting to look at particular objects: those taking uncommon or extreme values; those that you particularly know; those that everybody knows. To do so, you can fire SQL queries (if the dataset is in a database) or POSIX commands (if in text files). 12 / 25

Data exploration and completion Using background knowledge It is essential to understand the application domain and to complement the dataset with additional attributes: taken from other sources; derived from existing attributes. 13 / 25

Data exploration and completion Using background knowledge It is essential to understand the application domain and to complement the dataset with additional attributes: taken from other sources; derived from existing attributes. Spam detection Given an email, its date and the IP addresses sending and receiving it,... may help its classification as spam/ham., 13 / 25

Data exploration and completion Using background knowledge It is essential to understand the application domain and to complement the dataset with additional attributes: taken from other sources; derived from existing attributes. Spam detection Given an email, its date and the IP addresses sending and receiving it, the presence of some words ( viagra, pr0n, etc.), the country of the sender, the match/mismatch between the language of the email and that of the two countries, whether the day is holiday, the number of emails from the sender in the dataset,... may help its classification as spam/ham. 13 / 25

Analyzing one single attribute Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 14 / 25

Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); 15 / 25

Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); 15 / 25

Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); Interval-scaled (differences make sense but an arbitrary 0); 15 / 25

Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); Interval-scaled (differences make sense but an arbitrary 0); Ratio-scaled (ratios make sense). 15 / 25

Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); Interval-scaled (differences make sense but an arbitrary 0); Ratio-scaled (ratios make sense). Identifying the type of every attribute is essential. It tells what statistics and data mining algorithms are applicable and what operations are allowed to derive new attributes. 15 / 25

Analyzing one single attribute Centrality, dispersion and skewness Many (but not all) attributes take values that are distributed around one center. 16 / 25

Analyzing one single attribute Centrality, dispersion and skewness Many (but not all) attributes take values that are distributed around one center. If the attribute is interval-scaled, the dispersion is a measure of how much its values deviate from the center. 16 / 25

Analyzing one single attribute Centrality, dispersion and skewness Many (but not all) attributes take values that are distributed around one center. If the attribute is interval-scaled, the dispersion is a measure of how much its values deviate from the center. If the distribution is asymmetric around the center, it is said skewed. 16 / 25

Analyzing one single attribute Skewness Negative Skew Positive Skew c 2008 Rodolfo Hermans (from Wikimedia Commons) These diagrams are licensed under the Creative Commons Attribution ShareAlike 3.0 Unported License. 17 / 25

Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; 18 / 25

Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; ordinal the mode, the median (50% of the values are smaller, 50% greater), min and max; 18 / 25

Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; ordinal the mode, the median (50% of the values are smaller, 50% greater), min and max; interval-scaled all the above plus the arithmetic mean, the range (max min) and the standard deviation. 18 / 25

Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; ordinal the mode, the median (50% of the values are smaller, 50% greater), min and max; interval-scaled all the above plus the arithmetic mean, the range (max min) and the standard deviation. ratio-scaled all the above plus the geometric and harmonic means (for rates), the studentized range (difference of the z-scores of the largest and the smallest values), the coefficient of variation (ratio of the standard deviation and the mean), etc. 18 / 25

Analyzing one single attribute Robustness of a statistic A statistic is robust if extreme values (called outliers) do not affect it. A non-robust statistic computed from a sample may be very different from the one that would be obtained from all objects. 19 / 25

Analyzing one single attribute Robustness of a statistic A statistic is robust if extreme values (called outliers) do not affect it. A non-robust statistic computed from a sample may be very different from the one that would be obtained from all objects. The median is a robust statistic of centrality, whereas the arithmetic mean is not. The trimmed mean is the arithmetic mean computed after discarding the most extreme values (a small fraction to be chosen). It is a robust statistic. 19 / 25

Analyzing one single attribute Robustness of a statistic A statistic is robust if extreme values (called outliers) do not affect it. A non-robust statistic computed from a sample may be very different from the one that would be obtained from all objects. The median is a robust statistic of centrality, whereas the arithmetic mean is not. The trimmed mean is the arithmetic mean computed after discarding the most extreme values (a small fraction to be chosen). It is a robust statistic. The range is not a robust statistic of dispersion. The interquartile range (IRQ) is the range for the middle 50% of the values. It is a robust statistics. 19 / 25

Analyzing one single attribute Estimating the skewness Several statistics aim to measure the skewness of an interval-scaled attribute. The most common ones are the Pearson s skewness statistics: mean mode standard deviation ; mean median standard deviation ; the Pearson s moment coefficient of skewness (complicated formula with better statistical foundations). 20 / 25

Analyzing one single attribute Basic visualizations A histogram graphically represents the distribution of the values taken by an attribute (whatever its type). It requires partitioning the domain into intervals, base of rectangles whose areas are proportional with the number of values in the interval. Do not use pie charts. 21 / 25

Analyzing one single attribute Basic visualizations A histogram graphically represents the distribution of the values taken by an attribute (whatever its type). It requires partitioning the domain into intervals, base of rectangles whose areas are proportional with the number of values in the interval. Do not use pie charts. A box plot provides a simpler visualization of the distribution of an interval-scaled attribute. It shows the boundaries of the four quartiles, and either the min/max or the values 1.5IRQ below/above the first/third quartile and the values exceeding those thresholds. Those values are outliers (but their definition can be tuned modifying the 1.5 coefficient). 21 / 25

Analyzing one single attribute Histogram 22 / 25

Analyzing one single attribute Box plot 23 / 25

Analyzing one single attribute Structured attributes Objects can be described with attributes that are matrices/tensors, words, sequences, graphs (directed or not, weighted or not, labeled or not), with timestamps, with spacial positions, pictures, sounds, videos, etc. 24 / 25

Analyzing one single attribute Structured attributes Objects can be described with attributes that are matrices/tensors, words, sequences, graphs (directed or not, weighted or not, labeled or not), with timestamps, with spacial positions, pictures, sounds, videos, etc. Studying and using a method processing such data can be the non-trivial step of your project. 24 / 25

Analyzing one single attribute Structured attributes Objects can be described with attributes that are matrices/tensors, words, sequences, graphs (directed or not, weighted or not, labeled or not), with timestamps, with spacial positions, pictures, sounds, videos, etc. Studying and using a method processing such data can be the non-trivial step of your project. Properties of a structured attribute (its size, counts of patterns in it, dominant color, BPM, etc.) can substitute it, with loss of information. Metadata (author, creation date, tags, length, etc.) are valuable too. 24 / 25

License c 2011 2017 These slides are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 25 / 25