Data Exploration August, 9 th 2017 DCC ICEx UFMG
Summary of the last session Data mining Data mining is an empiricism; It can be seen as a generalization of querying; It lacks a unified theory; It implies trades-off between quality and computational complexity; It should always be practiced in an ethical way (privacy concerns). 2 / 25
Summary of the last session The pattern discovery process The pattern discovery process is iterative; It is interactive, hence statistics and visualization techniques, on both the data and the patterns, are essential; It involves pre-processing steps to get a dataset that: is relevant to the analysis; can be processed by chosen data mining algorithms. 3 / 25
Summary of the last session Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 4 / 25
Data structure Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 5 / 25
Data structure Classical data structure Most datasets can be represented as tables that describe objects (the rows, whose order is meaningless) with attributes (the columns, whose order is meaningless): a 1 a 2... a n o 1 d 1,1 d 1,2... d 1,n o 2 d 2,1 d 2,2... d 2,n....... o m d m,1 d m,2... d m,n m is called size of the dataset; n its dimensionality. 6 / 25
Data structure Data streams A dataset with an infinite size is a data stream. The complexity of an algorithm processing it cannot depend on the number of objects seen so far. 7 / 25
Data structure The dataset is a sample The dataset with a finite size is seen as a sample, i. e., not all objects of study (usually in infinite number) are in the dataset. 8 / 25
Data structure The dataset is a sample The dataset with a finite size is seen as a sample, i. e., not all objects of study (usually in infinite number) are in the dataset. It is almost always assumed that, in the sample, the values taken by an attribute are independent and identically distributed and that they follow a known distribution whose parameters are estimated from the sample. 8 / 25
Data structure The dataset is a sample The dataset with a finite size is seen as a sample, i. e., not all objects of study (usually in infinite number) are in the dataset. It is almost always assumed that, in the sample, the values taken by an attribute are independent and identically distributed and that they follow a known distribution whose parameters are estimated from the sample. It is important to understand whether (or to what extent) the assumption holds. If not, many analyses do not apply. 8 / 25
Data structure More structured datasets Patterns can be searched inside one large word, sequence, graph (directed or not, weighted or not, labeled or not), with timestamps (including sounds), with spacial positions (including pictures), etc. 9 / 25
Data structure More structured datasets Patterns can be searched inside one large word, sequence, graph (directed or not, weighted or not, labeled or not), with timestamps (including sounds), with spacial positions (including pictures), etc. Studying and using a method processing such data can be the non-trivial step of your project. 9 / 25
Data structure More structured datasets Patterns can be searched inside one large word, sequence, graph (directed or not, weighted or not, labeled or not), with timestamps (including sounds), with spacial positions (including pictures), etc. Studying and using a method processing such data can be the non-trivial step of your project. The structured dataset can be broken into components, objects that are individually described. Attributes can be derived from the initial structure (degree in the graph, distance to a reference object, statistics on neighbors, etc.). 9 / 25
Data exploration and completion Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 10 / 25
Data exploration and completion Difficulties Data usually are: incomplete; 11 / 25
Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; 11 / 25
Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; uncertain/noisy or even plainly wrong; 11 / 25
Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; uncertain/noisy or even plainly wrong; with some exceptions. 11 / 25
Data exploration and completion Difficulties Data usually are: incomplete; inconsistent; uncertain/noisy or even plainly wrong; with some exceptions. Never assume the data are perfect. Detect the problems (basic statistics and visualizations help a lot) and understand the limitations of the data generation/acquisition process. 11 / 25
Data exploration and completion Looking at specific objects It may be interesting to look at particular objects: those taking uncommon or extreme values; those that you particularly know; those that everybody knows. 12 / 25
Data exploration and completion Looking at specific objects It may be interesting to look at particular objects: those taking uncommon or extreme values; those that you particularly know; those that everybody knows. To do so, you can fire SQL queries (if the dataset is in a database) or POSIX commands (if in text files). 12 / 25
Data exploration and completion Using background knowledge It is essential to understand the application domain and to complement the dataset with additional attributes: taken from other sources; derived from existing attributes. 13 / 25
Data exploration and completion Using background knowledge It is essential to understand the application domain and to complement the dataset with additional attributes: taken from other sources; derived from existing attributes. Spam detection Given an email, its date and the IP addresses sending and receiving it,... may help its classification as spam/ham., 13 / 25
Data exploration and completion Using background knowledge It is essential to understand the application domain and to complement the dataset with additional attributes: taken from other sources; derived from existing attributes. Spam detection Given an email, its date and the IP addresses sending and receiving it, the presence of some words ( viagra, pr0n, etc.), the country of the sender, the match/mismatch between the language of the email and that of the two countries, whether the day is holiday, the number of emails from the sender in the dataset,... may help its classification as spam/ham. 13 / 25
Analyzing one single attribute Outline 1 Data structure 2 Data exploration and completion 3 Analyzing one single attribute 14 / 25
Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); 15 / 25
Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); 15 / 25
Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); Interval-scaled (differences make sense but an arbitrary 0); 15 / 25
Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); Interval-scaled (differences make sense but an arbitrary 0); Ratio-scaled (ratios make sense). 15 / 25
Analyzing one single attribute Typology of an attribute In 1946, Stanley Smith Stevens proposed to categorize the attributes into four types: Nominal (Boolean when there are only two categories); Ordinal (with a partial or a total order); Interval-scaled (differences make sense but an arbitrary 0); Ratio-scaled (ratios make sense). Identifying the type of every attribute is essential. It tells what statistics and data mining algorithms are applicable and what operations are allowed to derive new attributes. 15 / 25
Analyzing one single attribute Centrality, dispersion and skewness Many (but not all) attributes take values that are distributed around one center. 16 / 25
Analyzing one single attribute Centrality, dispersion and skewness Many (but not all) attributes take values that are distributed around one center. If the attribute is interval-scaled, the dispersion is a measure of how much its values deviate from the center. 16 / 25
Analyzing one single attribute Centrality, dispersion and skewness Many (but not all) attributes take values that are distributed around one center. If the attribute is interval-scaled, the dispersion is a measure of how much its values deviate from the center. If the distribution is asymmetric around the center, it is said skewed. 16 / 25
Analyzing one single attribute Skewness Negative Skew Positive Skew c 2008 Rodolfo Hermans (from Wikimedia Commons) These diagrams are licensed under the Creative Commons Attribution ShareAlike 3.0 Unported License. 17 / 25
Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; 18 / 25
Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; ordinal the mode, the median (50% of the values are smaller, 50% greater), min and max; 18 / 25
Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; ordinal the mode, the median (50% of the values are smaller, 50% greater), min and max; interval-scaled all the above plus the arithmetic mean, the range (max min) and the standard deviation. 18 / 25
Analyzing one single attribute Basic statistics Estimating from the sample the center and the dispersion of the values taken by an attribute is very useful. The applicable statistics depend on the type of the attribute: nominal the mode, i. e., the most common value; ordinal the mode, the median (50% of the values are smaller, 50% greater), min and max; interval-scaled all the above plus the arithmetic mean, the range (max min) and the standard deviation. ratio-scaled all the above plus the geometric and harmonic means (for rates), the studentized range (difference of the z-scores of the largest and the smallest values), the coefficient of variation (ratio of the standard deviation and the mean), etc. 18 / 25
Analyzing one single attribute Robustness of a statistic A statistic is robust if extreme values (called outliers) do not affect it. A non-robust statistic computed from a sample may be very different from the one that would be obtained from all objects. 19 / 25
Analyzing one single attribute Robustness of a statistic A statistic is robust if extreme values (called outliers) do not affect it. A non-robust statistic computed from a sample may be very different from the one that would be obtained from all objects. The median is a robust statistic of centrality, whereas the arithmetic mean is not. The trimmed mean is the arithmetic mean computed after discarding the most extreme values (a small fraction to be chosen). It is a robust statistic. 19 / 25
Analyzing one single attribute Robustness of a statistic A statistic is robust if extreme values (called outliers) do not affect it. A non-robust statistic computed from a sample may be very different from the one that would be obtained from all objects. The median is a robust statistic of centrality, whereas the arithmetic mean is not. The trimmed mean is the arithmetic mean computed after discarding the most extreme values (a small fraction to be chosen). It is a robust statistic. The range is not a robust statistic of dispersion. The interquartile range (IRQ) is the range for the middle 50% of the values. It is a robust statistics. 19 / 25
Analyzing one single attribute Estimating the skewness Several statistics aim to measure the skewness of an interval-scaled attribute. The most common ones are the Pearson s skewness statistics: mean mode standard deviation ; mean median standard deviation ; the Pearson s moment coefficient of skewness (complicated formula with better statistical foundations). 20 / 25
Analyzing one single attribute Basic visualizations A histogram graphically represents the distribution of the values taken by an attribute (whatever its type). It requires partitioning the domain into intervals, base of rectangles whose areas are proportional with the number of values in the interval. Do not use pie charts. 21 / 25
Analyzing one single attribute Basic visualizations A histogram graphically represents the distribution of the values taken by an attribute (whatever its type). It requires partitioning the domain into intervals, base of rectangles whose areas are proportional with the number of values in the interval. Do not use pie charts. A box plot provides a simpler visualization of the distribution of an interval-scaled attribute. It shows the boundaries of the four quartiles, and either the min/max or the values 1.5IRQ below/above the first/third quartile and the values exceeding those thresholds. Those values are outliers (but their definition can be tuned modifying the 1.5 coefficient). 21 / 25
Analyzing one single attribute Histogram 22 / 25
Analyzing one single attribute Box plot 23 / 25
Analyzing one single attribute Structured attributes Objects can be described with attributes that are matrices/tensors, words, sequences, graphs (directed or not, weighted or not, labeled or not), with timestamps, with spacial positions, pictures, sounds, videos, etc. 24 / 25
Analyzing one single attribute Structured attributes Objects can be described with attributes that are matrices/tensors, words, sequences, graphs (directed or not, weighted or not, labeled or not), with timestamps, with spacial positions, pictures, sounds, videos, etc. Studying and using a method processing such data can be the non-trivial step of your project. 24 / 25
Analyzing one single attribute Structured attributes Objects can be described with attributes that are matrices/tensors, words, sequences, graphs (directed or not, weighted or not, labeled or not), with timestamps, with spacial positions, pictures, sounds, videos, etc. Studying and using a method processing such data can be the non-trivial step of your project. Properties of a structured attribute (its size, counts of patterns in it, dominant color, BPM, etc.) can substitute it, with loss of information. Metadata (author, creation date, tags, length, etc.) are valuable too. 24 / 25
License c 2011 2017 These slides are licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. 25 / 25