Model Based Symbolic Description for Big Data Analysis

Size: px
Start display at page:

Download "Model Based Symbolic Description for Big Data Analysis"

Transcription

1 Model Based Symbolic Description for Big Data Analysis 1 Model Based Symbolic Description for Big Data Analysis *Carlo Drago, **Carlo Lauro and **Germana Scepi *University of Rome Niccolo Cusano, **University of Naples Federico II COMPSTAT st International Conference on Computational Statistics

2 Model Based Symbolic Description for Big Data Analysis 2 Outline The Statistical Problem Beanplot Time Series Definition Kernel and Bandwidth choice Beanplot Characteristics and Robustness Parameterization Beanplot Modelling Multiple Beanplot Time Series Beanplot Multiple Factor Analysis Beanplot Clustering (using the Beanplot Model Distance) Beanplot Constrained Clustering (using the Beanplot Model Distance) Beanplot Forecasting

3 Model Based Symbolic Description for Big Data Analysis Big Data 3 Big Data Recent technological advances carried on many innovations in data. In particular, there was an explosion of large data sets available. Big data is the term frequently used today for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big data are data characterized by: high volume high velocity high variety This type of data usually shows a temporal dimension also.

4 Model Based Symbolic Description for Big Data Analysis Big Data 4 Financial Big Data This is especially promising and differentiating for financial services companies. In fact, financial business copes with hundreds of millions daily transactions and use big data in order to conduct transformations on their processes and organizations and to obtain competitive advantages in financial markets. Financial firms must be able to collect, store, and analyze rapidly changing, this type of data in order to maximize profits, reduce risk, and meet increasingly stringent regulatory requirements. The extraction of insights from so complex, and frequently unstructured data, is a very important step in this process and the statistical approach can give a fundamental contribution in this sense.

5 Model Based Symbolic Description for Big Data Analysis Big Data 5 Financial Big Data We consider as big data, observations on financial variables, taken daily or at a finer time scale, often irregularly spaced over time, and usually exhibit periodic (intra-day and intra-week) patterns in financial markets. The high-frequency data possess these peculiar features and can be considered an example of big data in finance markets, such as records of transactions and quotes for stocks or bonds, currencies and so on. These peculiar time series shows many difficulties in visualization and if are analyzed by means of an aggregated index conduct to an evident information loss.

6 Model Based Symbolic Description for Big Data Analysis Big Data 6 The Frequency Domain A time series of distributions would offer a more informative representation than other forms of aggregated time series. In order to analyze these data and we will consider the data not on the temporal domain of the time series, but in the frequency domain (considering for example the day). In this sense we consider the number of occurrencies on the time related to a specific value. We have several advantages on do that: We can detect simply the data patterns on the data as the most recurrent observations on the temporal interval We can detect the inter-temporal seasonalities which can occur on the temporal interval We can observe the similarities between different series.

7 Model Based Symbolic Description for Big Data Analysis Big Data 7 From Financial Big Data to Symbolic Data From the initial financial big data we are able to obtain the symbolic data table in which each data can be represented as a distribution. At this point we can: Represent the distribution as a Beanplot data Choosing the adequate data model Parameterizing the data model and obtaining the relevant parameters The final parameters are the relevant big data representation and could be used in clustering and forecasting.

8 Model Based Symbolic Description for Big Data Analysis Big Data 8 From Financial Big Data to Symbolic Data Figure: Fro Finacial Big Data to Symbolic Data (the first graph is from Martinaitis (2012))

9 Model Based Symbolic Description for Big Data Analysis Big Data 9 Methods Figure: Methods

10 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 10 Beanplot Time Series (BTS) A Beanplot time series can be defined as an ordered sequence of beanplot data (Kampstra 2008) over the time. The advantage of using the beanplot is his capacity to represent the intra-period data structure at time t. In the Beanplot time series a density data at time t with t = 1... T is defined as: ˆb k,h,t = 1 nh n i=1 K( x x i ) = 1 h nh (K( x x1 h )+K( x x2 h )+ +K( x xn )) h (1) where K is a kernel function, h is a smoothing parameter defined as a bandwidth and n is the number of x i intra-period observations.

11 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 11 Beanplot Taxonomies We can detect some typical taxonomies in the beanplots: A) Unimodality: data tend to gather over one mode in a regular way. B) Multimodality: data tend to gather over two modes. C) Break: data tend to gather in two mode but there is at least a break between the observations. Figure: Beanplot Taxonomy

12 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 12 Identifying Intra-Period Breaks Beanplot can be characterised by some groups of internal outlier observations (more than one). In this way the final result is a break in the data structure. In order to detect the intra-period breaks we: We sort the observations from the highest to the lowest We compute the first differences i with i = 1... n 1 and we compute the mean = i n 1 Are considered relevant the values which are over a specified threshold for example i > 3 In particular these value need to break the internal patterns considered. Is relevant to take in to account we can weight the internal outliers detected. In this way the beanplot is represented by a suitable weighting system. Figure: Intra-period breaks

13 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 13 Kernels Cosine Various kernels (K) can be generally chosen: Gaussian, uniform, Epanechnikov, triweight, exponential, cosine between others. The kernel is chosen in order to represent adequately the density function. K need to satisfy: Kernel Uniform: + K(u) du = 1 (2) K(u) = ( u 1) (3) Epanechnikov K(u) = 3 4 (1 u2 ) 1 ( u 1) (4) Triweight Exponential K(u) = (1 u2 ) 3 1 ( u 1) (5) K(u) = 1 2π e 1 2 u2 (6)

14 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 14 Kernel Properties Kernel function K(u) is nonnegative and need to fulfill (Racine 2008): K(u)du = 1 (8) K(u) = K( u) (9) u 2 K(u)d(u) = K 2 > 0 (10)

15 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 15 Kernel Selection It turns out that a range of kernel functions result in estimators having similar relative efficiencies, one could choose the kernel based on computational considerations, the Gaussian kernel being a popular choice... (Racine 1986) In order to approximate our data we will choose the Gaussian kernel: K(u) = 1 2π e 1 2 u2 (11) By considering big data, the Gaussian kernel is the most simple to interpret...unlike choosing a kernel function, however, choosing an appropriate bandwidth is a crucial aspect of sound nonparametric analysis (Racine 1986)

16 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 16 Kernel Selection Figure: Kernel Choice and Kernel Density Estimation The figure show the kernel density estimation computation using a Gaussian kernel and a bandwidth of h = 0.3 (R code by François 2012)

17 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 17 BTS: Bandwidth Selection We ll show the impact of the different selected bandwidths (using three choices: low, high and Sheather-Jones) on the beanplot time series. In the example we consider a yearly interval for the beanplot observation related the Dow Jones Index. This interval could be validated by considering the temporal horizons in which in these data (stocks) can occur. In fact in risk management application the relevant interval is the year (to take in to account the risks of financial crisis). In particular by considering the bandwidth we can to observe: Low Bandwidth: tend to show many bumps or to maximize the number of bumps by beanplot. High Bandwidth: we tend to have a more regular shape of the density traces. However here the risk is to lose some informations. Sheather Jones Method: the bandwidth change beanplot by beanplot so the bandwidth as well became an indicator of variability. Usually the impact of both bandwidth selection and kernel selection is obtained by simulation.

18 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 18 BTS: Bandwidth Selection Dow Jones BTS Bandwidth Selection Yearly Beanplot Time Series on Dow Jones daily data ( ). Different Bandwidth choices on the Beanplot Time Series: Low bandwidth h = 8, High Bandwidth h = 102, Sheather and Jones method (use some pilot estimation of derivatives to choose the bandwidth). Kernel selected: Gaussian.

19 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 19 The Impact of the Kernel and the Bandwidth Changing Selection It is possible to explore the beanplot data characteristics using different Kernels and Bandwidths. We will choose to use the Gaussian kernel (for his flexibility) and the bandwidth obtained by the Sheather Jones method (to explore the data structure).

20 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 20 Beanplot Time Series: Characteristics Beanline the mean or the median. Beanplots Lower and Upper Bound [X ] t = [X t,l, X t,u ] with < X t,l X t,u < Beanplots Center and Radius [X ] t = X t,c, X t,r where X t,c = (X t,l + X t,u )/2 and X t,r = (X t,u X t,l )/2 Quantiles Main Characteristics Location: the beanline mean, the beanplot Center. Size: Beanplots Radius, Lower and Upper Bounds Shape: the h parameter regulates the density trace. So, the higher the bandwidth the wiggler the density function. The h parameter can be obtained using the Sheather-Jones method (see Kampstra (2008)). Relevant effects also on the kurthosis.

21 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 21 Beanplot Time Series: Characteristics Intra-period and inter-period variability Yearly beanplot time series on Dow Jones daily data ( ) allows the identification of structural changes and intra-period variability patterns. The Kernel chosen is the Gaussian, the bandwidth is obtained by mean of the Sheather-Jones method.

22 Model Based Symbolic Description for Big Data Analysis Parameterization 22 Beanplot Modeling: Choosing the Class of the Model We consider the symbolic aggregation approach by considering as temporal interval the day We consider an approch to the frequency domain in order to extract the relevant daily patterns At this point we choose the class of the model. In particular we consider the number of the mixtures to use, the distributions considered and so on. In our case we choose two mixtures because the gof indexes show a good approximation of the data. At the same time the Gaussian distribution allow to maximize the gof index in the experiments on data we have performed. From te relevant daily data we extract the relevant parameters by the parameterization procedure. In particular we will consider a finite mixture model for each density function.

23 Model Based Symbolic Description for Big Data Analysis Parameterization 23 Beanplot Parameterization In order to compare and to analyse the beanplot time series we need to parameterize the different beanplot. The aim of the parameterization are: Synthesizing the beanplot observations Comparing, analysing and interpreting the beanplot observations Storing big data In this sense: We consider a kernel density estimation of the density function (a bandwidth h and a kernel K). We obtain: B k t Finite mixture model of the density function. We obtain: B M t Model diagnostic and model fit

24 Model Based Symbolic Description for Big Data Analysis Parameterization 24 Beanplot by Mixture Models Parameterization is important because the storing of the relevant information of the beanplots can be used in clustering and in forecasting. With the aim of parameterization we estimate the model parameters as a finite mixture density function. So we have: B M t = J π j f (x θ j ) (12) j=1 Where π 1... π J are scalars and θ 1,..., θ J are vectors of parameters Here: 0 π j 1 and also π 1 + π π J = 1. Therefore we obtain A µ t (means), A σ t (standard deviations), A p t (weights). We use Gaussian distributions for their flexibility. We use the Maximum Likelihood Estimation for the estimation of the parameters.

25 Model Based Symbolic Description for Big Data Analysis Parameterization 25 Bt M Parameters Interpretation Parameters can be interpreted in this way: µ j they represent the main intra-period characteristics, for example in the financial context, which values the price of a stock has gathered over time. Changes in µ j can occur in the presence of structural changes σ j represent the intra-period variability, where in financial terms this can be higher volatility. Changes in σ j can occur in the presence of financial news (higher or lower intra-period volatility). π j represents the relative weight for each distinct group of observation. Changes in π j are related to intra-periodal changes

26 Model Based Symbolic Description for Big Data Analysis Parameterization 26 Number of Bt M Parameters The number of parameters to estimate is referred to the number of components (C) in the mixture. A feasible solution need to be a compromise between comparability, simplicity and usability. After the estimation of the model is necessary to consider the quality of the fit. Figure: Beanplot Model with C = 2

27 Model Based Symbolic Description for Big Data Analysis Parameterization 27 Weighting In every finite mixture model we measure the fit of the model by using an goodness of fit index. The index measure the level of fit of the model related the initial data. In this sense 1 represent the highest level of fit, and 0 the minimum. This index is used to weighting the observations in all the different models of models in order to weight less the observations with no represent adequately the data. At the same time observations with higher goodness of fit are weighted more.

28 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 28 Multiple Beanplot Time Series (MBTS) Here with the aim to create a representative market index we consider a beanplot to take in to account the intra-period variation. In particular we construct a Beanplot market index in order to represent the entire market risk. A beanplot market index can have a relevant applications in risk management to anticipate the risk over time. At the same time a beanplot market index can reflect the state of an economy and the sentiment of the investors and help investment decisions. So in this sense we extend our previous approach for single beanplot analysis to the case of Multiple Beanplot Time Series

29 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 29 Multiple Beanplot Time Series Multiple Beanplot Time Series can be defined the simultaneous observation of more than one Beanplot Time Series. For example we can observe the Beanplot Time Series related to a more than one financial market. By considering the multiple beanplot time series related to a market the resultant synthesis will be a beanplot representing the entire market (as an index of the entire market, for example, FTSE MIB for the Italian Case). Possible real applications: Exploratory Time Series Analysis Constructing Composite Indicators based on multiple beanplot time series Portfolio Selection Change Point Detection Forecasting

30 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 30 Multiple Beanplot Time Series Analysis We consider four different methods with different aims: Multiple Factor Analysis with the aim of seeking the common structure of the blocks describing the multiple beanplot time series. Clustering with the aim of detecting relevant subgroups over time and finding similar beanplot observations. These observation can be related to different stocks. The results can be used in portfolio selection strategies Constrained Clustering with the aim of detecting relevant subperiods in a beanplot time series. These relevant subperiods represented by groups of beanplots over time can be used in order to detect market change point. Forecasting with the aim of prediction of the observations over time. The models can be used in trading.

31 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 31 Beanplot Multiple Factor Analysis (BMFA) The aim of the method is to synthesize the different beanplot multiple time series in order to obtain indexes over time of the market or the portfolio. The indexes can be used in order to take decisions. We consider as one of the most important element in building the index the gof as the capacity of the models to approximate the original data. We parameterize the different beanplot time series. In this case we obtain the parameters related the weights, the means and the variance for each data. In this example we visualize the first parameter (the weight related the first mixture): m1.p1 m2.p1 m3.p1 m4.p1 m5.p1 m6.p1 m7.p

32 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 32 Beanplot Multiple Factor Analysis Here we visualize the matrix for the weight related the second mixture: m1.p2 m2.p2 m3.p2 m4.p2 m5.p2 m6.p2 m7.p

33 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 33 Beanplot Multiple Factor Analysis The first parameter for the mean of the first mixture: m1.m1 m2.m1 m3.m1 m4.m1 m5.m1 m6.m1 m7.m

34 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 34 Beanplot Multiple Factor Analysis The second parameter for the mean of the second mixture: m1.m2 m2.m2 m3.m2 m4.m2 m5.m2 m6.m2 m7.m

35 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 35 Beanplot Multiple Factor Analysis The variance parameter related the first mixture: m1.s1 m2.s1 m3.s1 m4.s1 m5.s1 m6.s1 m7.s

36 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 36 Beanplot Multiple Factor Analysis The second parameter related the variance of the second mixture: m1.s2 m2.s2 m3.s2 m4.s2 m5.s2 m6.s2 m7.s

37 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 37 Beanplot Multiple Factor Analysis We obtain as well the gof index for each mixture. Each model is represented by their parameters and by the gof index. The gof index is necessary in order to weight differently the observations with have a lower gof in the different models. m1.gof m2.gof m3.gof m4.gof m5.gof m6.gof m7.gof

38 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 38 Beanplot Multiple Factor Analysis We can obtain the index as beanplots from the block-pca weighting for the gof index. At the end of the procedure we can obtain the beanplot prototype time series. The global PCA is performed on a matrix with merged initial datasets (Abdi and Valentine 2007) Figure: MFA Beanplot Prototype Time Series

39 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 39 Beanplot Multiple Factor Analysis By considering the correlation circle we can observe the variables of high performing stocks (represented by higher means) versus the characteristics of low means (x-axis). At the same time we are able to see characteizations of higher volatility in the y-axis. Figure: Correlation Circle

40 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 40 Beanplot Multiple Factor Analysis We obtain as well: the individual factor maps and the groups representations. These results can be used in order to interpret the results financially: Individual factor map (1) show the characteristics of the different temporal observations. We can observe here the dynamics over time of the market as a whole Individual factor map (2) show the way the different stocks (represented by the different models) performs over time. It is possible to read that some stocks tend to grow more than others so they seems to be good opportunities (model 2 and model 5) Groups representation show the portfolio selection by considering the different performances of the stocks (or models). In this context seems to be reasonable a strategy by picking first of all the stocks 5 and 7 then 1 and 2. Overall these stocks seems to be convenient by considering their performances over time. The plot is useful in order to discriminate good stocks to others. We use the gof index in order to weight accordingly the observations.

41 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 41 Beanplot Multiple Factor Analysis Figure: Individual Factor Map (1)

42 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 42 Beanplot Multiple Factor Analysis Figure: Individual Factor Map (2)

43 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 43 Beanplot Multiple Factor Analysis Figure: Groups representation

44 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 44 Beanplot Clustering The aim of the clustering procedure is find groups of different beanplot models or stocks on a day which can be more similar. The procedure can be very useful on stock picking processes. In this context a relevant distance used is the model distance by Lauro Romano Giordano (2006). By using the appropriate distance we are able to discover that the stocks 2 and 3 performs very peculiarly on the groups of the stocks considered. The stocks 1 and 7 show together a very low gof. Finally we are able to discriminate the different stocks typologies.

45 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 45 Beanplot Clustering model t p1 p2 m1 m2 s1 s2 gof 1 m m m m m m m

46 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 46 Beanplot Clustering Figure: Clustering

47 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 47 Beanplot Constrained Clustering The aim of the constrained clustering procedure is to find groups of beanplots (or models) which are similar over the time. The final results can be used to detect relevant change point over time. Also in this case the relevant distance used is the model distance by Lauro Romano Giordano (2006). The results show a very unstable situation for the first three observations. In this context we can detect three changing points on the first three observations. Then the period 4-5 and the period 6-8 show relevant similarities. Overall the periods 1,2,3 are very risky because the gof level is comparatively not so high

48 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 48 Beanplot Constrained Clustering t p1 p2 m1 m2 s1 s2 gof

49 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 49 Beanplot Constrained Clustering Figure: Constrained Clustering

50 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 50 Beanplot Forecasting In order to predict adequately the observations related the beanplot models over time we can use a forecasting procedure based on the VAR. The aim of the procedure is to predict each observation over time by choosing the adequate VAR model The models take in the account the weight based on the gof. The results of the predicted parameters allows to obtain the predicted models.

51 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 51 Beanplot Forecasting V1 V2 V3 V4 V5 V6 V7 V prediction

52 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 52 Beanplot Forecasting Figure: Forecasting: real beanplot to predict (left) and the forecast (right)

53 Model Based Symbolic Description for Big Data Analysis Conclusions 53 Conclusions The application of the Beanplots as Symbolic Data seems to be very fruitful on Financial Big Data. The use of the models based on the beanplots allow to retain the relevant information based on the parameters of the models as well. A fundamental point is to use the error on weighting the different models and observations. In this context we have shown that the use of the error allow the improvement of the results The different models allow to detect relevant patterns in the data which can be exploited in various financial operations like trading, risk management and so on. As future development we will consider these methodologies in other contexts as for example control charts in order to evaluate the stability of the markets and building relevant system alerts.

Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE) Kernel Density Estimation (KDE) Previously, we ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Statistical Analysis of Metabolomics Data Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte Outline Introduction Data pre-treatment 1. Normalization 2. Centering,

More information

Nonparametric Density Estimation

Nonparametric Density Estimation Nonparametric Estimation Data: X 1,..., X n iid P where P is a distribution with density f(x). Aim: Estimation of density f(x) Parametric density estimation: Fit parametric model {f(x θ) θ Θ} to data parameter

More information

On Kernel Density Estimation with Univariate Application. SILOKO, Israel Uzuazor

On Kernel Density Estimation with Univariate Application. SILOKO, Israel Uzuazor On Kernel Density Estimation with Univariate Application BY SILOKO, Israel Uzuazor Department of Mathematics/ICT, Edo University Iyamho, Edo State, Nigeria. A Seminar Presented at Faculty of Science, Edo

More information

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time

Today. Lecture 4: Last time. The EM algorithm. We examine clustering in a little more detail; we went over it a somewhat quickly last time Today Lecture 4: We examine clustering in a little more detail; we went over it a somewhat quickly last time The CAD data will return and give us an opportunity to work with curves (!) We then examine

More information

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data.

Acquisition Description Exploration Examination Understanding what data is collected. Characterizing properties of data. Summary Statistics Acquisition Description Exploration Examination what data is collected Characterizing properties of data. Exploring the data distribution(s). Identifying data quality problems. Selecting

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Introduction Often, we have only a set of features x = x 1, x 2,, x n, but no associated response y. Therefore we are not interested in prediction nor classification,

More information

Dynamic Thresholding for Image Analysis

Dynamic Thresholding for Image Analysis Dynamic Thresholding for Image Analysis Statistical Consulting Report for Edward Chan Clean Energy Research Center University of British Columbia by Libo Lu Department of Statistics University of British

More information

Clustering CS 550: Machine Learning

Clustering CS 550: Machine Learning Clustering CS 550: Machine Learning This slide set mainly uses the slides given in the following links: http://www-users.cs.umn.edu/~kumar/dmbook/ch8.pdf http://www-users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf

More information

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Descriptive model A descriptive model presents the main features of the data

More information

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni Nonparametric Risk Attribution for Factor Models of Portfolios October 3, 2017 Kellie Ottoboni Outline The problem Page 3 Additive model of returns Page 7 Euler s formula for risk decomposition Page 11

More information

Unsupervised Learning : Clustering

Unsupervised Learning : Clustering Unsupervised Learning : Clustering Things to be Addressed Traditional Learning Models. Cluster Analysis K-means Clustering Algorithm Drawbacks of traditional clustering algorithms. Clustering as a complex

More information

Use of Extreme Value Statistics in Modeling Biometric Systems

Use of Extreme Value Statistics in Modeling Biometric Systems Use of Extreme Value Statistics in Modeling Biometric Systems Similarity Scores Two types of matching: Genuine sample Imposter sample Matching scores Enrolled sample 0.95 0.32 Probability Density Decision

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the

More information

SD 372 Pattern Recognition

SD 372 Pattern Recognition SD 372 Pattern Recognition Lab 2: Model Estimation and Discriminant Functions 1 Purpose This lab examines the areas of statistical model estimation and classifier aggregation. Model estimation will be

More information

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Classification Vladimir Curic Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University Outline An overview on classification Basics of classification How to choose appropriate

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach Truth Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 2.04.205 Discriminative Approaches (5 weeks)

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 26.04.206 Discriminative Approaches (5 weeks) Linear

More information

The data quality trends report

The data quality trends report Report The 2015 email data quality trends report How organizations today are managing and using email Table of contents: Summary...1 Research methodology...1 Key findings...2 Email collection and database

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

Lecture 25 Nonlinear Programming. November 9, 2009

Lecture 25 Nonlinear Programming. November 9, 2009 Nonlinear Programming November 9, 2009 Outline Nonlinear Programming Another example of NLP problem What makes these problems complex Scalar Function Unconstrained Problem Local and global optima: definition,

More information

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis

Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Bayesian Spherical Wavelet Shrinkage: Applications to Shape Analysis Xavier Le Faucheur a, Brani Vidakovic b and Allen Tannenbaum a a School of Electrical and Computer Engineering, b Department of Biomedical

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Expectation Maximization (EM) and Gaussian Mixture Models

Expectation Maximization (EM) and Gaussian Mixture Models Expectation Maximization (EM) and Gaussian Mixture Models Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 2 3 4 5 6 7 8 Unsupervised Learning Motivation

More information

When, Where & Why to Use NoSQL?

When, Where & Why to Use NoSQL? When, Where & Why to Use NoSQL? 1 Big data is becoming a big challenge for enterprises. Many organizations have built environments for transactional data with Relational Database Management Systems (RDBMS),

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Customer Clustering using RFM analysis

Customer Clustering using RFM analysis Customer Clustering using RFM analysis VASILIS AGGELIS WINBANK PIRAEUS BANK Athens GREECE AggelisV@winbank.gr DIMITRIS CHRISTODOULAKIS Computer Engineering and Informatics Department University of Patras

More information

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems Salford Systems Predictive Modeler Unsupervised Learning Salford Systems http://www.salford-systems.com Unsupervised Learning In mainstream statistics this is typically known as cluster analysis The term

More information

Latent Variable Models and Expectation Maximization

Latent Variable Models and Expectation Maximization Latent Variable Models and Expectation Maximization Oliver Schulte - CMPT 726 Bishop PRML Ch. 9 2 4 6 8 1 12 14 16 18 2 4 6 8 1 12 14 16 18 5 1 15 2 25 5 1 15 2 25 2 4 6 8 1 12 14 2 4 6 8 1 12 14 5 1 15

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Machine Learning Lecture 3 Probability Density Estimation II 19.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Announcements Exam dates We re in the process

More information

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked Plotting Menu: QCExpert Plotting Module graphs offers various tools for visualization of uni- and multivariate data. Settings and options in different types of graphs allow for modifications and customizations

More information

Introduction to Nonparametric/Semiparametric Econometric Analysis: Implementation

Introduction to Nonparametric/Semiparametric Econometric Analysis: Implementation to Nonparametric/Semiparametric Econometric Analysis: Implementation Yoichi Arai National Graduate Institute for Policy Studies 2014 JEA Spring Meeting (14 June) 1 / 30 Motivation MSE (MISE): Measures

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Many slides adapted from B. Schiele Machine Learning Lecture 3 Probability Density Estimation II 26.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course

More information

University of Florida CISE department Gator Engineering. Clustering Part 2

University of Florida CISE department Gator Engineering. Clustering Part 2 Clustering Part 2 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville Partitional Clustering Original Points A Partitional Clustering Hierarchical

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10. Cluster

More information

Computer Vision 6 Segmentation by Fitting

Computer Vision 6 Segmentation by Fitting Computer Vision 6 Segmentation by Fitting MAP-I Doctoral Programme Miguel Tavares Coimbra Outline The Hough Transform Fitting Lines Fitting Curves Fitting as a Probabilistic Inference Problem Acknowledgements:

More information

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys Unit 7 Statistics AFM Mrs. Valentine 7.1 Samples and Surveys v Obj.: I will understand the different methods of sampling and studying data. I will be able to determine the type used in an example, and

More information

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Sea Chen Department of Biomedical Engineering Advisors: Dr. Charles A. Bouman and Dr. Mark J. Lowe S. Chen Final Exam October

More information

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model

Outline. Topic 16 - Other Remedies. Ridge Regression. Ridge Regression. Ridge Regression. Robust Regression. Regression Trees. Piecewise Linear Model Topic 16 - Other Remedies Ridge Regression Robust Regression Regression Trees Outline - Fall 2013 Piecewise Linear Model Bootstrapping Topic 16 2 Ridge Regression Modification of least squares that addresses

More information

ChristoHouston Energy Inc. (CHE INC.) Pipeline Anomaly Analysis By Liquid Green Technologies Corporation

ChristoHouston Energy Inc. (CHE INC.) Pipeline Anomaly Analysis By Liquid Green Technologies Corporation ChristoHouston Energy Inc. () Pipeline Anomaly Analysis By Liquid Green Technologies Corporation CHE INC. Overview: Review of Scope of Work Wall thickness analysis - Pipeline and sectional statistics Feature

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Classification Classification systems: Supervised learning Make a rational prediction given evidence There are several methods for

More information

The Ohio State University Columbus, Ohio, USA Universidad Autónoma de Nuevo León San Nicolás de los Garza, Nuevo León, México, 66450

The Ohio State University Columbus, Ohio, USA Universidad Autónoma de Nuevo León San Nicolás de los Garza, Nuevo León, México, 66450 Optimization and Analysis of Variability in High Precision Injection Molding Carlos E. Castro 1, Blaine Lilly 1, José M. Castro 1, and Mauricio Cabrera Ríos 2 1 Department of Industrial, Welding & Systems

More information

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING Luca Bortolussi Department of Mathematics and Geosciences University of Trieste Office 238, third floor, H2bis luca@dmi.units.it Trieste, Winter Semester

More information

Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling

Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling Aaron Daniel Chia Huang Licai Huang Medhavi Sikaria Signal Processing: Forecasting and Modeling Abstract Forecasting future events and statistics is problematic because the data set is a stochastic, rather

More information

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007

What s New in Spotfire DXP 1.1. Spotfire Product Management January 2007 What s New in Spotfire DXP 1.1 Spotfire Product Management January 2007 Spotfire DXP Version 1.1 This document highlights the new capabilities planned for release in version 1.1 of Spotfire DXP. In this

More information

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM

8/3/2017. Contour Assessment for Quality Assurance and Data Mining. Objective. Outline. Tom Purdie, PhD, MCCPM Contour Assessment for Quality Assurance and Data Mining Tom Purdie, PhD, MCCPM Objective Understand the state-of-the-art in contour assessment for quality assurance including data mining-based techniques

More information

Computer Experiments: Space Filling Design and Gaussian Process Modeling

Computer Experiments: Space Filling Design and Gaussian Process Modeling Computer Experiments: Space Filling Design and Gaussian Process Modeling Best Practice Authored by: Cory Natoli Sarah Burke, Ph.D. 30 March 2018 The goal of the STAT COE is to assist in developing rigorous,

More information

Lecture 9: Hough Transform and Thresholding base Segmentation

Lecture 9: Hough Transform and Thresholding base Segmentation #1 Lecture 9: Hough Transform and Thresholding base Segmentation Saad Bedros sbedros@umn.edu Hough Transform Robust method to find a shape in an image Shape can be described in parametric form A voting

More information

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A. 205-206 Pietro Guccione, PhD DEI - DIPARTIMENTO DI INGEGNERIA ELETTRICA E DELL INFORMAZIONE POLITECNICO DI BARI

More information

Analysing Search Trends

Analysing Search Trends Data Mining in Business Intelligence 7 March 2013, Ben-Gurion University Analysing Search Trends Yair Shimshoni, Google R&D center, Tel-Aviv. shimsh@google.com Outline What are search trends? The Google

More information

Data transformation in multivariate quality control

Data transformation in multivariate quality control Motto: Is it normal to have normal data? Data transformation in multivariate quality control J. Militký and M. Meloun The Technical University of Liberec Liberec, Czech Republic University of Pardubice

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

The Curse of Dimensionality

The Curse of Dimensionality The Curse of Dimensionality ACAS 2002 p1/66 Curse of Dimensionality The basic idea of the curse of dimensionality is that high dimensional data is difficult to work with for several reasons: Adding more

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-31-017 Outline Background Defining proximity Clustering methods Determining number of clusters Comparing two solutions Cluster analysis as unsupervised Learning

More information

Introduction to Trajectory Clustering. By YONGLI ZHANG

Introduction to Trajectory Clustering. By YONGLI ZHANG Introduction to Trajectory Clustering By YONGLI ZHANG Outline 1. Problem Definition 2. Clustering Methods for Trajectory data 3. Model-based Trajectory Clustering 4. Applications 5. Conclusions 1 Problem

More information

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable.

Vocabulary. 5-number summary Rule. Area principle. Bar chart. Boxplot. Categorical data condition. Categorical variable. 5-number summary 68-95-99.7 Rule Area principle Bar chart Bimodal Boxplot Case Categorical data Categorical variable Center Changing center and spread Conditional distribution Context Contingency table

More information

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Your Name: Your student id: Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013 Problem 1 [5+?]: Hypothesis Classes Problem 2 [8]: Losses and Risks Problem 3 [11]: Model Generation

More information

Some questions of consensus building using co-association

Some questions of consensus building using co-association Some questions of consensus building using co-association VITALIY TAYANOV Polish-Japanese High School of Computer Technics Aleja Legionow, 4190, Bytom POLAND vtayanov@yahoo.com Abstract: In this paper

More information

Active Appearance Models

Active Appearance Models Active Appearance Models Edwards, Taylor, and Cootes Presented by Bryan Russell Overview Overview of Appearance Models Combined Appearance Models Active Appearance Model Search Results Constrained Active

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

User Behaviour and Platform Performance. in Mobile Multiplayer Environments

User Behaviour and Platform Performance. in Mobile Multiplayer Environments User Behaviour and Platform Performance in Mobile Multiplayer Environments HELSINKI UNIVERSITY OF TECHNOLOGY Systems Analysis Laboratory Ilkka Hirvonen 51555K 1 Introduction As mobile technologies advance

More information

2014 Stat-Ease, Inc. All Rights Reserved.

2014 Stat-Ease, Inc. All Rights Reserved. What s New in Design-Expert version 9 Factorial split plots (Two-Level, Multilevel, Optimal) Definitive Screening and Single Factor designs Journal Feature Design layout Graph Columns Design Evaluation

More information

3 Graphical Displays of Data

3 Graphical Displays of Data 3 Graphical Displays of Data Reading: SW Chapter 2, Sections 1-6 Summarizing and Displaying Qualitative Data The data below are from a study of thyroid cancer, using NMTR data. The investigators looked

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to JULY 2011 Afsaneh Yazdani What motivated? Wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge What motivated? Data

More information

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham

Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Final Report for cs229: Machine Learning for Pre-emptive Identification of Performance Problems in UNIX Servers Helen Cunningham Abstract. The goal of this work is to use machine learning to understand

More information

Clustering and Visualisation of Data

Clustering and Visualisation of Data Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some

More information

Exploratory data analysis for microarrays

Exploratory data analysis for microarrays Exploratory data analysis for microarrays Jörg Rahnenführer Computational Biology and Applied Algorithmics Max Planck Institute for Informatics D-66123 Saarbrücken Germany NGFN - Courses in Practical DNA

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

ELEC Dr Reji Mathew Electrical Engineering UNSW

ELEC Dr Reji Mathew Electrical Engineering UNSW ELEC 4622 Dr Reji Mathew Electrical Engineering UNSW Review of Motion Modelling and Estimation Introduction to Motion Modelling & Estimation Forward Motion Backward Motion Block Motion Estimation Motion

More information

Data Mining and Analytics. Introduction

Data Mining and Analytics. Introduction Data Mining and Analytics Introduction Data Mining Data mining refers to extracting or mining knowledge from large amounts of data It is also termed as Knowledge Discovery from Data (KDD) Mostly, data

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

* Hyun Suk Park. Korea Institute of Civil Engineering and Building, 283 Goyangdae-Ro Goyang-Si, Korea. Corresponding Author: Hyun Suk Park

* Hyun Suk Park. Korea Institute of Civil Engineering and Building, 283 Goyangdae-Ro Goyang-Si, Korea. Corresponding Author: Hyun Suk Park International Journal Of Engineering Research And Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 13, Issue 11 (November 2017), PP.47-59 Determination of The optimal Aggregation

More information

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and

CHAPTER-13. Mining Class Comparisons: Discrimination between DifferentClasses: 13.4 Class Description: Presentation of Both Characterization and CHAPTER-13 Mining Class Comparisons: Discrimination between DifferentClasses: 13.1 Introduction 13.2 Class Comparison Methods and Implementation 13.3 Presentation of Class Comparison Descriptions 13.4

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM) School of Computer Science Probabilistic Graphical Models Structured Sparse Additive Models Junming Yin and Eric Xing Lecture 7, April 4, 013 Reading: See class website 1 Outline Nonparametric regression

More information

3 Feature Selection & Feature Extraction

3 Feature Selection & Feature Extraction 3 Feature Selection & Feature Extraction Overview: 3.1 Introduction 3.2 Feature Extraction 3.3 Feature Selection 3.3.1 Max-Dependency, Max-Relevance, Min-Redundancy 3.3.2 Relevance Filter 3.3.3 Redundancy

More information

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets. Fernando Chirigati Harish Doraiswamy Theodoros Damoulas

Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets. Fernando Chirigati Harish Doraiswamy Theodoros Damoulas Data Polygamy: The Many-Many Relationships among Urban Spatio-Temporal Data Sets Fernando Chirigati Harish Doraiswamy Theodoros Damoulas Juliana Freire New York University New York University University

More information

The Perils of Unfettered In-Sample Backtesting

The Perils of Unfettered In-Sample Backtesting The Perils of Unfettered In-Sample Backtesting Tyler Yeats June 8, 2015 Abstract When testing a financial investment strategy, it is common to use what is known as a backtest, or a simulation of how well

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity

Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity Application of Clustering Techniques to Energy Data to Enhance Analysts Productivity Wendy Foslien, Honeywell Labs Valerie Guralnik, Honeywell Labs Steve Harp, Honeywell Labs William Koran, Honeywell Atrium

More information

Introduction to Mobile Robotics

Introduction to Mobile Robotics Introduction to Mobile Robotics Clustering Wolfram Burgard Cyrill Stachniss Giorgio Grisetti Maren Bennewitz Christian Plagemann Clustering (1) Common technique for statistical data analysis (machine learning,

More information

Welcome to Analytics. Welcome to Applause! Table of Contents:

Welcome to Analytics. Welcome to Applause! Table of Contents: Welcome to Applause! Your success is our priority and we want to make sure Applause Analytics (ALX) provides you with actionable insight into what your users are thinking and saying about their experiences

More information

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University NIPS 2008: E. Sudderth & M. Jordan, Shared Segmentation of Natural

More information

Automate Transform Analyze

Automate Transform Analyze Competitive Intelligence 2.0 Turning the Web s Big Data into Big Insights Automate Transform Analyze Introduction Today, the web continues to grow at a dizzying pace. There are more than 1 billion websites

More information

MATH3016: OPTIMIZATION

MATH3016: OPTIMIZATION MATH3016: OPTIMIZATION Lecturer: Dr Huifu Xu School of Mathematics University of Southampton Highfield SO17 1BJ Southampton Email: h.xu@soton.ac.uk 1 Introduction What is optimization? Optimization is

More information

Locating Salient Object Features

Locating Salient Object Features Locating Salient Object Features K.N.Walker, T.F.Cootes and C.J.Taylor Dept. Medical Biophysics, Manchester University, UK knw@sv1.smb.man.ac.uk Abstract We present a method for locating salient object

More information

Predict Outcomes and Reveal Relationships in Categorical Data

Predict Outcomes and Reveal Relationships in Categorical Data PASW Categories 18 Specifications Predict Outcomes and Reveal Relationships in Categorical Data Unleash the full potential of your data through predictive analysis, statistical learning, perceptual mapping,

More information

High-Dimensional Incremental Divisive Clustering under Population Drift

High-Dimensional Incremental Divisive Clustering under Population Drift High-Dimensional Incremental Divisive Clustering under Population Drift Nicos Pavlidis Inference for Change-Point and Related Processes joint work with David Hofmeyr and Idris Eckley Clustering Clustering:

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Conditional Volatility Estimation by. Conditional Quantile Autoregression

Conditional Volatility Estimation by. Conditional Quantile Autoregression International Journal of Mathematical Analysis Vol. 8, 2014, no. 41, 2033-2046 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/ijma.2014.47210 Conditional Volatility Estimation by Conditional Quantile

More information

Network Heartbeat Traffic Characterization. Mackenzie Haffey Martin Arlitt Carey Williamson Department of Computer Science University of Calgary

Network Heartbeat Traffic Characterization. Mackenzie Haffey Martin Arlitt Carey Williamson Department of Computer Science University of Calgary Network Heartbeat Traffic Characterization Mackenzie Haffey Martin Arlitt Carey Williamson Department of Computer Science University of Calgary What is a Network Heartbeat? An event that occurs repeatedly

More information

INDEX UNIT 4 PPT SLIDES

INDEX UNIT 4 PPT SLIDES INDEX UNIT 4 PPT SLIDES S.NO. TOPIC 1. 2. Screen designing Screen planning and purpose arganizing screen elements 3. 4. screen navigation and flow Visually pleasing composition 5. 6. 7. 8. focus and emphasis

More information

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach, Kumar Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Feature creation

More information