Model Based Symbolic Description for Big Data Analysis

Size: px

Start display at page:

Download "Model Based Symbolic Description for Big Data Analysis"

Flora Reynolds
6 years ago
Views:

1 Model Based Symbolic Description for Big Data Analysis 1 Model Based Symbolic Description for Big Data Analysis *Carlo Drago, **Carlo Lauro and **Germana Scepi *University of Rome Niccolo Cusano, **University of Naples Federico II COMPSTAT st International Conference on Computational Statistics

2 Model Based Symbolic Description for Big Data Analysis 2 Outline The Statistical Problem Beanplot Time Series Definition Kernel and Bandwidth choice Beanplot Characteristics and Robustness Parameterization Beanplot Modelling Multiple Beanplot Time Series Beanplot Multiple Factor Analysis Beanplot Clustering (using the Beanplot Model Distance) Beanplot Constrained Clustering (using the Beanplot Model Distance) Beanplot Forecasting

3 Model Based Symbolic Description for Big Data Analysis Big Data 3 Big Data Recent technological advances carried on many innovations in data. In particular, there was an explosion of large data sets available. Big data is the term frequently used today for any collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Big data are data characterized by: high volume high velocity high variety This type of data usually shows a temporal dimension also.

4 Model Based Symbolic Description for Big Data Analysis Big Data 4 Financial Big Data This is especially promising and differentiating for financial services companies. In fact, financial business copes with hundreds of millions daily transactions and use big data in order to conduct transformations on their processes and organizations and to obtain competitive advantages in financial markets. Financial firms must be able to collect, store, and analyze rapidly changing, this type of data in order to maximize profits, reduce risk, and meet increasingly stringent regulatory requirements. The extraction of insights from so complex, and frequently unstructured data, is a very important step in this process and the statistical approach can give a fundamental contribution in this sense.

5 Model Based Symbolic Description for Big Data Analysis Big Data 5 Financial Big Data We consider as big data, observations on financial variables, taken daily or at a finer time scale, often irregularly spaced over time, and usually exhibit periodic (intra-day and intra-week) patterns in financial markets. The high-frequency data possess these peculiar features and can be considered an example of big data in finance markets, such as records of transactions and quotes for stocks or bonds, currencies and so on. These peculiar time series shows many difficulties in visualization and if are analyzed by means of an aggregated index conduct to an evident information loss.

6 Model Based Symbolic Description for Big Data Analysis Big Data 6 The Frequency Domain A time series of distributions would offer a more informative representation than other forms of aggregated time series. In order to analyze these data and we will consider the data not on the temporal domain of the time series, but in the frequency domain (considering for example the day). In this sense we consider the number of occurrencies on the time related to a specific value. We have several advantages on do that: We can detect simply the data patterns on the data as the most recurrent observations on the temporal interval We can detect the inter-temporal seasonalities which can occur on the temporal interval We can observe the similarities between different series.

7 Model Based Symbolic Description for Big Data Analysis Big Data 7 From Financial Big Data to Symbolic Data From the initial financial big data we are able to obtain the symbolic data table in which each data can be represented as a distribution. At this point we can: Represent the distribution as a Beanplot data Choosing the adequate data model Parameterizing the data model and obtaining the relevant parameters The final parameters are the relevant big data representation and could be used in clustering and forecasting.

8 Model Based Symbolic Description for Big Data Analysis Big Data 8 From Financial Big Data to Symbolic Data Figure: Fro Finacial Big Data to Symbolic Data (the first graph is from Martinaitis (2012))

9 Model Based Symbolic Description for Big Data Analysis Big Data 9 Methods Figure: Methods

10 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 10 Beanplot Time Series (BTS) A Beanplot time series can be defined as an ordered sequence of beanplot data (Kampstra 2008) over the time. The advantage of using the beanplot is his capacity to represent the intra-period data structure at time t. In the Beanplot time series a density data at time t with t = 1... T is defined as: ˆb k,h,t = 1 nh n i=1 K( x x i ) = 1 h nh (K( x x1 h )+K( x x2 h )+ +K( x xn )) h (1) where K is a kernel function, h is a smoothing parameter defined as a bandwidth and n is the number of x i intra-period observations.

11 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 11 Beanplot Taxonomies We can detect some typical taxonomies in the beanplots: A) Unimodality: data tend to gather over one mode in a regular way. B) Multimodality: data tend to gather over two modes. C) Break: data tend to gather in two mode but there is at least a break between the observations. Figure: Beanplot Taxonomy

12 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 12 Identifying Intra-Period Breaks Beanplot can be characterised by some groups of internal outlier observations (more than one). In this way the final result is a break in the data structure. In order to detect the intra-period breaks we: We sort the observations from the highest to the lowest We compute the first differences i with i = 1... n 1 and we compute the mean = i n 1 Are considered relevant the values which are over a specified threshold for example i > 3 In particular these value need to break the internal patterns considered. Is relevant to take in to account we can weight the internal outliers detected. In this way the beanplot is represented by a suitable weighting system. Figure: Intra-period breaks

13 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 13 Kernels Cosine Various kernels (K) can be generally chosen: Gaussian, uniform, Epanechnikov, triweight, exponential, cosine between others. The kernel is chosen in order to represent adequately the density function. K need to satisfy: Kernel Uniform: + K(u) du = 1 (2) K(u) = ( u 1) (3) Epanechnikov K(u) = 3 4 (1 u2 ) 1 ( u 1) (4) Triweight Exponential K(u) = (1 u2 ) 3 1 ( u 1) (5) K(u) = 1 2π e 1 2 u2 (6)

14 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 14 Kernel Properties Kernel function K(u) is nonnegative and need to fulfill (Racine 2008): K(u)du = 1 (8) K(u) = K( u) (9) u 2 K(u)d(u) = K 2 > 0 (10)

15 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 15 Kernel Selection It turns out that a range of kernel functions result in estimators having similar relative efficiencies, one could choose the kernel based on computational considerations, the Gaussian kernel being a popular choice... (Racine 1986) In order to approximate our data we will choose the Gaussian kernel: K(u) = 1 2π e 1 2 u2 (11) By considering big data, the Gaussian kernel is the most simple to interpret...unlike choosing a kernel function, however, choosing an appropriate bandwidth is a crucial aspect of sound nonparametric analysis (Racine 1986)

16 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 16 Kernel Selection Figure: Kernel Choice and Kernel Density Estimation The figure show the kernel density estimation computation using a Gaussian kernel and a bandwidth of h = 0.3 (R code by François 2012)

17 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 17 BTS: Bandwidth Selection We ll show the impact of the different selected bandwidths (using three choices: low, high and Sheather-Jones) on the beanplot time series. In the example we consider a yearly interval for the beanplot observation related the Dow Jones Index. This interval could be validated by considering the temporal horizons in which in these data (stocks) can occur. In fact in risk management application the relevant interval is the year (to take in to account the risks of financial crisis). In particular by considering the bandwidth we can to observe: Low Bandwidth: tend to show many bumps or to maximize the number of bumps by beanplot. High Bandwidth: we tend to have a more regular shape of the density traces. However here the risk is to lose some informations. Sheather Jones Method: the bandwidth change beanplot by beanplot so the bandwidth as well became an indicator of variability. Usually the impact of both bandwidth selection and kernel selection is obtained by simulation.

18 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 18 BTS: Bandwidth Selection Dow Jones BTS Bandwidth Selection Yearly Beanplot Time Series on Dow Jones daily data ( ). Different Bandwidth choices on the Beanplot Time Series: Low bandwidth h = 8, High Bandwidth h = 102, Sheather and Jones method (use some pilot estimation of derivatives to choose the bandwidth). Kernel selected: Gaussian.

19 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 19 The Impact of the Kernel and the Bandwidth Changing Selection It is possible to explore the beanplot data characteristics using different Kernels and Bandwidths. We will choose to use the Gaussian kernel (for his flexibility) and the bandwidth obtained by the Sheather Jones method (to explore the data structure).

20 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 20 Beanplot Time Series: Characteristics Beanline the mean or the median. Beanplots Lower and Upper Bound [X ] t = [X t,l, X t,u ] with < X t,l X t,u < Beanplots Center and Radius [X ] t = X t,c, X t,r where X t,c = (X t,l + X t,u )/2 and X t,r = (X t,u X t,l )/2 Quantiles Main Characteristics Location: the beanline mean, the beanplot Center. Size: Beanplots Radius, Lower and Upper Bounds Shape: the h parameter regulates the density trace. So, the higher the bandwidth the wiggler the density function. The h parameter can be obtained using the Sheather-Jones method (see Kampstra (2008)). Relevant effects also on the kurthosis.

21 Model Based Symbolic Description for Big Data Analysis Beanplot Time Series 21 Beanplot Time Series: Characteristics Intra-period and inter-period variability Yearly beanplot time series on Dow Jones daily data ( ) allows the identification of structural changes and intra-period variability patterns. The Kernel chosen is the Gaussian, the bandwidth is obtained by mean of the Sheather-Jones method.

22 Model Based Symbolic Description for Big Data Analysis Parameterization 22 Beanplot Modeling: Choosing the Class of the Model We consider the symbolic aggregation approach by considering as temporal interval the day We consider an approch to the frequency domain in order to extract the relevant daily patterns At this point we choose the class of the model. In particular we consider the number of the mixtures to use, the distributions considered and so on. In our case we choose two mixtures because the gof indexes show a good approximation of the data. At the same time the Gaussian distribution allow to maximize the gof index in the experiments on data we have performed. From te relevant daily data we extract the relevant parameters by the parameterization procedure. In particular we will consider a finite mixture model for each density function.

23 Model Based Symbolic Description for Big Data Analysis Parameterization 23 Beanplot Parameterization In order to compare and to analyse the beanplot time series we need to parameterize the different beanplot. The aim of the parameterization are: Synthesizing the beanplot observations Comparing, analysing and interpreting the beanplot observations Storing big data In this sense: We consider a kernel density estimation of the density function (a bandwidth h and a kernel K). We obtain: B k t Finite mixture model of the density function. We obtain: B M t Model diagnostic and model fit

24 Model Based Symbolic Description for Big Data Analysis Parameterization 24 Beanplot by Mixture Models Parameterization is important because the storing of the relevant information of the beanplots can be used in clustering and in forecasting. With the aim of parameterization we estimate the model parameters as a finite mixture density function. So we have: B M t = J π j f (x θ j ) (12) j=1 Where π 1... π J are scalars and θ 1,..., θ J are vectors of parameters Here: 0 π j 1 and also π 1 + π π J = 1. Therefore we obtain A µ t (means), A σ t (standard deviations), A p t (weights). We use Gaussian distributions for their flexibility. We use the Maximum Likelihood Estimation for the estimation of the parameters.

25 Model Based Symbolic Description for Big Data Analysis Parameterization 25 Bt M Parameters Interpretation Parameters can be interpreted in this way: µ j they represent the main intra-period characteristics, for example in the financial context, which values the price of a stock has gathered over time. Changes in µ j can occur in the presence of structural changes σ j represent the intra-period variability, where in financial terms this can be higher volatility. Changes in σ j can occur in the presence of financial news (higher or lower intra-period volatility). π j represents the relative weight for each distinct group of observation. Changes in π j are related to intra-periodal changes

Model Based Symbolic Description for Big Data Analysis Parameterization 26 Number of Bt M Parameters The number of parameters to estimate is referred to the number of components (C) in the mixture.

26 Model Based Symbolic Description for Big Data Analysis Parameterization 26 Number of Bt M Parameters The number of parameters to estimate is referred to the number of components (C) in the mixture. A feasible solution need to be a compromise between comparability, simplicity and usability. After the estimation of the model is necessary to consider the quality of the fit. Figure: Beanplot Model with C = 2

27 Model Based Symbolic Description for Big Data Analysis Parameterization 27 Weighting In every finite mixture model we measure the fit of the model by using an goodness of fit index. The index measure the level of fit of the model related the initial data. In this sense 1 represent the highest level of fit, and 0 the minimum. This index is used to weighting the observations in all the different models of models in order to weight less the observations with no represent adequately the data. At the same time observations with higher goodness of fit are weighted more.

28 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 28 Multiple Beanplot Time Series (MBTS) Here with the aim to create a representative market index we consider a beanplot to take in to account the intra-period variation. In particular we construct a Beanplot market index in order to represent the entire market risk. A beanplot market index can have a relevant applications in risk management to anticipate the risk over time. At the same time a beanplot market index can reflect the state of an economy and the sentiment of the investors and help investment decisions. So in this sense we extend our previous approach for single beanplot analysis to the case of Multiple Beanplot Time Series

29 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 29 Multiple Beanplot Time Series Multiple Beanplot Time Series can be defined the simultaneous observation of more than one Beanplot Time Series. For example we can observe the Beanplot Time Series related to a more than one financial market. By considering the multiple beanplot time series related to a market the resultant synthesis will be a beanplot representing the entire market (as an index of the entire market, for example, FTSE MIB for the Italian Case). Possible real applications: Exploratory Time Series Analysis Constructing Composite Indicators based on multiple beanplot time series Portfolio Selection Change Point Detection Forecasting

30 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 30 Multiple Beanplot Time Series Analysis We consider four different methods with different aims: Multiple Factor Analysis with the aim of seeking the common structure of the blocks describing the multiple beanplot time series. Clustering with the aim of detecting relevant subgroups over time and finding similar beanplot observations. These observation can be related to different stocks. The results can be used in portfolio selection strategies Constrained Clustering with the aim of detecting relevant subperiods in a beanplot time series. These relevant subperiods represented by groups of beanplots over time can be used in order to detect market change point. Forecasting with the aim of prediction of the observations over time. The models can be used in trading.

31 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 31 Beanplot Multiple Factor Analysis (BMFA) The aim of the method is to synthesize the different beanplot multiple time series in order to obtain indexes over time of the market or the portfolio. The indexes can be used in order to take decisions. We consider as one of the most important element in building the index the gof as the capacity of the models to approximate the original data. We parameterize the different beanplot time series. In this case we obtain the parameters related the weights, the means and the variance for each data. In this example we visualize the first parameter (the weight related the first mixture): m1.p1 m2.p1 m3.p1 m4.p1 m5.p1 m6.p1 m7.p

32 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 32 Beanplot Multiple Factor Analysis Here we visualize the matrix for the weight related the second mixture: m1.p2 m2.p2 m3.p2 m4.p2 m5.p2 m6.p2 m7.p

33 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 33 Beanplot Multiple Factor Analysis The first parameter for the mean of the first mixture: m1.m1 m2.m1 m3.m1 m4.m1 m5.m1 m6.m1 m7.m

34 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 34 Beanplot Multiple Factor Analysis The second parameter for the mean of the second mixture: m1.m2 m2.m2 m3.m2 m4.m2 m5.m2 m6.m2 m7.m

35 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 35 Beanplot Multiple Factor Analysis The variance parameter related the first mixture: m1.s1 m2.s1 m3.s1 m4.s1 m5.s1 m6.s1 m7.s

36 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 36 Beanplot Multiple Factor Analysis The second parameter related the variance of the second mixture: m1.s2 m2.s2 m3.s2 m4.s2 m5.s2 m6.s2 m7.s

37 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 37 Beanplot Multiple Factor Analysis We obtain as well the gof index for each mixture. Each model is represented by their parameters and by the gof index. The gof index is necessary in order to weight differently the observations with have a lower gof in the different models. m1.gof m2.gof m3.gof m4.gof m5.gof m6.gof m7.gof

38 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 38 Beanplot Multiple Factor Analysis We can obtain the index as beanplots from the block-pca weighting for the gof index. At the end of the procedure we can obtain the beanplot prototype time series. The global PCA is performed on a matrix with merged initial datasets (Abdi and Valentine 2007) Figure: MFA Beanplot Prototype Time Series

39 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 39 Beanplot Multiple Factor Analysis By considering the correlation circle we can observe the variables of high performing stocks (represented by higher means) versus the characteristics of low means (x-axis). At the same time we are able to see characteizations of higher volatility in the y-axis. Figure: Correlation Circle

40 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 40 Beanplot Multiple Factor Analysis We obtain as well: the individual factor maps and the groups representations. These results can be used in order to interpret the results financially: Individual factor map (1) show the characteristics of the different temporal observations. We can observe here the dynamics over time of the market as a whole Individual factor map (2) show the way the different stocks (represented by the different models) performs over time. It is possible to read that some stocks tend to grow more than others so they seems to be good opportunities (model 2 and model 5) Groups representation show the portfolio selection by considering the different performances of the stocks (or models). In this context seems to be reasonable a strategy by picking first of all the stocks 5 and 7 then 1 and 2. Overall these stocks seems to be convenient by considering their performances over time. The plot is useful in order to discriminate good stocks to others. We use the gof index in order to weight accordingly the observations.

41 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 41 Beanplot Multiple Factor Analysis Figure: Individual Factor Map (1)

42 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 42 Beanplot Multiple Factor Analysis Figure: Individual Factor Map (2)

43 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 43 Beanplot Multiple Factor Analysis Figure: Groups representation

44 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 44 Beanplot Clustering The aim of the clustering procedure is find groups of different beanplot models or stocks on a day which can be more similar. The procedure can be very useful on stock picking processes. In this context a relevant distance used is the model distance by Lauro Romano Giordano (2006). By using the appropriate distance we are able to discover that the stocks 2 and 3 performs very peculiarly on the groups of the stocks considered. The stocks 1 and 7 show together a very low gof. Finally we are able to discriminate the different stocks typologies.

45 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 45 Beanplot Clustering model t p1 p2 m1 m2 s1 s2 gof 1 m m m m m m m

46 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 46 Beanplot Clustering Figure: Clustering

47 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 47 Beanplot Constrained Clustering The aim of the constrained clustering procedure is to find groups of beanplots (or models) which are similar over the time. The final results can be used to detect relevant change point over time. Also in this case the relevant distance used is the model distance by Lauro Romano Giordano (2006). The results show a very unstable situation for the first three observations. In this context we can detect three changing points on the first three observations. Then the period 4-5 and the period 6-8 show relevant similarities. Overall the periods 1,2,3 are very risky because the gof level is comparatively not so high

48 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 48 Beanplot Constrained Clustering t p1 p2 m1 m2 s1 s2 gof

49 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 49 Beanplot Constrained Clustering Figure: Constrained Clustering

50 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 50 Beanplot Forecasting In order to predict adequately the observations related the beanplot models over time we can use a forecasting procedure based on the VAR. The aim of the procedure is to predict each observation over time by choosing the adequate VAR model The models take in the account the weight based on the gof. The results of the predicted parameters allows to obtain the predicted models.

51 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 51 Beanplot Forecasting V1 V2 V3 V4 V5 V6 V7 V prediction

52 Model Based Symbolic Description for Big Data Analysis Multiple Beanplot Time Series 52 Beanplot Forecasting Figure: Forecasting: real beanplot to predict (left) and the forecast (right)

53 Model Based Symbolic Description for Big Data Analysis Conclusions 53 Conclusions The application of the Beanplots as Symbolic Data seems to be very fruitful on Financial Big Data. The use of the models based on the beanplots allow to retain the relevant information based on the parameters of the models as well. A fundamental point is to use the error on weighting the different models and observations. In this context we have shown that the use of the error allow the improvement of the results The different models allow to detect relevant patterns in the data which can be exploited in various financial operations like trading, risk management and so on. As future development we will consider these methodologies in other contexts as for example control charts in order to evaluate the stability of the markets and building relevant system alerts.

Kernel Density Estimation (KDE)

Kernel Density Estimation (KDE) Previously, we ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this