DI TRANSFORM. The regressive analyses. identify relationships

Size: px

Start display at page:

Download "DI TRANSFORM. The regressive analyses. identify relationships"

Bennett Cannon
6 years ago
Views:

1 July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical, and engineering data. There are three classification tools that use different machine learning algorithms to sort data into clusters based on similarity and return a class assignment for each data point. These options include unsupervised, supervised, and hierarchical classification. All are used to identify features, explore properties, and determine the location of data (if the input data has a spatial component). The MVstats TM package also includes two predictive multivariate regression tools: linear regression and nonlinear regression. The regressive analyses identify relationships between predictor variables and a response variable to construct a model that you can use to predict the value of the response variable where it is unknown. In addition, the nonlinear regression model has applications beyond basic variable prediction as it includes simulation tools that allow you to perform what if queries of the model. Both regression tools have out-of-sample model validation features that make it easy to assess the accuracy of the model. Finally, you can obtain high quality results faster from any MVstats TM algorithm if outlier and multicollinearity analysis data preparation tools are used prior to model construction. These tools are part of the MVstats TM package. Classification Methods The regressive analyses identify relationships between predictor variables and a response variable to construct a model that you can use to predict the value of the response variable where it is unknown. Classification is used to explore data and identify features. With respect to geophysical data, you can identify facies in a volume from seismic attributes using classification. The same approach works with well logs for facies identification in a vertical profile. You might use classification to analyze large volumes of completions and production information to identify the most effective completion design. When a classification model is applied, a class assignment is determined for each data point that could be a well, a well log measured depth, or a location within a seismic volume. Unsupervised Classification The unsupervised classification tool does not require training data and is often the best option for exploring large datasets as the algorithm efficiently operates on the raw data. Unsupervised classification uses 1

2 k-means 1 clustering that partitions the data into a specified number of mutually exclusive clusters. These clusters are optimized so that the data points within each cluster are as close to one another as possible but as far as possible from data points in other clusters. Each cluster is represented by a centroid, and a centroid value is reported for each input variable. The centroid values describe the properties of each cluster. These values are calculated at the location within the cluster where the sum of distances from all data points is minimized. Hierarchical Classification The hierarchical classification algorithm 2 identifies classes that have a genetic relationship to one another. An advantage of this approach is control. You can direct the model to search for smaller, more nuanced classes contained within a larger group. The algorithm starts with a single originating class that is subdivided into child classes. These can then be further subdivided to form a tree. Child classes of the same parent are more similar to each other than child classes of a different parent. The lowest level classes, those that are children but not parents, are the ones defined in the final model. Hierarchical classification is sensitive to outliers, so it is important to perform Outlier Analysis prior to modeling. 1 You can find a technical description of the k-means algorithm in the following: Ding C. and Xiaogeng H. (2004). Proceedings of the 21st International Conference on Machine Learning: K-means Clustering via Principal Component Analysis. Banff, Canada. 2 Additional algorithm details are found here: Luo F., Khan L., Bastani F., Yen I., and Zhou, J. (2004). A dynamically growing self-organizing tree for hierarchical clustering gene expression profiles. Bioinformatics Advance Access. Supervised Classification A training dataset is required to perform supervised classification. This is also known as discriminant analysis. Currently DI Transform only supports the use of facies logs for training a supervised model; this limits the tool-to-well log analysis. In addition to a facies log, you must supply a set of standard well logs (for example, gamma ray and resistivity) that are analyzed to describe each facies class with the ultimate goal of producing a model that can identify facies from a set of standard well logs alone. If a facies log is available, supervised classification is a powerful tool for well log classification because the model sees the answer and is allowed to work backwards from the desired results. Supervised classification is accomplished in four steps. First, the facies log supplies the model with a class assignment for every measured depth. Then, the discriminant analysis is performed on the data within each class to produce characteristic parameters describing the class. Next, the tool examines the standard well log values at every measured depth and assigns the class that the characteristic parameters show most closely matches the data. Finally, differences between the original facies log and the modeled facies are reported in a table and can be examined visually with a side-byside comparison of the logs. These differences are a signal that additional information is needed to distinguish facies of interest. Predictive Methods Regression models analyze data collected in the past to identify relationships to apply in the future or to fill gaps in data. A geologist might use a regression 2

3 model to predict porosity or pore pressure from well logs. An engineer might use a regression model to predict production from completions parameters and geologic characteristics. DI Transform offers linear and nonlinear regression modeling tools. With both approaches, relationships between multiple independent predictor variables and a single dependent response variable are identified and combined linearly to produce a model that predicts the response variable. Both models search for the best combination of regression coefficients to apply to the predictor variables so that the error between the model s prediction of the response variable and the actual value is minimized. The major difference between the two methods is the shape the relationships between predictor and response variables are allowed to take. With linear regression, relationships must be linear; with nonlinear regression, relationships can be more complex. Out-of-sample validation tools are offered for both linear and nonlinear regression. These tools withhold a portion of the possible regression data, build a model with the remainder, and compare the model prediction of the withheld data to the actual values. The N folds tool divides the regression data into N portions, and then performs the out-of-sample analysis N times once with each fold withheld. The leave-one-out method withholds a single regression sample with the out-of-sample analysis performed as many times as the user specifies. The average absolute error and error standard deviation of the out-of-sample analyses are reported for both methods. Linear Principal Components Regression Analysis DI Transform linear regression harnesses the power of principal components analysis (PCA). The advantage of this approach is that results are not negatively affected when redundant variables are included in a model. This makes it a good option for well log analysis where certain logs might track one another within different materials. PCA optimally fits a series of orthogonal vectors through the multidimensional cloud of input data and describes it in the most efficient way possible. The first eigenvector, or principal component, is fit through the data cloud in its widest direction, so it explains the largest possible variance in the data. The second principal component, which must be orthogonal to the first, describes the largest amount of remaining variance. More components are added until the data is sufficiently explained or until the number of components equals the number of variables. A regression model is then built using the principal components. When the model is applied, the predictor variable values are mapped onto the coordinate systems of the principal components. The response variable is predicted from the principal component regression model. Nonlinear Regression Nonlinear regression allows for complex transformations of the predictor variables. This increases the predictive power of the model because it is better able to utilize information from predictor variables that do not have a linear relationship with the response variable. It is also purposefully designed not to be a black box. The optimal transformations identified by the model are displayed so that you can exercise your expertise and intuition to evaluate and tune the model. This ensures that the model is built on physically reasonable relationships and is not biased by unique features of the regression data. This is not the case with neural 3

4 network-based prediction models, which do not allow for expert override and are vulnerable to data over-fitting if analyses are not performed using very large datasets. The transparency of the DI Transform approach also lets you pull meaningful information from the variable transforms, including optimal predictor variable values and points of diminishing returns. A weakness of the nonlinear regression method, however, is that it is sensitive to data redundancy; this can produce unintuitive predictor variable transforms. We recommend performing multicollinearity analysis before running nonlinear regression to safeguard against that possibility. The first step in the nonlinear regression algorithm is to convert the response variable data to a standard normal distribution. This entails subtracting the mean from each data point and dividing it by the standard deviation of the data. Then the predictor variable data is also transformed to have mean values of zero, sorted from smallest to largest, and scaled. Point-wise continuous transforms are applied to the predictor variables within the allowed relationships (linear, monotonic, higher order, or periodic) using a proprietary method. The algorithm iterates among the different transform options to minimize the error between the model prediction of the response variable and the actual value. This is a data-driven, non-parametric approach, meaning that no single equation describes the transform applied to a given predictor variable. The model returns a validation plot comparing the model prediction of the response variable values to the actual values. The model also returns significance and sensitivity values for each predictor variable. The sensitivity value reports how much the model correlation coefficient would change if the variable was not included in the model. The significance value is the ratio of the range of the predictor variable in its transformed space to the range of the response variable in its transformed space with large values indicating that a change in the predictor variable has a large impact on the value of the response variable. Predictor variable contribution to the model is further examined in transformation plots. The model produces transformation plots for every predictor variable and the response variable, which display the original variable values compared with the transformed values. Because the model is built in standard normal data space, the transformed variable axes are shown in relative units representing the contribution of the predictor variable to the prediction of the response variable unless a simulation is performed. When a simulation is performed on a particular predictor variable, discrete values or data ranges of the other predictor variables are supplied to the model. The response variable is then predicted in physical units for example, barrels of oil (bbls) using the supplied values over the full range of the predictor variable. Specifying predictor variable values lets you query the model with what if scenarios. Data Preparation Tools Outlier Analysis Outliers make fundamental patterns and relationships in data difficult to identify. A model built on data that contains outliers will underperform at best and produce completely incorrect predictions at worst. We recommend removing outliers prior to any modeling effort. DI Transform includes an outlier analysis tool to make that process fast and straightforward. Outlier analysis 4

5 is launched from any correlation table; the analysis is performed only on the data in the table. A probability distribution function (PDF), which represents the probability of a random sample having a particular value, is calculated for each variable from the supplied data using the mean and standard deviation. A smoothing factor lets the user control whether the PDF tracks the actual data distribution or that of a more idealized distribution. You specify an alpha which controls when data is flagged as an outlier. For example, if alpha is set to 0.01, data points that fall under the PDF curve at or below the two 0.5% probability cut-off levels (high or low) are flagged as outliers. You can then decide whether to remove the flagged data points from the correlation table or retain them. Data is only removed from the correlation table; it is not removed from the database. Conclusion DI Transform offers a variety of multivariate analysis tools to take your geophysical, geological, or engineering workflow to a higher level without the pain of exporting information into a statistical software package. Copyright 2015, Drillinginfo, Inc. All rights reserved. Multicollinearity Analysis Multicollinearity analysis determines when two variables contain redundant information. Redundant information supplied to the nonlinear regression tool can produce unintuitive predictor variable transforms and should be avoided. Multicollinearity analysis is launched from any correlation table. First, a maximum multiple correlation coefficient (RSQMAX) is specified. Then, RSQMAX is calculated for different combinations of variables within the correlation table. If the multiple correlation coefficient exceeds RSQMAX, the variable with the highest pair-wise correlation with other variables is flagged as a candidate for rejection. You determine which variables to reject or retain. A variable that is rejected using the multicollinearity analysis tool is only removed from the correlation table but not from the database. PROACTIVE EFFICIENT COMPETITIVE By monitoring the market, Drillinginfo continuously delivers innovative oil & gas solutions that enable our customers to sustain a competitive advantage in any environment. Drillinginfo customers constantly perform above the rest because they are able to be more efficient and more proactive than the competition. Learn more at 5 WP_DI Transform MVstats_RB_Q315; 07/31/15

Clustering and Visualisation of Data

Clustering and Visualisation of Data Hiroshi Shimodaira January-March 28 Cluster analysis aims to partition a data set into meaningful or useful groups, based on distances between data points. In some