DI TRANSFORM. The regressive analyses. identify relationships

Similar documents
Clustering and Visualisation of Data

Network Traffic Measurements and Analysis

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Applying Supervised Learning

UNIT 2 Data Preprocessing

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Artificial Neural Networks (Feedforward Nets)

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Preprocessing. Komate AMPHAWAN

Using Statistical Techniques to Improve the QC Process of Swell Noise Filtering

3 Feature Selection & Feature Extraction

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Data Preprocessing. Slides by: Shree Jaswal

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Gene Clustering & Classification

Variable Selection 6.783, Biomedical Decision Support

CS570: Introduction to Data Mining

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Lecture on Modeling Tools for Clustering & Regression

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Linear Methods for Regression and Shrinkage Methods

Dimension reduction : PCA and Clustering

Predict Outcomes and Reveal Relationships in Categorical Data

Clustering CS 550: Machine Learning

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Basic Statistical Terms and Definitions

Multiple Regression White paper

Dimension Reduction CS534

Machine Learning in Biology

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

SPSS INSTRUCTION CHAPTER 9

Machine Learning Techniques for Data Mining

Data Preprocessing. Data Mining 1

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Slides for Data Mining by I. H. Witten and E. Frank

Week 7 Picturing Network. Vahe and Bethany

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Certified Data Science with Python Professional VS-1442

CSE 158. Web Mining and Recommender Systems. Midterm recap

Abstractacceptedforpresentationatthe2018SEGConventionatAnaheim,California.Presentationtobemadeinsesion

ECLT 5810 Clustering

Improving Positron Emission Tomography Imaging with Machine Learning David Fan-Chung Hsu CS 229 Fall

Optimizing Completion Techniques with Data Mining

WELCOME! Lecture 3 Thommy Perlinger

Statistical Pattern Recognition

Tensor Based Approaches for LVA Field Inference

SAS (Statistical Analysis Software/System)

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

DOWNLOAD PDF BIG IDEAS MATH VERTICAL SHRINK OF A PARABOLA

Application of K-Means Clustering Methodology to Cost Estimation

Using the DATAMINE Program

Statistical Pattern Recognition

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

K236: Basis of Data Science

Unsupervised learning in Vision

Data: a collection of numbers or facts that require further processing before they are meaningful

COMBINED METHOD TO VISUALISE AND REDUCE DIMENSIONALITY OF THE FINANCIAL DATA SETS

Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

Table Of Contents: xix Foreword to Second Edition

A Soft Computing-Based Method for the Identification of Best Practices, with Application in the Petroleum Industry

CRF Based Point Cloud Segmentation Jonathan Nation

Spectral Classification

3. Data Preprocessing. 3.1 Introduction

What is machine learning?

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Clustering in Data Mining

Supervised Variable Clustering for Classification of NIR Spectra

Resting state network estimation in individual subjects

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

2. Data Preprocessing

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Statistical Pattern Recognition

ChristoHouston Energy Inc. (CHE INC.) Pipeline Anomaly Analysis By Liquid Green Technologies Corporation

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

Supervised vs unsupervised clustering

COMP 465: Data Mining Still More on Clustering

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Priyank Srivastava (PE 5370: Mid- Term Project Report)

This chapter will show how to organize data and then construct appropriate graphs to represent the data in a concise, easy-to-understand form.

Exploring Econometric Model Selection Using Sensitivity Analysis

Classification of Subject Motion for Improved Reconstruction of Dynamic Magnetic Resonance Imaging

7 Techniques for Data Dimensionality Reduction

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

ECLT 5810 Clustering

Clustering analysis of gene expression data

Quantifying Data Needs for Deep Feed-forward Neural Network Application in Reservoir Property Predictions

Data Mining and Analytics. Introduction

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Motion Interpretation and Synthesis by ICA

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Transcription:

July 2, 2015 DI TRANSFORM MVstats TM Algorithm Overview Summary The DI Transform Multivariate Statistics (MVstats TM ) package includes five algorithm options that operate on most types of geologic, geophysical, and engineering data. There are three classification tools that use different machine learning algorithms to sort data into clusters based on similarity and return a class assignment for each data point. These options include unsupervised, supervised, and hierarchical classification. All are used to identify features, explore properties, and determine the location of data (if the input data has a spatial component). The MVstats TM package also includes two predictive multivariate regression tools: linear regression and nonlinear regression. The regressive analyses identify relationships between predictor variables and a response variable to construct a model that you can use to predict the value of the response variable where it is unknown. In addition, the nonlinear regression model has applications beyond basic variable prediction as it includes simulation tools that allow you to perform what if queries of the model. Both regression tools have out-of-sample model validation features that make it easy to assess the accuracy of the model. Finally, you can obtain high quality results faster from any MVstats TM algorithm if outlier and multicollinearity analysis data preparation tools are used prior to model construction. These tools are part of the MVstats TM package. Classification Methods The regressive analyses identify relationships between predictor variables and a response variable to construct a model that you can use to predict the value of the response variable where it is unknown. Classification is used to explore data and identify features. With respect to geophysical data, you can identify facies in a volume from seismic attributes using classification. The same approach works with well logs for facies identification in a vertical profile. You might use classification to analyze large volumes of completions and production information to identify the most effective completion design. When a classification model is applied, a class assignment is determined for each data point that could be a well, a well log measured depth, or a location within a seismic volume. Unsupervised Classification The unsupervised classification tool does not require training data and is often the best option for exploring large datasets as the algorithm efficiently operates on the raw data. Unsupervised classification uses 1

k-means 1 clustering that partitions the data into a specified number of mutually exclusive clusters. These clusters are optimized so that the data points within each cluster are as close to one another as possible but as far as possible from data points in other clusters. Each cluster is represented by a centroid, and a centroid value is reported for each input variable. The centroid values describe the properties of each cluster. These values are calculated at the location within the cluster where the sum of distances from all data points is minimized. Hierarchical Classification The hierarchical classification algorithm 2 identifies classes that have a genetic relationship to one another. An advantage of this approach is control. You can direct the model to search for smaller, more nuanced classes contained within a larger group. The algorithm starts with a single originating class that is subdivided into child classes. These can then be further subdivided to form a tree. Child classes of the same parent are more similar to each other than child classes of a different parent. The lowest level classes, those that are children but not parents, are the ones defined in the final model. Hierarchical classification is sensitive to outliers, so it is important to perform Outlier Analysis prior to modeling. 1 You can find a technical description of the k-means algorithm in the following: Ding C. and Xiaogeng H. (2004). Proceedings of the 21st International Conference on Machine Learning: K-means Clustering via Principal Component Analysis. Banff, Canada. 2 Additional algorithm details are found here: Luo F., Khan L., Bastani F., Yen I., and Zhou, J. (2004). A dynamically growing self-organizing tree for hierarchical clustering gene expression profiles. Bioinformatics Advance Access. Supervised Classification A training dataset is required to perform supervised classification. This is also known as discriminant analysis. Currently DI Transform only supports the use of facies logs for training a supervised model; this limits the tool-to-well log analysis. In addition to a facies log, you must supply a set of standard well logs (for example, gamma ray and resistivity) that are analyzed to describe each facies class with the ultimate goal of producing a model that can identify facies from a set of standard well logs alone. If a facies log is available, supervised classification is a powerful tool for well log classification because the model sees the answer and is allowed to work backwards from the desired results. Supervised classification is accomplished in four steps. First, the facies log supplies the model with a class assignment for every measured depth. Then, the discriminant analysis is performed on the data within each class to produce characteristic parameters describing the class. Next, the tool examines the standard well log values at every measured depth and assigns the class that the characteristic parameters show most closely matches the data. Finally, differences between the original facies log and the modeled facies are reported in a table and can be examined visually with a side-byside comparison of the logs. These differences are a signal that additional information is needed to distinguish facies of interest. Predictive Methods Regression models analyze data collected in the past to identify relationships to apply in the future or to fill gaps in data. A geologist might use a regression 2

model to predict porosity or pore pressure from well logs. An engineer might use a regression model to predict production from completions parameters and geologic characteristics. DI Transform offers linear and nonlinear regression modeling tools. With both approaches, relationships between multiple independent predictor variables and a single dependent response variable are identified and combined linearly to produce a model that predicts the response variable. Both models search for the best combination of regression coefficients to apply to the predictor variables so that the error between the model s prediction of the response variable and the actual value is minimized. The major difference between the two methods is the shape the relationships between predictor and response variables are allowed to take. With linear regression, relationships must be linear; with nonlinear regression, relationships can be more complex. Out-of-sample validation tools are offered for both linear and nonlinear regression. These tools withhold a portion of the possible regression data, build a model with the remainder, and compare the model prediction of the withheld data to the actual values. The N folds tool divides the regression data into N portions, and then performs the out-of-sample analysis N times once with each fold withheld. The leave-one-out method withholds a single regression sample with the out-of-sample analysis performed as many times as the user specifies. The average absolute error and error standard deviation of the out-of-sample analyses are reported for both methods. Linear Principal Components Regression Analysis DI Transform linear regression harnesses the power of principal components analysis (PCA). The advantage of this approach is that results are not negatively affected when redundant variables are included in a model. This makes it a good option for well log analysis where certain logs might track one another within different materials. PCA optimally fits a series of orthogonal vectors through the multidimensional cloud of input data and describes it in the most efficient way possible. The first eigenvector, or principal component, is fit through the data cloud in its widest direction, so it explains the largest possible variance in the data. The second principal component, which must be orthogonal to the first, describes the largest amount of remaining variance. More components are added until the data is sufficiently explained or until the number of components equals the number of variables. A regression model is then built using the principal components. When the model is applied, the predictor variable values are mapped onto the coordinate systems of the principal components. The response variable is predicted from the principal component regression model. Nonlinear Regression Nonlinear regression allows for complex transformations of the predictor variables. This increases the predictive power of the model because it is better able to utilize information from predictor variables that do not have a linear relationship with the response variable. It is also purposefully designed not to be a black box. The optimal transformations identified by the model are displayed so that you can exercise your expertise and intuition to evaluate and tune the model. This ensures that the model is built on physically reasonable relationships and is not biased by unique features of the regression data. This is not the case with neural 3

network-based prediction models, which do not allow for expert override and are vulnerable to data over-fitting if analyses are not performed using very large datasets. The transparency of the DI Transform approach also lets you pull meaningful information from the variable transforms, including optimal predictor variable values and points of diminishing returns. A weakness of the nonlinear regression method, however, is that it is sensitive to data redundancy; this can produce unintuitive predictor variable transforms. We recommend performing multicollinearity analysis before running nonlinear regression to safeguard against that possibility. The first step in the nonlinear regression algorithm is to convert the response variable data to a standard normal distribution. This entails subtracting the mean from each data point and dividing it by the standard deviation of the data. Then the predictor variable data is also transformed to have mean values of zero, sorted from smallest to largest, and scaled. Point-wise continuous transforms are applied to the predictor variables within the allowed relationships (linear, monotonic, higher order, or periodic) using a proprietary method. The algorithm iterates among the different transform options to minimize the error between the model prediction of the response variable and the actual value. This is a data-driven, non-parametric approach, meaning that no single equation describes the transform applied to a given predictor variable. The model returns a validation plot comparing the model prediction of the response variable values to the actual values. The model also returns significance and sensitivity values for each predictor variable. The sensitivity value reports how much the model correlation coefficient would change if the variable was not included in the model. The significance value is the ratio of the range of the predictor variable in its transformed space to the range of the response variable in its transformed space with large values indicating that a change in the predictor variable has a large impact on the value of the response variable. Predictor variable contribution to the model is further examined in transformation plots. The model produces transformation plots for every predictor variable and the response variable, which display the original variable values compared with the transformed values. Because the model is built in standard normal data space, the transformed variable axes are shown in relative units representing the contribution of the predictor variable to the prediction of the response variable unless a simulation is performed. When a simulation is performed on a particular predictor variable, discrete values or data ranges of the other predictor variables are supplied to the model. The response variable is then predicted in physical units for example, barrels of oil (bbls) using the supplied values over the full range of the predictor variable. Specifying predictor variable values lets you query the model with what if scenarios. Data Preparation Tools Outlier Analysis Outliers make fundamental patterns and relationships in data difficult to identify. A model built on data that contains outliers will underperform at best and produce completely incorrect predictions at worst. We recommend removing outliers prior to any modeling effort. DI Transform includes an outlier analysis tool to make that process fast and straightforward. Outlier analysis 4

is launched from any correlation table; the analysis is performed only on the data in the table. A probability distribution function (PDF), which represents the probability of a random sample having a particular value, is calculated for each variable from the supplied data using the mean and standard deviation. A smoothing factor lets the user control whether the PDF tracks the actual data distribution or that of a more idealized distribution. You specify an alpha which controls when data is flagged as an outlier. For example, if alpha is set to 0.01, data points that fall under the PDF curve at or below the two 0.5% probability cut-off levels (high or low) are flagged as outliers. You can then decide whether to remove the flagged data points from the correlation table or retain them. Data is only removed from the correlation table; it is not removed from the database. Conclusion DI Transform offers a variety of multivariate analysis tools to take your geophysical, geological, or engineering workflow to a higher level without the pain of exporting information into a statistical software package. Copyright 2015, Drillinginfo, Inc. All rights reserved. Multicollinearity Analysis Multicollinearity analysis determines when two variables contain redundant information. Redundant information supplied to the nonlinear regression tool can produce unintuitive predictor variable transforms and should be avoided. Multicollinearity analysis is launched from any correlation table. First, a maximum multiple correlation coefficient (RSQMAX) is specified. Then, RSQMAX is calculated for different combinations of variables within the correlation table. If the multiple correlation coefficient exceeds RSQMAX, the variable with the highest pair-wise correlation with other variables is flagged as a candidate for rejection. You determine which variables to reject or retain. A variable that is rejected using the multicollinearity analysis tool is only removed from the correlation table but not from the database. PROACTIVE EFFICIENT COMPETITIVE By monitoring the market, Drillinginfo continuously delivers innovative oil & gas solutions that enable our customers to sustain a competitive advantage in any environment. Drillinginfo customers constantly perform above the rest because they are able to be more efficient and more proactive than the competition. Learn more at www.drillinginfo.com 5 WP_DI Transform MVstats_RB_Q315; 07/31/15