Priyank Srivastava (PE 5370: Mid- Term Project Report)

Similar documents
B S Bisht, Suresh Konka*, J P Dobhal. Oil and Natural Gas Corporation Limited, GEOPIC, Dehradun , Uttarakhand

DI TRANSFORM. The regressive analyses. identify relationships

Abstractacceptedforpresentationatthe2018SEGConventionatAnaheim,California.Presentationtobemadeinsesion

Seismic facies analysis using generative topographic mapping

Clustering and Visualisation of Data

Network Traffic Measurements and Analysis

Benefits of Integrating Rock Physics with Petrophysics

MURDOCH RESEARCH REPOSITORY

University of Florida CISE department Gator Engineering. Data Preprocessing. Dr. Sanjay Ranka

Data Preprocessing. Data Preprocessing

SPE demonstrated that quality of the data plays a very important role in developing a neural network model.

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

Clustering and Dimensionality Reduction

We LHR5 06 Multi-dimensional Seismic Data Decomposition for Improved Diffraction Imaging and High Resolution Interpretation

MURDOCH RESEARCH REPOSITORY

AN IMPROVED HYBRIDIZED K- MEANS CLUSTERING ALGORITHM (IHKMCA) FOR HIGHDIMENSIONAL DATASET & IT S PERFORMANCE ANALYSIS

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Introduction to digital image classification

A Soft Computing-Based Method for the Identification of Best Practices, with Application in the Petroleum Industry

Software that Works the Way Petrophysicists Do

Mahalanobis clustering, with applications to AVO classification and seismic reservoir parameter estimation

CSE 40171: Artificial Intelligence. Learning from Data: Unsupervised Learning

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Unsupervised learning in Vision

GeoFrame Basic Petrophysical Interpretation Using PetroViewPlus

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Supervised vs.unsupervised Learning

Introduction to Machine Learning CMU-10701

Multi-attribute seismic analysis tackling non-linearity

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

2. Data Preprocessing

Kernels and Clustering

Exploratory data analysis for microarrays

Fuzzy Preprocessing Rules for the Improvement of an Artificial Neural Network Well Log Interpretation Model

Joint quantification of uncertainty on spatial and non-spatial reservoir parameters

Predicting Porosity through Fuzzy Logic from Well Log Data

Accelerating Unique Strategy for Centroid Priming in K-Means Clustering

The exam is closed book, closed notes except your one-page cheat sheet.

CS 343: Artificial Intelligence

Multivariate Analysis

CPSC 340: Machine Learning and Data Mining

A Self-Organizing Map, Machine Learning Approach to Lithofacies Classification

3. Data Preprocessing. 3.1 Introduction

PS wave AVO aspects on processing, inversion, and interpretation

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Lecture 25: Review I

Synthetic, Geomechanical Logs for Marcellus Shale M. O. Eshkalak, SPE, S. D. Mohaghegh, SPE, S. Esmaili, SPE, West Virginia University

Tensor Based Approaches for LVA Field Inference

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Lecture on Modeling Tools for Clustering & Regression

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

ECS 234: Data Analysis: Clustering ECS 234

The Lesueur, SW Hub: Improving seismic response and attributes. Final Report

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining with SPSS Modeler

Data Preprocessing. Slides by: Shree Jaswal

Volumetric Classification: Program gtm3d

Tutorial Base Module

Cluster Analysis Gets Complicated

TerraStation II v7 Training

Data Mining: Unsupervised Learning. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Fluid flow modelling with seismic cluster analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

A Course in Machine Learning

Programs for MDE Modeling and Conditional Distribution Calculation

Iteration Reduction K Means Clustering Algorithm

Digital Core study of Wanaea and Perseus Core Fragments:

CS 229 Midterm Review

ECONOMIC DESIGN OF STATISTICAL PROCESS CONTROL USING PRINCIPAL COMPONENTS ANALYSIS AND THE SIMPLICIAL DEPTH RANK CONTROL CHART

DETERMINATION OF REGIONAL DIP AND FORMATION PROPERTIES FROM LOG DATA IN A HIGH ANGLE WELL

Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli

MSA220 - Statistical Learning for Big Data

Exploratory Data Analysis using Self-Organizing Maps. Madhumanti Ray

3 Feature Selection & Feature Extraction

General Instructions. Questions

Scaling Techniques in Political Science

10-701/15-781, Fall 2006, Final

User-defined compaction curves

Upstream Data Management in the 2020s: New uses for old skills in the world of Big Data. Dr Duncan Irving DEJ Aberdeen September 29 th, 2015

Final Report: Kaggle Soil Property Prediction Challenge

Data Analytics for. Transmission Expansion Planning. Andrés Ramos. January Estadística II. Transmission Expansion Planning GITI/GITT

Clustering algorithms and autoencoders for anomaly detection

A Really Good Log Interpretation Program Designed to Honour Core

PETROPHYSICAL DATA AND OPEN HOLE LOGGING BASICS COPYRIGHT. MWD and LWD Acquisition (Measurement and Logging While Drilling)

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Spectral Classification

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

FEATURE SELECTION TECHNIQUES

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

INF 4300 Classification III Anne Solberg The agenda today:

CSE 158. Web Mining and Recommender Systems. Midterm recap

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

UNIVERSITY OF OKLAHOMA GRADUATE COLLEGE DATA MINING APPLICATIONS IN RESERVOIR MODELING A THESIS SUBMITTED TO THE GRADUATE FACULTY

Transcription:

Contents Executive Summary... 2 PART- 1 Identify Electro facies from Given Logs using data mining algorithms... 3 Selection of wells... 3 Data cleaning and Preparation of data for input to data mining... 3 Selection of data mining technique & Workflow... 7 Mathematical Background of PCA and K-Means clustering... 9 Interpretation of Results... 9 Relationship of Predicted Electro facies with original variables... 10 The folly of trusting Data mining... 13 PART-2: Doing Clustering using SOM and R package... 13 Clustering and SOM in R... 14 PART-3: Clustering using Merged Dataset of all wells... 16 Conclusion... 17 Appendix A: R Code for Part-III... 18

Executive Summary The Objective of present project is to prepare a data mining model to estimate electro facies from set of open-hole well logs. This trained model can then be used as a predictive tool for estimating unknown logs at any new location. Present workflow utilizes principal component analysis (PCA) and K-Means clustering algorithm for preparation of data mining model. This report is divided into three parts in part-i the data mining algorithm is run on individual wells which uses different attributes for each well depending on the availability. The produced clusters are mapped back to individual wells based on gamma ray values which broadly shows Facies 1 as high gamma ray, Facies 3 as mix of sand shale sequence and Facies 2 as low gamma ray. Presence of these facies is then correlated with corresponding production rates from different wells to figure out reservoir quality of each facies. Though K-means always converges the answer given by K-means depends on the initial centers. It also returns centers that are averages of data points. So some of the wells (Young Joe; Flanik Randal) which do not have complete dataset doesn t show any clusters and thus it is difficult to generalize the interpretation from this Model. This part ends with discussing various disadvantages of K-means clustering. We can predict the unknown logs in these wells using present data mining model but it is out of scope of present project. The process of data mining helps uncovering the hidden patterns in the data set by exposing the relationships between attributes. But the issue is that it uncovers a lot of unuseful patterns. It is up to the domain expert to filter through the patterns and accept the ones that are valid to answer the objective question. Thus, in part-ii some of the wells are used for clustering using selforganizing maps (SOM). In part-iii, 5 attributes (GR, AT90, PEF, RHOB and NPHI) are merged for all the 10 selected wells and similar workflow (PCA+K-means) is run to generate a generalize model for three clusters from which different facies and its characteristics are identified. To conclude based on study in Part III, I can summarize my finding in following table Cluster name Interpretation 1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since, high N phi 0.289) 3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water 2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.

PART- 1 Identify Electro facies from Given Logs using data mining algorithms Selection of wells I choose the wells according to their API numbers so 10 wells in county parker (API: 42-367) were chosen. But not all wells have equal amount of data while some wells have processed logs some don t have it. The table below gives the API numbers with corresponding well name and Production rate for the chosen wells. API s Well name Production rate* (Mscf/day) 42-367-34050 Moore -- 42-367-34447 Deaton 202 42-367-34576 Frank-Mask 830 42-367-34094 Sugar Tree 532 42-367-34227 Westhoff John 1029 42-367-34343 Flamik Randal 201 42-367-34385 Young Joe 779 42-367-34438 Kinyon 493 42-367-34744 Hagler 1365 42-367-34883 Lake Wheatherford 965 *From Drillininginfo.com Based on the production rate, the wells can be divided in three categories. Our Goal in this project is to (1) classify each well in electro-facies. (2) If i can relate the performance of well with newly classified electro facies. Data cleaning and Preparation of data for input to data mining Since the logs given to us were processed and contains many redundant and missing parameters. It becomes imperative to select and clean the data for selection of attribute we want as input to data mining algorithms. We want to develop electro facies for upper Barnett and lower Barnett zones local stratigraphy of subsurface is given in Figure 1 as observed Barnett shales is divided in two parts by forestburg limestone Thus, before inputting data in any data mining algorithm we need to get rid of these limestone zones. Since in all of the given logs resistivity of mud is of order of 0.4 Ohm-meter we can be sure that all the wells are drilled by water-based muds and hence we can use Photoelectric (PE) log as lithology indicator since carbonates usually have high PE values of 5. We can easily screen out all the values of log which shows PE < 4. Additional filtering is done by screening out all depths which shows Density (RHOB) >2.7 gm/cc. Figure 2 shows the workflow used for cleaning and filtering of depth so that our final output is depth and parameters of only upper and lower Barnett shale.

Figure 3 contains the list of attributes selected for each well. It can be observed that flamik randal and young Joe well contains least amount of attributes.

Figure 1 : General stratigraphy of the Ordovician to Pennsylvanian section in fort-worth basin (Loucks & Ruppel, 2007) Select all the depths with PEF < 4 Select all the depths with non zero GR, RHOB, AT90 and 0<NPHI <1 Normalize every parameter with its mean and variance Figure 2: Workflow for Data cleaning

Moore ( 9 Attributes) GR(Max:368;Min:18) PEF(Max:6.2;Min:2.2) AT90(Max:862;Min:0.68) NPHI(Max:0.397;Min:.002) RHOB(Max:2.76;Min:2.34) WCLC(AVE: 0.183) WILL(AVE:0.69) WQUA(AVE:0.471) VCL(AVE:0.332) Deaton(11 Attributes) GR(Max:337;Min:12) PEF(Max:5.18;Min:1.8) AT90 NPHI(Max:0.374;Min:0) RHOB(Max:2.825;Min:2.39) WCLC(AVE: 0.176) WDOL(AVE:0.096) WILL(AVE:0.136) WQUA(AVE:0.474) WTOC(AVE:0.022) VCL(AVE:0.237) Frank Mask(6 Attributes) GR(Max:346;Min:0) NPHI(Max:0.30;Min:0) RHOB(Max:2.705;Min:0) VCL(AVE:0.289) PR (AVE: 0.227) CB (0.205) Sugartree (10 Attributes) GR(Max:201;Min:0) PEF(Min:0;Max:9.776) AT90(Min:0.224;Max:173) NPHI(Min:-0.014;Max:0.569) RHOB(Min:2.75;Max:0.30) WILL WQUA VCL PR BULKMOD Westhoff John (9 Attributes) GR(Max:368;Min:18) PEF(Min:2.28;Max:6.234) NPHI(Min:0.002;Max:0.397 RHOB(Max:2.76;Min:2.34) WCAR (AVE:0.025) WCLC(AVE:0.183) WILL(AVE:0.311) WQUA(AVE:0.471) VCL(AVE:0.332) Flamik Randal (5 Attributes) GR(Min:0,Max:883) PEF(Min:0,Max:11.54) AT90(Min:0,Max:927) NPHI(Min:0,Max:2.7) RHOB(Min:0;Max:164) Young Joe (5 Attributes) GR(Min:0,Max:883) PEF(Min:0,Max:11.54) AT90 NPHI RHOB Kinyon (7 Attributes) GR PEF AT90 NPHI RHOB PR YME Hagler (9 Attributes) GR PEF AT90 NPHI RHOB WCLC WILL WQUA VCL Lake whetherford GR PEF AT90 NPHI RHOB WILL WQUA WPYR Figure 3: Table listed below gives the summary of different meaningful curves which could be extracted from each well.

Selection of data mining technique & Workflow Due to high volume of log data. It is desirable to choose unsupervised data mining techniques to first find out if our data contains any hidden trends or patterns. Since many wells have log attributes as high as 200. So, it becomes necessary to first reduce the dimensionality of data before applying any clustering algorithm. I use principal component analysis (PCA) to first reduce the dimensionality of data in three principal components and consequently use K-means clustering algorithm to optimize and generate clusters in the data. Figure 4 gives PCA & clustering density plots for different wells in sequence. Clustering is done using X-means algorithm which automatically optimizes number of clusters by iteration. However, due to uneven size of clustering as shown in Fig-4 it can be argued successfully that this method is not giving us the right clusters that we want since in the quest to minimize the within cluster sum of squares error, the X-means clustering gave more weight to larger clusters. Thus, to conclude this clustering technique could not be applied in this case since K-means assumes that each cluster have roughly equal number of observations. Also, PCA is the methodology which is applied to correlated attributes since presence of variance in any one direction is necessary so if the data doesn t show any correlation than applying PCA is not a meaningful task. Table 1 : Parameters used in X-means clustering and PCA analysis PCA No. of components selected based of keeping variance of 90% X- Means Clustering Min. clusters 2 Max. clusters 60 Numerical measures Euclidean distances Max. runs 10 Max. Optimization steps 100

Figure 4 : PCA Density Plots with X- Means clustering for following wells in order from top left 1. Moore 2. Deaton 3. Frankmask 4. Sugar tree 5. Westhoff John 6. Flanik Randal 7. Young Joe 8. Kinyon 9. Hagler 10. Lake Wheatherford. While Using X-Means clustering most of the wells can be described by three clusters in PCA data but Well 6 & 7 does not display any specific clusters.

Mathematical Background of PCA and K-Means clustering PCA is the dimensionality reduction technique to reduce dimensionality of data for a correlated attribute dataset. The 1 st principal component is the direction of maximum variance in data. While each principal component is independent and orthogonal to each other. Every attribute needs to be scaled before applying PCA algorithm to it. PCA is a very useful tool for exploratory data analysis and predictive modelling of huge dimension dataset. While PCA helps to see internal patterns in data next step for data mining is Clustering, although literature is rich with many different algorithms for efficient way to do clustering fundamental workflow for clustering is shown in Table 2 Table 2 : Workflow for clustering algorithms Determine No. of Clusters (Centroids) to be placed Iterate until things converge and number of clusters optimizes. Find distance of each data point to each centroid and assign centroid to each data point based on minimizing sum of distance distance recompute centroid and reclassify based on minimizing sum of distances from centroid find centroid of the clusters done in first iteration and reclassify each datapoint to it's cluster Interpretation of Results Since Principal components as such does not have any physical meanings. I have to transform the predicted clusters back to the original data. Table below gives the distribution of data-points in different clusters for all the analyzed wells: Well name No. of data points used in analyses after cleaning Data points in cluster 1 Data point in cluster 2 Data point in cluster 3 Data point in cluster 4 Moore 1884 629 467 788 -- Deaton 2264 1729 125 410 -- Frank mask 3212 2642 570 -- -- Sugar tree 925 581 56 288 -- Westhoff john 8016 6539 1477 -- -- Flanik Randal 500 240 115 124 21 Young Joe 80 37 8 35 -- Kinyon 6462 538 1680 4244 -- Hagler 2535 1211 801 523 -- Wheatherford lake 5085 1178 3121 786 --

Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) Relationship of Predicted Electro facies with original variables 4400 GR & Electrofacies For Moore Well 0 50 100 150 200 250 300 350 400 Facies 1 Dominated 4600 4800 5000 5200 Facies 3 Dominated 5400 Facies 2 Dominated 5600 GR ELECTROFACIES Figure 5 : Moore well can be subdivided into three electro facies using data mining which can be correlated with gamma ray values. Facies 1 shows high gamma ray and are most probably shale interval while facies 2 have lesser radioactivity as compare to facies 1. Facies 3 have the lowest gamma ray reading.

Depth Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) 4900 GR & Electrofacies for Deaton Well 0 100 200 300 400 5400 GR & Electrofacies Frank mask 0 100 200 300 400 Facies 1 Dominated 5100 5600 Facies 1 Dominated 5800 5300 Facies 3 Dominated 6000 Facies 2 Dominated 5500 5700 6200 Facies 1 Dominated Facies 1 Dominated 6400 5900 6600 6100 6800 GR ELECTROFACIES GR ELECTROFACIES Figure 6 : Deaton well seem to contain only facies 1 and facies 3. While amount of facies 2 is very less. In Frank mask well only two type of facies is present but it is not easy to classify them just based on gamma ray log.

Depth Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) GR & Electrofacies Kinyon GR & Electrofacies Hagler 5600 0 100 200 300 400 5600 0 100 200 300 400 5800 Facies 1 5800 Facies 1 6000 6000 6200 Facies 2 6200 Facies 2 6400 6400 6600 6600 6800 Facies 3 6800 Facies 3 7000 7000 GR GR

The folly of trusting Data mining Most of Data mining algorithm are heuristic processes in which no physical understanding is needed for application of any process. The process of data mining is suppose to show us hidden trends. However, applying any data mining task blindly can lead to completely wrong outputs. Given below are some of the caveats of using K-means clustering to real life dataset. 1. K-means assumes the variance of the distribution of each attribute is spherical 2. Doesn t work on spherical dataset Usually higher the dimensions of data more difficult is applying K-means to it efficiently. 3. The Curse of Unevenly sized clusters K-means assumes the prior probability for all K clusters are the same i.e. each cluster has roughly equal number of observations. Which is obviously not the same with our dataset. PART-2: Doing Clustering using SOM and R package Figure 7 Shows use of self-organizing maps U matrix plot with K means clustering for all the wells using same attributes as used in part-1 Figure 7 : SOM clustering for Moore well

However, again it is difficult to evaluate the accuracy of clustering. Clustering and SOM in R Since R provides some flexibility and quality checks for clustering. The filtered data obtained from part- 1 data cleaning workflow with additional constraint of GR value >120 is used as an input to R and I used K-means clustering technique to see how it performs. This is done for following four wells Moore, Deaton, Frankmask, Kinyon. This section describes the results of using R. Figure 8 : Clustering Optimization for Moore well Figure 9: Clustering optimization of Deaton Well

Figure 10 : Clustering Optimization for Frank mask well Figure 11: Clustering optimization of Kinyon Well

Figure 12 : Clustering optimization of Hagler Well PART-3: Clustering using Merged Dataset of all wells Names of selected wells. This time I just used the wells which contains all these 5 curves i.e. GR, AT90, PEF, NPHI, and RHOB. Following wells were selected for the analysis Bonds ranch C-1 Hyder 1H Jerome Russell John W Porter 3 Massey Unit McFarland-Dixon Moore-Price Sol Carpenter Heirs Sugar tree Upham Joe Johnson Applying the same workflow to merged dataset gives following three clusters as given in Figure 13 : PCA clusters for merged dataset

The table below gives centroid for each cluster Cluster number PC1 PC2 Avg. GR (API) Avg. DPHI Avg. PEF Avg. At 90 Avg. RHOB Avg. NPHI 2-1.455 0.08 154 0.124 3.13 152 2.49 0.177 3 1.5113 0.8375 137 0.038 3.19 16.26 2.64 0.191 1 1.2253-1.647 134 0.09 3.33 9.12 2.55 0.289 Conclusion The clusters can be interpreted as follows: Cluster name Interpretation 1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since, high N phi 0.289) 3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water 2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.

Appendix A: R Code for Part-III setwd("c:/users/priya/desktop/dmp_midterm/r") ms<-read.table("book1_final.csv",header = TRUE,sep = ",") ms[is.na(ms)]<-0 attach(ms) ls.str(ms) #na.rm=true #x[!is.na(x)] ms<-ms[,c(1,2,4,5,6,7,8)] #removing values of PEF>4 and GR<120 msfilter<-ms[(ms$pef<4&ms$gr>110),] ##Doing k means clustering in r par(mfrow=row(1,3),mar=c(4,4,2,1)) #mydata<-scale(msfilter) ##applying PCA for sacled variable mspca<-prcomp(msfilter,center=true, scale=true, retx=true) fulldata<-data.frame(msfilter,mspca$x) mydata<-mspca$x # Determine number of clusters wss <- (nrow(xmydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) dev.copy(pdf,"myplot.pdf") plot(1:15, wss, type="b", xlab="number of Clusters", ylab="within groups sum of squares") fit<-kmeans(mydata,3,iter.max = 100, nstart=50) #get cluster means aggregate(mydata,by=list(fit$cluster),fun=mean) #append cluster assignment mydata<-data.frame(fulldata,fit$cluster) library(cluster) clusplot(mydata,fit$cluster,color=true,shade=true,labels=0,lines=0) write.table(mydata,"c:/users/priya/desktop/dmp_midterm/r/mergeddata.txt",sep="\t")