Priyank Srivastava (PE 5370: Mid- Term Project Report)

Contents Executive Summary... 2 PART- 1 Identify Electro facies from Given Logs using data mining algorithms... 3 Selection of wells... 3 Data cleaning and Preparation of data for input to data mining... 3 Selection of data mining technique & Workflow... 7 Mathematical Background of PCA and K-Means clustering... 9 Interpretation of Results... 9 Relationship of Predicted Electro facies with original variables... 10 The folly of trusting Data mining... 13 PART-2: Doing Clustering using SOM and R package... 13 Clustering and SOM in R... 14 PART-3: Clustering using Merged Dataset of all wells... 16 Conclusion... 17 Appendix A: R Code for Part-III... 18

Executive Summary The Objective of present project is to prepare a data mining model to estimate electro facies from set of open-hole well logs. This trained model can then be used as a predictive tool for estimating unknown logs at any new location. Present workflow utilizes principal component analysis (PCA) and K-Means clustering algorithm for preparation of data mining model. This report is divided into three parts in part-i the data mining algorithm is run on individual wells which uses different attributes for each well depending on the availability. The produced clusters are mapped back to individual wells based on gamma ray values which broadly shows Facies 1 as high gamma ray, Facies 3 as mix of sand shale sequence and Facies 2 as low gamma ray. Presence of these facies is then correlated with corresponding production rates from different wells to figure out reservoir quality of each facies. Though K-means always converges the answer given by K-means depends on the initial centers. It also returns centers that are averages of data points. So some of the wells (Young Joe; Flanik Randal) which do not have complete dataset doesn t show any clusters and thus it is difficult to generalize the interpretation from this Model. This part ends with discussing various disadvantages of K-means clustering. We can predict the unknown logs in these wells using present data mining model but it is out of scope of present project. The process of data mining helps uncovering the hidden patterns in the data set by exposing the relationships between attributes. But the issue is that it uncovers a lot of unuseful patterns. It is up to the domain expert to filter through the patterns and accept the ones that are valid to answer the objective question. Thus, in part-ii some of the wells are used for clustering using selforganizing maps (SOM). In part-iii, 5 attributes (GR, AT90, PEF, RHOB and NPHI) are merged for all the 10 selected wells and similar workflow (PCA+K-means) is run to generate a generalize model for three clusters from which different facies and its characteristics are identified. To conclude based on study in Part III, I can summarize my finding in following table Cluster name Interpretation 1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since, high N phi 0.289) 3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water 2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.

PART- 1 Identify Electro facies from Given Logs using data mining algorithms Selection of wells I choose the wells according to their API numbers so 10 wells in county parker (API: 42-367) were chosen. But not all wells have equal amount of data while some wells have processed logs some don t have it. The table below gives the API numbers with corresponding well name and Production rate for the chosen wells. API s Well name Production rate* (Mscf/day) 42-367-34050 Moore -- 42-367-34447 Deaton 202 42-367-34576 Frank-Mask 830 42-367-34094 Sugar Tree 532 42-367-34227 Westhoff John 1029 42-367-34343 Flamik Randal 201 42-367-34385 Young Joe 779 42-367-34438 Kinyon 493 42-367-34744 Hagler 1365 42-367-34883 Lake Wheatherford 965 *From Drillininginfo.com Based on the production rate, the wells can be divided in three categories. Our Goal in this project is to (1) classify each well in electro-facies. (2) If i can relate the performance of well with newly classified electro facies. Data cleaning and Preparation of data for input to data mining Since the logs given to us were processed and contains many redundant and missing parameters. It becomes imperative to select and clean the data for selection of attribute we want as input to data mining algorithms. We want to develop electro facies for upper Barnett and lower Barnett zones local stratigraphy of subsurface is given in Figure 1 as observed Barnett shales is divided in two parts by forestburg limestone Thus, before inputting data in any data mining algorithm we need to get rid of these limestone zones. Since in all of the given logs resistivity of mud is of order of 0.4 Ohm-meter we can be sure that all the wells are drilled by water-based muds and hence we can use Photoelectric (PE) log as lithology indicator since carbonates usually have high PE values of 5. We can easily screen out all the values of log which shows PE < 4. Additional filtering is done by screening out all depths which shows Density (RHOB) >2.7 gm/cc. Figure 2 shows the workflow used for cleaning and filtering of depth so that our final output is depth and parameters of only upper and lower Barnett shale.

Figure 3 contains the list of attributes selected for each well. It can be observed that flamik randal and young Joe well contains least amount of attributes.

Figure 1 : General stratigraphy of the Ordovician to Pennsylvanian section in fort-worth basin (Loucks & Ruppel, 2007) Select all the depths with PEF < 4 Select all the depths with non zero GR, RHOB, AT90 and 0<NPHI <1 Normalize every parameter with its mean and variance Figure 2: Workflow for Data cleaning

Moore ( 9 Attributes) GR(Max:368;Min:18) PEF(Max:6.2;Min:2.2) AT90(Max:862;Min:0.68) NPHI(Max:0.397;Min:.002) RHOB(Max:2.76;Min:2.34) WCLC(AVE: 0.183) WILL(AVE:0.69) WQUA(AVE:0.471) VCL(AVE:0.332) Deaton(11 Attributes) GR(Max:337;Min:12) PEF(Max:5.18;Min:1.8) AT90 NPHI(Max:0.374;Min:0) RHOB(Max:2.825;Min:2.39) WCLC(AVE: 0.176) WDOL(AVE:0.096) WILL(AVE:0.136) WQUA(AVE:0.474) WTOC(AVE:0.022) VCL(AVE:0.237) Frank Mask(6 Attributes) GR(Max:346;Min:0) NPHI(Max:0.30;Min:0) RHOB(Max:2.705;Min:0) VCL(AVE:0.289) PR (AVE: 0.227) CB (0.205) Sugartree (10 Attributes) GR(Max:201;Min:0) PEF(Min:0;Max:9.776) AT90(Min:0.224;Max:173) NPHI(Min:-0.014;Max:0.569) RHOB(Min:2.75;Max:0.30) WILL WQUA VCL PR BULKMOD Westhoff John (9 Attributes) GR(Max:368;Min:18) PEF(Min:2.28;Max:6.234) NPHI(Min:0.002;Max:0.397 RHOB(Max:2.76;Min:2.34) WCAR (AVE:0.025) WCLC(AVE:0.183) WILL(AVE:0.311) WQUA(AVE:0.471) VCL(AVE:0.332) Flamik Randal (5 Attributes) GR(Min:0,Max:883) PEF(Min:0,Max:11.54) AT90(Min:0,Max:927) NPHI(Min:0,Max:2.7) RHOB(Min:0;Max:164) Young Joe (5 Attributes) GR(Min:0,Max:883) PEF(Min:0,Max:11.54) AT90 NPHI RHOB Kinyon (7 Attributes) GR PEF AT90 NPHI RHOB PR YME Hagler (9 Attributes) GR PEF AT90 NPHI RHOB WCLC WILL WQUA VCL Lake whetherford GR PEF AT90 NPHI RHOB WILL WQUA WPYR Figure 3: Table listed below gives the summary of different meaningful curves which could be extracted from each well.

Selection of data mining technique & Workflow Due to high volume of log data. It is desirable to choose unsupervised data mining techniques to first find out if our data contains any hidden trends or patterns. Since many wells have log attributes as high as 200. So, it becomes necessary to first reduce the dimensionality of data before applying any clustering algorithm. I use principal component analysis (PCA) to first reduce the dimensionality of data in three principal components and consequently use K-means clustering algorithm to optimize and generate clusters in the data. Figure 4 gives PCA & clustering density plots for different wells in sequence. Clustering is done using X-means algorithm which automatically optimizes number of clusters by iteration. However, due to uneven size of clustering as shown in Fig-4 it can be argued successfully that this method is not giving us the right clusters that we want since in the quest to minimize the within cluster sum of squares error, the X-means clustering gave more weight to larger clusters. Thus, to conclude this clustering technique could not be applied in this case since K-means assumes that each cluster have roughly equal number of observations. Also, PCA is the methodology which is applied to correlated attributes since presence of variance in any one direction is necessary so if the data doesn t show any correlation than applying PCA is not a meaningful task. Table 1 : Parameters used in X-means clustering and PCA analysis PCA No. of components selected based of keeping variance of 90% X- Means Clustering Min. clusters 2 Max. clusters 60 Numerical measures Euclidean distances Max. runs 10 Max. Optimization steps 100

Figure 4 : PCA Density Plots with X- Means clustering for following wells in order from top left 1. Moore 2. Deaton 3. Frankmask 4. Sugar tree 5. Westhoff John 6. Flanik Randal 7. Young Joe 8. Kinyon 9. Hagler 10. Lake Wheatherford. While Using X-Means clustering most of the wells can be described by three clusters in PCA data but Well 6 & 7 does not display any specific clusters.

Mathematical Background of PCA and K-Means clustering PCA is the dimensionality reduction technique to reduce dimensionality of data for a correlated attribute dataset. The 1 st principal component is the direction of maximum variance in data. While each principal component is independent and orthogonal to each other. Every attribute needs to be scaled before applying PCA algorithm to it. PCA is a very useful tool for exploratory data analysis and predictive modelling of huge dimension dataset. While PCA helps to see internal patterns in data next step for data mining is Clustering, although literature is rich with many different algorithms for efficient way to do clustering fundamental workflow for clustering is shown in Table 2 Table 2 : Workflow for clustering algorithms Determine No. of Clusters (Centroids) to be placed Iterate until things converge and number of clusters optimizes. Find distance of each data point to each centroid and assign centroid to each data point based on minimizing sum of distance distance recompute centroid and reclassify based on minimizing sum of distances from centroid find centroid of the clusters done in first iteration and reclassify each datapoint to it's cluster Interpretation of Results Since Principal components as such does not have any physical meanings. I have to transform the predicted clusters back to the original data. Table below gives the distribution of data-points in different clusters for all the analyzed wells: Well name No. of data points used in analyses after cleaning Data points in cluster 1 Data point in cluster 2 Data point in cluster 3 Data point in cluster 4 Moore 1884 629 467 788 -- Deaton 2264 1729 125 410 -- Frank mask 3212 2642 570 -- -- Sugar tree 925 581 56 288 -- Westhoff john 8016 6539 1477 -- -- Flanik Randal 500 240 115 124 21 Young Joe 80 37 8 35 -- Kinyon 6462 538 1680 4244 -- Hagler 2535 1211 801 523 -- Wheatherford lake 5085 1178 3121 786 --

Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) Relationship of Predicted Electro facies with original variables 4400 GR & Electrofacies For Moore Well 0 50 100 150 200 250 300 350 400 Facies 1 Dominated 4600 4800 5000 5200 Facies 3 Dominated 5400 Facies 2 Dominated 5600 GR ELECTROFACIES Figure 5 : Moore well can be subdivided into three electro facies using data mining which can be correlated with gamma ray values. Facies 1 shows high gamma ray and are most probably shale interval while facies 2 have lesser radioactivity as compare to facies 1. Facies 3 have the lowest gamma ray reading.

Depth Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) 4900 GR & Electrofacies for Deaton Well 0 100 200 300 400 5400 GR & Electrofacies Frank mask 0 100 200 300 400 Facies 1 Dominated 5100 5600 Facies 1 Dominated 5800 5300 Facies 3 Dominated 6000 Facies 2 Dominated 5500 5700 6200 Facies 1 Dominated Facies 1 Dominated 6400 5900 6600 6100 6800 GR ELECTROFACIES GR ELECTROFACIES Figure 6 : Deaton well seem to contain only facies 1 and facies 3. While amount of facies 2 is very less. In Frank mask well only two type of facies is present but it is not easy to classify them just based on gamma ray log.

Depth Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) GR & Electrofacies Kinyon GR & Electrofacies Hagler 5600 0 100 200 300 400 5600 0 100 200 300 400 5800 Facies 1 5800 Facies 1 6000 6000 6200 Facies 2 6200 Facies 2 6400 6400 6600 6600 6800 Facies 3 6800 Facies 3 7000 7000 GR GR

The folly of trusting Data mining Most of Data mining algorithm are heuristic processes in which no physical understanding is needed for application of any process. The process of data mining is suppose to show us hidden trends. However, applying any data mining task blindly can lead to completely wrong outputs. Given below are some of the caveats of using K-means clustering to real life dataset. 1. K-means assumes the variance of the distribution of each attribute is spherical 2. Doesn t work on spherical dataset Usually higher the dimensions of data more difficult is applying K-means to it efficiently. 3. The Curse of Unevenly sized clusters K-means assumes the prior probability for all K clusters are the same i.e. each cluster has roughly equal number of observations. Which is obviously not the same with our dataset. PART-2: Doing Clustering using SOM and R package Figure 7 Shows use of self-organizing maps U matrix plot with K means clustering for all the wells using same attributes as used in part-1 Figure 7 : SOM clustering for Moore well

However, again it is difficult to evaluate the accuracy of clustering. Clustering and SOM in R Since R provides some flexibility and quality checks for clustering. The filtered data obtained from part- 1 data cleaning workflow with additional constraint of GR value >120 is used as an input to R and I used K-means clustering technique to see how it performs. This is done for following four wells Moore, Deaton, Frankmask, Kinyon. This section describes the results of using R. Figure 8 : Clustering Optimization for Moore well Figure 9: Clustering optimization of Deaton Well

Figure 10 : Clustering Optimization for Frank mask well Figure 11: Clustering optimization of Kinyon Well

Figure 12 : Clustering optimization of Hagler Well PART-3: Clustering using Merged Dataset of all wells Names of selected wells. This time I just used the wells which contains all these 5 curves i.e. GR, AT90, PEF, NPHI, and RHOB. Following wells were selected for the analysis Bonds ranch C-1 Hyder 1H Jerome Russell John W Porter 3 Massey Unit McFarland-Dixon Moore-Price Sol Carpenter Heirs Sugar tree Upham Joe Johnson Applying the same workflow to merged dataset gives following three clusters as given in Figure 13 : PCA clusters for merged dataset

The table below gives centroid for each cluster Cluster number PC1 PC2 Avg. GR (API) Avg. DPHI Avg. PEF Avg. At 90 Avg. RHOB Avg. NPHI 2-1.455 0.08 154 0.124 3.13 152 2.49 0.177 3 1.5113 0.8375 137 0.038 3.19 16.26 2.64 0.191 1 1.2253-1.647 134 0.09 3.33 9.12 2.55 0.289 Conclusion The clusters can be interpreted as follows: Cluster name Interpretation 1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since, high N phi 0.289) 3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water 2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.

Appendix A: R Code for Part-III setwd("c:/users/priya/desktop/dmp_midterm/r") ms<-read.table("book1_final.csv",header = TRUE,sep = ",") ms[is.na(ms)]<-0 attach(ms) ls.str(ms) #na.rm=true #x[!is.na(x)] ms<-ms[,c(1,2,4,5,6,7,8)] #removing values of PEF>4 and GR<120 msfilter<-ms[(ms$pef<4&ms$gr>110),] ##Doing k means clustering in r par(mfrow=row(1,3),mar=c(4,4,2,1)) #mydata<-scale(msfilter) ##applying PCA for sacled variable mspca<-prcomp(msfilter,center=true, scale=true, retx=true) fulldata<-data.frame(msfilter,mspca$x) mydata<-mspca$x # Determine number of clusters wss <- (nrow(xmydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) dev.copy(pdf,"myplot.pdf") plot(1:15, wss, type="b", xlab="number of Clusters", ylab="within groups sum of squares") fit<-kmeans(mydata,3,iter.max = 100, nstart=50) #get cluster means aggregate(mydata,by=list(fit$cluster),fun=mean) #append cluster assignment mydata<-data.frame(fulldata,fit$cluster) library(cluster) clusplot(mydata,fit$cluster,color=true,shade=true,labels=0,lines=0) write.table(mydata,"c:/users/priya/desktop/dmp_midterm/r/mergeddata.txt",sep="\t")