Priyank Srivastava (PE 5370: Mid- Term Project Report)

Size: px

Start display at page:

Download "Priyank Srivastava (PE 5370: Mid- Term Project Report)"

Claude Quinn
6 years ago
Views:

1 Contents Executive Summary... 2 PART- 1 Identify Electro facies from Given Logs using data mining algorithms... 3 Selection of wells... 3 Data cleaning and Preparation of data for input to data mining... 3 Selection of data mining technique & Workflow... 7 Mathematical Background of PCA and K-Means clustering... 9 Interpretation of Results... 9 Relationship of Predicted Electro facies with original variables The folly of trusting Data mining PART-2: Doing Clustering using SOM and R package Clustering and SOM in R PART-3: Clustering using Merged Dataset of all wells Conclusion Appendix A: R Code for Part-III... 18

2 Executive Summary The Objective of present project is to prepare a data mining model to estimate electro facies from set of open-hole well logs. This trained model can then be used as a predictive tool for estimating unknown logs at any new location. Present workflow utilizes principal component analysis (PCA) and K-Means clustering algorithm for preparation of data mining model. This report is divided into three parts in part-i the data mining algorithm is run on individual wells which uses different attributes for each well depending on the availability. The produced clusters are mapped back to individual wells based on gamma ray values which broadly shows Facies 1 as high gamma ray, Facies 3 as mix of sand shale sequence and Facies 2 as low gamma ray. Presence of these facies is then correlated with corresponding production rates from different wells to figure out reservoir quality of each facies. Though K-means always converges the answer given by K-means depends on the initial centers. It also returns centers that are averages of data points. So some of the wells (Young Joe; Flanik Randal) which do not have complete dataset doesn t show any clusters and thus it is difficult to generalize the interpretation from this Model. This part ends with discussing various disadvantages of K-means clustering. We can predict the unknown logs in these wells using present data mining model but it is out of scope of present project. The process of data mining helps uncovering the hidden patterns in the data set by exposing the relationships between attributes. But the issue is that it uncovers a lot of unuseful patterns. It is up to the domain expert to filter through the patterns and accept the ones that are valid to answer the objective question. Thus, in part-ii some of the wells are used for clustering using selforganizing maps (SOM). In part-iii, 5 attributes (GR, AT90, PEF, RHOB and NPHI) are merged for all the 10 selected wells and similar workflow (PCA+K-means) is run to generate a generalize model for three clusters from which different facies and its characteristics are identified. To conclude based on study in Part III, I can summarize my finding in following table Cluster name Interpretation 1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since, high N phi 0.289) 3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water 2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.

3 PART- 1 Identify Electro facies from Given Logs using data mining algorithms Selection of wells I choose the wells according to their API numbers so 10 wells in county parker (API: ) were chosen. But not all wells have equal amount of data while some wells have processed logs some don t have it. The table below gives the API numbers with corresponding well name and Production rate for the chosen wells. API s Well name Production rate* (Mscf/day) Moore Deaton Frank-Mask Sugar Tree Westhoff John Flamik Randal Young Joe Kinyon Hagler Lake Wheatherford 965 *From Drillininginfo.com Based on the production rate, the wells can be divided in three categories. Our Goal in this project is to (1) classify each well in electro-facies. (2) If i can relate the performance of well with newly classified electro facies. Data cleaning and Preparation of data for input to data mining Since the logs given to us were processed and contains many redundant and missing parameters. It becomes imperative to select and clean the data for selection of attribute we want as input to data mining algorithms. We want to develop electro facies for upper Barnett and lower Barnett zones local stratigraphy of subsurface is given in Figure 1 as observed Barnett shales is divided in two parts by forestburg limestone Thus, before inputting data in any data mining algorithm we need to get rid of these limestone zones. Since in all of the given logs resistivity of mud is of order of 0.4 Ohm-meter we can be sure that all the wells are drilled by water-based muds and hence we can use Photoelectric (PE) log as lithology indicator since carbonates usually have high PE values of 5. We can easily screen out all the values of log which shows PE < 4. Additional filtering is done by screening out all depths which shows Density (RHOB) >2.7 gm/cc. Figure 2 shows the workflow used for cleaning and filtering of depth so that our final output is depth and parameters of only upper and lower Barnett shale.

4 Figure 3 contains the list of attributes selected for each well. It can be observed that flamik randal and young Joe well contains least amount of attributes.

Figure 1 : General stratigraphy of the Ordovician to Pennsylvanian section in fort-worth basin (Loucks & Ruppel, 2007) Select all the depths with PEF < 4

5 Figure 1 : General stratigraphy of the Ordovician to Pennsylvanian section in fort-worth basin (Loucks & Ruppel, 2007) Select all the depths with PEF < 4 Select all the depths with non zero GR, RHOB, AT90 and 0<NPHI <1 Normalize every parameter with its mean and variance Figure 2: Workflow for Data cleaning

6 Moore ( 9 Attributes) GR(Max:368;Min:18) PEF(Max:6.2;Min:2.2) AT90(Max:862;Min:0.68) NPHI(Max:0.397;Min:.002) RHOB(Max:2.76;Min:2.34) WCLC(AVE: 0.183) WILL(AVE:0.69) WQUA(AVE:0.471) VCL(AVE:0.332) Deaton(11 Attributes) GR(Max:337;Min:12) PEF(Max:5.18;Min:1.8) AT90 NPHI(Max:0.374;Min:0) RHOB(Max:2.825;Min:2.39) WCLC(AVE: 0.176) WDOL(AVE:0.096) WILL(AVE:0.136) WQUA(AVE:0.474) WTOC(AVE:0.022) VCL(AVE:0.237) Frank Mask(6 Attributes) GR(Max:346;Min:0) NPHI(Max:0.30;Min:0) RHOB(Max:2.705;Min:0) VCL(AVE:0.289) PR (AVE: 0.227) CB (0.205) Sugartree (10 Attributes) GR(Max:201;Min:0) PEF(Min:0;Max:9.776) AT90(Min:0.224;Max:173) NPHI(Min:-0.014;Max:0.569) RHOB(Min:2.75;Max:0.30) WILL WQUA VCL PR BULKMOD Westhoff John (9 Attributes) GR(Max:368;Min:18) PEF(Min:2.28;Max:6.234) NPHI(Min:0.002;Max:0.397 RHOB(Max:2.76;Min:2.34) WCAR (AVE:0.025) WCLC(AVE:0.183) WILL(AVE:0.311) WQUA(AVE:0.471) VCL(AVE:0.332) Flamik Randal (5 Attributes) GR(Min:0,Max:883) PEF(Min:0,Max:11.54) AT90(Min:0,Max:927) NPHI(Min:0,Max:2.7) RHOB(Min:0;Max:164) Young Joe (5 Attributes) GR(Min:0,Max:883) PEF(Min:0,Max:11.54) AT90 NPHI RHOB Kinyon (7 Attributes) GR PEF AT90 NPHI RHOB PR YME Hagler (9 Attributes) GR PEF AT90 NPHI RHOB WCLC WILL WQUA VCL Lake whetherford GR PEF AT90 NPHI RHOB WILL WQUA WPYR Figure 3: Table listed below gives the summary of different meaningful curves which could be extracted from each well.

7 Selection of data mining technique & Workflow Due to high volume of log data. It is desirable to choose unsupervised data mining techniques to first find out if our data contains any hidden trends or patterns. Since many wells have log attributes as high as 200. So, it becomes necessary to first reduce the dimensionality of data before applying any clustering algorithm. I use principal component analysis (PCA) to first reduce the dimensionality of data in three principal components and consequently use K-means clustering algorithm to optimize and generate clusters in the data. Figure 4 gives PCA & clustering density plots for different wells in sequence. Clustering is done using X-means algorithm which automatically optimizes number of clusters by iteration. However, due to uneven size of clustering as shown in Fig-4 it can be argued successfully that this method is not giving us the right clusters that we want since in the quest to minimize the within cluster sum of squares error, the X-means clustering gave more weight to larger clusters. Thus, to conclude this clustering technique could not be applied in this case since K-means assumes that each cluster have roughly equal number of observations. Also, PCA is the methodology which is applied to correlated attributes since presence of variance in any one direction is necessary so if the data doesn t show any correlation than applying PCA is not a meaningful task. Table 1 : Parameters used in X-means clustering and PCA analysis PCA No. of components selected based of keeping variance of 90% X- Means Clustering Min. clusters 2 Max. clusters 60 Numerical measures Euclidean distances Max. runs 10 Max. Optimization steps 100

8 Figure 4 : PCA Density Plots with X- Means clustering for following wells in order from top left 1. Moore 2. Deaton 3. Frankmask 4. Sugar tree 5. Westhoff John 6. Flanik Randal 7. Young Joe 8. Kinyon 9. Hagler 10. Lake Wheatherford. While Using X-Means clustering most of the wells can be described by three clusters in PCA data but Well 6 & 7 does not display any specific clusters.

9 Mathematical Background of PCA and K-Means clustering PCA is the dimensionality reduction technique to reduce dimensionality of data for a correlated attribute dataset. The 1 st principal component is the direction of maximum variance in data. While each principal component is independent and orthogonal to each other. Every attribute needs to be scaled before applying PCA algorithm to it. PCA is a very useful tool for exploratory data analysis and predictive modelling of huge dimension dataset. While PCA helps to see internal patterns in data next step for data mining is Clustering, although literature is rich with many different algorithms for efficient way to do clustering fundamental workflow for clustering is shown in Table 2 Table 2 : Workflow for clustering algorithms Determine No. of Clusters (Centroids) to be placed Iterate until things converge and number of clusters optimizes. Find distance of each data point to each centroid and assign centroid to each data point based on minimizing sum of distance distance recompute centroid and reclassify based on minimizing sum of distances from centroid find centroid of the clusters done in first iteration and reclassify each datapoint to it's cluster Interpretation of Results Since Principal components as such does not have any physical meanings. I have to transform the predicted clusters back to the original data. Table below gives the distribution of data-points in different clusters for all the analyzed wells: Well name No. of data points used in analyses after cleaning Data points in cluster 1 Data point in cluster 2 Data point in cluster 3 Data point in cluster 4 Moore Deaton Frank mask Sugar tree Westhoff john Flanik Randal Young Joe Kinyon Hagler Wheatherford lake

10 Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) Relationship of Predicted Electro facies with original variables 4400 GR & Electrofacies For Moore Well Facies 1 Dominated Facies 3 Dominated 5400 Facies 2 Dominated 5600 GR ELECTROFACIES Figure 5 : Moore well can be subdivided into three electro facies using data mining which can be correlated with gamma ray values. Facies 1 shows high gamma ray and are most probably shale interval while facies 2 have lesser radioactivity as compare to facies 1. Facies 3 have the lowest gamma ray reading.

11 Depth Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) 4900 GR & Electrofacies for Deaton Well GR & Electrofacies Frank mask Facies 1 Dominated Facies 1 Dominated Facies 3 Dominated 6000 Facies 2 Dominated Facies 1 Dominated Facies 1 Dominated GR ELECTROFACIES GR ELECTROFACIES Figure 6 : Deaton well seem to contain only facies 1 and facies 3. While amount of facies 2 is very less. In Frank mask well only two type of facies is present but it is not easy to classify them just based on gamma ray log.

12 Depth Depth Priyank Srivastava (PE 5370: Mid- Term Project Report) GR & Electrofacies Kinyon GR & Electrofacies Hagler Facies Facies Facies Facies Facies Facies GR GR

The folly of trusting Data mining Most of Data mining algorithm are heuristic processes in which no physical understanding is needed for application of any process.

Given below are some of the caveats of using K-means clustering to real life dataset. 1. K-means assumes the variance of the distribution of each attribute is spherical 2.

13 The folly of trusting Data mining Most of Data mining algorithm are heuristic processes in which no physical understanding is needed for application of any process. The process of data mining is suppose to show us hidden trends. However, applying any data mining task blindly can lead to completely wrong outputs. Given below are some of the caveats of using K-means clustering to real life dataset. 1. K-means assumes the variance of the distribution of each attribute is spherical 2. Doesn t work on spherical dataset Usually higher the dimensions of data more difficult is applying K-means to it efficiently. 3. The Curse of Unevenly sized clusters K-means assumes the prior probability for all K clusters are the same i.e. each cluster has roughly equal number of observations. Which is obviously not the same with our dataset. PART-2: Doing Clustering using SOM and R package Figure 7 Shows use of self-organizing maps U matrix plot with K means clustering for all the wells using same attributes as used in part-1 Figure 7 : SOM clustering for Moore well

14 However, again it is difficult to evaluate the accuracy of clustering. Clustering and SOM in R Since R provides some flexibility and quality checks for clustering. The filtered data obtained from part- 1 data cleaning workflow with additional constraint of GR value >120 is used as an input to R and I used K-means clustering technique to see how it performs. This is done for following four wells Moore, Deaton, Frankmask, Kinyon. This section describes the results of using R. Figure 8 : Clustering Optimization for Moore well Figure 9: Clustering optimization of Deaton Well

15 Figure 10 : Clustering Optimization for Frank mask well Figure 11: Clustering optimization of Kinyon Well

Following wells were selected for the analysis Bonds ranch C-1 Hyder 1H Jerome Russell John W Porter 3 Massey Unit McFarland-Dixon

16 Figure 12 : Clustering optimization of Hagler Well PART-3: Clustering using Merged Dataset of all wells Names of selected wells. This time I just used the wells which contains all these 5 curves i.e. GR, AT90, PEF, NPHI, and RHOB. Following wells were selected for the analysis Bonds ranch C-1 Hyder 1H Jerome Russell John W Porter 3 Massey Unit McFarland-Dixon Moore-Price Sol Carpenter Heirs Sugar tree Upham Joe Johnson Applying the same workflow to merged dataset gives following three clusters as given in Figure 13 : PCA clusters for merged dataset

17 The table below gives centroid for each cluster Cluster number PC1 PC2 Avg. GR (API) Avg. DPHI Avg. PEF Avg. At 90 Avg. RHOB Avg. NPHI Conclusion The clusters can be interpreted as follows: Cluster name Interpretation 1 Shales/Sands with low porosity (0.09) and resistivity (9.12). Probably tight shales with high clay bound water (since, high N phi 0.289) 3 Shales/Sands with very low porosity (0.038) but higher resistivity (16.26) and grain density than facies 1. Probably contains hydrocarbon saturation and less water 2 Probably the hottest spot in this region with good porosity and high Hydrocarbon saturation. So the well with highest amount of Facies 2 will be the most prolific producer.

18 Appendix A: R Code for Part-III setwd("c:/users/priya/desktop/dmp_midterm/r") ms<-read.table("book1_final.csv",header = TRUE,sep = ",") ms[is.na(ms)]<-0 attach(ms) ls.str(ms) #na.rm=true #x[!is.na(x)] ms<-ms[,c(1,2,4,5,6,7,8)] #removing values of PEF>4 and GR<120 msfilter<-ms[(ms$pef<4&ms$gr>110),] ##Doing k means clustering in r par(mfrow=row(1,3),mar=c(4,4,2,1)) #mydata<-scale(msfilter) ##applying PCA for sacled variable mspca<-prcomp(msfilter,center=true, scale=true, retx=true) fulldata<-data.frame(msfilter,mspca$x) mydata<-mspca$x # Determine number of clusters wss <- (nrow(xmydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) dev.copy(pdf,"myplot.pdf") plot(1:15, wss, type="b", xlab="number of Clusters", ylab="within groups sum of squares") fit<-kmeans(mydata,3,iter.max = 100, nstart=50) #get cluster means aggregate(mydata,by=list(fit$cluster),fun=mean) #append cluster assignment mydata<-data.frame(fulldata,fit$cluster) library(cluster) clusplot(mydata,fit$cluster,color=true,shade=true,labels=0,lines=0) write.table(mydata,"c:/users/priya/desktop/dmp_midterm/r/mergeddata.txt",sep="\t")

B S Bisht, Suresh Konka*, J P Dobhal. Oil and Natural Gas Corporation Limited, GEOPIC, Dehradun , Uttarakhand

Prediction of Missing Log data using Artificial Neural Networks (ANN), Multi-Resolution Graph-based B S Bisht, Suresh Konka*, J P Dobhal Oil and Natural Gas Corporation Limited, GEOPIC, Dehradun-248195,