Data Mining Introduction

Size: px

Start display at page:

Download "Data Mining Introduction"

Augusta Dennis
5 years ago
Views:

1 Data Mining Introduction University of Iowa Power Plant is conducting study on their Boiler 11 Wants to use this data to predict efficiency and classification accuracy 1

2 Problem Definition Data has been collected measuring 14 different variables within the boiler The original data set was mined the rules determined from this data set only consider the static data The project will try to determine how the rules will change classification accuracy (CA) if you consider the trends of the data, instead of focusing on what the value was at a particular time Project Goals The goal of the project is to determine if boiler efficiency can be predicted more accurately if the dynamic nature of the process is considered Create new features based on trends and current values Find which discretization level impacts the classification accuracy best Learned how to use data mining software 2

3 Data Collection Iowa City Power Plant raw data 14 different variables I.E. Boiler Master, Air Fuel Ratio, ect About 10,000 different observations for each variable Data collected was collected at one minute intervals on randomly selected days Trends were created by looking at 20 minute intervals Slope Calculation Calculate slope of raw data over time Understood how the data was changing from point to point Used 20 minute intervals Standardize Slope of data Eased comparison Improved transformation process 3

4 Slope Discretization Discretization of the standardized data Determined how much the standardized slope data was changing Decreasing rapidly Decreasing No change Increasing Increasing rapidly Feature Discretization Discretization count The Values were assigned based on mean and standard deviation. Ex: SA Fan Flow Mean=55.67, Std Dev=6.98 IF(B2>65,65,IF(B2>60,60,IF(B2>57,57,IF(B2>55,55,IF(B 2>52,52,IF(B2>50,50,IF(B2>45,45,IF(B2>40,40,35)))))))) actual value discretized standardized value discretized 57_3 4

5 65_3 65_-2 65_2 65_-1 65_1 65_ _0.5 65_0 60_-3 60_3 60_-2 60_2 60_-1 60_1 60_ _0.5 60_0 57_-3 57_3 57_-2 57_2 57_-1 57_1 57_ _0.5 57_0 Feature Discretization High Discretization Decreasing rapidly Decreasing No change Increasing Increasing rapidly -2S -1S 0 1S 2S Low Discretization Same process as before except ONLY use 3 values Decreasing rapidly -1S No change 0 Increasing 1S 5

6 Experiment Setup Moving Average The first set of date used was the moving average of the raw data Bad idea because moving average takes out all the noise and extreme points Impossible to get a classification accuracy Therefore we used the raw data Solution Approach Data Mining Interested in using data to predict efficiency based on rule sets Supervised Learning because our training set had labels indicating the classes of observations From the rules sets we extract, we should be able to predict efficiency levels for new data that is measured. 6

7 Solution Approach Prepared the data as discussed in model formulation Ran data through data mining software to determine rules Weka Free data mining software In data sets replaced trend testing with original variable Example: Air Fuel Ratio Trend replaced Air Fuel Ratio Computational Study Software uses machine learning algorithms to solve data mining problems Used.j48.j48 and.j48.part algorithms to analyze data sets Based on the C4.5 data mining algorithm learned in class 7

8 Classifiers Used J48.J48 Classifier Decision tree algorithm Builds decision trees 10-fold validation to determine classification accuracy Unfortunately, too many branches would need to be created for this data, so the software could not handle it Classifiers Used J48.PART C4.5 Algorithm Creates rule sets 10-fold verification to determine classification accuracy te/lecture/datamining.pdf 8

9 Results % increase in rules from raw data % increase in rules from low to high variable discretization % increase CA # rules raw data none % 881 ave mid temp trend high % % 38.5% ave mid temp trend low % % air fuel ratio trend high % % 37.4% air fuel ratio trend low % % biomass feedrate trend high % % 21.3% biomass feedrate trend low % % Results Approximately 1000 rules and about 10,000 data points Each rule on average only describes 10 points Raw data describes on average 11.4 data points (10000/880) Low discretized data describes on average 10 data points (10000/1000) High discretized data describes on average 7.1 data points (10000/1400) 9

10 Conclusions Combining trends into the data set can increase the classification accuracy Low discretization lowers the number of rules formed, but adding trends increases the number of rules formed Low discretization is better than high discretization for two reasons: Higher classification accuracy (in most cases) Less rules formed Any Questions??? 10

Data Mining and Evolutionary Computation Algorithms for Process Modeling and Optimization

Data Mining and Evolutionary Computation Algorithms for Process Modeling and Optimization Zhe Song, Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 andrew-kusiak@uiowa.edu Tel: 319-335-5934