Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Similar documents
Euclidean Distance Based Feature Selection for Fault Detection Prediction Model in Semiconductor Manufacturing Process

Support Vector Machine with Restarting Genetic Algorithm for Classifying Imbalanced Data

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Lina Guzman, DIRECTV

CS145: INTRODUCTION TO DATA MINING

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Research on Applications of Data Mining in Electronic Commerce. Xiuping YANG 1, a

Quality prediction modeling for multistage manufacturing based on classification and association rule mining

PARALLEL SELECTIVE SAMPLING USING RELEVANCE VECTOR MACHINE FOR IMBALANCE DATA M. Athitya Kumaraguru 1, Viji Vinod 2, N.

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

A Spatial Point Pattern Analysis to Recognize Fail Bit Patterns in Semiconductor Manufacturing

Safe-Level-SMOTE: Safe-Level-Synthetic Minority Over-Sampling TEchnique for Handling the Class Imbalanced Problem

Using Real-valued Meta Classifiers to Integrate and Contextualize Binding Site Predictions

ECLT 5810 Evaluation of Classification Quality

A Feature Selection Method to Handle Imbalanced Data in Text Classification

Feature Selection Technique to Improve Performance Prediction in a Wafer Fabrication Process

Identification of the correct hard-scatter vertex at the Large Hadron Collider

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Intrusion detection in computer networks through a hybrid approach of data mining and decision trees

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

The class imbalance problem

Classification of Imbalanced Data Using Synthetic Over-Sampling Techniques

Prediction of Student Performance using MTSD algorithm

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

I211: Information infrastructure II

Robot localization method based on visual features and their geometric relationship

NETWORK FAULT DETECTION - A CASE FOR DATA MINING

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

CS249: ADVANCED DATA MINING

Data Mining: STATISTICA

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

Model s Performance Measures

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Post-Classification Change Detection of High Resolution Satellite Images Using AdaBoost Classifier

Noise-based Feature Perturbation as a Selection Method for Microarray Data

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Ester Bernadó-Mansilla. Research Group in Intelligent Systems Enginyeria i Arquitectura La Salle Universitat Ramon Llull Barcelona, Spain

Contents. Preface to the Second Edition

Evaluating Classifiers

A Wrapper for Reweighting Training Instances for Handling Imbalanced Data Sets

Impact of Encryption Techniques on Classification Algorithm for Privacy Preservation of Data

EM algorithm with GMM and Naive Bayesian to Implement Missing Values

Weka ( )

CHALLENGES IN HANDLING IMBALANCED BIG DATA: A SURVEY

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Evaluating Classifiers

LVQ-SMOTE Learning Vector Quantization based Synthetic Minority Over sampling Technique for biomedical data

Chuck Cartledge, PhD. 23 September 2017

Design of a Processing Structure of CNN Algorithm using Filter Buffers

Missing Value Imputation in Multi Attribute Data Set

SYED ABDUL SAMAD RANDOM WALK OVERSAMPLING TECHNIQUE FOR MI- NORITY CLASS CLASSIFICATION

A Robust Hand Gesture Recognition Using Combined Moment Invariants in Hand Shape

CS4491/CS 7265 BIG DATA ANALYTICS

Performance Analysis of Data Mining Classification Techniques

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

k-nn Disgnosing Breast Cancer

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Design and Implementation of HTML5 based SVM for Integrating Runtime of Smart Devices and Web Environments

Feature Selection for Multi-Class Imbalanced Data Sets Based on Genetic Algorithm

Classifying Imbalanced Data Sets Using. Similarity Based Hierarchical Decomposition

K- Nearest Neighbors(KNN) And Predictive Accuracy

Determination of the Parameter for Transformation of Local Geodetic System to the World Geodetic System using GNSS

B-kNN to Improve the Efficiency of knn

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

A study of classification algorithms using Rapidminer

K-Neighbor Over-Sampling with Cleaning Data: A New Approach to Improve Classification. Performance in Data Sets with Class Imbalance

Racing for Unbalanced Methods Selection

An Expert System for Detection of Breast Cancer Using Data Preprocessing and Bayesian Network

The Data Mining Application Based on WEKA: Geographical Original of Music

SOFTWARE DEFECT PREDICTION USING IMPROVED SUPPORT VECTOR MACHINE CLASSIFIER

2. On classification and related tasks

Large Scale Data Analysis Using Deep Learning

Predicting Bias in Machine Learned Classifiers Using Clustering

(JBE Vol. 23, No. 6, November 2018) Detection of Frame Deletion Using Convolutional Neural Network. Abstract

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Sensor-based Semantic-level Human Activity Recognition using Temporal Classification

List of Exercises: Data Mining 1 December 12th, 2015

Performance Evaluation of Various Classification Algorithms

Nearest Neighbor Classification with Locally Weighted Distance for Imbalanced Data

Study on the Signboard Region Detection in Natural Image

Seminars of Software and Services for the Information Society

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Classification of weld flaws with imbalanced class data

Tutorial on Machine Learning Tools

Fast and Effective Spam Sender Detection with Granular SVM on Highly Imbalanced Mail Server Behavior Data

An Improved Apriori Algorithm for Association Rules

Study on Classifiers using Genetic Algorithm and Class based Rules Generation

Fraud Detection Using Random Forest Algorithm

Artificial Intelligence. Programming Styles

ECE 5470 Classification, Machine Learning, and Neural Network Review

Network Traffic Measurements and Analysis

Facial Expression Classification with Random Filters Feature Extraction

Evaluating Classifiers

SNS College of Technology, Coimbatore, India

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

Transcription:

Vol.133 (Information Technology and Computer Science 2016), pp.79-84 http://dx.doi.org/10.14257/astl.2016. Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process Jaekwon Kim 1,1, Youngshin Han 2* and Jongsik Lee 1* 1 Dept. of Computer Science and Information Engineering, Inha University, South Korea {Jaekwon Kim and Jongsik Lee, jslee@inha.ac.kr 2 Dept. of Computer Engineering, Sungkyul University, South Korea {Youngshin Han, hanys@sungkyul.ac.kr Abstract. Fault detection prediction of FAB (wafer fabrication) process in semiconductor manufacturing process is possible that improve product quality and reliability in accordance with the classification performance. However, FAB process is sometimes due to a fault occurs. And mostly it occurs pass. Hence, data imbalance occurs in the pass/fail class. If the data imbalance occurs, prediction models are difficult to predict fail class because increases the bias of majority class (pass class). In this paper, we propose the SMOTE (Synthetic Minority Oversampling Technique) based over sampling method for solving problem of data imbalance. The proposed method solve the imbalance of the between pass and fail by oversampling the minority class of fail. In addition, by applying the fault detection prediction model to measure the performance. Keywords: Semiconductor manufacturing process, Fault detection prediction, Oversampling, SMOTE 1 Introduction Probe test is a step of classifying the pass/fail (regular / irregular) of the wafer after the FAB process finished.[1] Until now, the semiconductor manufacturing process predicts the semiconductor yield using FAB process and probe test. But, the manufacturing process has caused the read time and cost problem. Because the level of manufacturing technology increases and increased the number of chips constituting a wafer. Therefore, to predict the final test yield in the semiconductor industry requires a study to reduce the lead time and cost. Complex wafer manufacturing process can cause some defects, it may fail to produce products. Hence, semiconductor manufacturing process is necessary to fault detection and classification * Corresponding Author. Youngshin Han and Jongsik Lee. ISSN: 2287-1233 ASTL Copyright 2016 SERSC

method of the manufacturing process. In other word, fault detection prediction model can be quickly predict the final product, improve the quality and reliability. [2] Resolution of the data imbalance to improve of classification accuracy of fault detection prediction model.[3] The semiconductor Manufacturing process due to the fault classes are small, It is causing the imbalance between pass and fail class of the final product. Therefore, prediction model needs a data sampling method that can solve the data imbalance. In general cause of the imbalance, depending on the degree of imbalance uses the method under-sampling or oversampling. However, if the dataset is unbalanced, and some of the classes have the overlapping record data. In this case, a great influence on the classification predicted in accordance with the amount of overlap and the degree of imbalance. Therefore, a way to solve the problem of overlap is required with the over-sampling method. In this paper, we propose a SMOTE (Synthetic Minority Over-sampling Technique) [4] based oversampling for data imbalance in semiconductor manufacturing process. The proposed method solves the imbalance between the classes to improve the accuracy of the prediction model in Fault detection process. This study utilizes SECOM dataset [5], and generates data preprocessing and prediction models. 2 Method In this paper, the SMOTE based sampling technique to improve the performance of the predictive model. SMOTE generates the new minority class data using KNN (Knearest neighbor), a method for balancing the minority class and majority class. Framework for generating fault detection prediction model is shown in Figure 1. Fig. 1. Framework The proposed framework consists of a 2 phase. The first phase is the preprocessing steps to configure the SECOM dataset classified as predictive models. The SECOM dataset using the data cleaning, Feature selection. Pre-processing include data cleaning and feature selection method using SECOM dataset. Divide the 80 Copyright 2016 SERSC

SECOM dataset into training set (70%), testing set (30%). Oversampling uses a SMOTE. SMOTE is 1:2 balance (minority class: 33.4%, majority class: 66.6%) and configured. The second phase is to generate the prediction models and evaluation. Training set by the prediction model creation and utilization, LR (Logistic Regression), ANN (Artificial Neural Network), DT (Decision Tree C.4.5), RF (Random Forest) to use. In order to evaluate the prediction models used the confusion matrix. The procedure for generate the fault prediction model including SMOTE based oversampling are as follows: Data cleaning 1) Count in each attribute not available data or missing values. If record set are missing more than 60%, then remove that attribute Feature selection 2) Apply the following PCA (Principal Component Analysis) based feature selection. Oversampling 3) To balance the pass/ fail used SMOTE based over sampling. SMOTE pseudo code is as shown in Table 1. Table 2. SMOTE pseudo code [5] Line 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Code Start for i <- 1 to 10(k-nearest neighbors for 10) Compute k-nearest neighbors, and save the indices in the number of attribute. end for while Choose a random number between 1 and k, call it nn. Choose one of the k-nearest neighbors of 10. for j <- 1 to number of attribute. dif = MinorityClassSample(attribute(nn)(i)) - MinorityClass Sample[i][j] gap = rand() // between 0 and 1 NewClassSample[newindex][j] = MinorityClassSample[i][j] + gap *dif end for newindex ++ end while End Prediction model build 5) Build a fault prediction model with LR, ANN, DT (C.4.5) and RF. 6) Using the confusion matrix compares the precision, recall (sensitivity) and F- measure. Confusion matrix as shown Figure 3. (TP: True Positive, FP: False Positive, FN: False Negative, TN: True Negative) Copyright 2016 SERSC 81

Fig. 3. Confusion matrix 3 Experimental We used SECOM dataset[6] for the experiment. SECOM dataset consists of 1557 record and 590 attribute. Fail class record is 104, and pass class record is 1536. (Balance: 6.77%; 93.23%). Through data cleaning, removing and 271 attribute more than 'NaN (not available)' missing values with 60% of the 590 attribute. And finally using the 309 attribute. Feature selection is used in the final 35 feature from feature 309 using the PCA. SECOM dataset the training set 70% (total 1099 record; pass: 1026, fail: 73), testing set 30% (total 468 record; pass 437, fail 31) was composed. And, using the training set generates a fault prediction mode (using LR, ANN, DT and RF). For comparison of Oversampling SMOTE 1: 2, SMOTE 1: 2 and RUS (Random Under-Sampling) [6] 1: 2 compares. Confusion matrix of the results are shown in Table 2. Results of each model are shown in Figure 4. Table 2. Confusion matrix result Sampling Method Prediction Model TP FP FN TN SMOTE 1:2 LR 14 16 61 377 ANN 5 25 27 411 DT 5 25 69 369 RF 7 23 27 411 RUS 1:2 LR 17 13 107 331 ANN 18 12 173 265 DT 7 23 69 369 RF 5 25 20 418 82 Copyright 2016 SERSC

Fig. 4. Performance measure Sensitive average of all models showed a higher RUS SMOTE 0.259, RUS 0.392. The average of specificity is SMOTE 0.895, RUS 0.789. Accuracy is the average of the SMOTE 0.854, RUS 0.764. Precision is the average of the SMOTE 0.154, RUS 0.131. The average of the F-measure is SMOTE 0.186, RUS 0.175. SMOTE has high performance even more than the RUS. Although there are differences depending on the classification model, generally SMOTE is effective to configure the fault detection prediction model. Thus SMOTE based oversampling can be effectively used in semiconductor manufacturing process. 4 Conclusion Semiconductor manufacturing process are a lot of costs in accordance with the classification of the pass/fail. In this study, we propose a SMOTE (Synthetic Minority Over-sampling Technique) based on the over sampling to solve the data imbalance between pass and fail. The proposed method was used for SECOM dataset [6], the classification model used the LR, ANN, DT and RF. SMOTE based oversampling is to offer better performance than other models. To future studies should study the way to increase the accuracy of the classification predicted. Copyright 2016 SERSC 83

Acknowledgment. This work was funded by the Ministry of Science, ICT and Future Planning (NRF-2015R1C1A2A01051452). References 1. Kim, K.-H. and Baek, J.: A Prediction of Chip Quality using OPTICS(Ordering Points to Identify the Clustering Structure)-based Feature Extraction at the Cell Level, J. of the Korean Institute of Industrial Engineers, vol. 40, no. 3, pp. 257--266 (2014) 2. Kerdprasop, Kittisak, and Nittaya Kerdprasop: Feature selection and boosting techniques to improve fault detection accuracy in the semiconductor manufacturing process. Proc. of Inter. MultiConference of Engineers and Computer Scientists. vol. 1 (2011) 3. J. Liu, Q. Hu, and D. Yu: A comparative study on rough set based class imbalance learning. Knowledge-Based Systems, vol. 21, no. 8, pp. 753--763 (2008) 4. N. Chawla, K. Bowyer, L. Hall, and W. P. Kegelmeyer: SMOTE: synthetic minority oversampling technique. J. of Artificial Intelligence and Research, vol. 16, pp. 321--357 (2002) 5. SEmi COnductor Manufacturing. (2010) http://www.causality.inf.ethz.ch/repository.php 6. Witten,I.H. and Frank,E: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation, Morgan Kaufmann (2000) 84 Copyright 2016 SERSC