Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Vol.133 (Information Technology and Computer Science 2016), pp.79-84 http://dx.doi.org/10.14257/astl.2016. Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process Jaekwon Kim 1,1, Youngshin Han 2* and Jongsik Lee 1* 1 Dept. of Computer Science and Information Engineering, Inha University, South Korea {Jaekwon Kim and Jongsik Lee, jslee@inha.ac.kr 2 Dept. of Computer Engineering, Sungkyul University, South Korea {Youngshin Han, hanys@sungkyul.ac.kr Abstract. Fault detection prediction of FAB (wafer fabrication) process in semiconductor manufacturing process is possible that improve product quality and reliability in accordance with the classification performance. However, FAB process is sometimes due to a fault occurs. And mostly it occurs pass. Hence, data imbalance occurs in the pass/fail class. If the data imbalance occurs, prediction models are difficult to predict fail class because increases the bias of majority class (pass class). In this paper, we propose the SMOTE (Synthetic Minority Oversampling Technique) based over sampling method for solving problem of data imbalance. The proposed method solve the imbalance of the between pass and fail by oversampling the minority class of fail. In addition, by applying the fault detection prediction model to measure the performance. Keywords: Semiconductor manufacturing process, Fault detection prediction, Oversampling, SMOTE 1 Introduction Probe test is a step of classifying the pass/fail (regular / irregular) of the wafer after the FAB process finished.[1] Until now, the semiconductor manufacturing process predicts the semiconductor yield using FAB process and probe test. But, the manufacturing process has caused the read time and cost problem. Because the level of manufacturing technology increases and increased the number of chips constituting a wafer. Therefore, to predict the final test yield in the semiconductor industry requires a study to reduce the lead time and cost. Complex wafer manufacturing process can cause some defects, it may fail to produce products. Hence, semiconductor manufacturing process is necessary to fault detection and classification * Corresponding Author. Youngshin Han and Jongsik Lee. ISSN: 2287-1233 ASTL Copyright 2016 SERSC

method of the manufacturing process. In other word, fault detection prediction model can be quickly predict the final product, improve the quality and reliability. [2] Resolution of the data imbalance to improve of classification accuracy of fault detection prediction model.[3] The semiconductor Manufacturing process due to the fault classes are small, It is causing the imbalance between pass and fail class of the final product. Therefore, prediction model needs a data sampling method that can solve the data imbalance. In general cause of the imbalance, depending on the degree of imbalance uses the method under-sampling or oversampling. However, if the dataset is unbalanced, and some of the classes have the overlapping record data. In this case, a great influence on the classification predicted in accordance with the amount of overlap and the degree of imbalance. Therefore, a way to solve the problem of overlap is required with the over-sampling method. In this paper, we propose a SMOTE (Synthetic Minority Over-sampling Technique) [4] based oversampling for data imbalance in semiconductor manufacturing process. The proposed method solves the imbalance between the classes to improve the accuracy of the prediction model in Fault detection process. This study utilizes SECOM dataset [5], and generates data preprocessing and prediction models. 2 Method In this paper, the SMOTE based sampling technique to improve the performance of the predictive model. SMOTE generates the new minority class data using KNN (Knearest neighbor), a method for balancing the minority class and majority class. Framework for generating fault detection prediction model is shown in Figure 1. Fig. 1. Framework The proposed framework consists of a 2 phase. The first phase is the preprocessing steps to configure the SECOM dataset classified as predictive models. The SECOM dataset using the data cleaning, Feature selection. Pre-processing include data cleaning and feature selection method using SECOM dataset. Divide the 80 Copyright 2016 SERSC

SECOM dataset into training set (70%), testing set (30%). Oversampling uses a SMOTE. SMOTE is 1:2 balance (minority class: 33.4%, majority class: 66.6%) and configured. The second phase is to generate the prediction models and evaluation. Training set by the prediction model creation and utilization, LR (Logistic Regression), ANN (Artificial Neural Network), DT (Decision Tree C.4.5), RF (Random Forest) to use. In order to evaluate the prediction models used the confusion matrix. The procedure for generate the fault prediction model including SMOTE based oversampling are as follows: Data cleaning 1) Count in each attribute not available data or missing values. If record set are missing more than 60%, then remove that attribute Feature selection 2) Apply the following PCA (Principal Component Analysis) based feature selection. Oversampling 3) To balance the pass/ fail used SMOTE based over sampling. SMOTE pseudo code is as shown in Table 1. Table 2. SMOTE pseudo code [5] Line 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Code Start for i <- 1 to 10(k-nearest neighbors for 10) Compute k-nearest neighbors, and save the indices in the number of attribute. end for while Choose a random number between 1 and k, call it nn. Choose one of the k-nearest neighbors of 10. for j <- 1 to number of attribute. dif = MinorityClassSample(attribute(nn)(i)) - MinorityClass Sample[i][j] gap = rand() // between 0 and 1 NewClassSample[newindex][j] = MinorityClassSample[i][j] + gap *dif end for newindex ++ end while End Prediction model build 5) Build a fault prediction model with LR, ANN, DT (C.4.5) and RF. 6) Using the confusion matrix compares the precision, recall (sensitivity) and F- measure. Confusion matrix as shown Figure 3. (TP: True Positive, FP: False Positive, FN: False Negative, TN: True Negative) Copyright 2016 SERSC 81

Fig. 3. Confusion matrix 3 Experimental We used SECOM dataset[6] for the experiment. SECOM dataset consists of 1557 record and 590 attribute. Fail class record is 104, and pass class record is 1536. (Balance: 6.77%; 93.23%). Through data cleaning, removing and 271 attribute more than 'NaN (not available)' missing values with 60% of the 590 attribute. And finally using the 309 attribute. Feature selection is used in the final 35 feature from feature 309 using the PCA. SECOM dataset the training set 70% (total 1099 record; pass: 1026, fail: 73), testing set 30% (total 468 record; pass 437, fail 31) was composed. And, using the training set generates a fault prediction mode (using LR, ANN, DT and RF). For comparison of Oversampling SMOTE 1: 2, SMOTE 1: 2 and RUS (Random Under-Sampling) [6] 1: 2 compares. Confusion matrix of the results are shown in Table 2. Results of each model are shown in Figure 4. Table 2. Confusion matrix result Sampling Method Prediction Model TP FP FN TN SMOTE 1:2 LR 14 16 61 377 ANN 5 25 27 411 DT 5 25 69 369 RF 7 23 27 411 RUS 1:2 LR 17 13 107 331 ANN 18 12 173 265 DT 7 23 69 369 RF 5 25 20 418 82 Copyright 2016 SERSC

Fig. 4. Performance measure Sensitive average of all models showed a higher RUS SMOTE 0.259, RUS 0.392. The average of specificity is SMOTE 0.895, RUS 0.789. Accuracy is the average of the SMOTE 0.854, RUS 0.764. Precision is the average of the SMOTE 0.154, RUS 0.131. The average of the F-measure is SMOTE 0.186, RUS 0.175. SMOTE has high performance even more than the RUS. Although there are differences depending on the classification model, generally SMOTE is effective to configure the fault detection prediction model. Thus SMOTE based oversampling can be effectively used in semiconductor manufacturing process. 4 Conclusion Semiconductor manufacturing process are a lot of costs in accordance with the classification of the pass/fail. In this study, we propose a SMOTE (Synthetic Minority Over-sampling Technique) based on the over sampling to solve the data imbalance between pass and fail. The proposed method was used for SECOM dataset [6], the classification model used the LR, ANN, DT and RF. SMOTE based oversampling is to offer better performance than other models. To future studies should study the way to increase the accuracy of the classification predicted. Copyright 2016 SERSC 83

Acknowledgment. This work was funded by the Ministry of Science, ICT and Future Planning (NRF-2015R1C1A2A01051452). References 1. Kim, K.-H. and Baek, J.: A Prediction of Chip Quality using OPTICS(Ordering Points to Identify the Clustering Structure)-based Feature Extraction at the Cell Level, J. of the Korean Institute of Industrial Engineers, vol. 40, no. 3, pp. 257--266 (2014) 2. Kerdprasop, Kittisak, and Nittaya Kerdprasop: Feature selection and boosting techniques to improve fault detection accuracy in the semiconductor manufacturing process. Proc. of Inter. MultiConference of Engineers and Computer Scientists. vol. 1 (2011) 3. J. Liu, Q. Hu, and D. Yu: A comparative study on rough set based class imbalance learning. Knowledge-Based Systems, vol. 21, no. 8, pp. 753--763 (2008) 4. N. Chawla, K. Bowyer, L. Hall, and W. P. Kegelmeyer: SMOTE: synthetic minority oversampling technique. J. of Artificial Intelligence and Research, vol. 16, pp. 321--357 (2002) 5. SEmi COnductor Manufacturing. (2010) http://www.causality.inf.ethz.ch/repository.php 6. Witten,I.H. and Frank,E: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementation, Morgan Kaufmann (2000) 84 Copyright 2016 SERSC