Cluster-Based Profile Monitoring in Phase I Analysis. Yajuan Chen. Doctor of Philosophy In Statistics

Size: px

Start display at page:

Download "Cluster-Based Profile Monitoring in Phase I Analysis. Yajuan Chen. Doctor of Philosophy In Statistics"

Letitia Morgan
5 years ago
Views:

1 Cluster-Based Profle Montorng n Phase I Analyss Yajuan Chen Dssertaton submtted to the faculty of the Vrgna Polytechnc Insttute and State Unversty n partal fulfllment of the requrements for the degree of Doctor of Phlosophy In Statstcs Jeffery B Brch Pang Du Inyoung Km Wllam H. Woodall 1/31/014 Blacksburg VA Keywords: Cluster, Mxed Model, Phase I, Phase II, Robust, T Statstc

2 Cluster-Based Profle Montorng n Phase I Analyss Yajuan Chen ABSTRACT Profle montorng s a well-known approach used n statstcal process control where the qualty of the product or process s characterzed by a profle or a relatonshp between a response varable and one or more explanatory varables. Profle montorng s conducted over two phases, labeled as Phase I and Phase II. In Phase I profle montorng, regresson methods are used to model each profle and to detect the possble presence of out-of-control profles n the hstorcal data set (HDS). The out-of-control profles can be detected by usng the T stats- tc. However, prevous methods of calculatng the T statstc are based on usng all the data n the HDS ncludng the data from the out-of-control process. Consequently, the ablty of usng ths method can be dstorted f the HDS contans data from the out-of-control process. Ths work provdes a new profle montorng methodology for Phase I analyss. The proposed method, referred to as the cluster-based profle montorng method, ncorporates a cluster analyss phase before calculatng the T statstc. Before ntroducng our proposed cluster-based method n profle montorng, ths cluster-based method s demonstrated to work effcently n robust regresson, referred to as cluster-based bounded nfluence regresson or CBI. It wll be demonstrated that the CBI method provdes a robust, effcent and hgh breakdown regresson parameter estmator. The CBI method frst represents the data space va a specal set of ponts, referred to as anchor ponts. Then a collecton of sngle-pont-added ordnary least squares regresson estmators forms the bass of a metrc used n defnng the smlarty between any two observatons. Cluster analyss then yelds a man cluster contanng at least half the observatons, wth the remanng observatons comprsng one or more mnor clusters. An ntal regresson estmator arses from the man cluster, wth a group-addtve DFFITS argument used to carefully actvate the mnor clusters through a bounded nfluence regresson frame work. CBI acheves a 50% breakdown pont, s regresson equvarant, scale and affne equvarant and dstrbutonally s asymptotcally normal. Case studes and Monte Carlo results demonstrate the performance advantage of CBI over other popular robust regresson procedures regardng coeffcent stablty, scale estmaton and standard errors.

3 The cluster-based method n Phase I profle montorng frst replaces the data from each sampled unt wth an estmated profle, usng some approprate regresson method. The estmated parameters for the parametrc profles are obtaned from parametrc models whle the estmated parameters for the nonparametrc profles are obtaned from the p-splne model. The cluster phase clusters the profles based on ther estmated parameters and ths yelds an ntal man cluster whch contans at least half the profles. The ntal estmated parameters for the populaton average (PA) profle are obtaned by fttng a mxed model (parametrc or nonparametrc) to those profles n the man cluster. Profles that are not contaned n the ntal man cluster are teratvely added to the man cluster provded ther T statstcs are small and the mxed model (parametrc or nonparametrc) s used to update the estmated parameters for the PA profle. Those profles contaned n the fnal man cluster are consdered as resultng from the n-control process whle those not ncluded are consdered as resultng from an out-ofcontrol process. Ths cluster-based method has been appled to montor both parametrc and nonparametrc profles. A smulated example, a Monte Carlo study and an applcaton to a real data set demonstrates the detal of the algorthm and the performance advantage of ths proposed method over a non-cluster-based method s demonstrated wth respect to more accurate estmates of the PA parameters and mproved classfcaton performance crtera. When the profles can be represented by m p 1vectors, the profle montorng process s equvalent to the detecton of multvarate outlers. For ths reason, we also compared our proposed method to a popular method used to dentfy outlers when dealng wth a multvarate response. Our study demonstrated that when the out-of-control process corresponds to a sustaned shft, the cluster-based method usng the successve dfference estmator s clearly the superor method, among those methods we consdered, based on all performance crtera. In addton, the nfluence of accurate Phase I estmates on the performance of Phase II control charts s presented to show the further advantage of the proposed method. A smple example and Monte Carlo results show that more accurate estmates from Phase I would provde more effcent Phase II control charts.

4 Acknowledgments I would lke to express my wholehearted grattude to my advsor Dr. Jeffery B. Brch for hs nvaluable gudance, advce and countless hours of edtng to help me complete ths dssertaton. Hs encouragement, support and thoughtfulness are hghly apprecated. As a professor, Dr. Brch s an expert n hs statstcal areas and as a person, Dr. Brch s a very knd, organzed and postve man, I have learned a lot from hm durng my graduate study and he s the gude n my future lfe. Also, I would lke to thank Dr. Brch for hs excellent work as the drector of our graduate program. As a graduate student n the Department of Statstcs at Vrgna Tech, I feel so lucky to have hm here as he cares about every aspect of each student s lfe and provdes the help needed to make each graduate student s experence n our department an enjoyable one. I would lke to thank my commttee members: Dr. Pang Du, Dr. Inyoung Km, and Dr. Wllam H. Woodall. They were very helpful and provded me postve nput to further mprove ths work. I thank Dr. Woodall for hs great edtng and techncal advce for my dssertaton, papers and presentatons. I thank Dr. Km for her gude n nonparametrc analyss and I thank Dr. Du for hs gude n functonal data analyss. Many people on the faculty, staff and among the graduate students of the Department of Statstcs asssted and encouraged me n varous ways. I would lke to thank them. I would especally lke to thank our faculty for provdng excellent courses and consultng experences that have prepared me for my future career. Also, I thank all my frends n Blacksburg, whose frendshp have supported me and have brought me a lot of good memores n my lfe. I thank my parents and my brothers. They have been very supportve and encouragng n my lfe. I want to especally thank my brother Chuanwen, who gve me much help whenever t was needed. Further, I thank my parents-n-law for helpng us to take care of Ella so I have tme to fnsh ths dssertaton. Last but not least, I would lke to thank my sweet husband Y and my lovely daughter Ella. I thank them for comng nto my lfe, and for brngng me such v

5 joy and happness. I thank Y for hs love, support, patence and help durng my graduate studes. v

6 Contents Acknowledgments... v Contents... v Lst of Tables... x Lst of Fgures... x Acronyms... x Nomenclature... xv Chapter 1. Introducton and Motvaton Robust Estmaton n Regresson Robust estmaton n SPC Phase I and Phase II n SPC Robust estmaton n Phase I Profle Montorng n SPC Motvaton... 6 Chapter. Cluster-Based Bounded Influence Regresson Revew of Robust Regressons Revew of Selected Robust Regresson Methods Cluster-Based Bounded Influence Regresson Case Studes and Comparson Monte Carlo Study Chapter Summary Chapter 3. Profle Montorng Lterature Phase I and Phase II Profle Montorng Lterature Revew Multvarate T Statstcs Profle Montorng for Mxed Model v

7 3.4.1 Lnear Mxed Models and ts Parametrc Estmaton Nonparametrc Mxed Regresson and P-splne Estmaton Detectng the Out-of-control Process Detectng the Out-of-control Process Usng LMM Detectng Out-of-control Process Usng the P-splne Mxed Model Chapter Summary Chapter 4. Cluster-Based Profle Montorng n Phase I Motvaton Proposed Cluster-based Profle Montorng Method Detaled Smple Example Automoble Engne Applcaton A Monte Carlo Study Further Analyss based on the Monte Carlo Study Chapter Summary Chapter 5. Phase II Control Charts based on Phase I Analyss Profle Montorng n Phase II Detaled Smple Example ARL based on Monte Carlo Study Chapter Summary Chapter 6. Cluster-Based Nonparametrc Profle Montorng Cluster-Based Nonparametrc Profle Montorng An Automoble Engne Applcaton A Monte Carlo Study Concluson Chapter 7. Conclusons and Outlook for Future Work Conclusons Outlook for Future Work v

8 References Appendx v

9 Lst of Tables Table.1: Summary of the CBI regresson analyss of the PH dataset... 7 Table.: Robust analyss of parameter estmate summary of PH dataset... 8 Table.3: CBI analyss of parameter estmate summary of HBK dataset Table.4: Robust analyss of parameter estmate summary of HBK dataset Table.5: Smulaton results for Monte Carlo study... 3 Table.6: Standardzed average weght for observatons Table 4.1: Dataset for the example Table 4.:13 ˆB matrx; the parameter estmates for 1 profles Table 4.3: Smlarty matrx usng ˆ ˆ T 1 ˆ s ˆ ˆ j j D j β β V β β Table 4.4: Cluster hstory for example data Table 4.5: Eblups for the profles n C fnal Table 4.6: The Automotve Industry Data, Torque (T) vs. RPM... 7 Table 4.7: The parameter estmates for 0 engnes Table 4.8: Cluster hstory for 0 engnes Table 4.9: Classfcaton table for Phase I analyss Table 4.10: Average of performances based on a Monte Carlo study... 8 Table 4.11: Average of PA parameter estmates based on a Monte Carlo study Table 4.1: Classfcaton table for non-cluster-based method (shft=0.05) Table 4.13: Classfcaton table for cluster-based method (shft=0.05) Table 4.14: Classfcaton table for cluster-based method (shft=0.175) Table 4.15: Classfcaton table for cluster-based method (shft=0.175) Table 4.16: The parameter estmates for 30 profles (shft=0.3) Table 4.17: Classfcaton table for cluster-based method (shft=0.3)... 9 Table 4.18: Classfcaton table for cluster-based method (shft=0.3) Table 4.19: Performance of one smulaton study wth dfferent shft Table 5.1: ARL_CB and ARL_NCB wth ARL Table 5.: ARL_CB and ARL_NCB wth Phase I shft=0.05, ARL Table 5.3: ARL_CB and ARL_NCB wth Phase I shft=0.075, ARL x

10 Table 5.4: ARL_CB and ARL_NCB wth Phase I shft=0.1, ARL Table 5.5: ARL_CB and ARL_NCB wth Phase I shft=0.15, ARL Table 5.6: ARL_C and ARL_NCB wth Phase I shft=0.15, ARL Table 5.7: ARL_CB and ARL_NCB wth Phase I shft=0.175, ARL Table 5.8: ARL_CB and ARL_NCB wth Phase I shft=0., ARL Table 5.9: ARL_CB and ARL_NCB wth Phase I shft=0.5, ARL Table 5.10: ARL_CB and ARL_NCB wth Phase I shft=0.5, ARL Table 5.11: ARL_CB and ARL_NCB wth Phase I shft=0.75, ARL Table 5.1: ARL_CB and ARL_NCB wth Phase I shft=0.3, ARL Table 6.1: Estmated ˆ, 1,..14 for each engne Table 6.: Cluster hstory based on eblups for 0 engnes x

11 Lst of Fgures Fgure 1.1: The plot of 1 true profles... 9 Fgure.1: The ftted lne of the dfferent robust methods... 1 Fgure.: Cluster dendrogram and fnal observaton weghts of PH dataset... 8 Fgure.3: The fnal CBI regresson observaton weghts of HBK dataset Fgure 4.1: Plot of 1 observed profles Fgure 4.: Dendrogram for clusterng of example dataset Fgure 4.3: The raw data set for 0 automoble engnes Fgure 4.4: Dendrogram for clusterng of 0 engnes Fgure 4.5: Plot of true profles wth shft= Fgure 4.6: Plot of true profles wth shft= Fgure 4.7: Plot of true profles wth shft= Fgure 4.8: 3D Plot of estmated PA and PS parameter vectors when the shft= Fgure 4.9: 3D Plot of estmated PA and PS parameters Fgure 6.1: Dendrogram for clusterng of 0 engnes by nonparametrc approach Fgure 6.: Plot of PA profle wth dfferent values Fgure 6.3: FCC for dfferent shft values wth = Fgure 6.4: FCC for dfferent shft values wth = Fgure 6.5: FCC for dfferent shft values wth = Fgure 6.6: FPR for dfferent shft values wth = Fgure 6.7: FPR for dfferent shft values wth = Fgure 6.8: FPR for dfferent shft values wth = x

12 Acronyms ARL ARL 1 ARL 0 ARL_C ARL_NCB BI blup CBI eblup FCC FNR FPR HDS HKB hp LMM LMS LTS L-W MCD MLE MRPM MVE M1S OLS PA PH POS PS Average Run Length Out-of-Control ARL In-Control ARL ARL based on the cluster-based T control chart ARL based on the non-cluster-based T control chart Bounded Influence Best Lnear Unbased Predctor Cluster-based Bounded Influence Regresson Estmated Best Lnear Unbased Predctor Fracton Correctly Classfed False Negatve Rate False Postve Rate Hstorcal Data Set Hawkns DM, Bradu D, Gordon VK Hgh Influence Pont Lnear Mxed Model Least Medan Squares Least Trmmed Squares Lard-Ware Mnmum Covarate Determnant Maxmum Lkelhood Estmator Model Robust Profle Montorng Mnmum Volume Ellpsod Mallows 1-step Ordnary Least Square Populaton Average Pendleton and Hockng Probablty of Sgnal Profle Specfc x

13 REMLE REWLS RMCD SPC S1S Restrcted Maxmum Lkelhood Estmator Robust and Effcent Weghted Least Square Reweghted Mnmum Covarance Determnant Statstcal Process Control Schweppe s 1-step x

14 Nomenclature β A p 1 unknown parameter vector ˆβ A p 1estmated parameter vector β ˆM The p 1estmated parameter vector by usng M regresson β ˆCBI The p 1estmated parameter vector by usng CBI regresson ˆ P β The p 1estmated parameter vector for the th profle β ˆLMM The p 1estmated parameter vector for the PA profle by LMM approach β ˆREWLS The p 1estmated parameter vector by usng REWLS regresson b b 0 b 1 The ntercept parameter for the n-control PA The slope parameter for the n-control PA The quadratc parameter for the n-control PA The ntercept parameter for the out-of-control PA The slope parameter for the out-of-control PA The quadratc parameter for the out-of-control PA A vector of random effects that represent the eblups for the th profle The random ntercept effect for the th profle The random slope effect for the th profle b The random quadratc effect for the th profle f. True PA profle functon. f True th profle functon G The q q H h MVE1 y Hat matrx of X covarance matrx for the random effects b The th dagonal element of H Z The MVE center estmator for Z y xv

15 MVE m 1 m n y Z The MVE scale matrx estmator for Z y R A n n Number of the n-control profles Number of the total profles n HDS Number of observatons covarance matrx for the random error ε r A n1vector of resduals r rs The resdual for the th observaton The absolute scaled resdual for the th observaton T Hotellng's T statstc for or the th tme perod T Hotellng's MVE, T Hotellng's MCD, T statstcs for or the th tme perod based on MVE T statstc for or the th tme perod based on MCD T P1, Hotellng s T statstcs based on parametrc ftted value T P, Hotellng s T statstcs based on parametrc eblups T NP, Hotellng s T statstcs based on P-splne ftted value T NP1, Hotellng s T statstcs based on P-splne eblups t ˆ V V ˆD A vector of random effects represents the coeffcents for splne component Estmator of varance-covarance matrx The successve dfference estmator of varance-covarance matrx V The n n covarance matrx for the response vector y V ˆMVE Estmator of varance-covarance matrx based on MVE V ˆMCD Estmator of varance-covarance matrx based on MCD V ˆP The pooled sample estmator of varance-covarance matrx W The n ndagonal weght matrx w X The th dagonal element of W A desgn matrx xv

16 T x X The th row of desgn matrx X The n p matrx of explanatory varables for the th profle x j The fxed regressor for the th j observaton from the th profle μ A n 1estmated mean vector for the th tme perod ˆ μˆ MCD An estmated mean vector for the th tme perod based on MVE μ ˆ MVE An estmated mean vector the th tme perod based on MVE ˆ P PS, y A vector of parametrc ftted value for the th profle ˆ p s PS, y A vector of p-splne ftted value for the th profle ˆ p s PA y A vector of p-splne ftted value for the PA profle y ˆ PA A vector of parametrc ftted value for the PA profle y The n 1response vector for the th profle y j Z y The response value for the The nk th j observaton from the th profle matrx contanng only the k regressor varables Z The n pmatrx formed by augmentng the vector y to Z Z T z The n q matrx of explanatory varables for the th profle The th row of matrx Z T z y, The th row of matrx. Z y True th profle smooth functon ε j The random error term for the th profle The random error for the th j observaton from the th profle The smooth parameter for the penalzed regresson xv

17 Chapter 1. Introducton and Motvaton I n statstcal analyss, the observed data often does not fully conform to statstcal model assumptons. For example, as stated n Hampel et al. (1986) routne data are thought to contan 1% to 10% gross errors. Because of these errors, robust estmaton plays an mportant role n statstcal analyss. For example, n a regresson study, the abnormal data ponts can severely dstort the estmates and the true relatonshp between the covarates and the response. Consequently, the model s predcton ablty s smlarly dstorted. In other applcatons, such as n statstcal process control (SPC), robust statstcs are also provded so that the control lmts based on these statstcs are not dstorted by the abnormal measurements n the hstorcal data set (HDS). 1.1 Robust Estmaton n Regresson It s known that the regular ordnary least squares (OLS) estmator lacks resstance to as lttle as one unusual observaton. The correspondng coeffcents and ther standard errors, predctons, dagnostcs, hypothess tests, and other numercal measures can all become very msleadng due to a sngle anomalous observaton. Robust procedures are desgned to capture the general trend of the data n the presence of unusual data. Most of the robust regresson methodologes were provded by the early 1980 s. For example, M regresson (Huber (1981)), and bounded nfluence (BI) (Huber (1981)) regresson work well n the presence of low leverage outlers and at least one hgh nfluence pont, respectvely. However, they are unable to combat a small percentage of outlers. Repeated samplng based methods, such as Least medan squares (LMS) (Rousseeuw (1984)) regresson and least trmmed squares (LTS) (Ruppert and Carroll (1980)) regresson, on the other hand, are examples of hgh breakdown estmators as they possess the ablty to provde reasonable parameter estmates wth as much as 50% 1

18 of the data beng contamnated. Poor effcency and numercal/computatonal senstvty wth large datasets has typcally led to ther prmary use as an ntal estmator feedng nto other robust procedures such as M or BI estmators. Examples nclude Mallows 1- step (M1S) regresson (Smpson et al. (199)) and Schweppe s 1-step (S1S) regresson (Coakley and Hettmansperger (1993)), whch are one-step adjustments of LTS that ncrease effcency versus the LTS estmator. However, two vrtually dentcal LTS estmates may yeld dramatcally dfferent M1S (or S1S) estmators (Lawrence (003)), thereby llustratng a potental negatve ssue wth repeated samplng based methods. Another hgh breakdown one-step estmaton method s due to Gervn and Yoha (00). Ther robust and effcent weghted least square estmate (REWLS) procedure attans full asymptotc effcency wth the assumpton of normally dstrbuted random errors. However, the REWLS, on the average, fals to correctly dentfy the good and bad hgh leverage ponts when the error term s not deally normally dstrbuted (Lawrence (003)). A robust, effcent, hgh breakdown robust regresson methodology was proposed by Lawrence (003), called the cluster-based bounded nfluence regresson (CBI) method, whch combned a sutable clusterng method wth the bounded nfluence regresson method. In ths research, a revsed verson of ths method s presented and evaluated n Chapter. 1. Robust estmaton n SPC As prevously mentoned, statstcal data sets frequently contans errors. Not surprsngly, such data anomales can occur n the statstcal process control settng. Robust estmaton methods have also been proposed n SPC to avod the msleadng results from these errors Phase I and Phase II n SPC The SPC nvolves two phases, Phase I and Phase II. In Phase I, a HDS s analyzed to determne whch data ponts are from an n-control process and whch ones, n any, are from an out-of-control process. Data ponts determned to be from an out-ofcontrol process are usually removed and the remanng data ponts from the n-control

19 process are used to calculate the statstcs needed for computng control lmts used n Phase II analyss. In Phase II, future observatons are montored by usng the control lmts calculated from Phase I estmates to determne f the process contnues to be ncontrol. The control lmts n Phase I wll drectly affect the performance of Phase II analyss. Accurate control lmts n Phase I are desrable for the Phase II analyss. Ths research also focuses on the estmaton n Phase I and how these estmates affect performance n Phase II. 1.. Robust estmaton n Phase I Recall that the purpose of the Phase I analyss s to examne the HDS and obtan the control lmts that are suffcently accurate for Phase II montorng. However, lke the estmates of regresson analyss, the statstcs used for control lmts obtaned from the HDS can be pulled n the drecton of the multvarate outlers f the HDS contans data from an out-of-control process. Robust estmaton technques are used to obtan the control lmts that are not unduly nfluenced by unusual data ponts. Consequently, the control lmts wll be more accurate and effectve n Phase II analyss. In most prevous studes, products and processes were characterzed by ether unvarate qualty control data or multvarate qualty control data. Robust estmaton methods for unvarate qualty control data (such as those based on a medan or trmmed mean) are straghtforward and have receved attenton n past research (Rocke (1989); Tatum (1997); de Mast and Roes (004); Cal Mannng and Adams (005)). Robust methods for multvarate qualty control data are not as straghtforward, nor as easly mplemented. When dealng wth multvarate qualty control data, t s assumed that the HDS conssts of m tme ordered vectors that are ndependent of each other. Frequently the Hotellng s T statstcs s used to determne f a multvarate data pont results from an out-of-control process. In partcular, f each vector s of dmenson p and f μˆ denotes 3

20 a vector contanng p elements for the th tme perod, then the Hotellng's T statstc for the th tme perod s defned as m μˆ 1 where μ m T ˆ T μˆ μ V μˆ μ, 1,,..., m. (1.1) 1 and V ˆ s an estmator of the varance-covarance matrx V of μ ˆ (see secton 3 for more detals). The Hotellng s T s an example of a statstc that n not robust to outlyng observatons. One commonly used robust T statstc for multvarate data results by replacng the moment-based estmator of V, the one typcally used, by an estmator based on the mnmum volume ellpsod (MVE) estmator (Vargas (003)). The MVE estmator, frst proposed by Rousseeuw (1984), has been frequently used for the detecton of the multvarate outlers. The MVE estmator seeks to fnd the ellpsod of mnmum volume that covers a subset of at least half of the total data ponts. One well known algorthm for MVE estmator s provded by Rousseeuw and Leroy (1987) s an approxmate method usng a sub-samplng procedure. However, the problem of ths sub-samplng algorthm s that t lacks repeatablty and results n estmates wth poor effcency. An exact method to calculate the MVE estmator was later proposed by Cook et al. (1993) to avod the repeatablty problem. However, ths exact method s only computatonally feasble for small datasets (Cook et al. (1993)). Other computatonally feasble methods to fnd an approxmate MVE have been proposed. For example, Hawkns (1994) proposed a feasble soluton algorthm (FSA). Also, methods to fnd the MVE based on a heurstc search algorthms were proposed by Woodruff and Rocke (1993). The th tme perod based on MVE s denoted by T (Vargas (003)) MVE, T statstc for the T ˆ μˆ μˆ V μˆ μ ˆ, (1.) T 1 MVE, MVE MVE MVE 4

21 where μˆ MVE s the MVE estmator of multvarate mean and V ˆMVE s the MVE estmator of multvarate varance-covarance matrx. Another frequently used robust T statstc for multvarate data s based on the mnmum covarate determnant (MCD) estmator whch was also proposed by Rousseeuw (1984). The MCD estmator s obtaned by fndng the half set of the data ponts that gves the mnmum value of the determnant of the varance-covarance matrx. Smlar to the MVE estmator, there are both approxmate methods and exact methods to obtan the MCD estmates. For example, MCD estmates can be computed va the exact method provded by Cook et al. (1993). The sub-samplng approach of Rousseeuw and Leroy (1987) can be used to get an approxmate MCD estmate whch would have the same repeatablty ssue. The feasble soluton algorthm of Hawkns (1993) can be mplemented for the MCD, as shown by Hawkns (1994). An mproved verson of the feasble soluton algorthm for the MCD was proposed by Hawkns and Olve (1999). The (Vargas (003)) T statstc for the th tme perod based on MCD s denoted by T MCD, T ˆ T μˆ μˆ V μˆ μ ˆ, (1.3) MCD, MCD MCD MCD where μˆ MCD s the he MCD estmator of multvarate mean and V ˆMCD s the MCD estmator of multvarate covarance matrx. Estmators based on the MVE and MCD are powerful n detectng a reasonable number of outlers as demonstrated by Jensen et al. (007) and Yanez et al. (010)). Other robust estmators have been proposed for the multvarate SPC settng. Yanez et al. (010) proposed a T statstc usng S estmators based on the bweght functon for the locaton and dsperson parameters when montorng multvarate ndvdual observatons. They showed that ths method outperforms the MVE estmators for a small number of observatons. Other robust estmators defned usng trmmng, proposed by Alfaro and Ortega (008) and Chenour et al. (009) and referred to as reweghted mnmum covarance determnant (RMCD) 5

22 estmators, have been shown to provded hghly robust and effcent estmators of the mean vector and covarance matrx. 1.3 Profle Montorng n SPC Another more recent approach to SPC occurs when the product or process can be characterzed by a profle or a relatonshp between a response varable and one or more explanatory varables nstead of unvarate or multvarate vectors. The profle montorng process n Phase I s frst to represent the profles n the HDS by some proper modelng technque, and use some approprate method to dentfy those profles from the n-control process, and those, f any, from the out-of-control process. As a fnal step, these n-control profles are used to obtan the control lmts for future profle montorng n real-tme durng Phase II. Further detals concernng the profle montorng lterature wll be gven n Chapter 3. Ths dssertaton wll focus on montorng the profles usng the mxed model where the mxed model s frst ft to the profles n the HDS usng the mxed model technque to estmate the populaton average (PA) profle and the proper varancecovarance matrx. See chapter 3 for more detals concernng the mxed model appled to profle montorng. 1.4 Motvaton When usng the mxed model technque for the profle montorng, the frst step s to estmate the PA profle and use ths estmate n calculatng the Hotellng s T for each profle to determne whether ths profle results from the n-control process. However, n the typcal mxed model analyss, the estmated PA profle s based on all profles n the HDS ncludng the profles from the out-of-control process. For example, f there s large amount of profles that from the out-of-control process or small amount whch are far away from the n-control process, the estmated PA profle based on the HDS would lkely be pulled n the drecton of the out-of-control process. Addtonally, the correspondng varance-covarance matrx, needed n computng the 6

23 T statstc for each profle, wll be smlarly dstorted. Consequently, the T statstcs wll be msleadng and the n-control lmts used n Phase I wll be unable to properly separate those profles belongng to the n-control process from those belongng to the out-of-control process. In ths research, a new profle montorng methodology n Phase I whch ncorporates a cluster method wll be utlzed to obtan the estmated PA profle. Ths new cluster-based method wll be demonstrated to be more robust to the profles from the out-of-control process than the exstng non-cluster-based method (see (Jensen et al. (008)) for a thorough dscusson of the non-cluster-based method). Further, t s known that the performance of the Phase I analyss can be measured n terms of correctly dentfyng the unstable process or, equvalently, the presence of profles from the out-of-control process n the HDS. An mportant crteron used to measure the success of a Phase I method at detectng an unstable process s the probablty of sgnal (POS), the probablty of detectng at least one profle from the outof-control process n the HDS. However, the POS only measures the ablty of detectng the presence of the profle from the out-of-control process n the HDS and does not gve any nformaton about whether the classfcaton of profles nto the two categores of ncontrol and out-of-control s correctly specfed. A smple example s presented to llustrate that the POS s not suffcent to measure the performance of Phase I analyss. In ths example, t assumed that there are total m=1 profles n the HDS of whch nne are from the n-control process whle the other three are from the out-of-control process. The n-control profles were generated from the lnear mxed model y ( b ) ( b ) x ( b ) x, 1,,..., m, j 1,,..., n, (1.4) j j j j 1 and the out-control profles were generated as 7

24 y b b x b x (1.5) j ( 0 0 ) ( 1 1 ) j ( ) j j, m1 1,..., m, j 1,,..., n, where the random effects are defned as b b1 ~ MN0, 0 1 0, b 0 0 ε ~ MN 0, I, (here MN represents the multvarate normal dstrbuton) and wth fxed effects β T 1.5, 7, for the profles from the n-control process and β T 1.875, 14.5, 3.5 Addtonally, m1 9, m1, n8, for the profles from the out-of-control process. and 4. Thus, profles through 9 represent profles from the n-control process and profles 10, 11, and 1 represent profles from the out-of-control process. The 1 true profles, based on the actual parameter values and random effects, are plotted n Fgure 1.1 where the blue curves represent the profles from the n-control process whle the red curves represent the profles from the out-of-control process. It s dffcult to dstngush the profles from the n-control and out-of-control process by lookng only at the plot. 8

25 6 y x Fgure 1.1: The plot of 1 true profles Usng the T statstc, both the exstng non-cluster-based method and the proposed cluster-based method sgnaled, ndcatng that both methods detected a change n the process. However, the non-cluster-based method sgnaled due to msclassfyng the 6 th profle as the out-of-control process. The cluster-based method, on the other hand, correctly classfed the 10 th, 11 th and 1 th profles as from the out-of-control process and classfed the other nne profles as from the n-control process. The estmates of the PA parameters from the non-cluster-based method (Jensen et al. (008)) are β ˆT 16.61, 9.709,.178 proposed method are β T parameters, β T 1.5, 7,, whle the estmates of the PA parameters from the ˆ , ,.07. Compared to the true PA the estmates of the non-cluster-based method (Jensen et al. (008)) are severely dstorted whle the proposed method provded PA estmates much closer to the true values, as expected. 9

26 Chapter. Cluster-Based Bounded Influence Regresson R ecall that robust regresson estmaton plays an mportant role n statstcal analyss. In ths chapter, a new robust and effcent regresson method, called the cluster-based bounded nfluence (CBI) regresson (Lawrence (003)) wll be revewed. Addtonally, the CBI regresson algorthm wll be updated by usng the modern R package and compared to other exstng robust regresson methods..1 Revew of Robust Regressons The detecton of observatons not conformng to a gven statstcal model s a common goal of the data analyst. Many methods have been proposed to ad n the detecton of such nonconformng observatons or outlers. For example, n a recent paper by Fan et al. (01a), a herarchcal clusterng method was employed that greatly mproves the ablty of certan multvarate control chart technques at detectng the presence of multvarate outlers. Detectng unusual observatons n the multple regresson settng s a far more complcated process however and many technques have been ntroduced (see secton ) for ths purpose. As n the Fan et al. (01a) paper, the use of clusterng methodology can mprove the ablty of a technque to dentfy unusual data ponts n the multple regresson settng. The use of clusterng to mprove the propertes of the bounded-nfluence regresson method s demonstrated n ths chapter. In buldng a lnear regresson model, a sngle unusual observaton can dramatcally nfluence ordnary least squares (OLS) estmaton. Wth OLS, a sngle low leverage outler can have a dramatc effect on the estmaton of the general trend, especally concernng the ntercept. However, a sngle hgh nfluence pont, or hp, can have a dramatc effect on any or all parameter estmates. The jont nfluence of several hps can have an even greater deleterous mpact on parameter estmates. These 10

27 coeffcents and ther standard errors, along wth predctons, dagnostcs, hypothess tests, and other numercal measures can each become very msleadng wthout a thorough exploratory data analyss accompanyng t. Ths chapter focuses on the study of robust, hgh breakdown lnear regresson modelng. As ths dscplne s extremely computatonally ntensve, much of the publshed work n ths area has occurred snce the early 1980 s. Of course, some deas were proposed much earler, but generally lmted n actual applcaton. Methods such as M regresson (Huber and Ronchett (009)), and bounded nfluence (BI) (Huber and Ronchett (009)) regresson work well n the presence of low leverage outlers and at most one hp respectvely. However, they are unable to combat a small percentage of outlers. Least medan of squares (LMS) (Rousseeuw (1984)) regresson and least trmmed squares (LTS) (Ruppert and Carroll (1980)) regresson, on the other hand, are examples of hgh breakdown estmators as they possess the ablty to provde parameter estmates wth as much as 50% of the data beng contamnated. Poor effcency and numercal/computatonal senstvty wth large datasets has typcally led to ther prmary use as an ntal estmator feedng nto other robust procedures such as M or BI estmators. Examples nclude Mallows 1-step (M1S) regresson (Smpson et al. (199)) and Schweppe s 1-step (S1S) regresson (Coakley and Hettmansperger (1993)), whch are one-step adjustments of LTS that ncrease effcency versus the LTS estmator. However, two vrtually dentcal LTS estmates may yeld dramatcally dfferent M1S (or S1S) estmators (Lawrence (003)), thereby llustratng a potental negatve ssue wth repeated samplng based methods. Another hgh breakdown one-step estmaton method s due to Gervn and Yoha (00). Ther robust and effcent weghted least square estmate (REWLS) procedure attans full asymptotc effcency wth the assumpton of normally dstrbuted random errors. However, accordng to the Monte Carlo study n secton.5, the REWLS, on the average, fals to correctly dentfy the good and bad hgh leverage ponts when the error term s not deally normally dstrbuted. 11

28 The CBI method was ntroduced by (Lawrence et al. (013)) as a new regresson methodology that obtans compettve, robust, effcent, hgh breakdown regresson parameter estmates. Addtonally, ths method provdes an nformatve summary regardng possble multple outler structure. A smple example below gves the comparson of the CBI regresson method to several exstng robust procedures when the data has more than one hgh leverage pont. The data set has 11 observatons wth observatons 1-8 generated from the lnear model where ~ 0, 5, and wth the regressor varable generated va ~10,0. Observatons 9-11 were arbtrary added to reflect a mld nfluence pont and two hps, respectvely. y CBI ft LS ft (usng data 1-8) 6 REWLS ft LTS ft S1S ft 9 LS ft x Fgure.1: The ftted lne of the dfferent robust methods The data are plotted n Fgure 1.1 where the outler (9) and the two hps (10, 11) are clearly seen. Regardng the collecton of fts also dsplayed n Fgure 1.1, only the 1

29 proposed method (CBI) detects the correct trend of the uncontamnated data. Each of the other estmators was dramatcally msled by the jont nfluence of these three arbtrary ponts, resultng n a postve slope estmate when the true underlyng slope s negatve.. Revew of Selected Robust Regresson Methods As the bass for lnear regresson analyss, the statstcal model s restrcted to be of the form, wth the response varable,, beng explaned as a lnear functon of the regressor varables,, 1,, plus a random error component,, for each of the observatons, 1,. Gven the computatonal nature of the proposed method, clarty n notaton becomes qute mportant and, therefore, ths paper offers suffcent detal. The lnear model also can be wrtten matrx form as or element wse as 1 1 1, where ~,,. There are 1 unknown parameters that form the 1 parameter vector, whch s to be estmated by the 1 vector. Ths subsequently yelds the estmated fts as. Further, the 1 vector of resduals s computed as, wth representng the resdual for the observaton. Also, defne as the matrx 13

30 contanng only the regressor varables, wth representng the matrx formed by augmentng the vector to. To accommodate reference to ndvdual observatons, let the row of be denoted by the 1 row vector and the 1 row vector denote the row of. When the response varable s ncluded, the notaton for row of s,. Consder the objectve functon mn, for the OLS estmator, whch may be wrtten as mn, wth. In robust regresson, the functon can be selected to ether down weght or bound any argument rsng from unusual observatons. Ths becomes the bass for M regresson (Huber and Ronchett (009)) whch has the objectve functon mn, where the -functon s chosen to be bounded and odd-symmetrc, represents an arbtrary pont n the p-dmenson estmaton space, and where s some approprately chosen estmate of. The choce for s generally lmted to robust measures of scale. One such estmator that s frequently used s the medan absolute devaton (MAD), where med med. 14

31 Takng dervatves wth respect to leads to solvng altered normal equatons, 0, where and s the soluton for. These altered normal equatons form a system of nonlnear equatons that may be solved by a number of popular numercal methods ncludng (1) Newton-Raphson and () teratvely reweghted least squares (IRLS), the later used n ths paper. At convergence, IRLS produces the M regresson parameter estmator, where s the dagonal weght matrx, wth dagonal elements denoted as. Each weght,, determnes how much emphass the regresson wll place on a partcular observaton. A large weght (near 1) should ndcate a good observaton. An outler or a hp, on the other hand, should get a reduced weght or perhaps even a zero weght. In M regresson the weght s calculated as resdual. Typcally, the larger s the resdual, the smaller s the weght., a functon of the A sngle hp wll pull the ftted M regresson lne toward t to make the correspondng resdual small, thus that weght wll be large. Ths means that M regresson can be domnated by a sngle hp. One soluton to ths problem s to use bounded nfluence (BI) regresson. Here, the name refers to boundng the nfluence that the pont has n the regressor-space. One altered normal equaton form, called the Schweppe form (Staudte (1990)), s wrtten as. 0 15

32 Here, s chosen so that the effect of a large s reduced f, s a hp. One choce s to have, where s the dagonal element of the socalled hat matrx,, wth. The value s referred to as the BI weght. The BI regresson estmator can be obtaned n exactly the same manner as the M-estmator va IRLS, as. However, the weght now has the form, where s the scaled resdual. Specfcally, the BI weght depends on both the resdual and the locaton of n the regressor-space. Whle M and BI estmators provde an mprovement over OLS f the data has an outler or hp, respectvely, they cannot provde protecton aganst data wth even modest amounts of contamnaton. Ruppert and Carroll (1980) ntroduced LTS to combat ths stuaton, defnng the objectve functon as mn, representng the sum of the smallest squared resduals where s generally taken to be 1, wth [.] denotng the greatest nteger functon. Snce ths objectve functon s not dfferentable, no closed-form expresson exsts for the LTS estmator. However, algorthms are avalable that gve the exact LTS estmator for the locaton model, the exact LTS estmator for the regresson model based on small data sets, and a relatvely accurate LTS estmator for large data sets. The algorthmc detals may be found n Rousseeuw and Van Dressen (006). Hstorcally, methods lke LTS (and ts 16

33 predecessor LMS) had nvolved repeated samplng computatonal methods ncorporatng probablstc arguments. One problem wth hgh breakdown estmators such as LTS s poor effcency due to large varablty assocated wth estmated coeffcents. The remedy for ths poor effcency s to use the LTS estmator, or another hgh breakdown estmator, as an ntal estmator, wth the generalzed M estmator form to obtan a one-step generalzed M estmator. The S1S estmator s one such estmator and results from solvng the altered normal equatons. 0 A Gauss-Newton approxmaton usng a frst-order Taylor seres expanson about the ntal estmate yelds a one-step mprovement of the form. None of the above estmators acheve full effcency at the normal dstrbuton whle smultaneously mantanng a breakdown bound close to 50%. Gervn and Yoha (00) proposed an adaptve one-step estmaton method that attans full asymptotc effcency at the normal error dstrbuton whle at the same tme has a hgh breakdown bound and small maxmum bas. Ther method, referred to as the REWLS estmator, s a weghted LS estmator computed from an ntal hgh breakdown estmate, and a robust scale estmate such as MAD. However, rather than deletng those observatons whose absolute scaled resduals are greater than a gven value, the procedure wll keep a number of observatons, correspondng to the smallest values of the absolute scaled resdual, 1,. The has the property that n large samples under normalty t wll have 1, whch means a vanshng fracton of observatons wll 17

34 be deleted and full effcency wll be attaned (Maronna et al. (006)). The REWLS estmator can be obtaned as 0 0, where s the dagonal matrx wth Cluster-Based Bounded Influence Regresson The CBI regresson methodology offers a new phlosophcal approach to the robust regresson arena and conssts of two prmary phases, the cluster phase and the regresson phase. Frst, an ntal hgh-breakdown regresson estmator s produced va a sophstcated clusterng algorthm. Second, refnement of ths ntal regresson estmator s nvestgated and possbly mplemented under a carefully structured use of BI regresson. The ratonale behnd ths second phase s to allow for a possble mprovement n effcency, especally when the level of data contamnaton does not come close to approachng 50%. The CBI regresson method has been named clusterbased bounded nfluence regresson, or CBI for short, to reflect the nature of ts two phases computaton process. The cluster phase begns wth hgh-breakdown locaton and scale estmaton of the dmensonal regressor-response space. A specal set of ponts, referred to as the set of anchor ponts, s computed that together represent the general trend of the data. Each observaton s then characterzed by the OLS regresson ft that would occur f ths ndvdual observaton s augmented to the anchor ponts. Hgh breakdown locaton and scale estmaton of ths set of n OLS coeffcents provdes the foundaton for the constructon of the smlarty matrx (techncally, a dstance matrx). The desre for a 18

35 tght, compact sphere of smlar coeffcents exhbtng a common trend descrpton s the bass for the selecton of complete lnkage herarchcal clusterng (Lawrence (003)) as the default method and clusterng s performed untl an ntal man cluster of at least 1 observatons are formed. Two aspects worth mentonng are that (1) the OLS senstvty to a sngle pont s beng exploted to our advantage n evaluatng the data, and () the anchor ponts serve to allevate repeated samplng (as requred by other 50% breakdown pont estmators such as LTS) and the use of mnmal szed elemental subsets that must be n general poston (.e. no sngularty ssues). A smple OLS ft to ths man cluster s used as the bass for the possble adjustment of the anchor set metrc to more drectly relate to the general trend. A revsed smlarty matrx s constructed, wth a second cluster analyss yeldng a revsed, fnal man cluster and mnor clusters. The determnaton of ths cluster classfcaton structure completes the cluster phase. To begn the regresson phase, the ntal CBI estmator s smply the OLS estmate of the man cluster observatons. A hgh breakdown scale estmate s then computed. Hgh breakdown BI leverage weghts are computed from the regressor-space only. Usng only the man cluster, a BI regresson updates the ntal CBI estmator. To ths pont, the mnor clusters have not been utlzed n the computaton of the CBI regresson estmator and ther observatons are sad to be nactve. The actvaton process for these remanng observatons has two prmary stages. Frst, a statstc s computed for each of the mnor clusters, where 1,,,. A canddate mnor cluster s one such that for the cutoff value. Then, a sngle statstc, denoted by, s computed for the unon of all canddate mnor clusters. If s small enough, then the fnal CBI estmator s determned from ths actvaton process (provded at least one mnor cluster observaton obtaned a nonzero weght). Otherwse, the mnor clusters do not play an actve role (.e. all observatons possess a zero weght) and there s no further update to the current CBI regresson 19

36 estmator. A fnal CBI scale estmate s computed once the fnal CBI regresson estmator has been determned. The detaled algorthm consstng of ten nterrelated steps for the CBI estmator s presented below. Steps 1 through 3 represent the cluster phase and steps 4 through 10 represent the regresson phase. Notaton s ntroduced as needed. Step1 Perform mnmum volume ellpsod, MVE, estmaton (see Rousseeuw and Leroy (003)) of ; determne the 1 anchor pont matrx, Ω. These ponts nclude, the MVE locaton vector for, and the end ponts of the ellpsod of constant dstance., from based on the metrc, the MVE scale matrx estmator for, the par of end ponts s determnate by the expresson.,, where and s the egenvalue and egenvector of, respectvely. Step Determne the base regresson estmator matrx. The row of, denoted by the 1 vector, s defned as the estmator that results from an OLS regresson analyss of the set of anchor ponts supplemented by the addton of the observaton n the dataset. Perform an MVE estmaton of, treatng each row of as an observaton n dmensons. Step 3 Usng as the dstance metrc, compute a smlarty matrx whose elements are defned to be. 0

37 Perform a cluster analyss on the dataset gven the smlarty matrx and usng complete lnkage to obtan the tghtest cluster of vectors. The ntal man cluster,, s defned at the frst nstance of whch a sngle cluster conssts of at least 1 observatons. The remanng observatons fall nto one of mnor clusters that are labeled as,,,. Step 4 Compute the OLS estmate usng the data ponts n. A prelmnary estmate of scale,, s defned to be the MAD of all resduals where. Determne the set of observatons,, such that : Step 5 Usng the data ponts n, compute the 1 mean vector, of the regressor data n,and covarance matrx, usng standard moments estmators, of the regressor data n, defne the 1 robust regressor dstance vector contanng the elements. Step 6 Mmc step 1 to step 3 by replacng the MVE statstcs wth the weghted mean and covarance estmates for the data to get the new ntal man cluster,, and mnor clusters,...the weght for the data pont s defne as 1

38 1, 0,, Compute the ntal CBI estmator,, usng WLS and subsequently updated the scale estmate as MAD of all new resduals. Step 7 Determne the 1 BI leverage weght vector,, whose elements are defned as 1, mn 1,.,, Perform BI regresson usng only the man cluster,, to obtan, at convergence of IRLS, the estmate. Step 8 Let represent any mnor cluster and be the sze of, and let, be the subvector set of that corresponds only to the and observatons. Perform the BI regresson wth these new data ponts and leverage weght vector, to obtan the estmate at convergence. A statstc s then computed va,, where, represent fts when usng both and observatons and represents fts when usng just observatons. Ths statstc s computed for each of the mnor clusters.

39 Step 9 Defne the scalar to represent the maxmum allowable statstc. Then, let represent the unon of all actvaton canddate mnor sets,.e. δ and I w 0. Provded that, then wth, and, as nputs to obtan the BI regresson estmate and. The default value of s 4. Step 10,, δ and J w 0 The CBI scalar estmator s then updated as the MAD of new resduals. The fnal CBI weghts for the ndvdual observatons are smply the observatons weghts at convergence of BI regresson used to compute. Three scale estmators are provded by the CBI procedure, specfcally, and. s the MAD of the CBI resduals. Gven the CBI scale estmate, the BI leverage weght vector, and, a robust mean square error that mmcs the robust ANOVA scale estmate ntroduced by Brch (199) s found va CBI., CBI Usng the effectve sample sze, (Brch (010)), a modfed verson of the robust analyss of varance scale estmate then becomes 3

Wishing you all a Total Quality New Year!

Wishing you all a Total Quality New Year! Total Qualty Management and Sx Sgma Post Graduate Program 214-15 Sesson 4 Vnay Kumar Kalakband Assstant Professor Operatons & Systems Area 1 Wshng you all a Total Qualty New Year! Hope you acheve Sx sgma