Pre-processing method minimizing the need for reference analyses

Size: px

Start display at page:

Download "Pre-processing method minimizing the need for reference analyses"

Claud Warren
5 years ago
Views:

1 JOURNAL OF CHEMOMETRICS J. Chemometrics 2001; 15: Pre-processing method minimizing the need for reference analyses Per Waaben Hansen* Foss Electric A/S, Slangerupgade 69, DK-3400 Hillerød, Denmark SUMMARY A new pre-processing method called independent interference reduction (IIR) is proposed. It is based on modelling of the interferences by use of principal component analysis (PCA) using samples containing no variation in the parameter of interest. This is followed by subtraction of the modelled effects from the calibration matrix. It is useful in cases where the parameter of interest gives only minor contributions to the calibration matrix and where, at the same time, samples not containing variation in this parameter are easy to obtain. In such a case the method reduces the number of reference analyses required for establishing a reliable calibration. When applied to acetone determination in milk samples by use of Fourier transform infrared (FT-IR) spectroscopy, the method provides a treatment of the data which yields more stable models in the subsequent calibration step. Copyright 2000 John Wiley & Sons, Ltd. KEY WORDS: pre-processing method; independent interference reduction; infrared spectroscopy; reference analysis 1. INTRODUCTION When performing a calibration, the case where the data set has a high signal-to-noise ratio (S/N) is sometimes encountered. This is actually a good case, as it makes extraction of minor features from the data set possible. In spectroscopy, such examples are seen occasionally with mid-infrared (mid-ir) data, especially when the spectra are generated by Fourier transform infrared (FT-IR) spectroscopy. Dimension-reducing methods such as principal component regression (PCR) or partial least squares (PLS) regression [1] are necessary for making the information useful if the component of interest only gives rise to minor spectral features, e.g. because it is present in very low concentrations. In such cases, models of very high complexity result. FT-IR examples include urea [2] and acetone [3] in milk, where PLS models based on factors are common. Such models require a high number of calibration samples (i.e. hundreds) to ensure a stable predictive model. As a result, calculation of a reliable predictive model is very resource-demanding because of the amount of work and reagents required for obtaining reference analyses. * Correspondence to: Per Waaben Hansen, Foss Electric A/S, Slangerupgade 69, DK-3400 Hillerød, Denmark. PWH@foss-electric.dk The work was carried out partly at The Royal Veterinary and Agricultural University, Department of Dairy and Food Science, Rolighedsvej 30, DK-1958 Frederiksberg C, Denmark. Contract/grant sponsor: The Danish Academy of Technical Sciences (Lyngby, Denmark) Contract/grant sponsor: Foss Electric A/S (Hillerød, Denmark) Copyright 2000 John Wiley & Sons, Ltd.

2 124 P. W. HANSEN In the case of a component occurring in very low concentrations, this may seem a waste of good data, as the first PLS factors (or principal components if PCR is used) usually describe the major constituents of the samples. This is because the weak spectral features from a low-concentration component may not be visible to the algorithm before the major interferences have been removed. Thus expensive reference-analysed data are wasted on the determination of scores and loadings for interfering compounds. Therefore a method of removing all (or a major part of) the interferences before commencing regression on the reference-analysed samples would be useful. Usually, spectral data from a sample are easy and inexpensive to obtain the reference analysis is the resource-demanding step. This is the basis for the method proposed in this paper, independent interference reduction (IIR). It is useful in cases where a low-concentration component is determined in a sample matrix with many interferences at high concentrations. The method is applied to an example from FT-IR spectroscopy where the above-mentioned requirements are met. 2. DESCRIPTION OF THE METHOD IIR is a pre-processing method with similarities to two methods proposed recently, orthogonal signal correction (OSC) [4] and direct orthogonalization (DO) [5]. They both model the interferences and scatter effects by building a model of the data matrix with the information on the component of interest removed. This requires that (1) all samples should be reference analysed and that (2) the reference results are of a certain quality. If the latter requirement is not met, both methods might remove potentially useful information on the component of interest. Acting as pre-processing methods, they remove scatter effects more efficiently than other schemes, e.g. multiplicative signal correction (MSC) [1], since they make a detailed model of the background. Furthermore, they provide models that are more interpretable, as interferences are removed prior to the calibration step. This is because the loadings are divided into those relating to the interferences and those related to the useful parts of the data. IIR also removes interferences and scatter effects, and it does so by using two data sets: one set (matrix X simple ) containing a large number of samples showing variation in all parameters except the one of interest, used for pre-processing; and another set (matrices X special and Y special ) containing reference-analysed samples with variation in all parameters, the one of interest in particular, used for calibration. No reference results are needed for the samples contained in X simple. The method thus requires that samples without variation in the parameter of interest are easily obtained, which is often the case when spectroscopic methods are used for predictive purposes. This is the major advantage of IIR when compared to DO and OSC, since the latter two methods require all samples to be associated with their reference results. On the other hand, IIR will not be useful when the component of interest accounts for the majority of the variation in X simple, as IIR pre-processing would remove useful information in such a case. Constituents in milk and dairy products, such as compounds added during milk processing or components occurring only as a result of a disease, may fulfil the requirement that samples (almost) without variation in the parameter of interest are easy to obtain. In the case where a compound not present in ordinary milk samples is added, e.g. during a dairy process, the samples for X simple can be chosen among all samples measured before addition of the compound of interest. The basic steps in IIR are as follows. 1. A data matrix X simple showing typical sample variation in all parameters apart from the component of interest is obtained. This matrix is modelled using principal component analysis (PCA) with m principal components. If X simple is large enough, this results in a number of welldefined and accurate loadings.

3 PRE-PROCESSING METHOD Another matrix X special containing a wide variation in the component of interest is projected onto the loadings obtained in step 1, giving m sample scores. This model (based on m scores and loadings) of the interference part of X special is subtracted from X special, yielding a new matrix X special7m. 3. This interference-reduced X special7m is used with its associated reference matrix Y special to generate a calibration model for predictive purposes, e.g. using PLS, PCR or multiple linear regression (MLR). In step 1 the interferences contained in X simple are modelled. It is therefore necessary that the component of interest is not present (or is constant) in X simple. Deviations from this constraint are not likely to be serious if the component is present in low concentrations (or shows weak spectral features), as the first principal components will not describe it anyway. In such a case, if the prediction error (e.g. root mean square error of prediction (RMSEP)) is used for determining the optimal model in step 3, then problems originating from inclusion of spectral information stemming from the component of interest will be revealed, as this will result in an increased prediction error. The data contained in X special and Y special should fulfil the usual requirements that apply to calibration sets (i.e. that the calibration data must represent future samples). The steps are similar to the ones performed in PCA data pre-treatment (PCA-DP) [6], but where PCA-DP only models instrumental (baseline) effects, IIR is extended to model interferences as well. A detailed description of IIR is given below IIR during calibration The matrices X simple, X special and Y special are required to perform the pre-processing and calibration steps. For the method to be useful, X simple should contain more samples than X special. 1. X simple is centred to give X c simple, the mean spectrum being x m1. 2. PCA is performed to give X c simple ˆ T simplep T simple 1 using m principal components. 3. x m1 is subtracted from X special to give X s special. 4. The scores of X s special on P simple are calculated: T special ˆ X s special P simple 2 5. X s special is reduced according to this model: X s special m ˆ Xs special T special P T simple 3 6. X s special m is mean centred (the mean spectrum being x m2) and regression against Y special is performed. A PLS regression using n factors could be used in this step. The mean centring in step 6 is performed since X s special m is the residual after the projection of X s special onto the PCA model of X simple, which does not necessarily have the same mean as X special IIR during prediction The vectors x pred (a new spectrum), x m1, P simple and x m2 as well as a regression model (calculated during calibration) are required in the prediction step.

4 126 P. W. HANSEN 1. x m1 is subtracted from x pred to give x s pred. 2. The scores of x s pred on P simple are calculated: t pred ˆ x s pred P simple 4 3. x s pred is reduced according to this model: x s pred m ˆ xs pred t predp T simple 5 4. x m2 is subtracted from x s pred m and prediction is performed using the regression model calculated in calibration step 6 The dimensions of the initial PCA model (m) and the final PLS calibration (n) both have influence on the results in terms of the prediction error (RMSEP). Therefore validation should be performed by calculating the prediction error using various values of m and n. From experience a closure similar to the one observed with DO [5] and PCA-DP [6] seems to exist, i.e. the sum of m and n (the overall dimensionality of the model) is constant. The application of IIR presented in the next section shows an example of this observation. It could suggest that nothing is gained from use of IIR, but this is not the case: the first m loadings are generated from more samples (in X simple ) than if only the samples from X special had been used. Thus they are expected to be better defined and less noisy. Furthermore, the calibration is generated from a data set (X special ) which is not diluted by samples not including variation from the component of interest. This may avoid the problem of underestimation of high samples and overestimation of low samples which is common in inverse calibration methods such as PLS [1]. 3. APPLICATION TO FT-IR DATA The application presented here relates to acetone determination in milk from individual cows using FT-IR spectroscopy. This application fulfils the requirements of IIR, namely that (1) acetone is found in milk in low concentrations (e.g mm) and that (2) it is easy to obtain samples known not to contain acetone. The latter requirement is met, since acetone is only present when a cow suffers from ketosis (a metabolic disease), so the milk samples for X simple can be obtained from healthy cows. A detailed description of the data set can be found elsewhere [3]. Only the information necessary for understanding the present application of IIR will be given here Experimental The calibration set consisted of 310 samples from individual cows collected in Norway, Sweden and Denmark. Eight outliers were removed from the set by use of PCA. The remaining samples were divided into two sets: 198 samples containing less than 0 1 mm acetone (i.e. from cows not suffering from ketosis) made up the matrix X simple, while the remaining samples (104 samples, still containing some samples below 0 1 mm) made up X special. The latter sample set was used for building a predictive model (i.e. steps 5 and 6 during calibration). The test set consisted of 58 individual cow samples collected and measured in New Zealand. They were collected in a way that ensured a high proportion of acetone-containing samples. This set was used for testing the models. The rationale for using this set for testing purposes was that if IIR leads to more stable predictive models, then using a test set consisting of samples obtained in another part of the world and measured on another instrument (of the same type) would prove it. All samples were measured on MilkoScan FT 120 FT-IR transmission instruments (Foss Electric A/S, Hillerød, Denmark) located in either Denmark or New Zealand. The instruments were

5 PRE-PROCESSING METHOD 127 standardized [7] before use. The samples were analysed for their acetone content using the reference method [3] Calculations The data analysis and calibration work was performed on a PC using Matlab 1 software (Version 5 2 1; The MathWorks, Inc., Natick, MA). The calibration routines were either programmed by the author or taken from the PLS_Toolbox (Version 1 5 1; Eigenvector Technologies, Manson, WA). Repeatability is expressed as the mean standard deviation (S r ) of multiple determinations performed under identical conditions and is calculated as v u 1 X q X n S r ˆ t x j;i x j 2 q n 1 where q is the number of samples, n is the number of replicates, x j,i is the result of the ith replicate of the jth sample and x j is the average result of the jth sample. Accuracy is expressed as the root mean square error of prediction (RMSEP) and is calculated as jˆ1 iˆ1 v u 1 X N RMSEP ˆ t x i;reference x i;predicted 2 N iˆ1 where N is the number of determinations (number of samples (q) times number of replicates (n) from above) and x i,reference and x i,predicted are the reference and predicted values corresponding to the ith determination respectively. When a bias (mean difference between reference results and predictions) is observed, the standard error of prediction (SEP) is used. It is calculated as v u 1 X N SEP ˆ t x i;reference x i;predicted bias 2 N 1 iˆ RESULTS AND DISCUSSION Calibrations were carried out for all values of m (PCA dimension) from 0 (ordinary PLS) to 25 and with overall model complexities (i.e. m n, n being the number of PLS factors in the final regression step) from m 1 to 30. Cross-validation was used for selecting the optimal model complexity, using two, four, six, eight and 10 cross-validation segments. The average model complexity from these five trials was used when the final model was calculated. Note that the 198 samples containing less than 0 1 mm acetone were used for IIR pre-processing only, i.e. the PLS model was calculated using the 104 calibration samples irrespective the value of m. The cross-validated prediction errors using 10 cross-validation segments for various values of m are apparent in Figure 1. From these plots it is clear that the overall model complexity stabilizes at a value of approximately 25, irrespective of the value of m. This does not apply when m = 25. In that case the same RMSEP is reached when m n = 30, so when m becomes high, the optimal overall model complexity is pushed ahead. Thus there seems to exist a lower limit for n, and this value must

6 128 P. W. HANSEN Figure 1. Cross-validation error for acetone in milk (calculated using 10 cross-validation segments) against model complexity for various dimensions (m) of the initial PCA in IIR. Figure 2. First PLS loadings for models without (m = 0) and with (m = 20) IIR preprocessing, overlaid with the FT-IR spectrum of pure acetone in water. The PLS loading with m = 0 is shifted by 0 3 units for the sake of clarity. Only the range from 1000 to 1800 cm 71 is shown. The range from 1581 to 1697 cm 71 was not used in the calculations owing to a strong water band in the area.

7 PRE-PROCESSING METHOD 129 Table I. Results on the independent test set using various combinations of dimensions m (principal components used for pre-processing) and n (PLS factors) m m n RMSEP SEP S r be n = 5 in the present case. Regarding m, using too high a value for this parameter does not seem to affect the predictive ability of the model. The prediction error does not seem to improve when IIR is used, but it does not become poorer either. The main advantage of IIR is that it results in a smoother decay of the prediction error, which facilitates the location of the optimal number of PLS factors. Furthermore, the PLS loadings (Figure 2) show a closer relation to the component of interest when IIR is used. With an appropriate choice of m (e.g. m = 20) the first PLS loading contains features similar to the spectrum of pure acetone. This is not the case when m = 0 (i.e. without IIR pre-processing), where the first PLS loading mainly contains information about the milk fat in the samples. Thus the application of IIR pre-processing uncovers the relevant acetone information from a complex spectral matrix containing numerous interferences. When the final models are tested using the test set (with m n determined by cross-validation), there seems to be a benefit from using IIR. The results for various values of m are presented in Table I. Both RMSEP and SEP are stated, as SEP is independent of a possible bias, while RMSEP is not. In terms of the SEP the result is independent of the value of m, i.e. the use of IIR. The overall dimensionality of the models ends up at 24 or 25, apart from the case when m = 25 itself. In this light, nothing is gained by using IIR. When looking at the RMSEP, on the other hand, a significant improvement is seen when m is increased from 0 to 20. When m = 20, the RMSEP is not significantly different from the SEP, and the bias obviously present at lower values of m has disappeared. The bias could be a result of not all sample matrices being represented in X special, which is a common problem when the calibration set is too small. The application of IIR using the larger matrix X simple for preprocessing removes this problem by adding a wider range of sample types, i.e. also milk samples collected in another geographical region than the calibration samples, so the predictive model becomes more stable. The repeatability s r, which is a measure of the noise in the experiment, is not affected by the use of IIR: it is stable at a value of mm. When m becomes greater than the apparent overall dimension, the RMSEP increases while the SEP remains the same. The overall result is that IIR tends to remove a bias, but it requires careful selection of the value of the parameter m, as the bias increases when m becomes too high. In the present example the optimal model is obtained when m = 20 and n = 5. The predicted versus measured plot for the test set for this case is shown in Figure 3. The fact that the prediction error for the New Zealand samples in terms of the RMSEP exhibits a minimum at a certain combination of m and n is particularly promising, as it means that with IIR preprocessing, the calibration set does not necessarily need to contain local samples from the geographical region where the calibration model is to be used. It does, however, require that the samples contained in X simple are representative for the variations observed in the local samples. Another perspective of the method is that only the PCA loadings calculated from X simple need to be stored. Thus a set of e.g. 20 loadings (calculated from a large number of samples) could be saved and

8 130 P. W. HANSEN Figure 3. Reference versus predicted acetone in milk for test samples using a model with IIR pre-processing, where m = 20 and n = 5. RMSEP = 0 31 mm and s r = 0 17 mm. used for pre-processing in the future even when predictive models for parameters other than acetone are sought. 5. CONCLUSIONS IIR is a new pre-processing method which can be useful for specific applications where the following requirements are met: (1) the parameter of interest only gives minor contributions to the X matrix (e.g. a component in low concentrations); and (2) samples in which the parameter of interest is constant should be easy to obtain. The benefits of the method are that (1) it reduces the need for resource-demanding experiments to obtain Y data for calibration (i.e. reference analyses) and that (2) the resulting calibration model is more interpretable owing to the splitting of the loadings into those related to the interferences and those related to the parameter of interest. The main disadvantage is that two parameters, m and n, must be optimized when PLS regression is used for calibration. In the specific case where acetone is determined by use of FT-IR, an improved predictive model could be obtained by use of IIR. On this data set the major improvement over the ordinary PLS calibration was that future predictions calculated using the model became more stable (i.e. had a lower bias when used on samples from another geographical region measured on different instruments at different locations). ACKNOWLEDGEMENTS The author would like to thank The Danish Academy of Technical Sciences (Lyngby, Denmark) and Foss Electric A/S (Hillerød, Denmark) for providing the funding for the present work. It was carried

9 PRE-PROCESSING METHOD 131 out in co-operation with Food Technology at the Department of Dairy and Food Science (The Royal Veterinary and Agricultural University, Frederiksberg, Denmark). Dairying Research Corporation Ltd (Hamilton, New Zealand) and New Zealand Dairy Research Institute (Palmerston North, New Zealand) are thanked for providing samples and equipment for the practical work. Claus A. Andersson, Rasmus Bro, Henrik V. Juhl, Lars Nørgaard and Carsten Ridder are thanked for useful discussions. REFERENCES 1. Martens H, Næs T. Multivariate Calibration (2nd edn). Wiley: Chichester, Hansen PW. Milchwissenschaft 1998; 53: Hansen PW. J. Dairy Sci. 1999; 82: Wold S, Antti H, Lindgren F, Öhman J. Chemometrics Intell. Lab. Syst. 1998; 44: Andersson CA. Chemometrics Intell. Lab. Syst. 1999; 47: Sun J. J. Chemometrics 1997; 11: Andersen HV, Kjær L, Hansen PW, Ridder C. US Patent , 1999.

SELECTION OF A MULTIVARIATE CALIBRATION METHOD

SELECTION OF A MULTIVARIATE CALIBRATION METHOD 0. Aim of this document Different types of multivariate calibration methods are available. The aim of this document is to help the user select the proper