NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Introducton G10 NAG Fortran Lbrary Chapter Introducton G10 Smoothng n Statstcs Contents 1 Scope of the Chapter... 2 2 Background to the Problems... 2 2.1 Smoothng Methods... 2 2.2 Smoothng Splnes and Regresson Models... 2 2.3 Densty Estmaton... 3 2.4 Smoothers for Tme Seres... 4 3 Recommendatons on Choce and Use of Avalable Routnes... 4 4 Routnes Wthdrawn or Scheduled for Wthdrawal... 5 5 References... 5 [NP3546/20A] G10.1

Introducton G10 NAG Fortran Lbrary Manual 1 Scope of the Chapter Ths chapter s concerned wth methods for smoothng data. Included are methods for densty estmaton, smoothng tme seres data, and statstcal applcatons of splnes. These methods may also be vewed as nonparametrc modellng. 2 Background to the Problems 2.1 Smoothng Methods Many of the methods used n statstcs nvolve fttng a model, the form of whch s determned by a small number of parameters, for example, a dstrbuton model lke the gamma dstrbuton, a lnear regresson model or an autoregresson model n tme seres. In these cases the fttng nvolves the estmaton of the small number of parameters from the data. In modellng data wth these models there are two mportant stages n addton to the estmaton of the parameters; these are the dentfcaton of a sutable model, for example, the selecton of a gamma dstrbuton rather than a Webull dstrbuton, and the checkng to see f the ftted model adequately fts the data. Whle these parametrc models can be farly flexble, they wll not adequately ft all data sets, especally f the number of parameters s to be kept small. Alternatve models based on smoothng can be used. These models wll not be wrtten explctly n terms of parameters. They are suffcently flexble for a much wder range of stuatons than parametrc models. The man requrement for such a model to be sutable s that the underlyng models would be expected to be smooth, so excludng those stuatons where, for example, a step functon would be expected. These smoothng methods can be used n a varety of ways, for example: 1. producng smoothed plots to ad understandng; 2. dentfyng of a sutable parametrc model from the shape of the smoothed data; 3. elmnatng complex effects that are not of drect nterest so that attenton can be focused on the effects of nterest. Several smoothng technques make use of a smoothng parameter whch can be ether chosen by the user or estmated from the data. The smoothng parameter balances the two crteron of smoothness of the ftted model and the closeness of the ft of the model to the data. Generally, the larger the smoothng parameter s, the smoother the ftted model wll be, but for small values of the smoothng parameter the model wll closely follow the data, and for large values the ft wll be poorer. The smoothng parameter can be ether chosen usng prevous experence of a sutable value for such data, or estmated from the data. The estmaton can be ether formal, usng a crteron such as the crossvaldaton, or nformal by tryng dfferent values and examnng the result by means of sutable graphs. Smoothng methods can be used n three mportant areas of of statstcs: regresson modellng, dstrbuton modellng and tme seres modellng. 2.2 Smoothng Splnes and Regresson Models For a set of n observatons (y ;x ), ¼ 1; 2;...;n, the splne provdes a flexble smooth functon for stuatons n whch a smple polynomal or nonlnear regresson model s not sutable. Cubc smoothng splnes arse as the functon, f, wth contnuous frst dervatve whch mnmzes X n w ðy fðx ÞÞ 2 þ Z 1 1 ðf 00 ðxþþ 2 dx; where w s the (optonal) weght for the th observaton and s the smoothng parameter. Ths crteron conssts of two parts: the frst measures the ft of the curve and the second the smoothness of the curve. The value of the smoothng parameter,, weghts these two aspects: larger values of gve a smoother ftted curve but, n general, a poorer ft. Splnes are lnear smoothers snce the ftted values, ^y ¼ð^y 1 ; ^y 2 ;...; ^y n Þ T, can be wrtten as a lnear functon of the observed values y ¼ðy 1 ;y 2 ;...;y n Þ T, that s, ^y ¼ Hy G10.2 [NP3546/20A]

Introducton G10 for a matrx H. The degrees of freedom for the splne s traceðhþ gvng resdual degrees of freedom traceði HÞ ¼ Xn ð1 h Þ: The dagonal elements of H, h, are the leverages. The parameter can be estmated n a number of ways. 1. The degrees of freedom for the splne can be specfed,.e., fnd such that traceðhþ ¼ 0 for gven 0. 2. Mnmze the cross-valdaton (CV),.e., fnd such that the CV s mnmzed, where CV ¼ 1 X n r 2 : n 1 h 3. Mnmze generalsed cross-valdaton (GCV),.e., fnd such that the GCV s mnmzed, where GCV ¼ n P n r2 P n ð 1 h 2!: Þ 2.3 Densty Estmaton The object of densty estmaton s to produce from a set of observatons a smooth nonparametrc estmate of the unknown densty functon from whch the observatons were drawn. That s, gven a sample of n observatons, x 1, x 2 ;...;x n, from a dstrbuton wth unknown densty functon, fðxþ, fnd an estmate of the densty functon, ^fðxþ. The smplest form of densty estmator s the hstogram; ths may be defned by ^fðxþ ¼ 1 nh n j; a þðj 1Þh <x<aþjh; j ¼ 1; 2;...;n s ; where n j s the number of observatons fallng n the nterval a þðj 1Þh to a þ jh, a s the lower bound of the hstogram and b ¼ n s h s the upper bound. The value h s known as the wndow wdth. A smple development of ths estmator would be the runnng hstogram estmator ^fðxþ ¼ 1 2nh n x; a x b; where n x s the number of observatons fallng n the nterval ½x h : x þ hš. Ths estmator can be wrtten as ^fðxþ ¼ 1 X n w x x nh h for a functon w where wðxþ ¼ 1 2 f 1 <x<1 ¼ 0 otherwse: The functon w can be consdered as a kernel functon. To produce a smoother densty estmate, the kernel functon, KðtÞ, whch satsfes the followng condtons can be used: Z 1 1 The kernel densty estmator s therefore defned as KðtÞ dt ¼ 1 and KðtÞ 0:0: ^fðxþ ¼ 1 X n nh K x x : h The choce of KðÞ s usually not mportant, but to ease computatonal burden use can be made of Gaussan kernel defned as KðtÞ ¼ p 1 ffffffffff e t2 =2 : 2 [NP3546/20A] G10.3

Introducton G10 NAG Fortran Lbrary Manual The smoothness of the estmator, ^fðxþ, depends on the wndow wdth, h. In general, the larger the value h s, the smoother the resultng densty estmate s. There s, however, the problem of oversmoothng when the value of h s too large and essental features of the dstrbuton functon are removed. For example, f the dstrbuton was bmodal, a large value of h may result n a unmodal estmate. The value of h has to be chosen such that the essental shape of the dstrbuton s retaned whle effects due only to the observed sample are smoothed out. The choce of h can be aded by lookng at plots of the densty estmate for dfferent values of h, or by usng cross-valdaton methods; see Slverman (1990). Slverman (1990) shows how the Gaussan kernel densty estmator can be computed usng a fast Fourer transform (FFT). 2.4 Smoothers for Tme Seres If the data conssts of a sequence of n observatons recorded at equally spaced ntervals, usually a tme seres, several robust smoothers are avalable. The ftted curve s ntended to be robust to any outlyng observatons n the sequence, hence the technques employed prmarly make use of medans rather than means. These deas come from the feld of exploratory data analyss (EDA); see Tukey (1977) and Velleman and Hoagln (1981). The smoothers are based on the use of runnng medans to summarze overlappng segments; these provde a smple but flexble curve. In EDA termnology, the ftted curve and the resduals are called the smooth and the rough respectvely, so that Data ¼ Smooth þ Rough: Usng the notaton of Tukey, one of the smoothers commonly used s 4253H,twce. Ths conssts of a runnng medan of 4, then 2, then 5, then 3. Ths s then followed by what s known as hannng. Hannng s a runnng weghted mean, the weghts beng 1/4, 1/2 and 1/4. The result of ths smoothng s then reroughed. Ths nvolves computng resduals from the computed ft, applyng the same smoother to the resduals and addng the result to the smooth of the frst pass. 3 Recommendatons on Choce and Use of Avalable Routnes Note: refer to the Users Note for your mplementaton to check that a routne s avalable. The followng routnes ft smoothng splnes: G10ABF computes a cubc smoothng splne for a gven value of the smoothng parameter. The results returned nclude the values of leverages and the coeffcents of the cubc splne. Optons allow only parts of the computaton to be performed when the routne s used to estmate the value of the smoothng parameter or as when t s part of an teratve procedure such as that used n fttng generalzed addtve models; see Haste and Tbshran (1990). G10ACF estmates the value of the smoothng parameter usng one of three crtera and fts the cubc smoothng splne usng that value. G10ABF and G10ACF requre the x to be strctly ncreasng. If two or more observatons have the same x -value then they should be replaced by a sngle observaton wth y equal to the (weghted) mean of the y-values and weght, w, equal to the sum of the weghts. Ths operaton can be performed by G10ZAF. The followng routne produces an estmate of the densty functon: G10BAF computes a densty estmate usng a Normal kernel. The followng routne produces a smoothed estmate for a tme seres: G10CAF computes a smoothed seres usng runnng medan smoothers. The followng servce routne s also avalable: G10ZAF orders and weghts the ðx; yþ nput data to produce a data set strctly monotonc n x. G10.4 [NP3546/20A]

Introducton G10 4 Routnes Wthdrawn or Scheduled for Wthdrawal None. 5 References Haste T J and Tbshran R J (1990) Generalzed Addtve Models Chapman and Hall Slverman B W (1990) Densty Estmaton Chapman and Hall Tukey J W (1977) Exploratory Data Analyss Addson-Wesley Velleman P F and Hoagln D C (1981) Applcatons, Bascs, and Computng of Exploratory Data Analyss Duxbury Press, Boston, MA [NP3546/20A] G10.5 (last)