A Semi-parametric Regression Model to Estimate Variability of NO 2

Similar documents
y and the total sum of

Cluster Analysis of Electrical Behavior

Smoothing Spline ANOVA for variable screening

X- Chart Using ANOM Approach

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

Feature Reduction and Selection

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Simulation: Solving Dynamic Models ABE 5646 Week 11 Chapter 2, Spring 2010

Support Vector Machines

Available online at ScienceDirect. Procedia Environmental Sciences 26 (2015 )

Some variations on the standard theoretical models for the h-index: A comparative analysis. C. Malesios 1

S1 Note. Basis functions.

Adjustment methods for differential measurement errors in multimode surveys

C2 Training: June 8 9, Combining effect sizes across studies. Create a set of independent effect sizes. Introduction to meta-analysis

Parameter estimation for incomplete bivariate longitudinal data in clinical trials

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Outlier Detection based on Robust Parameter Estimates

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

Analysis of Malaysian Wind Direction Data Using ORIANA

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Synthesizer 1.0. User s Guide. A Varying Coefficient Meta. nalytic Tool. Z. Krizan Employing Microsoft Excel 2007

Why visualisation? IRDS: Visualization. Univariate data. Visualisations that we won t be interested in. Graphics provide little additional information

Wishing you all a Total Quality New Year!

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

REFRACTIVE INDEX SELECTION FOR POWDER MIXTURES

TN348: Openlab Module - Colocalization

Econometrics 2. Panel Data Methods. Advanced Panel Data Methods I

Estimating Regression Coefficients using Weighted Bootstrap with Probability

Life Tables (Times) Summary. Sample StatFolio: lifetable times.sgp

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

Estimation of bivariate diameter and height distributions using ALS

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Classifier Selection Based on Data Complexity Measures *

Lecture 4: Principal components

Finite Element Analysis of Rubber Sealing Ring Resilience Behavior Qu Jia 1,a, Chen Geng 1,b and Yang Yuwei 2,c

Design of Structure Optimization with APDL

Machine Learning 9. week

Cluster-Based Profile Monitoring in Phase I Analysis. Yajuan Chen. Doctor of Philosophy In Statistics

Predicting Die-level Process Variations from Wafer Test Data for Analog Devices: A Feasibility Study

A Simple and Efficient Goal Programming Model for Computing of Fuzzy Linear Regression Parameters with Considering Outliers

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Modeling Local Uncertainty accounting for Uncertainty in the Data

The Research of Ellipse Parameter Fitting Algorithm of Ultrasonic Imaging Logging in the Casing Hole

Design of Georeference-Based Emission Activity Modeling System (G-BEAMS) for Japanese Emission Inventory Management

MULTISPECTRAL IMAGES CLASSIFICATION BASED ON KLT AND ATR AUTOMATIC TARGET RECOGNITION

Fuzzy Logic Based RS Image Classification Using Maximum Likelihood and Mahalanobis Distance Classifiers

The BGLR (Bayesian Generalized Linear Regression) R- Package. Gustavo de los Campos, Amit Pataki & Paulino Pérez. (August- 2013)

PRÉSENTATIONS DE PROJETS

Principal Component Inversion

User Authentication Based On Behavioral Mouse Dynamics Biometrics

Lobachevsky State University of Nizhni Novgorod. Polyhedron. Quick Start Guide

Air Transport Demand. Ta-Hui Yang Associate Professor Department of Logistics Management National Kaohsiung First Univ. of Sci. & Tech.

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

A Bootstrap Approach to Robust Regression

Interpolation and assimilation methods for European scale air quality assessment and mapping. Part I: Review and recommendations

Unsupervised Learning and Clustering

CS 534: Computer Vision Model Fitting

THE THEORY OF REGIONALIZED VARIABLES

Small Area Estimation via M-Quantile Geographically Weighted Regression

The Man-hour Estimation Models & Its Comparison of Interim Products Assembly for Shipbuilding

DETECTING ERRORS AND IMPUTING MISSING DATA FOR SINGLE LOOP SURVEILLANCE SYSTEMS

Nonlinear Mixed Model Methods and Prediction Procedures Demonstrated on a Volume-Age Model

USING LINEAR REGRESSION FOR THE AUTOMATION OF SUPERVISED CLASSIFICATION IN MULTITEMPORAL IMAGES

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Analyzing Query Reformulation Data using Multi-level Modeling

APPLICATION OF A COMPUTATIONALLY EFFICIENT GEOSTATISTICAL APPROACH TO CHARACTERIZING VARIABLY SPACED WATER-TABLE DATA

Lecture #15 Lecture Notes

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

An Optimal Algorithm for Prufer Codes *

Detection, isolation and reconstruction of faulty sensors using principal component analysis +

Detecting Errors and Imputing Missing Data for Single-Loop Surveillance Systems

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Probability Base Classification Technique: A Preliminary Study for Two Groups

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Intra-Parametric Analysis of a Fuzzy MOLP

SVM-based Learning for Multiple Model Estimation

UB at GeoCLEF Department of Geography Abstract

A NEW APPROACH FOR SUBWAY TUNNEL DEFORMATION MONITORING: HIGH-RESOLUTION TERRESTRIAL LASER SCANNING

Cell Count Method on a Network with SANET

Analysis of Continuous Beams in General

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Three supervised learning methods on pen digits character recognition dataset

Page 0 of 0 SPATIAL INTERPOLATION METHODS

Statistical analysis on mean rainfall and mean temperature via functional data analysis technique

Research Note MODELING OF VERTICAL MOVEMENTS OF BUILDINGS WITH MAT FOUNDATION * C. T. CELIK * *

FITTING A CHI -square CURVE TO AN OBSERVI:D FREQUENCY DISTRIBUTION By w. T Federer BU-14-M Jan. 17, 1951

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Review of approximation techniques

The Research of Support Vector Machine in Agricultural Data Classification

Adaptive Transfer Learning

K-means and Hierarchical Clustering

7/12/2016. GROUP ANALYSIS Martin M. Monti UCLA Psychology AGGREGATING MULTIPLE SUBJECTS VARIANCE AT THE GROUP LEVEL

An Indian Journal FULL PAPER ABSTRACT KEYWORDS. Trade Science Inc.

Study on Fuzzy Models of Wind Turbine Power Curve

Electrical analysis of light-weight, triangular weave reflector antennas

A Statistical Model Selection Strategy Applied to Neural Networks

Transcription:

Envronment and Polluton; Vol. 2, No. 1; 2013 ISSN 1927-0909 E-ISSN 1927-0917 Publshed by Canadan Center of Scence and Educaton A Sem-parametrc Regresson Model to Estmate Varablty of NO 2 Meczysław Szyszkowcz 1, Mamun Mahmud 1 & Nel Tremblay 1 1 Health Canada, Populaton Studes Dvson, Ottawa, Canada Correspondence: Meczysław Szyszkowcz, Health Canada, Populaton Studes Dvson, 269 Laurer Avenue, Ottawa K1A 0K9, Canada. Tel: 1-613-946-3542. E-mal: metek.szyszkowcz@hc-sc.gc.ca Receved: July 9, 2012 Accepted: August 9, 2012 Onlne Publshed: December 4, 2012 do:10.5539/ep.v2n1p46 URL: http://dx.do.org/10.5539/ep.v2n1p46 Abstract The purpose of ths analyss was to derve a land-use regresson (LUR) model usng a sem-parametrc method (based on penalzed splnes) to estmate the geographcal characterstcs that nfluence ambent concentratons of ntrogen doxde (NO 2 ) n Montreal, Quebec, Canada. Such estmatons are often used to assess exposure to traffc-related polluton n epdemologc studes. In May 2003, levels of NO 2 were measured for 14 consecutve days at 67 stes across the cty, usng Ogawa passve-dffuson samplers. Concentratons ranged from 4.9 to 21.2 ppb (medan 11.8 ppb). Ths work s re-analyzng of these data. Lnear and sem-parametrc multvarate regresson analyses were conducted to assess the dependency between logarthms of concentratons of NO 2 and land-use varables. In the publshed multple lnear regresson analyses for ths study, dstance from the nearest hghway, length of hghways and major roads wthn 100 m, traffc count on the nearest hghway, and populaton densty showed sgnfcant assocatons wth NO 2. The best-fttng lnear model had a R 2 =0.54. The most mportant varable n the model was traffc count on the nearest hghway. The next most mportant varable was dstance from the nearest hghway, whch has a negatve assocaton wth NO 2 concentraton. Ths work used a sem-parametrc model wth a nonparametrc part ncorporatng the varables area of open space wthn 100 m and length of mnor roads wthn 500 m. These varables were non-sgnfcant n the lnear regresson model and showed nonlnear assocatons wth the level of NO 2. The sem-parametrc model mproves the ft of the model for land-use regresson when comparng observed and predcted results. Keywords: ambent ar polluton, LUR, ntrogen doxde, regresson, road, traffc 1. Introducton Ths work s an extenson of the publshed analyss conducted by other researchers usng the same data (Glbert, Goldberg, Beckerman, Brook, & Jerrett, 2005). In ther methodology, the authors of that publcaton appled only a lnear multvarate model. Ths s a standard approach for geographc nformaton system (GIS) modellng. Here, n ths artcle, a sem-parametrc model has been proposed. More recently, land-use regresson (LUR) methods usng GIS modellng have been developed and wdely used. These methods have been used mostly to estmate the mpact of traffc-related polluton on human health. The respectve lterature s presented n the orgnal publcaton (Glbert et al., 2005). The current study was conducted to assess the feasblty of a large-scale montorng campagn, n conjuncton wth land-use regresson models, to estmate recent and past levels of polluton related to traffc. The results are appled n a major cancer case-control study that was conducted n Montreal over recent years. In general, the land-use regresson approach has two mportant features: a) buldng a good ft (LUR model) to the measured data (an nterpolaton process), and b) generatng a good predcton (an extrapolaton process). The predcton nvolves applyng the model usng a new set of data ponts. Thus, the process s to buld a LUR model very specfc to the measured data and to later use t to predct values of the new dataset. Usng avalable measured data, t s possble to construct a model that provdes the best ft for a gven crteron. Manly, t holds true when a sem-parametrc approach s used and nonlnearty has been ncorporated nto the constructed model. A nonparametrc part of the model can easly ft nonlnear relatons n the measured data. From another pont of vew, apparently the classcal lnear land-use regresson model gnores all nonlnearty n the data; therefore, nonlnearty and scatter smoothng may present a major opportunty to construct the best ft to the measured data. Such a precsely ftted model may not necessarly be a good one as a unversal predctor. In other words, good nterpolaton for gven data s not necessarly best for the extrapolaton process for a new set of data. Rather, 46

some balance and proper nterpretaton of varables should be done before acceptng the specfc model. Ths suggests that a sem-parametrc method, a mxture of lnearty and nonlnearty, may be a good approach to use n GIS modellng. 2. Materals and Methods The methodology for montorng ntrogen doxde (NO 2 ) n Montreal s descrbed by the authors n the orgnal publcaton (Glbert et al., 2005). In ther study the montors were set up at stes for a perod of two weeks. Ther paper refers to the dstrbuton of NO 2 concentratons and the land-use varables selected, and also provde more nformaton on the study. Here, to avod repetton, many detals are not ncluded. Dfferent lnear multvarate models were ftted to the orgnal data. To assess the valdty and the robustness of the approach, one-tenth of the data observatons were selected systematcally (e.g., 1st, 11th, 21st observaton) and excluded. Thus, two models were used: one was ftted to the whole dataset (N=67 ponts), and then the model was ftted to the reduced data (N=60 ponts). The model obtaned for the reduced data was used to predct values for the 7 removed ponts. Ths approach was appled n the orgnal work (Glbert et al., 2005). A sem-parametrc model was ftted usng the SemPar package from the R statstcal system (Wand et al., 2005). SemPar s free software and a smple tool to construct a nonlnear regresson. In a unvarate case, a fully nonparametrc regresson model has the form y f ) x (, 2 where (x, y ), 1 n, are the scatter plot data, are zero mean random varables wth varance and f ( x ) E( y x ) s a smooth functon. In the SemPar package, f s estmated usng penalzed splne smoothng. To ft a sem-parametrc model, the functon spm (from the package SemPar) was used. For example, the ftlogno 2 model (whch s lnear wth respect to the dsth (dstance from nearest hghway) varable and nonparametrc wth respect to the areas (area of open space wthn 100 m) varable) s constructed by nvokng the followng command: ft log NO2 spm(log NO2 ~ dsth f ( areas )). A unvarate, non-parametrc model was constructed separately wth each sngle ndependent varable used n the land-use multvarate lnear regresson model, to assess ts nonlnearty n relaton to the levels of NO 2. Two varables were classfed to be n a nonparametrc part of the sem-parametrc model. The crteron of nonlnearty was used to classfy the varables (Wand et al., 2005). A full sem-parametrc model was developed usng the same varables as n the lnear regresson model. Both models, lnear and sem-parametrc, were compared by assessng ther ft to the full data. 3. Results Table 1 presents the results of the multvarate lnear regresson. The results are the same as those obtaned by the orgnal authors (Glbert et al., 2005). Table 1. The results from lnear regresson models wth N=67 and N=60 observatons. The beta coeffcents (Beta) are shown for the standardzed data (N=67) Covarates Beta (N=67) B (N=67) p-value B (N=60) p-value Intercept 0.745 0.000 0.757 0.000 Dstance from nearest hghway -0.306-0.0254 0.003-0.026 0.004 Traffc count on nearest hghway 0.358 1.61x10-6 0.002 1.56x10-6 0.004 Length of hghways wthn 100 m 0.228 0.132 0.017 0.122 0.029 Length of major roads wthn 100 m 0.242 0.138 0.018 0.104 0.095 Length of mnor roads wthn 500 m 0.183 6.38x10-3 0.106 7.32x10-3 0.087 Area of open space wthn 100 m -0.184-0.027 0.093-0.020 0.272 Populaton densty wthn 2000 m 0.284 1.25x10-5 0.039 1.11x10-5 0.079 The table represents the best fttng lnear regresson model wth a determnaton coeffcent R 2 =0.54. The results are shown n the followng columns: calculated coeffcents (B) and p-values (p), both for the full data set (N=67 observatons) and for the reduced data set (N=60 observatons). Here, as an addtonal result, the beta 47

coeffcents (Beta) are shown for the full data set. The beta coeffcents are the regresson coeffcents that have been calculated for measured data wth standardzng all varables to a mean of 0 and a standard devaton of 1. Thus, the advantage of beta coeffcents (as compared to the B coeffcents that are not standardzed) s that the magntude of these beta coeffcents allows for a clearer comparson of the relatve contrbuton of each ndependent varable predcton to the dependent varable. The traffc count for the nearest hghway (Beta=0.358) and the dstance from the nearest hghway (Beta=-0.306) are the two most mportant and statstcally sgnfcant varables n the model (presented n Table 1). The results show that two varables (length of mnor roads wthn 500 m, area of open space wthn 100 m) n the lnear regresson model are not statstcally sgnfcant. In the ftted model for reduced data (only 60 observatons), two addtonal varables are not statstcally sgnfcant, demonstratng some sort of nstablty. Ths ssue wll be dscussed later. Fgure 1 shows the results for a few varables used n a unvarate nonparametrc regresson. The SemPar package provdes nformaton on the degrees of freedom for each ftted nonparametrc element. Ths value can be used to classfy the degree of nonlnearty (Wand et al., 2005). Fgure 1. Sem-parametrc models for log 10 (NO 2 ) usng sngle varable (a few chosen to llustrate) Thus, potental canddates to be ncluded n the nonparametrc part of the sem-parametrc model are: dstance from nearest hghway, length of mnor roads wthn 500 m, and area of open space wthn 100 m. These varables n the unvarate model have the followng degrees of freedom (df): 4.2, 3.6 and 3.0, respectvely. The value of df=1 ndcates that the varable has a lnear dependency; values greater than 1 suggest nonlnearty (Wand et al., 2005). In the constructon of a sem-parametrc model, only two varables were ncluded n ts nonparametrc part: length of mnor roads wthn 500 m, and area of open space wthn 100 m. The results are shown n Table 2. 48

Table 2. The results from the sem parametrc model wth N=67 and N=60 observatons Covarates B (N=67) p-value B (N=60) p-value Intercept 0.888 0.000 0.757 0.000 Dstance from nearest hghway -0.026 0.002-0.026 0.004 Traffc count on nearest hghway 1.66x10-6 0.001 1.56x10-6 0.004 Length of hghways wthn 100 m 0.107 0.044 0.121 0.029 Length of major roads wthn 100 m 0.124 0.028 0.104 0.095 Length of mnor roads wthn 500 m df=3.13 df=1.00 Area of open space wthn 100 m df=1.68 df=1.01 Populaton densty wthn 2000 m 1.15x10-5 0.043 1.11x10-5 0.079 The varable (wth hgh df=4.2) correspondng to dstance from nearest hghway was not added to a nonparametrc part. For ths varable t was observed that, for values greater than 4 km, the level of NO 2 begns to ncrease. It suggests that there may have been another source of NO 2 durng measurements that was not dentfed. For data restrcted to dstances of less than 4 km, the relaton to NO 2 s lnear. In Table 2 the last two columns correspond to the subset of the data wth 7 ponts removed. In ths case, two nonlnear components n the ftted sem-parametrc model start to become lnear. Now the model s the same as that constructed by multvarate lnear regresson. Two last columns n Table 1 confrm ths, and Fgure 2 llustrates ths phenomenon. Fgure 2. Two components of the sem-parametrc model for log 10 (NO 2 ) wth N=67 ponts (left) and N=60 ponts (rght). The results related to the values n Table 2 49

The removed 7 ponts are actually some specfc ponts whch strongly affect the regresson model, and the ftted models are senstve to these pont. The lnear regresson for the predcted NO 2 values was calculated for both models. These values are used as a dependent varable, whle the actual measured values of ntrogen doxde are used as an ndependent varable. For the orgnal multvarate lnear model, the coeffcent for the measured NO 2 s 0.50, R 2 =0.43, correlaton=0.66. For the sem-parametrc model, the coeffcent s 0.56, R 2 =0.53, correlaton=0.72. Ths demonstrates that the sem-parametrc model provdes a better ft to the data than does a multvarate lnear model. 3. Conclusons In ths paper t s shown that a sem-parametrc model s a good alternatve to a lnear model. The man lesson s that the ftted models are very senstve to the data. The dataset wth N=67 ponts and ts subset composed of N=60 ponts show a dependency characterstc that can change from nonlnear to lnear. Ths change also emphaszes another problem: the best ftted model to the data s not necessarly a good tool (model) for an extrapolaton process. The proposed sem-parametrc model fts the orgnal data well, adopts for nonlnearty, and starts to become equvalent to lnear regresson f the data do not show nonlnear relatons. Acknowledgments The authors thank Ncolas Glbert for provdng the data related to ths study. References Glbert, N. L., Goldberg, M. S., Beckerman, B., Brook, J. R., & Jerrett, M. (2005). Assessng spatal varablty of ambent ntrogen doxde n Montreal, Canada, wth a land use regresson model. J. Ar & Waste Manage. Assoc, 55, 1059-1063. http://dx.do.org/10.1080/10473289.2005.10464708 Wand, M. P., Coull, B. A., French, J. L., Gangul, B., Kamman, E. E., Staudenmayer, J., & Zanobett, A. (2005). SemPar v.1.0. R. 2.5.1. The R Foundaton for Statstcal Computng. Retreved from http://www.r-project.org/ 50