Bandwidth selection for estimating the two-point correlation function of a spatial point pattern using AMSE

Similar documents
Controlled Information Maximization for SOM Knowledge Induced Learning

Segmentation of Casting Defects in X-Ray Images Based on Fractal Dimension

On Error Estimation in Runge-Kutta Methods

Point-Biserial Correlation Analysis of Fuzzy Attributes

IP Network Design by Modified Branch Exchange Method

Image Enhancement in the Spatial Domain. Spatial Domain

Journal of World s Electrical Engineering and Technology J. World. Elect. Eng. Tech. 1(1): 12-16, 2012

A Two-stage and Parameter-free Binarization Method for Degraded Document Images

A modal estimation based multitype sensor placement method

Concomitants of Upper Record Statistics for Bivariate Pseudo Weibull Distribution

An Unsupervised Segmentation Framework For Texture Image Queries

Detection and Recognition of Alert Traffic Signs

Frequency Domain Approach for Face Recognition Using Optical Vanderlugt Filters

Optical Flow for Large Motion Using Gradient Technique

Gravitational Shift for Beginners

Strictly as per the compliance and regulations of:

Color Correction Using 3D Multiview Geometry

Illumination methods for optical wear detection

A Mathematical Implementation of a Global Human Walking Model with Real-Time Kinematic Personification by Boulic, Thalmann and Thalmann.

= dv 3V (r + a 1) 3 r 3 f(r) = 1. = ( (r + r 2

A Shape-preserving Affine Takagi-Sugeno Model Based on a Piecewise Constant Nonuniform Fuzzification Transform

Lecture # 04. Image Enhancement in Spatial Domain

Cardiac C-Arm CT. SNR Enhancement by Combining Multiple Retrospectively Motion Corrected FDK-Like Reconstructions

DISTRIBUTION MIXTURES

Positioning of a robot based on binocular vision for hand / foot fusion Long Han

Title. Author(s)NOMURA, K.; MOROOKA, S. Issue Date Doc URL. Type. Note. File Information

Fifth Wheel Modelling and Testing

Communication vs Distributed Computation: an alternative trade-off curve

RANDOM IRREGULAR BLOCK-HIERARCHICAL NETWORKS: ALGORITHMS FOR COMPUTATION OF MAIN PROPERTIES

Extract Object Boundaries in Noisy Images using Level Set. Final Report

Assessment of Track Sequence Optimization based on Recorded Field Operations

A Memory Efficient Array Architecture for Real-Time Motion Estimation

Multi-azimuth Prestack Time Migration for General Anisotropic, Weakly Heterogeneous Media - Field Data Examples

HISTOGRAMS are an important statistic reflecting the

4.2. Co-terminal and Related Angles. Investigate

A Novel Automatic White Balance Method For Digital Still Cameras

Comparisons of Transient Analytical Methods for Determining Hydraulic Conductivity Using Disc Permeameters

Spiral Recognition Methodology and Its Application for Recognition of Chinese Bank Checks

SYSTEM LEVEL REUSE METRICS FOR OBJECT ORIENTED SOFTWARE : AN ALTERNATIVE APPROACH

Topological Characteristic of Wireless Network

Topic -3 Image Enhancement

Transmission Lines Modeling Based on Vector Fitting Algorithm and RLC Active/Passive Filter Design

ANALYTIC PERFORMANCE MODELS FOR SINGLE CLASS AND MULTIPLE CLASS MULTITHREADED SOFTWARE SERVERS

Modeling spatially-correlated data of sensor networks with irregular topologies

(a, b) x y r. For this problem, is a point in the - coordinate plane and is a positive number.

Improved Fourier-transform profilometry

INFORMATION DISSEMINATION DELAY IN VEHICLE-TO-VEHICLE COMMUNICATION NETWORKS IN A TRAFFIC STREAM

Generalized Grey Target Decision Method Based on Decision Makers Indifference Attribute Value Preferences

A VECTOR PERTURBATION APPROACH TO THE GENERALIZED AIRCRAFT SPARE PARTS GROUPING PROBLEM

arxiv: v2 [physics.soc-ph] 30 Nov 2016

Clustering Interval-valued Data Using an Overlapped Interval Divergence

CLUSTERED BASED TAKAGI-SUGENO NEURO-FUZZY MODELING OF A MULTIVARIABLE NONLINEAR DYNAMIC SYSTEM

Prof. Feng Liu. Fall /17/2016

Information Retrieval. CS630 Representing and Accessing Digital Information. IR Basics. User Task. Basic IR Processes

Shortest Paths for a Two-Robot Rendez-Vous

Methods for history matching under geological constraints Jef Caers Stanford University, Petroleum Engineering, Stanford CA , USA

ANN Models for Coplanar Strip Line Analysis and Synthesis

Massachusetts Institute of Technology Department of Mechanical Engineering

EYE DIRECTION BY STEREO IMAGE PROCESSING USING CORNEAL REFLECTION ON AN IRIS

Monte Carlo Techniques for Rendering

Keith Dalbey, PhD. Sandia National Labs, Dept 1441 Optimization & Uncertainty Quantification

AUTOMATED LOCATION OF ICE REGIONS IN RADARSAT SAR IMAGERY

Voting-Based Grouping and Interpretation of Visual Motion

Ranking Visualizations of Correlation Using Weber s Law

Modeling Spatially Correlated Data in Sensor Networks

XFVHDL: A Tool for the Synthesis of Fuzzy Logic Controllers

High performance CUDA based CNN image processor

A Minutiae-based Fingerprint Matching Algorithm Using Phase Correlation

Analysis of uniform illumination system with imperfect Lambertian LEDs

Bilateral Filter Based Selective Unsharp Masking Using Intensity and/or Saturation Components

vaiation than the fome. Howeve, these methods also beak down as shadowing becomes vey signicant. As we will see, the pesented algoithm based on the il

A New and Efficient 2D Collision Detection Method Based on Contact Theory Xiaolong CHENG, Jun XIAO a, Ying WANG, Qinghai MIAO, Jian XUE

User Specified non-bonded potentials in gromacs

Adaptation of Motion Capture Data of Human Arms to a Humanoid Robot Using Optimization

QUANTITATIVE MEASURES FOR THE EVALUATION OF CAMERA STABILITY

Improvement of First-order Takagi-Sugeno Models Using Local Uniform B-splines 1

Towards Adaptive Information Merging Using Selected XML Fragments

Input Layer f = 2 f = 0 f = f = 3 1,16 1,1 1,2 1,3 2, ,2 3,3 3,16. f = 1. f = Output Layer

Conservation Law of Centrifugal Force and Mechanism of Energy Transfer Caused in Turbomachinery

A Recommender System for Online Personalization in the WUM Applications

UCLA Papers. Title. Permalink. Authors. Publication Date. Localized Edge Detection in Sensor Fields.

Approximating Euclidean Distance Transform with Simple Operations in Cellular Processor Arrays

Experimental and numerical simulation of the flow over a spillway

AN ANALYSIS OF COORDINATED AND NON-COORDINATED MEDIUM ACCESS CONTROL PROTOCOLS UNDER CHANNEL NOISE

Obstacle Avoidance of Autonomous Mobile Robot using Stereo Vision Sensor

Conversion Functions for Symmetric Key Ciphers

Tissue Classification Based on 3D Local Intensity Structures for Volume Rendering

New Algorithms for Daylight Harvesting in a Private Office

Drag Optimization on Rear Box of a Simplified Car Model by Robust Parameter Design

Also available at ISSN (printed edn.), ISSN (electronic edn.) ARS MATHEMATICA CONTEMPORANEA 3 (2010)

COLOR EDGE DETECTION IN RGB USING JOINTLY EUCLIDEAN DISTANCE AND VECTOR ANGLE

An Extension to the Local Binary Patterns for Image Retrieval

Embeddings into Crossed Cubes

Adaptation of TDMA Parameters Based on Network Conditions

A Texture Feature Extraction Based On Two Fractal Dimensions for Content_based Image Retrieval

A New Finite Word-length Optimization Method Design for LDPC Decoder

Separability and Topology Control of Quasi Unit Disk Graphs

Effects of Model Complexity on Generalization Performance of Convolutional Neural Networks

ADDING REALISM TO SOURCE CHARACTERIZATION USING A GENETIC ALGORITHM

Elliptic Generation Systems

Transcription:

Bandwidth selection fo estimating the two-point coelation function of a spatial point patten using A Ji Meng Loh Dept of Mathematical Sciences New Jesey Institute of Technology Newak, New Jesey Woncheol Jang Dept of Statistics Seoul National Univesity Republic of Koea Mach 21, 2014 Abstact We intoduce an asymptotic mean squaed eo (A) appoach to obtain closedfom expessions of the adaptive optimal bandwidths fo estimating the two-point coelation function of a homogeneous spatial point patten. The two-point coelation function is one of seveal second-ode measues of clusteing of a point patten and is commonly used in astonomy whee cosmologists have found a elationship between the clusteing of galaxies and the evolution of the univese. The A appoach is adapted fom an AMISE appoach that is well-known in density estimation and ou appoach povides a simple and quick method fo optimal bandwidth selection fo estimating the two-point coelation function. Using optimal bandwidths fo estimation will allow moe infomation about clusteing to be extacted fom the data. Results fom numeical studies suggest that the mean squaed eo of estimates obtained using A optimal bandwidths is competitive with those obtained using moe computationally intensive methods and is close to the empiical optimal bandwidths. We illustate its use in an application to a galaxy cluste catalog fom the Sloan Digital Sky Suvey. Keywods: Bandwidth selection, second-ode clusteing, spatial point pattens, twopoint coelation function Email: loh@njit.edu; Tel: 973-596-2949; Fax: 973-596-5591 Email: wcjang@stats.snu.ac.k 1

1 Intoduction In the analysis of spatial point pattens, it is often of inteest to quantify the clusteing of the points, i.e. the degee of clumpiness o egulaity obseved in the point patten. The clusteing in the points often eflect the undelying pocess diving the placement of points. Fo example, the clusteing of galaxies is believed to be due to the e ects of gavitation ove an extemely long peiod of time. In ecological studies, the pesence of egulaity of tees in afoestmaybeduetothecompetitionfosoilesouces. In this pape, we conside the poblem of finding an optimal bin size o bandwidth fo estimating the two-point coelation function (2PCF) of an obseved spatial point patten. The two-point coelation function is a measue of second-ode clusteing and is outinely used by astonomes to quantify the clusteing of astonomical objects such as galaxies and quasi-stella objects (Matínez and Saa, 2001). We focus on estimatos which astonomes have developed that empiically account fo bounday e ects, and use an asymptotic mean squaed eo (A) appoach to find optimal bandwidths fo these estimatos. Although thid and highe-ode measues of clusteing have been studied in astonomy and in othe aeas (Kim et al., 2011; Szapudi et al., 2001), second-ode measues and especially the 2PCF in astonomy ae still the nom. Ou method fo finding the optimal bandwidth can be extended to apply to thid and highe-ode measues. Second-ode measues elate to the behavio of point pais and is quantified by the distibutions of inte-point distances. Vaious measues have been used to measue secondode clusteing of a spatial point patten, e.g. the second-ode poduct density (2), Ripley s K function, the pai coelation function g and the two-point coelation function. Each quantity is a function of the inte-point distance. Intuitively, the second-ode poduct density (2) is elated to the pobability of finding a pai of points, one in each of two small volumes ds 1 and ds 2 sepaated by distance. Moe fomally, if N(dS) epesentsthenumbeofpointsinthevolumeds, (2) () = N(dS)N(dS + ) lim ds!0 ds 2, whee ds epesents the volume of ds and ds + is ds tanslated by. If the spatial point pocess is isotopic, (2) () = (2) (), whee =. The vaious second-ode measues ae elated to each othe as follows: Z K() = 4 u 2 g(u) du in R 3, 0 g() = (2) ()/ 2, and () = g() 1. The 2PCF is popula in astonomy because the cosmological models descibing the evolution of the univese link the two-point coelation of astonomical objects such as galaxies to cosmological paametes that contol the univese s evolution. A well-known success stoy 2

in cosmological modeling is the pediction of a bump in the 2PCF due to an event called ecombination that was subsequently obseved in the empiical 2PCF estimated fom data (Ryden, 2003). Estimating quantities such as the two-point coelation function fom data equies counting pais of points that ae distance apat. This equies specifying a bandwidth, the focus of this pape. The e ects of the obsevation egion bounday also has to be taken into account, since the neighbohood of a point close to the bounday is not fully obseved. Thee ae seveal analytical methods fo dealing with edge e ects (Baddeley et al., 1993; Illian et al, 2008). Pobably due to the iegula boundaies of the obsevation egions, astonomes tend to use numeical methods. Specifically, suppose we use D and R to epesent the data and a andomly geneated set of points, with N D and N R numbe of points espectively. Define DD = DD( 0 )= X X 1{ 0 h apple x y apple 0 + h}/n D (N D 1), x2d y2d DR = DR( 0 )= X X 1{ 0 h apple x y apple 0 + h}/n D N R, x2d y2r RR = RR( 0 )= X X 1{ 0 h apple x y apple 0 + h}/n R (N R 1). x2r y2r Then the Landy-Szalay estimato (Landy and Szalay, 1993) of the two-point coelation function is given by ˆ LS ( 0 )=(DD 2DR + RR)/RR. Thee ae othe estimatos of, such as the Hamilton, the Peebles and the Hewitt estimatos. See Kesche et al. (2000) fo a eview of these estimatos. The Landy-Szalay and Hamilton estimatos ae geneally consideed to be the bette estimatos, with smalle standad eos, with the Landy-Szalay estimato being moe commonly used. These two estimatos ae simila to the estimatos constucted by Loh et al. (2001) fo the pupose of educing standad eos. We popose using an appoach simila to the asymptotic mean integated squaed eo (AMISE) in the density estimation liteatue to obtain optimal bandwidths fo the estimation of the 2PCF. We descibe the AMISE method fo density estimation in the subsection below. In Section 2 we descibe the application of the AMISE appoach to the 2PCF, emoving the integation fom the oiginal AMISE appoach to obtain an A method that allows us to select adaptive bandwidths depending on. Section 3 descibes the esults of a simulation study compaing the bandwidths obtained using A with empiically optimal bandwidths. In Section 4 we apply the A method to a Sloan Digital Sky Suvey dataset of galaxy clustes. Section 5 concludes this pape. 3

1.1 Bandwidth selection fo density estimation In this section, we give an oveview of bandwidth selection in density estimation. Suppse we have n independent identically distibuted andom vaiables, X 1,...,X n with an unknown density function f. The kenel density estimato of f(x) isgivenby ˆf h (x) = 1 nh nx x K i=1 Xi h whee h is the bandwidth and K is a kenel function. We assume that K satisfies Z Z Z K(x)dx =1, xk(x) =0, and x 2 K(x) < 1. The most common citeion fo measuing the global accuacy of ˆf, an estimato of f, isthe mean integated squae eo (MISE): MISE( ˆf) =E Z ˆf(x) f(x) 2 dx It is well known that the choice of kenel K is not cucial in density estimation wheeas the choice of bandwidth h is impotant. Vaious data-diven bandwidth selectos have been poposed and most of them ae motivated by di eent ways of minimizing the MISE. To see how the choice of the bandwidth make influence on the MISE, one consides the appoximation of MISE with leading vaiance and bias tems: Z Z MISE( ˆf) = E( ˆf(x)) 2 f(x) dx + V a( ˆf(x))dx = 1 Z K(t) 2 dt + 1 Z nh 4 h4 µ 2 (K) 2 f 00 (x) 2 dx + o 1 nh + h4, whee µ 2 (K) = R t 2 K 2 (t)dt. See Silveman (1986) fo details. We can then define the asymptotic mean integated squae eo as follows: AMISE( ˆf) = 1 nh Z K(t) 2 dt + 1 4 h4 µ 2 (K) 2 Z f 00 (x) 2 dx, ignoing the smalle ode tems. With simple algeba, the optimal bandwidth that minimizes the AMISE is given by R(K) 1/5 h opt = µ 2 (K) 2 R(f 00 )n whee R(g) = R g 2 (x)dx. Howeve, R(f 00 )isunknownsowecannotcalculatetheabove optimal bandwidth in pactice. A easy way to addess this issue is to assume f to be a 4

specific density function. Fo example, in the nomal efeence ule, f is assumed to be the nomal density with vaiance 2, and one can easily show that h nomal =1.06 n 1/5. A moe sophisticated way of estimating R(f 00 )istousethekeneldensityestimato. This, howeve, equies the expectation of the 4th deivative of f. This latte quantity can also be obtained using a kenel estimato, but its optimal bandwidth depends on the expectation of the 6th deivative of f. Usually, at this point, one applies the nomal efeence ule to select the optimal bandwidth fo the kenel estimato of the 6th deivative of f. Othe popula bandwidth selectos include least squaes coss-validation and likelihood coss-validation. See Silveman (1986). Howeve, the afoementioned plug-in bandwidth selecto povides a closed fom of the optimal bandwidth and these plug-in methods have an advantage in computation ove coss-validation based bandwidth selectos. 2 A fo the 2PCF To estimate the 2PCF, we need a bandwidth h fo computing DD, DR and RR. Often an ad-hoc appoach is used to select the optimal bandwidth. Fo example, Stoyan et al. (1995) poposed using the bandwidth h = c 1/d fo the Epanechnikov kenel whee d epesents the dimension of the obsevational aea and c is a constant. Fo plana point patten of 50-300 points, they suggested c 2 (0.1, 0.2) with c =.15 being a common choice (Guan, 2007). Instead of the Epanechnikov kenel, Stoyan (2006) ecommended the box kenel, but did not povide guidelines on how to select the optimal bandwidth with the box kenel. Fo 3 dimensions, Pons-Bodeia et al. (1999) ecommended c = 0.05 and 0.1 foclusteedand Poisson point pocesses espectively. Thei appoach is simila to the nomal efeence ule. While these appoaches povide a closed-fom expession fo the optimal bandwidth, the optimal bandwidth is deived based on the stong assumption on the undelying pocess. Theefoe, we develop a simple bandwidth selection method which is simila to the plugin bandwidth in density estimation so the selection of bandwidth can be less sensitive to the assumption on the undelying pocess. Futhemoe, ou appoach allows fo adaptive bandwidth selection depending of. Let the kenel K h (t) bedefinedasaboxcakenel: K h (t) = 2h 1 1{ t apple h}, so that DD = DD( 0 )=2h P P x y K h( 0 x y ) andsimilalyfodr and RR. Note that R hh K h (t) dt =1. Wethenhave Z Z E(RR) = 2h K h ( 0 x y ) 2 R dy dx AZ AZ 1 Z = 2h 2 R K h ( 0 )1 A {x +(, )} 2 d d dx, A 0 Z Z 1 applez = 2h 2 R K h ( 0 ) 1 A {x +(, )} 2 d d dx, A 0 Z Z 1 = 2h 2 R K h ( 0 )C() d dx, A 0 5

whee in the second line above we have expessed y in spheical coodinates: y = x +(, ), with 2 (0, 1), 2, andincludedtheindicatofunction1 A {x +(, )} to ensue that y 2 A. The quantity C() epesentstheedgee ect due to the bounday of A. It is equal to 4 2 in thee dimensions if x is moe than away fom the bounday of A. Fo small h, 0 and theefoe C() C( 0 ) C 0. Replacing C() withc( 0 )abovehelpssimplify the expession: Z Z 1 E(RR) 2h 2 R C 0 dx K h ( 0 ) d A 0 = 2h 2 R A C 0, if 0 >h. since R 1 0 K h ( 0 ) d = R 0 +h 0 h 1{ 0 apple h}/2hd =1if 0 >h.similaly,wehave E(DR) 2h D R A C 0, Z 1 E(DD) 2h 2 D A C 0 K h ( 0 0 )g() d. Note that if we set R = D, then E(DR) =E(RR), and the di eence between E(DD) and E(RR) isthepesenceofg() intheintegand.theefoe, E[ˆ LS ( 0 )] 1 Z 1 0 K g() d 1, when h 0 h 0 >h, whee K(t) = 2 1 1{ t apple 1}. The appoximation above is obtained by taking the atio of the expected values, and noting that K h (t) =K(t/h)/h. Using a Taylo expansion of g() about 0,wefindthatthebiasis Bias(ˆ LS ( 0 )) = E[ˆ LS ( 0 )] 1 ( 0 ) = 1 Z 1 0 K g() d 1 ( h 0 h 0 ) Z 1 = K(t)g( 0 + th) dt 1 ( 0 ) (Hee t ( 0 )/h) 0 /h Z 1 apple K(t) g( 0 )+g 0 ( 0 )th + 1 0 /h 2 g00 ( 0 )t 2 h 2 dt 1 ( 0 ) Z 1 Z 1 = g( 0 ) K(t) dt 1 ( 0 )+g 0 ( 0 )h tk(t) dt 0 /h 0 /h + g00 ( 0 )h 2 Z 1 t 2 K(t) dt 2 0 /h = g00 ( 0 )h 2 Z Z µ 2 2 (K)+o(h 2 ) because K(t) dt =1, tk(t) dt =0when 0 >h = g00 ( 0 )h 2 + o(h 2 ). 6 6

In the last line above, we have witten µ 2 (K) = R 1 0 /h t2 K(t) dt =1/3 if 0 >h. Landy and Szalay (1993) deived an appoximation fo the vaiance of ˆ LS. Specifically, Va[ ˆ LS ( 0 )] = E[1 + ˆ LS ( 0 )] 2 /(2h 2 A C 0 ), so that its leading tem is g( 0 ) 2 /(2h 2 A C 0 ). Togethe with the bias, we have, ( 0 ) [g00 ( 0 )] 2 h 4 1 + 36 h g( 0) 2 2 2. (1) A C 0 We can obtain an optimal bandwidth fom (1) by minimizing the expession with espect to h. Di eentiating (1) with espect to h we get the optimal bandwidth as h opt ( 0 ) = apple 9g( 0 ) 2 2 2 A C 0 g 00 ( 0 ) 2 1/5. (2) Note that the expession depends on the unknown function g =1+ as well as its second deviative. A simple pocedue to get an optimal bandwidth fom the above is to use a plug-in appoach and assume a specific fom fo g (o ). We choose as the expession fo g that of the Thomas modified pocess (Thomas, 1949). The Thomas pocess is a Neyman-Scott pocess whee the points ae the o sping of paent points. The paent points follow a homogeneous Poisson pocess with intensity. Each paent point has a Poisson numbe of o sping points (with mean numbe µ) that,fothe modifed Thomas pocess ae distibuted about the paents accoding to a Gaussian density with standad deviation. Fo the modified Thomas pocess, g() =1+ µ 4 2 e 2 /4 2. Using the above expession fo g, we get an optimal bandwidth h opt given by 0 " h opt ( 0 )=@ 72 8 4 2 + µe 2 /4 2 # 2 11/5 2 A C 0 µ( 2 2 2 )e 2 /4 2 A. (3) Altenatively, in astonomy, a commonly used functional fom fo the 2PCF is the powelaw model, () =(/s 0 ), which is known to fit a wide ange of empiical data well. If we use this model in the expession (2), we get h opt ( 0 ) = apple 9 2 2 A C 2 0 ( +1) 2 1+ 0 s 0 2 4 0! 1/5. (4) We note that this optimal bandwidth is an adaptive bandwidth, i.e. a value is obtained fo evey value of 0. This is in contast to the plug-in optimal bandwidths in kenel density 7

estimation. We also note that since h is usually elatively small, the assumption of 0 >h is not vey estictive. In an actual application, instead of 2 A C 0, we can eplace it with the aveage of RR/2h evaluated ove a ange of values of h, whee RR is as defined befoe, using R = D. Values of and s 0 can be obtained fom fits of the powe-law model to the data. 3 Simulation study We pefomed a simulation study to show the pefomance of the A method fo bandwidth selection to estimate the 2PCF. We compae it with (a) the simple ule of thumb c 1/d (Stoyan et al., 1995) fo the bandwidth, using c =.15 and d =2,and(b)theoptimal bandwidths obtained empiically (descibed below). The steps in ou simulation study ae as follows: 1. We conside the Thomas modified pocess with 12 di eent sets of paametes (see Table 1). Fo each paamete set, we simulate 500 ealizations and estimate the 2PCF, ˆ b i (),i=1,...,500, ove a set of values each using a ange of bandwidths. 2. Fo each value of, we find the bandwidth b that minimizes the mean squaed eo () = P i [ˆ i b () ()]2. The bandwidth b is then the empiically obtained optimal bandwidth fo estimating (). 3. Next, fo each paamete set, we simulate a new set of 500 ealizations and fo each ealization, estimate (), using the optimal bandwidths b obtained above, the A bandwidth using (3) and Stoyan s bandwidth. We conside two methods fo obtaining the A and Stoyan bandwidths, one based on the tue paametes of the pocess, and the othe using estimated paametes. Fo the A bandwidth, the thomas.estk function in the spatstat R package was used to obtain estimates of the paametes. Fo the Stoyan bandwidth, the estimated intensity is used. Thus, 5 bandwidths ae used. We compute () fo each of them. We also consideed the pefomance of the method when the undelying pocess is di eent fom the Thomas pocess. We used A bandwidths based on the Thomas pocess to estimate the 2PCFs fo ealizations of the Matén cluste pocess. Fo bandwidths based on estimated paametes, we estimate paametes using thomas.estk, i.e. we use the incoect model. Tables 1 and 2 list the paametes of the Thomas and Matén cluste pocess used, espectively. 3.1 Results Figues 1 and 2 show the esults of ou simulation study. Each figue shows plots of the mean squaed eo fo estimates of the two-point coelation function at distance. Each of the 12 plots in Figue 1 coespond to one of the paamete sets in Table 1. The of estimates 8

Thomas model 1 2 3 4 5 6 7 8 9 10 11 12 apple 50 50 50 50 50 50 100 100 100 100 100 100.05.05.1.1.2.2.05.05.1.1.2.2 µ 2 8 2 8 2 8 1 4 1 4 1 4 Table 1: Paametes used fo the Thomas modified pocess in the simulation study Matén 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 apple 50 50 50 50 50 50 50 50 100 100 100 100 100 100 100 100 R.05.05.1.1.2.2.4.4.05.05.1.1.2.2.4.4 µ 2 8 2 8 2 8 2 8 1 4 1 4 1 4 1 4 Table 2: Paametes used fo the Matén cluste pocess used in the simulation study obtained with the empiically obtained optimal bandwidths (black long-dashed lines) ae the smallest in each case. s of estimates using bandwidths obtained by ou A method ae shown as gay dashed lines. The thick dashed lines epesent the case when the tue model paametes ae used while fo the thin dashed lines, estimated paametes ae used. We find that these ae close to the s of estimatos based on the empiically obtained optimal bandwidths. Using estimated paametes esults in a highe, but not by much. s obtained fo estimates using Stoyan s simple fomula fo the bandwidth ae the highest and ae shown in Figue 1 as the solid gay lines. These latte s wee almost identical whethe the tue o estimated intensity was used. Figue 2 shows simila plots whee the data is geneated fom the Matén cluste pocess, but the Thomas pocess is used fo g in (2). Hee, the A bandwidths ae obtained using paametes estimated by fitting the Thomas model (i.e. a wong model) to the simulated ealizations. We find that the esults ae vey simila: the A bandwidths achieve s close to the empiical optimal bandwidths and smalle than those fo the Stoyan bandwidth. This suggests that the identified optimal bandwidths using ou A appoach ae not sensitive to the paticula point model used. This is encouaging, since the optimal bandwidth h opt fo the 2PCF using (2) is much moe complicated fo the Matén cluste pocess than fo the modified Thomas pocess. 4 Application to SDSS data The Sloan Digital Sky Suvey (Yok et al., 2000) is a majo astonomical suvey that began in 2000, coveing about 35% of the sky, and has collected obsevations on moe than a million objects consisting of di eent types of astonomical objects such as galaxies and quasas. Goto et al. (2002) intoduced a cut-and-enhance method fo selecting clustes of galaxies fom aw SDSS data and Basilakos and Plionis (2004) analyzed a subset of 200 of 9

(κ, σ, µ) = (50, 0.05, 2) (κ, σ, µ) = (50, 0.05, 8) (κ, σ, µ) = (50, 0.1, 2) (κ, σ, µ) = (50, 0.1, 8) (κ, σ, µ) = (50, 0.2, 2) (κ, σ, µ) = (50, 0.2, 8) (κ, σ, µ) = (100, 0.05, 1) (κ, σ, µ) = (100, 0.05, 4) (κ, σ, µ) = (100, 0.1, 1) (κ, σ, µ) = (100, 0.1, 4) (κ, σ, µ) = (100, 0.2, 1) (κ, σ, µ) = (100, 0.2, 4) Figue 1: Plots showing the mean squaed eo () of estimates of the two-point coelation function of the Thomas pocess using empiically obtained optimal bandwidths (black long-dashed lines), bandwidths obtained using ou A method (gay thick and thin dashed lines) and using the Stoyan s ule of thumb (gay solid lines). The thick and thin dashed lines epesent s fom using, espectively, the tue and estimated model paametes in the A method. 10

(κ, σ, µ) = (50, 0.05, 2) (κ, σ, µ) = (50, 0.05, 8) (κ, σ, µ) = (50, 0.1, 2) (κ, σ, µ) = (50, 0.1, 8) 0 5 10 15 0 5 10 15 (κ, σ, µ) = (50, 0.2, 2) (κ, σ, µ) = (50, 0.2, 8) (κ, σ, µ) = (50, 0.4, 2) (κ, σ, µ) = (50, 0.4, 8) (κ, σ, µ) = (100, 0.05, 1) (κ, σ, µ) = (100, 0.05, 4) (κ, σ, µ) = (100, 0.1, 1) (κ, σ, µ) = (100, 0.1, 4) 0 5 10 15 0 5 10 15 (κ, σ, µ) = (100, 0.2, 1) (κ, σ, µ) = (100, 0.2, 4) (κ, σ, µ) = (100, 0.4, 1) (κ, σ, µ) = (100, 0.4, 4) Figue 2: Plots showing the mean squaed eo () of estimates of the two-point coelation function of the Matén cluste pocess using empiically obtained optimal bandwidths (black long-dashed lines), bandwidths obtained using ou A method based on the incoect Thomas model (gay dashed lines) and using the Stoyan s ule of thumb (gay solid lines). 11

these galaxies. Loh and Jang (2010) also used this data set in conjunction with a bootstap bandwidth selection pocedue. Moe detailed desciption of the galaxy catalog, such as the egions in the sky in which they ae located, can be found in Goto et al. (2002); Basilakos and Plionis (2004); Loh and Jang (2010). In this section, we obtain esults fom applying ou A bandwidth selection pocedue to this galaxy catalog. Specifically, we use (4) to obtain adaptive bandwidths fo obtaining the Landy-Szalay estimato of. We consideed distances fom 5 to 100 h 1 Mpc and used values of 20.7 and 1.6 fo s 0 and espectively. These values wee obtained as estimates of s 0 and by Basilakos and Plionis (2004) though fitting a powe-law model fo. Figue 3 shows the bandwidths found fom (4) and fom the bootstap bandwidth pocedue of Loh and Jang (2010), and estimates of using these bandwidths. We find that the adaptive bootstap bandwidths using the method in Loh and Jang (2010) (black dots in Figue 3) ae moe vaiable, but also tend to be lage, poducing much smoothe estimates of. The oveall bootstap bandwidth, epesented in the left plot of Figue 3 by a hoizontal dashed line, is a single value applying to all values of. Its value of 7 is on the lowe end of the ange of the adaptive bootstap bandwidths and poduces a moe jagged cuve fo. The A bandwidths fall between these two. The bandwidths incease as incease, with smalle bandwidths than both the oveall and adaptive bootstap bandwidths fo < 40h 1 Mpc. Fo > 40h 1 Mpc, the A bandwidths become lage than the oveall bootstap bandwidth, having values compaable to those of the adaptive bootstap bandwidths. This behavio in the values of the A bandwidths is eflected in the esulting estimate of, shown on the log scale in the ight-hand plot of Figue 3. The estimate is slightly moe jagged fo <40h 1 Mpc, but still compaable with the estimate obtained with the oveall bootstap bandwidth. Its smoothness fo > 40h 1 Mpc lies between the two bootstap bandwidth vesions, close to the estimate obtained using the adaptive bootstap bandwidths. Hence, we find that the pocedue pefoms easonably well. Recall that the A bandwidths ae obtained using the closed-fom expession in (4) and thus is vey easy to obtain, unlike the bootstap optimal bandwidths which equie a moe computationally expensive bootstap pocedue. 5 Conclusion We intoduced the use of the A appoach, common in density estimation, as a method fo obtaining optimal bandwidths fo estimating the 2PCF, which is widely used in astonomy. Loh and Jang (2010) intoduced an optimal bandwidth selection method that uses bootstap. Howeve, this latte method is computationally intensive, and with the vey lage data sizes now common in astonomy, a quick way to obtain optimal bandwidths fo estimating the 2PCF will be vey useful. The A method intoduced hee povides a closed fom solution that is easily computed. It is not much moe complicated than Stoyan s ule of thumb fo the bandwidth, yet ou simulation studies suggest that it can poduce estimates of the 2PCF with substantially smalle mean squae eos. Also, the method natually yields optimal 12

40 30 bw 20 10 0 5 20 40 60 80 100 ξ 10 8 6 4 2 0 A bw adaptive bootstap bw oveall bootstap bw 5 10 20 50 100 Figue 3: Plots showing, on the left, the oveall (dashed line) and adaptive bandwidths (clea dots) obtained by a bootstap bandwidth pocedue (Loh and Jang, 2010) and adaptive bandwidths using A (solid dots), and, on the ight, estimates of the two-point coelation function using these bandwidths, on the log scale. adaptive bandwidths, i.e. optimal bandwidths ae obtained fo each distance of inteest. This can be useful in astonomy applications since the ange of distances consideed ae often vey lage. Refeences Baddeley, A., Moyeed, R., Howad, C. and Boyde, A. (1993). Analysis of a thee-dimensional point patten with eplication. Jounal of the Royal Statistical Society, Seies C, 42, 641 668. Basilakos, S. and Plionis, M. (2004). Modeling the two-point coelation function of galaxy clustes in the Sloan Digital Sky Suvey. Monthly Notices of the Royal Astonomical Society, 349, 882-888. Goto, T., Sekiguchi, M., Nichol, R.C., Bahcall, N.A., Kim, R.S.J., Annis, J., Ivezic, Z., Binkmann, J., Hennessy, G.S., Szokoly, G.P. and Tucke, D.L. (2002). The cut-an-enhance method: selecting clustes of galaxies fom the Sloan Digital Sky Suvey commissioning data. Astonomical Jounal, 123, 1807 1825. Guan, Y. (2007). A least squaes coss-validation bandwidth selection appoach in pai coelation measues. Statistics and Pobability Lettes, 77, 1722 1729. Illian,J., Penttinen, A., Stoyan, H. and Stoyan, D. (2008). Statistical analysis and modelling of spatial point pattens. New Yok: John Wiley and Sons. 13

Kesche, M., Szapudi, I., and Szalay, A. S. (2000). A compaison of estimatos fo the two point coelation function. Astophysical Jounal Lettes, 535, 13 16. Kim, S., Nowozin, S., Kohli, P., and Yoo, C. D. (2011). Highe-ode coelation clusteing fo image segmentation, In Advances in Neual Infomation Pocessing Systems, 24, 1530 1538. Landy, S. D. and Szalay, A. S. (1993). Bias and vaiance of angula coelation functions. Astophysical Jounal, 412, 64 71. Loh, J. M. and Jang, W. (2008). Estimating a cosmological mass bias paamete with bootstap bandwidth selection. Jounal of the Royal Statistical Society, Seies C, 59, 761 779. Loh, J. K., Quashnock, J. M. and Stein, M. L. (2001). A Measuement of the Theedimensional Clusteing of C IV Absoption-Line Systems on Scales of 51000 h 1 Mpc. Astophysical Jounal, 560, 606 616. Matínez, V. J. and Saa, E. (2001). Statistics of the galaxy distibution. New Yok: Chapman & Hall/CRC Pess. Pons-Bodeia, M.-J., Mainez, V. J., Stoyan, D., Stoyan, H. and Saa, E. (1999). Compaing estimatos of the galaxy coelation function. Astophysical Jounal, 523, 480 491 Ryden, B. (2003). Intoduction to Cosmology. Boston: Addison-Wesley. Silveman, B. W. (1986). Density Estimation fo Statistics and Data Analysis. New Yok: Chapman & Hall/CRC Pess. Stoyan, D. (2006). Fundamentals of point pocess statistics. In Case Studies in Spatial Point Pocess Modeling (eds Baddeley, A., Gegoi, P., Mateu, J., Stoica, R. and Stoyan, D.), 3 22. New Yok: Spinge. Stoyan, D., Kendall, W. S. and Mecke, J. (1995). Stochastic Geomety and Its Applications, 2nd edition, New Yok: Wiley Szapudi, I., Postman, M., Laue, T. R. and Oegele, William (2001). Obsevational constaints on highe ode clusteing up to z ' 1 Astophysical Jounal, 548, 114 126. Thomas, M. (1949). A genealisation of Poisson s binomial limit fo use in ecology. Biometika, 36, 18 25. Yok, D.G., Adelman, J., Andeson, J.E., et al. (2000). The Sloan Digital Sky Suvey: Technical summay. Astonomical Jounal, 120, 1579 1587. 14