On the Computation and Application of Prototype Point Patterns

Size: px

Start display at page:

Download "On the Computation and Application of Prototype Point Patterns"

Henry Goodman
6 years ago
Views:

1 On the Coputation and Application of Prototype Point Patterns Katherine E. Tranbarger Freier 1 and Frederic Paik Schoenberg 2 Abstract This work addresses coputational probles related to the ipleentation of Victor and Purpura s spike- tie distance etric for teporal point process data. Three coputational algoriths are presented that facilitate the calculation of spike-tie distance. In addition, recoendations for penalty paraeters are provided, and several properties and extensions of the spike-tie etric, and of point pattern distance etrics ore generally, are discussed. Applications include the foration of prototype point patterns that can be used for describing a typical point pattern in a collection of point patterns, and various clustering algoriths that can be odified for application to point process data through the use of spiketie distance and prototype patterns. Extensions of these techniques to u ltidiensional p o int p atterns are also addressed. Key words: distance etric, point process, prototype pattern. 1 Babson College Math and Science Division, 231 Forrest Street, Babson Park, MA Departent of Statistics, 8125 Math-Science Building, University of California, Los Angeles,

2 1 INTRODUCTION Teporal point process data arise naturally in the analysis of a wide variety of phenoena, including incidence of disease, sightings or births of a species, firings or neurons, or the occurrences of environental disasters such as earthquakes, tsunais, wildfires, and volcanic eruptions [1-4]. Often one obtains ultiple realizations of a point process, and probles that arise frequently in applications include characterizing particular realizations of the point process as outliers, suarizing a collection of point patterns by indicating the behavior of a typical realization, and organizing groups of point process realizations into clusters of siilar patterns of observations. All of these probles require the definition and coputation of the distance between two point process realizations. For instance, Victor and Purpura [5] analyzed collections of neuronal firings using the spike-tie distance etric between point patterns, and Schoenberg and Tranbarger [6] analyzed the behavior of typical aftershock sequences following a large earthquake using prototypes based on this spike-tie etric. Unfortunately, the spike-tie etric is extreely cubersoe to copute using present ethods, and this renders such applications for collections of large point process datasets infeasible. The central purpose of this paper is to develop coputational algoriths for the efficient coputation of spike-tie distances between teporal point patterns, to enable coputation of useful suaries such as the prototype of a collection of point process realizations as well as clustering, classification, and other applications of point process distance etrics. The spike-tie distance etric proposed by Victor and Purpura [5] involves the realignent, addition, and deletion of points in point pattern X to atch those in point pattern Y. The etric is soewhat analogous to several c l a s s i c a l a n d w e l l - u n d e r s t o o d distance etrics coonly used in coputer iaging and other areas. For instance, Earth Movier s Distance (EMD) introduced by Rubner, Toasi, and Guibas [7-8] evolved fro the Hitchcock solution [9] to the original transportation proble first discussed by Monge in 1781[10]. EMD easures the aount of work required to transfor a histogra of the values in one iage into that of the values in another iage using basic operations [8], and was shown by Levina and Bickel [11] to be equivalent to Mallows

3 distance on probability distributions when the two iage signatures in question are appropriately weighted by their sizes. In contrast to EMD, the spike-tie distance of Victor and Purpura [5] focuses on atching the points of two point processes rather than the suary histogras of an iage process, and thus requires soe fundaentally different algoriths for its coputation. In Section 2, we review soe of the atheatical history behind point pattern distance etrics and explore various properties of the spike-tie etric that c a n aid in its coputation. The priary focus of this section is dedicated to presenting three different algoriths that ay be used to copute spike-tie distance given repeated observations of a teporal point process. Section 3 describes point pattern prototypes along with issues relevant to their coputation and approxiation. Section 4 presents algoriths for clustering collections of observed point patterns based on their spiketie distances and prototype point patterns. A discussion and suggestions for further research are presented in Section 5. 2 SPIKE-TIME DISTANCE Let X and Y be teporal point patterns, i.e. each is a collection of points on the real line, and corresponds to a σ-finite non-negative integer-valued easure on R [1]. (We refer to such a collection of points as a point pattern, as distinguished fro a point process, which is a rando variable whose outcoes are point patterns.) The spike-tie distance between X and Y is defined as the total cost of transforing X into Y using a series of basic o v e e n t, a d d i t i o n, a n d d e l e t i o n operations [5]. Specifically, points fro X can be deleted at a cost p d, added at a cost of p a, or oved a distance at a cost of p. The iniu su of costs for the deletion, addition, and oving operations necessary to transfor point pattern X into point pattern Y is the spike-tie distance between X and Y. Note that in order for this to be a distance etric, it ust be syetric and thus the constraint p a = p d is typically iposed. While in soe applications it ight be desirable for p a and p d to take on different values, this paper focuses on the case where the spike-tie distance is a syetric distance etric.

4 Hence in what follows we assue p a = p d. One algorith for deterining the inial sequence of deletion, addition and oveent operations needed for transforing one point pattern into another is the Single-Unit (SU) algorith discussed in Aronov [12]. We will briefly review the SU algorith based on dynaic prograing algoriths introduced by Sellers [13] and show a odification on this approach that reduces coputation run tie fro O(n 2 ) to O(n 4/3 ). Also presented here is the Mutual Best Match (MBM) algorith that is useful for deterining distances between patterns of shorter lengths and which extends readily to ultiple diensions as discussed in Section Single-Unit algorith In general, the inial sequence of operations required to transfor point pattern X into point pattern Y can be difficult to deterine. In the special case where no points of X or Y are added or deleted, however, the coputation is quite trivial in view of the following result that is critical in the application of all three algoriths discussed. Theore 1. Suppose that teporal point patterns X and Y each consist of exactly n points and that p << p a = p d, so that in effect addition and deletion of points are not peritted. Then d(x,y) = p n x i y i where x 1, x 2,..., x n and y 1, y 2,..., y n are the sorted points of X and Y, respectively. Proof. The stateent is trivial for n = 1. Suppose n = 2, and without loss of generality assue x 1 y 1. If also x 2 y 2, then x 1 y 2 + x 2 y 1 x 1 y 1 + x 2 y 2 = x 1 y 1 + x 2 y 2, since both (x 1 y 1 ) and (x 2 y 2 ) a r e n o n - n e g a t i v e. Alternatively, if x 2 < y 2, then y 2 x 1 y 2 x 2 > 0, since x 1 x 2 < y 2. Siilarly, 0 y 1 x 1 y 1 x 2. Therefore, y 1 x 1 y 1 x 2 and y 2 x 2 y 2 x 1, so y 1 x 1 + y 2 x 2 y 1 x 2 + y 2 x 1. Hence

5 the basic operations that ove the point x 1 to y 2 and ove the point x 2 to y 1 have a total penalty that is at least as large as the penalties for oving x 1 to y 1 and oving x 2 to y 2. Now suppose n = k + 1 and that the result in Theore 1 holds for n = k. Consider any sequence of eleentary operations that includes oving x k+1 to y i and oving x j to y k+1, where i, j < k + 1. By the n = 2 case proven above, x k+1 y i + x j y k+1 x j y i + x k+1 y k+1. That is, the eleentary operations considered have cost at least as large as those obtained by oving x k +1 to y k +1 and oving x j to y i. Hence the inial total cost in aligning X and Y is obtained by the eleentary operations that involve oving x k+1 to y k+1 ; the cost of aligning the other k points of X with the other k points of Y is given by (1) with n = k. The result follows by induction. Using this result, an inductive approach to spike-tie distance calculation is possible. The SU algorith of Aronov [12] deterines the inial distance between i = 1, 2..., n points of pattern X and j = 1, 2..., points of pattern Y by building on the distance result for points i = 1, 2..., n 1 of X and points j = 1, 2..., 1 of Y. Let D i,j denote the spike-tie distance between the first i points of pattern X and first j points of pattern Y. To deterine the inial distance between patterns X and Y siple calculations to coplete an n by atrix of D i,j values are necessary. At each step D i,j is the iniu of D i 1,j + p d, D i,j 1 + p d, and D i 1,j 1 + (p x i y j ) (corresponding to the options of deletion of x i, deletion of y j, and the pairing of x i to y j, respectively). In the trivial cases of i = 0 or j = 0, D i,j = p d ax(i, j). 2.2 Modified Single-Unit algorith In spike-tie distance calculations, no ove greater than (p a + p d )/p ay take place since such a ove would be greater in cost than reoval and re-insertion of one of the points involved. With this fact in ind, the patterns X and Y can each be broken into patterns of shorter lengths whenever a gap of greater than (p a + p d )/p is present. In this odified SU (MSU) approach, coputations involve the sorting of points x 1, x 2,...x n and points y 1, y 2,...y cobined to for one pattern, Z, of

6 length n +. Instances in which z k -z k 1 is greater than (p a + p d )/p are easily identified to deterine how X and Y can be partitioned into patterns with fewer points. For exaple, if such a gap was present at z k =3, pattern X 1 would consist of all points in X less than 3 just as pattern Y 1 would consist of all points in Y less than 3. The SU algorith as earlier described is then applied to each pair of X i, Y i patterns fored in the partitioning process and the final distances for each pair are sued. Advantages of this odification are discussed in g r e a t e r d e t a i l i n Section Mutual Best Match algorith In light of Theore 1, the proble of deterining the best sequence of operations to transfor teporal (i.e. one diensional) point pattern X into point pattern Y reduces to the proble of deterining whether each point in the point patterns will be reoved (i.e. deleted fro one string or equivalently added to the other), or whether it will be kept, i.e. paired to a point in the other string. Once the points to be kept are deterined, the best approach for atching the kept points in X to those in Y is siply sequential, based on Theore 1. Evaluation of each potential set of kept points is thus rearkably straightforward. In fact, each potential distance is siply the nuber of points reoved sued with the integrated difference between the cuulative functions associated with the reaining points in the two point patterns [6]. Given two point patterns whose spike-tie distance is sought, the following results are useful in deterining which points will be kept and which will be reoved. Lea 2. Any point that is ore than (pa + pd )/p = 2pd /p away fro its nearest neighbor in the alternate point pattern will be reoved. Proof. The proof is iediate: for such a point x in X, it is ore costly to ove x to a point y in Y than to delete x and add to X a point at y. Theore 3. Any pair of points x i and y j satisfying the following condition will be kept: x i y i < in{ x k y j } < k i in{ x i y k } < 2p d k j p

7 Proof. There are three possible outcoes for points x i and y j satisfying the condition above. Either both points are reoved, one point is kept while the other is reoved, or both points are kept. We will show that the first two of these outcoes are excluded. Suppose that both points are reoved. Then the distance penalty function includes the penalty 2p d associated with those deletions. Since xi yj < 2p d /p, the penalty could be reduced by keeping both points and oving the to each other, at a cost of xi yj p. Therefore, reoving both x i and y j cannot yield the iniu penalty. Suppose that one point is kept while the other is reoved. Assue without loss of generality that x i is kept while y j is reoved. Then xi is paired with soe point y k such that y k = y j and the spike-tie distance includes both penalty p d for reoval of y j and x i y k p for the ove of point x i to y k. Since x i y k > x i y j the total spiketie distance penalty could be reduced by reoving point y k rather than point y j and oving x i to y j. Therefore, reoving one point while keeping the other cannot yield iniu penalty. The only option that reains is that both x i and y j are kept in the sequence of oves yielding inial total cost. The Mutual Best Match (MBM) algorith, naed for the property shown in Theore 3, uses the three discussed properties of the optial sequence of operations in the spike-tie distance etric to deterine the distance between point patterns X and Y. With these properties, the proble of deterining the distance between point pattern X and point pattern Y is siplified to identifying points known to be kept or deleted, then checking the total cost associated with each possible cobination of potentially kept points. Calculating the total cost associated with each potential cobination of kept points in the two point patterns is not coputationally tie prohibitive, as the sequential ordering prescribed by Theore 1 akes this calculation extreely straight-forward. 2.4 Application and penalty selection

8 It is not uncoon in point process applications to observe a collection of independent point patterns resulting fro a coon point process of interest. An exaple discussed in [6] is the collection of observed aftershock sequences following strong earthquakes in distinct regions, occurring within a fixed period of tie and space of each ain shock. Throughout this work, we refer to such a collection of point processes siply as a point process dataset. The nuber and proxiity of points in each point pattern observed in a point process dataset are iportant to consider when deterining the penalties used in the spike-tie distance etric. When selecting these penalties, it is the ratio of the deletion (addition) penalty p d to p that governs the setting p a = p d = 1 and deterining p by exaining the spread of the points in the data. If p is too large, then little oveent of points will take place in the coputation of distances between point patterns, as the cost of deletion and addition will be less than ost potential oves. Alternatively, if p is too sall, then oves will be ade between points that are not very close at all. With either extree, the resulting spike-tie distance will easure little ore than the su of, or difference between, the nuber of points in the two observed point patterns. Specifically, the spike-tie distances will approach the su of the two point pattern lengths for very large values of p, and will approach the absolute difference between these lengths for very sall p values. With these extrees in ind, one option is to set p to a value such that points closer than the typical inter-point distance are paired, and points further than this are reoved and t h e n reintroduced in the spike-tie distance calculations. This leads to selecting penalties such that: (T/M )p 2p d, (3) where M is the edian nuber of points per observed point pattern and T is the size of the range over which the points are observed. In siulation studies on point patterns generated fro a Poisson process, the precise value of the penalties selected were shown to have little influence on the overall distribution of pattern distances. Figure 1 shows the results fro calculating the 4,950 possible pair-wise distances between

9 100 siulated Poisson point patterns. The results are shown for three different values of p : the value as deterined by (3), 75% of the penalty recoended in (3), and 125% of the penalty found by (3). In each case, though the ean and standard deviation of the distances has changed (as would be expected), the distribution of distances is approxiately noral for all three. Noral Q-Q plots are provided below each histogra of distances to further illustrate how closely each follows the noral distribution. Since spike-tie distance is useful only for coparisons rather than as a raw easureent tool (where specific distance values have eaning out of the coparison context) the fact that the distribution of distances reains virtually unchanged in the face of rather substantial alterations to the penalty values suggests that investigations of differences and siilarities of point patterns within a dataset ay not be overly sensitive to the choice of penalty values, provided the penalties are within soe reasonable range, at least in so far as the shape of the distribution of the distances is concerned. Calculation tie is of course a key consideration in selection of which algorith to use with a particular data set. The ain advantage of the MSU approach is that it will often require significantly less tie to calculate the spike-tie distance of interest. This advantage over the standard SU algorith can be clearly seen in Figure 2. Here, for lengths n = 2, 3, 4..., 29, 30, 40, 50, 75, 100, 150, 200, distances were calculated between 1000 pairs of two randoly generated patterns of n points each using both the SU and MSU algoriths prograed in R. For the odified version, additional results for n = 300, 400, 500 are also shown. The patterns used t o c r e a t e F i g u r e 2 were generated by taking a rando saple of size n fro a unifor[0,10] distribution. Points in black are the resulting ean CPU ties using the standard SU algorith. The black line included is the best fit least-squared odel using n 2 as the only predictor, deonstrating the O(n 2 ) run tie achieved by the SU algorith. Points in red are fit with a linear odel and are the result fro using the MSU algorith. Though the MSU algorith should run O(n 4/3 ) due to the sorting calculation [14], up to n = 500 a linear odel fits the siulation results quite well as the sorting represents such a sall part of all coputations involved. In all cases, the penalties were fit as recoended above in (3). The ease of atrix anipulation in R is

10 the ain reason R was selected for ipleentation of the SU and MSU algoriths. The run tie for calculation of the spike-tie distance using the MBM algorith is approxiately O(2 n 1 +n 2), where n1 and n 2 are the lengths of the two point patterns. This is because there are on the order of O(2 n 1 +n 2) cobinations of points which ight be kept, and each of these is exained individually. Green points in Figure 3 illustrate this for the case n = n 1 = n 2, in which case the run tie increases approxiately as O(4 n ). In Figure 3, tiing siulations using the MBM approach prograed in C are included with a subset of the sae results shown in Figure 2. The overlaid green line corresponds to the best 4 n fit. C was selected for the MBM algorith because of the looping that takes place in the MBM distance calculation. All of these coputations were perfored using a 1.33GHz PowerPC G4. 3 PROTOTYPE POINT PATTERNS With a distance etric clearly defined, it becoes possible to identify a prototype point pattern that can be used for describing a typical observation within a point process dataset. We define this prototype to be the point pattern Y such that the su n d(x i,y) (4) is iniized, where X i, i = 1...n, are the n observed point patterns in the dataset. 3.1 Basic properties of prototype points A convenient feature aking prototypes easy to identify is described in Theore 4. Before stating this stating a proving this theore, we first turn to the definition of the edian of a sorted list of nubers z = {z 1, z 2,..., z }, whose length is even. Many texts define the edian of such a list as the ean of the two entries z /2 and z /2+1. Instead, let us refer to any value M such that z /2 M z /2+1 as a edian of z. With this convention in ind, we return to the proble of deterining prototypes. Suppose that Y is the prototype of a point process dataset consisting of n point patterns X 1,..., X n. For any point p in the prototype, consider the collection of points in the point process dataset z p = {z 1, z 2,..., z } to which

11 p is paired. That is, each eleent of z p is associated with a different X i point pattern within the coplete dataset and specifically, each point z i is the point to which point p is oved to when the spike-tie distance between Y and X i is deterined. Note that n, since p ight not be kept in the spike-tie distance between the prototype and soe of the point patterns in the dataset. Theore 4. Any point p in the prototype is a edian of z p. Proof. Fix any prototype point p and the list z p = {z 1,..., z } of points in the dataset to which p is paired. Note that, in order for p to be a point of the prototype, the su of distances fro z i to p ust be less than or equal to the su of distances fro z i to any other point q. Let q be a edian of z p, and suppose that p is not a edian of z p. We will show that the su of distances fro z i to q is less than the su of distances fro fro z i to p, contradicting the assuption that p is a point of the prototype. First suppose that the length of z p is odd, and without loss of generality assue p < q. The su of all distances fro z i to q is: ( 1)/ 2 = z i q + z i q i=( +3)/ 2 z i q = z i z ( / 2) + q z / 2 (5) since z (+1)/2 q = 0. The su of the distances fro z i to p is: z i p = ( z i q (q p) ) + z i q + (q p) ( 1)/ 2 i=( +3)/ 2 = z i q 1 ( q p) + 1 q p 2 2 = z i q + ( z ( +1)/ 2 p) ( ) + ( z ( +1)/ 2 p) ( ) ( ) + z ( +1)/ 2 p Therefore, the su of distances fro z i to p in (6) is greater than the su of distances to q, which is a contradiction. If is even, then without loss of generality assue that p < z /2. The su of all distances fro z i to q is: ( / 2) 1 + z i z ( / 2)+1 + z ( / 2)+1 q ( / 2)+1 ( ( )) + z i q ( ( )) i=( / 2)+2 i=( / 2) ( ) (6)

12 ( / 2) 1 ( ) + z ( / 2)+1 z / 2 = z i z ( / 2) + z 2 ( / 2)+1 z / 2 ( ) and the su of distances fro z to p is: ( / 2) 1 ( ) ( ) + z i z ( / 2)+1 z i p = z i z ( / 2) z / 2 p + z i z ( / 2)+1 + z ( / 2)+1 p i=( / 2)+2 ( / 2)+1 ( ( )) + z i p ( ( )) i=( / 2)+2 ( / 2) 1 ( ) + z ( / 2)+1 z / 2 = z i z ( / 2) ( ) z i z ( / 2)+1 i=( / 2)+2 ( ) + z / 2 p i=( / 2) ( ) z ( / 2)+1 z / 2 ( ) ( ) (7) = z i q + ( z / 2 p) which again contradicts the assuption that p is a point of the prototype. (8) A consequence of Theore 4 is that, for any point process dataset, there exists a prototype coprised entirely of points observed in the dataset. That is, in searching for a prototype, one ay liit one s search to all possible cobinations of entries in the dataset. 3.2 Prototype deterination In addition to the result shown in Theore 4, note that each point of the prototype ust be within (p a + p d )/p of a fraction p a /(p a + p d ) of the dataset s point patterns, as discussed in [6]. Let n be the nuber of point patterns in the dataset. Then each point z for which at least np a /[2(p a + p d )] other points fro distinct point patterns in the dataset are in the range [z (p a + p d )/p, z] and at least np a /[2(p a + p d )] other points fro distinct point patterns in the dataset fall in the range [z, z + (p a + p d )/p ] is a candidate for inclusion in the prototype. For the case where p a = p d, this is siply n/4 points fro distinct point patterns in the dataset occurring on either side of z, within a distance 2 p d /p. In practice, this observation will likely significantly decrease the nuber of candidates to be considered as potential points in the prototype.

13 3.2.1 Direct algoriths for prototype deterination With a finite nuber of candidate prototype points (n cp ) to consider, one ay copute the su in equation (4) for each possible prototype, i.e. each possible collection of points that are potentially in the prototype. For sufficiently sall oveent penalties, the prototype will generally have length equal to the edian length M of the n point patterns, so one ay liit one s search the n cp M collections of length M as candidates for the prototype. Depending on the size of the dataset of interest, this ay be a reasonable task to undertake. Alternatively, if p is not sall enough to allow for a prototype of length M, then the best prototype of length k can be iteratively sought for k = 1, 2,..., M. In this case, the prototype ay be found by considering the optial prototype candidate of length k = 1 and progressively increasing k, ending the search once the su in (4) for length k + 1 is greater than the su (4) for the best prototype of length k Forward and reverse stepwise approxiation Finding the prototype of a set of point patterns as prescribed in Section ay prove to be prohibitively tedious for datasets with a large nuber of point patterns, or for point patterns with a large edian nuber of points. In this case, stepwise ethods can be ipleented to attain an approxiate prototype solution. A reverse stepwise approach can be ipleented by first finding the optial prototype of edian length M. Possible prototypes of shorter length can then be found by eliinating points fro the length M prototype one at a tie. This strategy continues until the optial prototype of length k 1 has a larger su of distances (4) than the best prototype candidate of length k. While the reverse stepwise approach will entail fewer coputations than finding the prototype directly, even the reverse stepwise approxiation can prove prohibitively cubersoe due to the long length of the prototype coputed in the first stage. An alternative is to use a forward stepwise approach, which can further reduce coputation tie by beginning with a short prototype candidate and

14 progressively expanding it by one point at each step. In each iteration, the su of pattern-prototype distances (4) for the prototype candidate of length k is copared with the su for the prototype candidate of length k 1. Figure 4 illustrates the speed of the forward stepwise prototype approxiation approach using the MBM algorith for distance calculations, again using a 1.33GHz PowerPC G4 processor. Here, N = 5, 10,..., 50 Poisson processes of ean length n = 1, 2,...10 were siulated and the prototypes of the N patterns were deterined using the forward stepwise algorith. Mean CPU ties were recorded after 100 repetitions at each level of N and n. As seen in Figure 4, the nuber of point patterns in a dataset and the average length of those patterns have quite different effects on the calculation tie required for prototype deterination. While the large perspective plot of Figure 4 shows these two factors together, the two saller plots to the right illustrate the factors effects on tie separately by averaging the results first over the N levels of nuber of point patterns and secondly for the n levels of pattern ean length. These arginal plots show that while the ean pattern length has a relationship to run tie very siilar to that seen in Figure 3 for the tie required for MBM distance calculations, the relationship between the nuber of point patterns and run tie is nearly linear. This result is not surprising in light of the fact that collections of longer point patterns will have a prototype longer in length than collections of shorter point patterns. Longer prototypes require additional steps in the stepwise algorith and the CPU tie required for each step increases rapidly as the algorith continues. 3.3 Penalty selection considerations Penalties selected for p, p a, and p d play a significant role in prototype deterination. As stated in section 3.2, a point at tie t can only be part of the prototype if at least p a /(p a + p d ) of the point patterns in the dataset contain a point within the interval [t 2p a /p, t + 2p a /p ]. Selection of oving penalties that are too large will lead to prototypes that see unusually short, since the range t ± 2p a /p will be quite sall. When the priary purpose of the prototype pattern is to serve as an exaple of a typical point pattern, it is useful to set p to be quite sall copared to what ight be advisable for distance calculations. Setting p to a sall positive value when p a = p d enables the prototype to take on the edian nuber of points in the dataset. Because each prototype point ust be atched to points within at least p a

15 /(p a + p d ) of the observed point patterns, it is typically not possible to achieve a prototype of greater than edian length while aintaining the p a = p d restriction. Figure 5 illustrates how the length of the prototype is related to the relative size of the selected p and p a = p d values. For the first plot on the left, prototypes were calculated for 100 siulated datasets, each consisting of 20 stationary Poisson processes of rate 1 on [0, T ], using the forward stepwise algorith and 100 different p values ranging fro to 2.0. In all cases, p a = p d = 1. The ean prototype length for each penalty value is seen along the y-axis and is plotted against the corresponding oveent penalty value. As expected, saller values of p correspond to longer prototype lengths, with a leveling off across very sall p values where the axiu prototype length (the edian) has been achieved. The plot on the right of Figure 5 shows the 100 prototypes calculated for one randoly selected siulated dataset used for the illustration on the left. Of particular interest is the observation that while the nuber of points in the prototype decreases with increases in p, the actual location of any of the prototype points reains essentially constant across a wide range of penalty values. This suggests that for any particular dataset, the locations of the points in the prototype will be rather robust to choices of p and p a = p d. Sall changes in the ratio p /p a appear to have surprisingly little effect on either the prototype pattern or the overall distribution of distances as seen earlier in Figure 1. 4 CLASSIFICATION THROUGH CLUSTERING Using the spike-tie distance etric and prototype, various clustering algoriths established for standard ultivariate data can be odified for use in point process applications. This section describes how three such ethods can be adapted and discusses issues in the selection of addition, deletion, and oveent penalties when these clustering algoriths are used. 4.1 HMEANS, KMEANS and Aggloerative clustering HMEANS and KMEANS clustering are two closely related iterative techniques useful for clustering ulti-variate data. Both HMEANS and KMEANS clustering begin by randoly assigning each observation to one of c clusters and finding the center of each cluster. While cluster centers are conventionally defined as centroids in standard HMEANS and KMEANS clustering of ultivariate data

16 [15], one ay instead use the prototype as the definition of the center of a cluster of point patterns when analyzing a point process dataset. A third option is an aggloerative approach that assigns each observed point pattern to its own cluster and then systeatically cobines clusters. This approach is especially useful in cases where HMEANS and KMEANS fail to converge after nuerous iterations. With each observed point pattern assigned to a cluster of its own, each cluster s prototype is siply the eber pattern. The pair-wise distances between each cluster prototype can be found and the nearest two clusters joined to for one larger two-pattern cluster. Fro here, the new cluster prototype is deterined, and the new pair-wise distances between prototypes are calculated. The process repeats until only c desired clusters reain. If the aount of data akes coputing all pair-wise distances unfeasible, a siilar approach is to consider pair-wise distances for only one point pattern at a tie in either a randoly assigned order, or in order of length fro the longest pattern to the shortest. Our investigations suggest that this procedure of ordering the point patterns fro longest to shortest and then progressively erging the point patterns iniizes probles that can occur if the p a, p d, and p penalties are set such that pattern length overshadows other features in the distance easures. 4.2 Penalty choice considerations While setting p to a sall value is useful in deterining prototypes, very sall p values are not desirable in clustering applications. As discussed in Section 2.4, sall penalties for oveent of points will lead to distances that approach the absolute difference in pattern lengths, while large oveent penalties lead to distance calculations that approach the su of point pattern lengths. In clustering, either extree will lead to clusters assigned by grouping patterns of siilar lengths rather than patterns with siilarly placed points. For the siulation presented in Figure 6, ten point patterns were created for each of two different ethods. Ten realizations of a rate 1 Poisson process on [0,6], and another ten point patterns were each created by sapling n points independently fro a noral rando variable with ean 3.0 and standard deviation 0.25, where n is a Poisson rando variable with ean 6.0. Patterns of length zero were excluded. Plotted in Figure 6 is the ean failure rate for HMEANS in correctly identifying the two different types of patterns over 125 siulations. The ean rate of failure is shown for twenty different p values ranging fro 5% to 200% of the penalty recoended in (3) for distance calculations. It is clear in

17 these results that extreely sall values of p are less successful at clustering than values siilar to that advised in (3) for distance easureents. 5 MULTI-DIMENSIONAL EXTENSIONS AND DISCUSSION The spike-tie distance etric as exained in this work, and as originally proposed by Victor and Purpura [5], has thus far been applied priarily to teporal point patterns as defined in Section 2. This is just the beginning of ways prototypes and the spike-tie etric can be used though as the spike-tie etric, the related prototype pattern technique, and related clustering algoriths can all be extended to point process data with points occurring in R d. In the extension of the definition of the spike-tie distance etric to R d, points can be added with penalty p a, deleted with penalty p d, or oved along the i th axis a distance of at a cost of p (i). Moving penalties p (1), p (2) (d ),...p ay be set independently for oveent along each of the d axes as in (3) for distance calculations and clustering application settings. As with the one-diensional case, saller values for oveent penalties are again useful for prototype deterination to enable longer length prototype point patterns. While the ratio of p a and p d to oving penalty p was of priary iportance for teporal point process work, the relative values of the d oveent penalties ust also be considered when ultiple diensions are involved. These ratios ust be considered so as to avoid (or allow) inter-point distances along one or ore diensions being ore heavily weighted in distance calculations. As ight be expected, spike-tie distance calculations are far ore cubersoe in R d, as the result of Theore 1 does not hold in ore than one diension. Without the result of Theore 1, neither the SU nor MSU approaches can be ipleented. Also, using MBM, ultiple pairing arrangeents ust be considered for each possible set of kept points. The process to deterine which points will be kept and which will be reoved under the MBM algorith reains unchanged as Lea 2 and Theore 3 extend iediately to ultiple diensions. For prototype pattern deterination, a odified version of Theore 4 applies to the ultidiensional setting. With ore than one diension, rather than each prototype point p being a edian of the points (z 1, z 2,..., z ) it is paired with, each coordinate of prototype point p will be a edian of the

18 corresponding coordinate of the points (z 1, z 2,..., z ). Therefore, while the prototype ay not contain points in the dataset, there exists a prototype ade entirely of points such that each coordinate of each prototype point is a coordinate of one of the points observed in the dataset. Clustering algoriths discussed in Section 4 can be applied to ulti-diensional point pattern data without odification. For soe datasets, the coputation tie involved in prototype deterination ipedes the use of the clustering algoriths as described in Section 4 and slight odifications can be ade, such as considering only one of the d diensions at a tie in each step of the prototype and/or distance calculation. Such a onothetic approach is used in [6] for clustering earthquake aftershock activity considering the tie, agnitude, and location of each aftershock one at a tie. While the spike-tie distance etric is only one of countless distance etrics for point pattern data [5], the concept of a prototype sequence is one that exists independently of distance etric specifics. Therefore, though any of the results presented here apply solely to the spike-tie etric, the approaches to deterining prototypes and clusters derived fro the ability to easure distances between point patterns should reain useful in conjunction with other point pattern distance etrics.

19 6 FIGURES

20 Figure 2

21 Figure 3

22 Figure 4

23 Figure 5

24 Figure 6

25 7 ACKNOWLEDGEMENTS We thank Chuck Woody for introducing us to the work of Victor and Purpura, and we thank Pepe Segundo for introducing us to Chuck Woody. This aterial is based upon work supported by the National Science Foundation under Grant No Any opinions, findings, and conclusions or recoendations expressed in this aterial are those of the authors and do not necessarily reflect the views of the National Science Foundation. This research was also supported by the Southern California Earthquake Center. SCEC is funded by NSF Cooperative Agreeent EAR and USGS Cooperative Agreeent 07HQAG0008. The SCEC contribution nuber for this paper is 1283.

26 8 REFERENCES [1] D. Daley and D. Vere-Jones, An Introduction to the Theory of Point Processes, Volue 1: Eleentary Theory and Methods, 2nd ed. Springer-Verlag, New York, [2] F.P. Schoenberg, D.R. Brillinger, and P.M. Guttorp, Point processes, spatial-teporal in Encyclopedia of Environetrics, Abdel El-Shaarawi and Walter Piegorsch, editors. Wiley, NY, vol. 3, pp , [3] P. Guttorp and M. Thopson, Nonparaetric estiation of intensities for sapled counting processes, Journal of the Royal Statistical Society, Series B 52, , [4] D.L. Synder and M.I. Miller, Rando Point Processes in Tie and Space, Wiley, New York, [5] J. Victor and K. Purpura, Metric-space analysis of spike trains: theory, algoriths and application, Network: Coputation in Neural Systes vol. 8, pp , [6] F.P. Schoenberg, and K.E. Tranbarger, Description of earthquake aftershock sequences using prototype point processes, Environetrics vol. 19, pp , [7] Y. Rubner, C. Toasi, and L. Guibas, A Metric for Distributions with Applications to Iage Databases, in IEEE International Conference on Coputer Vision, pp , [8] Y. Rubner, C. Toasi, and L. Guibas, The Earth Mover s Distance as a Metric for Iage Retrieval, International Journal of Coputer Vision vol. 40(2), pp , [9] F. L. Hitchcock, The distribution of a product fro several sources to nuerous localities, Journal of Matheatical Physics, vol. 20, pp , [10] G. Monge, Méoire sur la Théorie des Déblais et des Reblais, Histoire de l Acadéie Royale des Sciences, Paris, [11] E. Levina and P.J. Bickel, The Earth Mover s Distance is the Mallows Distance: Soe Insights fro Statistics, in Proceedings of ICCV Vancouver, Canada, 2001, pp [12] D. Aronov, Fast algorith for the etric-space analysis of siultaneous responses of ultiple single neurons Journal of Neuroscience Methods vol. 124, pp , [13] P.H. Sellers, On the theory and coputation of evolutionary distances SIAM J. Appl. Math. vol. 26, pp , [14] R. Sedgewick, A new upper bound for Shell sort, Journal of Algoriths vol. 7, pp , [15] H. Spath, Cluster Analysis Algoriths for Data Reduction and Classification of Objects, E. Horwood, Chichester and Halsted Press, New York, [16] D. Reich, F. Mechler, K. Purpura, and J. Victor, Interspike intervals, receptive fields and inforation encoding in priary visual cortex,journal of Neuroscience vol. 20 (5), pp , 2000.

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering

Clustering. Cluster Analysis of Microarray Data. Microarray Data for Clustering. Data for Clustering Clustering Cluster Analysis of Microarray Data 4/3/009 Copyright 009 Dan Nettleton Group obects that are siilar to one another together in a cluster. Separate obects that are dissiilar fro each other into