WARSAW UNIVERSITY OF TECHNOLOGY. Faculty of Electronics and Information Technology. Ph.D. THESIS. Piotr Lasek, M.Sc.

Size: px

Start display at page:

Download "WARSAW UNIVERSITY OF TECHNOLOGY. Faculty of Electronics and Information Technology. Ph.D. THESIS. Piotr Lasek, M.Sc."

Deborah Atkinson
5 years ago
Views:

1 WARSAW UNIVERSITY OF TECHNOLOGY Faculty of Electronics and Information Technology Ph.D. THESIS Piotr Lasek, M.Sc. Efficient Density-Based Clustering Supervisor Professor Marzena Kryszkiewicz, Ph.D., D.Sc. Warsaw, 2011

2 2

3 Abstract This thesis is concerned with efficient density-based clustering using algorithms such as DBSCAN and NBC as well as the application of indices and the property of triangle inequality in order to make these algorithms faster. A new LVA-Index is proposed as well as methods for building it and searching for nearest neighbors. LVA-Index combines some features of VA-File and the NBC algorithm: it uses the idea of the approximation vectors and the layer approach for determining nearest neighbors. The characteristic feature of the LVA-Index is that it does not require all cells to be checked in order to determine nearest neighbors. Contrary to the NBC approach, the LVA-Index was adapted to search nearest neighbors within layers of levels numbers greater than 1. Another key feature of LVA-Index is that during building the index, the representations of closest layers containing of non-empty cells are stored in each cell. This feature significantly speeds up the search of nearest neighbors because only the closest layers are scanned. These layers are stored in memory, in order to be accessed very fast. In this thesis, we also presented our proposal of using the triangle inequality property for increasing efficiency of density-based data clustering algorithms. We presented the results of the experiments we performed for examining the proposed solution with respect to a number of dimensions, number of data objects and number of reference points used for determining distances between data points. It was experimentally proved that, comparing to the densitybased clustering algorithms using spatial indices like R-Tree or VA-File, the algorithms we proposed which use the triangle inequality property, are capable of clustering data having even large number of dimensions efficiently. 3

4 4

5 Streszczenie Niniejsza praca poświęcona jest efektywnemu gęstościowemu grupowaniu danych przy uŝyciu algorytmów takich jak DBSCAN oraz NBC oraz zastosowaniu indeksów i własności nierówności trójkąta w celu polepszenia ich wydajności. W pracy zaproponowano nowy indeks LVA oraz metody jego budowania i wyszukiwania najbliŝszych sąsiadów z jego uŝyciem. Indeks LVA łączy w sobie niektóre cechy indeksu VA-File oraz algorytmu NBC, a mianowicie wykorzystuje koncepcję wektorów aproksymacyjnych oraz warstwowe podejście do wyszukiwania najbliŝszych sąsiadów. Cechą charakterystyczną indeksu LVA jest brak konieczności odwoływania się do wszystkich komórek przestrzeni w celu znalezienia najbliŝszych sąsiadów. Ponadto, w przeciwieństwie do rozwiązania zaproponowanego w algorytmie NBC, indeks LVA został przystosowany do wyszukiwania sąsiadów w warstwach o numerach większych od 1. Inną istotną cechą indeksu LVA jest to, Ŝe w trakcie budowy indeksu, dla kaŝdej komórki zapamiętywane są reprezentacje najbliŝszych warstw zawierających niepuste komórki. Cecha ta znacząco przyspiesza wyszukiwanie najbliŝszych sąsiadów danego punktu, poniewaŝ w celu ich znalezienia, przeszukiwane są jedynie najbliŝsze warstwy komórki, do której ten punkt naleŝy. Z uwagi na przechowywanie warstw w pamięci, dostęp do nich jest bardzo szybki. W pracy przedstawiono takŝe propozycję wykorzystania własności nierówności trójkąta w celu zwiększenia efektywności algorytmów gęstościowego grupowania danych. Przedstawiono wyniki eksperymentów, których celem było zbadanie efektywności nowego rozwiązania w zaleŝności od liczby wymiarów danych, ilości danych oraz liczby uŝytych punktów referencyjnych wykorzystywanych do wnioskowania o odległościach pomiędzy punktami na podstawie własności nierówności trójkąta. Eksperymentalnie wykazano, Ŝe w porównaniu do algorytmów gęstościowego grupowania danych wykorzystujących indeksy przestrzenne typu R-drzewo, czy plik VA, zaproponowane algorytmy gęstościowego grupowania danych, które wykorzystują własność nierówności trójkąta, są w stanie efektywnie grupować dane takŝe o duŝej liczbie wymiarów. 5

6 6

7 To my Parents Rodzicom 7

8 8

9 Acknowledgements I would like to express my deepest gratitude and thanks to Professor Marzena Kryszkiewicz, who has been my supervisor since the beginning of my Ph.D. studies. She provided me with invaluable help and support and without her guidance, corrections and constant encouragement, this thesis would not have been possible. I am also very grateful to my family, especially my parents Krystyna and Eugeniusz, and my wife Agnieszka, who supported me while working on this thesis. 9

10 10

11 Contents 1 Introduction Clustering Taxonomy of Clustering Algorithms Stages of Clustering Distance Metrics Indices Statement of the Problem and Theses of Dissertation Content Outline DBSCAN and NBC DBSCAN Definitions of Terms Used in DBSCAN Clustering with DBSCAN NBC Definitions of Terms Used in NBC The NBC Algorithm VA-File and LVA-Index VA-File Definitions of Terms Used in VA-File Structure of VA-File Simple Search Algorithm LVA-Index Definitions of a Cell and a Layer Structure of LVA-Index Building LVA-Index Searching Neighbors Using LVA-Index Clustering Using Triangle Inequality Triangle Inequality in DBSCAN Using Triangle Inequality Property to Determine ε-neighborhood Optimizing DBSCAN by Using Triangle Inequality wrt. a Reference Point Optimizing DBSCAN by Using Triangle Inequality wrt. Many Reference Points

12 4.2 Triangle Inequality in NBC Efficient Determination of k-neighborhoods Building k-neighborhood Index by Using Triangle Inequality Experiments The LVA-Index Building LVA-Index Searching Neighbors by Means of LVA-Index Clustering LVA-Index in NBC TI-DBSCAN TI-NBC Summary and Further Works Bibliography A Clustering Framework B Results of Experiments

13 1 Introduction 1.1 Clustering The modern world is full of digital data. The amount of information is overwhelming. Databases, data warehouses and other repositories store an enormous number of records related to crucial human activities such as science, medicine and economics. One of the ways of dealing with such a large amount of information is to classify or to group it into sets of meaningful categories. This is where clustering algorithms come in useful. Clustering data into meaningful groups is an important task of both artificial intelligence and data mining. Clustering is considered to be an unsupervised classification of data. The results of the task depend on the algorithm used. A number of clustering algorithms have been offered in literature. Some of them are capable of discovering proper clustering of data only when the number of clusters is known in advance. Other algorithms are capable of discovering clusters of particular shapes only. There are also algorithms that are able to identify noise data Taxonomy of Clustering Algorithms Since there exists a great variety of clustering algorithms, it is difficult to present a single taxonomy for all of them. In literature, one can find several taxonomies which are based on the following criteria (Jain, Topchy, Law, & Buhmann, 2004): - the representation of input data, - the representation of output (e.g., a hierarchy of partitions), - the probability model, - the search process, - the clustering direction. A sample taxonomy is illustrated in Figure 1.1. Other sample taxonomies can be also found for example in (Jain, Topchy, Law, & Buhmann, 2004) and (Jain, Murty, & Flynn, 1999). 13

14 In this subchapter, we will focus on types of algorithms which use mainly numerical attributes and which were found to work well for geographical databases. These algorithms can be divided into four groups: hierarchical, partitioning, density-based and grid-based algorithms (Berkhin, 2002). Each group is briefly described below. Clustering Algorithms Hierarchical Methods Grid-Based Methods Partitioning Methods Density-Based Algorithms Agglomerative Algorithms Divisive Algorithms Relocation Algorithms Probabilistic Clustering k-medoids/ k- means Methods NBC DBSCAN Hierarchical Algorithms Figure 1.1. A sample taxonomy of clustering algorithms (after (Berkhin, 2002)) Some examples of hierarchical clustering algorithms are: BIRCH (Zhang, Ramakrishnan, & Livny, 1996) and CURE (Guha, Rastogi, & Shim, 1998). The former uses so called clustering features and a clustering feature tree (CF-tree) to represent clusters. It is quite efficient, but can only find spherical clusters. The latter achieves better clustering quality. To model a cluster and compute distances between clusters, CURE uses so called representative points. By using these points CURE is able to discover clusters of any shape. Hierarchical algorithms produce a dendrogram which represents a nested grouping of objects. Clusters can be obtained by cutting the dendrogram at some level. For example, in Figure 1.2, the dashed line represents the division of the dendrogram, so that three clusters were created (Figure 1.3). If the line was moved down, then more smaller clusters would have been created. However, the similarities between points within such clusters would have been greater. 14

15 y d e similarity a b c Cluster 1 Cluster 2 f g Cluster 3 a b c d e f g x Figure 1.2. A dendrogram obtained from the dataset presented in Figure 1.3 Figure 1.3. Points assigned to three clusters y y Cluster 1 Cluster 1 Cluster 2 Cluster 2 Figure 1.4. Distance in single-link algorithms x Figure 1.5. Distance in complete-link algorithms x Hierarchical algorithms can be divided into three groups: - single-link hierarchical algorithms, - complete-link algorithms, - minimum-variance algorithms. The most popular are single-link (Sneath & Sokal, 1973) and complete-link (King, 1967) algorithms. The difference between these two types of algorithms lies in the method of characterizing the similarity between clusters. In the single-link algorithms, the distance 15

16 between points is computed so that it is the minimal value of distances between all pairs of objects (Figure 1.4). In the complete-link algorithms, the distance between two clusters equals the maximum value of all pairwise distances between objects in the clusters (Figure 1.5). Clusters created by the complete-link algorithm are tightly bound or compact (Baeza-Yates, 1992) in contrast to the chaining effect (the tendency to produce elongated or straggled clusters) observed in single-link algorithms (Nagy, 1968). The goal in the minimum-variance algorithms is to minimize the sum of squared error criterion function. In such algorithms the number of clusters is known and the most famous representative of this group is the k-means algorithm. y y x x Figure 1.6. A sample centroid Figure 1.7. A sample medoid Partitioning Algorithms The family of partitioning clustering algorithms can be divided into k-means algorithms and k-medoids algorithms. In k-means algorithms, each cluster is represented by the gravity center of the cluster, so called centroid (Figure 1.6); in k-medoid algorithm, each cluster is represented by its center point belonging to a dataset, so called medoid (Figure 1.7). One of the partitioning clustering algorithms is CLARANS (Ng & Han, 2002), which is an improved k-medoid algorithm. 16

17 Partitioning clustering algorithms produce single data partitions instead of creating a structure such as dendrogram created by hierarchical clustering algorithms. A major problem with partitioning algorithms is selecting an appropriate number of output clusters. Some advice on this problem is given in (Dubes R., 1987). The simplest algorithm of this type is the k-means algorithm with a squared error criterion (McQueen, 1967). However, there exists a number of variants of this algorithm (Anderberg, 1973), (Ball & Hall, 1965), (Diday, 1973), (Symon, 1977). y a b c Cluster 2 d e Threshold egdesof the max. length Cluster 1 f g Cluster 3 Figure 1.8. Clustering using MST x There also exist hierarchical clustering approaches which employ the graph theory. The most commonly used algorithm in this group is the algorithm based on the minimal spanning tree (MST) (Zahn, 1971). The sample result of clustering using this algorithm is illustrated in Figure 1.8. The lines (edges) in the figure constitute the minimum spanning tree. The idea behind clustering based on MST is that in order to separate clusters, edges having lengths greater than a specific threshold are removed. Thus, as a result of removing the edges indicated by arrows in Figure 1.8, three clusters will be created. 17

18 Some papers related both to hierarchical and graph-theoretic clustering in which authors express clustering-related notations in terms of the graph theory notation are: (Gower & Ross, 1969), (Gotlieb & Kumar, 1968), (Backer & Hubert, 1976), (Augustson & Minker, 1970), (Raghavan & Yu, 1981), (Ozawa, 1985), (Toussaint, 1980). Density-Based Algorithms In contrast to most partitioning methods, density-based algorithms use a density function in order to locate clusters. In these algorithms, clusters are regarded as dense regions separated by noise (regions of low density) or empty space (Han, Kamber, & Tung, 2001). The clusters that are produced by density-based algorithms can be of arbitrary shape and are capable of finding outliers (noise). The DBSCAN algorithm (Density-Based Spatial Clustering of Applications with Noise) (Ester, Kriegel, Sander, & Xu, 1996) is recognized as a high quality scalable algorithm for clustering. It has been proved to show a good performance for low-dimensional data. In the DBSCAN algorithm, the density function is defined in such a way that an object must fulfill the following basic criterion in order to be a member of a cluster: the minimum number of objects k must be located within the object s neighborhood of a given radius. The drawback of DBSCAN is that it is not capable of distinguishing dense clusters adhering to sparse clusters (Figure 1.9). DBSCAN is described in more detail in the next part of this thesis. There also exists an extension of DBSCAN OPTICS (Ankerst, Breunig, Kriegel, & Sander, 1999), which addressed the above mentioned feature of DBSCAN: the determining clusters of varying density. The Neighborhood-Based Clustering (NBC) algorithm also belongs to the group of density based clustering algorithms. The characteristic feature of NBC is that it discovers cluster of arbitrary shape and different densities (Figure 1.10), requires fewer input parameters than the existing algorithms and can cluster both large and high-dimensional datasets efficiently. NBC is described in more detail in Subchapter

19 Figure 1.9. A result of clustering of a sample dataset using DBSCAN; only one cluster (0) was found; 4, 12 Figure A result of clustering of a sample dataset using NBC; three clusters of different size and density (0, 1, 2) were found; 22 Another example of a density-based algorithm is DENCLUE (Hinneburg & Keim, 2002), which is based on a set of density distribution functions. The basic ideas behind this method are: the impact of a data point on its neighborhood can be described mathematically using a so called influence function; the sum of influence functions of all points gives the overall density of the data space; local maxima of the overall density function, which are called density attractors, determine clusters. 19

20 Other distinct representative algorithms of this class are: DBCLASD (Xu, Ester, Kriegel, & Sander, 1984), O-Cluster (Milenova & Campus, 2002) and OPTICS (Ankerst, Breunig, Kriegel, & Sander, 1999). The characteristic feature of density-based algorithms is that they distinguish between areas of high and low density. An area is said to be of high density if it contains a large number of data points per area unit; otherwise, it is of low density. Under this understanding of space, a cluster is an area of density exceeding the required threshold value or greater than the density of the enclosing space. The areas that do not constitute clusters are considered to be noise. Grid Based Algorithms The typical grid-based algorithms are: STING (Wang, Yang, & Muntz, 1997), which divides the data space into rectangular cells using a hierarchical structure, WaveCluster (Sheikholeslami, Chatterjee, & Zhang, 2000) employing wavelet transform, and CLIQUE (Agrawal, Gehrke, Gunopulos, & Raghavan, 2005) representing both a grid- and densitybased approach. Grid-based approaches are more efficient than density-based ones in highdimensional space. High efficiency is achieved in grid-based algorithms by using a grid data structure and quantizing the data space into a specific number of cells forming a grid. When data is organized in this way, the processing time is fast and dependent on the number of cells in each dimension. The Statistical Information Grid (STING) approach uses a multiresolution approach to analyzing clusters and, therefore, its quality depends on the lowest level of the grid structure. Despite low processing time, STING produces clusters having horizontal or vertical boundaries only, which may negatively affect the quality and accuracy of clustering. The computational complexity of STING is, where is the number of points in a processed dataset. However, the quality of clustering is rather low as the relationship between neighboring cells is not taken into account. WaveCluster is designed so that first it summarizes the data by imposing a multidimensional grid structure onto the data space, then a wavelet transformation is used to transform the feature space, finding dense regions in the transformed space. The wavelet transformation is 20

21 very useful because outliers can be automatically removed and the clusters automatically determined. Wavelet-based clustering is fast, works well with large datasets, discovers clusters of an arbitrary shape and is able to handle data with up to 20 dimensions (Murtagh & Berry, 2000). CLIQUE is capable of discovering clusters in the subspaces of data. Hence, it is useful for sparse datasets. It is both a density- and grid-based algorithm. It works by moving from lower to higher dimensional data space. In other words, in the k-dimensional space (e.g. when searching for dense units which are used to define clusters) it uses information retrieved from clustering in the ( 1)-dimensional space. Such an approach is very similar to the Apriori property (Agrawal & Srikant, 1994) which in CLIQUE states: If a k-dimensional unit is dense, then so are its projections in ( 1)-dimensional space. CLIQUE is not sensitive to the order of input data, it scales linearly and has good scalability. Other Types of Clustering Algorithms As opposed to traditional methods of clustering, which partition the data space into clusters so that every object belongs to one cluster only, there also exist algorithms which do not fulfil this assumption. In fuzzy clustering algorithms, objects are associated with clusters using a membership function (Zadeh, 1965). The theory of fuzzy sets was initially applied to clustering in 1969 by Ruspini (Ruspini, 1969). The most widely implemented fuzzy clustering algorithm is FCM (the fuzzy c-means) and its generalization was proposed in (Bezdek, 1981), which is also a good source of knowledge on fuzzy clustering. The other fuzzy approach to clustering, the c-shell algorithm, was proposed in 1992 by Dave (Dave, 1992). A sample result of using a fuzzy clustering algorithm is presented in Figure Artificial neural networks (ANN) (Hertz, Krogh, & Palmer, 1991) were also applied in clustering and classification algorithms (Seti & Jain, 1991), (Jain & Mao, 1994). The following features of ANNs are important in the context of clustering: - processed data vectors must be numerical, - artificial neural networks are parallel, - weights in ANNs can may be learned adaptively (Jain & Mao, 1996), (Oja, 1982). 21

22 y a b d e Fuzzy cluster 1 c f g Fuzzy cluster 2 Figure The result of clustering using a fuzzy clustering algorithm x The following approaches, such as competitive neural networks (Jain & Mao, 1996), Kohonen s learning vector quantization (LVQ) (Kohonen, 1989) and self-organizing map (SOM) (Carpenter & Grossberg, 1990), have been applied for example to clustering algorithms in (Pal, Bezdek, & Tsao, 1993) and (Moor, 1988). As well as fuzzy and artificial networks, evolutionary techniques such as genetic algorithms (GAs) (Holland, 1975), (Goldberg, 1989), (Raghavan & Birchand, 1979) evolution strategies (ESs) (Shwefel, 1981) and evolutionary programming (EP) (Fogel, Owens, & Walsh, 1965), (Fogel, Fogel, & Eds., 1994) have been applied to clustering Stages of Clustering The following steps are typical in clustering (Jain & Dubes, 1988): 1. Extracting data features 2. Selecting a distance metric 3. Clustering 4. Describing clusters (if needed) 5. Assessing the output (if needed). 22

23 In the first step, the input dataset is preprocessed so that only desired features are taken for further analysis. Next, the distance function has to be chosen. Usually the Euclidean distance is selected. However, a variety of different distance measures are in use in various areas (Anderberg, 1973) (Jain & Dubes, 1988) (Diday & Simon, 1976). Some widely used distance functions are described further in this section. The results of the clustering step can vary significantly between clustering algorithms. The results can be exact or fuzzy. Nested series of partitions are produced by hierarchical clustering algorithms, whereas partitional clustering algorithms identify clusters that optimize a certain clustering criterion. Other techniques include probabilistic (Brailovsky, 1991) and graph-theoretic (Zahn, 1971) clustering methods. After data is clustered, a compact description can be added to each cluster. Such a description is generated in terms of cluster representative patterns, e.g. centroid (Diday & Simon, 1976). In other words, generating a description of a cluster is the process of extracting a simple representation of a dataset so that it can be analyzed by a computer efficiently or represented in an intuitive way that is easy to comprehend for a human Distance Metrics Since there exists a variety of different types of data, a number of distance measures have been introduced. The most commonly used is Euclidean distance which is defined by the following equation:, /, where and are data points and is a number of dimensions. The Euclidean distance is a special case of the Minkowski metric. The Minkowski metric is defined as follows:, /, 23

24 where 0. The Minkowski metric is typically used with values of equal to 1 or 2. The advantage of the Euclidean distance is that it is intuitive and works well for compact or isolated clusters (Mao & Jain, 1996). The drawback of the Minkowski metric is that the features of the largest scale tend to dominate the others. This can be solved, for example, by using the squared Mahalanobis distance:, Σ, where Σ is the covariance matrix. When computing distances between objects with non-continuous attributes, the situation is complicated since it is not an easy task to compare different types of features. However, several solutions were proposed, for example: mapping nominal attributes into binary features (Kaufman & Rousseeuw, 1990) or using the matching criterion (Everitt, 1993) using continuous dissimilarity measures (Kaufman & Rousseeuw, 1990), using an edit distance for alphabetic sequences (Gusfield, 1997). Also, for objects represented by strings (Fu & Lu, 1977) or tree structures (Knuth, 1973), appropriate solutions were introduced. Similarity measures that can be used for strings or trees were described in (Baeza-Yates, 1992) and (Zhang K., 1995), respectively. Syntactic methods of clustering where strings are used were presented in (Tanaka, 1995). There are some interesting distance measures which take into account neighboring points (Gowda & Krishna, 1977), (Michalski, Stepp, & Diday, 1983). A metric that uses this concept is the mutual neighbor distance (MND) (Gowda & Krishna, 1977). MND is given by:,,,, where, denotes which neighbor of is. For example,, returns 1, if is the first (closest) neighbor of. Although MND is not a metric, it was applied in clustering applications with success (Gowda & Diday, 1992). 24

25 Some other widely known distance metrics are presented in Table 1.1. Table 1.1. Some distance metrics (based on (Xu & Wunsch, 2005)) Distance Formula, Description Minkowski Euclidean City-block (also called Manhattan) / / Invariant to translations and rotations only for 1. Most commonly used. Case of Minkowski for 2. Case of Minkowski metric for 1. Sup max.. Case of Minkowski metric for. Mahalanobis Cosine Σ cos Σ is a covariance matrix calculated based on all objects. For non-correlated objects, it is equivalent to the squared Euclidean distance. Commonly used in text document clustering. Independent of vector length and invariant to rotation. is an angle between and. In Figure 1.12, we have presented a graphical comparison of Euclidean and Manhattan distances. The Euclidean distance (route 1) between two points (black dots) is approximately equal to 7,1. A Manhattan distance of length 10 is represented in Figure 1.12 by routes 1 3. In Figure 1.13, the cosine metric is used to determine distances between points. In this case, since cos cos, we can conclude that point 0 is more similar to than to. The cosine metric is often used in text mining to compare documents. 25

26 route 2 y axis p 1 route 3 p 0 α 0 α 1 p 2 route 1 Figure Euclidean distance (marked with dotted line) versus Manhattan distance (marked with solid lines) x axis Figure An example of cosine distance. Point is more similar to than to since cos cos 1.2 Indices The problem of searching data in multidimensional data spaces has been investigated over the past few decades. For spaces with less than five dimensions, fast solutions exist. However, when the number of dimensions and the size of a dataset increase, the search of nearest neighbor of a given point becomes much more complex and is related to so called curse of dimensionality. Over the years, a number of algorithms for searching data in multidimensional data spaces have been developed. Several of them are described below. In 1974, Finkel and Bentley proposed the Quadtree (Finkel & Bentley, 1974). The main idea was to divide a two dimensional space with two orthogonal lines into four areas. Splitting the space was applied recursively and resulted in a hierarchical structure. This solution was easy to implement and efficient for two-dimensional spaces. However, when using files to store data, the performance of this solution was poor. One year later, Bentley proposed k-d tree (Bentley, 1975). The k-d tree was also easy to implement and had a good performance when it was used for searching the nearest neighbors of a given point in low-dimensional spaces. However, k-d-tree is not usually balanced because its structure is dependent on the sequence of insertion of points and is, therefore, inappropriate for higher dimensions. 26

27 Several years later, in 1981, Nievergelt and Hinterberger developed the Gridfile index (Nievergelt, Hinterberger, & Sevcik, 1984). Simple implementation and good performance are the advantages of this index, but the structure of Gridfile usually becomes unbalanced quickly which negatively affects the time of points searching in Gridfile. Moreover, performance of a search is poor in the case of datasets with large empty spaces, and the size of the index grows linearly with the number of points. Voronoi diagrams (Nievergelt, Hinterberger, & Sevcik, 1984) perform also well only in lowdimensional spaces. In high-dimensional spaces, the computation of Voronoi diagrams is expensive as well as the storage cost of the cells. The retrieval times of nearest neighbors are not better than those in the methods described above. In 1984, Guttman proposed the R-Tree index (Guttman, 1984), which was similar to the B-Tree index (Bayer, 1971). In R-Tree, each node contains at least a certain minimal number of points, and leaves have the same depth level. R-Tree is always perfectly balanced this implies that the height of the tree is logarithmic in number of points. It is also fast in low-dimensional spaces. However, the structure of R-Tree requires rebuilding from time to time, basic operations are very expensive, and performance in high-dimensional spaces is also bad. A number of extensions to R-Tree have been proposed to improve its performance, including: R+-Tree (Sellis, Roussopoulos, & Faloustos, 1987), R*-Tree (Beckmann & Kriegel, 1990), TV-Tree (Lin, Jagadish, & Faloutsos, 1994), vp-tree (Chiueh, 1994), X-Tree (Berchtold, Keim, & Kriegel, 1996), SS-Tree (White & Jain, 1996), SS+-Tree (Kurniawati, Jin, & Shepherd, 1997), SR-Tree (Katayama & Satoh, 1997), M-Tree (Ciaccia, Patella, & Zezula, 1997), DABS-Tree (Bhm & Kriegel, 2000), GiST (Hellerstein, Naughton, & Pfeffer, 1995). Finally, in 1997, at ETH in Zurich, the Vector Approximation File (VA-File) (Weber, Schek, & Blott, 1998) was developed. This simple, yet efficient, index has a crucial feature which uses a mechanism of approximations. The VA-File is an array of approximations (a bit string of a specific length) of data vectors. By using an array of the approximations when searching through data vectors, a large number of irrelevant vectors can be easily excluded. The mechanism of approximations means that there is no need to build sophisticated hierarchical structures, and it solves the problem of the curse of the dimensionality. The VA-File and algorithms for its building and nearest neighbors search are described in Chapter 3. 27

28 The mechanism of approximations and the idea of layers enabled us to design LVA-Index (Lasek, LVA-Index: An Efficient Way to Determine Nearest Neighbors, 2008) with a layered structure combining some features of the VA-File and the NBC algorithm. We examined the index by comparing it with the VA-File. The results of the experiments showed that search of nearest neighbors in the LVA-Index is much faster than in VA-File. 1.3 Statement of the Problem and Theses of Dissertation The density-based clustering algorithms offered in the literature are inefficient in the case of datasets with large number of points as well as in the case of high-dimensional data. In this thesis, we formulate and prove the following theses related to density-based clustering: An application of the layer-based LVA-Index increases the efficiency of density-based clustering of low-dimensional data by means of the NBC algorithm. It is possible to efficiently build any k-th layer of LVA-Index without unnecessary scanning of a large number of cells. Applying the triangle inequality property significantly improves the efficiency of density-based clustering based on DBSCAN and NBC approaches for both lowdimensional and high-dimensional large datasets. 1.4 Content Outline In Chapter 2, two important representations of density-based clustering algorithms (DBSCAN and NBC) are described. Basic definitions, which are used in further chapters, are also given. In Chapter 3, we present our new indices (LVA-Index and ELVA-Index), as well as methods of building and searching nearest neighbors using these indices. The triangle inequality property (TI) is presented in Chapter 4, where we also describe our two new algorithms, namely TI- DSCAN and TI-NBC, and discuss their variants. The experimental results concerning presented indices and algorithms are shown and discussed in Chapter 5. The thesis is concluded and summarized in Chapter 6. Appendix A contains a description of the clustering framework we implemented for the purpose of verification of the solutions presented in this thesis. Appendix B contains an extended set of the obtained experimental results. 28

29 2 DBSCAN and NBC This chapter presents two density-based clustering algorithms: DBSCAN (Density-Based Spatial Clustering of Applications with Noise) (Ester, Kriegel, Sander, & Xu, 1996) and NBC (A Neighborhood-Based Clustering Algorithm) (Zhou, Zhao, Guan, & Huang, 2005). The crucial subtask of these algorithms is to determine nearest neighbors, which are points located within specific neighborhoods. There exist a diversity of definitions of neighborhoods and different neighborhoods types are used in clustering algorithms. For example, in DBSCAN, the nearest points are determined so that they are located within the neighborhood of a given radius ε, whereas in NBC, a particular number of points is searched. The definitions of neighborhoods are given below as well as examples and figures clarifying concepts related to neighborhoods. For the sake of simple graphical presentation, the Euclidean distance will be used from now on in this thesis. Several other distance functions were presented in Section 1.1.3, while more of them have been described in (Han & Kamber, 2000) for example. Throughout the thesis, we assume tat neighborhoods are searched within a given dataset. Definition 2.1. (ε-neighborhood, or briefly ) ε-neighborhood of point p ( ) is the set of all points q in dataset D that are distant from p by no more than ε; that is,,. Definition 2.2. ( -neighborhood, or in brief) k-neighborhood of point p ( ) is a set of k ( 0) points satisfying the following conditions: a), b) \,,. 29

30 Example 2.1. According to Definition 2.2, the cardinality of a cannot be greater than. However, there may be a situation in which several different points are located the same distance from point. Consider for example Figure 2.1. q 1 q 5 q 1 q 5 q 0 ε q 0 q 4 p q 2 q 4 p q 2 q 3 q 3 q 6 q 6 Figure neighborhood of point Figure neighborhood of point containing point q 4, k=6 q 1 q 5 q 1 q 5 q 0 q 0 q 4 p q 2 q 4 p q 2 q 3 q 3 q 6 q 6 Figure neighborhood of point containing point q 5, k=6 Figure neighborhood of point containing point q 6, k=6 Figures demonstrate three possible sets of points constituting -neighborhoods of point. In all the cases, points,,, and are assigned to. Moreover, depending on an order of points processing, may include either point or or. 30

31 And thus, in Figure 2.2 includes, in Figure 2.3 includes instead of, in Figure 2.4 includes instead of, respectively. Please note, that does not guarantee that. -neighborhood of point p will be defined in terms of ε-neighborhood based on the observation that for each point p in dataset D there exists point such that, max,. Definition 2.3. ( -neighborhood, or briefly ) -neighborhood of point ( ) is equal to, where max,. q 1 q 2 q 3 q 4 q 0 q 3 p q 1 q 5 q 6 q ε 4 ε p 0 ε' q 5 q 2 q 8 q 9 q 11 q 6 q 13 q 7 q 12 q 14 q 10 Figure neighborhood of point, k=6 Figure 2.6. A sample dataset Example 2.2. Figure 2.5 illustrates the -neighborhood of point for 6. The following points were assigned to :,,,,,,,. This means that the cardinality of is equal to 8 and it is consistent with Definition 2.3 since the cardinality of can be greater or equal. Neighborhoods presented so far are also called non-punctured neighborhoods. Punctured neighborhoods are introduced in Definition

32 Definition 2.4. (punctured neighborhood) The punctured neighborhood of point ( ) is equal to \, where stands for,, or, respectively. Example 2.3. Table 2.1 presents examples of different kinds of neighborhoods given the dataset, and 5 as illustrated in Figure 2.6. Table 2.1. Examples of punctured and non punctured neighborhoods. Punctured neighborhoods Non-punctured neighborhoods,,,,,,,,,,,,,,,,,,,,,, 2.1 DBSCAN This section focuses on DBSCAN, a well-known density-based clustering algorithm, which was introduced by (Ester, Kriegel, Sander, & Xu, 1996). The main feature of this algorithm is that each point of a cluster must contain at least a certain number of points ( ) within its -neighborhood. In other words, the density in the -neighborhood of a point belonging to a cluster has to be greater of equal to a predefined threshold. The clustering process in DBSCAN is based on the following concepts of relations between points: directly densityreachability and density-reachability. Moreover, DBSCAN discerns three types of points: core points, border points and noise points. These concepts and types of points are defined in the next section Definitions of Terms Used in DBSCAN A cluster in the context of the DBSCAN algorithm is a region of high density. Regions of low density constitute noise. A point in space is considered a member of a cluster if there is a sufficient number of points within a given distance from it. Definitions and notions related to the DBSCAN algorithm are given below. 32

33 Definition 2.5. (a core point) is a core point with respect to if its -neighborhood contains at least points; that is, if. Definition 2.6. (directly density-reachable points) Point is directly density-reachable from point with respect to and if the following two conditions are satisfied: a), b) is a core point. p 0 p 1 p 2 Figure 2.7. is directly density-reachable from core point ; is density-reachable from ( 6) Example 2.4. Let 6. Point in Figure 2.7 has 6 neighbors (including itself) in its non-punctured neighborhood, so it is a core point. There are only 2 points in, so it is not a core point. However, since point belongs to, it is directly density-reachable from. On the other hand, despite of belonging to, is not directly density-reachable from. 33

34 Definition 2.7. (density-reachable points) Point is density-reachable from a point q with respect to and if there is a sequence of points,, such that, and is directly density-reachable from, Example 2.5. Figure 2.8 illustrates an example of density reachable points. In this example, is directly density-reachable from and is directly density-reachable from. Hence, point is density-reachable from point, although is not a core point. p 0 p 1 p 4 p 2 p 3 p 5 Figure 2.8. Both and are density-reachable from point, so and belong to ( 6) Definition 2.8. (a border point) Point is a border point if it is not a core point and is density-reachable from a core point. The above definition implies that a point is a border point if it is not a core point, but belongs to the -neighborhood of some core point. Let denote all density-reachable points from point in D. If o is not a core point, then is empty. In Figure 2.8, and belong to, since points and are densityreachable from core point. 34

35 Definition 2.9. (cluster) A cluster is as a non-empty set of all points in which are density-reachable from a same core point. Although Definition 2.9 is formulated differently than the definition provided in (Ester, Kriegel, Sander, & Xu, 1996), the resulting clusters are identical in both cases. According to these definitions, if is a core point, then is a cluster. Theorem 2.1. (Ester, Kriegel, Sander, & Xu, 1996) If and are core points which belong to a same cluster, then. By Theorem 2.1, two core points belonging to the same cluster determine the same cluster. This implies that each core point belongs to exactly one cluster. However, a border point may belong to more than one cluster (Kryszkiewicz & Skonieczny, 2005). Definition (noise) Noise is the set of all points in that are not density-reachable from any core point. Noise contains points that are neither core, nor border points. In other words, noise are points in which do not belong to any cluster Clustering with DBSCAN In this section, the DBSCAN clustering algorithm is recalled. The algorithm takes three input parameters, namely: the set of data points, the radius of the neighborhood, MinPts - the minimal number of points within -neighborhood (methods for determining the values of and MinPts parameters are described e.g. in (Ankerst, Breunig, Kriegel, & Sander, 1999) and (Ester, Kriegel, Sander, & Xu, 1996)). Each point in has an attribute called ClusterId which stores the cluster s identifier and initially is equal to UNCLASSIFIED. Firstly, the algorithm generates a label for the first cluster to be found. Next, the points in D are read. The value of the ClusterId attribute of the first point read is equal to 35

36 UNCLASSIFIED. While the algorithm analyzes point after point, it may occur that the ClusterId attributes of some points may change before these points are actually analyzed. Such a case may occur when a point is density-reachable from a core point examined earlier. Such density-reachable points will be assigned to the cluster of a core point and will not be analyzed later. If a currently analyzed point p has not been classified yet (the value of its ClusterId attribute is equal to UNCLASSIFIED), then the ExpandCluster function (please see Function 2.1) is called for this point. If p is a core point, then all points in are assigned by the ExpandCluster function to the cluster with a label equal to the current cluster label. Next, a new cluster label is generated by DBSCAN. Otherwise, if p is not a core point, the attribute ClusterId of point p is set to NOISE, which means that point p will be tentatively treated as noise. After analyzing all points in D, each point s attribute ClusterId stores a respective cluster label or its value is equal to NOISE. In other words, D contains only points which have been assigned to particular clusters or are noise. Algorithm 2.1. DBSCAN(set of points D,, MinPts) ClusterId = label of a first cluster; for each point p in set D do if (p.clusterid = UNCLASSIFIED) then if ExpandCluster(D, p, ClusterId,, MinPts) then ClusterId = NextId(ClusterId) endif endif endfor Function 2.1 ExpandCluster(D, point p, ClId,, MinPts) seeds = Neighborhood(D, p, ); if seeds < MinPts then p.clusterid = NOISE; return FALSE else for each point q in seeds do // including point p q.clusterid = ClId; endfor delete p from seeds; while seeds > 0 do curpoint = first point in seeds; curseeds = Neighborhood(D, curpoint, ); if curseeds >= MinPts then for each point q in curseeds do if q.clusterid = UNCLASSIFIED then /* N ε (q) has not been evaluated yet, so q is added to seeds */ q.clusterid = ClId; append q to seeds; 36

37 elseif q.clusterid = NOISE then /* N ε (q) has been evaluated already, so q is not added to seeds */ q.clusterid = ClId; endif endfor endif delete curpoint from seeds; endwhile The ExpandCluster takes five parameters: D a set of points, p point currently being processed, ClId the current value of ClusterId, the radius of the neighborhood, MinPts the minimum number of points in the -neighborhood required to form a cluster. The function starts by calculating the -neighborhood of point p. If the cardinality of the -neighborhood of point p is less than MinPts, then p is not a core point. Additionally, the value of its ClusterId attribute is temporarily set to NOISE and ExpandCluster reports the failure of creating the cluster. Otherwise, if the number of points in the -neighborhood of p is sufficient, p is recognized as a core point. Hence, all density-reachable points from p will constitute a cluster. Having determined the -neighborhood of p, all points in this neighborhood become members of the currently built cluster (a ClId label is assigned to the ClusterId fields of these points). The -neighborhood of point p (except for p) is stored in the seeds collection. The neighborhood of each seed point which is a core point will augment the seeds. Points which belong to those -neighborhoods of seed points and were earlier classified as noise are now assigned to the current cluster. Border points may belong to many clusters. Although DBSCAN assigns these points only to one cluster it would be possible to change the algorithm so that border points are assigned to all possible clusters. ExpandCluster processes all points contained in the seeds collection. Each of these points is removed from the collection after being processed. When seeds is empty (meaning that all points found as cluster seeds were checked), then the function ends. Please note that due to the fact that the -neighborhood of point p which is passed as a parameter of the ExpandCluster function may contain points classified earlier as noise, some calculations in ExpandCluster are redundant (Kryszkiewicz & Skonieczny, 2005). Such points will be processed once again. 37

38 Figures illustrate a sample execution of the DBSCAN algorithm. Figure 2.9. The neighborhood of the first core point is assigned to a cluster Figure Subsequent assignment of density-reachable points forms the first cluster; initial seeds are determined for the second cluster Figure The second cluster reaches its maximum size; the initial seeds are determined for the third cluster Figure Third cluster reaches its maximum size; the initial seeds are determined for the fourth cluster Figure The final clustering result with DBSCAN Figure Noise (empty dots) 38

39 2.2 NBC The Neighborhood-Based Clustering (NBC) algorithm (Zhou, Zhao, Guan, & Huang, 2005) belongs to the group of density based clustering algorithms. The characteristic feature of NBC is the ability to measure relative local densities. Hence, it is capable of discovering clusters of different local-densities and of arbitrary shape. NBC requires only one input parameter (a value). In the sequel, we present definitions of basic NBC notions such as: reversed punctured - neighborhood of a point p, neighborhood-based density factor of a point, local sparse point, local even point, local core point, neighborhood-based density-reachable points, neighborhood-based density-reachable points, cluster and noise Definitions of Terms Used in NBC Definition (reversed punctured -neighborhood of a point p) Reversed punctured k + -neighborhood of a point p ( ) is the set of all points in dataset such that belongs to ; that is:. Definition (neighborhood-based density factor of a point) Neighborhood-based density factor of a point p ( ) is defined as. is understood as a measure of local density of point p. Definition (local sparse point, in brief SP) A point p is a local sparse point if 1. 39

40 Definition (local even point, in brief EP) A point p is a local even point if is equal to 1. Definition (local core point) p is a local core point if 1. Example 2.6. A sample dataset is given in Figure It contains point and points:,,. For point and for each point, 0,, 6, belonging to the dataset, the punctured -neighborhood ( ) was determined for 2. If point was included in of point q i, then q i was marked with black color (,,, ). Otherwise, was marked with white color (points,, ). Points marked black constitute the reversed k- neighborhood of point, which is illustrated in Figure 2.15.h. Having determined, it is possible to calculate the neighborhood-based density factor of p, which is equal to the ratio of the cardinality of to the cardinality of. In our case, 2 1, therefore we can conclude that is a local core point. Definition (directly neighborhood-based density-reachable points) Point p is directly neighborhood-based density-reachable from a point q with respect to k if the following two conditions are satisfied: 1), 2) q is a local core point. Definition (neighborhood-based density-reachable points) A point is neighborhood-based density-reachable from a point with respect to if there is a sequence of points,, such that, and is directly neighborhoodbased density-reachable from,

41 q 3 q 3 q 3 q 2 q 0 q 6 q 2 q 0 q 6 q 2 q 0 q 6 p q 1 p q 1 p q 1 q 4 q 5 q 4 q 5 q 4 q 5 a) b) c) q 3 q 3 q 3 q 2 q 0 q 6 q 2 q 0 q 6 q 2 q 0 q 6 p q 1 p q 1 p q 1 q 4 q 4 q 4 q 5 q 5 q 5 d) e) f) q 3 q 3 q 2 q 0 q 6 q 2 q 0 q 6 q 4 p q 1 q 5 q 4 p q 1 q 5 Figure a) g) Visualization of determination of reversed punctured k + -neighborhood of point p (k=2); h) Reversed punctured k + - neighborhood of point p (k=2) g) h) 41

42 k + NN(p - ) Rk + NN(p - ) q 12 q 11 q 13 Rk + NN(q 5- ) q 12 q 11 q 13 q 8 q 0 q 1 q 5 q 8 q 0 q 1 q 5 q 4 p q 2 q 9 q 4 p q 2 q 9 q 7 q 3 q 6 q 10 q 7 q 3 q 6 q 10 Figure is directly neighborhood-based density reachable from :, is a core point ( 1); 6 Figure is neighborhood-based density reachable from since is directly neighborhoodbased density reachable from and is directly neighborhood-based density reachable from ; 6 q 12 q 11 q 13 q 1 q 8 q 0 q 5 q 4 p q 2 q 9 q 3 q 7 q 6 q 10 Figure is a local border point as it belongs to and it is not a local core point; k = 6 Figure Clusters and noise Example 2.7. Figure 2.16 presents that point is directly neighborhood-based density reachable point q 4 from, which can be concluded from the fact that is a local core point and. Similarly, point is also directly neighborhood-based density reachable from point. Hence, is neighborhood-based density reachable from (Figure 2.17). Definition (local border point) A point is a local border point if it is not a local core point and belongs to of some local core point. 42

43 Example 2.8. Figure 2.18 presents point, which is a local border point as it belongs to (Figure 2.17), where is a local core point, and it is not a local core point ( 3/5 1). Definition (neighborhood-based density connected) Points and which belong to dataset are neighborhood-based density connected with respect to, if is neighborhood-based density reachable from, or is neighborhood-based density reachable from, or there exists another point, say, such that, both points and are neighborhoodbased density reachable from. Definition (cluster) A cluster is a maximal non-empty subset of such that: a) for two points and q in the cluster, and are neighborhood-based densityreachable from a local core point with respect to, and b) if belongs to cluster and is neighborhood-based density connected with with respect to, then belongs to. Definition (noise) The noise is the set of all points in D that do not belong to any cluster. In other words, noise is the set of all points in D that are not neighborhood-based densityreachable from any local core point. Figure 2.19 illustrates clusters and noise. The clusters have been marked with numbers from 0 to 4 and different colors; noise points have been marked with gray color. 43

44 2.2.2 The NBC Algorithm The NBC algorithm begins with the CalcNDF function described beneath. The CalcNDF Function The CalcNDF function calculates, and for each point in dataset ( 0 ). The authors of NBC employ a cell-based approach to calculating by cutting the data space into (high-dimensional) cells, and using the VA-File (Weber, Schek, & Blott, 1998) to organize the cells (Zhou, Zhao, Guan, & Huang, 2005). As follows from the code of NBC obtained from its authors (Xiaojun, 2008), the neighborhood of any point is built only from the points belonging to the same cell as, say cell, and from the points belonging to so called first layer of cell that contains all cells adhering to. Please note that the neighborhood calculated in this way may differ from. Clearly, the larger size of a cell, the minor differences between calculated neighborhood of and. Example 2.9. The difference between a -punctured neighborhood and a neighborhood calculated using the described cell based approach is shown in Figure As one can see, only points and are common for both neighborhoods of point. p p 1 p 3 2 q p 4 p p 1 p 3 2 q p 4 p 5 p 6 p 7 p 5 p 6 p 7 Figure The difference between a -punctured neighborhood and a neighborhood of point calculated using a cell based approach limited to the first layer; 4 44

45 The NBC Algorithm The pseudo-code of the NBC algorithm is given below. Algorithm 2.2 NBC(D, k) for each point p in D do p.clusterid = UNCLASSIFIED // initialize cluster number for each object CalcNDF(D, k) // calculate NDF ClusterId = label of a first cluster // set the first cluster number to 0 for each point p in D { // scan dataset if (p. ClusterId!= NULL or p.ndf < 1) then continue endif p.clusterid = ClusterId; // label a new cluster DPSet.empty(); // initialize DPSet for each point q in do q.clusterid = ClusterId; if (q.ndf >= 1) DPset.add(q) endfor while (DPset is not empty) do // expanding the cluster p = first point in DPset; for each point q in do if (q.clusterid!= UNCLASSIFIED) then continue endif q.clusterid = ClusterId; if (q.ndf >= 1) then DPset.add(q); endif DPset.remove(p); endfor ClusterId = NextId(ClusterId ); endwhile for each point p in D do // label noise if(p. ClusterId = UNCLASSIFIED) p.clusterid = NOISE; endfor endfor endfor After calculating the factors for each point in a database D, the clustering process is performed. For each point, it is checked if. is less than 1. If. 1 then, at this moment, is omitted and a next point is checked. If. 1, then, as a local core point, is assigned to the currently created cluster identified by the current value of ClusterId. Next, the DPSet variable for storing points, which are directly neighborhood-based density-reachable from point, is cleared and each point, say, 45 belonging to 1 is assigned to the currently created. Moreover, if. is greater than or equal to 1, then point is added to DPSet. Next, DPSet is analyzed. For each point from DPSet, say

46 , it is checked if it has been already assigned to any cluster; if it was, then the next point is processed. If the point has not been assigned to any cluster, then it is assigned to the appropriate cluster (. ) and if. 1, then it is also added to DPSet. Finally, all unclassified points are marked as noise set by setting the value of their ClusterId attribute to NOISE. Example Let us consider an example dataset given in Figure This dataset consists of 14 points. For each point 0,, 13, using CalcNDF function, the factors were determined (please refer to Table 2.1). As a result of clustering this dataset with the NBC algorithm, we get two sets of points belonging to separate clusters and a set of noise points. The result of the clustering is presented in a graphical form in Figure p 8 p 0 p 12 p 11 p13 p 5 p 7 p 4 p 2 p 1 p 9 p 3 p 10 p 6 Figure A sample dataset for illustrating the NBC algorithm 46

47 Table 2.1. The result (NDF) of the CalcNDF function for the dataset from Figure 2.21, 2,,,, 2.00,,, 1.50,,, 1.50,, 0.00,,,, 2.00,,, 1.50,, 1.00, 0.00,,,, 1.00,, 1.00,,, 1.50,, 0.00,, 1.00,, 1.00 clst_id= 0 clst_id= 1 * p 7 noise p 8 * p 12 p 11 + p13 * * p 0 p 4 p 3 + p p 2 p p 10 + p 6 p 9 Figure The result of the clustering using NBC for the dataset presented in Figure

48 48

49 3 VA-File and LVA-Index 3.1 VA-File The VA-File index was proposed in order to reduce the amount of data that must be read during similarity searches (Weber, Schek, & Blott, 1998). The method of building VA-File algorithm is based on the idea of approximations, which are computed using the approximation functions. In this section, we recall definitions related to VA-File, as well as its structure and the Simple Search Algorithm (SSA) (Weber, 2000) for searching nearest neighbors using VA-File. In the last part of this suchapter, we also describe the layered approach which was applied in the NBC algorithm (Zhou, Zhao, Guan, & Huang, 2005) Definitions of Terms Used in VA-File Concepts used in VA-File, such as a mark, slice, approximation, lower and upper mark, as well as lower bound are presented below. Definition 3.1. (mark) The l-th mark in j-th dimension, where 0, 1, 0 2, is the number of bits per j-th dimension, is denoted by, and defined as follows:, minimum value of -th dimension of points from for 0, maximum value of -th dimension of points from for 2,, 0, 2, 0 2 otherwise 2 is meant to be the number of intervals into which the data space in j-th dimension was divided, and by this determines the granularity of the data space division in j-th dimension. 49

50 Example 3.1. In Figure 3.1, we have plotted a sample two-dimensional dataset containing 8 points:,,. In this example, the minimum values of points coordinates are equal to 0 in both dimensions (for points and, respectively). The maximum values of points coordinates are equal to 8 (for points and, respectively). The number of bits per dimension ( ) has been set to 2 for dimension 0 as well as for dimension 1. m(1,4) m(1,3) dim. 1 (j = 1) p 7 p 3 p 2 Point Coordinates 0, 3 1, 0 8, 5 m(1,2) m(1,1) p 0 p 5 p 6 p 4 3, 8 6, 4 3, 3 m(1,0) m(0,0) p 1 m(0,1) m(0,2) m(0,3) m(0,4) dim. 0 (j = 0) 5, 2 1, 6 Figure 3.1. Determining marks Given the minimum value of a mark in dimension 0: 0,0 0 and minimum value of a mark in dimension 1: 1,0 0, maximum value of a mark in dimension 0: 0,4 8, maximum value of a mark in dimension 1: 1,4 8, 2 and 2, one can compute values of the remaining marks, namely: 0,1 0,0 1,, 0,2 4, 0,3 6, 0 2, 1,1 2, 1,2 4, 1,

51 Definition 3.2. (slice) The l-th slice in j-th dimension, where 0.. 1, 0.. 2, and, 0, is defined as follows:,,,,, :,, 1, if 0, 2 1,,,, :,, 1, if, 2 1 Example 3.2. In Figures 3.2 a) - b), we have graphically presented slices: 0, 0, 0, 1, 0, 2, 0,3, 1, 0, 1, 1, 1, 2 and 1, 3. s(1, 3) p 3 s(1, 3) p 3 Slice3 indim. 1 s(1, 2) s(1, 1) Slice0indim. 0 p 0 p 7 Slice1indim. 0 Slice2indim. 0 p 5 p 6 Slice3indim. 0 p 4 p 2 s(1, 2) s(1, 1) p 0 p 7 Slice2 indim. 1 Slice1 p indim. 1 5 p 6 p 4 p 2 s(1, 0) p 1 s(1, 0) p 1 Slice0 indim. 1 s(0, 0) s(0, 1) s(0, 2) s(0, 3) s(0, 0) s(0, 1) s(0, 2) s(0, 3) a) b) Figure 3.2. Illustration of slices determined for the dataset from Figure 3.1 Definition 3.3. (approximation function) The approximation function of a point in the -th dimension (, ) is the index of the slice in the -th dimension to which point belongs, namely:,, where,. 51

52 Example 3.3. In Figure 3.3, we have drawn both the data points and the slices. In Table 3.1, we present the result of applying an approximation function applied for all data points from this figure. s(1, 3) p 3 s(1, 2) p 7 p 4 p 2 s(1, 1) p 0 p 5 p 6 s(1, 0) p 1 s(0, 0) s(0, 1) s(0, 2) s(0, 3) Figure 3.3. A sample dataset of points divided into slices Table 3.1. The results of computing approximation function for points from the sample dataset from Figure 3.3 Point,, Definition 3.4. (approximation of a point) The approximation of a point ( ) is the concatenation of the results of the approximation function, for 0 1 written in a binary form: 52

53 0, & 1, & & 1,, where returns a binary form of a given number using 2 bits and & is the operator of concatenation. Example 3.4. Having determined the values of the approximation functions for all points from the sample dataset (Figure 3.3), the approximations of points can be computed. The approximations of points computed using the above definition are presented in Table 3.2. Using the notions of the approximation of point and the mark, the lower mark and the upper mark of point can be defined. Table 3.2. Computing approximations of points (* - binary form, ** - decimal form) Point,,,, * ** Definition 3.5. (lower mark of a point) The lower mark in -th dimension for point denoted as, is defined as:,,,. Definition 3.6. (upper mark of a point) The upper mark in -th dimension for point denoted as, is given by the following formula: 53

54 ,,, 1. Example 3.5. In Table 3.3 we present the results of computing the lower and the upper marks for points plotted in Figure 3.3. Table 3.3. Lower and upper marks of points from dataset presented in Figure 3.3 Point Coordinates,,,,,,,, 0, , , , , , , , Definition 3.7. (lower bound) The lower bound of the distance from point to point, denoted as,, is defined in terms of lower and upper marks as:,,,, 0,,,,,,, where and are the values of the point and point coordinates in -th dimension, respectively. Example 3.6. In Figures 3.4 a) c), we have plotted three cases of computing lower bounds for 0, namely: a),, b),,, c),. 54

55 In the case a), the lower mark is taken for computing the lower bound of distance from to. In the case b), when points and are located in the same slice, the distance between points in a given dimension is set to zero. In the case c), both, the upper and the lower marks of point are used for computation of the lower bound of distance between points to p q p lb(p, q) q p lb(p, q) q a) b) c) Figure 3.4. Different cases in calculation of lower bounds Structure of VA-File VA-File comprises an array of references to points from a given dataset and an array of approximations of these points. In VA-File, each point is represented by its approximation. Both for the array of points references, as well as for an array of approximations, special structures, called headers are stored. The header of the array of references to points contains information about the number of dimensions ( ) and the number of points in the dataset ( ). The header of the approximations stores the number of bits per each dimension (, 0,.. 1) as well as marks along each dimension (,, where and 0 2 ). Example 3.7. In Figure 3.5, we have illustrated a sample VA-File including a header of a dataset as well as a header of approximations. In the approximation array, the approximations (both in binary and decimal format) of four points are presented. 55

56 m(1,4) 11 m(1,3) 10 m(1,2) 01 m(1,1) 00 m(1,0) m(0,0) p 1 p 2 p 3 p m(0,1) m(0,2) m(0,3) m(0,4) VA-File p 0 p 1 p 2 p 3 points header d=2, n=4 (0.1, 0.9) (0.6, 0.8) (0.1, 0.4) (0.9, 0.1) p 0 p 1 p 2 p 3 approximations header b j =2, j=(0,1) m(0,0)=0, m(0,1)=1, m(0,2)=2, m(0,3)=3, m(0,4)=4, m(1,0)=0, m(1,1)=1, m(1,2)=2, m(1,3)=3 m(1,4)= binary decimal format Figure 3.5. The structure of VA-File Simple Search Algorithm The SSA algorithm presented below returns the punctured neighborhood of a query point, namely. Algorithm 3.1 SSA(D, q, k) /* D set of points q query point for which the neighbors are to be found k number of neighbors to be found neighbors k element array for storing coordinates of points dst k element array of real values for storing distances of neighbors to q */ max = InitCandidates(k, neighbors, dst, ); // the radius is set to, so the area of search is not limited /* assert: neighbors is ordered with respect to distance */ for each point p in D // the main loop of SSA if lb(p, q) < max then = dist(p, q); max = Candidate(p,, neighbors, dst); end if; end for; return neighbors having all coordinates different from ; The SSA algorithm iterates through all points in. For and for each point the lower bound, is computed and if its value is less than the value of the maximal distance of a point from the neighbors array to, then it means that the current point is likely to be closer to the query point than point. 56

57 The purpose of the Candidate function is to check if a currently processed point has a chance to be one of nearest neighbors of point. If yes, then the Candidate function adds into the neighbors list and performs sorting of the list. Function 3.1 Candidate(p,, neighbors, dst) /* p point possibly located closer to q than the last point in neighbors distance between points p and q neighbors k-element array which will contain nearest neighbors of p*/ r = neighbors[k-1]; if < dst[k-1] then neighbors[k-1] = p; dst[k-1] = ; sort neighbors and dst with respect to distance from q; end if; s = neighbors[k-1]; // last point in neighbors return dst[k-1]; The InitCandidates function initializes the dst array for storing distances from points in the neighbors array to point q. Function 3.2 InitCandidates(neighbors, dst, k, r) /* r the radius within which the neighbors are searched */ neighbors = k element array // set coordinates of points and elements of dst array to for 0 to dst[i] = r; for j 0 to 1 ; end for; end for; return ; 3.2 LVA-Index This subchapter is devoted to our LVA-Index, which was designed for efficient searching nearest neighbors in multidimensional datasets. In the first part of this subchapter, we provide basic definitions and describe the structure of the LVA-Index. Then we introduce methods of building it and finally, we present methods of searching nearest neighbors using LVA-Index. In this subchapter, we also offer ELVA-Index. Its main feature is that the number of layers stored in each non-empty cell is not fixed. 57

58 3.2.1 Definitions of a Cell and a Layer Definition 3.8. (cell) A cell having coordinates,, is defined as follows:,,,, : 0 1,,,, where d is the number of dimensions,, 0 and 0 1. Example 3.8. In Figure 3.6, we have marked a sample cell 1, 2, which has been determined using Definition 3.8 ( 2) as follows: 1, 2, 0,, 1, 1,, 2. m(1,4) 11 p 1 p 2 m(1,4) 11 c(2,3) p 1 p 2 m(1,3) 10 c(1,2) m(1,3) 10 m(1,2) m(1,2) 01 p 3 01 p 3 m(1,1) 00 m(1,0) m(0,0) p m(0,1) m(0,2) m(0,3) m(0,4) m(1,1) 00 m(1,0) m(0,0) p m(0,1) m(0,2) m(0,3) m(0,4) Figure 3.6. A cell defined in terms of marks Figure 3.7. Determining a cell into which a given point belongs Example 3.9. (Determining coordinates of a cell into which a given point belongs.) It is possible to determine coordinates of a cell into which a given point belongs by means of the approximation function. Let us consider point from a sample dataset presented in Figure 3.7 and apply the approximation function to it ( 2): 58

59 0, 2, 1, 3. Using the above computed values of the approximations of point in dimension 0 and 1, we can state that point belongs to 2,3. Definition 3.9. (layer) An -th layer of a cell,,, denoted as,,, is defined recursively as follows:,,,, ;,,,, \,, The number of cells belonging to the -th layer ( is given by the following formula: , 0 1, 0 where is the number of dimensions and 2 1 and 2 1 are lengths of -dimensional cubes, respectively c(1,2) 2n-1 = 1 2n+1 = 3 1-st layer of c(1,2), n= Figure 3.8. Determining the number of cells in the first layer of the cell 1,2. 59

60 Example 3.10 In Figure 3.8, we have illustrated the way of determining the number of cells in the first layer of the cell 1,2. Example In Figures 3.9 a) b), we have illustrated two layers of cell 1,2 : 1,2 and 1,2, respectively. 1,2 contains cell 1,2 itself and 1,2 contains the following cells: 0,1, 1,1, 2,1, 2,2, 2,3, 1,3, 0,3, 0, c(1,2) L 0 (c(1,2)) c(1,2) L 1 (c(1,2)) a) b) Figure 3.9. A layer defined in terms of cells Structure of LVA-Index LVA-Index The LVA-Index is designed so that each non-empty cell representation contains the list of the references to the points belonging to it and the list of structures representing its nearest layers (, 1,2,.. ). Simultaneously, each point representation stores the reference to a cell, to which belongs. In what follows, depending on the context, we use the terms such as cell, layer, point, etc., interchangeably with the representation of a cell, representations of a layer, representations point, etc. 60

61 The number of the nearest layers, for each non-empty cell, is determined experimentally and depends on the number of dimensions and the density of the dataset, so that the number of points in neighbor layers is equal to or greater than, where is the number of the nearest neighbors to be found p 0 p 1 p 2 p 3 p 4 c(2,3) P= {p 1 } L 1 (c(2,3)) = Ø L 2 (c(2,3)) = {c(0,2), c(0,1)} c(0,1) P= {p 2 } L 1 (c(0,3)) = Ø L 2 (c(0,3)) = {c(0,3), c(2,3)} c(0,3) P= {p 0 } L 1 (c(0,3)) = Ø L 2 (c(0,3)) = {c(2,3),c(0,1)} c(3,0) P= {p 3, p 4 } L 1 (c(0,3)) = Ø L 2 (c(0,3)) = Ø Figure The structure of LVA-Index for the two-dimensional data space ( 2, 2) In our implementation of LVA-Index, only non-empty cells are stored in the structure. Example In Figure 3.10, we present the structure of LVA-Index for the two dimensional dataset. In this case, the dataset contains points,,,,, which belong to four nonempty cells: 0,3, 2,3, 0,1, 3,0. LVA-Index for each of these cells contains a list of points belonging to the cell and a list of the cell s nearest neighbor layers, 1,2. 0-th layers are not stored in LVA-Index. When building LVA-Index the following property is used. Property 3.1. If a cell,,, where,,, are coordinates of cell, belongs to the k-th layer of another cell,,, where,,, are coordinates of cell, then cell belongs to the k-th layer of the cell. Example In Figure 3.11, we have illustrated Property 3.1 for a sample two dimensional dataset. Since the cell 0,2 belongs to 2,0, 2,0 belongs to 0,2. 61

62 m(1,4) 3 m(1,3) 2 m(1,2) 1 m(1,1) 0 m(1,0) m(0,0) L 2 (c(0,2)) c(0,2) L 2 (c(2,0)) c(2,0) m(0,1) m(0,2) m(0,3) Figure An illustration of the Property 3.1 m(0,4) ELVA-Index There is no difference in the structure of both LVA-Index and ELVA-Index. However, when building the latter one, the number of layers to be stored is determined during the index building. For more information about building ELVA-Index please refer to Subchapter Building LVA-Index In this subchapter, we provide methods of building LVA-Index as well as ELVA-Index. The first method (SemiLVABuild), is a semi-naïve method that uses VA-File to determine cells belonging to a given layer (Lasek, 2008) (Lasek, 2009a). The second one (IterativeLVABuild), is an efficient method which employs our iterative algorithm for determining cells belonging to a given layer (Lasek, 2009b).The iterative method does not use VA-File Function SemiLVABuild The semi-naïve method employs VA-File for determining cells belonging to the -th layer of a given cell. 62

63 Function 3.3 SemiLVABuild(D, LVA) /* D - a set of data points LVA - a reference to the LVA-Index l - a number of layers stored in each cell */ for each point p in D do // main loop c = LVA.DetermineCell(p) // determine a cell for the given point p.cell = c; // save reference to in c.points.add(p); // assign point to /* apply Property 3.1 */ for ( = 1; l ; ) do // the nested loop for updating layers nearest to = SemiGetLayerCells(LVA, c, i); // return non-empty cells from the -th layer of cell forall do =SemiGetLayerCells(LVA, c, i) // get cells of i-th layer with respect to cell c if (c ) then // check if the cell c belongs to the -th layer of c.addnonemptycell( c) // add a reference to the non-empty cell c to the -th layer of c endif endfor endfor The SemiLVABuild function takes two parameters: D (dataset) and LVA (a reference to an empty LVA-Index). The function is composed of the main loop and the nested loop. For each point from the cell, to which belongs, is determined. Then is assigned to and nearest layers of are determined and updated according to Property 3.1. This is performed in the nested loop. Definition (the distance between an approximation of a point and a cell) The distance between an approximation of any point and a certain cell is denoted by, and defined as follows:, max 0 0, 1 1,, 1 1, where d is the number of dimensions, and (0, ) are values of approximations of a cell and coordinates of cell in -th dimension, respectively. Since an approximation of a point is equivalent to coordinates of a cell to which the point belongs, the function can be also used for computation of distances between cells. Function 3.4 CellDist(a, c) /* d - the number of dimensions */ return max 0 0, 1 1,, 1 1 ; 63

64 Example In Figure 3.12, we have illustrated how a distance between given cells is computed using the CellDist function. Let us consider two two-dimensional cells: 1,3, 3,0. In this case, the result of the CellDist function is as follow: 1,3, 3,0 max 3 1, 0 3 max 2,3. The SemiGetLayerCells function returns a set of cells, which belong to the k-th layer of a given cell. It takes several parameters namely: a reference to the dataset -, a cell, a number of layers to be returned, a reference to LVA-Index - LVA c(1,3) 3-0 = = 2 c(3,0) Figure An illustration of computing a distance between given cells Function 3.5 SemiGetLayerCells(D, c, k, LVA) /* D - a dataset containing data points VA - a reference to the VA-File LVA - a reference to the LVA-Index k - an index of the layer to be returned */ for each approximation a in VA l = CellDist(a, c); if l = k then d = LVA.GetCell(a.coordinates); if d.points then endif endif endfor return ;.AddCell( d); // get lower bound distance // get cell from LVA by coordinates // check if cell d is not empty // add reference to cell d to the k-th layer of cell c 64

65 The GetLayerCells function is designed so that it uses a VA-File to determine cells belonging to the -th layer of a given cell. This is achieved by scanning all approximations stored in VA- File and checking if the result of the GetDistance function is equal to. GetDistance uses the Definition 3.10 for computing the value of Function IterativeLVABuild The IterativeLVABuild function is based on the following property. Property 3.2. A cell,,,, if the total number of coordinates of,,, whose values are equal to either or, is not less than 1. is the number of dimensions and 0,. 6 c(0,6) 5 c(-1,5) c(1,5) c(2,5) c(4,5) 4 c(3,4) 3 c(1,3) c(1,3) c(3,3) 2 c(-1,2) c(0,2) c(2,2) 1 0 c(-2,1) -2-1 c(3,1) c(0,0) c(2,0) c(3,0) Figure An illustration of computing a distance between given cells Example Let us consider the cells in Figure 3.13 and determine which of them belong to the second layer of 1,3. In this example the values of corresponding variables are as follows: 2, 2, 1, 3, 1 2, ,

66 According to the above calculations, if for any cell, the value of its coordinates is equal to 3 or 1 for 0, or either 5 or 1 for 1, then the cell belongs to the second layer of 1,3. In Table 3.4, we have presented 16 cells and for each of them calculated the values of. As a result we have determined 9 out of 16 cells belonging to 1,3, namely: 0,1, 1,1, 2,1, 3,1, 3,2, 3,3. Table 3.4. Checking which cells belong to the second layer of cell 1,3 No., No., 1 0,6 0 No 9 1,2 1 Yes 2 1,5 2 Yes 10 0,2 0 No 3 1,5 1 Yes 11 2,2 0 No 4 2,5 1 Yes 12 2,1 0 No 5 4,5 1 Yes 13 3,1 2 Yes 6 3,4 1 Yes 14 0,0 0 No 7 1,3 0 No 15 2,0 0 No 8 3,4 1 Yes 16 3,0 1 Yes Below we present the IterativeLVABuild function which differs from the SemiLVABuild function by using the IterativeGetLayerCells function in place of the SemiGetLayerCells function (Lasek, 2009b). Function 3.6 IterativeLVABuild(D, LVA) /* D - a set of data points LVA - a reference to the LVA-Index LVA.l - number of layers stored in each cell */ for each point p in D do // main loop c = LVA.DetermineCell(p) // determine a cell for the given point p.cell = c; // save reference to in c.points.add( p); // assign point to /* apply Property 3.1 */ for (i = 1; i LVA.l ; i++) do // the nested loop for updating layers adhering to = IterativeGetLayerCells(LVA, c, i); // return non-empty cells from the i-th layer of cell forall c do = IterativeGetLayerCells (LVA, c, i) // get cells of i-th layer with respect to cell c if (c ) then // check if the cell c belongs to the i-th layer of c.addnonemptycell( c) // add a reference to cell c to the k-th layer of c endif endfor endfor 66

67 In order to determine the coordinates of cells belonging to a given layer the IterativeGetLayerCells function is called. Function 3.7 IterativeGetLayerCells(k, c, d,) /* k a given number of layers that are to be determined c a given cell d a number of dimensions a set storing cells belonging to the i-th layer; is empty initially T a table of size d for storing temporary coordinates of candidate cells */ kcount = d; // a temporary variable NumberOfCells = // number of cells to be determined InitTable(k, d, T); // initialization of for i = 0 to NumberOfCells do // the main loop if kcount!= 0 then // this means that coordinates have at least one value LayerCells.add( c [T[0] + c[0],, T[d-1] + c[d-1]); // a reference to a new cell having coordinates equal endif; // to is added to the appropriate layer if kcount == 1 and T[0] == k then // if kcount equals to 1 and the value of the first coordinate 0 = k; // in is equal to, then the first coordinate is set to continue; // and the current iteration of the main loop is stopped endif for = 0 to 1 do // T-loop ; in this loop the table is updated if == k then // updating the value of kcount kcount = kcount 1 elseif T[ ] == k 1 then kcount = kcount + 1 endif; if Increment(T,,, ) == false then // break the current loop break; endif endfor; endfor; IterativeGetLayerCells starts with an initialization of the table (the InitTable function) and a special variable kcount. kcount stores a current number of coordinates in having values equal to either or, where,,,,, are coordinates of cell, for which the -th layer is currently determined. Function 3.8 InitTable(k, d, T) /* k - number of the layer d - number of dimensions T - a table to be initiated */ T[d]; for 0 to 1 do ; endfor; 67

68 IterativeGetLayerCells computes coordinates of consecutive cells by using the Increment function which is used to increment values of elements of T depending on the values of kcount and. In Example 3.16, we present a graphical explanation of how the Increment function is used in IterativeGetLayerCells. Function 3.9 Increment(,,, ) /* k the layer number T a vector of the coordinates dimension index */ T[ ] += 1 if (T[ ] > k) then T[ ] = -k return true; else return false; end; If the following conditions are fulfilled: kcount= 1 and T[0] = -1 thenthealgorithmomitscellsnot belonging to the layer by setting the value of T[0] to 3: T[0] = c(-1,1) c(-1,0) c(0,1) c(1,1) c(2,1) end c(3,1) c(3,-2) Legend -1 c(-1,-1) c(3,-2) k= 2 c(1,1) c[0] = 1 -k+c[0] = -1 k+c[0] = 3 -a number of a layerto be generated -the cellfor wich the second layer(k= 2) isgenerated -a value of the coordinate zero of c(1,1) -a value to whicht[0] is compared -a value to whicht[0] isset in order to omit cellsthat do not belong to the layer c(-1,-2) c(3,-2) c(-1,-3)c(0,-3)c(1,-3) c(2,-3)c(3,-3) start Figure An illustration of generating second layer cells coordinates of cell 1, Example (generating layer cells coordinates) Given the number of dimensions 2, let us determine coordinates of cells of the second layer ( 2) of a cell 1, 1. 68

69 The IterativeGetLayerCells starts with a cell having coordinates equal to 1, 3. In the next few steps, as shown in Figure 3.14, coordinates 0, 3, 1, 3, 2, 3 and 3, 3 are generated. Then, since Increment returned false when generating 3, 3, the value of is incremented and the coordinates 1, 2 are generated. Next, the condition kcount == 1 and T[0] == k is fulfilled and thus, the next generated coordinates are equal to 3, 2. Then, two previous steps are repeated two more times and the following coordinates are generated: 1, 1, 3, 2, 1,0, 3, 2. Finally, 5 last cells coordinates will be created: 1,1, 0,1, 1,1, 2,1 and 3, The IterativeELVABuild function We have also designed an ELVA-Index in which the number of neighbor layers stored in each cell is not fixed, but depends on local dataset s densities. In other words, the number of layers stored in each cell is smaller if the neighborhood of the given cell is dense or greater if the neighborhood of the given cell is sparse. The results of experiments using this version of the LVA-Index are presented and discussed in Subchapter 5.1. Below we present the IterativeLVABuild function which is used to build ELVA-Index. This function is similar to the previously presented functions for building LVA-Index. Function 3.10 IterativeELVABuild(D, LVA, k) /* k the layer number d number of dimensions D dataset */ forall p D do cell = LVA.DetermineCell(p); AssignVector(cell, p); neighborscount = 0; li = 0 while(neighborscount < k and ) do = IterativeGetLayerCells(LVA, c, li) forall c do neighborscount += c.points ; = IterativeGetLayerCells (LVA, c, li); if (cell ) then. AddNonEmptyCell(cell) endif endfor li = li + 1 endwhile endfor; // layer index 69

70 The difference is that the nested loop for layers nearest to : for (i = 1; i LVA.l ; i++) do is replaced by the following while loop : while (neighborscount < k and ) do which iterates until the number of neighbors found neighborscount is less then and the set of cells belonging to the -th layer is not empty. This implies that the number of iteration can vary from 1 (if the first layer contains enough points to fulfill the condition neighborscount < k) to the maximum number of slices (when the dataset is sparse and). Example In Figure 3.15, we have illustrated the difference between LVA-Index and ELVA-Index with respect to a number of scanned nearest layers. LVA-Index ELVA-Index c 0 c 0 c 1 c 1 a) b) Figure The difference in building LVA-Index (a) and ELVA-Index (b); a) the number of nearest layers to be scanned is fixed and equal to k, b) the number of nearest layers is determined dynamically 70

71 3.2.4 Searching Neighbors Using LVA-Index The searching in both LVA-Index and ELVA-Index is performed using the LVA-Index Simple Search Algorithm (LSSA). The idea of this function is the same like in SSA. So, the InitCandidate function initializes the array of length which will contain the neighbor points after the search is finished and the Candidate function checks if a currently processed point is really a candidate to be one of nearest neighbors of point. If the currently analyzed point is the candidate, then the Candidate function updates the neighbors list. However, contrary to SSA, the LSSA s main loop iterates only through layers of the cell to which point belongs. Previously, the search function iterated through all approximations in VA-File. Function 3.11 LSSA(D, q, k) /* D dataset of points q query point for which the neighbors are to be found k number of neighbors to be found neighbors a list of neighbors to be determined */ max = InitCandidates(neighbors, k, ); // the radius is set to, so the area of searching is not limited foreach layer in q.cell.l do // the main loop of LSSA foreach cell in layer do p = first point in cell if (lb(p, q) < max) then foreach r in cell do // r is a candidate point = dist(p, q) max = Candidate(p,, neighbors) endfor endif endfor endfor return neighbors having coordinates different from ; 71

72 72

73 4 Clustering Using Triangle Inequality In this chapter, we offer two clustering algorithms, namely TI-DBSCAN and TI-NBC. In both of them we applied the triangle inequality property for improving the efficiency of calculation of neighborhoods of points. 4.1 Triangle Inequality in DBSCAN In this section, we propose a new clustering algorithm called TI-DBSCAN (Kryszkiewicz & Lasek, 2010b,c). The result of clustering produced by our algorithm is the same as the one produced by DBSCAN. Taking into account that the most time-consuming operation in DBSCAN is the calculation of ε-neighborhood of points, we propose to use a triangle inequality property for efficient exclusion of points that do not have a chance to belong to ε- neighborhood ( ) of a given point. In addition, we adopt the solution from (Kryszkiewicz & Skonieczny, 2005) that consists in removing a point from the analyzed set D as soon as it is found to be a core point. In TI-DBSCAN, each analyzed points is removed, irrespectively it is a core point or not Using Triangle Inequality Property to Determine ε-neighborhood Property 4.1. (triangle inequality property). For any three points p, q, r:,,,. Let be a set of points. It follows from the definition of ε-neighborhood of a point in set that if and only if,. Taking into account this observation and Property 4.1, we obtain the following theorem: Theorem 4.1. (Kryszkiewicz & Lasek, 2010b,c) For any two points p, q D and any point r:,,. 73

74 Proof. Let,, (*). By Property 4.1,,,, (**). By (*) and (**),,. Hence,. Example 4.1. In Figure 4.1, we have illustrated Theorem 4.1. It can be seen that by the condition,,, does not belong to. q dist(q f,r) -dist(p, r) >ε q f r p ε ε p r dist(p, r) - dist(r, q b )>ε q b Figure 4.1. The Triangle inequality property;,, Figure 4.2. An illustration of Theorem 4.2 It follows from Theorem 4.1 that when we know that the difference of distances of two points and to some point is greater than, then we are able to conclude that does not belong to the non-punctured -neigborhood of point ( ) without calculating the actual distance between and. In the next theorem, we show how to exclude many points from checking if they belong to the non-punctured ε-neighborhood of a given point. Theorem 4.2. (Kryszkiewicz & Lasek, 2010b,c) Let be any point and be a set of points ordered in a non-decreasing way with respect to their distances to. Let be any point in, be a point following point in such that,,, and be a point preceding point in such that,,. Then: a) and all points following in do not belong to. b) and all points preceding in do not belong to. 74

75 axis Y C A B 3 2 G L F H 1 0 K Figure 4.3. The sample set of points axis X Table 4.1. Ordered set of points from Figure 4.3 with their distance to reference point 0,0 q X Y distance(q, r) K 0,9 0,0 0,9 L 1,0 1,5 1,8 G 0,0 2,4 2,4 H 2,4 2,0 3,1 F 1,1 3,0 3,2 C 2,8 3,5 4,5 A 4,2 4,0 5,8 B 5,9 3,9 7,1 Example 4.2. Figure 4.3 shows a sample set D of two dimensional points. Table 4.1 illustrates the same set ordered in a non-decreasing way with respect to the distance of its points to the point 0,0. Let us consider point. We will be interested in determining points that cannot belong to (F), provided 0.5. We note that, 3.2. Then the first point following point in such that,, is point 75

76 C (,, ), and the first point preceding point in such that,, is G (,, ). By Theorem 4.1, points and do not belong to. In addition, by Theorem 4.2, we also know without any checking that all points following point in set (that is, points and ) and all points preceding point in set (that is, points and ) do not belong to (F). Clearly,. In addition, is the only point that has a chance to belong to, and it is the only point for which it is necessary to calculate its actual distance to in order to determine properly. In the sequel, a point to which the distances from all points in are known will be called a reference point Optimizing DBSCAN by Using Triangle Inequality with respect to a Reference Point In general, the layout of our TI-DBSCAN algorithm is similar to the layout of DBSCAN. The differences are between TI-DBSCAN and DBSCAN are as follows: - TI-DBSCAN calculates the distance of all points in a given set of points to some reference point, e.g. the point with all coordinates equal to 0; - TI-DBSCAN stores the number of neighbors for each point in in the field NeighborsNo (initialized to 1 to indicate that a point itself belongs to its ε- neighborhood); - TI-DBSCAN stores for each point in the information about points that turned out not to be core, but neighboring. The information is stored in a point s field called Border; - TI-DBSCAN sorts all points in in a non-decreasing way with respect to their distance to reference point r; - TI-DBSCAN invokes the TI-ExpandCluster function instead of ExpandCluster in order to create clusters. Algorithm 4.1 TI-DBSCAN(set of points D, ε, MinPts) /* assert: r denotes a reference point, */ /* e.g. the point with all coordinates equal to 0 */ ClusterId = label of first cluster; 76

77 for each point p in set D do p.clusterid = UNCLASSIFIED p.dist = Distance(p,r); p.neighborsno = 1; p.border = {}; endfor sort all points in D non-decreasingly with respect to attribute dist; for each point p in the ordered set D starting from the first point until last point in D do if TI-ExpandCluster(D, D, p, ClusterId, Eps, MinPts) then ClusterId = NextId(ClusterId) endif endfor return D // D is a clustered set of points The TI-ExpandCluster function is a function equivalent to the DBSCAN s ExpandCluster function. The main differences between TI-ExpandCluster and ExpandCluster are as follows: - TI-ExpandCluster requires the set of points D to be ordered in a non-decreasing way with respect to distances of points in D from a reference point. - TI-ExpandCluster calls TI-Neighborhood function to determine ε-neighborhood of a point in (more precisely, TI-Neighborhood returns \ in ). TI- Neighborhood will use the ordering of points in to efficiently identify points that are likely to belong to ε-neighborhood of by means of the triangle inequality property. - TI-ExpandCluster moves each analyzed point from the set to another set after its ε-neighborhood in is found. As a result, ε-neighborhood for each analyzed point is calculated in more and more reduced set. Clearly, ε-neighborhood determined in a reduced set will not contain the neighboring points that were already moved from to. In order to determine the real size of ε-neighborhood of a point in a reduced set correctly, its auxiliary NeighborsNo field is used. Whenever, a point is moved from to, the NeighborsNo field of each of its neighboring points in is incremented. As a result, the sum of the size of ε-neighborhood of a point p found in the reduced D and the value of the NeighborsNo field of point equals the size of in the original non-reduced set. - TI-ExpandCluster treats in a special way points that turned out not to be core ones, but are not guaranteed to be noise; that is, the points passed to TI-ExpandCluster that turn out non-core. Such points get temporally a label called NOISE. Some of them will be identified as border points of some cluster. In order not too loose the information 77

78 about such points after moving them from to, the information about them is stored in the Border fields of points belonging to their ε-neighborhoods. The label NOISE is removed as soon as having such a label is assigned to a respective cluster. - TI-ExpandCluster calculates ε-neighborhood for each point only one time. The TI-Neighborhood function takes the ordered point set, point in, and as input parameters. It returns as the set theoretical union of the point sets found by the TI- Backward-Neighborhood function and the TI-Forward-Neighborhood. Function 4.1 TI-ExpandCluster(D, point p, ClId, ε, MinPts) /* Assert: TI-Neighborhood does not include p */ seeds = TI-Neighborhood(D, p, ε); p.neighborsno = p.neighborsno + seeds ; // include p itself if p.neighborsno < MinPts then p.clusterid = NOISE; for each point q in seeds do append p to q.border q.neighborsno = q.neighborsno + 1 endfor move p from D to D ; // D stores analyzed points return false else p.clusterid = ClId; for each point q in seeds do q.clusterid = ClId; q.neighborsno = q.neighborsno + 1; endfor for each point q in p.border do D.q.ClusterId = ClId; //assign cluster id to q in D delete q from p.border; endfor move p from D to D ; // D stores analyzed points while seeds > 0 do curpoint = first point in seeds; curseeds = TI-Neighborhood(D, curpoint, Eps); curpoint.neighborsno = curpoint.neighborsno + curseeds ; if curpoint.neighborsno < MinPts then for each point q in curseeds do q.neighborsno = q.neighborsno + 1 endfor for each point q in curpoint.border do delete q from curpoint.border; endfor else for each point q in curseeds do q.neighborsno = q.neighborsno + 1; if q.clusterid = UNCLASSIFIED then q.clusterid = ClId; move q from curseeds to seeds; else delete q from curseeds; endif endfor for each point q in p.border do D.q.ClusterId = ClId; //assign ClusterId to q in D 78

79 delete q from p.border; endfor endif move curpoint from D to D ; // D stores analyzed points delete curpoint from seeds; endwhile return true; endif TI-Backward-Neighborhood examines points preceding currently analyzed point, for which ε-neighborhood is to be determined. The function applies Theorem 4.1 to identify first point, say, preceding in such that,,. All points preceding point in are not checked any more, since they are guaranteed not to belong to (by Theorem 4.2). The points that precede and follow in have a chance to belong to. For these points, it is necessary to calculate their actual distance to. If the Euclidean distance is used, the square of the distance and the square of is applied, to make calculations more efficient. The TI-Backward-Neighborhood function returns all points with the distance to p not exceeding. The TI-Forward-Neighborhood function is analogous to TI- Backward-Neighborhood. Unlike TI-Backward-Neighborhood, TI-Forward-Neighborhood examines points following currently analyzed point, for which ε-neighborhood is to be determined. The TI-Forward-Neighborhood function returns all points with the distance to not exceeding ε. Function 4.2 TI-Neighborhood(D, point p, ε) return TI-Backward-Neighborhood(D, p, ε) TI-Forward-Neighborhood(D, p, ε) Function 4.3 TI-Backward-Neighborhood(D, point p, ε) /* assert: D is ordered non-decreasingly with respect to dist */ seeds = {}; backwardthreshold = p.dist ε; for each point q in the ordered set D starting from the point immediately preceding point p until the first point in D do if q.dist < backwardthreshold then // p.dist q.dist > ε? break; endif if dist, then append q to seeds; endif endfor return seeds 79

80 Function 4.4 TI-Forward-Neighborhood(D, point p, ε) /* assert: D is ordered non-decreasingly with respect to dist */ seeds = {}; forwardthreshold = ε + p.dist; for each point q in the ordered set D starting from the point immediately following point p until the last point in D do if q.dist > forwardthreshold then // q.dist p.dist > ε? break; endif if (q, p) then append q to seeds; endif endfor return seeds Optimizing DBSCAN by Using Triangle Inequality with Respect to Many Reference Points One may consider the usage of many reference points instead of one for reasoning about the distance among pairs of points by means of the triangle inequality property. In fact, the changes are required only in the functions: TI-DBSCAN, TI-Backward-Neighborhood and TI- Forward-Neighborhood. Below we provide the modified versions of these functions. The introduced changes are highlighted in the code. In our version of TI-DBSCAN with many reference points (TI-DBSCAN-REF), all reference points are stored in array RefPoints. In addition, each point p in set of points to be clustered will be associated with the array Dists, storing distances between p and referenced points stored in RefPoints. The first point in Dists (that is, Dists[1]) plays the same role as the field dist in the version of TI-DBSCAN with one reference point. Consequently, all points in will be sorted in a non-decreasing way with respect to their distances Dists[1] to the first reference point RefPoints[1]. Algorithm 4.2 TI-DBSCAN(set of points D, ε, MinPts); store reference points in table RefPoints; ClusterId = label of first cluster; for each point p in set D do p.clusterid = UNCLASSIFIED for i = 1 to RefPoints do p.dists[i] = dist(p,refpoints[i]); endfor p.neighborsno = 1; p.border = {}; endfor 80

81 sort all points in D non-decreasingly with respect to attribute Dists[1]; for each point p in the ordered set D starting from the first point until last point in D do if TI-ExpandCluster(D, p, ClusterId, ε, MinPts) then ClusterId = NextId(ClusterId) endif endfor return D // D is a clustered set of points Function 4.5 TI-Backward-Neighborhood(D, point p, ε, MinPts) /* assert: D is ordered non-decreasingly w.r.t. Dists[1] */ seeds = {}; backwardthreshold = p.dists[1] - ε for each point q in the ordered set D starting from the point immediately preceding point p until the first point in D do if q.dists[1] < backwardthreshold then break; endif candidateneighbor = true; i = 2; while candidateneighbor and (i p.dists ) do if q.dists[i] p.dists[i] > ε then candidateneighbor = false else i = i + 1 endif endwhile if candidateneighbor then if dist, then append q to seeds; endif endif endfor return seeds The many-reference-points version of TI-Backward-Neighborhood function differs from onereference-point version only in treating the points in that were not excluded from the analysis carried out by means of the triangle inequality property applied to the first reference point RefPoints[1], a given point, and a point being a candidate for a neighbor of point ; that is based on Dists[1] fields of and. In one-reference-point version of TI-Backward- Neighborhood, the neighborhood between point and a non-excluded point requires the calculation of the actual distance between the two points. In many-reference-points version, this calculation is deferred; namely, the actual distance calculation between p and q is carried out only when the usage of the triangle inequality property with respect to all remaining reference points in RefPoints does not invalidate as a neighbor of (based on respective Dists[i] for and, 2). Otherwise, if at least one of all remaining reference points in 81

82 RefPoints invalidates as a neighbor of, then is found not to be a neighbor of and the distance between and needs not to be calculated. The many-reference-points version of the TI-Forward-Neighborhood function applies analogous adaptation as in the case of TI-Backward-Neighborhood. Function 4.6 TI-Forward-Neighborhood(D, point p, ε, MinPts) /* assert: D is ordered non-decreasingly with respect to Dists[1] */ seeds = {}; forwardthreshold = ε + p.dists[1] for each point q in the ordered set D starting from the point immediately following point p until the last point in D do if q.dists[1] > forwardthreshold then break; endif candidateneighbor = true; i = 2; while candidateneighbor and (i p.dists ) do if q.dists[i] p.dists[i] > ε then candidateneighbor = false else i = i + 1 endif endwhile if candidateneighbor then if dist, then append q to seeds; endif endif endfor return seeds 4.2 Triangle Inequality in NBC In this section we propose a new clustering algorithm called TI-NBC (Kryszkiewicz & Lasek, 2010a) (Kryszkiewicz & Lasek, 2011). TI-NBC produces the same clustering as NBC, but much more quickly. Unlike NBC, TI-NBC does not use VA-File. In order to reduce neighborhood search space, it employs the triangle inequality property. 82

83 4.2.1 Efficient Determination of k-neighborhoods Now, we will present a theoretical basis useful for determining punctured -neighborhood of any point (. Theorem 4.3. (Kryszkiewicz & Lasek, 2010a) Let be any point and be a set of points ordered in a non-decreasing way with respect to their distances to. Let be any point in and be a value such that k, be a point following point in such that,,, and be a point preceding point in such that,,. Then: a) and all points following in do not belong to. b) and all points preceding in do not belong to. Proof. Let r be any point and be a set of points ordered in a non-decreasing way with respect to their distances to. a) Let be any point in, be a value such that k (*), and be a point following point in such that,,. Then by Theorem 4.2.a, and all points following in do not belong to. Hence, there are at least points different from that are distant from by no more than (by *), and and all points following in are distant from by more than. Thus, and all points following in do not belong to. b) The proof is analogous to the proof of Theorem 4.3.a. Example 4.3. Let be a point (0,0). Figure 4.3 shows sample set of two dimensional points. Table 4.1 illustrates the same set ordered in a non-decreasing way with respect to the distance of its points to point. Let us consider a determination of the of point. Let us assume that we have calculated the distances between, and points,, and, respectively, and they are as follows:, 1.64,, 1.25,, Let,,,,, ; i.e., ε = This means, that,,. Thus, has at least 3 points different from itself in 83

84 its punctured -neighborhood. Now, we note that the first point following point in such that,, is point (,, ), and the first point preceding point in such that,, is (,, ). By Theorem 3.2.1, the points, as well as the points that follow (here, point ) and precede in (here, no point precedes ) do not belong to Building k-neighborhood Index by Using Triangle Inequality In this section, we present the TI-k-Neighborhood-Index algorithm that uses Theorem 4.3 to determine punctured -neighborhoods for all points in a given dataset and store them as a k-neighborhood index. The algorithm starts with calculating the distance for each point in to a reference point, e.g. to the point with all coordinates equal to 0. Then the points are sorted with respect to their distance to. Next, for each point in the TI-k-Neighborhood function is called that returns. The function first identifies those points following and preceding point in for which the difference between, and distance, is least. These points are considered as candidates for nearest neighbors of. Then radius is calculated as the maximum from real distances of these points to. It is guaranteed that real nearest neighbors lie within this radius from point. Then the remaining points preceding and following point and (starting from points less distant to ) are checked as potential nearest neighbors of until the conditions specified in Theorem 4.3 are fulfilled. If so, no other points in are checked as they are guaranteed not to belong to. In order to speed up the algorithm, the value of is modified each time a new candidate for a nearest neighbor is identified. Algorithm 4.3 TI-k-Neighborhood-Index(set of points D, k); /* assert: 0 denotes a reference point, e.g. with all coordinates = 0 */ /* assert: There are more than k points in D */ foreach point p in set D do p.dist = Distance(p,0) endfor; sort all points in D non-decreasingly with respect to attribute dist; foreach point p in the ordered set D starting from the first point until last point in D do insert (position of point p, TI-k-Neighborhood(D, p, k)) to k-neighborhood-index endfor 84

85 Function 4.7 TI-k-Neighborhood(D, point p, k) ; ; backwardsearch = PrecedingPoint(D, b); forwardsearch = FollowingPoint(D, f); k-neighborhood = {}; i = 0; Find-First-k-Candidate-Neighbours-Forward&Backward(D, p, b, f, backwardsearch, forwardsearch, k-neighborhood, k, i); Find-First-k-Candidate-Neighbours-Backward(D, p, b, backwardsearch, k-neighborhood, k, i); Find-First-k-Candidate-Neighbours-Forward(D, p, f, forwardsearch, k-neighborhood, k, i); p.eps = max({e.dist e k-neighborhood}); Verify-k-Candidate-Neighbours-Backward(D, p, b, backwardsearch, k-neighborhood, k); Verify-k-Candidate-Neighbours-Forward(D, p, f, forwardsearch, k-neighborhood, k); return k-neighborhood; Function 4.8 PrecedingPoint(D, var point p) if there is a point in D preceding p then p = point immediately preceding p in D; backwardsearch = true else backwardsearch = false endif return backwardsearch; Function 4.9 FollowingPoint(D, var point p) if there is a point in D following p then p = point immediately following p in D; forwardsearch = true else forwardsearch = false endif return forwardsearch; Function 4.10 Find-First-k-Candidate-Neighbours-Forward&Backward(D, var point p, var point b, var point f, var backwardsearch, var forwardsearch, var k-neighborhood, k, var i) while backwardsearch and forwardsearch and (i < k) do if p.dist - b.dist < f.dist - p.dist then dist = Distance(b, p); i = i + 1; insert element e = (position of b, dist) in k-neighborhood holding it sorted with respect to e.dist; backwardsearch = PrecedingPoint(D, b) else dist = Distance(f, p); i = i + 1; insert element e = (position of f, dist) in k-neighborhood holding it sorted with respect to e.dist; forwardsearch = FollowingPoint(D, f); endif endwhile Function 4.11 Find-First-k-Candidate-Neighbours-Backward(D,var point p, var point b, var backwardsearch, var k-neighborhood, k, var i) while backwardsearch and (i < k) do dist = Distance(b, p); i = i + 1; insert element e = (position of b, dist) in k-neighborhood holding it sorted with respect to e.dist; backwardsearch = PrecedingPoint(D, b) endwhile 85

86 Function 4.12 Find-First-k-Candidate-Neighbours-Forward(D, var point p, var point f, var forwardsearch, var k-neighborhood, k, var i) while forwardsearch and (i < k) do dist = Distance(f, p); i = i + 1; insert element e = (position of f, dist) in k-neighborhood holding it sorted with respect to e.dist; forwardsearch = FollowingPoint(D, b) endwhile Function 4.13 Verify-k-Candidate-Neighbours-Backward(D, var point p, var point b, var backwardsearch, var k-neighborhood, k) while backwardsearch and ((p.dist b.dist) p.eps) do dist = Distance(b, p); if dist < p.eps then i = {e k-neighborhood e.dist = p.eps}}; if k-neighborhood - i k - 1 then delete each element e with e.dist = p.eps from k-neighborhood; insert element e = (position of b, dist) in k-neighborhood holding it sorted with respect to e.dist; p.eps = max({e.dist e k-neighborhood}); else insert element e = (position of b, dist) in k-neighborhood holding it sorted with respect to e.dist; endif elseif dist = p.eps insert element e = (position of b, dist) in k-neighborhood holding it sorted with respect to e.dist endif backwardsearch = PrecedingPoint(D, b) endwhile Function 4.14 Verify-k-Candidate-Neighbours-Forward(D, var point p, var point f, var forwardsearch, var k- Neighborhood, k) while forwardsearch and ((f.dist p.dist) p.eps) do dist = Distance(f, p); if dist < p.eps then i = {e k-neighborhood e.dist = p.eps}}; if k-neighborhood - i k - 1 then delete each element e with e.dist = p.eps from k-neighborhood; insert element e = (position of f, dist) in k-neighborhood holding it sorted with respect to e.dist; p.eps = max({e.dist e k-neighborhood}); else insert element e = (position of f, dist) in k-neighborhood holding it sorted with respect to e.dist; endif elseif dist = p.eps insert element e = (position of f, dist) in k-neighborhood holding it sorted with respect to e.dist endif forwardsearch = FollowingPoint(D, f) endwhile 86

87 As in the case of the TI-DBSCAN algorithm, many reference points can be also used in TI-NBC. The application of many reference points would require introducing changes in the Verify-k-Candidate-Neighbours-Backward and the Verify-k-Candidate-Neighbours-Forward functions analogous to the changes made in TI-DBSCAN functions, namely: TI-Backward- Neighborhood and TI-Forward-Neighborhood. TI-NBC may be also optimized by applying the following observation (Kryszkiewicz & Lasek, 2011) related to estimating the distance within which all nearest neighbors are guaranteed to be found: Let, be points in dataset and ε p be the distance from to its most distant nearest neighbor. Then all nearest neighbors of lie within ε p +, distance from. 87

88 88

89 5 Experiments This chapter is divided into several parts in which we describe the results of experimental evaluation of the following tasks: comparison of the methods of building LVA-Index, searching for nearest neighbors using LVA-Index, using LVA-Index in the NBC clustering algorithm, using triangle inequality property in DBSCAN and NBC. In this chapter, we adopted the convention that the names of datasets consist of three parts: the prefix which denotes the way of generating the dataset or is its name, the number of dimensions (2d, 3d, etc.) and the number of points in the dataset. For example: syn_2d_100 denotes that the dataset is two-dimensional, contains 100 points, whose coordinates were randomly generated sequoia_2d_1000 denotes that the name of the dataset is sequoia and it contains 1000 two-dimensional points For the purpose of all experiments presented in this chapter, the Euclidean distance measure was used. 5.1 The LVA-Index Building LVA-Index In the experiments, we have tested two methods we proposed for building the LVA-Index: SemiLVABuild the semi-naïve method that uses VA-File to determine cells belonging to a given layer; and IterativeLVABuild a more efficient method which employs our iterative algorithm for determining cells belonging to a given layer (Lasek, 2008). In order to determine whether the IterativeLVABuild method is more efficient than the SemiLVABuild, we have performed a number of experiments. It can be seen in Figure 5.1 and Figure 5.2 that the IterativeLVABuild method is 8 times faster than SemiLVABuild. This feature is related to the fact that SemiLVABuild, uses VA-File which 89

90 has to be scanned many times entirely, whereas IterativeLVABuild determines cells belonging to the given layer using an efficient iterative approach. Table 5.1. The runtimes of building LVA-Index using the SemiLVABuild and IterativeLVABuild methods; the number of layers stored for each cell is equal to 3 No. Dataset SemiLVABuild Runtime [ms] IterativeLVABuild Ratio 1 syn_2d_ syn_2d_ syn_2d_ syn_2d_ syn_2d_ syn_2d_ syn_2d_ syn_2d_ syn_2d_ syn_3d_ syn_3d_ syn_3d_ syn_3d_ syn_3d_ syn_3d_ syn_3d_ syn_3d_ syn_3d_

91 SemiLVABuild IterativeLVABuild Runtime [ms] Number of points Figure 5.1. Runtimes of building LVA-Index for subsets of dataset syn_2d_5000 ( 3) SemiLVABuild IterativeLVABuild Runtime [ms] Number of points Figure 5.2. Runtimes of building LVA-Index for subsets of dataset syn_3d_5000 ( 3) Searching Neighbors by Means of LVA-Index The LVA-Index Simple Search Algorithm (LSSA) (Lasek, 2008) was proposed for searching nearest neighbors in LVA-Index. In this subsection, we compare the efficiencies of LSSA and the Simple Search Algorithm (SSA) (Weber, Schek, & Blott, 1998) which was designed for searching in VA-File. 91

92 Table 5.2. Runtimes of searching punctured neighborhood ( ) using the LSSA and SSA method ( 3) Dataset k Runtimes [ms] LSSA SSA Ratio syn_2d_ ,125 1, ,141 1, ,188 1, ,219 1, ,281 1,328 5 syn_3d ,281 1, ,313 1, ,281 1, ,313 1, ,359 1,625 4 syn_4d_ ,203 1, ,234 1, ,250 1, ,266 1, ,282 1,890 7 syn_3d_ ,015 0, ,000 0, ,016 0, ,015 0, ,032 0, syn_3d_ ,015 0, ,016 0, ,031 0, ,031 0, ,032 0, syn_3d_ ,016 1, ,031 1, ,032 1, ,032 1, ,031 1, syn_3d_ ,031 3, ,032 3, ,032 3, ,047 3, ,047 3,

93 We performed two series of the experiments on datasets from Table 5.2. First, we examined the performance of the LSSA by changing the number of dimensions, and next, by increasing the number of points. The value of was changed from 5 to 25 for each dataset. As shown in Table 5.2, increasing a value of, adversely affects the efficiency of both LSSA and SSA. The former is affected more. 5.2 Clustering LVA-Index in NBC This section presents the results of the experiments that we carried out in order to examine the efficiency of LVA-Index when used in the NBC algorithm. Due to the fact that the LVA-Index takes three parameters, namely: k the number of nearest neighbors to be found, b - the number of bits per dimension, and l the maximum number of layers used for storing nearest layers, we had to perform a large number of experiments. For this reason, only the most interesting results are presented in this section. Additional results are reported in Appendix A original LVA-Index VA-File R-Tree Figure 5.3. The runtimes of clustering the test dataset using the NBC algorithm ant different indices, 20; for LVA-Index 3 93

We have also performed experiments for other indices, namely VA-File and R-Tree. These experiments were run on the dataset manual_d2_2658. The runtimes measured and presented in Figure 5.

a) original index, b = 5, t = 3328 ms b) LVA-Index, b = 7, n = 3, t = 813 ms c) VA-File, b = 6, t = 15578 ms d) R-Tree, t = 242

The results of clustering the dataset manual_2d_2658 using NBC ant different indices, 20 In each of the experiments performed which tested the original implementation of NBC and its versions

94 We have also performed experiments for other indices, namely VA-File and R-Tree. These experiments were run on the dataset manual_d2_2658. The runtimes measured and presented in Figure 5.3 correspond to the clustering results presented in Figure 5.4. It can be seen that the least clustering time was achieved using NBC with LVA-Index implementation. a) original index, b = 5, t = 3328 ms b) LVA-Index, b = 7, n = 3, t = 813 ms c) VA-File, b = 6, t = ms d) R-Tree, t = 2422 ms Figure 5.4. The results of clustering the dataset manual_2d_2658 using NBC ant different indices, 20 In each of the experiments performed which tested the original implementation of NBC and its versions (LVA-Index, VA-File, R-Tree), four clusters were found. There were minor differences between clusters concerning specific features and parameters of the indices, such as: the number of bits per dimension (b), the maximum number of layers stored in the index (n), the different index structures (R-Tree). Table 5.3 presents the parameters and runtimes measured. We ran two series of experiments using two-dimensional dataset: in the first series we varied the value of l from 3 to 1; in the second, we varied the value of b from 5 to 7. 94

95 a) b = 7, l = 3, t = 813 ms b) b = 7, l = 2, t = 484 ms c) b = 7, l = 1, t = 187 ms d) b = 5, l = 1, t = 3328 ms e) b = 6, l = 1, t = 828 ms f) b = 7, l = 1, t = 187 ms Figure 5.5. The results of clustering for the dataset manual_2d_2658, = 20 95

Faster Clustering with DBSCAN

Faster Clustering with DBSCAN Marzena Kryszkiewicz and Lukasz Skonieczny Institute of Computer Science, Warsaw University of Technology, Nowowiejska 15/19, 00-665 Warsaw, Poland Abstract. Grouping data