Distributed Weighted Fuzzy C-Means Clustering Method with Encoding Based Search Structure for Large Multidimensional Dynamic Indexes

Size: px

Start display at page:

Download "Distributed Weighted Fuzzy C-Means Clustering Method with Encoding Based Search Structure for Large Multidimensional Dynamic Indexes"

Millicent Wilkins
6 years ago
Views:

1 World Engineering & Applied Sciences Journal 8 (2): 57-65, 2017 ISSN IDOSI Publications, 2017 DOI: /idosi.weasj Distributed Weighted Fuzzy C-Means Clustering Meod wi Encoding Based Search Structure for Large Multidimensional Dynamic Indexes Sunil Sunny Chalakkal, M. Rajalakshmi and R. Vijayakumar 1 Department of Computer Science, St Thomas College, Trichur, Kerala, India 2 Department of CSE / IT, Coimbatore Institute of Technology, Coimbatore, India 3 School of Computer Science, M.G.University, Kottayam, India Abstract: Searching of data in huge dimensional space is a major issue now days. In is paper different index structure and improved cluster Tree++ structures are used to speed up e search process. Wi e help of high dimensional spaces wi Distributed Weighted Fuzzy C-Means (DWFCM) clustering algorim. All data points are grouped into clusters rough a DWFCM clustering algorim. Dual distance Driven encoding (DDE) approach is used to acquire e uniform ID number of every data point. Where every cluster is partitioned in to two based on e centroid distance. Finally, a uniform index key is obtained by integrating e unique ID number and centroid displacement. In addition, encoding based indexing approach is added to e improved Cluster Tree++ for distributed multidimensional data point. Evaluation on e effectiveness and efficiency of e proposed scheme is done rough effective assessments, correspondingly e results states at is meod is much better an e existing algorims of Distributed Weighted Possibility-Means (DWPCM) and Weighted Possiblistic C-Means (WPCM). Key words: Clustering Indexing Encoding Fuzzy c-means Index key Dual-distance-Driven B+-tree Cluster Tree++ distributed Multi-dimensional data sets INTRODUCTION Latest review states, large volumes of data wi high dimensionality are generated in different fields. A lot of meods have been proposed for effectively indexing high-dimensional datasets for organized querying. When dimensionality scales up higher, e performance reduces wi e present meods which can support nearest neighbor search for low dimensional datasets. Along wi it, e process of adding new data can lead original structures to no longer maintain e dataset effectively, since it might considerably increase e amount of data accessed for a particular query. Amongst all ese meods, Cluster Tree [1] is e startup work towards e direction of creating an efficient index structure from clustering for high dimensional datasets. An analysis scheme for finding out e interesting data distributions and patterns in e underlying dataset is termed as clustering. Given a set of ndata points in a d dimensional space, e clustering meod provides e data points to groups in relation to e computation of e degree of similarity among data points in a group are extremely similar to each oer compared to e data points in different groups. Each and every group is a cluster. In order to help well-organized queries e Cluster Tree scheme constructs an index structure on e cluster structures. At e present time, a huge part of e data set is time associated; in addition to it availability of e outdated or unnecessary data in e data set shall critically degrade e dataset processing. On e oer hand, few schemes are formulated in order to device e management of e complete data set to keep away from e outdated data and maintain e dataset constantly in a better and updated position for e easiness and effectiveness of dataset process such as query and insertion. Wiout having to linearly search for e high-dimensional dataset The Cluster Tree can support e retrieval of e Corresponding Auor: Sunil Sunny Chalakkal, Department of Computer Science, St Thomas College, Trichur, Kerala, India. 57

2 neighbors effectively. The key aim is to reduce e dimensional point data structures and its variants are: The response time to e entire users query. It is e first hb-tree [6], e BD-Tree [7], e hybrid tree [8] and e scheme at uses e Cluster Tree for e purpose of quad tree. generating effective index Structure for high dimensional The binary search tree model includes The K-D Tree datasets. which initiates a recursive subdivision of e data spaceto In is paper, a symmetrical encoding meod is partitions rough (d-1)-dimensional hyper planes.the created by double partitioning. To deliver a key division most common problem to all e K-D schemes is at for of e uniform index key e uniform ID number of each some distributions hyper planes at partitions e data point can be gained. In order to facilitate highly efficient objects equally are not found. Using e is-oriented hyper DWFCM search a symmetrical encoding based Improved planes e K-D-Tree, e quad-tree [9] decomposes e cluster Tree++ (EIC-Tree++) is proposed. Using a universe. A major difference is at e quad-trees are not DWFCM clustering algorim all data are partitioned in to binary trees.until e amount of objects in every partition different clusters. Where all cluster spheres is segmented is under a given reshold e subspaces are e twice in relation wi e two distances of start and subspaces are disintegrated. Quad trees are not stabilized centroid distance. and all e sub trees of populated trees need to deeper Section 2 reviews e related work cluster tree and an sparingly populated areas leading a bad worst case index structure designs and clustering meods. Section behavior [10]. 3 describes innovative symmetrical encoding-based index Pyramid Technique [11] is e transformation based structure wi clustering approach suitable to distributed high dimensional indexing meod. Works better incase of multidimensional datasets. Section 4 focuses on e window based queries, here NB-tree [12, 13] and idistance processing of Cluster Tree results and section 5 presents [14] are used to manage e search process. For example e conclusion and future research progress. NB-Tree is a kind of single reference point based meod. A B+-tree is used for indexing e distances on Related Work: Different types of schemes are being which all e operations are carried out. framed in order to manage effectively wi The drawback of NB-Tree is at its inability to multidimensional data. In particular, Space Filling Curve reduce e region of search, when e volume of data (SFC) schemes take part an important role. Space Filling increase as well as e dimensionality shoot-up it will Curve (SFC) meods include Z-order curve [2], Hilbert directly affect e searching process, e efficiency also Curve [3] and gray codes [4] where data is divided onto decrees. idistance [14] is framed by selecting certain multidimensional pixels in relation wi e bottom reference points wi e intention of furer pruning. granularity and uses a curve at passes rough all e To increase e efficiency we are using e M-Tree [10], pixels in multidimensional space. An order of pixels in Omni-family [12] and empirically [14]. The efficiency of space is generated by is curve. This meod (ordering) data searching is mainly depends on e portion permits usage existing one dimensional access structures. mechanism, segmenting and clustering. For example, The B++Trees. Data representation on a part of e curve is enabled by e leaf pages of e access Proposed Clustering Based Indexing Structure: In is structure, generating a primary index where close by data section, e proposed clustering based indexed structure are clustered wi a high probability. The key drawback of is explained.the purpose of clustering e similar data e SFC scheme is at e CPU utilization rate is very points is done by e Distributed Weighted Fuzzy high and ey have issues such as high overlap among C-Means (DWFCM) algorim which includes encoding pages and e query interval. based indexing scheme employed for serving search For multidimensional data The UB-Tree [5] integrates query results are detailed. a space filling curve and a B+-Tree generates e primary The major concept of Clustering-based indexing index. Each leaf node indicates a block of data on a part of structures initially make use of clustering schemes wi e curve since it s a kind of paginated index. It divides e idea of clustering data points and make use of e space into linear segments of a Z Curve(or any estimation during search phase in such a way at e SFC).The problem of e UB-Tree is at it requires search shall be completed in e obtained clusters which amendments to e DBMS kernel for e purpose of likely includes e nearest neighbor of e query point. integration and like oer SFC s e partitions are not The architecture proposed for clustering based naturally not hyper cubic and ey may even represent structures is shown in e below figure.1 which includes disjoint space. The K-D-Tree is e most famous d- two phases e clustering phase and e search phase. 58

3 The proposed architecture of e clustering-based structures is shown in e following Fig. 1 Clusteringbased indexing structures includes two phases: clustering and search phase. Clustering Phase Data Distributed Multi dimensional Database Distributed weighted fuzzy c-means Clustering algorim (DWFCM) User Query World Eng. & Appl. Sci. J., 8 (2): 57-65, 2017 Clustered Result Search Phase (Indexed rough eir Centroid and indexing structure done using Tree approach) Cluster Tree++ Indexing approach wi encoding scheme Query based Result Weighted Point Calculation: The FCM is defined as; Calculation can be done using v i Presents e i cluster centroid and n presents e amount of examples and c indicates e volume of 2 clusters. d ik (x k, v i) = x k v i Presents e rule.at is stage e Euclidean distance meod is used. A weighted FCM algorim is taken into account which takes e weight of examples, clusters data into c partitions. Using Fig. 1: Overall architecture diagram of proposed system n d examples examples which are loaded in memory in e d as partial data access (PDA). Clustering Phase: In is section, a Distributed WFCM Case 1: d=1: When e value of d is equitable to one, algorim (DWFCM) is formulated in accordance wi in e first PDA, in is situation, As FCM WFCM is Map Reduce in is section. Based on e steps of identical. Successively after clustering, consider vi DWFCM, ere are two most important operations: indicate e cluster centroids, where 1 i c. Considering computing e degree of membership µ ij and computing U ij shows e membership values, where 1 i c and e clustering centers v i. In e map phase, e Map 1 i n d. Consider w shows e weights of e points in function is intended to compute e degree of membership e memory. In such a situation, all e entire n d points µ ij. have weight 1 since weighted points from previous PDC Loading a huge amount of data in to memory makes do not exist.at is point of time e memory is let free some problems as same as in [15]. In e case only a rough reducing e clustering result in to c weighted particular amount of data is loaded in to memory. points, which are indicated by e c cluster centroids v i, Suppose we are loading 1% of data into memory, we have where 1 i c and eir weights are shown below: to search 100 times scan to complete e entire search process. This meod is called partial data access (PDA). (3) The volume of data loaded for a given time will decide e amount of partial data access. In is meod subsequent to e first PDA, Using Fuzzy C Means (FCM) data is (4) grouped in to different clusters and reduced its weighted points. The additional mechanism we are going to Weights of e c points, after e process of implement is PDA. PDA is directly connected wi reduction e clustering results, in memory is shown weighted points. The major difference between e crisp below: followinge process of condensing e clustering [15] scenario and is meod is e absence of clusteringresults, in memory is as given below: fuzzy membership. For every PDA, new singleton points are added into memory. This process will continue till e (5) data is searched ones. The function of FCM is to accommodate weights. To get accurate clustering Result d It is observed at when n, In all e succeeding each data set will be closely connecting wi centroids. PDA (d > 1) new singleton points are added. Their indices This process of knowledge transmission permits for faster related wi w shall begin at c + 1 and finally end at n d + merging. c i.e. (1) (2) 59

4 For each n we are assigning a value of one. d World Eng. & Appl. Sci. J., 8 (2): 57-65, 2017 (6) Case 2: d > 1: In is case, On singleton points clustering will be applied by newly loading in e d PDA along wi (12) cweighted points following e condensation from (d 1) PDC. Resulting in n d + c points in e memory for For e purpose of e encoding scheme data point clustering by means of WFCM. The derived new nd clustering results are taken. In is meod, uniform, ID singleton points have weight of one.after e clustering number of all data point is taken wi e help of a process, e data in e memory (togeersingletons encoding mechanism. wi e help of unique ID and and weighted points) is reduced to c new weighted distance from e centroid, we create unique index key points. All e new weighted points are indicated rough B+tree. e c cluster v i, where 1 i c and eir weights are formulated as below: Symmetrical Dual Distance Encoding Scheme: Many observations has led to e initiation on design of improved cluster tree++. First we will measure e (7) unlikeness of e data points in association wi eire distance. Using e value we can create B+tree structure. To formulate e improved version of cluster tree++ we Subsequently e memory is going to free, we can can combine two distance metrics. Through e use of a revise wi is equation: novel symmetrical encoding approach for e purpose of getting a uniform index key expression, to furer diminish (8) e search region. Distance function, shown as d(v i, V j) starting Weighted Fuzzy C Mean: To take into consideration of distance is e distance between it and V o(0,0,..0), e weighted points e main objective of FCM is naturally given as; modified [16]. Each clustering group should have e c weighted point to accomplish perfect PDC. The closed SD(V i) = d(v i, V o) centroids and e WFCM calculated as follows: It is observed at n data points are clustered into T clusters, e O j of each and every cluster C j is derived in (9) e beginning, where j [1,T]. For calculating centroid distance, V i is e data point and distance from e centriod is O j. CD(V ) = d(v, O ) i i j where i [1, C ] and j [1, T]. (10) Dual distance encoding scheme DDE is used to get Wi e focus of reviewing cluster wi e objective a uniform index key by means of double slicing (t) (t) of cluster centers in parallel, two parameters, i and i is mechanism. Wi e assistance of a DWFCM clustering introduced, where t shows e serial number of data node. algorim, e start and centroid distances of every data Following After e process of calculating e point is computed, as a result V i can be designed as a four (t) (t) membership µ ij, Map function calculates i and i by tuple: using equation (11) and (12); V i :: = < i, CID, SD, CD > Here i indicates to e i data pointand CID shows (11) e cluster id. 60

5 Start Slice: Fora given cluster sphere (O, j CR) j e start slice of it is shown as SS(, j) centroid slice is a cluster sphere (O j, CR j) where µ [1, ] here, V :: = < i, CID, SD_ID, CD_ID > i In e given equation i is representing e ID number Vi, CID is e cluster ID, SD-ID means e starting slice ID number and CD-ID is e centoid distance ID number. From at we can form e equation given below; when shows an even integer and /2+1=. This results, a uniform ID number of V which can be indicated as linearly integrating esd_id and CD_ID as shown i below: UID(V ) = c SD_ID(V ) + CD_ID(V ) i i i In relation to e encoding scheme, e uniform encoding identifier of each data point can be obtained rough Algorim 1. Algorim 1. Problem: Calculate e points DDE (1 to n) from e set, Starting cluster is, e centroid slice is. Steps: 1. Using DWFCM clustering algorim, e data points in e set are grouped into T number of clusters. 2. Set j=1 3. Repeat steps 4 rough 9 until j<=t 4. Repeat steps 5 rough 8 while V i (O j, CR j) 5. Compute e centroid distance and start distance of each point. 6. finding out e result of e Eq.(13) 7. construct e unique ID of V i(dde(v i)). 8. Take next V i 9. Increment j by one 8. End The data structure is given in relation wi e above algorim.wi e above definitions, obtain e index key of V as given below: i (13) (14) The above Eq.(14) shows e index key expression of V. On e oer side, in relation wi e symmetrical encoding i scheme, ere exists a chance of availability f two different data points (e.g., V i and V j) wiin e identical cluster sphere, where its index keys are equal. Bo e two data points must meet e condition at: (SD(O ) SD(V )) (SD(O ) j i j SD(V )) < 0. j 61

6 In order to keep away from e overlapping of e index we using e equation below: (15) Finally, n index keys obtained by means of Eq. (15) is inserted into a partitioned B+- tree, which will be discussed in e search phase. L field of e entries in e leaf node of improved Cluster Tree++ must be pointed to e created leaf node in simple prefix B+ tree. Search Phase: In is section, e hierarchical structure Algorim 2: Improved clustertree++ Index Construction of e improved cluster++ is presented. Then a scheme is Problem: Given e data point set, e algorim presented for breaking up a cluster inosubclusters and e constructs e index bt(1 to T) for Improved clustertree++ scheme for e purpose generating a clustertree++ by means of decomposing a cluster into subclusters and a Steps: meod for e purpose of building a ClusterTree++ by 1. Using DWFCM clustering algorim, e data points emans of decomposing clusters repeatedly. The enhanced in e set µ are grouped into T number of clusters. Cluster Tree++ is used to encode for e purpose of 2. Set j = 1 reducing e dead space and searching time. The key 3. Repeat steps 4 rough 11 until j<=t advantage of is scheme is at e loading process goes 4. Create index header file, bt(j) new tree File() rapidly because e output can be given sequentially and 5. Repeat steps 6 rough 9 for each data point Vi in e it makes only one pass over e data and no blocks are j- cluster required to be rearranged. Benefits in e tree are loaded 6. In is step we are calculating e centroid distance. and en e blocks are 100% complete. Sequential 7. Using algorim 1 obtain e ID number of Vi. loading generates a degree of spatial locality inside e 8. Set Key (V i) = TransValue(V i) file==>. 9. Insert it to B+-tree using e function Binsert Seeking can be reduced. The key ree conditions (Key(V i), bt(j)) are: When blocks are split in e sequence set, a new 10. Return bt(j) separator has to be inserted into e index set. Then when 11. Increment j by one blocks are combined in e sequence set, a separator has 12. End to be eliminated from e index set. When records are re-distributed among blocks in e sequence set, e value Processing of e ClusterTree: Insertion, Query and of separator in e index set has to be changed. Deletion are e ree main process in e ClusterTree++: Construction: The building of improved Cluster Tree++is based on e generation of Cluster Tree++ and e construction of simple prefix B+ tree in parallel wi encoding. Using a hierarchical structure and T sliced indexes. During e construction of ClusterTree++ all internal node and leaf node must set e ct (current time) and ht (historic time) as e current time as similar to e current time. Since e insertion time of e original data points are not known in e dataset, resulting in it is necessary to set e insertion times of e entire original data points as e current one. When e new data points are inserted into e structure, different times can be recorded into e structure. During e meantime, a leaf node is developed in simple prefix B+tree which includes e current time data. The entire L field of e entries in e leaf node of improved clustertree ++ must be pointed to e created leaf node in simple prefix+ tree. The entire Insertion: For new incoming data point, its classified into one ree categories: Clusterpoints: In a cluster inside a given reshold extremely near to certain data pointsclose by points: The data points which are neighbors to certain points in e clusters inside a specific reshold. Random points: are e data points which are far from all e clusters and cannot be bounded rough any bounding sphere of e cluster Tree, or shall be included in e bounding spheres of e clustering during every level, but ey do not have any nearest cluster points inside a particular reshold.as a result in agreement wi e category of e new coming data point, e insertion algorim of Cluster Tree ++ shall be applied repeatedly to put in data point to a specific leaf node of cluster Tree++along wi e insertion time in e T(time) field of e new entry in e leaf node, at e same time including a new entry which includes e insertion time details into 62

7 e simple prefix B+tree. Subsequently, point e L(link) field of e new entry in e certain leaf node in ClusterTree++ to e new entry in e leaf node of simple prefix B+ tree and point e L(link) field of e new entry in e particular leaf node in simple prefix B+ tree to e new entry in e leaf node of ClusterTree++. Query: The time-irrelevant queries and time-associated queries are e two categories of queries. The timeirrelevant queries comprise e range queries and p Nearest Neighbor queries. The time-associated queries comprise certain time period condition.for instance, certain clients possibly will request e nearby clients to a particular data point at are added into e structure, approximately e same time of at data point insertion. The original Cluster Tree query algorim can be employed to resolve e previous category of queries. The second category of queries is solved by using e algorim time-related-queries. Deletion: Wiin several systems e superseded or unnecessary data must be removed once in a while. Managers might desire to remove e dataset at are included, during e past, or ey desire to remove ose data inserted at some stage in a certain period. We can use time -related-deletion algorim to solve is problem. In addition, ey can just point out to e data system to automatically regulate itself. The ClusterTree+ can support such user specified deletions. Using e recursive automatic adjustment algorim we can make new ClusterTree. EXPERIMENTAL RESULTS AND DISCUSSION Here, distributed multi dimensional dataset is taken as input. The real-world datasets are utilized for carrying out e tests: CAR comprises 2, 249, 727 road segments of California obtained from Tiger/Line datasets; (b) HYD comprises 40, 995, 718 line segments symbolizing China rivers and (c) TLK includes up to 157, 425, 887 points obtained from e China s elevation data. Memory Usage: Given at DWFCM process data as amount of chunks calculated as e memory utilization of every chunk independently and get e largest value as e closing memory consumption for DWFCM. In view of e fact at e dataset is streaming in character, it is not necessary for DWFCM to access over one chunk at a time. Figure 2 shows e percentage of improvement in terms of memory consumption by proposed (DWFCM) as compared against e Baseline Algorim (DWPCM) and (WPCM). Fig. 2: Memory Usage Comparison It is clear from Table 1, e improvement is in excess of 97% for e entire ree datasets. WPCM makes use of e complete dataset at a time and at's why it needs adequate memory to hold e entire dataset. This is e cause why WPCM needs much higher memory an e proposed algorim. Table 1: Memory Usage Comparison DWPCM wi Input Dataset DWFCM (%) Map reduce (%) WPCM (%) CAR HYD TLK Selectivity: The time and e quantity of page writes for e purpose of bounding boxes insertion by means of several data chunks are illustrated in Figure 3. Since, only a predetermined size dataset undergoes partitioning, e size of e data chunk reduces, ereby increasing e amount of leaf nodes in e indexing trees. If e data chunk size is (or ), approximately 40, 000 data chunks are inserted into e indexing trees, however when e data chunk size is , approximately 560, 000 data chunks can be inserted. As per fact, file pages accession reduces e response time to a definite query and increase e selectivity. Hence multi-dimensional indexing structures should supply low amount during e explore for accessing a file. It is clear from e table 2 at e proposed indexing scheme of DWFCM wi cluster tree++ encoding is perform well an e DWPCM wi cluster tree++ and WPCM wi cluster tree+. 63

8 Fig. 3: Selectivity Comparison Table 2: Selectivity Comparison Number of Data Chunks DWFCM wi Cluster++ encoding DWPCM wi Cluster++2 WPCM wi Cluster Scalability wi Query: Figure 4 demonstrates e query experiment result. It is obvious from e results at Cluster Tree+ can resolve e multi-dimensional query inefficient setback; however Cluster Tree++ encoding execute much better an Cluster Tree++. Fig. 4: Scalability Comparison Table 3 shows e values of scalability of e meods. It proves at e multi-dimensional distributed index scales almost linearly wi e number of nodes in e system. Table 3: Scalability Comparison Number of Data Chunks DWFCM wi Cluster++ encoding DWPCM wi Cluster++ WPCM wi Cluster Table 3 shows e values of scalability of e meods. Wi e number of nodes in e system, e multidimensional distributed index confirms e rank wi approximate linear. 64

9 CONCLUSION 5. Berchtold, S. and C. Bohm, Implementation of multidimensional index structures for knowledge This paper presents a novel high-dimensional discovery in relational databases-international indexing in e midst of clustering scheme. This is conference on Data Warehousing (). phrased as symmetrical Encoding-based improved 6. Lomet, D. and B. Salzberg, The Hb-Tree: A cluster-tree++ (EIC-Tree++). To build an EIC-Tree++, robust multiattribute search structure Proceedings process carries out various steps. In e first step, wi in IEEE international conference on Data engineering e support of a DWFCM algorim, e complete (n) data (1989) Y. Ohsawa, M.S.: Bd-tree: A new n- points are clustered into T clusters. In e second step, dimensional data structure wi efficient dynamic rough an undersized encoding effort, e uniform ID characteristics. Proceedings of e Nin World number of each point is gained, where every cluster Computer Congress, IFIP, pp: sphere is segregated two times in accordance wi e 7. Chakrabarti, K., The Hybrid tree-an index dual distances of start- and centroid-distance. Through structure for high dimensional feature spacese merging process of uique ID and centroid distanc Proceedings of e fifteen international conference Finally we getting e unique index key. The uniform index on Data Engineering, pp: keys are rough a partitioned B+-tree. Consequently, 8. Samet, H., The quadtree and related an advanced encoded ClusterTree++ will be capable of hierarchical data structures. ACM Computing unvaryingly maintain e most renewed status of e Surveys 16(2): dataset in order to sustain e competence and efficiency 9. Ciaccia, P., M. Patella and P. Zezula, M-treesof data manipulations like insertion, modification and Efficent access meod for similarity search in metric deletion. An advanced encoded ClusterTree++of rd space. Proceedings of e 23 VLDB Conference, hierarchy of clusters and sub clusters at integrates e pp: representation of clusters into e index structure in order 10. Berchtold, S., C. Bohm and C. Bohm, The to accomplish e retrieval effectually and proficiently. pyramid technique: towards breaking e curse of The experimental outcomes reveal at e proposed dimensionaliy. Proceedings of SIGMOD conference. meod do one better an e existing Distributed 11. Filho, R., A. Traina and C. Faloutsos, Similarity Weighted Possibilistic C-Means (DWPCM) Algorims search wiout tears-the omni family of all purpose and Weighted Possibilistic C-Means (WPCM) access meods.in proceedings of ICDE Conference, Algorims. Future research must be concentrated on e pp: storage and also must enhance e performance of e 12. Fonseca, M.J. and J.A. George, Indexing Hightree structure. Dimensional data for Content Based retrieval in large databases.in proceedings of e DASSFA REFERENCES Conference, Kyoto, Japan, pp: Jagadish, H.V., B.C. Ooi and K.L. Tani, Yu, Dantong and Aidong Zhang, Integration of Distance: An adaptive B+-Tree based Indexing Cluster representation and Nearest Neighbor search Meod for Nearest Neighbor search, ACM for Image Databases. IEEE international Conference Transactions on Database Systems, pp: on Multimedia, New York. 14. Fredrik Farnstorm, Jameslewis and Charles Elkan, 2. Orenstein and J.A. Merrett, A class of data Scalability for Clustering algorims revisisted, structures for associative searching, Proceedings of ACM SIGKDD Explorations, 2: rd 3 ACM symposium on principles of database 15. Steven Eschrich, Jingweike Lawrence, O. Hall and systems, New York. B. Dmitry, Fast accurate Fuzzy Clustering 3. Faloutsos, C. and S. Roseman, Fractals for secondary rough data reduction, IEEE Transactions on Fuzzy Key retrieval in e proceedings of e 8 ACM Systems, 11: SIGACT-SIGMOD symposium on principles of data base. 4. Faloutsos, C., Mutihashing using gray codes, SIGMOD 86 Proceedings of e 1986 ACM international conference on Management of Data, NewYork, ACM Press, pp:

An encoding-based dual distance tree high-dimensional index

Science in China Series F: Information Sciences 2008 SCIENCE IN CHINA PRESS Springer www.scichina.com info.scichina.com www.springerlink.com An encoding-based dual distance tree high-dimensional index