PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

Size: px

Start display at page:

Download "PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS"

Laura Daniel
6 years ago
Views:

1 International Journal of Computer Engineering and Technology (IJCET), ISSN (Print) ISSN (Online) Volume 1 Number 1, May - June (2010), pp IAEME, International Journal of Computer Engineering and Technology (IJCET), ISSN (Print), IJCET I A E M E PATENT DATA CLUSTERING: A MEASURING UNIT FOR ABSTRACT INNOVATORS M.Pratheeban Research Scholar Anna University of Technology Coimbatore id: pratheeban_mca@yahoo.co.in Dr. S. Balasubramanian Former Director- IPR Anna University of Technology Coimbatore id: s_balasubramanian@rediffmail.com As software applications increase in volume, grouping the application into smaller, more manageable components is often proposed as a means of assisting software maintenance activities. One of the thrusting in software development is Patent Data Clustering. The key challenge of Patent Data Clustering has how they can cluster and to improve searching the patent data in repositories. In this paper, we propose a new clustering algorithm that improved clustering facilities for patent data. INTRODUCTION Patent Data Clustering is a method for grouping patent related data. Clustering of patent data documents (such as Titles, Abstract and Claims) has been used to bring out the importance of patents for researchers. Clustering analysis is an unsupervised process that divides a set of objectives into homogeneous groups. It is to measure or perceived intrinsic characteristics or similarities among patent. Patent Clustering is to speed up shifting through large sets of patent data for analyzing the patent that helps people to identify competitive and technology trends. The need for academic researchers to retrieve patents is increasing. Because applying for patents are now considered on important research activity [6]. 158

2 PATENT INFORMATION Patents are an important source of scientific, technical and information. For anyone planning to apply for a patent, a search is crucial to identify the existence of prior art, which affects the patentability of an invention. For researchers, patents can be important as they are often the only published information on specific topics, and can provide insight into research directions. Patents are also used by marketing and competitive intelligence professionals, to find out about work being done by others. PATENT DATABASE Information that may be provided in Patent Databases Patent data may relate to unexamined and examined patent applications, and includes: Titles and abstract in English (if the patent is in another language) Inventor s name Patent assignee Patent publication data Images Full text (sometimes this is available through a separate database, or must be ordered) International Patent Classification (IPC) codes. The IPC is used by over 70 patent authorities to classify and index the subject matter of published patent specifications. It is presumably based on literacy warrant, and sections range from the very broad to the specific [2]. PATENT ASSESSMENT AND TECHNOLOGY AREA ASSESSMENT Currently high quality valuing of patents and patent applications and the assessment of technology areas with respect to their potential to give rise to patent application is done mainly manually which is very costly and time consuming. We are developing techniques that uses statistical and semantic information from patent as well as user based data for market aspects to prognosticate the patent. 159

3 MINING PATENT A Clear and effective IP Strategy critically incorporates a clear and effective strategy for managing an organization s patent portfolio [7]. It means the analysis of all patents that can directly revolutionize business and technology development practice. Patent mining is a premeditated and core functions for any IP-Centric business to secure technology development and provides an establishment to help the administrators make to plan decisions regarding technology development. Today patent management applications and robust search engines allow internal IP managers to quickly pull together organized set of patents from within their own portfolios those of specific competitors and those specific competitions and those patents citing relevant technical or industry terms. Companies once only interested in understanding the patents within their own portfolio are now interested in knowing about the patents held by competitors [8]. BASICS OF CLUSTERING Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups [1]. It groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. These discovered clusters can help explain the characteristics of the underlying data distribution and serve as the foundation for other data mining and analysis techniques [5]. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. The quality of a clustering result also depends on both the similarity measure used by the method and its implementation [3]. CLUSTERING ALGORITHMS Most existing clustering algorithms find clusters that fit some static model. Although effective in some cases, these algorithms can break down that is, cluster the data incorrectly if the user doesn t select appropriate static-model parameters. Or sometimes the model cannot adequately capture the clusters characteristics. Most of these algorithms break down when the data contains clusters of diverse shapes, densities, 160

4 and sizes [5]. Cluster analysis is the organization of a collection of patterns into clusters based on similarity [4]. LIMITATIONS OF TRADITIONAL CLUSTERING ALGORITHMS Partition-based clustering techniques such as K-Means and Clarans attempt to break a data set into K clusters such that the partition optimizes a given criterion. These algorithms assume that clusters are hyper-ellipsoidal and of similar sizes. They can t find clusters that vary in size, or concave shapes [9]. DBScan (Density-Based Spatial Clustering of Applications with Noise), a well known spatial clustering algorithm, can find clusters of arbitrary shapes. DBScan defines a cluster to be a maximum set of density-connected points, which means that every core point in a cluster must have at least a minimum number of points (MinPts) within a given radius (Eps) [10]. DBScan assumes that all points within genuine clusters can be reached from one another by traversing a path of density connected points and points across different clusters cannot. DBScan can find arbitrarily shaped clusters if the cluster density can be determined beforehand and the cluster density is uniform [10]. Hierarchical clustering algorithms produce a nested sequence of clusters with a single, all-inclusive cluster at the top and single-point clusters at the bottom. Agglomerative hierarchical algorithms start with each data point as a separate cluster. Each step of the algorithm involves merging two clusters that are the most similar. After each merger, the total number of clusters decreases by one. Users can repeat these steps until they obtain the desired number of clusters or the distance between the two closest clusters goes above a certain threshold. The fact that most hierarchical algorithms do not revisit once constructed (intermediate) clusters with the purpose of their improvement [1]. In Agglomerative Hierarchical Clustering provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. The result should be examined closely to ensure it makes sense. Use of different distance metrics for measuring distances between clusters may generate different results. Performing multiple experiments and comparing the results is recommended to support the veracity of the original results. [11] 161

5 The many variations of agglomerative hierarchical algorithms primarily differ in how they update the similarity between existing and merged clusters. In some hierarchical methods, each cluster is represented by a centroid or medoid a data point that is the closest to the center of the cluster and the similarity between two clusters is measured by the similarity between the centroids / medoids. Both of these schemes fail for data in which points in a given cluster are closer to the center of another cluster than to the center of their own cluster. Rock a recently developed algorithm that operates on a derived similarity graph, scales the aggregate interconnectivity with respect to a user-specified interconnectivity model. However, the major limitation of all such schemes is that they assume a static, user supplied interconnectivity model. Such models are inflexible and can easily lead to incorrect merging decisions when the model under or overestimates the interconnectivity of the data set. Although some schemes allow the connectivity to vary for different problem domains, it is still the same for all clusters irrespective of their densities and shapes [12]. CURE measures the similarity between two clusters by the similarity of the closest pair of points belonging to different clusters. Unlike centroid/medoid-based methods, CURE can find clusters of arbitrary shapes and sizes, as it represents each cluster via multiple representative points. Shrinking the representative points toward the centroid allows CURE to avoid some of the problems associated with noise and outliers. However, these techniques fail to account for special characteristics of individual clusters. They can make incorrect merging decisions when the underlying data does not follow the assumed model or when noise is present. In some algorithms, the similarity between two clusters is captured by the aggregate of the similarities among pairs of items belonging to different clusters [13]. Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. Furthermore, one set of schemes ignores the information about the aggregate interconnectivity of items in two clusters. The other set of schemes ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. By only considering 162

6 either interconnectivity or closeness, these algorithms can easily select and merge the wrong pair of clusters USAGE OF ALGORITHMS: The most standard approach for document classification in recent years in applying machine learning, such as support vector machine or Naïve Bayes. However this approach is not easy to apply to the patent mining Task, because the number of classes is large and it occurs in a high calculation cast [6]. So we propose a new algorithm rather than machine learning algorithms. OUR APPROACH We propose a new dynamic algorithm it satisfies for both interlink and nearness in identifying the most similar pair of clusters. Thus, it does not depend on a static, usersupplied model and can automatically adapt to the internal characteristics of the merged clusters. In above algorithm we replaced Chameleon with suitable k-mediods which may give better result in interlink compared to interlink using k-means. From various comparisons we came know that the average time taken by K-Means algorithm is greater than the time taken by K-Medoids algorithm for same set of data and also K-Means algorithm is efficient for smaller data sets and K-Medoids algorithm seems to perform better for large data sets [14]. For Inter links of patent, 1. Randomly choose k objects from the data set to be the cluster medoids at the initial state. Collect the patent data related to particular field or all fields 2. For each pair of non-selected object h and selected object i, calculate the total swapping cost Tih. 3. For each pair of i and h, If Tih < 0, i is replaced by h Then assign each nonselected object to the most similar representative object. 4. Repeat steps 2 and 3 until no change happens 163

7 Absolute nearness of two clusters is normalized by the internal nearness of the clusters. During the calculation of nearness, the algorithm use to find the genuine clusters by repeatedly combining these sub clusters. CONCLUSION The methodology of dynamic modeling of clusters in agglomerative hierarchical methods is applicable to all types of data as long as a similarity is available. Even though we chose to model the data using k-mediods in this paper, it is entirely possible to use other algorithms suitable for patent mining domains. Our future research work includes the practical implementation of this algorithm for better results in patent mining. REFERENCE [1] Pavel Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, Inc [2] [3] Dr. Osmar R. Zaïane, Principles of Knowledge Discovery in Databases, University of Alberta, CMPUT690 [4] Cheng- Fa Tsai, Han-Chang Wu, Chun-Wei Tsai, A New Data Clustering Approach for Data Mining in Large Database, International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN,02). [5] George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling. karypis99.pdf [6] Hidetsugu Nanba, Hiroshima City University at NTC1R-7 Patent Mining Task, Proceedings of NTCIR-7 Workshop Meeting, December 16 19, 2008, Tokyo, Japan [7] Bob Stembridge, Breda Corish, Patent data mining and effective patent portfolio management, Intellectual Asset Management, October/November 2004 [8] Edward Khan, Patent mining in a changing world of technology and product development, Intellectual Asset Management, July/August

8 [9] Raymond T.Ng, Jiawei Han Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of the 20 th VLDB Conference, Santiago, Chile [10] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) [11] ive_ Hierarchical_ Clustering_Overview.htm [12] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, Proc. 15th Int l Conf. Data Eng., IEEE CS Press, Los Alamitos, Calif., 1999, pp [13] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, ACM Press, New York, 1998, pp [14] T. Velmurugan and T. Santhanam, Computational Complexity between K- Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points, Journal of Computer Science 6 (3): , 2010 ISSN , 2010 Science Publications 165

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

To Appear in the IEEE Computer CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling George Karypis Eui-Hong (Sam) Han Vipin Kumar Department of Computer Science and Engineering University