PATENT DATA CLUSTERING: A MEASURING UNIT FOR INNOVATORS

Similar documents
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

A Comparative Study of Various Clustering Algorithms in Data Mining

CSE 5243 INTRO. TO DATA MINING

Analysis and Extensions of Popular Clustering Algorithms

DBSCAN. Presented by: Garrett Poppe

DENSITY BASED AND PARTITION BASED CLUSTERING OF UNCERTAIN DATA BASED ON KL-DIVERGENCE SIMILARITY MEASURE

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Clustering Algorithms for Data Stream

Study and Implementation of CHAMELEON algorithm for Gene Clustering

CS570: Introduction to Data Mining

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Data Clustering With Leaders and Subleaders Algorithm

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CS145: INTRODUCTION TO DATA MINING

A Review on Cluster Based Approach in Data Mining

Unsupervised learning on Color Images

Clustering Algorithms In Data Mining

Clustering part II 1

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Centroid Based Text Clustering

Clustering Part 4 DBSCAN

Clustering Large Dynamic Datasets Using Exemplar Points

University of Florida CISE department Gator Engineering. Clustering Part 4

Efficient and Effective Clustering Methods for Spatial Data Mining. Raymond T. Ng, Jiawei Han

CS570: Introduction to Data Mining

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

PAM algorithm. Types of Data in Cluster Analysis. A Categorization of Major Clustering Methods. Partitioning i Methods. Hierarchical Methods

A New Fast Clustering Algorithm Based on Reference and Density

Lecture 7 Cluster Analysis: Part A

Unsupervised Learning

Clustering CS 550: Machine Learning

Comparative Study of Clustering Algorithms using R

Impulsion of Mining Paradigm with Density Based Clustering of Multi Dimensional Spatial Data

Clustering Techniques

Knowledge Discovery in Databases

Density-based clustering algorithms DBSCAN and SNN

CS249: ADVANCED DATA MINING

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

WEB USAGE MINING: ANALYSIS DENSITY-BASED SPATIAL CLUSTERING OF APPLICATIONS WITH NOISE ALGORITHM

Performance Analysis of Video Data Image using Clustering Technique

Data Stream Clustering Using Micro Clusters

Hierarchical Document Clustering

Heterogeneous Density Based Spatial Clustering of Application with Noise

University of Florida CISE department Gator Engineering. Clustering Part 5

Clustering: An art of grouping related objects

COMP 465: Data Mining Still More on Clustering

Scalable Varied Density Clustering Algorithm for Large Datasets

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

CSE 5243 INTRO. TO DATA MINING

The comparative study of text documents clustering algorithms

Course Content. Classification = Learning a Model. What is Classification?

A Survey on Clustering Algorithms for Data in Spatial Database Management Systems

A New Approach to Determine Eps Parameter of DBSCAN Algorithm

Density Based Clustering using Modified PSO based Neighbor Selection

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Slides From Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering in Data Mining

CS6220: DATA MINING TECHNIQUES

CSE 5243 INTRO. TO DATA MINING

Course Content. What is Classification? Chapter 6 Objectives

Review of Spatial Clustering Methods

Gene Clustering & Classification

MOSAIC: A Proximity Graph Approach for Agglomerative Clustering 1

Clustering Algorithms for High Dimensional Data Literature Review

International Journal of Computer Engineering and Applications, Volume VIII, Issue III, Part I, December 14

Hierarchical clustering

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

AN IMPROVED DENSITY BASED k-means ALGORITHM

CS6220: DATA MINING TECHNIQUES

International Journal of Advance Engineering and Research Development CLUSTERING ON UNCERTAIN DATA BASED PROBABILITY DISTRIBUTION SIMILARITY

UNIT V CLUSTERING, APPLICATIONS AND TRENDS IN DATA MINING. Clustering is unsupervised classification: no predefined classes

DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II

Unsupervised Learning Partitioning Methods

Data Mining: Concepts and Techniques. Chapter 7 Jiawei Han. University of Illinois at Urbana-Champaign. Department of Computer Science

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

C-NBC: Neighborhood-Based Clustering with Constraints

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

An Enhanced K-Medoid Clustering Algorithm

Kapitel 4: Clustering

Data Mining Algorithms

Multi-Modal Data Fusion: A Description

Acknowledgements First of all, my thanks go to my supervisor Dr. Osmar R. Za ane for his guidance and funding. Thanks to Jörg Sander who reviewed this

Datasets Size: Effect on Clustering Results

COMP5331: Knowledge Discovery and Data Mining

Lesson 3. Prof. Enza Messina

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining, 2 nd Edition

Module: CLUTO Toolkit. Draft: 10/21/2010

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Text clustering based on a divide and merge strategy

ENHANCED DBSCAN ALGORITHM

Clustering in Ratemaking: Applications in Territories Clustering

INF4820, Algorithms for AI and NLP: Evaluating Classifiers Clustering

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

A Data Set Oriented Approach for Clustering Algorithm Selection

USING SOFT COMPUTING TECHNIQUES TO INTEGRATE MULTIPLE KINDS OF ATTRIBUTES IN DATA MINING

Density-Based Clustering. Izabela Moise, Evangelos Pournaras

Outlier Detection Using High Dimensional Dataset for Comparision of Clustering Algorithms

Transcription:

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 1 Number 1, May - June (2010), pp. 158-165 IAEME, http://www.iaeme.com/ijcet.html International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), IJCET I A E M E PATENT DATA CLUSTERING: A MEASURING UNIT FOR ABSTRACT INNOVATORS M.Pratheeban Research Scholar Anna University of Technology Coimbatore E-mail id: pratheeban_mca@yahoo.co.in Dr. S. Balasubramanian Former Director- IPR Anna University of Technology Coimbatore E-Mail id: s_balasubramanian@rediffmail.com As software applications increase in volume, grouping the application into smaller, more manageable components is often proposed as a means of assisting software maintenance activities. One of the thrusting in software development is Patent Data Clustering. The key challenge of Patent Data Clustering has how they can cluster and to improve searching the patent data in repositories. In this paper, we propose a new clustering algorithm that improved clustering facilities for patent data. INTRODUCTION Patent Data Clustering is a method for grouping patent related data. Clustering of patent data documents (such as Titles, Abstract and Claims) has been used to bring out the importance of patents for researchers. Clustering analysis is an unsupervised process that divides a set of objectives into homogeneous groups. It is to measure or perceived intrinsic characteristics or similarities among patent. Patent Clustering is to speed up shifting through large sets of patent data for analyzing the patent that helps people to identify competitive and technology trends. The need for academic researchers to retrieve patents is increasing. Because applying for patents are now considered on important research activity [6]. 158

PATENT INFORMATION Patents are an important source of scientific, technical and information. For anyone planning to apply for a patent, a search is crucial to identify the existence of prior art, which affects the patentability of an invention. For researchers, patents can be important as they are often the only published information on specific topics, and can provide insight into research directions. Patents are also used by marketing and competitive intelligence professionals, to find out about work being done by others. PATENT DATABASE Information that may be provided in Patent Databases Patent data may relate to unexamined and examined patent applications, and includes: Titles and abstract in English (if the patent is in another language) Inventor s name Patent assignee Patent publication data Images Full text (sometimes this is available through a separate database, or must be ordered) International Patent Classification (IPC) codes. The IPC is used by over 70 patent authorities to classify and index the subject matter of published patent specifications. It is presumably based on literacy warrant, and sections range from the very broad to the specific [2]. PATENT ASSESSMENT AND TECHNOLOGY AREA ASSESSMENT Currently high quality valuing of patents and patent applications and the assessment of technology areas with respect to their potential to give rise to patent application is done mainly manually which is very costly and time consuming. We are developing techniques that uses statistical and semantic information from patent as well as user based data for market aspects to prognosticate the patent. 159

MINING PATENT A Clear and effective IP Strategy critically incorporates a clear and effective strategy for managing an organization s patent portfolio [7]. It means the analysis of all patents that can directly revolutionize business and technology development practice. Patent mining is a premeditated and core functions for any IP-Centric business to secure technology development and provides an establishment to help the administrators make to plan decisions regarding technology development. Today patent management applications and robust search engines allow internal IP managers to quickly pull together organized set of patents from within their own portfolios those of specific competitors and those specific competitions and those patents citing relevant technical or industry terms. Companies once only interested in understanding the patents within their own portfolio are now interested in knowing about the patents held by competitors [8]. BASICS OF CLUSTERING Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups [1]. It groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. These discovered clusters can help explain the characteristics of the underlying data distribution and serve as the foundation for other data mining and analysis techniques [5]. The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. The quality of a clustering result also depends on both the similarity measure used by the method and its implementation [3]. CLUSTERING ALGORITHMS Most existing clustering algorithms find clusters that fit some static model. Although effective in some cases, these algorithms can break down that is, cluster the data incorrectly if the user doesn t select appropriate static-model parameters. Or sometimes the model cannot adequately capture the clusters characteristics. Most of these algorithms break down when the data contains clusters of diverse shapes, densities, 160

and sizes [5]. Cluster analysis is the organization of a collection of patterns into clusters based on similarity [4]. LIMITATIONS OF TRADITIONAL CLUSTERING ALGORITHMS Partition-based clustering techniques such as K-Means and Clarans attempt to break a data set into K clusters such that the partition optimizes a given criterion. These algorithms assume that clusters are hyper-ellipsoidal and of similar sizes. They can t find clusters that vary in size, or concave shapes [9]. DBScan (Density-Based Spatial Clustering of Applications with Noise), a well known spatial clustering algorithm, can find clusters of arbitrary shapes. DBScan defines a cluster to be a maximum set of density-connected points, which means that every core point in a cluster must have at least a minimum number of points (MinPts) within a given radius (Eps) [10]. DBScan assumes that all points within genuine clusters can be reached from one another by traversing a path of density connected points and points across different clusters cannot. DBScan can find arbitrarily shaped clusters if the cluster density can be determined beforehand and the cluster density is uniform [10]. Hierarchical clustering algorithms produce a nested sequence of clusters with a single, all-inclusive cluster at the top and single-point clusters at the bottom. Agglomerative hierarchical algorithms start with each data point as a separate cluster. Each step of the algorithm involves merging two clusters that are the most similar. After each merger, the total number of clusters decreases by one. Users can repeat these steps until they obtain the desired number of clusters or the distance between the two closest clusters goes above a certain threshold. The fact that most hierarchical algorithms do not revisit once constructed (intermediate) clusters with the purpose of their improvement [1]. In Agglomerative Hierarchical Clustering provision can be made for a relocation of objects that may have been 'incorrectly' grouped at an early stage. The result should be examined closely to ensure it makes sense. Use of different distance metrics for measuring distances between clusters may generate different results. Performing multiple experiments and comparing the results is recommended to support the veracity of the original results. [11] 161

The many variations of agglomerative hierarchical algorithms primarily differ in how they update the similarity between existing and merged clusters. In some hierarchical methods, each cluster is represented by a centroid or medoid a data point that is the closest to the center of the cluster and the similarity between two clusters is measured by the similarity between the centroids / medoids. Both of these schemes fail for data in which points in a given cluster are closer to the center of another cluster than to the center of their own cluster. Rock a recently developed algorithm that operates on a derived similarity graph, scales the aggregate interconnectivity with respect to a user-specified interconnectivity model. However, the major limitation of all such schemes is that they assume a static, user supplied interconnectivity model. Such models are inflexible and can easily lead to incorrect merging decisions when the model under or overestimates the interconnectivity of the data set. Although some schemes allow the connectivity to vary for different problem domains, it is still the same for all clusters irrespective of their densities and shapes [12]. CURE measures the similarity between two clusters by the similarity of the closest pair of points belonging to different clusters. Unlike centroid/medoid-based methods, CURE can find clusters of arbitrary shapes and sizes, as it represents each cluster via multiple representative points. Shrinking the representative points toward the centroid allows CURE to avoid some of the problems associated with noise and outliers. However, these techniques fail to account for special characteristics of individual clusters. They can make incorrect merging decisions when the underlying data does not follow the assumed model or when noise is present. In some algorithms, the similarity between two clusters is captured by the aggregate of the similarities among pairs of items belonging to different clusters [13]. Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. Furthermore, one set of schemes ignores the information about the aggregate interconnectivity of items in two clusters. The other set of schemes ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. By only considering 162

either interconnectivity or closeness, these algorithms can easily select and merge the wrong pair of clusters USAGE OF ALGORITHMS: The most standard approach for document classification in recent years in applying machine learning, such as support vector machine or Naïve Bayes. However this approach is not easy to apply to the patent mining Task, because the number of classes is large and it occurs in a high calculation cast [6]. So we propose a new algorithm rather than machine learning algorithms. OUR APPROACH We propose a new dynamic algorithm it satisfies for both interlink and nearness in identifying the most similar pair of clusters. Thus, it does not depend on a static, usersupplied model and can automatically adapt to the internal characteristics of the merged clusters. In above algorithm we replaced Chameleon with suitable k-mediods which may give better result in interlink compared to interlink using k-means. From various comparisons we came know that the average time taken by K-Means algorithm is greater than the time taken by K-Medoids algorithm for same set of data and also K-Means algorithm is efficient for smaller data sets and K-Medoids algorithm seems to perform better for large data sets [14]. For Inter links of patent, 1. Randomly choose k objects from the data set to be the cluster medoids at the initial state. Collect the patent data related to particular field or all fields 2. For each pair of non-selected object h and selected object i, calculate the total swapping cost Tih. 3. For each pair of i and h, If Tih < 0, i is replaced by h Then assign each nonselected object to the most similar representative object. 4. Repeat steps 2 and 3 until no change happens 163

Absolute nearness of two clusters is normalized by the internal nearness of the clusters. During the calculation of nearness, the algorithm use to find the genuine clusters by repeatedly combining these sub clusters. CONCLUSION The methodology of dynamic modeling of clusters in agglomerative hierarchical methods is applicable to all types of data as long as a similarity is available. Even though we chose to model the data using k-mediods in this paper, it is entirely possible to use other algorithms suitable for patent mining domains. Our future research work includes the practical implementation of this algorithm for better results in patent mining. REFERENCE [1] Pavel Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, Inc http://www.ee.ucr.edu/~barth/ee242/clustering_survey.pdf. [2] http://www.wipo.int/classifications/ipc/en/ [3] Dr. Osmar R. Zaïane, Principles of Knowledge Discovery in Databases, University of Alberta, CMPUT690 [4] Cheng- Fa Tsai, Han-Chang Wu, Chun-Wei Tsai, A New Data Clustering Approach for Data Mining in Large Database, International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN,02). [5] George Karypis, Eui-Hong (Sam) Han, Vipin Kumar, Chameleon: Hierarchical Clustering Using Dynamic Modeling. http://wwwleibniz.imag.fr/apprentissage/depot/selection/ karypis99.pdf [6] Hidetsugu Nanba, Hiroshima City University at NTC1R-7 Patent Mining Task, Proceedings of NTCIR-7 Workshop Meeting, December 16 19, 2008, Tokyo, Japan [7] Bob Stembridge, Breda Corish, Patent data mining and effective patent portfolio management, Intellectual Asset Management, October/November 2004 [8] Edward Khan, Patent mining in a changing world of technology and product development, Intellectual Asset Management, July/August 2003 164

[9] Raymond T.Ng, Jiawei Han Efficient and Effective Clustering Methods for Spatial Data Mining, Proceedings of the 20 th VLDB Conference, Santiago, Chile 1994. [10] Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise, Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96) [11]http://www.improvedoutcomes.Com/docs/WebSiteDocs/Clustering/Agglomerat ive_ Hierarchical_ Clustering_Overview.htm [12] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, Proc. 15th Int l Conf. Data Eng., IEEE CS Press, Los Alamitos, Calif., 1999, pp. 512-521. [13] S. Guha, R. Rastogi, and K. Shim, CURE: An Efficient Clustering Algorithm for Large Databases, Proc. ACM SIGMOD Int l Conf. Management of Data, ACM Press, New York, 1998, pp. 73-84. [14] T. Velmurugan and T. Santhanam, Computational Complexity between K- Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points, Journal of Computer Science 6 (3): 363-368, 2010 ISSN 1549-3636, 2010 Science Publications 165