A Survey On Different Text Clustering Techniques For Patent Analysis

Similar documents
Discovery of Agricultural Patterns Using Parallel Hybrid Clustering Paradigm

Iteration Reduction K Means Clustering Algorithm

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

A SURVEY- WEB MINING TOOLS AND TECHNIQUE

An Efficient Approach for Color Pattern Matching Using Image Mining

The Clustering Validity with Silhouette and Sum of Squared Errors

Text Document Clustering Using DPM with Concept and Feature Analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

Analyzing Outlier Detection Techniques with Hybrid Method

Keywords: clustering algorithms, unsupervised learning, cluster validity

Review on Text Mining

An Enhanced K-Medoid Clustering Algorithm

KEYWORDS: Clustering, RFPCM Algorithm, Ranking Method, Query Redirection Method.

A Review of K-mean Algorithm

Hybrid Fuzzy C-Means Clustering Technique for Gene Expression Data

CLUSTERING. CSE 634 Data Mining Prof. Anita Wasilewska TEAM 16

Classifying Twitter Data in Multiple Classes Based On Sentiment Class Labels

Techniques for Mining Text Documents

University of Florida CISE department Gator Engineering. Clustering Part 2

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

ARTICLE; BIOINFORMATICS Clustering performance comparison using K-means and expectation maximization algorithms

Research and Improvement on K-means Algorithm Based on Large Data Set

Machine Learning (BSMC-GA 4439) Wenke Liu

PROJECT PERIODIC REPORT

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

Data Mining Technology Based on Bayesian Network Structure Applied in Learning

Domestic electricity consumption analysis using data mining techniques

Dr. Chatti Subba Lakshmi

Hybrid Models Using Unsupervised Clustering for Prediction of Customer Churn

What to come. There will be a few more topics we will cover on supervised learning

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

ISSN: (Online) Volume 3, Issue 9, September 2015 International Journal of Advance Research in Computer Science and Management Studies

NORMALIZATION INDEXING BASED ENHANCED GROUPING K-MEAN ALGORITHM

SK International Journal of Multidisciplinary Research Hub Research Article / Survey Paper / Case Study Published By: SK Publisher

A Comparative study of Clustering Algorithms using MapReduce in Hadoop

Adaptive and Personalized System for Semantic Web Mining

Clustering of Data with Mixed Attributes based on Unified Similarity Metric

Comparative Study of Clustering Algorithms using R

Application of Data Mining in Manufacturing Industry

Clustering in Data Mining

APPLICATION OF MULTIPLE RANDOM CENTROID (MRC) BASED K-MEANS CLUSTERING ALGORITHM IN INSURANCE A REVIEW ARTICLE

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Filtering of Unstructured Text

New Approach for K-mean and K-medoids Algorithm

A Hybrid Unsupervised Web Data Extraction using Trinity and NLP

CSE 5243 INTRO. TO DATA MINING

A Comparative Study of Data Mining Process Models (KDD, CRISP-DM and SEMMA)

A Survey Of Issues And Challenges Associated With Clustering Algorithms

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

Data Management Glossary

Exploratory Analysis: Clustering

Normalization based K means Clustering Algorithm

Nearest Clustering Algorithm for Satellite Image Classification in Remote Sensing Applications

An Efficient Clustering for Crime Analysis

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

International Journal of Advance Engineering and Research Development. A Survey on Data Mining Methods and its Applications

A Framework for Hierarchical Clustering Based Indexing in Search Engines

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Patent Classification Using Ontology-Based Patent Network Analysis

Best Practices to Ensure Comprehensive Prior Art Searches

Efficient Clustering of Web Documents Using Hybrid Approach in Data Mining

A Framework on Ontology Based Classification and Clustering for Grouping Research Proposals

LITERATURE SURVEY ON SEARCH TERM EXTRACTION TECHNIQUE FOR FACET DATA MINING IN CUSTOMER FACING WEBSITE

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A Fuzzy System Modeling Algorithm for Data Analysis and Approximate Reasoning

CSE 5243 INTRO. TO DATA MINING

Enhanced Cluster Centroid Determination in K- means Algorithm

WEB PAGE RE-RANKING TECHNIQUE IN SEARCH ENGINE

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

A CRITIQUE ON IMAGE SEGMENTATION USING K-MEANS CLUSTERING ALGORITHM

Introduction to Machine Learning CMU-10701

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Dynamic Clustering of Data with Modified K-Means Algorithm

Mine Blood Donors Information through Improved K- Means Clustering Bondu Venkateswarlu 1 and Prof G.S.V.Prasad Raju 2

Chapter 1, Introduction

Context Based Web Indexing For Semantic Web

A STUDY OF SOME DATA MINING CLASSIFICATION TECHNIQUES

A Semantic Model for Concept Based Clustering

Modelling Structures in Data Mining Techniques

Data Mining of Web Access Logs Using Classification Techniques

International Journal of Scientific & Engineering Research, Volume 6, Issue 10, October ISSN

Keywords Data alignment, Data annotation, Web database, Search Result Record

Technique For Clustering Uncertain Data Based On Probability Distribution Similarity

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

A Review: Content Base Image Mining Technique for Image Retrieval Using Hybrid Clustering

Educational Data Mining: Performance Evaluation of Decision Tree and Clustering Techniques using WEKA Platform

HFCT: A Hybrid Fuzzy Clustering Method for Collaborative Tagging

[Raghuvanshi* et al., 5(8): August, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

CS Introduction to Data Mining Instructor: Abdullah Mueen

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

ANALYSIS AND REASONING OF DATA IN THE DATABASE USING FUZZY SYSTEM MODELLING

Semi-Supervised Clustering with Partial Background Information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

CHAPTER 4 FUZZY LOGIC, K-MEANS, FUZZY C-MEANS AND BAYESIAN METHODS

Relevance Feature Discovery for Text Mining

Effective Clustering Algorithms for VLSI Circuit. Partitioning Problems

Overview of Web Mining Techniques and its Application towards Web

Redefining and Enhancing K-means Algorithm

Preprocessing of Stream Data using Attribute Selection based on Survival of the Fittest

Transcription:

A Survey On Different Text Clustering Techniques For Patent Analysis Abhilash Sharma Assistant Professor, CSE Department RIMT IET, Mandi Gobindgarh, Punjab, INDIA ABSTRACT Patent analysis is a management tool in order to confront the management of product or service development process and organization s technology. Patent documents contain novel ideas, inventions and important research results. The analysis of these patents can be valuable to various sectors such as industry, business, law and policy-making communities in order to assess latest technological trends and to forecast new technologies. This work has been carried out with an aim to review various text clustering techniques for effective patent analysis. 2. METHODOLOGY This paper illustrates the study of various text clustering methodologies for patent analysis that can help out many companies for improving their competitiveness. The main methodology used for this work was by examining the publications, journals and reviews in the field of text clustering, patent analysis and patent documents over the times. 3. RESEARCH FINDINGS Keywords Text Clustering; Patent Analysis; Fuzzy Approach; Bayesian Method; k-means Clustering Algorithm. 1. INTRODUCTION A Patent is basically a type of Intellectual Property. Patent analysis is one of the ways of recognizing the advancements in technologies. But patent documents consist of large technical and legal terminology and it is difficult for nonspecialists to interpret the inventions and technologies mentioned in the patents. So, some simple methods are required to deduce the valuable information from the patent Text Clustering, also referred to as Document Clustering, is closely related to the concept of data clustering. Text clustering is a more specific technique for unsupervised document organization, automatic topic extraction and fast information retrieval or filtering. This clustering technique involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. 3.1 K-Means Clustering Algorithm Young Gil Kim, et. al. [2008] proposed a new visualization method for patent analysis. In this technique, initially keywords are collected from the patent documents of a particular technology field. After that, clusters of patent documents are generated using k-means algorithm. With the clustering results, a semantic network of keywords is formed without respect of filing dates. Then, a patent map is made by rearranging each keyword node of the semantic network according to its earliest filing date and frequency in patent Extracting keywords from patents related to a particular technology field Implementing k-means Algorithm Patent analyses based on structured information such as filing dates, assignees, or citations have been the major approaches in practice and in the literature for many years. A typical Patent Analysis Model [7] is shown in Figure 2. But there is need of some effective techniques for analysing the patents. Our work attempts to analyse various text clustering techniques for reviewing the patent data and that data can be beneficial for various companies to understand the present technologies, to predict the future technologies and to plan for potential competition based on new technologies. Generating Semantic Network of Keywords Forming Patent Map Figure 1: Flowchart of visualization method for patent analysis 1

Figure 1 depicts the flowchart of the proposed methodology. A patent map is the visualized expression of total patent analysis results to understand complex patent information easily and effectively. And it is generated by collecting related patent documents of a target technology field, processing, and analyzing them. In general, a patent document consists of structured and unstructured data. 3.2 Document Clustering and Time Series Analysis In this research work, a new Patent Analysis Model is proposed for Technology Forecasting [1]. Most of the techniques which were developed earlier for Patent Analysis were based on one analytical approach such as clustering, classification and citation analyses. But, they had some limitations to predict the future state of a technology because they were dependent on only one result of a Technology Forecasting method. Sang Sung Park, et. al. [2012] attempted to minimize this problem by combining two analytical approaches which are patent document clustering and time series model. K-means clustering algorithm has been used as a patent clustering method and Time Series Regression (TSR) as a time series model. K-means clustering is a clustering method for finding K clusters and assigning all points to each cluster by Euclidean distance measure. TSR is a time series method to model the function of dependent variable Y and independent variable X, where X is time and Y is the number of issued patent Patent Analysis Model Task Identification Searching Segmentation Abstracting Clustering Visualization Interpretation Figure 2: A Typical Patent Analysis Model 3.3 Distance Determination Approach This research work proposed a new clustering algorithm for automatic discovery of data clusters [2]. Iterative Relocating Technique of Partitional Clustering, i.e. k-means and k- medoids have been used in this work. The algorithms have been implemented in C++ language. A partitioning based algorithm, D-M (Density Means), clustering automatically generates the clusters in this distance determination approach. K-Means Step: The basic step of k-means clustering is to give the number of clusters k and consider first k objects from data set D as clusters & their centroid. After that, K-means algorithm s steps will be performed. K-Medoids or Partition around Medoid (PAM) Step: The basic step of K-Medoid or PAM clustering is to give the number of clusters k and consider first k objects from data set Dn as clusters & their medoid. Then the k-medoid or PAM algorithm will perform its steps. Table 1 and Table 2 depict the Comparison of algorithm s running time and Comparison of SSE of algorithms respectively. Algorithm k-means Computational Complexity O(nkt) k-medoids O(k(n-k) 2 ) D-M O(n i ) Table 1: Comparison of algorithm s running time 3.4 Bayesian Approach The patent documents include the data which is of highly dimensional structure. It is difficult to cluster the document data because of their dimensional problem. Therefore, Bayesian approach was adopted to solve this problem of dimensionality [3]. 2

Earlier, clustering algorithms were based on similarity or distance measures, but Bayesian clustering used the probability distribution of the data. The distribution family of Bayesian model has Gaussian and Laplace in this model. The proposed method is a two step process, i.e. Initialization and Repetition (Clustering) for the following input and output: Input: (1) Given data, X={x 1, x 2,, x n } (2) Prior distribution, p(θ) (3) Likelihood function, l(x θ) Output: (1) Posterior distribution, p(θ x) (2) Updated parameters of hierarchical Bayesian model (3) Dendrogram of Bayesian clustering Algorithm s Name Two Four k-means 1053.6022 681.4046 k-medoids 1082.002 464.5469 D-M 1053.6022 392.6812 Table 2: Comparison of SSE (Sum of Square Error) of Algorithms 3.5 Fuzzy Logic Control Approach This research work presented a novel hierarchical clustering approach for patent analysis [4]. Keyword-based methodologies for analysis tend to be inconsistent and ineffective when partial meanings of the technical content are used for cluster analysis. Thus, a new methodology has been presented to automatically interpret and cluster knowledge documents using an ontology schema. Moreover, a fuzzy logic control approach is used to match suitable document clusters for given patents based on their derived ontological semantic webs. Fuzzy ontological document clustering (FODC) uses the following methodology: Initially, domain experts define the domain ontology using a knowledge ontology building and RDF editing tool called Protégé, and the words and phrases (e.g., speech, chunks, and lemmas) of the patent documents are mapped to the corresponding domain ontology concepts. The experts also create a training set of patents using a free and easy-to-use natural language processing and tagging tool. Afterwards, the probabilities of the concepts in given document chunks are computed. The concept probabilities calculated in any given patent document are then used for clustering the patents with fuzzy logic inferences. Hence, the hierarchical clustering algorithm is refined by adapting fuzzy logic to the process of ontological concept derivation. Table 3 depicts the differences between FODC and Key Phrase K-Means Clustering. No. FODC Key-Phrase K-Means Clustering 1. Extracts representative information Extracts general phrases 2. Ontological structure is applied to present knowledge 3. Ontology carries meanings and relations 4. FODC s F-Measure is high 5. Documents provide various views including the main concepts and details Key phrase sentence fragments are used to present knowledge Key phrases are less meaningful K-Mean s F-Measure is low Documents provide a key phrase view point Table 3: Differences between FODC and Key-Phrase K- Means Clustering 4. RESULTS For better analysis, the methodologies proposed and inferences from each Text Clustering Technique have been shown separately. Different methodologies provide different types of analysis depending upon the user demands. Moreover, generation of patent maps and semantic networks also help in reducing analysis time. 5. CONCLUSIONS The objective of our work is to provide a study of different text clustering techniques that can be used for efficient patent analysis. These techniques can be beneficial for various types of analysis such as visual analysis, analysis which include high dimensional data, etc. Hence, this study helps to prove effective for efficient analysis in a way that analyst can choose the appropriate Text Clustering Technique depending upon the requirements. 6. REFERENCES [1] Sang Sung Park, et. al., A Patent Analysis Model Combining Document Clustering and Time Series Analysis ; Brain Korea Project, 2012. [2] Bhanu Sukhija, Sukhvir Singh, Improved K-Means Clustering Technique using distance determination approach ; ISSN: 2277 9043, International Journal of Advanced Research in Computer Science and Electronics Engineering Volume 1, Issue 5, July 2012. [3] Sunghae Jun, A Clustering Method of Highly Dimensional Patent Data using Bayesian Approach ; IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 1, No 1, January 2012, ISSN: 1694-0814. 3

[4] Amy J.C. Trappey, et. al., A Fuzzy Ontological Knowledge Document Clustering Methodology ; IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART B: CYBERNETICS, VOL. 39, NO. 3, JUNE 2009. [5] Young Gil Kim, et. al., Visualization of patent analysis for emerging technology ; ScienceDirect, Expert Systems with Applications 34 (2008) 1804 1812. [8] Michele Fattori, et. al., Text mining applied to patent mapping: a practical business case ; World Patent Information 25 (2003) 335 342. [9] https://sites.google.com/site/analyzingpatenttrends/home/ what-is-patent-analysis [10] http://www.patentinsightpro.com/documents/text%20 Clustering%20on%20Patents.pdf [6] Khaled Khelif, et. al, Semantic Patent Clustering for Biomedical Communities ; 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. [7] Yuen-Hsien Tseng, et. al., Text Mining Techniques for Patent Analysis ; ScienceDirect, Information Processing and Management 43 (2007) 1216 1247. 4