A Modified Fuzzy Relational Clustering Approach for Sentence-Level Text

Size: px

Start display at page:

Download "A Modified Fuzzy Relational Clustering Approach for Sentence-Level Text"

Franklin Pearson
6 years ago
Views:

1 Procs. of the IEEE 201 2nd International Conference on Electrical, Information and Communication Technology (EICT 201), Khulna, Bangladesh, December 10-12, (201). Note: This is the pre-print version A Modified Fuzzy Relational Clustering Approach for Sentence-Level Text Sikder Tahsin Al-Amin, Mahade Hasan, and M. M. A. Hashem Department of Computer Science and Engineering Khulna University of Engineering and Technology Khulna-9203, Bangladesh stahsin.cse@gmail.com, mahade0@gmail.com, mma.hashem@outlook.com Abstract This paper proposes a fuzzy relational clustering (FRC) to find similar sentences from a set of sentences as well as group them in clusters. For finding similar sentences here FRC used both word-to-word and order similarity. For word-to-word similarity FRC used Jiang and Conrath similarity measure (JnC) with the help of WordNet database. Order similarity is calculated from joint word set. As a sentence may relate to more than one theme so FRC used a fuzzy clustering approach. Here FRC used FRECCA algorithm for the sentence clustering purpose. The algorithm works on Expectation-Maximization where importance of a sentence is expressed by PageRank score which is treated as likelihood. The PageRank scores and mixing coefficients are initialized with Uniform Random Number generation technique. Applying this method on a quotation dataset of different classes we found that it is capable of identifying and grouping similar sentences in a cluster. FRC is also applied on a news article dataset and found admirable results. Keywords Fuzzy Clustering, Sentence Similarity, Relational Data. I. INTRODUCTION The amount of data being generated and stored is growing exponentially which presents new opportunities and challenges for unlock the information embedded within this data. Modern field of data mining can be used to extract important knowledge from data [1]. This method only emphasizes on text data. FRC used pre-processed text data of quotes. This method can be further used on pre-processed text data from s, newspapers, websites, online blogs etc. One of the applications is that, on a certain day many newspapers are published. There are some news articles that are printed in all or most of them but are not written in the same format or language nevertheless the idea of the news is similar. So the main objective is to find similarity between the sentences residing in the news article. Several works have been done on fuzzy clustering. First one was Relational Fuzzy c-means by Hathaway et al [2]. Then ARCA and K-Medoids [3],[] was also proposed. But these algorithms had some limitations. In Relational Fuzzy C-means distance between data points couldn t be measured although it could operate on relational data. If the size of data set increased by high number, it causes the ARCA algorithm to fail. K- Medoids were affected by poor initialization which is done randomly. A sentence is likely to be related to more than one theme or topic present within a document. However, because most sentence similarity measures do not represent sentences in a common metric space, a bag-of-words [] approach or conventional fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering []. Hence there is a need of fuzzy clustering algorithm that operates on relational input data. Relational data are in the form of a square matrix of pairwise similarities between data objects. FRC finds similarities between sentences and puts them into a cluster by calculating the cluster membership values. As the approach is fuzzy, a sentence may reside in more than one cluster with some amount of membership value. The sum of the membership values of a sentence in all clusters is 1. A sentence is considered to be in a cluster for which the corresponding membership value is largest. This paper has improved the Measuring Sentence Similarity method. The sentence similarity method measures similarity between two sentences based on both Word-to-Word similarity and Order similarity. Word-to-Word similarity measures the similarity between two words with the help of a lexical database known as WordNet []. On the other hand Order similarity measures the positioning of a word in a sentence. The proposed method Sentence-Clustering based on FRC (SBFRC) has improved the Initialization step of the algorithm. The FRECCA algorithm initializes the membership values and PageRank values using simple random number generation technique. FRC implemented Uniform Random Number generation technique. The rest of the paper is organized as follows. Section 2 summarizes some related works. Section 3 describes the methodology of SBFRC. Section illustrates the experimental studies. Finally Section concludes the paper. II. RELATED WORKS Hathaway et al. s Relational Fuzzy c-means (RFCM) algorithm [2] is considered as the first successful fuzzy clustering algorithm. Although RFCM operates on relational /1/$ IEEE

2 data input, it does not the express the relation by this data to be Euclidean. Though having success, the Euclidean requirement in RFCM was considered limited, and various alternatives have been proposed. For instance, the ARCA algorithm [8] uses an attribute-based representation. A limitation of this method is the high dimensionality. It causes by representing objects in terms of their similarity with all other objects. K-Medoid family is popular in clustering algorithms. Fuzzy versions of k-medoids have also been introduced. Like k- Means, k-medoids are also highly sensitive to the initial selection of centroids which is done randomly. And often requires running the algorithm several times from different random initializations. Many spectral clustering algorithms that can be applied to sentence clustering by Zha [9], and Wang et al. [10]. They have recently applied a closely related nonnegative matrix factorization [11] technique to sentence clustering in the context of multi-document summarization. Our proposed approach Sentence-Clustering based on FRC (SBFRC) is based on a fuzzy relational clustering algorithm which is known as FRECCA algorithm []. Here, Cluster Membership values for each node represent the degree or extent to which the object or a sentence belongs to each of the respective clusters, and mixing coefficients represent the probability of an object being in a cluster []. III. SENTENCE-CLUSTERING BASED ON FRC (SBFRC) To overcome the shortcomings of the above approaches, SBFRC has proposed in this paper that can group similar sentences together in a cluster. It contains several steps which are described as follows. Fig. 1 is the flowchart of the steps of SBFRC. Sentence i Word-to-word similarity S word WordNet Database Document (Collection of sentences), i=1,2.n Sentence j Overall Sentence Similarity, Sentence Similarity Matrix Order similarity S order Step 1: The input set is produced by pre-processing a text document. The sentences are extracted from a paragraph and the numbers of sentences are determined. Sentence 1 Sentence 2 Step 2: Second step of SBFRC is to word-to-word similarity measure. Sentence similarity measures play an important role in text-related research. Existing methods for measuring sentence similarity have been adopted from approaches used for long text documents. These methods process sentences in a very high-dimensional space and are consequently inefficient and are not adaptable to some application domains. [12] The proposed sentence similarity method derives similarity between two sentences using both word-to-word similarity and order similarity. A text is considered to be a sequence of words. The words, along with their combination structure, make a specific meaning. Unlike existing methods, the proposed method calculates both word-to-word similarity and order similarity to compute final sentence similarity. For word-to-word similarity the method used knowledge based measures, the WordNet [] based measure due to Jiang and Conrath[13]. The Jiang-Conrath measure is based on the idea that the degree to which two words are similar is proportional to the amount of information they share. The similarity between words w1 and w2 is defined as equation 1: & 1, 2 Word-to-word similarity WordNet Database Order vector 1 Order vector 2 Fig. 2. Measuring Sentence Similarity Sentence Similarity Order similarity, (1) Page Rank Algorithm Expectation Maximization Determine cluster membership values Expected Result? No Where, 1, 2 is the word that is the deepest common ancestor of word w1 and w2, is the information content of word w, and defined as, where is the probability that word w appears in a large corpus [12]. Step 3: Third of SBFRC is to measure order similarity. For each sentence, a raw semantic vector is derived [12]. An order similarity is calculated using the two order vectors. Stop Yes Fig. 1. Flow chart of SBFRC 2

3 Step : The fourth step is to calculate the overall similarity as showed in Fig 2. Word-to-word similarity represents the lexical similarity. On the other hand, word order similarity provides information about the relationship between words: which words come before or after other words. Then, the overall sentence similarity is derived by combining word-toword similarity and order similarity. And the similarity value is stored in the similarity matrix in respective position. Overall sentence similarity is calculated using the equation 2 [12]-, 1 Where r <=1 decides the relative contributions of word-toword and word order information to the overall similarity computation [12]. Since word-to-word similarity is more important and plays a vital role to overall sentence similarity, r should be a value greater than 0.. Step : Fifth step is to form the similarity matrix. The similarity method is applied between all possible pairs of sentences. After calculating similarity values between all the sentences in a document, the similarity values are stored in a matrix. If there are N sentences in a document then the matrix which size is NxN. Step : Sixth step is to applying the PageRank algorithm [1],[1]. Unlike Gaussian mixture models, which use a likelihood function parameterized by the means and covariances of the mixture components, this method uses the PageRank score of an object within a cluster as a measure of its centrality to that cluster. These PageRank values are then treated as likelihoods. PageRank values are calculated using the equation no. 3: 1 Here, is the pagerank score of object in cluster. is the weight between objects and in cluster calculated previously. And d is the damping factor. The damping factor d that appears in the PageRank calculation affects the fuzziness of the clustering, but generally does not affect the number of clusters, provided that the value is above approximately 0.8. In general, the higher the value of d, the harder is the clustering, with cluster membership values being close to either zero or one. We have used a value of 0.8. After PageRank scores are calculated, these are treated as likelihoods and then they are used to calculate cluster membership values. Cluster membership values are obtained using the equation : Where, is the mixing coefficient for cluster. And is the likelihoods for object in cluster which is obtained from the PageRank values. Membership values are normalized so that membership for an object sums to 1 for all clusters []. Step : Seventh step is the Expectation-Maximization (EM) step. Here there are two steps. The E-step calculates the membership values for each cluster. For each pair of sentences (2) (3) () in a cluster weight is calculated using values from similarity matrix and membership values. That is () In equation no., is the weight between sentences and in cluster. is the similarity between sentences and derived from Sentence similarity matrix. And and are the respective membership values of sentences and in cluster []. With the help of this step once the PageRank scores are calculated they are treated as likelihoods and cluster membership values are calculated afterwards. In the M-step the membership values are updated until convergence. It actually updates the mixing coefficients based on membership values. The value of mixing coefficient for each sentence in a cluster is same. As in the equation no.. () Where, is the mixing coefficient for cluster, is the membership values calculated in expectation step and N is the total number of sentences. Step 8: The eighth and last step is to find the convergence point for the clusters. The convergence is achieved when the membership of the sentences in the clusters is not updating anymore or the difference between the previous values is very little. IV. EXPERIMENTAL ANALYSIS A. Experimental Setup SBFRC has been development under the environment on Intel core i-2.30 GHz processor with.0 GBytes of RAM running on windows operating system. Here SBFRC has been developed in Java. WordNet database has been used while measuring word-to-word similarity. The quotation dataset on Table 1 [1] and news article dataset on Table 3 [] are partial of the full dataset. Table 1: Quotation dataset Knowledge Class 1. Our knowledge can only be finite, while our ignorance must necessarily be infinite. 2. Everybody gets so much common information all day long that they lose their common sense. Marriage Class 11. A husband is what is left of a lover, after the nerve has been extracted. 12. Marriage has many pains, but celibacy has no pleasures.. Nature Class 21. I have called this principle by which each slight variation if useful is preserved by the term natural selection. 22. Nature is reckless of the individual. When she has points to carry, she carries them.. Peace Class 31. There is no such thing as inner peace there is only nervousness and death. 32. Once you hear the details of victory, it is hard to distinguish it from a defeat... Food Class 1. Food is an important part of a balanced diet. 2. To eat well in England you should have breakfast three times a day... 3

4 Experimental Results and Comparisons Table 2 shows the results of applying our method, ARCA, Spectral Clustering, and k-medoids algorithms to the quotations data set and evaluating using the external measures. Our method requires that an initial number of clusters must be specified. This number was varied from 3 to 8, running 0 trials for each case, each trial commencing from a different random initialization of membership values. In each case the same affinity matrix was used, with pairwise similarities calculated as described before. However, only three unique clustering were found, each containing a different number of clusters, which ranged from four to seven. No. of Cluster Table 2: Evaluation of quotation dataset Purity Entropy Rand F- Measure SBFRC ARCA Spectral Clust K-Medoids To demonstrate how the algorithm may perform in more general use in activities related to text mining, the system is also applied to clustering sentences from a news article. Table 3 shows the sentences from an article about President Barak Obama s presidency. Table 3: News article dataset News Articles 1. President Barak Obama on Tuesday championed nuclear energy expansion as the latest way that feuding parties can move beyond the broken politics 12. That mission however remains in doubt 1. In Saginaw Biden insisted the stimulus is working even as he acknowledged it s going to take us a while to get us out of this ditch.. 2. It includes more direct and rapid response to criticism more events at which the president speaks directly to the public without the filter of the media 28. The intended narrative is one in which Obama hears people s frustrations and is working directly to end them. 29. There is a little doubt the public is angry. 30. A CBS news poll in early February found eighty one percent saying it s time to elect new people to congress The result from the news article dataset is shown below- Sentences Table : Results of News Article dataset 3,,,9,10,13,1,1,1,1,19,21,2 2,,12,18,20,22,2,2,2,28,29 1,8,23,30 B. Discussions The performance measure techniques (Purity, Entropy [1], Rand, F-measure [18]) evaluated our method. The algorithm is run 0 times for the quotation dataset and chosen the best results from it. Since the method is unsupervised, there is no fixed output. As the four performance measures are not always consistent as to which algorithm achieves best performance for a given number of clusters, boldface indication of the value corresponding to SBFRC for which the measure is a maximum. For example, for the Rand Index corresponding to Cluster No =, our method achieves a value of 0.9, which is greater than that achieved by the other algorithms (0.8, 0.8, and 0.33), and for cluster No. =, the proposed method achieves the value of entropy. whereas other method has higher value than it (0.1, 0. and 0.) hence these values are represented in boldface. It can be seen from the table that when the number of clusters is, the algorithm performs well than when cluster number is or as measured by all five external cluster evaluation criteria. The result achieved from the news article dataset, it is observed that in 1st cluster, sentence 1 has the highest PageRank score. So the sentence 1 would go in cluster 1. The sentences 3,,, 9, 10, 13, 1, 1, 1, 19, 21, 2 are also in cluster 1. There PageRank score in close to the PageRank of sentence 1. These sentences are of similar meaning. For 2nd cluster sentence 2 has highest PageRank. Closer to this are the sentences 2,, 12, 18, 20, 22, 2, 2, 28, 29. And most of them are similar as they are of negative sense (criticism, angry, frustration). V. CONCLUSION An obvious potential application of the algorithm is to document classification and summarization. Like any other clustering algorithm, the performance of this method will ultimately depend on the quality of the input set, and as for the sentence clustering the performance can be improved with better sentence similarity measures. As the cluster number is provided initially the algorithm appears to be able to converge to an appropriate number of clusters. The idea can be expanded to the development of a hierarchical fuzzy relational clustering algorithm as well as the proposed method can also be applied to a cloud trust system to measure its performance by analyzing user feedback comments.

5 REFERENCES [1] Gary M. Weiss & Brian D. Davison, To appear in the Handbook of Technology Management, H. Bidgoli (Ed.), John Wiley and Sons, [2] R.J. Hathaway, J.W. Devenport, and J.C. Bezdek, Relational Dual of the C-Means Clustering Algorithms, Pattern Recognition, vol. 22, no. 2, pp , [3] R. Krishnapuram, A. Joshi, and Y. Liyu, A Fuzzy Relative of the k-medoids Algorithm with Application to Web Document and Snippet Clustering, Proc. IEEE Fuzzy Systems Conf.,pp , [] T. Geweniger, D. Zu hlke, B. Hammer, and T. Villmann, Median Fuzzy C-Means for Clustering Dissimilarity Data, Neurocomputing, vol. 3, nos. -9, pp , [] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 2 August UTC. Accessed on 2 September 201. [] Skabar, A. and Abdalgader, K. Clustering Sentence- Level Text Using a Novel Fuzzy Relational Clustering Algorithm, IEEE transactions on knowledge and data engineering, vol. 2, no. 1, [] C. Fellbaum, WordNet: An Electronic Lexical Database. MIT Press, [8] P. Corsini, F. Lazzerini, and F. Marcelloni, A New Fuzzy Relational Clustering Algorithm Based on the Fuzzy C-Means Algorithm, Soft Computing, vol. 9, pp. 39-, 200. [9] H. Zha, Generic Summarization and Keyphrase Extraction Using Mutual Reinforcement Principle and Sentence Clustering, Proc. 2th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [10] D. Wang, T. Li, S. Zhu, and C. Ding, Multi-Document Summarization via Sentence-Level Semantic Analysis and Symmetric Matrix Factorization, Proc. 31st Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [11] D. Lee and H. Seung, Algorithms for Non-Negative Matrix Factorization, Advances in Neural Information Processing Systems, vol. 13, pp. -2, [12] Y. Li, D. McLean, Z.A. Bandar, J.D. O Shea, and K. Crockett, Sentence Similarity Based on Semantic Nets and CorpusStatistics, IEEE Trans. Knowledge and Data Eng., vol. 8, no. 8, pp , Aug [13] J.J. Jiang and D.W. Conrath, Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy, Proc. 10th Int l Conf. Research in Computational Linguistics,pp , 199. [1] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, vol. 30, pp , [1] [] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 30 August UTC. accessed on 02 September, 201. [1] accessed on 30 August, 201. [1] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 1 September UTC. y), accessed on 02 September, 201. [18] Wikipedia: the Free Encyclopedia. Wikimedia Foundation Inc. Updated 2 July UTC. accessed on 02 September, 201.

CLUSTERING PERFORMANCE IN SENTENCE USING FUZZY RELATIONAL CLUSTERING ALGORITHM

CLUSTERING PERFORMANCE IN SENTENCE USING FUZZY RELATIONAL CLUSTERING ALGORITHM Purushothaman B PG Scholar, Department of Computer Science and Engineering Adhiyamaan College of Engineering Hosur, Tamilnadu