CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM

Size: px

Start display at page:

Download "CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM"

Ethelbert Tyrone Heath
6 years ago
Views:

1 82 CHAPTER 5 CLUSTERING USING MUST LINK AND CANNOT LINK ALGORITHM 5.1 INTRODUCTION In this phase, the prime attribute that is taken into consideration is the high dimensionality of the document space. The proposed system employs three different mechanisms. The first stage is the identification of related words in a document, using the MLCL algorithm. The relevant keywords are grouped to form clusters, using three main equations. Thereafter the clusters formed are optimized using Gaussian parameters. The Gaussian parameter identifies the word patterns and standard deviation of clusters. Then the words are grouped in accordance with the Gaussian outputs, and colligated into clusters with reference to the documents. 5.2 SYSTEM ARCHITECTURE Figure 5.1 shows the system architecture of the entire system. In this, the keywords extracted from the genetic algorithm process are clustered using Must Link and Cannot Link algorithm. The clusters are then optimized using Gaussian Parameters.

2 83 Figure 5.1 Overall System Architecture 5.3 MUST LINK AND CANNOT LINK ALGORITHM The extracted keywords are passed into the MLCL algorithm. The terms which do not correlate with the other terms are identified and eliminated using the MLCL algorithm. Each document is considered as an individual cluster of key terms. The equations used to calculate the similarity between the key terms of each document are based on the principles of cosine Equations (5.1) and (5.2) compute a value for the are compared against a threshold Equation (5.3), to check if any kind of relationship can be established between the words. The related keywords will then be grouped to form a cluster. The Must Link and Cannot Link algorithm

3 84 has three different phases for the formation of clusters. Equations (5.1) to (5.3) is given as follows: D W t d W t d min (, ) 2* (, )*log min( W ( t, d )) 1 (5.1) 1 D max log * W2 ( t, d) W2 ( t, d) max( W ( t, d)) 1 2 (5.2) Val ( Dmin) Val ( D max) (5.3) The decision of whether the documents have to be grouped is taken, with respect to its key terms. The key terms expressed in a numerical form will boost the rates of accuracy, but will not make the process of clustering grouping of documents into clusters is based on Equations (5.4) to (5.6). These equations are based on the weight of each term. The small variances can identify the apt words and place them in the accurate clusters. Clustering equations depend on the weight of each term. The three equations are as follows: W1 W 2 E1 log Avg( W1, W 2) (5.4) E2 Low( W1, W 2) Avg( W1, W 2) (5.5) E3 log( low( W1, W 2)) log( W1) log( W 2) (5.6)

4 85 W1 and W2 are the terms in two different documents, D1 and D2, respectively. Equation (5.4) gives a numerical relationship between the terms w1 and w2 of two different documents. Equation (5.5) adopts the lowest weight amongst two words w1 and w2, to form a relationship between the two different documents. Equation (5.6) is a combined representation of E1 and E2. It gives the platform over which the relationship of the two terms can be calibrated. Based on these Equations (5.7) and (5.8), the following inequalities are formed. E3 E2 E1 E2 E3 E 1 (5.7) Or E3 E1 E1 E2 E3 E 2 (5.8) Three different cases are considered based on these inequalities. CASE 1: Documents with Matching Keywords The two documents are grouped together The average weight of the two key terms is computed in the final cluster CASE 2: Clusters with matching sub keywords Clusters for matching sub keywords Matched clusters will be grouped together The average weight of the two key terms is computed in the final cluster CASE 3: No Matching terms - Do Nothing

5 ALGORITHM FOR OPTIMIZATION OF CLUSTERS Let C be the total number of Clusters Let N(j,k) = Let W(j) = Let S(j) = for i from 1 to C do for j from i+1 to C do for x from 1 to count(i) do for y from 1 to count(j) do if ( word[x,i] == word[ y,j]) if( S(x) == S(y) ) Mark S(y) to 1 End if End if End for End for End for End for Let flag=0 for i from 1 to C do for x from 1 to count(i) do Get ( word[x,i] ) If (S(x) == 0) break Else flag=1 End For If flag =1 then delete the cluster End For

6 87 The algorithm given above is used to optimize the clusters. It exhibits how the standard deviation and word pattern can be used to form clusters. Terms with similar standard deviation can be grouped into a single cluster, and thus, the output is optimized. Some documents can appear in more than one cluster. In such cases the clusters are optimized, so that a document appears in one cluster alone. 5.5 CLUSTER OUTPUT Scatter Plots of REUTERS The process of testing MLCL Clustering evolved around aspects of the document test space. One of the key objectives was to keep up with the performance even as the document size increases. For the testing of Reuters, a single category of documents was taken into consideration. As indicated in the test plan, the output had to give a much higher value for the Micro Measures, when compared with the existing methods of Fuzzy clustering. Since a single category of topic was pushed into the algorithms, the algorithm has to group as many documents as possible into a single cluster. This is because all the documents are of the same genre. The figures consist of rectangular and oval shaped dimensions, to indicate the distinguishable documents. The total area covered by the shapes will give an account of the state space employed by the phases of MLCL Clustering Documents When the number of documents was 20, as indicated in Figure 5.1 the state space covered by the novel methods of MLCL clustering was much less than that of fuzzy logic. The MLCL method formed only two different clusters, in an individual segment in which most of the documents were placed. This satisfied the prime need of precision. The Fuzzy Algorithm formed three different clusters, each of which had an even distribution of

7 88 documents. Nevertheless, this attribute does not satisfy the requirement of one topic clustering. The novel algorithm stood ahead of the existing one, when tested against 20 documents under a single topic as shown in Figure 5.2. Figure 5.1 MLCL 20 Documents Reuters Figure 5.2 Fuzzy 20 Documents Reuters 21578

89 5.5.1.2 50 Documents Figures 5.3 and 5.

8 Documents Figures 5.3 and 5.4 give an account of how clustering was done, when the number of documents increased to 50. Figure 5.3 MLCL 50 Documents Reuters Figure 5.4 Fuzzy 50 Documents Reuters 21578

9 90 The novel method as denoted by the project adhered to the prime necessitate. The size of the state space did increase, but the documents were all clustered together. Thereby, the creation of false clusters was prevented. The Fuzzy algorithm produced an extra cluster for 50 documents. This gave rise to a subsequent decrease in the over precision and accuracy of the method. The clusters covered a greater area, and did compromise on the actual requirement of single clustering Documents The test on 100 documents offered an extreme view of how clustering would have happened in both the algorithms and is shown in Figures 5.5 and 5.6. Based on the monotonous approach of clustering in accordance with the genre, the number of documents in the individual cluster tended to boost the MLCL methodology, while the existing technique contributed to more diverse clusters. Figure 5.5 MLCL 100 Documents Reuters 21578

10 91 Figure 5.6 Fuzzy 100 Documents Reuters Scatter Plots of the Brown Corpus The next phase of testing is done on the other aspect of clustering. k on documents with individual topics. It deals with documents of different topics, which will be combined together and tested for the effective formation of clusters. Here, each test comprises of five different document topics, and the proposed method works with the intention to form five distinguished clusters Documents Initially, when 20 documents with five different topics were combined, the algorithm had to give a few distinguished clusters. As aimed, the process of MLCL clustering produced a few different clusters within a given state space and is shown in Figures 5.7 and 5.8. The output, as compared with the Fuzzy logic was similar, but it caught up with the prime requisite of document space management. The amount of state space covered by Fuzzy logic was much more than that of MLCL logic. In the novel method,

92 documents formed an even distribution amongst the

whereas in Fuzzy logic, there are a few clusters

11 92 documents formed an even distribution amongst the clusters, and none of the documents were abandoned, whereas in Fuzzy logic, there are a few clusters with just two documents, which brings down the weightage of the methodology. Figure 5.7 MLCL 20 Documents Brown Corpus Figure 5.8 Fuzzy 20 Documents Brown Corpus

93 5.5.2.2 50 Documents A marked increase in performance was seen in Figures 5.9 and 5.

12 Documents A marked increase in performance was seen in Figures 5.9 and 5.10 while testing with 50 documents. Figure 5.9 MLCL 50 Documents Brown Corpus Figure 5.10 Fuzzy 50 Documents Brown Corpus

94 The new method of the MLCL Algorithm has adhered to the formation of a reasonable number of clusters. The documents were well distributed and distinct.

The output gives a clear impression of the poor performance of Fuzzy Logic. This has enhanced the newer method of MLCL Clustering. 5.5.2.3 100 Documents The above Figures 5.11 and 5.

13 94 The new method of the MLCL Algorithm has adhered to the formation of a reasonable number of clusters. The documents were well distributed and distinct. The method of Fuzzy logic did not work to satisfactory levels. The documents were sparse and grouped into an inappropriate number of clusters. The output gives a clear impression of the poor performance of Fuzzy Logic. This has enhanced the newer method of MLCL Clustering Documents The above Figures 5.11 and 5.12 gives an account of the performance in 100 documents of 5 different categories. The grouping in MLCL clustering was appealingly different. Though the graphical representation on the scatter plot demonstrated a specific color, the plotting is diversified into distinct spaces. This shows the possibility of two different clusters. The intermediate areas are framed with other clusters showing the presence of two different document topics. Apart from this, the output of fuzzy logic was also lower. The documents were sparse and distributed, and only two different clusters were formed of 100 documents. Figure 5.11 MLCL 100 Documents Brown Corpus

and Micro Averaged Accuracy (MicroA). Micro Averaged Precision (http://scikitlearn.org/stable/modules/ generated/sklearn.metrics.precision_recall_fscore_support.html, http://en.

14 95 Figure 5.12 Fuzzy 100 Documents Brown Corpus 5.6 TESTING The performance of clusters is evaluated using Micro Measures like Micro Averaged Precision (MicroP), Micro Averaged Recall (MicroR), Micro Averaged F Measure (MicroF) and Micro Averaged Accuracy (MicroA). Micro Averaged Precision ( generated/sklearn.metrics.precision_recall_fscore_support.html, wikipedia.org/wiki/f1_score) is defined as the ratio of the number of true positives (correct results) divided by the number of all the returned results and is given in Equation (5.9). It is the ability of the algorithm not to label as positive a sample that is negative. MicroP p i 1 p i 1 TP i TP i FP i (5.9)

15 96 Micro Averaged Recall [20,21] is defined as the ratio of the number of true positives (correct results) divided by the number of results that should have been returned. It is the ability of the algorithm to find all the positive samples. The Equation (5.10) is given as follows: MicroR p i 1 p i 1 TP i TP i FN i (5.10) Micro averaged F-measure [20, 21] is interpreted as the weighted average of precision and recall and is given in Equation(5.11) as follows: MicroF 2* MicroP* MicroR MicroP MicroR (5.11) Micro Averaged Accuracy [22], is defined as the portion of all decisions that were correct. It is defined in Equation(5.12) as follows: MicroA p i 1 p i 1 TP i FN TP FP TN FN i i i i i (5.12) used for the study. Two different datasets Reuters-21578, and the Brown Corpus, were REUTERS Reuters has been widely used for testing clustering algorithms. Reuters is an experimental data collection that appeared on the Reuters newswire of the year The dataset was obtained from

16 97 Table 5.1 MicroR Values (Reuters-21578) No of MLCL Fuzzy Logic Documents Clustering Table 5.2 MicroP values (Reuters-21578) No of MLCL Fuzzy Logic Documents Clustering Table 5.3 MicroF values (Reuters-21578) No of Documents Fuzzy Logic MLCL Clustering

17 98 Table 5.4 MicroA values (Reuters-21578) No of Documents Fuzzy Logic MLCL Clustering Figure 5.13 MicroR Values for Reuters Dataset

18 99 Figure 5.14 MicroP Values for Reuters Dataset Figure 5.15 MicroF Values for Reuters Dataset

100 Figure 5.16 MicroA Values for Reuters-21578 Dataset The values of MicroR, MicroP, MicroF and MicroA in Tables 5.1 to 5.

19 100 Figure 5.16 MicroA Values for Reuters Dataset The values of MicroR, MicroP, MicroF and MicroA in Tables 5.1 to 5.4 dictate how the results of MLCL clustering stood ahead of Fuzzy logic, by numerical figures. This concludes the testing of single documents in a large data space, with a common topic of identification. The Figures 5.13 to 5.16 shows the diagrammatic representation of the same Brown Corpus The next phase of testing is done on the other aspect of clustering. work on documents with individual topics. It deals with documents of different topics, which will be combined together and tested for the effective formation of clusters. Here, each test comprises of five different document topics, and the proposed method works with the intention to form five distinguished clusters.

20 101 The Brown Corpus has 500 sample English Documents. The document contains tagged words. It can be used to identify the tense of each word. The text sample is distributed over 15 different genres. Each sample starts with a random sentence boundary, and continues to the next boundary. Table 5.5 MicroR values (Brown Corpus) No of Documents Fuzzy Logic MLCL Clustering Table 5.6 MicroP values (Brown Corpus) No of Documents Fuzzy Logic MLCL Clustering

21 102 Table 5.7 MicroF values (Brown Corpus) No of Documents Fuzzy Logic MLCL Clustering Table 5.8 MicroA values (Brown Corpus) No of Documents Fuzzy Logic MLCL Clustering

22 103 Figure 5.17 MicroR Values for Brown Corpus Dataset Figure 5.18 MicroP Values for Brown Corpus Dataset

23 104 Figure 5.19 MicroF Values for Brown Corpus Dataset Figure 5.20 MicroA Values for Brown Corpus Dataset

24 105 The numerical values in Tables 5.5 to 5.8 show that the new MLCL algorithm outperforms the fuzzy logic for the brown corpus dataset also. The Figures 5.17 to 5.20 shows the diagrammatic representation of the same. 5.7 SUMMARY In this work, clustering is done in two phases. First, the dimensionality of the text document is decreased by selecting the important keywords. Then, the selected keywords are clustered using a MLCL algorithm. The novel method incorporates various computations to find the similarity between the words and the documents. The relationship between the words of a document is calculated using the MLCL algorithm. Then the similarity measures are used to identify the initial clusters, and the clustering process is continued till all the documents are clustered. Finally, the clusters are optimized using Gaussian parameters. The entire process is tested for its effectiveness with two different benchmark datasets. The new MLCL clustering algorithm is compared against Fuzzy self-constructing feature clustering, and it was found that the new novel method outperformed the existing algorithm in a consistent manner. The next chapter talks about sentence based clustering, which is an extended version of the current work. The sentence based clustering helps to improve the process of text summarization and clustering.

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving