Multiple Layer Clustering of Large Software Systems

Size: px
Start display at page:

Download "Multiple Layer Clustering of Large Software Systems"

Transcription

1 Multiple Layer Clustering of Large Software Systems Bill Andreopoulos Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, M3J1P3 Aijun An Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, M3J1P3 Vassilios Tzerpos Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, M3J1P3 Xiaogang Wang Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, M3J1P3 yorku.ca Abstract Software clustering algorithms presented in the literature rarely incorporate in the clustering process dynamic information, such as the number of function invocations during runtime. Moreover, the structure of a software system is often multi-layered, while existing clustering algorithms often create flat system decompositions. This paper presents a software clustering algorithm called MULICsoft that incorporates in the clustering process both static and dynamic information. MULICsoft produces layered clusters with the core elements of each cluster assigned to the top layer. We present experimental results of applying MULICsoft to a large open-source system. Comparison with existing software clustering algorithms indicates that MULICsoft is able to produce decompositions that are close to those created by system experts. 1. Introduction Reverse engineering is the process of analyzing a system s internal elements and its external behavior and creating a structural view of the system. Automatic construction of a structural view of a large legacy system significantly facilitates the developers understanding of how the system works. In legacy systems the original source code is often the only available source of information about the system and it is very time consuming to study. Software clustering techniques aim to decompose a software system into meaningful subsystems, to help new developers understand the system. Clustering is applied to large software systems in order to partition the source files of the system into clusters, such that files containing source code with similar functionality are placed in the same cluster, while files in different clusters contain source code that performs dissimilar functions. Software clustering can be done automatically or manually. Automatic clustering of a large software system using a clustering tool is especially useful in the absence of experts or accurate design documentation. It is desirable to have a software clustering tool that can consider both static and dynamic system information. Automatic clustering techniques generally employ certain criteria (i.e., low coupling and high cohesion) in order to decompose a software system into subsystems [9, 8, 7]. Manual decomposition of the system is done by software engineers. However, it is time consuming and it requires full knowledge of the system. We propose the MULICsoft software clustering algorithm that is based on the MULIC categorical clustering algorithm 1 that is described in [2]. MULICsoft differs from MULIC in that it incorporates both static and dynamic information (i.e., the number of function calls during the run time) in the software clustering process. MULICsoft handles dynamic information by associating weights with file dependencies and incorporating the weights in the clustering process through special similarity metrics. We showed that MULIC clustering results are of higher quality than those of other categorical clustering algorithms, such as k-modes, ROCK, AutoClass, CLOPE and others [2]. Characteristics of MULIC and MULICsoft include: a. The algorithm does not sacrifice the quality of the resulting clusters for the number of clusters desired. Instead, it produces as many clusters as there naturally exist in the data set. b. Each cluster 1

2 consists of layers formed gradually through iterations, by reducing the similarity criterion for inserting objects in layers of a cluster at different iterations. Section 2 describes the Mozilla data set used for clustering. Section 3 gives an overview of previous software clustering tools. Section 4 describes the MULICsoft clustering algorithm. Section 5 describes the experimental results on the Mozilla system. Section 6 discusses inputting additional data to MULICsoft. Section 7 discusses the runtime performance. Section 8 concludes the paper and discusses future work. 2. Description of Data Sets Static information on a software system represents dependencies between the objects to be clustered. The objects to be clustered are source files, while the dependencies are procedure calls and variable references. Static information on software systems is categorical, meaning that the objects have attribute values that are taken from a set of discrete values and the values have no specified ordering. We represent static information as a categorical data set by creating an N N matrix, where N is the number of files. Each row of the matrix represents a file i of the software system, along with the files that i may call or reference during execution. The categorical attribute value (CA) in cell (i,j) of the matrix is zero or one, where one represents that file i calls or references file j and zero represents that file i does not call or reference file j. Dynamic information on a software system contains the results of a profiling of the execution of the system, representing how many times each file called procedures in other files during the run time. Each row of the data set represents a file x of the software system, along with the files that x called as well as how many times x called them during the profiled run time. We represent dynamic information by associating a weight with each CA in the matrix, in the range 0.0 to 1.0, where 1.0 represents that file i called file j the maximum number of times during the runtime and 0.0 represents that file i did not call file j. Figure 1 shows an example of a software data set in the form of a matrix. The weights were derived by normalizing the number of procedure calls during an execution profiling, by dividing all numbers of calls in a column by the maximum number of calls in that column. Thus, the weights are real values in the range from 0.0 to 1.0 and there is at least one weight with a value of 1.0 in each column. The rationale behind normalizing the weights this way is that some helper functions get called thousands of times, but we do not want them to have a stronger influence on the clustering process than other important files that get called fewer times. Figure 1. Cells representing file dependencies have values {CA: zero or one, weight: }. We have applied the MULICsoft algorithm to cluster the Mozilla software system. The process of extracting static and dynamic information for Mozilla was presented in [3]. There are 1202 objects in the Mozilla data set, corresponding to 1202 source files of the Mozilla system. We use both categorical data and weights in clustering. The CAs are boolean values of zero or one describing the dependencies between the Mozilla files. A weight in the range 0.0 to 1.0 is attached to each CA to indicate how strongly the corresponding CA should influence the clustering process. 3. Related Work Several clustering algorithms for software have been presented in the literature [8, 9, 3]. Some of the previous software clustering tools can consider the dynamic information (i.e., the number of function calls during the run time) in the clustering process [12, 3] but most can not. In this section, we describe three established clustering algorithms: BUNCH, ACDC and LIMBO. Bunch is a clustering tool intended to aid the software developer and maintainer in understanding, verifying and maintaining a source code base [8]. The input to Bunch is a Module Dependency Graph (MDG). Figure 2 shows an MDG graph. Bunch views the clustering problem as trying to find a good partition of an MDG graph. Bunch views a good partition" as a partition where highly interdependent modules are grouped in the same cluster (representing subsystems) and independent modules are assigned to separate clusters. Figure 2b shows a good partitioning of Figure 2a. Finding a good graph partition involves systematically navigating through a very large search space of all possible partitions for that graph. Bunch treats graph partitioning (clustering) as an optimization problem. The goal of the optimization is to maximize the value of an objective function, called Modularization Quality (MQ) [8].

3 Figures 2a,b. An MDG graph, from [8]. ACDC works in a different way from other algorithms. Most software clustering algorithms identify clusters by utilizing criteria such as the maximization of cohesion, the minimization of coupling, or some combination of the two. ACDC performs the task of clustering in two stages. In the first stage, it creates a skeleton of the final decomposition by identifying subsystems that resemble established subsystem patterns, such as the bodyheader pattern and the subgraph dominator pattern [9]. Depending on the pattern used the subsystems are given appropriate names. In the second stage, ACDC completes the decomposition by using an extended version of a technique known as Orphan Adoption [11]. Orphan Adoption is an incremental clustering technique based on the assumption that the existing structure is well established. It attempts to place each newly introduced resource (called an orphan) in the subsystem that seems more appropriate. This is usually a subsystem that has a larger amount of connectivity to the orphan than any other subsystem. LIMBO is introduced in [4] as a scalable hierarchical categorical clustering algorithm that builds on the Information Bottleneck (IB) framework for quantifying the relevant information preserved when clustering. LIMBO has been successfully applied to the software clustering problem [3]. LIMBO s goal is to create clusters whose features contain as much information as possible about the features of their contents. LIMBO considers weights representing dynamic dependencies in the software clustering process. 4. The MULICsoft Clustering Algorithm MULICsoft is an extension of the k-modes clustering algorithm for categorical data sets [6]. The k-modes clustering algorithm requires the user to specify the number of clusters to be produced and the algorithm builds and refines the specified number of clusters. Each cluster has a mode associated with it. Assuming that the objects in the data set are described by m categorical attributes, the mode of a cluster is a vector Q={q 1, q 2,, q m } where q i is the most frequent value for the ith attribute in the given cluster. MULICsoft makes substantial changes to the k-modes algorithm. The purpose of the MULICsoft clustering algorithm is to maximize the similarity between the object and the mode of the cluster in which the object is inserted: similarity ( o i, modei ) (1) where o i is the ith object in the data set and mode i is the mode of the ith object s cluster. There are various options for the similarity metric that will be described in Section 4.4. Maximizing formula (1) ensures that all objects are as similar to their clusters modes as possible, when the objects are clustered. The MULICsoft algorithm has the following characteristics. First, the number of clusters is not specified by the user. Clusters are created, removed or merged during the clustering process, as the need arises. Second, it is possible for all objects to be assigned to clusters of size two or greater by the end of the process. However, outliers are assigned to separate clusters of size one. Third, clusters are layered. Input: (1) a set S of objects; Parameters: (1) δφ : the increment for φ; (2) threshold for φ : the maximum number of values that can differ between an object and the mode of its cluster; Default parameter values: (1) δφ = 1; (2) threshold = the number of categorical attributes m; Output: a set of clusters; Method: 1. Insert the first object into a new cluster, use the object as the mode of the cluster, and remove the object from S; 2. Initialize φ to 1; 3. Loop through the following until S is empty or φ is greater than the specified threshold a. For each object o in S i. Find o s closest cluster c by using the similarity metric to compare o with the modes of all existing cluster(s); ii. If the number of different values between o and c s mode is larger than φ, insert o into a new cluster iii. Otherwise, insert o into c and update c s mode; iv. Remove object o from S; b. For each cluster c, if there is only one object in c, remove c and put the object back in S; c. If in this loop no objects were placed in a cluster with size > 1, increment φ by δφ. Figure 3. The MULICsoft clustering algorithm. Figure 3 shows the main part of the MULICsoft clustering algorithm. The algorithm starts by reading all objects from the input file and storing them in S. The first object is inserted in a new cluster, the object becomes the mode of the cluster and the object is removed from S. Then, it continues iterating over all

4 objects that have not been assigned to clusters yet, to find the closest cluster. In all iterations, the closest cluster for each unclassified object is the cluster with the highest similarity between the cluster s mode and the object, as computed by the similarity metric. The variable φ is maintained to indicate how strong the similarity has to be between an object and the closest cluster s mode for the object to be inserted in the cluster initially φ equals 1, meaning that the similarity has to be very strong between an object and the closest cluster s mode. If the number of different values between the object and the closest cluster s mode is greater than φ then the object is inserted in a new cluster on its own, else, the object is inserted in the closest cluster and the mode is updated. At the end of each iteration, all objects assigned to clusters of size one have their clusters removed so that the objects will be re-clustered at the next iteration. This ensures that the clusters that persist through the process are only those containing at least 2 objects for which the required similarity can be found. Objects assigned to clusters with size greater than one are removed from the set of unclassified objects S, so those objects will not be re-clustered. At the end of each iteration, if no objects have been inserted in clusters of size greater than one, then the variable φ is incremented by δφ. Thus, at the next iteration the criterion for inserting objects in clusters will be more flexible. The iterative process stops when all objects are classified in clusters of size greater than one, or φ exceeds a user-specified threshold. If the threshold equals its default value of the number of attributes m, the process stops when all objects are assigned to clusters of size greater than one. The MULICsoft algorithm can eventually classify all objects in clusters, even if the closest cluster to an object is not that similar, because φ can continue increasing until all objects are classified. Even in the extreme cases, where an object o with m attributes has only zero or one value similar to the mode of the closest cluster, it can still be classified when φ = m or φ = m-1, respectively. Figure 4 illustrates what the results of MULICsoft look like. Each cluster consists of many different "layers" of objects. The layer of an object represents how strong the object's similarity was to the mode of the cluster when the object was assigned to the cluster. The cluster s layer in which an object is inserted depends on the value of φ. Lower layers have a lower coherence - meaning a lower average similarity between all pairs of objects in the layer - and correspond to higher values of φ. MULICsoft starts by inserting as many objects as possible in top layers such as layer 1 - and then moves to lower layers, creating them as φ increases. Figure 4. MULICsoft results. Each cluster consists of one or more different layers representing different similarities of the objects attached to the cluster. If an unclassified object has equal similarity to the modes of the two or more closest clusters, then the algorithm tries to resolve this tie by comparing the object to the mode of the top layer of each of these clusters the top layer of a cluster may be layer 1 or 2 and so on. Each cluster s top layer s mode was stored by MULICsoft when the cluster was created, so it does not need to be recomputed. If the object has equal similarity to the modes of the top layer of all of its closest clusters, the object is assigned to the cluster with the highest bottom layer. If all clusters have the same bottom layer then the object is assigned to the first cluster, since there is insufficient data for selecting the best cluster. The complexity of MULICsoft is O(N 2 ), where N is the number of objects. Most of our trials had a runtime of less than 30 seconds. Increasing δφ or decreasing threshold reduces the runtime, often without hurting the quality of the results [2] Merging of clusters We should generally avoid the situation where the similarity of the top layers of two different clusters is stronger than the similarity of the top and bottom layer of the same cluster. To avoid this, after the clustering process MULICsoft can merge pairs of clusters whose top layers modes dissimilarity is less than the maximum layer depth of the two clusters. For this purpose, MULICsoft preserves the modes of the top layers of all clusters. This process reduces the total number of clusters and may improve the quality of the results. This process is described as follows:

5 for (c = first cluster to last cluster) for (d = c+1 to last cluster) if the dissimilarity between c s mode and d s mode is less than the maximum layer depth of c and d, merge c into d and break the inner loop; where the dissimilarity between two modes (Q c = {q c1,, q cm } and Q d = {q d1,, q dm }) is defined as: dissimilarity(q c, Q d )= i= where δ ( q ci, q di 0 ) = Dealing with outliers m 1 δ ( q ci, q di ) ( q ( q ci ci = q q MULICsoft will eventually put all the objects in clusters if the threshold for φ equals its default value of the number of attributes m. When φ equals m, any object that remains unclassified will be inserted in the lowest layer of a cluster. This is undesirable if the object is an outlier and has little similarity with any cluster. The user can disallow this situation from happening by specifying a value for threshold that is less than m. In this case when φ exceeds the maximum allowed value specified by threshold, any remaining objects are treated as outliers by classifying each object in a separate cluster of size one. We showed that top layers are more reliable than lower layers in [2] MULICsoft characteristics for software clustering MULICsoft includes characteristics specific for software clustering, allowing the incorporation of both static and dynamic system information in the clustering process. All categorical attribute values (CAs) of an object have "weights" in the range of 0.0 to 1.0 associated with them, which represent dynamic information derived from profiling the execution of a system. The weights were extracted as described in Section 2. We represent the weights of an object o as a vector w_o. A position of the mode of a cluster is set to one if there is at least one object in the cluster that has a CA of one in the corresponding position, or has a weight greater than 0.0 at the corresponding position. We do not use the most frequent value for each position of the mode, because with our software data set most or all values of the mode would be set to zero. When calculating the similarity between a mode and an object, pairs of zero attribute values between mode and object are ignored. di di ); ). Besides storing the boolean values of zero or one for the mode µ of a cluster, we also store real numbers for each position of the mode, which represent the sum of all weights at that position over all objects allocated to the cluster. We represent this special mode as w_µ. Special similarity metrics are used to compute the similarity between a mode and an object Similarity metrics for comparison of objects to modes A similarity metric is used to find the closest cluster to an object, by computing the similarity between the cluster s mode and the object. MULICsoft handles dynamic information by associating weights with CAs and incorporating these weights in the clustering process through special similarity metrics that consider CAs and weights. The similarity metrics use the weight vectors w_o and/or w_ µ. The function σ returns 1 if an object o and a mode µ have identical CAs of one at a position, and returns 0 otherwise: 1 ( o = µ = 1); σ ( o, µ ) = i i i i 0 otherwise Similarity metric 1. The first similarity metric uses both the weights of the objects and the mode: m similarity ( o, µ ) = w _ o w _ µ σ ( o, µ ) i = 1 i i i i Similarity metric 2. The second similarity metric uses only the weights of the objects: (, ) m similarity o µ = w (, ) 1 _ o σ o µ i = i i i Similarity metric 3. The third similarity metric amplifies the weights of the objects as follows: m x (4 w _ o ) i similarity ( o, µ ) = σ ( o, µ ) i = 15 (4 w _ o ) i i i The parameter x takes an integer value greater than 5. This similarity metric places more importance on high weights (1.0) than low weights (0.0). The intuition for this formula is that for each pair of CAs with identical values of one between o and µ the contribution to the similarity result should be at least 1.0, for the lowest weight of 0.0. The maximum contribution, for the highest weight of 1.0, depends on the integer value of x. For example, for x=6 the contribution to the similarity result ranges from 1.2 for a low weight of 0.1 to 2.0 for a high weight of 1.0. For x=9 the contribution to the similarity result ranges from 1.8 for a low weight of 0.1 to 5.0 for a high weight of 1.0.

6 Figure 5. The function surface of the 3 rd similarity metric, for x=6 and weight values between 0.0 and 1.0. Figure 5 shows the shape of the values returned by the 3 rd similarity metric for x=6. Each object in this example has 10 CAs and weights. This graph shows that an object is more likely to be assigned to a cluster if all CAs match the mode with high weights of 1.0, than if all CAs match the mode with medium weights of 0.5, than if all CAs match the mode with low weights of 0.1, than if 1 CA matches the mode with a high weight, than if 1 CA matches the mode with a low weight. 5. Results for Clustering Mozilla with MULICsoft In order to evaluate the applicability of MULICsoft to the software clustering problem, we applied it to the Mozilla software system and compared its output to that of other well-established software clustering algorithms. We experimented with Mozilla version 1.3 that was released in March It contains approximately four million lines of C and C++ source code. We built Mozilla under Linux and extracted its static dependency graph using CPPX and a dynamic dependency graph using jprof. A decomposition of the Mozilla source files for version M9 was presented in [5]. For the evaluation portion of our work, we used an updated authoritative decomposition for version 1.3 [12]. We have placed all of our detailed results online 2. We compared MULICsoft to the following software clustering algorithms: ACDC [9], BUNCH [8], LIMBO [4]. To evaluate the clustering results we compared them with the authoritative manual decomposition, using the MoJo distance measure 3 [10, 9]. MoJo measures the A Java implementation of MoJo is available for download at: distance between two decompositions of the same software system by computing the number of Move and Join operations one needs to perform in order to transform one to the other. Intuitively, the smaller the distance of a proposed decomposition to the authoritative one, the more effective the algorithm that produced it. MULICsoft clusters the 1202 Mozilla files into clusters, without merging the clusters after the clustering process. The clusters produced for Mozilla before merging have sizes ranging from 3 to 37 files. The results indicate that MULICsoft outperforms other software clustering algorithms, such as LIMBO, BUNCH and ACDC. The MoJo distances for ACDC, BUNCH and LIMBO applied to clustering the Mozilla software system are shown in Table 1. MULICsoft clusters all Mozilla system files, without treating any as outliers, giving MoJo distances such as 388, 397, 399, 424, as explained in the next section. Table 1. ACDC, BUNCH, LIMBO results for clustering Mozilla. Software Clustering MoJo distance Number of clusters Files classified Algorithm ACDC BUNCH LIMBO Results for different similarity metrics Table 2 shows the results for each of the 3 similarity formulas. The experiments use a linear increase of φ by setting it to an initial value of 1 and increasing it by a constant value δφ, after each loop where no object was classified in a cluster of size greater than one. We set threshold equal to its default value of the number of attributes m, so that no objects are treated as outliers and all 1202 files are clustered. We do not merge the clusters after the clustering process. Table 2. MULICsoft results for clustering Mozilla with different similarity metrics. Similarity MoJo Number δφ metric distance of clusters 1 st nd rd, x= rd, x= rd, x= As Table 2 shows, for all of our similarity metrics, the MoJo distance to the authoritative manual decomposition is significantly lower than the distances

7 of ACDC, BUNCH and LIMBO. The 3 rd similarity metric produces the best results. For the 3 rd similarity metric with x=9, the results are especially good with a MoJo distance of 388 to the authoritative decomposition. The reason why x=9 produces the best results is that the 3 rd similarity metric amplifies significantly the effect of the high weights on the clustering process. We tried setting x to even higher values, such as 10, 12 and 15, but the MoJo distance no longer decreased. Thus, a high value for the parameter x improves the results until a specific point. As x decreases to 7 and 6, the results are still good, with MoJo distances of 397 and 399 respectively. For the 1 st and 2 nd similarity metrics the results are better than those of ACDC, BUNCH and LIMBO. Note that for the 1 st and 2 nd similarity metrics a lower value of δφ of 80 is used than for the 3 rd metric. The reason for this is that with the 3 rd metric more files are classified in the correct cluster during the first and second iterations, because of the amplified effect of the weights on the clustering process. With the 1 st and 2 nd metrics, on the other hand, fewer files are classified correctly during the first and second iterations and the lower value of δφ allows some of the files to be considered instead at the next iterations MULICsoft with linear and exponential growths of φ We also experimented with increasing the variable φ linearly by setting it to an initial value of 1 and increasing it by a constant value δφ after each loop at which no object was placed in a cluster of size greater than one. We also experimented with increasing the variable φ exponentially by setting it to an initial value of 1 and multiplying it by 2 after each loop at which no object was classified in a cluster of size greater than one. Table 3 shows our results for both a linear and an exponential increase of φ. We let threshold have its default value equal to the number of attributes m, so that no objects are treated as outliers and all 1202 files are clustered. We assume no merging is done on the clusters after the clustering process. Table 3. MULICsoft results for clustering Mozilla with linear and exponential growths of φ. The initial value of φ is 1. MoJo dist. Number of clusters δφ Similarity metric Linear growth of φ, 3 rd similarity metric, x= rd, x= rd, x= rd, x= rd, x= rd, x=9 Linear growth of φ, 3 rd similarity metric, x= rd, x= rd, x= rd, x= rd, x= rd, x= rd, x= rd, x=6 Exponential growth of φ multiply φ by 2 3 rd, x=9 As Table 3 shows, for the 3 rd similarity metric a value of δφ between 70 and 150 gives the best results overall, with MoJo distances as low as 388, 391 and 399. The reason why a high value of δφ is used is that sufficient files should be clustered at each iteration so that the modes of the clusters are given the opportunity to change, as opposed to remaining static. For the 3 rd similarity metric and x=6 a value of δφ between 100 and 110 gives the best results, while for x=9 a value of δφ between 130 and 150 gives the best results. The reason for this is that with x=9 more files are classified in the correct cluster during the first and second iterations, because of the amplified effect of the weights on the clustering process. With x=6, on the other hand, fewer files are classified correctly during the first and second iterations and the lower value of δφ allows some of the files to be considered instead at the next iterations. For the exponential growth of φ, the MoJo distance increases to 456. Even though the exponential growth of φ does not produce the best results in this case, it can still produce good results when we treat some objects as outliers, as described in the next section.

8 5.3. Treating objects as outliers by setting a threshold for φ Some times it may be desirable to treat the objects in bottom layers of clusters as outliers. Objects are treated as outliers by setting the threshold for φ to a value less than the number of attributes m, as discussed in Section 4.2. When φ exceeds the maximum allowed value specified by threshold, any remaining objects are treated as outliers by classifying them independently in clusters of size one. For example, setting the threshold for φ to the value 150 means that clustering will stop at layer 150 and any objects that would be clustered in layers greater than 150 are treated as outliers. We showed that lower layers are less reliable than higher layers in [2]. We experiment with various thresholds for φ, for both linear and exponential growths of φ. We use the 3 rd similarity metric with x=7. We assume no merging is done on the clusters after the clustering process. Table 4 shows the results. Table 4. MULICsoft results for setting a threshold for φ and treating some files as outliers. The initial value of φ is 1. The 3 rd δφ similarity metric is used with x=7. Threshold MoJo Number for dist. of φ clusters Number of outliers Linear growth of φ Exponential growth of φ multiply φ by 2 multiply φ by As Table 4 shows, the MoJo distance increases after treating objects as outliers. This increase in MoJo distance is related to the fact that each object that is treated as an outlier is placed in an independent cluster of size one. Thus, many Moves and Joins need to be performed for the computed decomposition to reach the authoritative manual decomposition and it is expected for the MoJo distance to increase. We do not interpret the increase in MoJo distance as a decline of the quality of the results, but as a sign that we should treat outliers in a different way from placing them in clusters independently. The MoJo distance would significantly decrease if all outliers were placed instead in one cluster together. Furthermore, the MoJo distance decreases even more if the outliers are simply ignored and the distance is computed between the intersection of files in the computed decomposition with files in the authoritative manual decomposition. A different distance measure could be useful for computing the distance between a computed decomposition and an authoritative decomposition, when outliers are involved Merging of clusters MULICsoft provides the capability to merge clusters that are very similar after the clustering process, for the purpose of reducing the number of clusters, as described in Section 4.1. Table 5 presents the results after merging the clusters. For this experiment we ignore the files that are outliers, since they are classified independently in clusters of size one and are not merged. Table 5. MULICsoft results after merging clusters. The initial value of φ is 1 and it grows exponentially by multiplying φ by 2. The threshold is 32. The 3 rd similarity metric is Clusters after merging 100 clusters Increased from 320 to clusters Increased from 320 to clusters Increased from 320 to 384 used with x=7. MoJo Clusters distance Reduced from 250 to 100 Reduced from 250 to 90 Reduced from 250 to 80 Files classified As Table 5 shows, in all cases the initial number of clusters is 250. We merge clusters until the number of clusters decreases to 100, 90 and 80. The MoJo distance increases, but this may be due to the MoJo distance metric not being able to capture some structural change in the clusters caused by the merging. In any case, the slight increase in MoJo distance caused by merging clusters is an interesting observation and we will be investigating it further. Perhaps the harm from the increase in MoJo distance is less than the benefit from decreasing the number of clusters and MoJo is not able to capture this tradeoff.

9 6. Inputting Additional Categorical Data We integrated the following categorical data sets with the Mozilla file data set, to produce improved results when MULICsoft clustering is applied to the integrated data sets. Developers (Dev): The ownership information, i.e., the names of the developers involved in the implementation of the file. In case no developer was known, a unique dummy value for each file is used. Directory Path (Dir): The full directory path for each file. In order to increase the similarity of files residing in similar directory paths, the set of all subpaths for each path is included. Lines of Code (Loc): The number of lines of code for each of the files. The values are discretized by dividing the full range of loc values into the intervals {0; 100}, {100; 200}, {200; 300}, etc. Each file is given a feature such as RANGE1, RANGE2, RANGE3, etc. Time of Last Update (Tim): The time-stamp of each file on the disk. Only the month and year are included. Table 6 shows the MULICsoft MoJo distances to the authoritative decomposition for Mozilla, after inputting additional categorical data sets. Table 6. MULICsoft results for clustering Mozilla with additional categorical data. The initial value of φ is 1 and it grows linearly. The 3 rd similarity metric is used with x=9. δφ=130. Threshold has its default value. Categorical Data Sets MoJo distance Dev+Dir+LocEQ+Tim Dev+Dir+LocEQ Dev+Dir+Tim Dev+Dir Number of clusters As Table 6 shows, the MoJo distance to the authoritative decomposition does not significantly improve after inputting additional categorical data sets to MULICsoft. After inputting all 4 additional data sets of Dev+Dir+LocEQ+Tim, the result is 387, which is just slightly better than the best previous result of 388 for not inputting any additional categorical data. As fewer additional categorical data sets are input, the MoJo distance increases slightly. For 3 additional categorical data sets the distance increases to 407 and for 2 additional categorical data sets the distance increases again to 409. The reason why we do not observe a significant improvement is likely to be that for the additional data sets we have set all weights to a default value of 1.0, implying that all of the additional data will influence the clustering process the same. However, in the additional data sets different data should have a different effect on the process. For example, a directory path with a large number of files should be more influential in clustering any of its files than a directory path with only a few files. These experiments are only preliminary, as we have not evaluated MULICsoft for all types and combinations of additional data sets and we need to assign different weights to the data. 7. Runtime Evaluation Our experiments were performed on a Sun Ultra 60 with 256 MB of memory and a 300 MHz processor. Table 7 shows the run times it took for MULICsoft to cluster the files of the Mozilla system. Table 7. MULICsoft runtimes, in seconds. The initial value of φ is 1 and it grows linearly. Time Sim. δφ Threshold (Seconds) metric 12 3 rd, x=9 90 Default (1202) 12 3 rd, x=9 120 Default (1202) 29 3 rd, x= rd, x= rd, x=9 50 Default (1202) 8. Conclusion and Future Work We have presented a software clustering algorithm, named MULICsoft. MULICsoft creates decompositions that are close to manually created ones. MULICsoft does not sacrifice the quality of the results for the number of clusters, which in k-modes is defined strictly before the process [2]. For each cluster, MULICsoft forms layers of varying interdependencies between the files. It starts by forming a first layer of highly interdependent files, using strict criteria concerning which files to insert in the layer. The first layer of a cluster contains the highly interdependent files that are at the core of a subsystem. As the process continues, MULICsoft relaxes its criteria, forming layers with files that are less interdependent than the previous layers. The multilayer structure of MULICsoft is ideal for clustering software system data. MULICsoft clusters are representative of the underlying patterns in a software system, because differing layers of interdependencies exist in a cluster between files. MULICsoft similarity metrics consider dynamic system information by incorporating weights on the

10 file interdependencies that are derived from a runtime profiling of the system. In the end, the human expert has the option of merging clusters that are very similar to build larger clusters and reduce the number of clusters. We have evaluated the quality of MULICsoft results on the Mozilla software system for which an expertdefined authoritative system decomposition exists. On this data set, the MULICsoft distance to the authoritative decomposition was lower than the distances of LIMBO [4], BUNCH [8] and ACDC [9]. Finally, we showed that the runtime of MULICsoft was satisfactory as it took between 10 and 30 seconds. Future work will include improving the method for merging clusters that are similar, to build larger clusters, after the clustering process. We are currently implementing and testing an improved method for merging the clusters. This improved merging method will hopefully produce better MoJo distance results than the current one. The results for inputting additional categorical data sets to MULICsoft are only preliminary. Thorough experiments with more types of data are needed. We intend to conduct more experiments for additional categorical data, such as data on the developers of the source files, the time stamps of source files, the directory paths of the source files, etc. Furthermore, we will be setting the weights for the additional data sets to values other than the default value of 1.0 as it is currently, depending on the effect that the data should have on the clustering process. The results for identifying outliers indicate that often the distance to the authoritative decomposition increases when outliers are identified. However, this is related to the fact that in the computed decomposition each outlier is inserted in an independent cluster of size one. We will be experimenting with different ways to handle outliers, such as inserting them all in one cluster. This would decrease the MoJo distance, since fewer Moves and Joins would need to be performed in the computed decomposition to reach the authoritative manual decomposition. Furthermore, we are designing a different distance measure for computing the distance between a computed decomposition and an authoritative decomposition, when outliers are involved. 9. References [2] B. Andreopoulos, A. An and X. Wang. (2004) MULIC: Multi-Layer Increasing Coherence Clustering of Categorical Data Sets. Technical Report # CS Department of Computer Science and Engineering, York University. [3] P. Andritsos, V. Tzerpos. Information-Theoretic Software Clustering. IEEE Transactions on Software Engineering, Vol.31, No.2, February [4] P. Andritsos, P. Tsaparas, R. J. Miller, K. C. Sevcik. LIMBO: Scalable Clustering of Categorical Data. In Proceedings of the Ninth International Conference on Extending DataBase Technology (EDBT), March [5] M.W. Godfrey and E.H.S. Lee. Secrets from the Onster: Extracting Mozilla s Software Architecture. In Proceedings of the Second International Symposium on Constructing Software Engineering Tools (CoSET), [6] Huang Z. (1998) Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2(3): [7] Brian S. Mitchell, Spiros Mancoridis. Comparing the Decompositions Produced by Software Clustering Algorithms using Similarity Measurements. In Proceedings of the International Conference on Software Maintenance (ICSM) 2001, pages [8] S. Mancoridis, B.S. Mitchell, Y. Chen, and E. R. Gansner. Bunch: a clustering tool for the recovery and maintenance of software system structures. In Proceedings of the International Conference on Software Engineering, [9] Vassilios Tzerpos and Richard C. Holt. ACDC: An algorithm for comprehension-driven clustering. In Proceedings of the Seventh Working Conference on Reverse Engineering, pages , [10] Vassilios Tzerpos and Richard C. Holt. MoJo: A Distance Metric for Software Clusterings. In Proceedings of the Sixth Working Conference on Reverse Engineering, pages 187-, [11] Vassilios Tzerpos and Richard C. Holt. The Orphan Adoption problem in Architecture Maintenance. In Proceedings of the Fourth Working Conference on Reverse Engineering 1997, Amsterdam, October 1997, pages [12] C. Xiao and V. Tzerpos. Software Clustering Based on Dynamic Dependencies. In Proceedings of the Ninth European Conference on Software Maintenance and Reengineering, to appear. [1] B. Andreopoulos, A. An and X. Wang. (2005) BILCOM: Bi-level Clustering of Mixed Categorical and Numerical Biological Data. Technical Report # CS Department of Computer Science and Engineering, York University.

Clustering Large Software Systems at Multiple Layers

Clustering Large Software Systems at Multiple Layers Clustering Large Software Systems at Multiple Layers Bill Andreopoulos a, Aijun An a, Vassilios Tzerpos a Xiaogang Wang b a Department of Computer Science and Engineering, York University, Toronto, ON

More information

Clustering large software systems at multiple layers

Clustering large software systems at multiple layers Information and Software Technology 49 (2007) 244 254 www.elsevier.com/locate/infsof Clustering large software systems at multiple layers Bill Andreopoulos a, *, Aijun An a, Vassilios Tzerpos a, Xiaogang

More information

Clustering the Internet Topology at the AS-level

Clustering the Internet Topology at the AS-level Clustering the Internet Topology at the AS-level BILL ANDREOPOULOS 1, AIJUN AN 1, XIAOGANG WANG 2 1 Department of Computer Science and Engineering, York University 2 Department of Mathematics and Statistics,

More information

Hybrid Clustering Approach for Software Module Clustering

Hybrid Clustering Approach for Software Module Clustering Hybrid Clustering Approach for Software Module Clustering 1 K Kishore C, 2 Dr. K. Ramani, 3 Anoosha G 1,3 Assistant Professor, 2 Professor 1,2 Dept. of IT, Sree Vidyanikethan Engineering College, Tirupati

More information

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation

Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation Clustering Web Documents using Hierarchical Method for Efficient Cluster Formation I.Ceema *1, M.Kavitha *2, G.Renukadevi *3, G.sripriya *4, S. RajeshKumar #5 * Assistant Professor, Bon Secourse College

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

CS 534: Computer Vision Segmentation and Perceptual Grouping

CS 534: Computer Vision Segmentation and Perceptual Grouping CS 534: Computer Vision Segmentation and Perceptual Grouping Ahmed Elgammal Dept of Computer Science CS 534 Segmentation - 1 Outlines Mid-level vision What is segmentation Perceptual Grouping Segmentation

More information

Clustering Mixed Numerical and Low Quality Categorical Data: Significance Metrics on a Yeast Example

Clustering Mixed Numerical and Low Quality Categorical Data: Significance Metrics on a Yeast Example Clustering Mixed Numerical and Low Quality Categorical Data: Significance Metrics on a Yeast Example Bill Andreopoulos, Aijun An, Xiaogang Wang York University IQIS 2005 June 17, 2005 What is clustering?

More information

Segmentation of Images

Segmentation of Images Segmentation of Images SEGMENTATION If an image has been preprocessed appropriately to remove noise and artifacts, segmentation is often the key step in interpreting the image. Image segmentation is a

More information

Fast Efficient Clustering Algorithm for Balanced Data

Fast Efficient Clustering Algorithm for Balanced Data Vol. 5, No. 6, 214 Fast Efficient Clustering Algorithm for Balanced Data Adel A. Sewisy Faculty of Computer and Information, Assiut University M. H. Marghny Faculty of Computer and Information, Assiut

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Software Clustering based on Information Loss Minimization

Software Clustering based on Information Loss Minimization Software Clustering based on Information Loss Minimization Periklis Andritsos University of Toronto periklis@cs.toronto.edu Vassilios Tzerpos York University bil@cs.yorku.ca The majority of the algorithms

More information

SOFTWARE MODULE CLUSTERING USING SINGLE AND MULTI-OBJECTIVE APPROACHES

SOFTWARE MODULE CLUSTERING USING SINGLE AND MULTI-OBJECTIVE APPROACHES SOFTWARE MODULE CLUSTERING USING SINGLE AND MULTI-OBJECTIVE APPROACHES CHANDRAKANTH P 1 ANUSHA G 2 KISHORE C 3 1,3 Department of Computer Science & Engineering, YITS, India 2 Department of Computer Science

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

COOPERATIVE CLUSTERING USING OPTIMIZED SIMILARITY MEASURES

COOPERATIVE CLUSTERING USING OPTIMIZED SIMILARITY MEASURES Vol.2 Issue.3, Pg.: 335 March 24 INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2327345 COOPERATIVE CLUSTERING USING OPTIMIZED SIMILARITY MEASURES Diya Davis PostGraduate

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Software Clustering Using Dynamic Analysis and Static Dependencies

Software Clustering Using Dynamic Analysis and Static Dependencies Software Clustering Using Dynamic Analysis and Static Dependencies Chiragkumar Patel 1, Abdelwahab Hamou-Lhadj 2, Juergen Rilling 1 1 Department of Computer Science and Software Engineering 2 Department

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

Complementary Graph Coloring

Complementary Graph Coloring International Journal of Computer (IJC) ISSN 2307-4523 (Print & Online) Global Society of Scientific Research and Researchers http://ijcjournal.org/ Complementary Graph Coloring Mohamed Al-Ibrahim a*,

More information

6. Dicretization methods 6.1 The purpose of discretization

6. Dicretization methods 6.1 The purpose of discretization 6. Dicretization methods 6.1 The purpose of discretization Often data are given in the form of continuous values. If their number is huge, model building for such data can be difficult. Moreover, many

More information

Web page recommendation using a stochastic process model

Web page recommendation using a stochastic process model Data Mining VII: Data, Text and Web Mining and their Business Applications 233 Web page recommendation using a stochastic process model B. J. Park 1, W. Choi 1 & S. H. Noh 2 1 Computer Science Department,

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Introduction to Computer Science

Introduction to Computer Science DM534 Introduction to Computer Science Clustering and Feature Spaces Richard Roettger: About Me Computer Science (Technical University of Munich and thesis at the ICSI at the University of California at

More information

Relative Constraints as Features

Relative Constraints as Features Relative Constraints as Features Piotr Lasek 1 and Krzysztof Lasek 2 1 Chair of Computer Science, University of Rzeszow, ul. Prof. Pigonia 1, 35-510 Rzeszow, Poland, lasek@ur.edu.pl 2 Institute of Computer

More information

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so

More information

Introduction to System Design

Introduction to System Design Introduction to System Design Software Requirements and Design CITS 4401 Lecture 8 System Design is a creative process no cook book solutions goal driven we create a design for solving some problem constraint

More information

CHAPTER 4: CLUSTER ANALYSIS

CHAPTER 4: CLUSTER ANALYSIS CHAPTER 4: CLUSTER ANALYSIS WHAT IS CLUSTER ANALYSIS? A cluster is a collection of data-objects similar to one another within the same group & dissimilar to the objects in other groups. Cluster analysis

More information

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm

Part 4. Decomposition Algorithms Dantzig-Wolf Decomposition Algorithm In the name of God Part 4. 4.1. Dantzig-Wolf Decomposition Algorithm Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Introduction Real world linear programs having thousands of rows and columns.

More information

University of Florida CISE department Gator Engineering. Clustering Part 5

University of Florida CISE department Gator Engineering. Clustering Part 5 Clustering Part 5 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville SNN Approach to Clustering Ordinary distance measures have problems Euclidean

More information

On Multiple Query Optimization in Data Mining

On Multiple Query Optimization in Data Mining On Multiple Query Optimization in Data Mining Marek Wojciechowski, Maciej Zakrzewicz Poznan University of Technology Institute of Computing Science ul. Piotrowo 3a, 60-965 Poznan, Poland {marek,mzakrz}@cs.put.poznan.pl

More information

Distance based Clustering for Categorical Data

Distance based Clustering for Categorical Data Distance based Clustering for Categorical Data Extended Abstract Dino Ienco and Rosa Meo Dipartimento di Informatica, Università di Torino Italy e-mail: {ienco, meo}@di.unito.it Abstract. Learning distances

More information

Clustering & Bootstrapping

Clustering & Bootstrapping Clustering & Bootstrapping Jelena Prokić University of Groningen The Netherlands March 25, 2009 Groningen Overview What is clustering? Various clustering algorithms Bootstrapping Application in dialectometry

More information

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010 Cluster Analysis Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX 7575 April 008 April 010 Cluster Analysis, sometimes called data segmentation or customer segmentation,

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University

Data Clustering. Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Data Clustering Algorithmic Thinking Luay Nakhleh Department of Computer Science Rice University Data clustering is the task of partitioning a set of objects into groups such that the similarity of objects

More information

An Efficient Clustering Method for k-anonymization

An Efficient Clustering Method for k-anonymization An Efficient Clustering Method for -Anonymization Jun-Lin Lin Department of Information Management Yuan Ze University Chung-Li, Taiwan jun@saturn.yzu.edu.tw Meng-Cheng Wei Department of Information Management

More information

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic

Clustering. SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic Clustering is one of the fundamental and ubiquitous tasks in exploratory data analysis a first intuition about the

More information

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22 INF4820 Clustering Erik Velldal University of Oslo Nov. 17, 2009 Erik Velldal INF4820 1 / 22 Topics for Today More on unsupervised machine learning for data-driven categorization: clustering. The task

More information

AN IMPROVED GRAPH BASED METHOD FOR EXTRACTING ASSOCIATION RULES

AN IMPROVED GRAPH BASED METHOD FOR EXTRACTING ASSOCIATION RULES AN IMPROVED GRAPH BASED METHOD FOR EXTRACTING ASSOCIATION RULES ABSTRACT Wael AlZoubi Ajloun University College, Balqa Applied University PO Box: Al-Salt 19117, Jordan This paper proposes an improved approach

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 2 Part 4: Dividing Networks into Clusters The problem l Graph partitioning

More information

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents

E-Companion: On Styles in Product Design: An Analysis of US. Design Patents E-Companion: On Styles in Product Design: An Analysis of US Design Patents 1 PART A: FORMALIZING THE DEFINITION OF STYLES A.1 Styles as categories of designs of similar form Our task involves categorizing

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

City Growth Model Simulation VIGRE Spring 2012 Bowen Huang Faculty Advisor: David Aldous

City Growth Model Simulation VIGRE Spring 2012 Bowen Huang Faculty Advisor: David Aldous City Growth Model Simulation VIGRE Spring 2012 Bowen Huang Faculty Advisor: David Aldous Introduction: Urbanization is a process that many countries go through to be a developed country. The process is

More information

Cluster Analysis. Ying Shen, SSE, Tongji University

Cluster Analysis. Ying Shen, SSE, Tongji University Cluster Analysis Ying Shen, SSE, Tongji University Cluster analysis Cluster analysis groups data objects based only on the attributes in the data. The main objective is that The objects within a group

More information

Chapter 6 Continued: Partitioning Methods

Chapter 6 Continued: Partitioning Methods Chapter 6 Continued: Partitioning Methods Partitioning methods fix the number of clusters k and seek the best possible partition for that k. The goal is to choose the partition which gives the optimal

More information

A Review on Cluster Based Approach in Data Mining

A Review on Cluster Based Approach in Data Mining A Review on Cluster Based Approach in Data Mining M. Vijaya Maheswari PhD Research Scholar, Department of Computer Science Karpagam University Coimbatore, Tamilnadu,India Dr T. Christopher Assistant professor,

More information

Maximum Clique Problem

Maximum Clique Problem Maximum Clique Problem Dler Ahmad dha3142@rit.edu Yogesh Jagadeesan yj6026@rit.edu 1. INTRODUCTION Graph is a very common approach to represent computational problems. A graph consists a set of vertices

More information

WITHOUT insight into a system s design, a software

WITHOUT insight into a system s design, a software IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, NO. 3, MARCH 2006 1 On the Automatic Modularization of Software Systems Using the Bunch Tool Brian S. Mitchell, Member, IEEE, and Spiros Mancoridis,

More information

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams

Mining Data Streams. Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction. Summarization Methods. Clustering Data Streams Mining Data Streams Outline [Garofalakis, Gehrke & Rastogi 2002] Introduction Summarization Methods Clustering Data Streams Data Stream Classification Temporal Models CMPT 843, SFU, Martin Ester, 1-06

More information

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms. Volume 3, Issue 5, May 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Survey of Clustering

More information

Text Documents clustering using K Means Algorithm

Text Documents clustering using K Means Algorithm Text Documents clustering using K Means Algorithm Mrs Sanjivani Tushar Deokar Assistant professor sanjivanideokar@gmail.com Abstract: With the advancement of technology and reduced storage costs, individuals

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

A Scalable Approach for Packet Classification Using Rule-Base Partition

A Scalable Approach for Packet Classification Using Rule-Base Partition CNIR Journal, Volume (5), Issue (1), Dec., 2005 A Scalable Approach for Packet Classification Using Rule-Base Partition Mr. S J Wagh 1 and Dr. T. R. Sontakke 2 [1] Assistant Professor in Information Technology,

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm

HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm HIMIC : A Hierarchical Mixed Type Data Clustering Algorithm R. A. Ahmed B. Borah D. K. Bhattacharyya Department of Computer Science and Information Technology, Tezpur University, Napam, Tezpur-784028,

More information

Software Architecture Recovery based on Dynamic Analysis

Software Architecture Recovery based on Dynamic Analysis Software Architecture Recovery based on Dynamic Analysis Aline Vasconcelos 1,2, Cláudia Werner 1 1 COPPE/UFRJ System Engineering and Computer Science Program P.O. Box 68511 ZIP 21945-970 Rio de Janeiro

More information

Introduction. Linear because it requires linear functions. Programming as synonymous of planning.

Introduction. Linear because it requires linear functions. Programming as synonymous of planning. LINEAR PROGRAMMING Introduction Development of linear programming was among the most important scientific advances of mid-20th cent. Most common type of applications: allocate limited resources to competing

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Study and Implementation of CHAMELEON algorithm for Gene Clustering

Study and Implementation of CHAMELEON algorithm for Gene Clustering [1] Study and Implementation of CHAMELEON algorithm for Gene Clustering 1. Motivation Saurav Sahay The vast amount of gathered genomic data from Microarray and other experiments makes it extremely difficult

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points Dr. T. VELMURUGAN Associate professor, PG and Research Department of Computer Science, D.G.Vaishnav College, Chennai-600106,

More information

EE 701 ROBOT VISION. Segmentation

EE 701 ROBOT VISION. Segmentation EE 701 ROBOT VISION Regions and Image Segmentation Histogram-based Segmentation Automatic Thresholding K-means Clustering Spatial Coherence Merging and Splitting Graph Theoretic Segmentation Region Growing

More information

Clustering Part 4 DBSCAN

Clustering Part 4 DBSCAN Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

Clustering: Overview and K-means algorithm

Clustering: Overview and K-means algorithm Clustering: Overview and K-means algorithm Informal goal Given set of objects and measure of similarity between them, group similar objects together K-Means illustrations thanks to 2006 student Martin

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

What to come. There will be a few more topics we will cover on supervised learning

What to come. There will be a few more topics we will cover on supervised learning Summary so far Supervised learning learn to predict Continuous target regression; Categorical target classification Linear Regression Classification Discriminative models Perceptron (linear) Logistic regression

More information

A Virtual Laboratory for Study of Algorithms

A Virtual Laboratory for Study of Algorithms A Virtual Laboratory for Study of Algorithms Thomas E. O'Neil and Scott Kerlin Computer Science Department University of North Dakota Grand Forks, ND 58202-9015 oneil@cs.und.edu Abstract Empirical studies

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Automatic Modularization of ANNs Using Adaptive Critic Method

Automatic Modularization of ANNs Using Adaptive Critic Method Automatic Modularization of ANNs Using Adaptive Critic Method RUDOLF JAKŠA Kyushu Institute of Design 4-9-1 Shiobaru, Minami-ku, Fukuoka, 815-8540 JAPAN Abstract: - We propose automatic modularization

More information

On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks

On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks On Distributed Algorithms for Maximizing the Network Lifetime in Wireless Sensor Networks Akshaye Dhawan Georgia State University Atlanta, Ga 30303 akshaye@cs.gsu.edu Abstract A key challenge in Wireless

More information

Unsupervised Learning and Clustering

Unsupervised Learning and Clustering Unsupervised Learning and Clustering Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2008 CS 551, Spring 2008 c 2008, Selim Aksoy (Bilkent University)

More information

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster)

More information

Topological map merging

Topological map merging Topological map merging Kris Beevers Algorithmic Robotics Laboratory Department of Computer Science Rensselaer Polytechnic Institute beevek@cs.rpi.edu March 9, 2005 Multiple robots create topological maps

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

Workload Characterization Techniques

Workload Characterization Techniques Workload Characterization Techniques Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 Jain@cse.wustl.edu These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/

More information

Parallel Evaluation of Hopfield Neural Networks

Parallel Evaluation of Hopfield Neural Networks Parallel Evaluation of Hopfield Neural Networks Antoine Eiche, Daniel Chillet, Sebastien Pillement and Olivier Sentieys University of Rennes I / IRISA / INRIA 6 rue de Kerampont, BP 818 2232 LANNION,FRANCE

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering SYDE 372 - Winter 2011 Introduction to Pattern Recognition Clustering Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 5 All the approaches we have learned

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 5243 INTRO. TO DATA MINING Cluster Analysis: Basic Concepts and Methods Huan Sun, CSE@The Ohio State University 09/25/2017 Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han 2 Chapter 10.

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 4th, 2014 Wolf-Tilo Balke and José Pinto Institut für Informationssysteme Technische Universität Braunschweig The Cluster

More information

Getting to Know Your Data

Getting to Know Your Data Chapter 2 Getting to Know Your Data 2.1 Exercises 1. Give three additional commonly used statistical measures (i.e., not illustrated in this chapter) for the characterization of data dispersion, and discuss

More information

MODULAR PARTITIONING FOR INCREMENTAL COMPILATION

MODULAR PARTITIONING FOR INCREMENTAL COMPILATION MODULAR PARTITIONING FOR INCREMENTAL COMPILATION Mehrdad Eslami Dehkordi, Stephen D. Brown Dept. of Electrical and Computer Engineering University of Toronto, Toronto, Canada email: {eslami,brown}@eecg.utoronto.ca

More information

Supervised classification of law area in the legal domain

Supervised classification of law area in the legal domain AFSTUDEERPROJECT BSC KI Supervised classification of law area in the legal domain Author: Mees FRÖBERG (10559949) Supervisors: Evangelos KANOULAS Tjerk DE GREEF June 24, 2016 Abstract Search algorithms

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

On the evaluation of the Bunch search-based software modularization algorithm

On the evaluation of the Bunch search-based software modularization algorithm Soft Comput DOI.7/s-7--3 FOCUS On the evaluation of the Bunch search-based software modularization algorithm Brian S. Mitchell Spiros Mancoridis Springer-Verlag 7 Abstract The first part of this paper

More information

Efficient Re-construction of Document Versions Based on Adaptive Forward and Backward Change Deltas

Efficient Re-construction of Document Versions Based on Adaptive Forward and Backward Change Deltas Efficient Re-construction of Document Versions Based on Adaptive Forward and Backward Change Deltas Raymond K. Wong Nicole Lam School of Computer Science & Engineering, University of New South Wales, Sydney

More information

A Design Recovery View - JFace vs. SWT. Abstract

A Design Recovery View - JFace vs. SWT. Abstract A Design Recovery View - JFace vs. SWT Technical Report 2009-564 Manar Alalfi School of computing- Queen s University Kingston, Ontario, Canada alalfi@cs.queensu.ca Abstract This paper presents an experience

More information

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION

6. NEURAL NETWORK BASED PATH PLANNING ALGORITHM 6.1 INTRODUCTION 6 NEURAL NETWORK BASED PATH PLANNING ALGORITHM 61 INTRODUCTION In previous chapters path planning algorithms such as trigonometry based path planning algorithm and direction based path planning algorithm

More information

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania

UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING. Daniela Joiţa Titu Maiorescu University, Bucharest, Romania UNSUPERVISED STATIC DISCRETIZATION METHODS IN DATA MINING Daniela Joiţa Titu Maiorescu University, Bucharest, Romania danielajoita@utmro Abstract Discretization of real-valued data is often used as a pre-processing

More information

Using Heuristic Search Techniques to Extract Design Abstractions from Source Code

Using Heuristic Search Techniques to Extract Design Abstractions from Source Code Using Heuristic Search Techniques to Extract Design Abstractions from Source Code Brian S. Mitchell and Spiros Mancoridis Department of Mathematics & Computer Science Drexel University, Philadelphia, PA

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information

CCP: Conflicts Check Protocol for Bitcoin Block Security 1

CCP: Conflicts Check Protocol for Bitcoin Block Security 1 CCP: Conflicts Check Protocol for Bitcoin Block Security Chen Yang Peking University, China yc900@pku.edu.cn Abstract In this work, we present our early stage results on a Conflicts Check Protocol (CCP)

More information