Filtering Redundant Features by Similarity Comparisons

Size: px

Start display at page:

Download "Filtering Redundant Features by Similarity Comparisons"

Lindsay Booth
5 years ago
Views:

1 IT Examensarbete 30 hp Augusti 2017 Filtering Redundant Features by Similarity Comparisons Fangming Lan Institutionen för informationsteknologi Department of Information Technology

Abstract Filtering Redundant Features by Similarity Comparisons Fangming Lan Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0

3 Abstract Filtering Redundant Features by Similarity Comparisons Fangming Lan Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box Uppsala Telefon: In the field of Biology, exposure to diseases is dependent on several factors which are intricately correlated. Statistical analyses of these diseases often suffer from too many factors generated, resulting in complexity of these analyses. In order to solve these issues, we propose a program that would identify the correlated factors and then remove some insignificant factors, thus reducing the total number of factors. As a result, the efficiency of calculation will be improved. The new algorithms detailed here are based on Apriori and Chi-square tests. They work by reducing the size of the initial data. Consequently, two algorithms have been suggested and the insignificance features in test dataset efficiently removed as well. It is evident that using reasonable algorithms can help improve the efficiency and performance in statistical analyses. Telefax: Hemsida: Handledare: Nicholas Baltzer Ämnesgranskare: Jan Komorowski Examinator: Mats Daniels IT Tryckt av: Reprocentralen ITC

5 Table of Contents 1 Background Introduction Datasets Correlation Odds Ratio Defining odds Ratio Calculating Odds Ratio Calculating confidence intervals Synthetic dataset generator Feature Synthesis Algorithms Monte Carlo Feature Selection Lovelace Composite Generator Apriori Introduction Calculating Confidence Chi-Square Test Introduction Calculating Chi-square Value Implementations Chi-square Test Implementation Calculating Chi-square value Feature Deletion Apriori implementation Setting the input parameters Establishing the Apriori table Calculating Confidence Value Implementing Feature deletion Results Conclusions Acknowledgements... 33

7 8 References List of Figures FIGURE 1: BLOCK DIAGRAM OF THE MAIN STEP OF THE MCFS PROCEDURE FIGURE 2: DIAGRAM OF THE LOVELACE COMPOSITE GENERATOR List of Tables Table 1: An example data set Table 2: An example data set in Diabetes Table 3: An example of correlated features Table 4: Frequency table Table 5: Example of synthetic dataset Table 6: Example dataset with 4 customers and 4 items Table 7: Example data for Chi-square test Table 8: Objects=1000, Single max correlation =0.15, pairwise max correlation = Table 9: Objects=1000, Single max correlation =0.15, pairwise max correlation = Table 10: Objects=1000, Single max correlation =0.15, pairwise max correlation = Table 11: Dataset from synthetic dataset generator (Objects=1000, start correlation=0.95, end correlation =0.05, decrement=0.05)... 30

8 1 Background Interactions are existent in every aspect of life. From the way taste buds cooperate on the tongue to the way transcription factors co-bind to the genome in regulating the cells, interactions are significant to biological processes. Consequently, to comprehend Biology, there is need to first these types of interactions. Previous projects at the Komorowski Lab have strived to find these interactions in a limited scale. The last project gave rise to a program capable of searching datasets for interactions, and making them easily visible when found. However, this program was limited in scope and application. It lacked certain statistical and algorithmic implementations for more general adoption. The efficiency of the program is, therefore, of crucial concern. The efficiency of the program is first evaluated on the basis of the time needful to complete it. In the same condition, a program that runs fast is likely to also have higher efficiency. When the data sample grows larger, the efficiency of the program also becomes apparent. That is to say, the interaction searching is also likely to take longer time. In addition, as the data becomes larger, the computational cost of the program will be increased whereas the availability of the program will also be greatly reduced. For instance, data from 200 observations take about 10 seconds for the program to complete, while data from 1000 observations takes about 2 hours. This indicates a strong correlation between the efficiency of the program and the input sizes it can rationally compute. Therefore, it is important to optimise the program to handle a larger data input size. One way of optimizing the program is through implementing algorithms which is capable of reducing the size of data. This way, the computation needs is pushed towards the earlier, faster, result. The initial project that preceded the program was titled Using Feature Synthesis to Discern Non-linear Interactions via Composites, and is made available on the IBG portal at Uppsala University. 9

9 2 Introduction An important issue in Bioinformatics is finding relative features to identify relationships between different biological components. Similar to the relationship between DNA and proteins, the relationship between proteins and bio-membrane system also requires deep understanding of the various aspects of the system. When the number of aspects for a given system is huge, it becomes almost impractical to find clear pathways from the observed aspects to the known outcome of the system. As a result, the system ends up becomes a black box, much like the system of the human body. Nonetheless, we have learned to identify all its individual components and functions. The primary objective of this master thesis is to implement algorithms (Apriori and Chi-square test) which may be of use in reducing the number of aspects needed for these systems. This objective will be achieved by presenting three main steps. First, a literature review of the previous necessary knowledge about merging these aspects into more compact versions and on how these aspects may be related to one another in non-linear ways will be presented. Besides this, the statistics necessary in manipulating and advancing existing methods for achieving this objective will be discussed, as well as the contemporary algorithms applicable to the problem. Second, the above previous knowledge will be used in implementing the necessary algorithms found to be of interest and more superior than the traditional ones. Lastly, the resulting data from the previous step will be analysed for accuracy and validity to verify the usefulness of the present study. The utilisation of data in this present study comes from two distinct sources; a synthetic generator made up of sets with known interactions for validation purposes, and datasets contained both in single and pairwise correlated features. They will be used in testing the result against current literature. The structure of this research is as follows: Section 3 describes the background information and key concepts of the research; Section 4 will list the different algorithms entailed in the study; Sections 5 and 6 will discuss the implementations 10

10 and the results of the present study, respectively; and Section 7 will discuss the conclusion and recommendations on future studies related to the present topic. 2.1 Datasets A dataset is a system comprising of objects (also called observations), features (also called attributes), and decisions (decision attributes). An object is a sample or observed case in the system, such as patients in a disease system. On the other hand, a feature is a characteristic of the object that can be measured, such as the heart rate of a patient. In any given decision system, there is an outcome for every object, referred to as the decision feature. Formally, the dataset is defined as U = {O 1, O 2, O 3, O N } where U represents a nonempty finite set of all objects called the universe and A = {F 1, F 2, F 3, F N } denotes the non-empty finite set of all features such that U V a a A. The set V a is called the value set of a. As shown in Table 1, the first column indicates the objects{o 1 O 10 }, while the first row exhibits the name of features {F 1 F 5 }. Likewise, the last column defines the decision attribute and its observation value for each object. As a binary decision system, the value of the decision feature is measured as either 0 or 1. Table 1: An example data set F 1 F 2 F 3 F 4 F 5 Decision O O O O O O O O O O

11 Table 2 depicts a real sample table with relevant data in Diabetes. Table 2: An example data set in Diabetes High Family History High Fat Overweight Blood Pressure Sedentary Lifestyle Diabetes Patient No Patient No Patient Yes Patient Yes Patient Yes 2.2 Correlation Correlation is a statistical measurement that shows the level of linear dependency between two variables. In this case, linear is used to indicate that any change in one variable will cause an equal change (bar a factor) in the correlated variable. For instance, suppose the sale of sunglasses is correlated to the sale of ice cream in the summer, the first variable here is sales of sunglasses, while the second variable is sales of ice cream. Therefore, it means that there is an association between the two variables, wherein when one increases; the other does too (by some, potentially negative, factor). Notably in this instance, it seems that the sales of both ice cream and sunglasses depend on the sun, and that one can, to some extent, use the sales of ice cream to approximate the sales of sunglasses and vice versa. Correlation can either be positive or negative. Positively correlated features increase (or increment) together and decrease (decrement) together as well. Conversely, negatively correlated features swap this behaviour around; when one feature increases, the other decreases and vice versa. However, uncorrelated features show none of these patterns. This is because their feature values are not dependent on each other. 12

12 In the Pearson correlation (Pearson, 1900), formula, the correlation is represented by the strength of a linear relationship between two variables through the Pearson s correlation coefficient. The value is between -1 to 1. x i and y i are the paired variables in the sample while x and y are the mean values of x and y separately. The formula is: n r xy = (x i x)(y i y) i=1 (x i x) (y i y) = (n 1)s x s y i=1 n (x i x) 2 n (y i y) i=1 n i=1 2 The simplified expression is as expressed below: where s x is the standard deviation x is the sample mean n r = r xy = 1 n 1 (x i x )( y i y ) s y i=1 s x x i x s x is the standard score for x y i y s y is the standard score for y. As an illustration, in Table 3, the column F1 has similar values as the decision feature values, suggesting a positive correlation between these two. However, column F2 and the decision feature values are different (opposite values), suggesting a negative correlation. In the last column, F3, there is an uncorrelated observation with the decision feature values. Table 3: An example of correlated features F 1 F 2 F 3 (correlated) (anti-correlated) (un-correlated) Decision O O

13 O O O O O O O O Odds Ratio Defining odds Ratio In case-control studies, the Odds Ratio (OR) (Cornfield, 1951) is a relative measure of the strength of association between an exposure, such as radiation or genetic flaws, and a disease, such as cancer or Alzheimer s. The OR shows the likelihood of an outcome given an exposure in two groups, that is, the experimental and control groups. For instance, an OR is helpful in defining the ratio between the odds of explosion with a disease and the odds of explosion without a disease Calculating Odds Ratio A simple example used in calculating odds ratio (OR) is illustrated in Table 4. The equation for defining the OR is expressed as: where A = Number of exposed cases B = Number of exposed controls C = Number of unexposed cases D = Number of unexposed controls OR = A/C B/D = AD BC 14

14 In this case-control study, the odds ratio refers to the odds of exposure in cases over the odds of exposure in controls. Odds of exposure in cases = Odds of exposure in controls = cases with exposure cases in total controls with exposure controls in total = A A + C = B B + D Table 4: Frequency table Disease (cases) No disease (controls) Exposed A B Unexposed C D If the odds ratio is one, then the odds of disease given an exposure will be the same as the odds of disease without the said exposure. This means that the exposure does not affect the odds of acquiring the disease. Consequently, when the OR is greater than one, then it denotes an association of the exposure with a susceptibility to the examined disease. Likewise, when the OR is less than one, then it denotes that the exposure is associated with lowered risk of disease Calculating confidence intervals The Confidence Interval (CI) is a range which any sampled value in the population is anticipated to fall within on the basis of the study results. The 95% CI is commonly used in estimating the precision of the OR. Informally, a large CI range indicates a low level of precision of the OR, and vice versa. Thus, 95% CI is calculated using the formula: Upper95% CI = e [ln(or) A +1 B +1 C +1 D ] Lower95% CI = e [ln(or) A +1 B +1 C +1 D ] 15

15 2.4 Synthetic dataset generator For purposes of validation, it is essential to provide some datasets for testing. The synthetic dataset generator has been implemented in providing sets with features of both known correlation and interactive behaviours. Table 5 exhibits an example of a generated dataset, in which the first row shows the correlation between the features and the decisions. The first value, 652, denotes that the listed feature is 65.2% correlated with the decision feature. In addition, the decision represents the outcome for the observation. Table 5: An example of synthetic dataset Decision Feature Synthesis Feature synthesis is a term used in the generation of features from other feature or properties with the intention of directing, tilting, or reducing the noise while still retaining the desired information and an accurate correlation to some outcome (Rudnicki, 2004). This generation of results in groupings of properties as they relate to this outcome can reduce the size of a dataset and intensify the visibility of features or properties. Likewise, composite feature generation is a tilt towards non-linear interactivity. This generation creates a single feature from multiple so as to emphasize the interactive capacity of the constituent features. 16

16 3 Algorithms Four algorithms have been applied in this present study, including Monte Carlo Feature Selection (MCFS), Lovelace Composite Generator (LLCG), Apriori and Chisquare. MCFS was utilised in checking the features which are significant, whereas the LLCG is a composite feature synthesizer applicable in combining patterns. Apriori and Chi-square algorithms were both utilised in removing features which could be deduced or predicted from other features. 3.1 Monte Carlo Feature Selection The computational cost of data analysis can proportionally increase with a growing size of datasets. Monte Carlo Feature Selection (MCFS) (M. Drami ński, 2008) is applied in facilitating (or making faster) the classification. In essence, it helps to remove features that are not beneficial to the classification, and ranks the remaining features. The sharp reduction of feature numbers can in most cases be necessary in getting any sensible classification out of a large scale data like genomics data, in which datasets commonly have about 50,000 features or more. The ranking provided for the remaining features also indicates which features are critical and which ones only add a specific detail to the classification. Thus, this is a powerful tool when considering the inferences and processes that underpin the data, such as biological pathways or climate patterns. MCFS works by selecting subsets of features randomly from the dataset (s subsets are generated, as illustrated in Figure 1). Each feature can appear in multiple subsets. Then, each subset is split into several parts (t splits). Subsequently, these splits are used to construct a decision tree, which is later used in classifying the last split. The classification performance of these trees is measured, and then the features used are weighted by the performances. After all subsets have been measured and normalized, the final scores determine which features are deemed significant for classification purposes and which ones can be eliminated o discarded from the dataset without losing the performance. 17

17 In this present study, the results from Apriori and Chi-square are compared with the results from the MCFS to ascertain validity of the results. FIGURE 1: BLOCK DIAGRAM OF THE MAIN STEP OF THE MCFS PROCEDURE 3.2 Lovelace Composite Generator The Lovelace composite generator (LLCG) is utilised in creating composite features by feature synthesis. This Composite Feature represents a relationship between two features. The main advantage of merging features in such a way is that the relationship pattern is no longer hidden in different features. This generator computes the Lovelace composites for the dataset by means of the specified parameters and then generates a new dataset with composite features where such relationships are found. Figure 2 summarises an example of the process. Feature 1 and Feature 2 produce a composite feature. In the subsequent steps, this composite may further merge with other features. This example ends with all features combined into one. In common scenarios, however, there are usually one or two composite features while the remaining features stay unchanged. 18

FIGURE 2: DIAGRAM OF THE LOVELACE COMPOSITE GENERATOR 3.3 Apriori 3.3.1 Introduction The Apriori algorithm was proposed by Agrawal and Srikant in 1994 as a method of finding relations between features in a dataset.

18 FIGURE 2: DIAGRAM OF THE LOVELACE COMPOSITE GENERATOR 3.3 Apriori Introduction The Apriori algorithm was proposed by Agrawal and Srikant in 1994 as a method of finding relations between features in a dataset. These relations would be similar to correlation, but working in logistic value domains. For these relations, there would be no need for an increasing or decreasing trend. The only thing which matters is that a specific value from one feature can be constantly mapped to some value from the other feature. This makes the relationship asymmetric since five values from one feature are capable of mapping to just a single value in the other feature, whereas the reverse would not provide any consistency. For instance, the markers for depression originate from various environments. They can be genetic, environmental, medical, or physiological. By applying the Apriori algorithm to a dataset that contains all of these markers, we are able to test correlations between the markers. This also allows us to identify similarities between the different markers, including lack of exposure to sunlight and stress. As much as 19

19 this does not provide any causative evidence (lack of sunlight exposure can be as a result of stressful days spent in an office; anti-social behaviour can cause stress of failing relationships), it certainly shows which feature values can be used to infer other feature values. Using this deduction also permit us to eliminate any features from the dataset which are inferable. In Apriori, confidence and support are considered as the most significant parameters. Support is defined as the probability or possibility of two variable occurring at the same time, that is, P (A B). Confidence is applied in calculating the probability when a variable occurs simultaneously with another variable, such as P (B A), which is the probability of simultaneous occurrence of variable B in the event of occurrence of variable A Calculating Confidence Table 6: An example dataset with 4 customers and 4 items Customer milk bread butter beer A relation is defined as As illustrated in Table 6, the first row indicates some items name. The first title customer indicates the set of binary attribute I = {i 1, i 2,, i n }, in this case, I = {milk, bread, butter, beer}. The first column lists the number of customers, C = {c 1, c 2,, c n }, wherein each C shows a purchase history in this example dataset. A relation rule is expressed as: 20

20 X Y, where X, Y I The support of X with respect to T is defined as the proportion of transaction C in the dataset, which contains item set X. Formally, Support(X) = {c T; X c} C Support is an indication of how frequently the items appear in the dataset. Thus, the confidence value of a rule, X Y, with respect to a set of transactions T, is the proportion of the transactions containing X which also contains Y. Formally, Confidence(X Y) = Support(X Y) Support(X) Confidence is an indication of how often the rule is found to be true. 3.4 Chi-Square Test Introduction Chi-square test (Greenwood, 1996) is a statistical method of testing a stated hypothesis by comparing the observed data with the anticipated or expected data. For instance, to estimate the likelihood of diseases, the rate of Tuberculosis in a population might be estimated by the proportion of smoking to non-smoking people. In this case, we can use a Chi-square test to establish whether smoking is related to the advancement of disease Calculating Chi-square Value The Chi-square (χ 2 ) is expressed as: χ 2 = (observed expected)2 expected 21

21 The value of the Chi-square ought to be compared against a static table so as to determine whether the observed behaviour is significantly different from the expected behaviour. The value to compare against is determined by the value domain of the observed as well as the expected variables. This is referred to as the Degree of Freedom, formally DF = (r 1)(c 1) r = V d (observed), c = V d (expected) where r and c are the cardinalities of the value domains of the two variables. The algorithm pseudo code and the real instance have been detailed in the experiment section. 4 Implementations 4.1 Chi-square Test Implementation Calculating Chi-square value Table 7 illustrates an example data for calculating Chi-square value. Table 7: An example data for Chi-square test C00_0 S00_

22 I. Establishing test hypotheses H o : n 1 = n 2 (C00_0 is independent of S00_0) H a : n 1 n 2 (C00_0 is no independent of S00_0) II. Generating the observed table C00_0 S00_0 value = 0 value = 1 Value = Value = Total 6 4 Total III. Generating the compared table The expected value for each cell is the total for its row multiplied by the total for its column, then divided by the total for the table: that is Expected data = (RowTotal ColTotal)/GridTotal Therefore, in the table above, the expected count in cell (1,1) = = 3.6 cell (1,2) = =

23 cell (2,1) = = 2.4 cell (2,2) = = 1.6 C00_0 S00_0 value = 0 value = 1 Value = Value = Total 6 4 Total IV. Calculating Chi-square value and find out the p-value The alpha level of significance utilised here is 0.05 while the degrees of freedom is 1 (as shown in Section 3.4.3). Therefore, the critical Chi-square value, obtained from the distribution for df = 1 and p = 0.05, is Practically, this indicates the probability of obtaining a result at least as extreme as 3.84 is 5%. The example above has a Chi-square value of 3.40, which is less than what is required to denote independence. Hence, the two features do not have significant differences in distribution, and that one can estimate the other. (3.6 x 2 5)2 (2.4 1)2 (2.4 1)2 (1.6 3)2 = = df = (2 1) (2 1) = 1 x 2 c =

24 4.1.2 Feature Deletion Reducing the number of features shortens the time required to compute all the potential composite features of a dataset. Where two features with no significant differences are established, letting one of them to be represented by the other denote that one feature can be discarded without losing significant information. 4.2 Apriori implementation Setting the input parameters At the beginning, the user ought to set the minimum support value as well as the confidence value. By default, the normal confidence value is set at 0.75, while the minimum support value is half of the number of the objects Establishing the Apriori table As illustrated in Table 6, it is identifiable that the two features, C00_0 and S00_0, have some relations. Therefore, there is need to subdivide the feature C00_0 to two kinds of possibility features, that is feature C00_0=0 and feature C00_0=1. Likewise, there is also need to subdivide the feature S00_0 to two possibility features, that is feature S00_0 = 0 and feature S00_0 = 1. The restructuring data is defined as: C00_0=0 C00_0=1 S00_0=0 S00_0=

25 In this case, it is simpler to sum up the corresponding relationships: C00_0 S00_0, when C00_0 appears, denoting some correlation with S00_0: 1. C00_0=0 S00_0=0 2. C00_0=0 S00_0=1 3. C00_0=1 S00_0=0 4. C00_0=1 S00_0=1 S00_0 C00_0, when S00_0 appears, showing some correlation with C00_0: 1. S00_0=0 C00_0=0 2. S00_0=0 C00_0=1 3. S00_0=1 C00_0=0 4. S00_0=1 C00_0= Calculating Confidence Value For each relation between C00_0 and S00_0, calculate the confidence. Support(C00 0 = 0 S00 0 = 0) = 5 Confidence(C00 0 = 0 S00 0 = 0) = 5 6 = The rest can be calculated in the same way, that is: 26

26 Relations support Confidence C00_0=0 S00_0=0 C00_0=0 S00_0=1 C00_0=1 S00_0=0 C00_0=1 S00_0=0 S00_0=0 C00_0=0 S00_0=0 C00_0=1 S00_0=1 C00_0=0 S00_0=1 C00_0= Implementing Feature deletion As result and on the basis of the calculations showed above, the maximum of the confidence is 0.83, which is more than the normal confidence value of 0.75 which was set at the beginning. On the other hand, the support is also more than the minimum support value. This illustrates that C00_0 and S00_0 are correlated. In this case, C00_0 can represent S00_0. Nevertheless, S00_0 cannot represent C00_0 since the support is insufficient when the features are compared in this opposite way or direction. 27

27 5 Results For the validation tests, we utilised four distinct datasets. The first was generated from the Synthetic dataset generator. The remaining three datasets were test sets for VisuNet, a tool used in displaying feature value relationships in networks. All of these datasets contained both single and pair-wise correlated features. Table 8: Objects=1000, Single max correlation =0.15, pair-wise max correlation =0.25 LLCG Apriori Apriori + LLCG Chi- Square Chi-Square + LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 8 shows a dataset with single max correlation of 0.15, pair-wise max correlation of 0.25, and 1000 objects. In this example, MCFS was unable to find out any significant features in the original dataset. When running LLCG, only 3 composite features were generated. Subsequently, running the MCFS resulted in 2 significant features being discovered. The Apriori algorithm was also unable to identify any relationships between the features in the dataset. Nonetheless, the Chi-square test found 2 significantly correlated features, and hence supporting the removal of two 28

28 features. Combining the Chi-square test with LLCG generated 2 composite features from the remaining dataset. Table 9: Objects=1000, Single max correlation =0.15, pair-wise max correlation =0.55 LLCG Apriori Apriori + LLCG Chi- Square Chi-Square + LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 9 illustrates a dataset with single max correlation of 0.15, pair-wise max correlation of 0.55 and 1000 objects. MCFS identified 3 significant features from the original dataset. The Apriori algorithm was, however, unable to identify any relationship features. Further, the Chi-square test was able to discover 4 correlated features. When comparing the result from MCFS, there are no significant features deleted from the Chi-square test. When combining the Chi-square test and the LLCG, only 1 composite feature was generated. Table 10: Objects=1000, Single max correlation =0.15, pairwise max correlation =

29 LLCG Apriori Apriori + LLCG Chi- Square Chi-Square + LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 10 shows a dataset with single max correlation of 0.15, pair-wise max correlation of 0.85 and 1000 objects. In this dataset, MCFS was able to discover 5 significant features. For Apriori algorithm, no pair-wise related features were identified. Yet, the Chi-square test found 5 correlated features. Out of the 5 significant features realised by MCFS, Chi-Square removed 2. When the Chi-square test was combined with the LLCG, 4 composite features were identified. Table 11: Dataset from synthetic dataset generator (Objects=1000, start correlation=0.95, end correlation =0.05, decrement=0.05) LLCG Apriori Apriori + Chi- Chi-Square + 30

30 LLCG Square LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 11 summarises a dataset from the synthetic dataset generator with 1000 objects, with the starting correlation at 0.95, ending correlation at 0.05 and the decrement at From this table, 3 significant features were obtained through MCFS. In this case, after running LLCG, 152 composite features were also generated. Then, the Apriori algorithm discovered 8 relationship features. After deleting the redundant features, 44 composite features were found. Compared with the result from MCFS, the significant features have not been deleted, meaning that removing features successfully. However, the Chi-Square deleted all the features. 6 Conclusions In this study, Apriori and Chi-Square have been utilised in identifying some potentially similar features. The results shown from Table 8 to Table 10 have illustrated that the Apriori algorithm might not certainly identify some insignificant features. However, the Chi-square test has shown a good performance in detecting some insignificant features and deleting them. When compared with the results from MCFS, some of the reduced features are significant features, proving that the Chisquare test is not useful in reducing features. The results illustrated in Table 11 shows 31

31 that 8 redundant or non-informative features were discovered by the Apriori algorithm. In addition, the MCFS also validated these 8 features as insignificant features. Therefore, the features can be deleted successfully. This confirms that the Apriori algorithm can find some representatives for other features in a dataset. The Chi-square test was used to test the independence of the features. In this program, the Chi-square test was useful in testing goodness of fit, thus helpful in making the decision of whether there is any difference between the observed (experimental) value and the expected (theoretical) value. In the experiment, the Chi-square provided the dataset correlation information. As indicated in the results of the implementation section, Apriori algorathim can help reduce the number of features in a dataset since the reduced features are not in conflict with the results from MCFS. In fact, when the dataset is large enough and full of many features, the Apriori algorithm can reduce some redundant features. Nevertheless, the limit of this algorithm is that when the features are large or big, it takes quite some time to parse. therefore, the disadvantage in this program is that running Apriori and Chi-square test would take much time. For instance, as the number of iterations increase, the processing time is likely to also become longer. Hence a dataset with 100 features and 1000 objects might take up to 2 hours to compute. In the future, an improved program optimisation or algorithms would be needful to achieve faster computation. 32

32 7 Acknowledgements I am grateful to my supervisor, MSc Nicholas Baltzer, for giving me this precious opportunity and taking time to guide me through this project. I would also like to extend my gratitude to Prof. Jan Komorowski for providing me with a good working environment in his laboratory. I am also indebted for the support and encouragement I receive from all my friends. Finally, I am pleased and very much appreciative to my family, and most particularly my parents for providing me with the financial and moral support I needed to get through to this stage of my studies. Without their encouragement and support, this work would not have been possible. Thank you. 33

33 8 References Cornfield, J. (1951). A Method for Estimating Comparative Rates from Clinical Data. Applications to Cancer of the Lung, Breast, and Cervix. Journal of the National Cancer Institute. Greenwood, P. N. (1996). A guide to Chi-squared testing. Wiley. New York. ISBN X. M. Drami nski, A. R.-I. (2008). Monte carlo feature selection for supervised classification. Bioinformatics,24(1):110117, doi: /bioinformatics/btm486. Pearson. (1900). On the criterion that a given system of deviations from the probable in the. Philosophical Magazine. Rudnicki, W. a. (2004). Feature synthesis and extraction for the construction of generalized properties of amino acids. In Rough Sets and Current Trends in Computing, (Springer), pp

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation

IT 15 010 Examensarbete 30 hp Februari 2015 Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation Xiaoyuan Chu Institutionen för informationsteknologi