Filtering Redundant Features by Similarity Comparisons

Size: px
Start display at page:

Download "Filtering Redundant Features by Similarity Comparisons"

Transcription

1 IT Examensarbete 30 hp Augusti 2017 Filtering Redundant Features by Similarity Comparisons Fangming Lan Institutionen för informationsteknologi Department of Information Technology

2

3 Abstract Filtering Redundant Features by Similarity Comparisons Fangming Lan Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box Uppsala Telefon: In the field of Biology, exposure to diseases is dependent on several factors which are intricately correlated. Statistical analyses of these diseases often suffer from too many factors generated, resulting in complexity of these analyses. In order to solve these issues, we propose a program that would identify the correlated factors and then remove some insignificant factors, thus reducing the total number of factors. As a result, the efficiency of calculation will be improved. The new algorithms detailed here are based on Apriori and Chi-square tests. They work by reducing the size of the initial data. Consequently, two algorithms have been suggested and the insignificance features in test dataset efficiently removed as well. It is evident that using reasonable algorithms can help improve the efficiency and performance in statistical analyses. Telefax: Hemsida: Handledare: Nicholas Baltzer Ämnesgranskare: Jan Komorowski Examinator: Mats Daniels IT Tryckt av: Reprocentralen ITC

4

5 Table of Contents 1 Background Introduction Datasets Correlation Odds Ratio Defining odds Ratio Calculating Odds Ratio Calculating confidence intervals Synthetic dataset generator Feature Synthesis Algorithms Monte Carlo Feature Selection Lovelace Composite Generator Apriori Introduction Calculating Confidence Chi-Square Test Introduction Calculating Chi-square Value Implementations Chi-square Test Implementation Calculating Chi-square value Feature Deletion Apriori implementation Setting the input parameters Establishing the Apriori table Calculating Confidence Value Implementing Feature deletion Results Conclusions Acknowledgements... 33

6

7 8 References List of Figures FIGURE 1: BLOCK DIAGRAM OF THE MAIN STEP OF THE MCFS PROCEDURE FIGURE 2: DIAGRAM OF THE LOVELACE COMPOSITE GENERATOR List of Tables Table 1: An example data set Table 2: An example data set in Diabetes Table 3: An example of correlated features Table 4: Frequency table Table 5: Example of synthetic dataset Table 6: Example dataset with 4 customers and 4 items Table 7: Example data for Chi-square test Table 8: Objects=1000, Single max correlation =0.15, pairwise max correlation = Table 9: Objects=1000, Single max correlation =0.15, pairwise max correlation = Table 10: Objects=1000, Single max correlation =0.15, pairwise max correlation = Table 11: Dataset from synthetic dataset generator (Objects=1000, start correlation=0.95, end correlation =0.05, decrement=0.05)... 30

8 1 Background Interactions are existent in every aspect of life. From the way taste buds cooperate on the tongue to the way transcription factors co-bind to the genome in regulating the cells, interactions are significant to biological processes. Consequently, to comprehend Biology, there is need to first these types of interactions. Previous projects at the Komorowski Lab have strived to find these interactions in a limited scale. The last project gave rise to a program capable of searching datasets for interactions, and making them easily visible when found. However, this program was limited in scope and application. It lacked certain statistical and algorithmic implementations for more general adoption. The efficiency of the program is, therefore, of crucial concern. The efficiency of the program is first evaluated on the basis of the time needful to complete it. In the same condition, a program that runs fast is likely to also have higher efficiency. When the data sample grows larger, the efficiency of the program also becomes apparent. That is to say, the interaction searching is also likely to take longer time. In addition, as the data becomes larger, the computational cost of the program will be increased whereas the availability of the program will also be greatly reduced. For instance, data from 200 observations take about 10 seconds for the program to complete, while data from 1000 observations takes about 2 hours. This indicates a strong correlation between the efficiency of the program and the input sizes it can rationally compute. Therefore, it is important to optimise the program to handle a larger data input size. One way of optimizing the program is through implementing algorithms which is capable of reducing the size of data. This way, the computation needs is pushed towards the earlier, faster, result. The initial project that preceded the program was titled Using Feature Synthesis to Discern Non-linear Interactions via Composites, and is made available on the IBG portal at Uppsala University. 9

9 2 Introduction An important issue in Bioinformatics is finding relative features to identify relationships between different biological components. Similar to the relationship between DNA and proteins, the relationship between proteins and bio-membrane system also requires deep understanding of the various aspects of the system. When the number of aspects for a given system is huge, it becomes almost impractical to find clear pathways from the observed aspects to the known outcome of the system. As a result, the system ends up becomes a black box, much like the system of the human body. Nonetheless, we have learned to identify all its individual components and functions. The primary objective of this master thesis is to implement algorithms (Apriori and Chi-square test) which may be of use in reducing the number of aspects needed for these systems. This objective will be achieved by presenting three main steps. First, a literature review of the previous necessary knowledge about merging these aspects into more compact versions and on how these aspects may be related to one another in non-linear ways will be presented. Besides this, the statistics necessary in manipulating and advancing existing methods for achieving this objective will be discussed, as well as the contemporary algorithms applicable to the problem. Second, the above previous knowledge will be used in implementing the necessary algorithms found to be of interest and more superior than the traditional ones. Lastly, the resulting data from the previous step will be analysed for accuracy and validity to verify the usefulness of the present study. The utilisation of data in this present study comes from two distinct sources; a synthetic generator made up of sets with known interactions for validation purposes, and datasets contained both in single and pairwise correlated features. They will be used in testing the result against current literature. The structure of this research is as follows: Section 3 describes the background information and key concepts of the research; Section 4 will list the different algorithms entailed in the study; Sections 5 and 6 will discuss the implementations 10

10 and the results of the present study, respectively; and Section 7 will discuss the conclusion and recommendations on future studies related to the present topic. 2.1 Datasets A dataset is a system comprising of objects (also called observations), features (also called attributes), and decisions (decision attributes). An object is a sample or observed case in the system, such as patients in a disease system. On the other hand, a feature is a characteristic of the object that can be measured, such as the heart rate of a patient. In any given decision system, there is an outcome for every object, referred to as the decision feature. Formally, the dataset is defined as U = {O 1, O 2, O 3, O N } where U represents a nonempty finite set of all objects called the universe and A = {F 1, F 2, F 3, F N } denotes the non-empty finite set of all features such that U V a a A. The set V a is called the value set of a. As shown in Table 1, the first column indicates the objects{o 1 O 10 }, while the first row exhibits the name of features {F 1 F 5 }. Likewise, the last column defines the decision attribute and its observation value for each object. As a binary decision system, the value of the decision feature is measured as either 0 or 1. Table 1: An example data set F 1 F 2 F 3 F 4 F 5 Decision O O O O O O O O O O

11 Table 2 depicts a real sample table with relevant data in Diabetes. Table 2: An example data set in Diabetes High Family History High Fat Overweight Blood Pressure Sedentary Lifestyle Diabetes Patient No Patient No Patient Yes Patient Yes Patient Yes 2.2 Correlation Correlation is a statistical measurement that shows the level of linear dependency between two variables. In this case, linear is used to indicate that any change in one variable will cause an equal change (bar a factor) in the correlated variable. For instance, suppose the sale of sunglasses is correlated to the sale of ice cream in the summer, the first variable here is sales of sunglasses, while the second variable is sales of ice cream. Therefore, it means that there is an association between the two variables, wherein when one increases; the other does too (by some, potentially negative, factor). Notably in this instance, it seems that the sales of both ice cream and sunglasses depend on the sun, and that one can, to some extent, use the sales of ice cream to approximate the sales of sunglasses and vice versa. Correlation can either be positive or negative. Positively correlated features increase (or increment) together and decrease (decrement) together as well. Conversely, negatively correlated features swap this behaviour around; when one feature increases, the other decreases and vice versa. However, uncorrelated features show none of these patterns. This is because their feature values are not dependent on each other. 12

12 In the Pearson correlation (Pearson, 1900), formula, the correlation is represented by the strength of a linear relationship between two variables through the Pearson s correlation coefficient. The value is between -1 to 1. x i and y i are the paired variables in the sample while x and y are the mean values of x and y separately. The formula is: n r xy = (x i x)(y i y) i=1 (x i x) (y i y) = (n 1)s x s y i=1 n (x i x) 2 n (y i y) i=1 n i=1 2 The simplified expression is as expressed below: where s x is the standard deviation x is the sample mean n r = r xy = 1 n 1 (x i x )( y i y ) s y i=1 s x x i x s x is the standard score for x y i y s y is the standard score for y. As an illustration, in Table 3, the column F1 has similar values as the decision feature values, suggesting a positive correlation between these two. However, column F2 and the decision feature values are different (opposite values), suggesting a negative correlation. In the last column, F3, there is an uncorrelated observation with the decision feature values. Table 3: An example of correlated features F 1 F 2 F 3 (correlated) (anti-correlated) (un-correlated) Decision O O

13 O O O O O O O O Odds Ratio Defining odds Ratio In case-control studies, the Odds Ratio (OR) (Cornfield, 1951) is a relative measure of the strength of association between an exposure, such as radiation or genetic flaws, and a disease, such as cancer or Alzheimer s. The OR shows the likelihood of an outcome given an exposure in two groups, that is, the experimental and control groups. For instance, an OR is helpful in defining the ratio between the odds of explosion with a disease and the odds of explosion without a disease Calculating Odds Ratio A simple example used in calculating odds ratio (OR) is illustrated in Table 4. The equation for defining the OR is expressed as: where A = Number of exposed cases B = Number of exposed controls C = Number of unexposed cases D = Number of unexposed controls OR = A/C B/D = AD BC 14

14 In this case-control study, the odds ratio refers to the odds of exposure in cases over the odds of exposure in controls. Odds of exposure in cases = Odds of exposure in controls = cases with exposure cases in total controls with exposure controls in total = A A + C = B B + D Table 4: Frequency table Disease (cases) No disease (controls) Exposed A B Unexposed C D If the odds ratio is one, then the odds of disease given an exposure will be the same as the odds of disease without the said exposure. This means that the exposure does not affect the odds of acquiring the disease. Consequently, when the OR is greater than one, then it denotes an association of the exposure with a susceptibility to the examined disease. Likewise, when the OR is less than one, then it denotes that the exposure is associated with lowered risk of disease Calculating confidence intervals The Confidence Interval (CI) is a range which any sampled value in the population is anticipated to fall within on the basis of the study results. The 95% CI is commonly used in estimating the precision of the OR. Informally, a large CI range indicates a low level of precision of the OR, and vice versa. Thus, 95% CI is calculated using the formula: Upper95% CI = e [ln(or) A +1 B +1 C +1 D ] Lower95% CI = e [ln(or) A +1 B +1 C +1 D ] 15

15 2.4 Synthetic dataset generator For purposes of validation, it is essential to provide some datasets for testing. The synthetic dataset generator has been implemented in providing sets with features of both known correlation and interactive behaviours. Table 5 exhibits an example of a generated dataset, in which the first row shows the correlation between the features and the decisions. The first value, 652, denotes that the listed feature is 65.2% correlated with the decision feature. In addition, the decision represents the outcome for the observation. Table 5: An example of synthetic dataset Decision Feature Synthesis Feature synthesis is a term used in the generation of features from other feature or properties with the intention of directing, tilting, or reducing the noise while still retaining the desired information and an accurate correlation to some outcome (Rudnicki, 2004). This generation of results in groupings of properties as they relate to this outcome can reduce the size of a dataset and intensify the visibility of features or properties. Likewise, composite feature generation is a tilt towards non-linear interactivity. This generation creates a single feature from multiple so as to emphasize the interactive capacity of the constituent features. 16

16 3 Algorithms Four algorithms have been applied in this present study, including Monte Carlo Feature Selection (MCFS), Lovelace Composite Generator (LLCG), Apriori and Chisquare. MCFS was utilised in checking the features which are significant, whereas the LLCG is a composite feature synthesizer applicable in combining patterns. Apriori and Chi-square algorithms were both utilised in removing features which could be deduced or predicted from other features. 3.1 Monte Carlo Feature Selection The computational cost of data analysis can proportionally increase with a growing size of datasets. Monte Carlo Feature Selection (MCFS) (M. Drami ński, 2008) is applied in facilitating (or making faster) the classification. In essence, it helps to remove features that are not beneficial to the classification, and ranks the remaining features. The sharp reduction of feature numbers can in most cases be necessary in getting any sensible classification out of a large scale data like genomics data, in which datasets commonly have about 50,000 features or more. The ranking provided for the remaining features also indicates which features are critical and which ones only add a specific detail to the classification. Thus, this is a powerful tool when considering the inferences and processes that underpin the data, such as biological pathways or climate patterns. MCFS works by selecting subsets of features randomly from the dataset (s subsets are generated, as illustrated in Figure 1). Each feature can appear in multiple subsets. Then, each subset is split into several parts (t splits). Subsequently, these splits are used to construct a decision tree, which is later used in classifying the last split. The classification performance of these trees is measured, and then the features used are weighted by the performances. After all subsets have been measured and normalized, the final scores determine which features are deemed significant for classification purposes and which ones can be eliminated o discarded from the dataset without losing the performance. 17

17 In this present study, the results from Apriori and Chi-square are compared with the results from the MCFS to ascertain validity of the results. FIGURE 1: BLOCK DIAGRAM OF THE MAIN STEP OF THE MCFS PROCEDURE 3.2 Lovelace Composite Generator The Lovelace composite generator (LLCG) is utilised in creating composite features by feature synthesis. This Composite Feature represents a relationship between two features. The main advantage of merging features in such a way is that the relationship pattern is no longer hidden in different features. This generator computes the Lovelace composites for the dataset by means of the specified parameters and then generates a new dataset with composite features where such relationships are found. Figure 2 summarises an example of the process. Feature 1 and Feature 2 produce a composite feature. In the subsequent steps, this composite may further merge with other features. This example ends with all features combined into one. In common scenarios, however, there are usually one or two composite features while the remaining features stay unchanged. 18

18 FIGURE 2: DIAGRAM OF THE LOVELACE COMPOSITE GENERATOR 3.3 Apriori Introduction The Apriori algorithm was proposed by Agrawal and Srikant in 1994 as a method of finding relations between features in a dataset. These relations would be similar to correlation, but working in logistic value domains. For these relations, there would be no need for an increasing or decreasing trend. The only thing which matters is that a specific value from one feature can be constantly mapped to some value from the other feature. This makes the relationship asymmetric since five values from one feature are capable of mapping to just a single value in the other feature, whereas the reverse would not provide any consistency. For instance, the markers for depression originate from various environments. They can be genetic, environmental, medical, or physiological. By applying the Apriori algorithm to a dataset that contains all of these markers, we are able to test correlations between the markers. This also allows us to identify similarities between the different markers, including lack of exposure to sunlight and stress. As much as 19

19 this does not provide any causative evidence (lack of sunlight exposure can be as a result of stressful days spent in an office; anti-social behaviour can cause stress of failing relationships), it certainly shows which feature values can be used to infer other feature values. Using this deduction also permit us to eliminate any features from the dataset which are inferable. In Apriori, confidence and support are considered as the most significant parameters. Support is defined as the probability or possibility of two variable occurring at the same time, that is, P (A B). Confidence is applied in calculating the probability when a variable occurs simultaneously with another variable, such as P (B A), which is the probability of simultaneous occurrence of variable B in the event of occurrence of variable A Calculating Confidence Table 6: An example dataset with 4 customers and 4 items Customer milk bread butter beer A relation is defined as As illustrated in Table 6, the first row indicates some items name. The first title customer indicates the set of binary attribute I = {i 1, i 2,, i n }, in this case, I = {milk, bread, butter, beer}. The first column lists the number of customers, C = {c 1, c 2,, c n }, wherein each C shows a purchase history in this example dataset. A relation rule is expressed as: 20

20 X Y, where X, Y I The support of X with respect to T is defined as the proportion of transaction C in the dataset, which contains item set X. Formally, Support(X) = {c T; X c} C Support is an indication of how frequently the items appear in the dataset. Thus, the confidence value of a rule, X Y, with respect to a set of transactions T, is the proportion of the transactions containing X which also contains Y. Formally, Confidence(X Y) = Support(X Y) Support(X) Confidence is an indication of how often the rule is found to be true. 3.4 Chi-Square Test Introduction Chi-square test (Greenwood, 1996) is a statistical method of testing a stated hypothesis by comparing the observed data with the anticipated or expected data. For instance, to estimate the likelihood of diseases, the rate of Tuberculosis in a population might be estimated by the proportion of smoking to non-smoking people. In this case, we can use a Chi-square test to establish whether smoking is related to the advancement of disease Calculating Chi-square Value The Chi-square (χ 2 ) is expressed as: χ 2 = (observed expected)2 expected 21

21 The value of the Chi-square ought to be compared against a static table so as to determine whether the observed behaviour is significantly different from the expected behaviour. The value to compare against is determined by the value domain of the observed as well as the expected variables. This is referred to as the Degree of Freedom, formally DF = (r 1)(c 1) r = V d (observed), c = V d (expected) where r and c are the cardinalities of the value domains of the two variables. The algorithm pseudo code and the real instance have been detailed in the experiment section. 4 Implementations 4.1 Chi-square Test Implementation Calculating Chi-square value Table 7 illustrates an example data for calculating Chi-square value. Table 7: An example data for Chi-square test C00_0 S00_

22 I. Establishing test hypotheses H o : n 1 = n 2 (C00_0 is independent of S00_0) H a : n 1 n 2 (C00_0 is no independent of S00_0) II. Generating the observed table C00_0 S00_0 value = 0 value = 1 Value = Value = Total 6 4 Total III. Generating the compared table The expected value for each cell is the total for its row multiplied by the total for its column, then divided by the total for the table: that is Expected data = (RowTotal ColTotal)/GridTotal Therefore, in the table above, the expected count in cell (1,1) = = 3.6 cell (1,2) = =

23 cell (2,1) = = 2.4 cell (2,2) = = 1.6 C00_0 S00_0 value = 0 value = 1 Value = Value = Total 6 4 Total IV. Calculating Chi-square value and find out the p-value The alpha level of significance utilised here is 0.05 while the degrees of freedom is 1 (as shown in Section 3.4.3). Therefore, the critical Chi-square value, obtained from the distribution for df = 1 and p = 0.05, is Practically, this indicates the probability of obtaining a result at least as extreme as 3.84 is 5%. The example above has a Chi-square value of 3.40, which is less than what is required to denote independence. Hence, the two features do not have significant differences in distribution, and that one can estimate the other. (3.6 x 2 5)2 (2.4 1)2 (2.4 1)2 (1.6 3)2 = = df = (2 1) (2 1) = 1 x 2 c =

24 4.1.2 Feature Deletion Reducing the number of features shortens the time required to compute all the potential composite features of a dataset. Where two features with no significant differences are established, letting one of them to be represented by the other denote that one feature can be discarded without losing significant information. 4.2 Apriori implementation Setting the input parameters At the beginning, the user ought to set the minimum support value as well as the confidence value. By default, the normal confidence value is set at 0.75, while the minimum support value is half of the number of the objects Establishing the Apriori table As illustrated in Table 6, it is identifiable that the two features, C00_0 and S00_0, have some relations. Therefore, there is need to subdivide the feature C00_0 to two kinds of possibility features, that is feature C00_0=0 and feature C00_0=1. Likewise, there is also need to subdivide the feature S00_0 to two possibility features, that is feature S00_0 = 0 and feature S00_0 = 1. The restructuring data is defined as: C00_0=0 C00_0=1 S00_0=0 S00_0=

25 In this case, it is simpler to sum up the corresponding relationships: C00_0 S00_0, when C00_0 appears, denoting some correlation with S00_0: 1. C00_0=0 S00_0=0 2. C00_0=0 S00_0=1 3. C00_0=1 S00_0=0 4. C00_0=1 S00_0=1 S00_0 C00_0, when S00_0 appears, showing some correlation with C00_0: 1. S00_0=0 C00_0=0 2. S00_0=0 C00_0=1 3. S00_0=1 C00_0=0 4. S00_0=1 C00_0= Calculating Confidence Value For each relation between C00_0 and S00_0, calculate the confidence. Support(C00 0 = 0 S00 0 = 0) = 5 Confidence(C00 0 = 0 S00 0 = 0) = 5 6 = The rest can be calculated in the same way, that is: 26

26 Relations support Confidence C00_0=0 S00_0=0 C00_0=0 S00_0=1 C00_0=1 S00_0=0 C00_0=1 S00_0=0 S00_0=0 C00_0=0 S00_0=0 C00_0=1 S00_0=1 C00_0=0 S00_0=1 C00_0= Implementing Feature deletion As result and on the basis of the calculations showed above, the maximum of the confidence is 0.83, which is more than the normal confidence value of 0.75 which was set at the beginning. On the other hand, the support is also more than the minimum support value. This illustrates that C00_0 and S00_0 are correlated. In this case, C00_0 can represent S00_0. Nevertheless, S00_0 cannot represent C00_0 since the support is insufficient when the features are compared in this opposite way or direction. 27

27 5 Results For the validation tests, we utilised four distinct datasets. The first was generated from the Synthetic dataset generator. The remaining three datasets were test sets for VisuNet, a tool used in displaying feature value relationships in networks. All of these datasets contained both single and pair-wise correlated features. Table 8: Objects=1000, Single max correlation =0.15, pair-wise max correlation =0.25 LLCG Apriori Apriori + LLCG Chi- Square Chi-Square + LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 8 shows a dataset with single max correlation of 0.15, pair-wise max correlation of 0.25, and 1000 objects. In this example, MCFS was unable to find out any significant features in the original dataset. When running LLCG, only 3 composite features were generated. Subsequently, running the MCFS resulted in 2 significant features being discovered. The Apriori algorithm was also unable to identify any relationships between the features in the dataset. Nonetheless, the Chi-square test found 2 significantly correlated features, and hence supporting the removal of two 28

28 features. Combining the Chi-square test with LLCG generated 2 composite features from the remaining dataset. Table 9: Objects=1000, Single max correlation =0.15, pair-wise max correlation =0.55 LLCG Apriori Apriori + LLCG Chi- Square Chi-Square + LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 9 illustrates a dataset with single max correlation of 0.15, pair-wise max correlation of 0.55 and 1000 objects. MCFS identified 3 significant features from the original dataset. The Apriori algorithm was, however, unable to identify any relationship features. Further, the Chi-square test was able to discover 4 correlated features. When comparing the result from MCFS, there are no significant features deleted from the Chi-square test. When combining the Chi-square test and the LLCG, only 1 composite feature was generated. Table 10: Objects=1000, Single max correlation =0.15, pairwise max correlation =

29 LLCG Apriori Apriori + LLCG Chi- Square Chi-Square + LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 10 shows a dataset with single max correlation of 0.15, pair-wise max correlation of 0.85 and 1000 objects. In this dataset, MCFS was able to discover 5 significant features. For Apriori algorithm, no pair-wise related features were identified. Yet, the Chi-square test found 5 correlated features. Out of the 5 significant features realised by MCFS, Chi-Square removed 2. When the Chi-square test was combined with the LLCG, 4 composite features were identified. Table 11: Dataset from synthetic dataset generator (Objects=1000, start correlation=0.95, end correlation =0.05, decrement=0.05) LLCG Apriori Apriori + Chi- Chi-Square + 30

30 LLCG Square LLCG Significant Features (MCFS) Objects Original Features Composite features Total Features after running Features removed Table 11 summarises a dataset from the synthetic dataset generator with 1000 objects, with the starting correlation at 0.95, ending correlation at 0.05 and the decrement at From this table, 3 significant features were obtained through MCFS. In this case, after running LLCG, 152 composite features were also generated. Then, the Apriori algorithm discovered 8 relationship features. After deleting the redundant features, 44 composite features were found. Compared with the result from MCFS, the significant features have not been deleted, meaning that removing features successfully. However, the Chi-Square deleted all the features. 6 Conclusions In this study, Apriori and Chi-Square have been utilised in identifying some potentially similar features. The results shown from Table 8 to Table 10 have illustrated that the Apriori algorithm might not certainly identify some insignificant features. However, the Chi-square test has shown a good performance in detecting some insignificant features and deleting them. When compared with the results from MCFS, some of the reduced features are significant features, proving that the Chisquare test is not useful in reducing features. The results illustrated in Table 11 shows 31

31 that 8 redundant or non-informative features were discovered by the Apriori algorithm. In addition, the MCFS also validated these 8 features as insignificant features. Therefore, the features can be deleted successfully. This confirms that the Apriori algorithm can find some representatives for other features in a dataset. The Chi-square test was used to test the independence of the features. In this program, the Chi-square test was useful in testing goodness of fit, thus helpful in making the decision of whether there is any difference between the observed (experimental) value and the expected (theoretical) value. In the experiment, the Chi-square provided the dataset correlation information. As indicated in the results of the implementation section, Apriori algorathim can help reduce the number of features in a dataset since the reduced features are not in conflict with the results from MCFS. In fact, when the dataset is large enough and full of many features, the Apriori algorithm can reduce some redundant features. Nevertheless, the limit of this algorithm is that when the features are large or big, it takes quite some time to parse. therefore, the disadvantage in this program is that running Apriori and Chi-square test would take much time. For instance, as the number of iterations increase, the processing time is likely to also become longer. Hence a dataset with 100 features and 1000 objects might take up to 2 hours to compute. In the future, an improved program optimisation or algorithms would be needful to achieve faster computation. 32

32 7 Acknowledgements I am grateful to my supervisor, MSc Nicholas Baltzer, for giving me this precious opportunity and taking time to guide me through this project. I would also like to extend my gratitude to Prof. Jan Komorowski for providing me with a good working environment in his laboratory. I am also indebted for the support and encouragement I receive from all my friends. Finally, I am pleased and very much appreciative to my family, and most particularly my parents for providing me with the financial and moral support I needed to get through to this stage of my studies. Without their encouragement and support, this work would not have been possible. Thank you. 33

33 8 References Cornfield, J. (1951). A Method for Estimating Comparative Rates from Clinical Data. Applications to Cancer of the Lung, Breast, and Cervix. Journal of the National Cancer Institute. Greenwood, P. N. (1996). A guide to Chi-squared testing. Wiley. New York. ISBN X. M. Drami nski, A. R.-I. (2008). Monte carlo feature selection for supervised classification. Bioinformatics,24(1):110117, doi: /bioinformatics/btm486. Pearson. (1900). On the criterion that a given system of deviations from the probable in the. Philosophical Magazine. Rudnicki, W. a. (2004). Feature synthesis and extraction for the construction of generalized properties of amino acids. In Rough Sets and Current Trends in Computing, (Springer), pp

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation

Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation IT 15 010 Examensarbete 30 hp Februari 2015 Improving Estimation Accuracy using Better Similarity Distance in Analogy-based Software Cost Estimation Xiaoyuan Chu Institutionen för informationsteknologi

More information

Exploring Econometric Model Selection Using Sensitivity Analysis

Exploring Econometric Model Selection Using Sensitivity Analysis Exploring Econometric Model Selection Using Sensitivity Analysis William Becker Paolo Paruolo Andrea Saltelli Nice, 2 nd July 2013 Outline What is the problem we are addressing? Past approaches Hoover

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization

Keywords- Classification algorithm, Hypertensive, K Nearest Neighbor, Naive Bayesian, Data normalization GLOBAL JOURNAL OF ENGINEERING SCIENCE AND RESEARCHES APPLICATION OF CLASSIFICATION TECHNIQUES TO DETECT HYPERTENSIVE HEART DISEASE Tulasimala B. N* 1, Elakkiya S 2 & Keerthana N 3 *1 Assistant Professor,

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

Kernel-based online machine learning and support vector reduction

Kernel-based online machine learning and support vector reduction Kernel-based online machine learning and support vector reduction Sumeet Agarwal 1, V. Vijaya Saradhi 2 andharishkarnick 2 1- IBM India Research Lab, New Delhi, India. 2- Department of Computer Science

More information

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on   to remove this watermark. 119 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 120 CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM 5.1. INTRODUCTION Association rule mining, one of the most important and well researched

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the

More information

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques 24 Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques Ruxandra PETRE

More information

Optimization using Ant Colony Algorithm

Optimization using Ant Colony Algorithm Optimization using Ant Colony Algorithm Er. Priya Batta 1, Er. Geetika Sharmai 2, Er. Deepshikha 3 1Faculty, Department of Computer Science, Chandigarh University,Gharaun,Mohali,Punjab 2Faculty, Department

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Features and Patterns The Curse of Size and

More information

EVALUATION OF THE NORMAL APPROXIMATION FOR THE PAIRED TWO SAMPLE PROBLEM WITH MISSING DATA. Shang-Lin Yang. B.S., National Taiwan University, 1996

EVALUATION OF THE NORMAL APPROXIMATION FOR THE PAIRED TWO SAMPLE PROBLEM WITH MISSING DATA. Shang-Lin Yang. B.S., National Taiwan University, 1996 EVALUATION OF THE NORMAL APPROXIMATION FOR THE PAIRED TWO SAMPLE PROBLEM WITH MISSING DATA By Shang-Lin Yang B.S., National Taiwan University, 1996 M.S., University of Pittsburgh, 2005 Submitted to the

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES

CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 70 CHAPTER 3 A FAST K-MODES CLUSTERING ALGORITHM TO WAREHOUSE VERY LARGE HETEROGENEOUS MEDICAL DATABASES 3.1 INTRODUCTION In medical science, effective tools are essential to categorize and systematically

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

The Bootstrap and Jackknife

The Bootstrap and Jackknife The Bootstrap and Jackknife Summer 2017 Summer Institutes 249 Bootstrap & Jackknife Motivation In scientific research Interest often focuses upon the estimation of some unknown parameter, θ. The parameter

More information

HPC methods for hidden Markov models (HMMs) in population genetics

HPC methods for hidden Markov models (HMMs) in population genetics HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013 Outline Background

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Using Genetic Algorithms to Solve the Box Stacking Problem

Using Genetic Algorithms to Solve the Box Stacking Problem Using Genetic Algorithms to Solve the Box Stacking Problem Jenniffer Estrada, Kris Lee, Ryan Edgar October 7th, 2010 Abstract The box stacking or strip stacking problem is exceedingly difficult to solve

More information

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data

Spatial Patterns Point Pattern Analysis Geographic Patterns in Areal Data Spatial Patterns We will examine methods that are used to analyze patterns in two sorts of spatial data: Point Pattern Analysis - These methods concern themselves with the location information associated

More information

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Cluster Analysis Mu-Chun Su Department of Computer Science and Information Engineering National Central University 2003/3/11 1 Introduction Cluster analysis is the formal study of algorithms and methods

More information

Understanding the fundamentals of CPU architecture

Understanding the fundamentals of CPU architecture TVE-E 18 009 Examensarbete 15 hp Juni 2018 Understanding the fundamentals of CPU architecture Bachelor project in Electrical engineering Christian alkzair Altin Januzi Andreas Blom Abstract Understanding

More information

Predicting Messaging Response Time in a Long Distance Relationship

Predicting Messaging Response Time in a Long Distance Relationship Predicting Messaging Response Time in a Long Distance Relationship Meng-Chen Shieh m3shieh@ucsd.edu I. Introduction The key to any successful relationship is communication, especially during times when

More information

A Novel method for Frequent Pattern Mining

A Novel method for Frequent Pattern Mining A Novel method for Frequent Pattern Mining K.Rajeswari #1, Dr.V.Vaithiyanathan *2 # Associate Professor, PCCOE & Ph.D Research Scholar SASTRA University, Tanjore, India 1 raji.pccoe@gmail.com * Associate

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

The Power and Sample Size Application

The Power and Sample Size Application Chapter 72 The Power and Sample Size Application Contents Overview: PSS Application.................................. 6148 SAS Power and Sample Size............................... 6148 Getting Started:

More information

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements

More information

Rule induction. Dr Beatriz de la Iglesia

Rule induction. Dr Beatriz de la Iglesia Rule induction Dr Beatriz de la Iglesia email: b.iglesia@uea.ac.uk Outline What are rules? Rule Evaluation Classification rules Association rules 2 Rule induction (RI) As their name suggests, RI algorithms

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2012 http://ce.sharif.edu/courses/90-91/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Chapter 4: Implicit Error Detection

Chapter 4: Implicit Error Detection 4. Chpter 5 Chapter 4: Implicit Error Detection Contents 4.1 Introduction... 4-2 4.2 Network error correction... 4-2 4.3 Implicit error detection... 4-3 4.4 Mathematical model... 4-6 4.5 Simulation setup

More information

Understanding Rule Behavior through Apriori Algorithm over Social Network Data

Understanding Rule Behavior through Apriori Algorithm over Social Network Data Global Journal of Computer Science and Technology Volume 12 Issue 10 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals Inc. (USA) Online ISSN: 0975-4172

More information

Enumerating Pseudo-Intents in a Partial Order

Enumerating Pseudo-Intents in a Partial Order Enumerating Pseudo-Intents in a Partial Order Alexandre Bazin and Jean-Gabriel Ganascia Université Pierre et Marie Curie, Laboratoire d Informatique de Paris 6 Paris, France Alexandre.Bazin@lip6.fr Jean-Gabriel@Ganascia.name

More information

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach ABSTRACT G.Ravi Kumar 1 Dr.G.A. Ramachandra 2 G.Sunitha 3 1. Research Scholar, Department of Computer Science &Technology,

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Features and Feature Selection Hamid R. Rabiee Jafar Muhammadi Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Features and Patterns The Curse of Size and

More information

Improved Frequent Pattern Mining Algorithm with Indexing

Improved Frequent Pattern Mining Algorithm with Indexing IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 16, Issue 6, Ver. VII (Nov Dec. 2014), PP 73-78 Improved Frequent Pattern Mining Algorithm with Indexing Prof.

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset. Glossary of data mining terms: Accuracy Accuracy is an important factor in assessing the success of data mining. When applied to data, accuracy refers to the rate of correct values in the data. When applied

More information

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL

ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL ADAPTIVE TILE CODING METHODS FOR THE GENERALIZATION OF VALUE FUNCTIONS IN THE RL STATE SPACE A THESIS SUBMITTED TO THE FACULTY OF THE GRADUATE SCHOOL OF THE UNIVERSITY OF MINNESOTA BY BHARAT SIGINAM IN

More information

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014

Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into

More information

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification

Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Robustness of Selective Desensitization Perceptron Against Irrelevant and Partially Relevant Features in Pattern Classification Tomohiro Tanno, Kazumasa Horie, Jun Izawa, and Masahiko Morita University

More information

Classification with Diffuse or Incomplete Information

Classification with Diffuse or Incomplete Information Classification with Diffuse or Incomplete Information AMAURY CABALLERO, KANG YEN Florida International University Abstract. In many different fields like finance, business, pattern recognition, communication

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES

CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES CHAPTER 6 HYBRID AI BASED IMAGE CLASSIFICATION TECHNIQUES 6.1 INTRODUCTION The exploration of applications of ANN for image classification has yielded satisfactory results. But, the scope for improving

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

SEEK User Manual. Introduction

SEEK User Manual. Introduction SEEK User Manual Introduction SEEK is a computational gene co-expression search engine. It utilizes a vast human gene expression compendium to deliver fast, integrative, cross-platform co-expression analyses.

More information

A mining method for tracking changes in temporal association rules from an encoded database

A mining method for tracking changes in temporal association rules from an encoded database A mining method for tracking changes in temporal association rules from an encoded database Chelliah Balasubramanian *, Karuppaswamy Duraiswamy ** K.S.Rangasamy College of Technology, Tiruchengode, Tamil

More information

Improving the Efficiency of Fast Using Semantic Similarity Algorithm

Improving the Efficiency of Fast Using Semantic Similarity Algorithm International Journal of Scientific and Research Publications, Volume 4, Issue 1, January 2014 1 Improving the Efficiency of Fast Using Semantic Similarity Algorithm D.KARTHIKA 1, S. DIVAKAR 2 Final year

More information

A noninformative Bayesian approach to small area estimation

A noninformative Bayesian approach to small area estimation A noninformative Bayesian approach to small area estimation Glen Meeden School of Statistics University of Minnesota Minneapolis, MN 55455 glen@stat.umn.edu September 2001 Revised May 2002 Research supported

More information

On Demand Phenotype Ranking through Subspace Clustering

On Demand Phenotype Ranking through Subspace Clustering On Demand Phenotype Ranking through Subspace Clustering Xiang Zhang, Wei Wang Department of Computer Science University of North Carolina at Chapel Hill Chapel Hill, NC 27599, USA {xiang, weiwang}@cs.unc.edu

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

What Secret the Bisection Method Hides? by Namir Clement Shammas

What Secret the Bisection Method Hides? by Namir Clement Shammas What Secret the Bisection Method Hides? 1 What Secret the Bisection Method Hides? by Namir Clement Shammas Introduction Over the past few years I have modified the simple root-seeking Bisection Method

More information

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace

Yunfeng Zhang 1, Huan Wang 2, Jie Zhu 1 1 Computer Science & Engineering Department, North China Institute of Aerospace [Type text] [Type text] [Type text] ISSN : 0974-7435 Volume 10 Issue 20 BioTechnology 2014 An Indian Journal FULL PAPER BTAIJ, 10(20), 2014 [12526-12531] Exploration on the data mining system construction

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

The Composition of Number

The Composition of Number An introduction to The Composition of Number Original Work in the Field of Theoretical Mathematics Barbara Mae Brunner Revision 1.1.0 Copyright 1995-2002, Barbara Mae Brunner, All Rights Reserved bmb192@earthlink.net

More information

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS 23 CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS This chapter introduces the concepts of association rule mining. It also proposes two algorithms based on, to calculate

More information

STATISTICS (STAT) Statistics (STAT) 1

STATISTICS (STAT) Statistics (STAT) 1 Statistics (STAT) 1 STATISTICS (STAT) STAT 2013 Elementary Statistics (A) Prerequisites: MATH 1483 or MATH 1513, each with a grade of "C" or better; or an acceptable placement score (see placement.okstate.edu).

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

Comparison of FP tree and Apriori Algorithm

Comparison of FP tree and Apriori Algorithm International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.78-82 Comparison of FP tree and Apriori Algorithm Prashasti

More information

Data Mining Clustering

Data Mining Clustering Data Mining Clustering Jingpeng Li 1 of 34 Supervised Learning F(x): true function (usually not known) D: training sample (x, F(x)) 57,M,195,0,125,95,39,25,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0 0

More information

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in

More information

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242 Creation & Description of a Data Set * 4 Levels of Measurement * Nominal, ordinal, interval, ratio * Variable Types

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Parallel Bayesian Additive Regression Trees, using Apache Spark

Parallel Bayesian Additive Regression Trees, using Apache Spark IT 17 005 Examensarbete 30 hp Januari 2017 Parallel Bayesian Additive Regression Trees, using Apache Spark Sigurdur Geirsson Institutionen för informationsteknologi Department of Information Technology

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011,

Purna Prasad Mutyala et al, / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 2 (5), 2011, Weighted Association Rule Mining Without Pre-assigned Weights PURNA PRASAD MUTYALA, KUMAR VASANTHA Department of CSE, Avanthi Institute of Engg & Tech, Tamaram, Visakhapatnam, A.P., India. Abstract Association

More information

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm

A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm A Technical Analysis of Market Basket by using Association Rule Mining and Apriori Algorithm S.Pradeepkumar*, Mrs.C.Grace Padma** M.Phil Research Scholar, Department of Computer Science, RVS College of

More information

Discovering interesting rules from financial data

Discovering interesting rules from financial data Discovering interesting rules from financial data Przemysław Sołdacki Institute of Computer Science Warsaw University of Technology Ul. Andersa 13, 00-159 Warszawa Tel: +48 609129896 email: psoldack@ii.pw.edu.pl

More information

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Introduction to GE Microarray data analysis Practical Course MolBio 2012 Introduction to GE Microarray data analysis Practical Course MolBio 2012 Claudia Pommerenke Nov-2012 Transkriptomanalyselabor TAL Microarray and Deep Sequencing Core Facility Göttingen University Medical

More information

Climate Precipitation Prediction by Neural Network

Climate Precipitation Prediction by Neural Network Journal of Mathematics and System Science 5 (205) 207-23 doi: 0.7265/259-529/205.05.005 D DAVID PUBLISHING Juliana Aparecida Anochi, Haroldo Fraga de Campos Velho 2. Applied Computing Graduate Program,

More information

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks

8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks 8 th Grade Pre Algebra Pacing Guide 1 st Nine Weeks MS Objective CCSS Standard I Can Statements Included in MS Framework + Included in Phase 1 infusion Included in Phase 2 infusion 1a. Define, classify,

More information

Gene Clustering & Classification

Gene Clustering & Classification BINF, Introduction to Computational Biology Gene Clustering & Classification Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Introduction to Gene Clustering

More information

Multivariate Conditional Distribution Estimation and Analysis

Multivariate Conditional Distribution Estimation and Analysis IT 14 62 Examensarbete 45 hp Oktober 14 Multivariate Conditional Distribution Estimation and Analysis Sander Medri Institutionen för informationsteknologi Department of Information Technology Abstract

More information

Graph Structure Over Time

Graph Structure Over Time Graph Structure Over Time Observing how time alters the structure of the IEEE data set Priti Kumar Computer Science Rensselaer Polytechnic Institute Troy, NY Kumarp3@rpi.edu Abstract This paper examines

More information

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH

DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY) SEQUENTIAL PATTERN MINING A CONSTRAINT BASED APPROACH International Journal of Information Technology and Knowledge Management January-June 2011, Volume 4, No. 1, pp. 27-32 DISCOVERING ACTIVE AND PROFITABLE PATTERNS WITH RFM (RECENCY, FREQUENCY AND MONETARY)

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. 1/44 Cross-validation and the Bootstrap In the section we discuss two resampling

More information

Snapshot Algorithm Animation with Erlang

Snapshot Algorithm Animation with Erlang IT 13 077 Examensarbete 15 hp November 2013 Snapshot Algorithm Animation with Erlang Fredrik Bryntesson Institutionen för informationsteknologi Department of Information Technology Abstract Snapshot Algorithm

More information

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL

CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL CHAPTER 5 GENERATING TEST SCENARIOS AND TEST CASES FROM AN EVENT-FLOW MODEL 5.1 INTRODUCTION The survey presented in Chapter 1 has shown that Model based testing approach for automatic generation of test

More information

ViTraM: VIsualization of TRAnscriptional Modules

ViTraM: VIsualization of TRAnscriptional Modules ViTraM: VIsualization of TRAnscriptional Modules Version 2.0 October 1st, 2009 KULeuven, Belgium 1 Contents 1 INTRODUCTION AND INSTALLATION... 4 1.1 Introduction...4 1.2 Software structure...5 1.3 Requirements...5

More information

Role of Association Rule Mining in DNA Microarray Data - A Research

Role of Association Rule Mining in DNA Microarray Data - A Research Role of Association Rule Mining in DNA Microarray Data - A Research T. Arundhathi Asst. Professor Department of CSIT MANUU, Hyderabad Research Scholar Osmania University, Hyderabad Prof. T. Adilakshmi

More information

Combinatorial Search; Monte Carlo Methods

Combinatorial Search; Monte Carlo Methods Combinatorial Search; Monte Carlo Methods Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico May 02, 2016 CPD (DEI / IST) Parallel and Distributed

More information

Frequency Distributions

Frequency Distributions Displaying Data Frequency Distributions After collecting data, the first task for a researcher is to organize and summarize the data so that it is possible to get a general overview of the results. Remember,

More information

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani Feature Selection CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Dimensionality reduction Feature selection vs. feature extraction Filter univariate

More information

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration Data Mining 2.4 Fall 2008 Instructor: Dr. Masoud Yaghini Data integration: Combines data from multiple databases into a coherent store Denormalization tables (often done to improve performance by avoiding

More information

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation Optimization Methods: Introduction and Basic concepts 1 Module 1 Lecture Notes 2 Optimization Problem and Model Formulation Introduction In the previous lecture we studied the evolution of optimization

More information

Artificial Intelligence. Programming Styles

Artificial Intelligence. Programming Styles Artificial Intelligence Intro to Machine Learning Programming Styles Standard CS: Explicitly program computer to do something Early AI: Derive a problem description (state) and use general algorithms to

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Data mining with Support Vector Machine

Data mining with Support Vector Machine Data mining with Support Vector Machine Ms. Arti Patle IES, IPS Academy Indore (M.P.) artipatle@gmail.com Mr. Deepak Singh Chouhan IES, IPS Academy Indore (M.P.) deepak.schouhan@yahoo.com Abstract: Machine

More information

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367(Print), ISSN 0976 6367(Print) ISSN 0976 6375(Online)

More information

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS

HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS HARNESSING CERTAINTY TO SPEED TASK-ALLOCATION ALGORITHMS FOR MULTI-ROBOT SYSTEMS An Undergraduate Research Scholars Thesis by DENISE IRVIN Submitted to the Undergraduate Research Scholars program at Texas

More information

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018

Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Quick Start Guide Jacob Stolk PhD Simone Stolk MPH November 2018 Contents Introduction... 1 Start DIONE... 2 Load Data... 3 Missing Values... 5 Explore Data... 6 One Variable... 6 Two Variables... 7 All

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 68 CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL 5.1 INTRODUCTION During recent years, one of the vibrant research topics is Association rule discovery. This

More information

2. On classification and related tasks

2. On classification and related tasks 2. On classification and related tasks In this part of the course we take a concise bird s-eye view of different central tasks and concepts involved in machine learning and classification particularly.

More information