User Guide Written By Yasser EL-Manzalawy

Size: px
Start display at page:

Download "User Guide Written By Yasser EL-Manzalawy"

Transcription

1 User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team

2 Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient genome annotation tools required to assign biological interpretation to the DNA sequence becomes more desirable. Although several computational genome annotation tools have been proposed, accurate and scalable genome annotation remains a major challenge. A variety of knowledge-based, statistical, and machine learning methods have been developed for many genome annotation tasks. They differ in terms of the training data sets used to train the predictive models, the data representations (e.g., sequence features) used for encoding the inputs and outputs (class labels) of the predictive models, the algorithms used for building the predictors, and the validation data sets and the performance metrics used to assess the effectiveness of the predictors. Often, it is the case that the data sets, implementations of algorithms, the data representations used, are simply not available to the research community in a form that allows rigorous comparison of alternative approaches. Yet, such comparisons are essential for determining the strengths and limitations of existing approaches so that further research can be focused on improving these methods. For example, some of the methods are accessible via the Internet as online Web servers. Comparison of the underlying computational methods implemented by such servers is not straightforward in the absence of access to implementations of the algorithms and the precise data sets and data representations used. This is further complicated by the fact that some of the servers often update the predictors periodically using newly available data, newer computational methods, or data representations, making it difficult to determine whether the reported or measured changes in predictive accuracy stem from improvements in the methods, data representations, or better data sets. What is Gennotate? Gennotate is a platform for sharing data representations, predictors, and machine learning algorithms for a broad range of gene structure prediction tasks. Gennotate has two main components (see Figure 1): 1) Model builder, an application for building and evaluating predictors and serializing these models in a binary format (model files). 2) Predictor, an application for applying a model to test data (e.g., sequences to be annotated). The model builder application is an extension of WEKA [1], a widely used machine learning workbench supporting many standard machine learning algorithms. WEKA provides tools for data pre-processing, classification, regression, clustering, validation, and visualization. Furthermore, WEKA provides a framework for implementing new machine learning methods and data pre-processors. The model builder extends WEKA by adding a suite of data pre-processors (called filters 2 Copyright Gennotate development team

3 in WEKA) for converting molecular sequences into vectors of numerical features such that WEKA supported methods can be applied to the data. The current implementation supports filters for generating several of the widely used data representations of molecular sequences. Once the sequences are converted into numeric or nominal features, any suitable WEKA learner can be trained and evaluated on that data set. Figure 1: Gennotate model builder (left) and predictor (right). Model builder The model builder extends WEKA with a variety of DNA sequence preprocessors (WEKA filters) and a number of classification algorithms (e.g., classifiers based on Markov models). The very few exceptions, machine learning algorithms supported in WEKA can not be directly applied to DNA sequence data. A preprocessing step that extracts some features from sequence data is often required. Gennotate model builder provides more than 30 implemented sequence and structure based DNA feature extraction methods. Additionally, a filter called ConcatenateFilter generates new features based on the combination of any set of Gennotate features. Table 1 summarizes the list of currently implemented Gennotate filters. For detailed information about these filters, please check the Gennotate API documentation available at Once the features have been extracted from DNA sequences, many WEKA supported machine learning algorithms can be applied (including state-of-the-art algorithms for classification, regression, clustering, and feature selection). In addition to WEKA supported implementation, Gennotate can run any third party extension of WEKA. The procedure is as a simple as just adding some extra jar files to our CLASSPATH when running Gennotate. 3 Copyright Gennotate development team

4 The current implementation of Gennotate enriches WEKA with a number of classification algorithms summarized in Table 2. Table 1: List of Gennotate supported filters Filter Description DDNAFilter A filer for extracting dinucleotide structure features from DNA sequences. DNA2Filter A filer for converting DNA sequence into a new sequence over a new alphabet defined over all dinucleotide symbols. DNASeqToNominalFilter A Filter to convert a string attribute of DNA sequence into nominal attributes. DNCFilter A Filter to convert a string attribute into 400 features representing compositions of dinucleotides. KMerFilter A Filter to convert a string attribute into numeric features represented as the frequency of its k-mer substrings. MonoHBondFilter A Filter for extracting hydrogen bond based DNA structure features. NCFilter A Filter to convert a string attribute into numeric features representing compositions of nucleotides. SubSequenceFilter A Filter for extracting a substring from the DNA sequence. TRIDNAFilter A filer for extracting tri-nucleotide structure features from DNA sequences. ConcatenateFilter A filter for concatenating multiple filters. Table 2: List of Gennotate classification algorithms Classifier Description HMMClassifier A classifier for implementing a Hidden Markov Model from sequence data. IMMClassifier A classifier for implementing an Interpolated Markov Model from sequence data. MMClassifier A classifier for implementing a Markov Model from sequence data. BalancedClassifier A meta classifier for training a base classifier on unbalanced data set. ModelBased A meta classifier for performing classification/regression using a specified model file. Predictor The Predictor is a graphical user interface (GUI) for applying a saved prediction model to a test datasets. Specifically, the user inputs the model file, the test data file, the output file name, the format of the test data (DNA fragments (one fragment per line) or fasta sequences), the type of the problem (peptide-based or nucleotide-based), and the length of the peptide/window sequence. The output of the predictor is a summary of the input model (model name, model parameters, and the name of the datasets used to build the model) followed by the predictions. The predictions are four tab-separated columns (See Figure 2). The first column is the sequence identifier. The second and third columns are position and the sequence of the predicted peptide/nucleotide sequence. The last column is the predicted scores. 4 Copyright Gennotate development team

5 Figure 2: Example Predictor output. Installing and running Gennotate Gennotate is platform-independent since it is implemented in Java. For Installing Gennotate, one needs to download it from the project web site and unzip the compressed file. For running Gennotate, you need to add all the jar files included in the lib folder to the CLASSPATH and run the gennotate.jar file. For example, the following command sets the CLASSPATH and runs Gennotate on Windows machines: java -Xmx1024m -classpath "gennotate.jar;weka.jar" gennotate.gui.maingui For linux machines, replace ; with :. Using Gennotate In this section, we show several examples on how to use Gennotate to develop predictors from DNA sequence data. For this purpose, we use two in-house data sets for predicting sigma 70 promoters in E. coli: 1) Sigma70.arff is a non-redundant data set extracted from RegulonDB on June 24, The data set contains 579 promoter sequences published before April None of 579 shares more than 45% similarity with any other sequence in the 5 Copyright Gennotate development team

6 promoter data. There are also 579 non-promoter sequences in which none of them shares more than 45% with any either promoter or non-promoter sequence. 2) Sigma70_test is a non-redundant data set extracted from RegulonDB on June 24, All promoter sequences are published after April The data has 792 promoters and 792 non-promoters sequences. None of the sequences shares more than 45% with any other sequence. The test data is provides in two formats: 1) standard WEKA format (file Sigma70_test.arff); 2) one fragment per line format (file Sigma70_test.txt). Building your first predictor Here, we show how to build your first predictor using Sigma70.arff data and HMMClassifier and store it for future use on test data. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab 5. In classifier panel click choose and browse for HMMClassifier 6. The HMMClassifier has two parameters: input data alphabet (default ACGTN); and whether the input sequences has gaps (default false). Keep the default parameters and click OK. 7. Having both the data set and the classification algorithm specified, we are ready to build the model and evaluate it using 10-fold cross-validation. Just click start button and wait for the 10-fold cross-validation procedure to finish. The classifier output shows several statistical estimates of HMMClassifier using 10-fold cross-validation. For example, the accuracy and AUC of the model are 72.8% and 0.81, respectively. 8. To save the model, right click on the model in the Result list panel and select Save model. Save your model as /Examples/Models/Sigma70HMM.model. 6 Copyright Gennotate development team

7 Applying your model to test data There are several methods for applying your model on a test data. First, if your test data are stored in a WEKA format, then you can use the model builder directly to apply the model to test data and get predictions and some performance measures. To do that following the following steps: 1. In Test options panel, click supplied test data and click Set to specify the test data file /Examples/Models/Sigma70_test.arff 2. Right click on Result list panel and select load model to load /Examples/Models/Sigma70HMM.model. After successfully loading the model, the classifier output shows information about the training data, the algorithms and its parameters. 7 Copyright Gennotate development team

8 3. By default, WEKA explorer does not output predictions. To output predictions, click More options and mark output predictions option. 4. Click Start. Wait for evaluating the model on the test data. Then the classifier output panel will display the predictions and some performance evaluation measures. Second, if your test data are in Gennotate supported formats (e.g., FASTA or single DNA fragment per line), then you can use the Predictor application to apply a saved model and get predictions. For example, to apply Sigma70HMM.model to the test data in /Examples/Data/Sigma70_test.txt data, follow the following steps: 1. Run Predictor from Application menu. 8 Copyright Gennotate development team

9 2. Specify, your inputs and output file as in the below figure. 3. Click Predict and wait to see the output in the Predictions panel and also in the output file /Examples/Output/Sigma70_test_out.txt. 9 Copyright Gennotate development team

10 Case Study 1: Predicting promoter regions in E.coli using sequence and structure features In the previous section, we show how to build a HMM model for predicting Sigma70 promoters in E.coli. A major difference between HMMClassifier and traditional classifiers such as Naïve Bayes (NB) and Random Forest is that HMMClassifier can be applied directly to sequence data while traditional classifiers expect the data to be in the form of feature vectors extracted from the original sequence data. Here, we show how to simultaneously, extract features from sequence data and build/test a model. Thanks to WEKA FilteredClassifier which allows us to specify a machine learning algorithm and a filter to be applied on the fly before feeding the data to the predictor. To build a NB classifier using 3-mer features, follow these steps: 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.filteredclassifier. 6. Click on the classifier schema in classifier panel to get the following window. 10 Copyright Gennotate development team

11 7. Change the classifier to weka.classifiers.bayes.naivebayes (with its default parameters) and the filter to gennotate.filters.unsupervised.kmerfilter (set k parameter to 3). Click OK. 8. Click Start to run the 10-fold cross-validation experiment. The following figure shows the result of our experiment. You can repeat the preceding procedure for different choices of classifiers and Gennotate filters. Table 3, compares Naïve Bayes (NB) and Random Forest (with 50 trees) (RF50) for k = 1,2,3, and 4. Interestingly, none of the classifiers has a competitive performance with the HMM classifier which achieved AUC equals 0.81 on the same data set. 11 Copyright Gennotate development team

12 Table 3: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data using different sequence-based features. Features NB RF50 1-mer mer mer mer To build models using structure features, follow the preceding procedure and replace KMerFilter with DDNAFilter which allows us to experiment with 12 different dinucleotide structure-based features [2] (See Gennotete API documentation for detailed information about these methods). Table 4, compares NB and RF50 using 10-fold cross-validation and different structure-based features extracted from Sigma70.arff data. In several cases, structure-based features helped us to reach a performance that competitive with HMM classifier. RF50 seems to be doing better than NB. However, it should be noted that the number of trees were arbitrary set to 50. There could be room for potential improvements using larger numbers of trees (we leave this as an exercise for the user). For future experiments, let s save the best model in Table 4 as /Examples/Models/Sigma70_Stability_RF50.model. Table 4: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data using twelve different dinucleotide structure-based features. Features NB RF50 DI_APHYLICITY DI_BDNATWISTOHLER DI_BDNATWISTOLSON DI_DNABENDSTIFF DI_DNADENATURE DI_ZDNASTABENERGY DI_DUPLEXSTAB_DISRUPTENERGY DI_DUPLEXSTAB_FREEENERGY DI_PINDUCEDDEFORM DI_PROPELLERTWIST DI_PROTEINDNATWIST DI_STACKINGENERGY Copyright Gennotate development team

13 Case Study 2: Improved prediction of promoter regions in E.coli In case study 1, we evaluated the prediction of sigma 70 promoters in E.coli using twelve different methods for extracting dinucleotide features. In general, a better performance can be achieved by: 1) combing a set of these features; 2) building an ensemble of classifiers where each base classifier is trained using different structure-based feature; 3) combining all 12 sets of structure features and using a feature selection method to find an optimal subset of features. Here, we show how to use Gennotate to build improved methods using these three approaches. Concatenating features To build a single classifier that takes as input the features extracting using the twelve different dinucleotide features, use gennotate.filters.concatenatefilter. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.filteredclassifier. 6. Click on the classifier schema in classifier panel to get the following window. 13 Copyright Gennotate development team

14 7. Change the classifier to weka.classifiers.trees.randomforest (set the number of trees to 50) and the filter to gennotate.filters.concatenatefilter 8. Click on the ConcatenateFilter and input the twelve filters (e.g., DDNAFilter with twelve different selections of ConversionTable parameter. 14 Copyright Gennotate development team

15 9. Click Start to run the 10-fold cross-validation experiment. The following figure shows the result of our experiment. The following figure shows the cross-validation performance of the predictor using twelve combined sets of structure features. The result is better than any single set of structure features. 15 Copyright Gennotate development team

16 Concatenating features and selecting optimal subset of features In the preceding experiment, we show that working with a concatenation of twelve set of features can improve the performance of the resulting model. In general, this high dimensional feature space might have several irrelevant and/or redundant features. Here, we show how to use WEKA feature selections with our ConcatenateFilter to further improve the performance of the resulting model. 1. Follow steps 1-6 in the previous experiment 2. In FilteredClassifier window, choose RF50 as your classifier and choose weka.classifiers.attributeselection as your filter. 3. Click on MultiFilter and input two Filters: i) ConcatenateFilter with twelve DDNAFilters, each with different choice of ConversionTable; ii) AttributeSelection filter. 4. For AttributeSelection filter, you can play with large pool of WEKA provided feature selection methods and search algorithms. For our experiment, we ranked all features based on information gain and used the 20 top ranked features. 5. Click Start and wait for the cross-validation results. The output result will show the top 100 features used to build the model (See Table 5 for a list of top 20 features) and will also show some performance measures. Interestingly, the model has a better AUC (0.83) than the model that uses all the set of features (0.80). 16 Copyright Gennotate development team

17 Table 5: List of top 20 structure features DI_DUPLEXSTAB_FREEENERGY48 DI_DNABENDSTIFF48 DI_DUPLEXSTAB_DISRUPTENERGY48 DI_DUPLEXSTAB_FREEENERGY49 DI_PINDUCEDDEFORM49 DI_DNABENDSTIFF49 DI_DUPLEXSTAB_FREEENERGY50 DI_DUPLEXSTAB_FREEENERGY47 DI_DNABENDSTIFF47 DI_STACKINGENERGY48 DI_DUPLEXSTAB_DISRUPTENERGY49 DI_DNADENATURE48 DI_DNABENDSTIFF50 DI_ZDNASTABENERGY49 DI_ZDNASTABENERGY50 DI_DUPLEXSTAB_DISRUPTENERGY47 DI_STACKINGENERGY47 DI_DUPLEXSTAB_DISRUPTENERGY50 DI_DNADENATURE50 DI_STACKINGENERGY49 Improved predictions of sigma 70 promoters using Ensemble of classifiers In this experiment, we build a number of classifiers using RF50 and different choices of the dinucleotide structure-based features. The base classifiers we be combined together using a second stage classifier using WEKA Logistic classifier. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.stacking. Set numfolds to 3 and set the metaclassifier to weka.classifiers.functions.logistic and input 12 classifiers, each is a FilteredClassifier with RF50 and different choice of ConversioTable for DDNAFilter. 17 Copyright Gennotate development team

18 6. Click Start to run the 10-fold cross-validation experiment. The following figure shows the result of our experiment. 18 Copyright Gennotate development team

19 Case Study 3: Improved prediction of promoter regions in E.coli using metapredictors An Interesting property of Gennotate is that it allows sharing not only data sets but also the learned models. Once you have a number of different predictors for the same classification task. You can: 1) use the Predictor application to apply any of these predictors to some test data; 2) Rebuild the prediction model using updated/different training data; 3) build a consensus or hybrid predictor that combine these predictors. The first usage has been shown earlier. The second usage can be done simply by: loading the new training data; loading the current model; and performing cross-validation experiment. The results will show the performance of the new model which also can be saved as a model file for further use. The third usage is the focus of this case study. To facilitate the development of a consensus/hybrid predictor that relies on existing predictors not necessarily developed by the same user, Gennotate provides a metaclassifier called ModelBased. In the following experiment, we show how to use ModelBased classifier to build a consensus predictor combining Sigma70HMM and Sigma70_Stability_RF50 developed earlier. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70_tets.arff. Please note that our goal is to combine existing models. So, there is no need to retrain these models but instead we will use the test data to evaluate the combination of these predictors. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.vote 19 Copyright Gennotate development team

20 6. Input two classifiers each one is using gennotate.classifiers.meta.modelbased and set modelfile parameter as showing in the following figure. 7. In the Test options panel, choose use training data. Note that the ModelBased classifier does not perform any training. It just loads the model and keeps it for predictions. Hence, what is reported is the performance of applying the models encapsulated within the ModelBased classifier to what it seems as training data. The obtained performance of the consensus predictor combining the HMM model and the RF50 model is almost the same as the performance of HMM (AUC equals 0.81). In practice, we expect improvements in performance when we combine several (not just two) predictors. Please note that we can build a hybrid model using the HMM and RF50 models simply by following the preceding procedure and replacing the Vote classifier with Stacking classifier. However, in that case the user might perform cross-validation test and the result should be handled with caution because the test data has been used to train the meta-predictor in the Stacking classifier. 20 Copyright Gennotate development team

21 Extending Gennotate Gennotate is extendable, in the sense that anyone can add extra filters or extra classification methods. To add your own classification methods or filters, please follow the procedure described in the WEKA documentation on how to write your own classifier and your own filter. Once you have a jar file including your added components, just add it to our CLASSPATH when running Gennotate and enjoy your customized version of Gennotate. References [1] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), [2] Gan, Y., Guan, J., & Zhou, S. (2012). A comparison study on feature selection of DNA structural properties for promoter prediction. BMC bioinformatics, 13(1), Copyright Gennotate development team

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy  August 30, 2016 Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing Practical Data Mining COMP-321B Tutorial 4: Preprocessing Shevaun Ryan Mark Hall June 30, 2008 c 2006 University of Waikato 1 Introduction For this tutorial we will be using the Preprocess panel, the Classify

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Lab Exercise Three Classification with WEKA Explorer

Lab Exercise Three Classification with WEKA Explorer Lab Exercise Three Classification with WEKA Explorer 1. Fire up WEKA to get the GUI Chooser panel. Select Explorer from the four choices on the right side. 2. We are on Preprocess now. Click the Open file

More information

Tutorial Case studies

Tutorial Case studies 1 Topic Wrapper for feature subset selection Continuation. This tutorial is the continuation of the preceding one about the wrapper feature selection in the supervised learning context (http://data-mining-tutorials.blogspot.com/2010/03/wrapper-forfeature-selection.html).

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Short instructions on using Weka

Short instructions on using Weka Short instructions on using Weka G. Marcou 1 Weka is a free open source data mining software, based on a Java data mining library. Free alternatives to Weka exist as for instance R and Orange. The current

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER

SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER SCENARIO BASED ADAPTIVE PREPROCESSING FOR STREAM DATA USING SVM CLASSIFIER P.Radhabai Mrs.M.Priya Packialatha Dr.G.Geetha PG Student Assistant Professor Professor Dept of Computer Science and Engg Dept

More information

Comparative Study on Classification Meta Algorithms

Comparative Study on Classification Meta Algorithms Comparative Study on Classification Meta Algorithms Dr. S. Vijayarani 1 Mrs. M. Muthulakshmi 2 Assistant Professor, Department of Computer Science, School of Computer Science and Engineering, Bharathiar

More information

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear Using Machine Learning to Identify Security Issues in Open-Source Libraries Asankhaya Sharma Yaqin Zhou SourceClear Outline - Overview of problem space Unidentified security issues How Machine Learning

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. Americo Pereira, Jan Otto Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga. ABSTRACT In this paper we want to explain what feature selection is and

More information

Seminars of Software and Services for the Information Society

Seminars of Software and Services for the Information Society DIPARTIMENTO DI INGEGNERIA INFORMATICA AUTOMATICA E GESTIONALE ANTONIO RUBERTI Master of Science in Engineering in Computer Science (MSE-CS) Seminars in Software and Services for the Information Society

More information

Automatic Labeling of Issues on Github A Machine learning Approach

Automatic Labeling of Issues on Github A Machine learning Approach Automatic Labeling of Issues on Github A Machine learning Approach Arun Kalyanasundaram December 15, 2014 ABSTRACT Companies spend hundreds of billions in software maintenance every year. Managing and

More information

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

Practical Data Mining COMP-321B. Tutorial 5: Article Identification Practical Data Mining COMP-321B Tutorial 5: Article Identification Shevaun Ryan Mark Hall August 15, 2006 c 2006 University of Waikato 1 Introduction This tutorial will focus on text mining, using text

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

Sabbatical Leave Report

Sabbatical Leave Report Zdravko Markov, Ph.D. Phone: (860) 832-2711 Associate Professor of Computer Science E-mail: markovz@ccsu.edu Central Connecticut State University URL: http://www.cs.ccsu.edu/~markov/ Sabbatical Leave Report

More information

Weka: Practical machine learning tools and techniques with Java implementations

Weka: Practical machine learning tools and techniques with Java implementations Weka: Practical machine learning tools and techniques with Java implementations AI Tools Seminar University of Saarland, WS 06/07 Rossen Dimov 1 Supervisors: Michael Feld, Dr. Michael Kipp, Dr. Alassane

More information

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8 Tutorial 3 1 / 8 Overview Non-Parametrics Models Definitions KNN Ensemble Methods Definitions, Examples Random Forests Clustering Definitions, Examples k-means Clustering 2 / 8 Non-Parametrics Models Definitions

More information

New ensemble methods for evolving data streams

New ensemble methods for evolving data streams New ensemble methods for evolving data streams A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà Laboratory for Relational Algorithmics, Complexity and Learning LARCA UPC-Barcelona Tech, Catalonia

More information

S2 Text. Instructions to replicate classification results.

S2 Text. Instructions to replicate classification results. S2 Text. Instructions to replicate classification results. Machine Learning (ML) Models were implemented using WEKA software Version 3.8. The software can be free downloaded at this link: http://www.cs.waikato.ac.nz/ml/weka/downloading.html.

More information

JAVA SPRING BOOT REST WEB SERVICE INTEGRATION WITH JAVA ARTIFICIAL INTELLIGENCE WEKA FRAMEWORK

JAVA SPRING BOOT REST WEB SERVICE INTEGRATION WITH JAVA ARTIFICIAL INTELLIGENCE WEKA FRAMEWORK 2017 INTERNATIONAL SCIENTIFIC CONFERENCE 17-18 November 2017, GABROVO JAVA SPRING BOOT REST WEB SERVICE INTEGRATION WITH JAVA ARTIFICIAL INTELLIGENCE WEKA FRAMEWORK Željko Jovanović, Dijana Jagodić, Dejan

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde IEE 520 Data Mining Project Report Shilpa Madhavan Shinde Contents I. Dataset Description... 3 II. Data Classification... 3 III. Class Imbalance... 5 IV. Classification after Sampling... 5 V. Final Model...

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Analyzing HTTP requests for web intrusion detection

Analyzing HTTP requests for web intrusion detection Kennesaw State University DigitalCommons@Kennesaw State University KSU Proceedings on Cybersecurity Education, Research and Practice 2017 KSU Conference on Cybersecurity Education, Research and Practice

More information

Finding data. HMMER Answer key

Finding data. HMMER Answer key Finding data HMMER Answer key HMMER input is prepared using VectorBase ClustalW, which runs a Java application for the graphical representation of the results. If you get an error message that blocks this

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Weka VotedPerceptron & Attribute Transformation (1)

Weka VotedPerceptron & Attribute Transformation (1) Weka VotedPerceptron & Attribute Transformation (1) Lab6 (in- class): 5 DIC 2016-13:15-15:00 (CHOMSKY) ACKNOWLEDGEMENTS: INFORMATION, EXAMPLES AND TASKS IN THIS LAB COME FROM SEVERAL WEB SOURCES. Learning

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining

More information

WEKA A Machine Learning Workbench for Data Mining

WEKA A Machine Learning Workbench for Data Mining Chapter 1 WEKA A Machine Learning Workbench for Data Mining Eibe Frank, Mark Hall, Geoffrey Holmes, Richard Kirkby, Bernhard Pfahringer, Ian H. Witten Department of Computer Science, University of Waikato,

More information

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer Practical Data Mining COMP-321B Tutorial 1: Introduction to the WEKA Explorer Gabi Schmidberger Mark Hall Richard Kirkby July 12, 2006 c 2006 University of Waikato 1 Setting up your Environment Before

More information

Lecture 5: Markov models

Lecture 5: Markov models Master s course Bioinformatics Data Analysis and Tools Lecture 5: Markov models Centre for Integrative Bioinformatics Problem in biology Data and patterns are often not clear cut When we want to make a

More information

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC

Attribute Discretization and Selection. Clustering. NIKOLA MILIKIĆ UROŠ KRČADINAC Attribute Discretization and Selection Clustering NIKOLA MILIKIĆ nikola.milikic@fon.bg.ac.rs UROŠ KRČADINAC uros@krcadinac.com Naive Bayes Features Intended primarily for the work with nominal attributes

More information

Predicting Popular Xbox games based on Search Queries of Users

Predicting Popular Xbox games based on Search Queries of Users 1 Predicting Popular Xbox games based on Search Queries of Users Chinmoy Mandayam and Saahil Shenoy I. INTRODUCTION This project is based on a completed Kaggle competition. Our goal is to predict which

More information

Effective Classifiers for Detecting Objects

Effective Classifiers for Detecting Objects Effective Classifiers for Detecting Objects Michael Mayo Dept. of Computer Science University of Waikato Private Bag 3105, Hamilton, New Zealand mmayo@cs.waikato.ac.nz Abstract Several state-of-the-art

More information

B-kNN to Improve the Efficiency of knn

B-kNN to Improve the Efficiency of knn Dhrgam AL Kafaf, Dae-Kyoo Kim and Lunjin Lu Dept. of Computer Science & Engineering, Oakland University, Rochester, MI 809, U.S.A. Keywords: Abstract: Efficiency, knn, k Nearest Neighbor. The knn algorithm

More information

Topics In Feature Selection

Topics In Feature Selection Topics In Feature Selection CSI 5388 Theme Presentation Joe Burpee 2005/2/16 Feature Selection (FS) aka Attribute Selection Witten and Frank book Section 7.1 Liu site http://athena.csee.umbc.edu/idm02/

More information

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux.

Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. 1 Introduction Computing a Gain Chart. Comparing the computation time of data mining tools on a large dataset under Linux. The gain chart is an alternative to confusion matrix for the evaluation of a classifier.

More information

ClaNC: The Manual (v1.1)

ClaNC: The Manual (v1.1) ClaNC: The Manual (v1.1) Alan R. Dabney June 23, 2008 Contents 1 Installation 3 1.1 The R programming language............................... 3 1.2 X11 with Mac OS X....................................

More information

Data Mining With Weka A Short Tutorial

Data Mining With Weka A Short Tutorial Data Mining With Weka A Short Tutorial Dr. Wenjia Wang School of Computing Sciences University of East Anglia (UEA), Norwich, UK Content 1. Introduction to Weka 2. Data Mining Functions and Tools 3. Data

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008

An Introduction to WEKA Explorer. In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer In part from: Yizhou Sun 2008 What is WEKA? Waikato Environment for Knowledge Analysis It s a data mining/machine learning tool developed by Department of Computer Science,,

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis

WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis , pp-01-05 WEKA: A Dynamic Software Suit for Machine Learning & Exploratory Data Analysis P.B.Khanale 1, Vaibhav M. Pathak 2 1 Department of Computer Science,Dnyanopasak College,Parbhani 431 401 e-mail

More information

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset

Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset International Journal of Computer Applications (0975 8887) Combination of PCA with SMOTE Resampling to Boost the Prediction Rate in Lung Cancer Dataset Mehdi Naseriparsa Islamic Azad University Tehran

More information

A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction

A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction Rezarta Islamaj 1, Lise Getoor 1, and W. John Wilbur 2 1 Computer Science Department, University of Maryland, College

More information

Scalable Machine Learning in R. with H2O

Scalable Machine Learning in R. with H2O Scalable Machine Learning in R with H2O Erin LeDell @ledell DSC July 2016 Introduction Statistician & Machine Learning Scientist at H2O.ai in Mountain View, California, USA Ph.D. in Biostatistics with

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn University,

More information

Chapter 10. Conclusion Discussion

Chapter 10. Conclusion Discussion Chapter 10 Conclusion 10.1 Discussion Question 1: Usually a dynamic system has delays and feedback. Can OMEGA handle systems with infinite delays, and with elastic delays? OMEGA handles those systems with

More information

Applied Machine Learning

Applied Machine Learning Applied Machine Learning Lab 3 Working with Text Data Overview In this lab, you will use R or Python to work with text data. Specifically, you will use code to clean text, remove stop words, and apply

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA.

WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. 1 Topic WRAPPER feature selection method with SIPINA and R (RWeka package). Comparison with a FILTER approach implemented into TANAGRA. Feature selection. The feature selection 1 is a crucial aspect of

More information

ECLT 5810 Evaluation of Classification Quality

ECLT 5810 Evaluation of Classification Quality ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

MetaPhyler Usage Manual

MetaPhyler Usage Manual MetaPhyler Usage Manual Bo Liu boliu@umiacs.umd.edu March 13, 2012 Contents 1 What is MetaPhyler 1 2 Installation 1 3 Quick Start 2 3.1 Taxonomic profiling for metagenomic sequences.............. 2 3.2

More information

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2 ACE Contents ACE Presentation Comparison with existing frameworks Technical aspects ACE 2.0 and future work 24 October 2009 ACE 2 ACE Presentation 24 October 2009 ACE 3 ACE Presentation Framework for using

More information

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science

Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science Technical Brief: Domain Risk Score Proactively uncover threats using DNS and data science 310 Million + Current Domain Names 11 Billion+ Historical Domain Profiles 5 Million+ New Domain Profiles Daily

More information

MOA: {M}assive {O}nline {A}nalysis.

MOA: {M}assive {O}nline {A}nalysis. MOA: {M}assive {O}nline {A}nalysis. Albert Bifet Hamilton, New Zealand August 2010, Eindhoven PhD Thesis Adaptive Learning and Mining for Data Streams and Frequent Patterns Coadvisors: Ricard Gavaldà and

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES

DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES EXPERIMENTAL WORK PART I CHAPTER 6 DESIGN AND EVALUATION OF MACHINE LEARNING MODELS WITH STATISTICAL FEATURES The evaluation of models built using statistical in conjunction with various feature subset

More information

Community edition(open-source) Enterprise edition

Community edition(open-source) Enterprise edition Suseela Bhaskaruni Rapid Miner is an environment for machine learning and data mining experiments. Widely used for both research and real-world data mining tasks. Software versions: Community edition(open-source)

More information

Prognosis of Lung Cancer Using Data Mining Techniques

Prognosis of Lung Cancer Using Data Mining Techniques Prognosis of Lung Cancer Using Data Mining Techniques 1 C. Saranya, M.Phil, Research Scholar, Dr.M.G.R.Chockalingam Arts College, Arni 2 K. R. Dillirani, Associate Professor, Department of Computer Science,

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad).

Perform the following steps to set up for this project. Start out in your login directory on csit (a.k.a. acad). CSC 458 Data Mining and Predictive Analytics I, Fall 2017 (November 22, 2017) Dr. Dale E. Parson, Assignment 4, Comparing Weka Bayesian, clustering, ZeroR, OneR, and J48 models to predict nominal dissolved

More information

A Study of Random Forest Algorithm with implemetation using Weka

A Study of Random Forest Algorithm with implemetation using Weka A Study of Random Forest Algorithm with implemetation using Weka 1 Dr. N.Venkatesan, Associate Professor, Dept. of IT., Bharathiyar College of Engg. & Technology, Karikal, India 2 Mrs. G.Priya, Assistant

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

Neural Networks and Machine Learning Applied to Classification of Cancer. Sachin Govind, Advisor: Namrata Pandya, IMSA

Neural Networks and Machine Learning Applied to Classification of Cancer. Sachin Govind, Advisor: Namrata Pandya, IMSA Neural Networks and Machine Learning Applied to Classification of Cancer Sachin Govind, Advisor: Namrata Pandya, IMSA Cancer Screening Current methods Invasive techniques (biopsy, colonoscopy, etc.) Helical

More information

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique

Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique www.ijcsi.org 29 Improving Quality of Products in Hard Drive Manufacturing by Decision Tree Technique Anotai Siltepavet 1, Sukree Sinthupinyo 2 and Prabhas Chongstitvatana 3 1 Computer Engineering, Chulalongkorn

More information

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA

MIT Samberg Center Cambridge, MA, USA. May 30 th June 2 nd, by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Exploratory Machine Learning studies for disruption prediction on DIII-D by C. Rea, R.S. Granetz MIT Plasma Science and Fusion Center, Cambridge, MA, USA Presented at the 2 nd IAEA Technical Meeting on

More information

Contents. Foreword to Second Edition. Acknowledgments About the Authors

Contents. Foreword to Second Edition. Acknowledgments About the Authors Contents Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments About the Authors xxxi xxxv Chapter 1 Introduction 1 1.1 Why Data Mining? 1 1.1.1 Moving toward the Information Age 1

More information

Tutorial on Machine Learning. Impact of dataset composition on models performance. G. Marcou, N. Weill, D. Horvath, D. Rognan, A.

Tutorial on Machine Learning. Impact of dataset composition on models performance. G. Marcou, N. Weill, D. Horvath, D. Rognan, A. Part 1. Tutorial on Machine Learning. Impact of dataset composition on models performance G. Marcou, N. Weill, D. Horvath, D. Rognan, A. Varnek 1 Introduction Predictive performance of QSAR model depends

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

Software Defect Prediction System Decision Tree Algorithm With Two Level Data Preprocessing

Software Defect Prediction System Decision Tree Algorithm With Two Level Data Preprocessing Software Defect Prediction System Decision Tree Algorithm With Two Level Data Preprocessing Reena P Department of Computer Science and Engineering Sree Chitra Thirunal College of Engineering Thiruvananthapuram,

More information

Author Verification: Exploring a Large set of Parameters using a Genetic Algorithm

Author Verification: Exploring a Large set of Parameters using a Genetic Algorithm Author Verification: Exploring a Large set of Parameters using a Genetic Algorithm Notebook for PAN at CLEF 2014 Erwan Moreau 1, Arun Jayapal 2, and Carl Vogel 3 1 moreaue@cs.tcd.ie 2 jayapala@cs.tcd.ie

More information

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

Combining Neural Networks and Log-linear Models to Improve Relation Extraction Combining Neural Networks and Log-linear Models to Improve Relation Extraction Thien Huu Nguyen and Ralph Grishman Computer Science Department, New York University {thien,grishman}@cs.nyu.edu Outline Relation

More information

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions CAMCOS Report Day December 9th, 2015 San Jose State University Project Theme: Classification The Kaggle Competition

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

WEKA Waikato Environment for Knowledge Analysis Performing Classification Experiments Prof. Pietro Ducange

WEKA Waikato Environment for Knowledge Analysis Performing Classification Experiments Prof. Pietro Ducange WEKA Waikato Environment for Knowledge Analysis Performing Classification Experiments Prof. Pietro Ducange 1 The Knowledge Flow Interface It provides an alternative to the Explorer interface The user can

More information

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Abstract Automatic linguistic indexing of pictures is an important but highly challenging problem for researchers in content-based

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

Finding and Exporting Data. BioMart

Finding and Exporting Data. BioMart September 2017 Finding and Exporting Data Not sure what tool to use to find and export data? BioMart is used to retrieve data for complex queries, involving a few or many genes or even complete genomes.

More information

Topic Classification in Social Media using Metadata from Hyperlinked Objects

Topic Classification in Social Media using Metadata from Hyperlinked Objects Topic Classification in Social Media using Metadata from Hyperlinked Objects Sheila Kinsella 1, Alexandre Passant 1, and John G. Breslin 1,2 1 Digital Enterprise Research Institute, National University

More information

3 Ways to Improve Your Regression

3 Ways to Improve Your Regression 3 Ways to Improve Your Regression Introduction This tutorial will take you through the steps demonstrated in the 3 Ways to Improve Your Regression webinar. First, you will be introduced to a dataset about

More information

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram.

Subject. Dataset. Copy paste feature of the diagram. Importing the dataset. Copy paste feature into the diagram. Subject Copy paste feature into the diagram. When we define the data analysis process into Tanagra, it is possible to copy components (or entire branches of components) towards another location into the

More information

KNIME Enalos+ Molecular Descriptor nodes

KNIME Enalos+ Molecular Descriptor nodes KNIME Enalos+ Molecular Descriptor nodes A Brief Tutorial Novamechanics Ltd Contact: info@novamechanics.com Version 1, June 2017 Table of Contents Introduction... 1 Step 1-Workbench overview... 1 Step

More information