Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Size: px
Start display at page:

Download "Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016"

Transcription

1 Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer can distribute his predictor as a serialized Java object (model file). This allows other EpiT users to use his predictor on their own machines, rebuild the predictor on other datasets, or combine the predictor with other predictors to obtain a customized hybrid or consensus predictor. Overview of EpiT EpiT has two main components: i. Model builder, an application for building and evaluating epitope predictors and serializing these models in a binary format (model files) ii. Predictor, an application for applying a model to test data (e.g., set of epitopes or protein sequences). Model builder The model builder application is an extension of Weka [1], a well-known machine learning workbench supporting many standard machine learning algorithms. Weka provides tools for data pre-processing, classification, regression, clustering, validation, and visualization. Furthermore, Weka provides a framework for implementing new machine learning methods and data pre-processors. The model builder in EpiT offers the following extensions to Weka: i) a suite of data pre-processors (called filters in Weka) for converting epitope sequences into a vector of numerical features such that Weka supported methods can be applied to the data. The current implementation supports filters for converting epitope sequences into amino acid compositions, dipeptide compositions, amino acid pair propensities [2], compositiontransition-distribution (CTD) [3,4], and nominal attributes. Once epitope sequences have been converted into numeric or nominal features, any suitable Weka learner can be trained and evaluated on that datasets; ii) a 1

2 number of methods that can be directly (without applying any filters) trained and evaluated for qualitative and quantitative epitope predictions. The current implementation of EpiT provides classifiers for propensity scale methods (e.g., Parker s hydropholicity scale [5]), position specific scoring matrix (PSSM) [6], and a method for predicting MHC class II binding affinity using multiple-instance regression [7]. In addition, a meta-classifier for building a consensus predictor combining a group of predictors and a meta-classifier for building epitope predictors from highly unbalanced training datasets by randomly under-sampling instances from the majority class. More information about these extensions is provided in the Epit API documentation. Predictor The Predictor is a graphical user interface (GUI) for applying a model to a test datasets. Specifically, the user inputs the model file, the test data file, the output file name, the format of the test data (set of epitopes or fasta sequences), the type of the problem (peptide-based or residue-based) [8], and the length of the peptide/window sequence. The output of the predictor is a summary of the input model (model name, model parameters, and the name of the datasets used to build the model) followed by the predictions. The predictions are four tab-separated columns. The first column is the epitope/antigen identifier. The second and third columns are position and the sequence of the predicted peptide/residue sequence. The last column is the predicted scores. Installing and using EpiT EpiT is platform-independent since it is implemented in Java. For Installing EpiT, one needs to download it from the project web site and unzip the compressed file. For running EpiT, you need to add all the jar files included in the lib folder to the CLASSPATH and run the epit.jar file (See RunEpiT.bat as an example). The following command sets the CLASSPATH and runs EpiT: java Xmx512m -classpath "./epit.jar;./lib/weka.jar;./lib/readseq.jar;./lib/swing-layout jar;./lib/swing-worker-1.2.jar;." epit.gui.maingui 2

3 Example 1: Predicting linear B-cell epitopes using FBCPred model FBCPred [9] is a recent method for predicting flexible length linear B-cell epitopes using subsequence kernel. An implementation of this method is available on BCPREDS web server. However, users are restricted to submit one protein sequence at a time. In this example, we demonstrate how to use the Predictor application in EpiT and the FBCPred model file provided in the Examples folder to predict potential linear B-cell epitopes. 1. Run EpiT 2. Go to Application menu and select Predictor application 3. Press the Model button to view an open file dialog and use it to enter the./examples/models/fbcpred.model 4. Press the Test button to view an open file dialog and use it to enter the file containing the test sequences in fasta format./examples/data/test.fasta.txt 5. Press the Output button to view a save file dialog and use it to specify the path and the name of the file that the predictions will be outputted to (e.g.,./examples/fbcpred.test.out.txt ). 6. Set the peptide length to 14 (default value for FBCPred method). 7. Press the Predict button to get the predictions (See Figure 1). 8. Change the test file to./examples/data/abcpred.blind.txt. This is the blind test set published by Saha et al. [10]. 9. Set the output file to./examples/data/fbcpred.abcpred.out.txt. 10. Change the Input format to epitopes list. Note that the peptide length will be changed to -1. This implies that full-length test epitopes will be fed to the model for prediction without applying a sliding window to fix the length of the test peptides submitted to the classifier. 11. Press Predict button to get the predictions (See Figure 2). 3

4 Figure 1: Output predictions of applying FBCPred model to antigen sequences in test.fasta.txt. Figure 2: Output predictions of applying FBCPred model to ABCPred blind test set in abcpred.blind.txt. 4

5 Example 2: Developing a Position Specific Scoring Matrix (PSSM) for predicting 20-mer Predicting linear B-cell peptides 1- Run EpiT 2- Go to Application menu and select Model builder application. A modified version of Weka explorer will be displayed. 3- Press the open file button and use the open file dialog to open./examples/data/bcpred20.nr80.arff. This is the datasets that has been used to develop the 20-mer peptides classifier for BCPred method [11]. Each instance is 20 residues in length and is associated with a binary label to indicate whether the corresponding peptide is a linear B-cell epitope or not. Figure 3 provides some useful information about this dataset. 4- Click the Classify tab. 5- Click Choose button to select the classification method and select epit.classifiers.matrix.pssmclassifier (See Figure 4). 6- Click Start button to begin a 10-fold cross-validation test to evaluate the PSSM classifier on the BCPred 20-mer dataset. At the end, the program will output the PSSM matrix constructed using the entire training dataset and will also output several performance metrics obtained using the cross-validation test. For more details, please see the Weka explorer tutorial available at: 7- In the result panel, right click on the classifier name and select Save model from the popup menu and save the model as./examples/models/pssm.model (See Figure 5). 5

6 Figure 3: EpiT model builder, an extended version of Weka GUI explorer. Figure 4: Selecting the PSSM classifier. 6

7 Figure 5: Saving the PSSM model. Note that, the default setting for the PSSM method is to use negative data for estimating background probabilities. Alternatively, one can disable this option to assume uniform background probabilities. The performance of the PSSM model is that case is lower than the one obtained using negative training data to estimate the background probabilities (See Figure 6) 7

8 Figure 6: A reported poor performance of the PSSM model built using positive information only and assuming uniform background probabilities. Example 3: Developing a propensity scale based method for predicting linear B-cell epitopes 1- Run EpiT 2- Go to Application menu and select Model builder application. A modified version of Weka explorer will be displayed. 3- Press the open file button and use the open file dialog to open./examples/data/bcpred20.nr80.arff. 4- Click the Classify tab. 5- Click Choose button to select the classification method and select epit.classifiers.propensity.propensityscale. The default parameter settings for this method are: standard 20 amino acids alphabet, Parker s hydrophilicity scale, and window size = Click Start button to begin a 10-fold cross-validation test to evaluate the PSSM classifier on the BCPred 20-mer dataset. 8

9 7- In the result panel, right click on the classifier name and select Save model from the popup menu and save the model as./examples/models/parker.model. It should be mentioned that, the EpiT distribution includes 544 amino acid propensity scales extracted from AAIndex. Any of these scales can be used with the PropensityScale classifier instead of the default Parker s hydrophilicity scale. Example 4: Peptide-based and residue-based linear B-cell epitopes prediction using Parker s propensity scale 1. Run EpiT 2. Go to Application menu and select Predictor application 3. Press the Model button to view an open file dialog and use it to enter the./examples/models/parker.model 4. Press the Test button to view an open file dialog and use it to enter the file containing the test sequences in fasta format./examples/data/test.fasta.txt 5. Press the Output button to view a save file dialog and use it to specify the path and the name of the file that the predictions will be outputted to (e.g.,./examples/ parker.test.peptide.out.txt ). 6. Set the peptide length to 14. Note that, setting the window size to - 1 when building parker.model allows us to evaluate it using any Peptide/Window length. Otherwise, we have to use the exact size that has been specified during the training of the model. 7. Press the Predict button to get predictions for each 14-mer peptide in the test sequences. 8. Change the instance type to residue-based. 9. Set the window length to 7 (has to be an odd number) 10. Set the output file to parker.test.residue.out.txt 11. Press the Predict button to get prediction scores for each residue in the test sequences (See Figure 7). 9

10 Figure 7: Residue-based classification using parker.model. Example 5: Developing a Naïve Bayes classifier for predicting linear B- cell epitopes using amino acid composition information Because the majority of Weka implemented algorithms, including Naïve Bayes classifier, are not applicable on datasets with string attributes, EpiT provides a set of filters for converting epitope sequences into feature vectors. 1- Run EpiT 2- Go to Application menu and select Model builder application. A modified version of Weka explorer will be displayed. 3- Press the open file button and use the open file dialog to open./examples/data/bcpred20.nr80.arff. 4- Click the Classify tab. 5- Click Choose button to select the classification method and select weka.classifiers.meta.filteredclassifier. 10

11 6- Left-click on the classifier name to edit the FilteredClassifier properties. Set the classifier to weak.bayes.naivebayes. Set the filter to epit.filters.unsupervised.attribute.sequencecomposition. Click OK to close the properties window. 7- Click Start button to begin a 10-fold cross-validation test to evaluate the model on the BCPred 20-mer dataset. 8- In the result panel, right click on the classifier name and select Save model from the popup menu and save the model as./examples/models/nbac.model. Example 6: Developing a consensus predictor for predicting flexiblelength linear B-cell epitopes Let s assume that we may have several models for predicting flexible-length linear B-cell epitopes. Our goal is to combine the predictions of these models into a consensus prediction. In general, we expect the consensus method combining several methods to outperform any individual method. There are two ways of obtaining consensus predictions. First, one can use the Predictor application to apply every individual model on the test data. Then, the output predictions can be combined into a consensus prediction (e.g., via importing the predictions into an Excel sheet and combining them or by writing a simple script to combine these predictions). Second, one can use the weak.classifiers.meta.vote classifier and epit.classifiers.meta.modelbased to build a consensus predictor and use the Predictor application to apply this consensus predictor to the test data. 1- Run EpiT 2- Go to Application menu and select Model builder application. A modified version of Weka explorer will be displayed. 3- Press the open file button and use the open file dialog to open./examples/data/bcpred20.nr80.arff. 4- Click the Classify tab. 5- Click Choose button to select the classification method and select weka.classifiers.meta.vote. 6- Left-click on the classifier name to edit the Vote classifier properties. For the classifiers property, add two epit.classifiers.meta.modelbased classifiers and set their ModelFile property to./examples\models\fbcpred.model,./examples\models\parker.model, respectively. 11

12 7- Select use training set as the test option and click Start button to begin evaluating the consensus model on the BCPred 20-mer dataset. It should be noted that the FBCPred.model was built using FBCPred dataset and in this example the consensus model is evaluated on BCPred 20-mer dataset. Because both datasets were extracted from the BciPep database, the reported performance is expected to be overoptimistic. If your goal, is to evaluated a consensus model of combining FBCPred and Parker s hydrophilicity scale, then you should use the Vote to combine an SMO classifier with subsequence kernel (FBCPred method) and a PropensityScale classifier. 8- In the result panel, right click on the classifier name and select Save model from the popup menu and save the model as./examples/models/consensus.model. Figure 8: Setting the properties of the Vote classifier. 12

13 Example 7: Using EpiT to build a hybrid predictor Briefly, you can follow the approach described in Example 6 to use any Weka meta-classifier to build a hybrid model combining several existing models (each model will be encapsulated in a ModelBased classifier) or to build and evaluate a hybrid model combining several prediction methods. Example 8: Using EpiT to build semi-supervised predictors Semi-supervised learning offers a powerful approach for leveraging (often large amounts of) unlabeled data U together with modest amounts of labeled data L to train predictive models that are more accurate than those that could be trained using only the available labeled data. In this example, we show how to use the semi-supervised self-training algorithm to build a linear B- cell epitope prediction model that outperform its supervised learning counterpart model. We also demonstrate how to use potential labeled data (e.g., expert annotated data with no experimental validation) to further improve the performance of the self-training semi-supervised predictors. More details about these two algorithms are provided in [12]. 1- Run EpiT 2- Go to Application menu and select Model builder application. A modified version of Weka explorer will be displayed. 3- Press the open file button and use the open file dialog to open./examples/data/ssl/ BCPred16.nr80-L.arff. This is the labeled dataset for predicting linear B-cell epitopes. 4- Click the Classify tab. 5- Click Choose button to select the classification method and select epit.classifiers.ssl.selftrain. 6- Click the classifier panel to set the parameters of the self-training classifier as shown in Figure 9. Briefly, set the baseclassifier and finalclassifier to weka.tress.randomforest and set unlabeleddata to the full path to the file./examples/data/ssl/ BCPred16.nr80-U.arff. Then, click OK. 7- In the Test options panel, select Supplied test set and set the test set to./examples/data/ssl/ BCPred16.nr80-U.arff. 8- Click Start button to train a semi-supervised models using labeled data, BCPred16.nr80-L.arff, and unlabeled data, BCPred16.nr80- U.arff. The learned model will then be evaluated using the unlabeled 13

14 data BCPred16.nr80-U.arff and the evaluation performance will be reported (see Figure 10). Figure 9: Setting the parameters for SelfTraing classifier. 14

15 Figure 10: Performance of SelfTraining classifier trained using labaled and unlabeled data. In cases where potentially labeled data is available, SelfTrain algorithm could be set to leverage it to improve its predictive performance. To build self-training classifiers using labeled, unlabeled and potentially labeled data, follow the preceding procedure but in step 6 provide the full path for both labeled and unlabeled data (see Figure 11). The improved result is shown in Figure

16 Figure 11: Updating the parameters for SelfTraing classifier to use potentially labeled data. 16

17 Figure 12: Performance of SelfTraining classifier trained using labaled, unlabeled, and potentially labeled data. Updating an existing model An interesting feature in EpiT is that it allows anyone to rebuild an existing model. Assume that you have augmented FBCPred dataset with newly reported epitopes data and your goal is to rebuild your own FBCPred model with the modified dataset. Note that in Figure 1, the Predictor application is reporting the classification method and the parameters that have been used to build the original FBCPred model. Therefore, to build your own updated FBCPred model, you can use this information and the Model builder application to evaluate and build your own model. 17

18 Extending EpiT EpiT is an open source project under the GNU General Public License (GPL). This assures that anyone can freely extend or change this software as long as the modified software will be licensed under the GNU GPL. We encourage bioinformatics developers to participate in EpiT by contributing new components (e.g., filters or machine learning methods), new epitope datasets in Weka accepted formats, or new epitope prediction tools in the form of model files. References [1] Witten, I., Frank, E., Data mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann. [2] Chen, J., Liu, H., Yang, J., Chou, K., Prediction of linear B-cell epitopes using amino acid pair antigenicity scale. Amino Acids 33, [3] Cui, J., Han, L., Lin, H., Tan, Z., Jiang, L., Cao, Z., Chen, Y., MHC-BPS: MHC binder prediction server for identifying peptides of flexible lengths from sequence derived physicochemical properties. Immunogenetics 58, [4] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008a. On Evaluating MHC-II Binding Peptide Prediction Methods. PLoS ONE 3. [5] Parker, J., Guo, D and, H. R., New hydrophilicity scale derived from highperformance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and x-ray-derived accessible sites. Biochemistry 25, [6] Henikoff, J., Henikoff, S., Using substitution probabilities to improve positionspecific scoring matrices. Bioinformatics 12, [7] EL-Manzalawy, Y., Dobbs, D., Honavar, V., Predicting MHC-II binding affinity using multiple instance regression. Submitted to IEEE/ACM Trans Comput Biol Bioinform. 18

19 [8] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008c. Predicting linear B- cell epitopes using evolutionary information. IEEE International Conference on Bioinformatics and Biomedicine. [9] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008b. Predicting flexible length linear B-cell epitopes. 7th International Conference on Computational Systems Bioinformatics, [10] Saha, S. and Raghava, G. (2006b). Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins, 65: [11] EL-Manzalawy, Y., Dobbs, D., Honavar, V., 2008d. Predicting linear B-cell epitopes using string kernels. J. Mol. Recognit. 21, [12] El-Manzalawy, Y., Munoz2, E., Lindner, S., Honavar, V., PlasmoSEP: Predicting Surface Exposed Proteins on the Malaria Parasite Using Semi-supervised Self-training and Expert-annotated Data. Submitted. 19

User Guide Written By Yasser EL-Manzalawy

User Guide Written By Yasser EL-Manzalawy User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient

More information

Application of Support Vector Machine In Bioinformatics

Application of Support Vector Machine In Bioinformatics Application of Support Vector Machine In Bioinformatics V. K. Jayaraman Scientific and Engineering Computing Group CDAC, Pune jayaramanv@cdac.in Arun Gupta Computational Biology Group AbhyudayaTech, Indore

More information

WEKA homepage.

WEKA homepage. WEKA homepage http://www.cs.waikato.ac.nz/ml/weka/ Data mining software written in Java (distributed under the GNU Public License). Used for research, education, and applications. Comprehensive set of

More information

Data Mining: STATISTICA

Data Mining: STATISTICA Outline Data Mining: STATISTICA Prepare the data Classification and regression (C & R, ANN) Clustering Association rules Graphic user interface Prepare the Data Statistica can read from Excel,.txt and

More information

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov

WEKA: Practical Machine Learning Tools and Techniques in Java. Seminar A.I. Tools WS 2006/07 Rossen Dimov WEKA: Practical Machine Learning Tools and Techniques in Java Seminar A.I. Tools WS 2006/07 Rossen Dimov Overview Basic introduction to Machine Learning Weka Tool Conclusion Document classification Demo

More information

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface Data Mining: i STATISTICA Outline Prepare the data Classification and regression Clustering Association rules Graphic user interface 1 Prepare the Data Statistica can read from Excel,.txt and many other

More information

Using Weka for Classification. Preparing a data file

Using Weka for Classification. Preparing a data file Using Weka for Classification Preparing a data file Prepare a data file in CSV format. It should have the names of the features, which Weka calls attributes, on the first line, with the names separated

More information

Short instructions on using Weka

Short instructions on using Weka Short instructions on using Weka G. Marcou 1 Weka is a free open source data mining software, based on a Java data mining library. Free alternatives to Weka exist as for instance R and Orange. The current

More information

Classification using Weka (Brain, Computation, and Neural Learning)

Classification using Weka (Brain, Computation, and Neural Learning) LOGO Classification using Weka (Brain, Computation, and Neural Learning) Jung-Woo Ha Agenda Classification General Concept Terminology Introduction to Weka Classification practice with Weka Problems: Pima

More information

Tutorial on Machine Learning Tools

Tutorial on Machine Learning Tools Tutorial on Machine Learning Tools Yanbing Xue Milos Hauskrecht Why do we need these tools? Widely deployed classical models No need to code from scratch Easy-to-use GUI Outline Matlab Apps Weka 3 UI TensorFlow

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank Implementation: Real machine learning schemes Decision trees Classification

More information

COMP s1 - Getting started with the Weka Machine Learning Toolkit

COMP s1 - Getting started with the Weka Machine Learning Toolkit COMP9417 16s1 - Getting started with the Weka Machine Learning Toolkit Last revision: Thu Mar 16 2016 1 Aims This introduction is the starting point for Assignment 1, which requires the use of the Weka

More information

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction International Journal of Computer Trends and Technology (IJCTT) volume 7 number 3 Jan 2014 Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction A. Shanthini 1,

More information

TUBE: Command Line Program Calls

TUBE: Command Line Program Calls TUBE: Command Line Program Calls March 15, 2009 Contents 1 Command Line Program Calls 1 2 Program Calls Used in Application Discretization 2 2.1 Drawing Histograms........................ 2 2.2 Discretizing.............................

More information

6.034 Design Assignment 2

6.034 Design Assignment 2 6.034 Design Assignment 2 April 5, 2005 Weka Script Due: Friday April 8, in recitation Paper Due: Wednesday April 13, in class Oral reports: Friday April 15, by appointment The goal of this assignment

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER

IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER IMPLEMENTATION OF CLASSIFICATION ALGORITHMS USING WEKA NAÏVE BAYES CLASSIFIER N. Suresh Kumar, Dr. M. Thangamani 1 Assistant Professor, Sri Ramakrishna Engineering College, Coimbatore, India 2 Assistant

More information

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats

Data Mining. Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka. I. Data sets. I.1. Data sets characteristics and formats Data Mining Lab 1: Data sets: characteristics, formats, repositories Introduction to Weka I. Data sets I.1. Data sets characteristics and formats The data to be processed can be structured (e.g. data matrix,

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest

AlignMe Manual. Version 1.1. Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest AlignMe Manual Version 1.1 Rene Staritzbichler, Marcus Stamm, Kamil Khafizov and Lucy R. Forrest Max Planck Institute of Biophysics Frankfurt am Main 60438 Germany 1) Introduction...3 2) Using AlignMe

More information

Tutorial Case studies

Tutorial Case studies 1 Topic Wrapper for feature subset selection Continuation. This tutorial is the continuation of the preceding one about the wrapper feature selection in the supervised learning context (http://data-mining-tutorials.blogspot.com/2010/03/wrapper-forfeature-selection.html).

More information

Bioinformatics explained: Smith-Waterman

Bioinformatics explained: Smith-Waterman Bioinformatics Explained Bioinformatics explained: Smith-Waterman May 1, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com

More information

Individualized Error Estimation for Classification and Regression Models

Individualized Error Estimation for Classification and Regression Models Individualized Error Estimation for Classification and Regression Models Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme Abstract Estimating the error of classification and regression models

More information

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees

Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Estimating Missing Attribute Values Using Dynamically-Ordered Attribute Trees Jing Wang Computer Science Department, The University of Iowa jing-wang-1@uiowa.edu W. Nick Street Management Sciences Department,

More information

WEKA Explorer User Guide for Version 3-4

WEKA Explorer User Guide for Version 3-4 WEKA Explorer User Guide for Version 3-4 Richard Kirkby Eibe Frank July 28, 2010 c 2002-2010 University of Waikato This guide is licensed under the GNU General Public License version 2. More information

More information

Using Google s PageRank Algorithm to Identify Important Attributes of Genes

Using Google s PageRank Algorithm to Identify Important Attributes of Genes Using Google s PageRank Algorithm to Identify Important Attributes of Genes Golam Morshed Osmani Ph.D. Student in Software Engineering Dept. of Computer Science North Dakota State Univesity Fargo, ND 58105

More information

Weka: Practical machine learning tools and techniques with Java implementations

Weka: Practical machine learning tools and techniques with Java implementations Weka: Practical machine learning tools and techniques with Java implementations AI Tools Seminar University of Saarland, WS 06/07 Rossen Dimov 1 Supervisors: Michael Feld, Dr. Michael Kipp, Dr. Alassane

More information

Bioinformatics explained: BLAST. March 8, 2007

Bioinformatics explained: BLAST. March 8, 2007 Bioinformatics Explained Bioinformatics explained: BLAST March 8, 2007 CLC bio Gustav Wieds Vej 10 8000 Aarhus C Denmark Telephone: +45 70 22 55 09 Fax: +45 70 22 55 19 www.clcbio.com info@clcbio.com Bioinformatics

More information

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44

Data Mining. Introduction. Piotr Paszek. (Piotr Paszek) Data Mining DM KDD 1 / 44 Data Mining Piotr Paszek piotr.paszek@us.edu.pl Introduction (Piotr Paszek) Data Mining DM KDD 1 / 44 Plan of the lecture 1 Data Mining (DM) 2 Knowledge Discovery in Databases (KDD) 3 CRISP-DM 4 DM software

More information

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI

DATA ANALYSIS WITH WEKA. Author: Nagamani Mutteni Asst.Professor MERI DATA ANALYSIS WITH WEKA Author: Nagamani Mutteni Asst.Professor MERI Topic: Data Analysis with Weka Course Duration: 2 Months Objective: Everybody talks about Data Mining and Big Data nowadays. Weka is

More information

A Cloud Framework for Big Data Analytics Workflows on Azure

A Cloud Framework for Big Data Analytics Workflows on Azure A Cloud Framework for Big Data Analytics Workflows on Azure Fabrizio MAROZZO a, Domenico TALIA a,b and Paolo TRUNFIO a a DIMES, University of Calabria, Rende (CS), Italy b ICAR-CNR, Rende (CS), Italy Abstract.

More information

Machine Learning in Action

Machine Learning in Action Machine Learning in Action PETER HARRINGTON Ill MANNING Shelter Island brief contents PART l (~tj\ssification...,... 1 1 Machine learning basics 3 2 Classifying with k-nearest Neighbors 18 3 Splitting

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information

Mathematical Themes in Economics, Machine Learning, and Bioinformatics

Mathematical Themes in Economics, Machine Learning, and Bioinformatics Western Kentucky University From the SelectedWorks of Matt Bogard 2010 Mathematical Themes in Economics, Machine Learning, and Bioinformatics Matt Bogard, Western Kentucky University Available at: https://works.bepress.com/matt_bogard/7/

More information

I211: Information infrastructure II

I211: Information infrastructure II Data Mining: Classifier Evaluation I211: Information infrastructure II 3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1

More information

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows: CS299 Detailed Plan Shawn Tice February 5, 2013 Overview The high-level steps for classifying web pages in Yioop are as follows: 1. Create a new classifier for a unique label. 2. Train it on a labelled

More information

Lab Exercise Three Classification with WEKA Explorer

Lab Exercise Three Classification with WEKA Explorer Lab Exercise Three Classification with WEKA Explorer 1. Fire up WEKA to get the GUI Chooser panel. Select Explorer from the four choices on the right side. 2. We are on Preprocess now. Click the Open file

More information

Technical University of Munich. Exercise 8: Neural Networks

Technical University of Munich. Exercise 8: Neural Networks Technical University of Munich Chair for Bioinformatics and Computational Biology Protein Prediction I for Computer Scientists SoSe 2018 Prof. B. Rost M. Bernhofer, M. Heinzinger, D. Nechaev, L. Richter

More information

Package signalhsmm. August 29, 2016

Package signalhsmm. August 29, 2016 Type Package Title Predict Presence of Signal Peptides Version 1.4 LazyData true Date 2016-03-03 Package signalhsmm August 29, 2016 Predicts the presence of signal peptides in eukaryotic protein using

More information

Introduction to Automated Text Analysis. bit.ly/poir599

Introduction to Automated Text Analysis. bit.ly/poir599 Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last

More information

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Pavel P. Kuksa, Rutgers University Yanjun Qi, Bing Bai, Ronan Collobert, NEC Labs Jason Weston, Google Research NY Vladimir

More information

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Supervised Learning: Nonparametric

More information

Bio3D: Interactive Tools for Structural Bioinformatics.

Bio3D: Interactive Tools for Structural Bioinformatics. Bio3D: Interactive Tools for Structural Bioinformatics http://thegrantlab.org/bio3d/ What is Bio3D A freely distributed and widely used R package for structural bioinformatics. Provides a large number

More information

Data mining: concepts and algorithms

Data mining: concepts and algorithms Data mining: concepts and algorithms Practice Data mining Objective Exploit data mining algorithms to analyze a real dataset using the RapidMiner machine learning tool. The practice session is organized

More information

WEKA KnowledgeFlow Tutorial for Version 3-5-6

WEKA KnowledgeFlow Tutorial for Version 3-5-6 WEKA KnowledgeFlow Tutorial for Version 3-5-6 Mark Hall Peter Reutemann June 1, 2007 c 2007 University of Waikato Contents 1 Introduction 2 2 Features 3 3 Components 4 3.1 DataSources..............................

More information

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo (luigi.grimaudo@polito.it) DataBase And Data Mining Research Group (DBDMG) Summary RapidMiner project Strengths

More information

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE Luigi Grimaudo (luigi.grimaudo@polito.it) DataBase And Data Mining Research Group (DBDMG) Summary RapidMiner project Strengths

More information

Lecture #11: The Perceptron

Lecture #11: The Perceptron Lecture #11: The Perceptron Mat Kallada STAT2450 - Introduction to Data Mining Outline for Today Welcome back! Assignment 3 The Perceptron Learning Method Perceptron Learning Rule Assignment 3 Will be

More information

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Feature Selection Using Modified-MCA Based Scoring Metric for Classification 2011 International Conference on Information Communication and Management IPCSIT vol.16 (2011) (2011) IACSIT Press, Singapore Feature Selection Using Modified-MCA Based Scoring Metric for Classification

More information

Automatic Labeling of Issues on Github A Machine learning Approach

Automatic Labeling of Issues on Github A Machine learning Approach Automatic Labeling of Issues on Github A Machine learning Approach Arun Kalyanasundaram December 15, 2014 ABSTRACT Companies spend hundreds of billions in software maintenance every year. Managing and

More information

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2

Contents. ACE Presentation. Comparison with existing frameworks. Technical aspects. ACE 2.0 and future work. 24 October 2009 ACE 2 ACE Contents ACE Presentation Comparison with existing frameworks Technical aspects ACE 2.0 and future work 24 October 2009 ACE 2 ACE Presentation 24 October 2009 ACE 3 ACE Presentation Framework for using

More information

Fraud Detection Using Random Forest Algorithm

Fraud Detection Using Random Forest Algorithm Fraud Detection Using Random Forest Algorithm Eesha Goel Computer Science Engineering and Technology, GZSCCET, Bhatinda, India eesha1992@rediffmail.com Abhilasha Computer Science Engineering and Technology,

More information

Machine Learning Techniques for Data Mining

Machine Learning Techniques for Data Mining Machine Learning Techniques for Data Mining Eibe Frank University of Waikato New Zealand 10/25/2000 1 PART VII Moving on: Engineering the input and output 10/25/2000 2 Applying a learner is not all Already

More information

Feature Selection in Learning Using Privileged Information

Feature Selection in Learning Using Privileged Information November 18, 2017 ICDM 2017 New Orleans Feature Selection in Learning Using Privileged Information Rauf Izmailov, Blerta Lindqvist, Peter Lin rizmailov@vencorelabs.com Phone: 908-748-2891 Agenda Learning

More information

Efficient Pairwise Classification

Efficient Pairwise Classification Efficient Pairwise Classification Sang-Hyeun Park and Johannes Fürnkranz TU Darmstadt, Knowledge Engineering Group, D-64289 Darmstadt, Germany Abstract. Pairwise classification is a class binarization

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 06: Multiple Sequence Alignment https://upload.wikimedia.org/wikipedia/commons/thumb/7/79/rplp0_90_clustalw_aln.gif/575px-rplp0_90_clustalw_aln.gif Slides

More information

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing

Practical Data Mining COMP-321B. Tutorial 4: Preprocessing Practical Data Mining COMP-321B Tutorial 4: Preprocessing Shevaun Ryan Mark Hall June 30, 2008 c 2006 University of Waikato 1 Introduction For this tutorial we will be using the Preprocess panel, the Classify

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

Performance Evaluation of Different Classifier for Big data in Data mining Industries

Performance Evaluation of Different Classifier for Big data in Data mining Industries Journal of Engineering and Science Research, 2 (1): 11-17, 2018 e-issn:2289-7127 RMPPublications, 2018 DOI:10.26666/rmp.jesr.2018.1.3 Performance Evaluation of Different Classifier for Big data in Data

More information

Gnome Data Mine Tools Evaluation Report

Gnome Data Mine Tools Evaluation Report Gnome Data Mine Tools Evaluation Report CMPUT695 Assignment 2 Haobin Li, Junfeng Wu Thursday, November 04, 2004 Overview The gnome-data-mine-tools (GDataMine) is an open source data mining tool set which

More information

Tutorial 2: Analysis of DIA/SWATH data in Skyline

Tutorial 2: Analysis of DIA/SWATH data in Skyline Tutorial 2: Analysis of DIA/SWATH data in Skyline In this tutorial we will learn how to use Skyline to perform targeted post-acquisition analysis for peptide and inferred protein detection and quantification.

More information

Implementing Breiman s Random Forest Algorithm into Weka

Implementing Breiman s Random Forest Algorithm into Weka Implementing Breiman s Random Forest Algorithm into Weka Introduction Frederick Livingston A classical machine learner is developed by collecting samples of data to represent the entire population. This

More information

New String Kernels for Biosequence Data

New String Kernels for Biosequence Data Workshop on Kernel Methods in Bioinformatics New String Kernels for Biosequence Data Christina Leslie Department of Computer Science Columbia University Biological Sequence Classification Problems Protein

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining) Data Mining: Classifier Evaluation CSCI-B490 Seminar in Computer Science (Data Mining) Predictor Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Data Mining Practical Machine Learning Tools And Techniques With Java Implementations The Morgan Kaufmann Series In Data Management Systems

Data Mining Practical Machine Learning Tools And Techniques With Java Implementations The Morgan Kaufmann Series In Data Management Systems Data Mining Practical Machine Learning Tools And Techniques With Java Implementations The Morgan Kaufmann We have made it easy for you to find a PDF Ebooks without any digging. And by having access to

More information

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science

The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval. Kevin C. O'Kane. Department of Computer Science The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval Kevin C. O'Kane Department of Computer Science The University of Northern Iowa Cedar Falls, Iowa okane@cs.uni.edu http://www.cs.uni.edu/~okane

More information

Collective Intelligence in Action

Collective Intelligence in Action Collective Intelligence in Action SATNAM ALAG II MANNING Greenwich (74 w. long.) contents foreword xv preface xvii acknowledgments xix about this book xxi PART 1 GATHERING DATA FOR INTELLIGENCE 1 "1 Understanding

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 08: Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu October 24, 2017 Learnt Prediction and Classification Methods Vector Data

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems

Comparative Study of Instance Based Learning and Back Propagation for Classification Problems Comparative Study of Instance Based Learning and Back Propagation for Classification Problems 1 Nadia Kanwal, 2 Erkan Bostanci 1 Department of Computer Science, Lahore College for Women University, Lahore,

More information

Performance Analysis of Data Mining Classification Techniques

Performance Analysis of Data Mining Classification Techniques Performance Analysis of Data Mining Classification Techniques Tejas Mehta 1, Dr. Dhaval Kathiriya 2 Ph.D. Student, School of Computer Science, Dr. Babasaheb Ambedkar Open University, Gujarat, India 1 Principal

More information

Multiple-Choice Questionnaire Group C

Multiple-Choice Questionnaire Group C Family name: Vision and Machine-Learning Given name: 1/28/2011 Multiple-Choice naire Group C No documents authorized. There can be several right answers to a question. Marking-scheme: 2 points if all right

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Semi-supervised Learning

Semi-supervised Learning Semi-supervised Learning Piyush Rai CS5350/6350: Machine Learning November 8, 2011 Semi-supervised Learning Supervised Learning models require labeled data Learning a reliable model usually requires plenty

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Function Algorithms: Linear Regression, Logistic Regression

Function Algorithms: Linear Regression, Logistic Regression CS 4510/9010: Applied Machine Learning 1 Function Algorithms: Linear Regression, Logistic Regression Paula Matuszek Fall, 2016 Some of these slides originated from Andrew Moore Tutorials, at http://www.cs.cmu.edu/~awm/tutorials.html

More information

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis A Critical Study of Selected Classification s for Liver Disease Diagnosis Shapla Rani Ghosh 1, Sajjad Waheed (PhD) 2 1 MSc student (ICT), 2 Associate Professor (ICT) 1,2 Department of Information and Communication

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Classification and Regression

Classification and Regression Classification and Regression Announcements Study guide for exam is on the LMS Sample exam will be posted by Monday Reminder that phase 3 oral presentations are being held next week during workshops Plan

More information

Programming assignment for the course Sequence Analysis (2006)

Programming assignment for the course Sequence Analysis (2006) Programming assignment for the course Sequence Analysis (2006) Original text by John W. Romein, adapted by Bart van Houte (bart@cs.vu.nl) Introduction Please note: This assignment is only obligatory for

More information

Noise-based Feature Perturbation as a Selection Method for Microarray Data

Noise-based Feature Perturbation as a Selection Method for Microarray Data Noise-based Feature Perturbation as a Selection Method for Microarray Data Li Chen 1, Dmitry B. Goldgof 1, Lawrence O. Hall 1, and Steven A. Eschrich 2 1 Department of Computer Science and Engineering

More information

ECLT 5810 Evaluation of Classification Quality

ECLT 5810 Evaluation of Classification Quality ECLT 5810 Evaluation of Classification Quality Reference: Data Mining Practical Machine Learning Tools and Techniques, by I. Witten, E. Frank, and M. Hall, Morgan Kaufmann Testing and Error Error rate:

More information

Practical Data Mining COMP-321B. Tutorial 5: Article Identification

Practical Data Mining COMP-321B. Tutorial 5: Article Identification Practical Data Mining COMP-321B Tutorial 5: Article Identification Shevaun Ryan Mark Hall August 15, 2006 c 2006 University of Waikato 1 Introduction This tutorial will focus on text mining, using text

More information

AI32 Guide to Weka. Andrew Roberts 1st March 2005

AI32 Guide to Weka. Andrew Roberts   1st March 2005 AI32 Guide to Weka Andrew Roberts http://www.comp.leeds.ac.uk/andyr 1st March 2005 1 Introduction Weka is an excellent system for learning about machine learning techniques. Of course, it is a generic

More information

An Adaptive Framework for Multistream Classification

An Adaptive Framework for Multistream Classification An Adaptive Framework for Multistream Classification Swarup Chandra, Ahsanul Haque, Latifur Khan and Charu Aggarwal* University of Texas at Dallas *IBM Research This material is based upon work supported

More information

Machine Learning Practical NITP Summer Course Pamela K. Douglas UCLA Semel Institute

Machine Learning Practical NITP Summer Course Pamela K. Douglas UCLA Semel Institute Machine Learning Practical NITP Summer Course 2013 Pamela K. Douglas UCLA Semel Institute Email: pamelita@g.ucla.edu Topics Covered Part I: WEKA Basics J Part II: MONK Data Set & Feature Selection (from

More information

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process

Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction Model in Semiconductor Manufacturing Process Vol.133 (Information Technology and Computer Science 2016), pp.79-84 http://dx.doi.org/10.14257/astl.2016. Data Imbalance Problem solving for SMOTE Based Oversampling: Study on Fault Detection Prediction

More information

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline

Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb Outline Learn to Use Weka Jue Wang (Joyce) Department of Computer Science, University of Massachusetts, Boston Feb-09-2010 Outline Introduction of Weka Explorer Filter Classify Cluster Experimenter KnowledgeFlow

More information

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA

Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA Journal of Computer Science 2 (3): 292-296, 2006 ISSN 1549-3636 2006 Science Publications Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA 1 E.Ramaraj and 2 M.Punithavalli

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

CS570: Introduction to Data Mining

CS570: Introduction to Data Mining CS570: Introduction to Data Mining Classification Advanced Reading: Chapter 8 & 9 Han, Chapters 4 & 5 Tan Anca Doloc-Mihu, Ph.D. Slides courtesy of Li Xiong, Ph.D., 2011 Han, Kamber & Pei. Data Mining.

More information

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar..

.. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. .. Fall 2011 CSC 570: Bioinformatics Alexander Dekhtyar.. PAM and BLOSUM Matrices Prepared by: Jason Banich and Chris Hoover Background As DNA sequences change and evolve, certain amino acids are more

More information

Didacticiel - Études de cas

Didacticiel - Études de cas Subject In some circumstances, the goal of the supervised learning is not to classify examples but rather to organize them in order to point up the most interesting individuals. For instance, in the direct

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Panorama Sharing Skyline Documents

Panorama Sharing Skyline Documents Panorama Sharing Skyline Documents Panorama is a freely available, open-source web server database application for targeted proteomics assays that integrates into a Skyline proteomics workflow. It has

More information

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment

GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment GLOBEX Bioinformatics (Summer 2015) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms CLUSTAL W Courtesy of jalview Motivations Collective (or aggregate) statistic

More information