User Guide Written By Yasser EL-Manzalawy

Size: px

Start display at page:

Download "User Guide Written By Yasser EL-Manzalawy"

Valerie Washington
6 years ago
Views:

1 User Guide Written By Yasser EL-Manzalawy 1 Copyright Gennotate development team

2 Introduction As large amounts of genome sequence data are becoming available nowadays, the development of reliable and efficient genome annotation tools required to assign biological interpretation to the DNA sequence becomes more desirable. Although several computational genome annotation tools have been proposed, accurate and scalable genome annotation remains a major challenge. A variety of knowledge-based, statistical, and machine learning methods have been developed for many genome annotation tasks. They differ in terms of the training data sets used to train the predictive models, the data representations (e.g., sequence features) used for encoding the inputs and outputs (class labels) of the predictive models, the algorithms used for building the predictors, and the validation data sets and the performance metrics used to assess the effectiveness of the predictors. Often, it is the case that the data sets, implementations of algorithms, the data representations used, are simply not available to the research community in a form that allows rigorous comparison of alternative approaches. Yet, such comparisons are essential for determining the strengths and limitations of existing approaches so that further research can be focused on improving these methods. For example, some of the methods are accessible via the Internet as online Web servers. Comparison of the underlying computational methods implemented by such servers is not straightforward in the absence of access to implementations of the algorithms and the precise data sets and data representations used. This is further complicated by the fact that some of the servers often update the predictors periodically using newly available data, newer computational methods, or data representations, making it difficult to determine whether the reported or measured changes in predictive accuracy stem from improvements in the methods, data representations, or better data sets. What is Gennotate? Gennotate is a platform for sharing data representations, predictors, and machine learning algorithms for a broad range of gene structure prediction tasks. Gennotate has two main components (see Figure 1): 1) Model builder, an application for building and evaluating predictors and serializing these models in a binary format (model files). 2) Predictor, an application for applying a model to test data (e.g., sequences to be annotated). The model builder application is an extension of WEKA [1], a widely used machine learning workbench supporting many standard machine learning algorithms. WEKA provides tools for data pre-processing, classification, regression, clustering, validation, and visualization. Furthermore, WEKA provides a framework for implementing new machine learning methods and data pre-processors. The model builder extends WEKA by adding a suite of data pre-processors (called filters 2 Copyright Gennotate development team

3 in WEKA) for converting molecular sequences into vectors of numerical features such that WEKA supported methods can be applied to the data. The current implementation supports filters for generating several of the widely used data representations of molecular sequences. Once the sequences are converted into numeric or nominal features, any suitable WEKA learner can be trained and evaluated on that data set. Figure 1: Gennotate model builder (left) and predictor (right). Model builder The model builder extends WEKA with a variety of DNA sequence preprocessors (WEKA filters) and a number of classification algorithms (e.g., classifiers based on Markov models). The very few exceptions, machine learning algorithms supported in WEKA can not be directly applied to DNA sequence data. A preprocessing step that extracts some features from sequence data is often required. Gennotate model builder provides more than 30 implemented sequence and structure based DNA feature extraction methods. Additionally, a filter called ConcatenateFilter generates new features based on the combination of any set of Gennotate features. Table 1 summarizes the list of currently implemented Gennotate filters. For detailed information about these filters, please check the Gennotate API documentation available at Once the features have been extracted from DNA sequences, many WEKA supported machine learning algorithms can be applied (including state-of-the-art algorithms for classification, regression, clustering, and feature selection). In addition to WEKA supported implementation, Gennotate can run any third party extension of WEKA. The procedure is as a simple as just adding some extra jar files to our CLASSPATH when running Gennotate. 3 Copyright Gennotate development team

4 The current implementation of Gennotate enriches WEKA with a number of classification algorithms summarized in Table 2. Table 1: List of Gennotate supported filters Filter Description DDNAFilter A filer for extracting dinucleotide structure features from DNA sequences. DNA2Filter A filer for converting DNA sequence into a new sequence over a new alphabet defined over all dinucleotide symbols. DNASeqToNominalFilter A Filter to convert a string attribute of DNA sequence into nominal attributes. DNCFilter A Filter to convert a string attribute into 400 features representing compositions of dinucleotides. KMerFilter A Filter to convert a string attribute into numeric features represented as the frequency of its k-mer substrings. MonoHBondFilter A Filter for extracting hydrogen bond based DNA structure features. NCFilter A Filter to convert a string attribute into numeric features representing compositions of nucleotides. SubSequenceFilter A Filter for extracting a substring from the DNA sequence. TRIDNAFilter A filer for extracting tri-nucleotide structure features from DNA sequences. ConcatenateFilter A filter for concatenating multiple filters. Table 2: List of Gennotate classification algorithms Classifier Description HMMClassifier A classifier for implementing a Hidden Markov Model from sequence data. IMMClassifier A classifier for implementing an Interpolated Markov Model from sequence data. MMClassifier A classifier for implementing a Markov Model from sequence data. BalancedClassifier A meta classifier for training a base classifier on unbalanced data set. ModelBased A meta classifier for performing classification/regression using a specified model file. Predictor The Predictor is a graphical user interface (GUI) for applying a saved prediction model to a test datasets. Specifically, the user inputs the model file, the test data file, the output file name, the format of the test data (DNA fragments (one fragment per line) or fasta sequences), the type of the problem (peptide-based or nucleotide-based), and the length of the peptide/window sequence. The output of the predictor is a summary of the input model (model name, model parameters, and the name of the datasets used to build the model) followed by the predictions. The predictions are four tab-separated columns (See Figure 2). The first column is the sequence identifier. The second and third columns are position and the sequence of the predicted peptide/nucleotide sequence. The last column is the predicted scores. 4 Copyright Gennotate development team

5 Figure 2: Example Predictor output. Installing and running Gennotate Gennotate is platform-independent since it is implemented in Java. For Installing Gennotate, one needs to download it from the project web site and unzip the compressed file. For running Gennotate, you need to add all the jar files included in the lib folder to the CLASSPATH and run the gennotate.jar file. For example, the following command sets the CLASSPATH and runs Gennotate on Windows machines: java -Xmx1024m -classpath "gennotate.jar;weka.jar" gennotate.gui.maingui For linux machines, replace ; with :. Using Gennotate In this section, we show several examples on how to use Gennotate to develop predictors from DNA sequence data. For this purpose, we use two in-house data sets for predicting sigma 70 promoters in E. coli: 1) Sigma70.arff is a non-redundant data set extracted from RegulonDB on June 24, The data set contains 579 promoter sequences published before April None of 579 shares more than 45% similarity with any other sequence in the 5 Copyright Gennotate development team

promoter data. There are also 579 non-promoter sequences in which none of them shares more than 45% with any either promoter or non-promoter sequence.

6 promoter data. There are also 579 non-promoter sequences in which none of them shares more than 45% with any either promoter or non-promoter sequence. 2) Sigma70_test is a non-redundant data set extracted from RegulonDB on June 24, All promoter sequences are published after April The data has 792 promoters and 792 non-promoters sequences. None of the sequences shares more than 45% with any other sequence. The test data is provides in two formats: 1) standard WEKA format (file Sigma70_test.arff); 2) one fragment per line format (file Sigma70_test.txt). Building your first predictor Here, we show how to build your first predictor using Sigma70.arff data and HMMClassifier and store it for future use on test data. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab 5. In classifier panel click choose and browse for HMMClassifier 6. The HMMClassifier has two parameters: input data alphabet (default ACGTN); and whether the input sequences has gaps (default false). Keep the default parameters and click OK. 7. Having both the data set and the classification algorithm specified, we are ready to build the model and evaluate it using 10-fold cross-validation. Just click start button and wait for the 10-fold cross-validation procedure to finish. The classifier output shows several statistical estimates of HMMClassifier using 10-fold cross-validation. For example, the accuracy and AUC of the model are 72.8% and 0.81, respectively. 8. To save the model, right click on the model in the Result list panel and select Save model. Save your model as /Examples/Models/Sigma70HMM.model. 6 Copyright Gennotate development team

7 Applying your model to test data There are several methods for applying your model on a test data. First, if your test data are stored in a WEKA format, then you can use the model builder directly to apply the model to test data and get predictions and some performance measures. To do that following the following steps: 1. In Test options panel, click supplied test data and click Set to specify the test data file /Examples/Models/Sigma70_test.arff 2. Right click on Result list panel and select load model to load /Examples/Models/Sigma70HMM.model. After successfully loading the model, the classifier output shows information about the training data, the algorithms and its parameters. 7 Copyright Gennotate development team

3. By default, WEKA explorer does not output predictions. To output predictions, click More options and mark output predictions option. 4. Click Start. Wait for evaluating the model on the test data.

8 3. By default, WEKA explorer does not output predictions. To output predictions, click More options and mark output predictions option. 4. Click Start. Wait for evaluating the model on the test data. Then the classifier output panel will display the predictions and some performance evaluation measures. Second, if your test data are in Gennotate supported formats (e.g., FASTA or single DNA fragment per line), then you can use the Predictor application to apply a saved model and get predictions. For example, to apply Sigma70HMM.model to the test data in /Examples/Data/Sigma70_test.txt data, follow the following steps: 1. Run Predictor from Application menu. 8 Copyright Gennotate development team

9 2. Specify, your inputs and output file as in the below figure. 3. Click Predict and wait to see the output in the Predictions panel and also in the output file /Examples/Output/Sigma70_test_out.txt. 9 Copyright Gennotate development team

10 Case Study 1: Predicting promoter regions in E.coli using sequence and structure features In the previous section, we show how to build a HMM model for predicting Sigma70 promoters in E.coli. A major difference between HMMClassifier and traditional classifiers such as Naïve Bayes (NB) and Random Forest is that HMMClassifier can be applied directly to sequence data while traditional classifiers expect the data to be in the form of feature vectors extracted from the original sequence data. Here, we show how to simultaneously, extract features from sequence data and build/test a model. Thanks to WEKA FilteredClassifier which allows us to specify a machine learning algorithm and a filter to be applied on the fly before feeding the data to the predictor. To build a NB classifier using 3-mer features, follow these steps: 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.filteredclassifier. 6. Click on the classifier schema in classifier panel to get the following window. 10 Copyright Gennotate development team

7. Change the classifier to weka.classifiers.bayes.naivebayes (with its default parameters) and the filter to gennotate.filters.unsupervised.kmerfilter (set k parameter to 3). Click OK. 8.

11 7. Change the classifier to weka.classifiers.bayes.naivebayes (with its default parameters) and the filter to gennotate.filters.unsupervised.kmerfilter (set k parameter to 3). Click OK. 8. Click Start to run the 10-fold cross-validation experiment. The following figure shows the result of our experiment. You can repeat the preceding procedure for different choices of classifiers and Gennotate filters. Table 3, compares Naïve Bayes (NB) and Random Forest (with 50 trees) (RF50) for k = 1,2,3, and 4. Interestingly, none of the classifiers has a competitive performance with the HMM classifier which achieved AUC equals 0.81 on the same data set. 11 Copyright Gennotate development team

12 Table 3: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data using different sequence-based features. Features NB RF50 1-mer mer mer mer To build models using structure features, follow the preceding procedure and replace KMerFilter with DDNAFilter which allows us to experiment with 12 different dinucleotide structure-based features [2] (See Gennotete API documentation for detailed information about these methods). Table 4, compares NB and RF50 using 10-fold cross-validation and different structure-based features extracted from Sigma70.arff data. In several cases, structure-based features helped us to reach a performance that competitive with HMM classifier. RF50 seems to be doing better than NB. However, it should be noted that the number of trees were arbitrary set to 50. There could be room for potential improvements using larger numbers of trees (we leave this as an exercise for the user). For future experiments, let s save the best model in Table 4 as /Examples/Models/Sigma70_Stability_RF50.model. Table 4: Performance (in terms of AUC score) comparison of NB and RF50 on Sigma70 data using twelve different dinucleotide structure-based features. Features NB RF50 DI_APHYLICITY DI_BDNATWISTOHLER DI_BDNATWISTOLSON DI_DNABENDSTIFF DI_DNADENATURE DI_ZDNASTABENERGY DI_DUPLEXSTAB_DISRUPTENERGY DI_DUPLEXSTAB_FREEENERGY DI_PINDUCEDDEFORM DI_PROPELLERTWIST DI_PROTEINDNATWIST DI_STACKINGENERGY Copyright Gennotate development team

13 Case Study 2: Improved prediction of promoter regions in E.coli In case study 1, we evaluated the prediction of sigma 70 promoters in E.coli using twelve different methods for extracting dinucleotide features. In general, a better performance can be achieved by: 1) combing a set of these features; 2) building an ensemble of classifiers where each base classifier is trained using different structure-based feature; 3) combining all 12 sets of structure features and using a feature selection method to find an optimal subset of features. Here, we show how to use Gennotate to build improved methods using these three approaches. Concatenating features To build a single classifier that takes as input the features extracting using the twelve different dinucleotide features, use gennotate.filters.concatenatefilter. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.filteredclassifier. 6. Click on the classifier schema in classifier panel to get the following window. 13 Copyright Gennotate development team

7. Change the classifier to weka.classifiers.trees.randomforest (set the number of trees to 50) and the filter to gennotate.filters.concatenatefilter 8.

14 7. Change the classifier to weka.classifiers.trees.randomforest (set the number of trees to 50) and the filter to gennotate.filters.concatenatefilter 8. Click on the ConcatenateFilter and input the twelve filters (e.g., DDNAFilter with twelve different selections of ConversionTable parameter. 14 Copyright Gennotate development team

The following figure shows the cross-validation performance of the predictor using

15 9. Click Start to run the 10-fold cross-validation experiment. The following figure shows the result of our experiment. The following figure shows the cross-validation performance of the predictor using twelve combined sets of structure features. The result is better than any single set of structure features. 15 Copyright Gennotate development team

Concatenating features and selecting optimal subset of features In the preceding experiment, we show that working with a concatenation of twelve set of features can improve the performance of the

Here, we show how to use WEKA feature selections with our ConcatenateFilter to further improve the performance of the resulting model. 1. Follow steps 1-6 in the previous experiment 2.

16 Concatenating features and selecting optimal subset of features In the preceding experiment, we show that working with a concatenation of twelve set of features can improve the performance of the resulting model. In general, this high dimensional feature space might have several irrelevant and/or redundant features. Here, we show how to use WEKA feature selections with our ConcatenateFilter to further improve the performance of the resulting model. 1. Follow steps 1-6 in the previous experiment 2. In FilteredClassifier window, choose RF50 as your classifier and choose weka.classifiers.attributeselection as your filter. 3. Click on MultiFilter and input two Filters: i) ConcatenateFilter with twelve DDNAFilters, each with different choice of ConversionTable; ii) AttributeSelection filter. 4. For AttributeSelection filter, you can play with large pool of WEKA provided feature selection methods and search algorithms. For our experiment, we ranked all features based on information gain and used the 20 top ranked features. 5. Click Start and wait for the cross-validation results. The output result will show the top 100 features used to build the model (See Table 5 for a list of top 20 features) and will also show some performance measures. Interestingly, the model has a better AUC (0.83) than the model that uses all the set of features (0.80). 16 Copyright Gennotate development team

17 Table 5: List of top 20 structure features DI_DUPLEXSTAB_FREEENERGY48 DI_DNABENDSTIFF48 DI_DUPLEXSTAB_DISRUPTENERGY48 DI_DUPLEXSTAB_FREEENERGY49 DI_PINDUCEDDEFORM49 DI_DNABENDSTIFF49 DI_DUPLEXSTAB_FREEENERGY50 DI_DUPLEXSTAB_FREEENERGY47 DI_DNABENDSTIFF47 DI_STACKINGENERGY48 DI_DUPLEXSTAB_DISRUPTENERGY49 DI_DNADENATURE48 DI_DNABENDSTIFF50 DI_ZDNASTABENERGY49 DI_ZDNASTABENERGY50 DI_DUPLEXSTAB_DISRUPTENERGY47 DI_STACKINGENERGY47 DI_DUPLEXSTAB_DISRUPTENERGY50 DI_DNADENATURE50 DI_STACKINGENERGY49 Improved predictions of sigma 70 promoters using Ensemble of classifiers In this experiment, we build a number of classifiers using RF50 and different choices of the dinucleotide structure-based features. The base classifiers we be combined together using a second stage classifier using WEKA Logistic classifier. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70.arff. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.stacking. Set numfolds to 3 and set the metaclassifier to weka.classifiers.functions.logistic and input 12 classifiers, each is a FilteredClassifier with RF50 and different choice of ConversioTable for DDNAFilter. 17 Copyright Gennotate development team

18 6. Click Start to run the 10-fold cross-validation experiment. The following figure shows the result of our experiment. 18 Copyright Gennotate development team

Case Study 3: Improved prediction of promoter regions in E.coli using metapredictors An Interesting property of Gennotate is that it allows sharing not only data sets but also the learned models.

19 Case Study 3: Improved prediction of promoter regions in E.coli using metapredictors An Interesting property of Gennotate is that it allows sharing not only data sets but also the learned models. Once you have a number of different predictors for the same classification task. You can: 1) use the Predictor application to apply any of these predictors to some test data; 2) Rebuild the prediction model using updated/different training data; 3) build a consensus or hybrid predictor that combine these predictors. The first usage has been shown earlier. The second usage can be done simply by: loading the new training data; loading the current model; and performing cross-validation experiment. The results will show the performance of the new model which also can be saved as a model file for further use. The third usage is the focus of this case study. To facilitate the development of a consensus/hybrid predictor that relies on existing predictors not necessarily developed by the same user, Gennotate provides a metaclassifier called ModelBased. In the following experiment, we show how to use ModelBased classifier to build a consensus predictor combining Sigma70HMM and Sigma70_Stability_RF50 developed earlier. 1. Run Gennotate 2. Go to Application menu and select model builder application. 3. In the model builder window (WEKA explorer augmented with Gennotate filters and prediction methods) click open and select the file /Example/Data/Sigma70_tets.arff. Please note that our goal is to combine existing models. So, there is no need to retrain these models but instead we will use the test data to evaluate the combination of these predictors. 4. Click classify tab. 5. In classifier panel click choose and browse for weka.classifiers.meta.vote 19 Copyright Gennotate development team

6. Input two classifiers each one is using gennotate.classifiers.meta.modelbased and set modelfile parameter as showing in the following figure. 7. In the Test options panel, choose use training data.

20 6. Input two classifiers each one is using gennotate.classifiers.meta.modelbased and set modelfile parameter as showing in the following figure. 7. In the Test options panel, choose use training data. Note that the ModelBased classifier does not perform any training. It just loads the model and keeps it for predictions. Hence, what is reported is the performance of applying the models encapsulated within the ModelBased classifier to what it seems as training data. The obtained performance of the consensus predictor combining the HMM model and the RF50 model is almost the same as the performance of HMM (AUC equals 0.81). In practice, we expect improvements in performance when we combine several (not just two) predictors. Please note that we can build a hybrid model using the HMM and RF50 models simply by following the preceding procedure and replacing the Vote classifier with Stacking classifier. However, in that case the user might perform cross-validation test and the result should be handled with caution because the test data has been used to train the meta-predictor in the Stacking classifier. 20 Copyright Gennotate development team

21 Extending Gennotate Gennotate is extendable, in the sense that anyone can add extra filters or extra classification methods. To add your own classification methods or filters, please follow the procedure described in the WEKA documentation on how to write your own classifier and your own filter. Once you have a jar file including your added components, just add it to our CLASSPATH when running Gennotate and enjoy your customized version of Gennotate. References [1] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), [2] Gan, Y., Guan, J., & Zhou, S. (2012). A comparison study on feature selection of DNA structural properties for promoter prediction. BMC bioinformatics, 13(1), Copyright Gennotate development team

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy August 30, 2016

Epitopes Toolkit (EpiT) Yasser EL-Manzalawy http://www.cs.iastate.edu/~yasser August 30, 2016 What is EpiT? Epitopes Toolkit (EpiT) is a platform for developing epitope prediction tools. An EpiT developer