RAMMCAP The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline

Size: px

Start display at page:

Download "RAMMCAP The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline"

Ellen Stafford
6 years ago
Views:

1 RAMMCAP The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline Weizhong Li, liwz@sdsc.edu CAMERA project ( Contents: 1. Introduction 2. Implementation 3. Tools 3.1. CD-HIT rrna-hmm 3.3. rrna-finder 3.4. trna-scan 3.5. CD-HIT-EST 3.6. ORF-finder 3.7. Metagene 3.8. CD-HIT 3.9. Annotation of ORFs GO annotation EC number FASTA file summary 4. RAMMCAP under CAMERA web portal 5. Running RAMMCAP locally 5.1. Download RAMMCAP package 5.2. Installation 5.3. Programs 5.4. Example on a desktop computer 5.5. Example on a computer cluster 6. Statistical comparison of metagenomes 7. References Last modified on Nov 2, Introduction The remarkable advance of metagenomics creates new challenges in data analysis. The Rapid Analysis of Multiple Metagenomes with a Clustering and Annotation Pipeline (RAMMCAP) was implemented within the CAMERA project to address the computational challenges imposed by the huge size and great diversity of metagenomic data(li, 2009). RAMMCAP was developed using an ultra fast sequence clustering algorithm, fast protein family annotation tools, and other supporting tools. It can cluster and annotate 1 million metagenomic reads with only hundreds of CPU hours. Online RAMMCAP service is available from the CAMERA portal at The standalone RAMMCAP software, supporting reference databases and examples can be downloaded from

2 2. Implementation RAMMCAP was implemented as a pipeline in Figure 1. This is an updated version of the original RAMMCAP workflow published in BMC Bioinformatics(Li, 2009). Figure 1 Metagenomic data analysis pipeline RAMMCAP. The input of RAMMCAP is a FASTA file of reads from one or more metagenomes. The reads can be pre processed to remove artificial duplicates with program cdhit 454 if the input is 454 data. Lowquality reads can also be filtered out at this stage. Then, rrnas are identified with rrna_hmm, a program we developed for identification of rrnas from metagenomic fragments with high sensitivity and specificity (Huang, et al., 2009). The rrnas are masked form the input reads, and trnas are predicted with trna scan. trnas are also masked from the input reads. Masked reads are used in further analyses. Using the DNA version of CD HIT (Li and Godzik, 2006; Li, et al., 2001; Li, et al., 2002), the masked reads are clustered together at 95% sequence identity over 80% of length of shorter sequences (clustering parameters can be adjusted by users) to identify clusters of unique read sequences, referred to as read clusters. It takes ~1 hour to cluster a million 200 base reads. ORFs are collected from reads with ORF_finder, an ORF calling program implemented here by six reading frame translation. Within each reading frame, an ORF starts at the beginning of a read or the first ATG after a previous stop codon; it ends at the first stop codon or the end of that read. The minimal length of ORFs can be specified by users. ORFs can also be called from reads with program Metagene (Noguchi, et al., 2006). Since these sequence reads are short, a predicted ORF maybe a portion of a complete ORF. An ORF may also be a translation from a non coding frame: such an ORF is called a spurious ORF, as defined in the original GOS study (Yooseph, et al., 2007). The GOS study also introduced a spurious ORF detection method using nonsynonymous to synonymous substitution test, which is available along with a recent GOS clustering study (Yooseph, et al., 2008). This method is not integrated within RAMMCAP, but it can be used independently to identify spurious ORFs predicted here. ORFs are first clustered at 90 95% identity to identify the non redundant sequences, which are further clustered to families (ORF clusters) at a conservative threshold, so that each cluster contains sequences of the same or similar function. A 30% sequence identity indicates significant similarities for full length proteins. Since ORFs from Sanger reads are long enough, so they are clustered at 30% - 2 -

3 identity over 80% of ORF length. ORFs from 454 reads are shorter; they are clustered at 60% identity over 80% of ORF length. ORF clusters are used for functional studies. The size of an ORF cluster is the number of its non redundant sequences. For one million ORFs, it takes a few CPU hours to cluster at 60% identity and ~100 CPU hours at 30% identity. This type of multiple step clustering has been used in independent analysis of GOS ORFs (Li, et al., 2008). Please note, the current online RAMMCAP service from CAMERA portal doesn t support clustering at 30% identity, which need to be performed with the standalone CD HIT program. ORFs are annotated from Pfam and Tigrfam with HMMER (Eddy, 1998), accelerated with Hammerhead (Portugaly, et al., 2007), and from COG with RPS BLAST (Altschul, et al., 1997). Hits must be with e values 0.001, and meet the significant scores in case of HMMER searches. This annotation process only takes ~100 CPU hours for one million ORFs. 3. Tools 3.1. CD-HIT-454 The 454 pyrosequencing reads contains artificially duplicates, which might lead to misleading conclusions. CD HIT 454 is a fast program to identify exact and nearly identical duplicates, the reads begin at the same position but may vary in length or bear mismatches. CD HIT 454 can process a typical 454 dataset in ~10 minutes. It also provides a consensus sequence for each group of duplicates. Duplicates are either exactly identical or meet these criteria includes: (1) they start at the same position; (2) their lengths can be different, but shorter one must be fully aligned with the longer one (the seed); (3) they can only have 4% mismatches (insertion, deletion, and substitution); and (4) only 1 base is allowed per insertion or deletion. Here, (3) and (4) can be adjusted by users Input and output CD HIT 454 takes a FASTA file as input, it outputs two files. One file lists the clusters of duplicates with similarity information. Another file in FASTA format is the reads sequences with duplicates removed. CD HIT 454 can also provide a consensus sequence for each group of duplicates Parameters Default command line option for cdhit 454 for RAMMCAP: -g 1 -c 0.96 D 1 Where c 0.96 means 96% identity D 1 means only 1 base is allowed per insertion or deletion Detailed usage of CDHIT 454 can be found from hit.org/. Other options: -i input filename in fasta format, required -o output filename, required -c sequence identity threshold, default 0.98 this is a "global sequence identity" calculated as : number of identical amino acids in alignment divided by the full length of the shorter sequence + gaps -b band_width of alignment, default 10 -M max available memory (Mbyte), default

4 -n word_length, default 10, see user's guide for choosing it -al alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AL alignment coverage control for the longer sequence, default if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -as alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AS alignment coverage control for the shorter sequence, default if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive it is recommended to use -B 1 for huge databases -g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters -D max size per indel, default rrna-hmm This is a HMM based method developed in CAMERA to identify rrnas from fragmented sequence reads (Huang, et al., 2009). The performance of Meta_RNA is show in Table 1. Table 1 sensitivity and specificity of Meta_RNA and BLASTN based rrna finder Input and output The input is a FASTA file of reads. One output is a FASTA file of rrnas. The start position, end position, stand, frame, and length are reported in FASTA deflines. Second output is a FASTA file of input reads with identified rrnas masked with letter N. Third file list the coordinates of rrnas in input reads 3.3. rrna-finder rrnas are predicted from reads by BLASTN against rrna database Silva at evalue 1e Input and output The input and output of rrna finder is same as rrna HMM 3.4. trnascan trnas are predicted from input reads with program trnascan with general model with default parameters

5 Input and output The input and output of trna scan is same as rrna HMM CD-HIT-EST, clustering of reads Cluster analysis is a key approach in this pipeline. The ultra fast sequence clustering algorithm CD HIT (Li and Godzik, 2006; Li, et al., 2001; Li, et al., 2002) was modified to handle large metagenomic datasets. Using the DNA version of CD HIT, CD HIT EST, the raw metagenomic reads from one or more metagenomes are clustered together at 95% sequence identity over 80% of length of shorter sequences (clustering parameters can be adjusted by users) to identify clusters of unique read sequences, or read clusters Input and output The input is a FASTA file of raw reads. Two output files are generated: (1) a FASTA file of representative sequences, and (2) a text file that lists the number of clusters Parameters Default command line option for CD HIT EST for RAMMCAP: -d 0 -n 10 -l 11 -r 1 -p 1 -g 1 -G 0 -c as 0.8 Where c 0.95 means 95% identity as 0.8 means alignment cover 80% of shorter sequence Detailed usage of CD HIT EST can be found from hit.org/. Other options: -i input filename in fasta format, required -o output filename, required -c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as : number of identical amino acids in alignment divided by the full length of the shorter sequence -G use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as : number of identical amino acids in alignment divided by the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see options -al, -AL, -as, -AS -b band_width of alignment, default 20 -n word_length, default 8, see user's guide for choosing it -l length of throw_away_sequences, default 10 -d length of description in.clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -s length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster -S length difference cutoff in amino acid, default f set to 60, the length difference between the shorter sequences and the representative of the cluster can not be bigger than 60 -al alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AL alignment coverage control for the longer sequence, default if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -as alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AS alignment coverage control for the shorter sequence, default if set to 60, and the length of the sequence is 400, - 5 -

6 then the alignment must be >= 340 (400-60) residues -A minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences -B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive it is recommended to use -B 1 for huge databases -p 1 or 0, default 0 if set to 1, print alignment overlap in.clstr file -g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters -r 1 or 0, default 0, by default only +/+ strand alignment if set to 1, do both +/+ & +/- alignments Format of the cluster output file One output file (.clstr in native cd hit format) lists all the clusters where a line with a leading ">" starts a cluster, then sequences are listed. The sequences with a "*" is the representative of the cluster. Others are redundant sequences, and the similarity between each redundant sequence and the representative is printed. Example: >Cluster nt, > at 1:53:2:54/+/100% 1 194nt, > * 2 101nt, > at 1:101:3:103/+/96% 3 54nt, > at 3:54:3:54/+/100% 4 66nt, > at 1:66:2:67/+/100% 5 90nt, > at 1:85:2:86/+/97% 6 71nt, > at 8:71:2:65/+/100% 7 57nt, > at 5:57:2:54/+/100% 8 103nt, > at 9:103:2:96/+/95% 9 58nt, > at 6:58:2:54/+/100% is the repsentative of this cluster is 95% idnetital to the representative, is overlapping region on , 2-96 is the overlapping region on representative. "+" means plus/plus strand ORF_finder ORFs are collected from raw read sequences with ORF_finder, a ORF calling program implemented here by six reading frame translation in a similar way as the GOS study (Yooseph, et al., 2007). Within each reading frame, an ORF starts at the beginning of a read or the first ATG after a pervious stop codon, it then ends at the first stop codon or the end of that read. The minimal length of ORFs can be specified by users. Since the raw reads are short and fragmented, a predicted ORF maybe a portion of a complete ORF. An ORF maybe also a translation from a non coding frame; such an ORF is called a spurious ORF, as defined in the GOS study (Yooseph, et al., 2007). Some spurious ORFs can further be identified through downstream clustering analysis and annotation. The sensitivity and specificity of ORF_finder and Metagene (next component) are listed in Table 2. Results are based on 4 simulated datasets of 100, 200, 400, and 800 bp

7 Table 2 sensitivity and specificity of ORF_finder and Metagene Sensitivity / specificity (%) Datasets Sim100 Sim200 Sim400 Sim800 Method Metagene / / / / ORF_finder, cut off=30aa / / / / ORF_finder, cut off=40aa / / / ORF_finder, cut off=50aa / / / ORF_finder, cut off=60aa / / / input and output The input is a FASTA file of raw reads. Output is a FASTA file of called ORFs. The start position, end position, stand, frame, and length are reported in FASTA deflines Parameters Default command line option for ORF_finder for RAMMCAP: -l 30 -L 30 -t 11 Other options: -i input_file, default stdin -o output_file, default stdout -l minimal length of orf, default 20 -L minimal length of orf between two stop codon, default 40 -M max_dna_len, in MB, length of the longest input DNA sequence, default 10 -X max letter X (ratio) default 0.1 -t translation table, default 1 -b ORF begin option: default 2 1: start at the begining of DNA sequence or after pervious stop codon 2: start with the first ATG if there is a stop codon upstream We don't know which ATG is the real start, but for prokaryotic DNA, a fragment between a stop codon and the first ATG can not be part of real gene. Therefore, -b 2 is recommanded for prokaryotic -e ORF end option: default 1 1: end at the end of DNA sequence or at a stop codon 2: must end at a stop codon 3.7. Metagene ORFs can also be called from raw reads with the program Metagene (Noguchi, et al., 2006) Input and output The input is a FASTA file of raw reads. Output is a FASTA file of called ORFs. The start position, end position, stand, frame, and length are reported in FASTA deflines

8 3.8. CD-HIT - Clustering of ORFs ORFs are first clustered at 90 95% identity to identify the non redundant sequences, which are further clustered to families (ORF clusters) at a conservative threshold, 60% identity over 80% of length of ORFs. ORF clusters are used for functional studies Input and output There are two runs of CD HIT: cd_hit_90 and cd_hit_60. The input for the first run is the ORF FASTA file. Two output files are generated: (1) A FASTA file of representative sequences, and (2) a text file that lists the number of clusters. The FASTA file of the representative sequences of the first run is the input of second CD HIT run, which produces another two files Parameters Default command line option for CD HIT: First run (service cd_hit_90) -d 0 -n 5 -p 1 -g 1 -G 0 -c as 0.8 Second run (service cd_hit_60) -d 0 -n 4 -p 1 -g 1 -G 0 -c as 0.8 Detailed usage of CD HIT can be found from hit.org/. Other options: -i input filename in fasta format, required -o output filename, required -c sequence identity threshold, default 0.9 this is the default cd-hit's "global sequence identity" calculated as : number of identical amino acids in alignment divided by the full length of the shorter sequence -G use global sequence identity, default 1 if set to 0, then use local sequence identity, calculated as : number of identical amino acids in alignment divided by the length of the alignment NOTE!!! don't use -G 0 unless you use alignment coverage controls see options -al, -AL, -as, -AS -b band_width of alignment, default 20 -n word_length, default 5, see user's guide for choosing it -l length of throw_away_sequences, default 10 -t tolerance for redundance, default 2 -d length of description in.clstr file, default 20 if set to 0, it takes the fasta defline and stops at first space -s length difference cutoff, default 0.0 if set to 0.9, the shorter sequences need to be at least 90% length of the representative of the cluster -S length difference cutoff in amino acid, default f set to 60, the length difference between the shorter sequences and the representative of the cluster can not be bigger than 60 -al alignment coverage for the longer sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AL alignment coverage control for the longer sequence, default if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -as alignment coverage for the shorter sequence, default 0.0 if set to 0.9, the alignment must covers 90% of the sequence -AS alignment coverage control for the shorter sequence, default if set to 60, and the length of the sequence is 400, then the alignment must be >= 340 (400-60) residues -A minimal alignment coverage control for the both sequences, default 0 alignment must cover >= this value for both sequences -B 1 or 0, default 0, by default, sequences are stored in RAM if set to 1, sequence are stored on hard drive - 8 -

9 it is recommended to use -B 1 for huge databases -p 1 or 0, default 0 if set to 1, print alignment overlap in.clstr file -g 1 or 0, default 0 by cd-hit's default algorithm, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode) but either 1 or 0 won't change the representatives of final clusters Merge two clustering runs The.clstr file of the first run lists all the ORFs in clusters. But the.clstr file of the second run only lists the representative sequences from the first run. So the results of these two runs are merged together by a script. The merged.clstr file contains all the ORFs for the clusters of the second run. This script also gives the cluster distributions (Figure 2). Figure 2 Cluster distribution The x axis is the cluster size x, the number of sequences in a cluster. The red line is plotted against the left y axis, which is the number of clusters of at least size x. The green line is plotted against the right y axis, which is the % of sequences found in clusters of at least size x Annotation of ORFs ORFs are annotated from Pfam and Tigrfam with HMMER (Eddy, 1998) (accelerated with Hammerhead (Portugaly, et al., 2007)), and from COG with RPS BLAST (Altschul, et al., 1997). Hits must be with E values 0.001, and meet the significant scores in case of HMMER searches. This annotation process only takes ~100 CPU hours for 1 million ORFs Input and output The input is a FASTA file of ORFs. Output is a TAB delimited text file that list all hits with e-value, alignment positions etc, such as: ID Pfam Evalue Start Query End Query Start Model End Model Length Model *

10 DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e DICE_READ_ PF e Unique hit flag, if a Pfam family is hit more than 1 times for a query, the best hit has flag 1 and other hits has flag Go annotation GO annotations are derived from Pfam and TIGRfam annotation through Pfam to GO or TIGRfam to GO mappings. The output is a TAB delimited file shows the ORF_ID, GO_ID, the Pfam/TIGRfam family referred from, and the e value of original Pfam/TIGRfam hit, such as: DICE_READ_ GO: TIGRFAM TIGR DICE_READ_ GO: TIGRFAM TIGR DICE_READ_ GO: TIGRFAM TIGR e-13 DICE_READ_ GO: TIGRFAM TIGR e-13 DICE_READ_ GO: TIGRFAM TIGR e-13 DICE_READ_ GO: TIGRFAM TIGR e-13 DICE_READ_ GO: TIGRFAM TIGR e-13 DICE_READ_ GO: TIGRFAM TIGR e-13 DICE_READ_ GO: TIGRFAM TIGR e-11 DICE_READ_ GO: TIGRFAM TIGR e-11 DICE_READ_ GO: TIGRFAM TIGR e-11 DICE_READ_ GO: TIGRFAM TIGR e-11 DICE_READ_ GO: TIGRFAM TIGR e-11 DICE_READ_ GO: TIGRFAM TIGR e-11 DICE_READ_ GO: TIGRFAM TIGR e EC number EC numbers are derived from GO through GO to EC mapping FASTA file summary FASTA files are scanned and analyzed. Summaries, such as number of sequences, distribution of lengths, distribution of GC bins, are reported. 4. RAMMCAP under CAMERA web portal In CAMERA, RAMMCAP is available under the CAMERA Portal (Figure 3)

4). Users can specify parameters for the following components in the full metagenomic data annotation and clustering workflow: ORF clustering, 1 st run: clustering of ORFs at ~90% identity ORF

11 Figure 3 Snapshot of the workflow page within CAMERA 2.0 portal The highlighted one is the RAMMCAP pipeline 4.1. Running the annotation workflows To run the annotation workflows, select the workflow and then hit the Start button, and then specify parameters and upload a FASTA file on the following page (Figure 4). Users can specify parameters for the following components in the full metagenomic data annotation and clustering workflow: ORF clustering, 1 st run: clustering of ORFs at ~90% identity ORF clustering 2 nd run: clustering of ORFs at ~60% identity Read clustering: clustering of reads at ~95% identity ORF finder: 6 reading frame ORF translation The default parameters are supplied on the portal. Clicking the icon will pop up a window for parameter setup (Figure 4). Please refer to Tools Section in this document for how to specify detailed parameters. To submit the job, please hit the submit workflow button

Figure 4 Snapshot of the workflow configuration and submission page 4.2. Retrieve the annotation results Hit the Current Jobs link on right column (Figure 3) to check the status of submitted.

12 Figure 4 Snapshot of the workflow configuration and submission page 4.2. Retrieve the annotation results Hit the Current Jobs link on right column (Figure 3) to check the status of submitted. If the job is done, click the result link to view, browse, and download the annotation results (Figure 5). Figure 5 Results of annotation workflow 5. Running RAMMCAP locally 5.1. Download RAMMCAP package RAMMCAP software can be downloaded from The package includes all the program, scripts, and databases need to run RAMMCAP Installation RAMMCAP runs under LINUX system. After unpack the package, chdir that directory, and then compile CD HIT, CD HIT 454 and ORF_finder: $ cd./cd hit #please read README file there $ make $ cd.. $ cd./cd hit 454 #please read README file there $ make $ cd.. $ cd./orf_finder #please read README file there $ make $ cd.. Gnuplot 4.2 is needed to generate graphic files with some of the scripts. A binary gnuplot was included under RAMMCAP package. If it won't run on your computer, please install this package

13 ImageMagick is also need for graphic files. You will need two packages similar to ImageMagick perl c4 ImageMagick c4 The most recent version of CD HIT is available from hit.org, you can directly download the most recent version of CD HIT separately. If you want to use the utilities in qsub scripts directory to submit annotation jobs to a computer cluster, please edit qsub scripts/service local.pl to modify the $SL_rammcap_dir, which should be the installation dir of RAMMCAP package modify the qsub command call 5.3. Programs After unpack the RAMMCAP package, you can find following directories: bin binary files blast BLAST program and databases include COG cd hit 454 CD HIT 454 program cd hit CD HIT GO GO data hmmer2.3.2 hmmerhead_0.5 HMMER program with Hmmerhead HMM_lib HMM library of Pfam and Tigrfam Metagene Metagene program and wrapping scripts orf_finder ORF_finder program qsub scripts Example scripts to submit jobs to computer cluster rammcap Scripts for statistical comparison of metagenomes rrna_hmm_fs rrna_hmm program rrna_package rrna_blastn program scripts wrapping scripts, utility scripts, parsing scripts etc trna_package trnascan program and wrapping scripts work_space example Here, cd hit 454, cd hit, orf_finder, rrna_hmm_fs, rrna_package, trna_package are all independent programs developed by us. Each of them has separate README file and example, please read and test them Example on a desktop computer A test database and a README file are provided in directory./example_desktop directory Example on a computer cluster A test database and a README file are provided in directory./example_computer_cluster directory. 6. Statistical comparison of metagenomes A detailed guide and examples for statistical comparison of metagenomes is provided under./rammcap directory

14 7. References Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25, Eddy, S.R. (1998) Profile hidden Markov models, Bioinformatics, 14, Huang, Y., Gilna, P. and Li, W.Z. (2009) Identification of ribosomal RNA genes in metagenomic fragments, Bioinformatics, 25, Li, W. (2009) Analysis and comparison of very large metagenomes with fast clustering and functional annotation, BMC Bioinformatics, 10, 359. Li, W. and Godzik, A. (2006) Cd hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22, Li, W., Jaroszewski, L. and Godzik, A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases, Bioinformatics, 17, Li, W., Jaroszewski, L. and Godzik, A. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases, Bioinformatics, 18, Li, W., Wooley, J.C. and Godzik, A. (2008) Probing metagenomics by rapid cluster analysis of very large datasets, PLoS ONE, 3, e3375. Noguchi, H., Park, J. and Takagi, T. (2006) MetaGene: prokaryotic gene finding from environmental genome shotgun sequences, Nucleic Acids Res, 34, Portugaly, E., Johnson, S., Ninio, M. and Eddy, S. (2007) Improved HMMERHEAD for Better Sensitivity, RECOMB 07 Poster. Yooseph, S., Li, W. and Sutton, G. (2008) Gene identification and protein classification in microbial metagenomic sequence data via incremental clustering, BMC Bioinformatics, 9, 182. Yooseph, S., Sutton, G., Rusch, D.B., Halpern, A.L., Williamson, S.J., Remington, K., Eisen, J.A., Heidelberg, K.B., Manning, G., Li, W., Jaroszewski, L., Cieplak, P., Miller, C.S., Li, H., Mashiyama, S.T., Joachimiak, M.P., van Belle, C., Chandonia, J.M., Soergel, D.A., Zhai, Y., Natarajan, K., Lee, S., Raphael, B.J., Bafna, V., Friedman, R., Brenner, S.E., Godzik, A., Eisenberg, D., Dixon, J.E., Taylor, S.S., Strausberg, R.L., Frazier, M. and Venter, J.C. (2007) The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families, PLoS Biol, 5, e

2 Algorithm. Algorithms for CD-HIT were described in three papers published in Bioinformatics.

2 Algorithm. Algorithms for CD-HIT were described in three papers published in Bioinformatics. CD-HIT User s Guide Last updated: 2012-04-25 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1 Contents 2 1