Community analysis of 16S rrna amplicon sequencing data with Chipster. Eija Korpelainen CSC IT Center for Science, Finland

Size: px

Start display at page:

Download "Community analysis of 16S rrna amplicon sequencing data with Chipster. Eija Korpelainen CSC IT Center for Science, Finland"

Florence Henderson
5 years ago
Views:

1 Community analysis of 16S rrna amplicon sequencing data with Chipster Eija Korpelainen CSC IT Center for Science, Finland

2 What will I learn? How to operate the Chipster software Community analysis of 16S rrna amplicon sequencing data Central concepts Analysis steps File formats

3 Introduction to Chipster

4 Chipster Provides an easy access to over 360 analysis tools Command line tools R/Bioconductor packages Free, open source software What can I do with Chipster? analyze and integrate high-throughput data visualize data efficiently share analysis sessions save and share automatic workflows

5 Analysis tool overview 160 NGS tools for 140 microarray tools for RNA-seq gene expression mirna-seq mirna expression exome/genome-seq protein expression ChIP-seq acgh FAIRE/DNase-seq SNP CNA-seq integration of different data 16S rrna amplicon seq Single cell RNA-seq 60 tools for sequence analysis BLAST, EMBOSS, MAFFT Phylip

6 Tools for community analysis of amplicon sequencing data Quality control with FastQC, trimming with Trimmomatic Preprocessing and taxonomy assignment with Mothur package Trim primers and barcodes and filter reads Combine paired reads to contigs (for MiSeq data) Screen sequences for several criteria Extract unique sequences Align sequences to 16S rrna reference alignment Remove empty alignment columns Precluster aligned sequences Remove chimeric sequences Classify sequences to taxonomic units Statistical analyses using R Compare sample groups using several ANOVA-type of analyses Visualization using R Rarefaction curve, rank-abundance curve, RDA plot

7 Chipster: technical aspects Client-server system Enough CPU and memory for large analysis jobs Centralized maintenance Easy to install Client uses Java Web Start Server available as a virtual machine

10 Mode of operation Select: data tool category tool run visualize

11 Job manager You can run many analysis jobs at the same time Use Job manager to view status cancel jobs view time view parameters

12 Analysis sessions Remember to save the analysis session. Session includes all the files, their relationships and metadata (what tool and parameters were used to produce each file). Session is a single.zip file. You can save a session locally (on your computer) and in the cloud but note that cloud sessions are not stored forever!

13 Workflow panel Shows the relationships of the files You can move the boxes around, and zoom in and out. Several files can be selected by keeping the Ctrl key down Right clicking on the data file allows you to Save an individual result file ( Export ) Delete Link to another data file Save workflow

14 Workflow reusing and sharing your analysis pipeline You can save your analysis steps as a reusable automatic macro, which you can apply to another dataset When you save a workflow, all the analysis steps and their parameters are saved as a script file, which you can share with other users

15 Visualizing the data Data visualization panel Maximize and redraw for better viewing Detach = open in a separate window, allows you to view several images at the same time Two types of visualizations 1. Interactive visualizations produced by the client program Select the visualization method from the pulldown menu Save by right clicking on the image 2. Static images produced by analysis tools Select from Analysis tools/ Visualisation View by double clicking on the image file Save by right clicking on the file name and choosing Export

16 Options for importing data to Chipster Import files/ Import folder Import from URL Utilities / Download file from URL directly to server Open an analysis session Files / Open session Import from SRA database Utilities / Retrieve FASTQ or BAM files from SRA Import from Ensembl database Utilities / Retrieve data for a given organism in Ensembl What kind of data files can I use in Chipster? Compressed files (.gz) are ok FASTQ, FASTA, SFF

17 Problems? Send us a support request -request includes the error message and link to analysis session (optional)

18 Acknowledgements to Chipster users and contibutors

19 More info Chipster tutorials in YouTube

20 Community analysis of 16S rrna data

21 Main sections of community analysis Preprocessing Clean sequences and align them to 16S rrna reference alignment Classification Taxonomic assignment of sequences Community analysis and visualization Compare sample groups Indicator species analysis

22 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

23 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

24 What and why? Potential problems low confidence bases, Ns adapters Knowing about potential problems in your data allows you to correct for them before you spend a lot of time on analysis take them into account when interpreting results

25 Raw reads: FASTQ file format Four lines per name GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + read name!''*((((***+))%%%++)(%%%%).1***-+*''))**55ccf>>>>>>ccccccc65 Attention: Do not unzip FASTQ files Chipster s analysis tools can cope with zipped files (.gz)

26 Base qualities If the quality of a base is 20, the probability that it is wrong is Phred quality score Q = -10 * log 10 (probability that the base is wrong) T C A G T A C T C G Sanger encoding: numbers are shown as ASCII characters so that 33 is added to the Phred score E.g. 39 is encoded as H, the 72nd ASCII character (39+33 = 72) Note that older Illumina data uses different encoding Illumina1.3: add 64 to Phred Illumina : add 64 to Phred, ASCII 66 B means that the whole read segment has low quality

27 Base quality encoding systems

28 How to check sequence quality? You can use either FastQC or PRINSEQ tools in Chipster (tool category Quality control) Both provide graphical reports, FastQC is faster Check many things, including base quality and composition, duplication, Ns, k-mers, adaptors, Note that you can run the analysis in parallel for max 10 fastq files (select the files and click Run for each ) Currently individual report is produced for each sample, but we are in a process of integrating the MultiQC tool in Chipster, which will provide a summary report showing results for each individual sample.

29 Per position base quality (FastQC) good ok bad

30 Per position base quality (FastQC)

31 What if there is a quality problem? You can either trim or filter sequences with Trimmomatic or PRINSEQ (tool category Preprocessing) Trimmomatic is faster Offer numerous options

32 Trimmomatic options in Chipster Adapters Minimum quality Per base, one base at a time or in a sliding window, from 3 or 5 end Per base adaptive quality trimming (balance length and errors) Minimum (mean) read quality Trim x bases from left/ right Minimum read length after trimming Copes with paired end data

33 Filtering vs trimming Filtering removes the entire read Trimming removes only the bad quality bases It can remove the entire read, if all bases are bad Trimming makes reads shorter This might not be optimal for some applications Paired end data: the matching order of the reads in the two files has to be preserved If a read is removed, its pair has to removed as well

34 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

35 Make a Tar package of fastq files Typically each sample has two fastq files. When you have a lot of samples, Chipster s workflow view can become crowded. In order to keep the view clearer, you can put all the fastq files in one Tar package Use the tool Utilities / Make Tar package Fastq files can be zipped When your Tar package is ready, you can delete the original fastq files from your Chipster session If you want to look at the individual fastq files later, you can always open the Tar package using the tool Utilities / Extract.tar.gz file

36 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

37 Combine paired reads to contigs Read are joined to contigs using the Mothur make.contigs tool creates a reverse complement of the reverse read and performs a Needleman alignment for the two reads if one read has a base and the other has a gap, the quality of the base has to be at least 25 to be kept. if the bases differ, the quality difference has to be at least 6. If it is less, the consensus base is set to N. Input file: Tar package of fastq files Output files contigs.fasta.gz = contig sequences contigs.groups = assignment of contigs to samples contig.numbers.txt = number of contig sequences in each sample contigs.summary.tsv = sequence information samples.fastqs.txt = fastq file assignment to samples

38 Mothur s make.contigs tool is not ideal Problems with read pairs with short or bad quality overlap. Low quality ends with the MiSeq 2x300 chemistry sequence only short regions (~250 recommended by Patrick Schloss) so that you get full overlap of the reads USEARCH fastq_mergepairs followed by fastq_filter might work better (not possible to offer in Chipster due to licensing)

39 samples.fastqs.txt Make contigs tool in Chipster assigns fastq files to each sample. You can check this in the output file samples.fastqs.txt. If the assignment is wrong, you can make a samples.fastqs.txt and give it as input.

40 summary.tsv Number of sequences total (and unique) Min, max, mean, median and quantiles of start and end positions number of bases and ambiguous bases homopolymer length

41 group file Sequence name and sample assignment

42 Contig.numbers.txt Number of contig sequences per sample and in total

43 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

44 Screen sequences for several criteria Filters sequences for length, ambiguous bases, homopolymers You can either set the minimum and maximum sequence length manually, or select optimize and tell what percentage of sequences should be kept. Based on Mothur tool screen.seqs. Input file: Fasta file and group file Output files screened.fasta.gz = screened sequences screened.groups = assignment of sequences to samples summary.screened.tsv = sequence information

45 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

46 Remove identical sequences Many sequences are identical. It would be computationally wasteful to align the same sequence to the reference many times Remove identical sequences and keep only one representative in the fasta file keep track of how many sequences it represents, and store this info in a count_table file Alternatively we could list the names of each represented sequence, but this names file would be very large as sequence names are long Based on Mothur tool unique.seqs and count.seqs. Input file: Fasta file and group file Output files unique.fasta = unique sequences unique.count_table = how many represented sequences are in each sample unique.summary.tsv = sequence information

47 Count_table file Rows are names of unique sequences Columns are samples Cells show how many times each sequence occurs in each sample

48 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

49 Align sequences to reference alignment Aligns sequences to reference 16S rrna alignment Chipster offers the full Silva 16S rrna reference set and its bacterial subsection. You can also provide your own reference alignment in fasta format. Indicate the region of the reference alignment which matches the region that you amplified. K-mer searching with 8mers is followed by Needleman-Wunsch pairwise alignment. Speed depends on the number and length of the sequences. Result is given in fasta format. Periods '.' lead up to the first base in the sequence and follow the last base. Based on Mothur tool align.seqs and pcr.seqs Input file: Fasta file and count_table file Output files aligned.fasta.gz = aligned sequences custom.reference.summary.tsv = information on the region of the reference used aligned-summary.tsv = aligned sequence information

50 Alignment output files Align.fasta aligned-summary.tsv custom.reference.summary.tsv

51 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units remove unwanted lineages Count species per sample Statistical analyses, visualization

52 Screen aligned sequences for alignment position and homopolymers All the sequences should overlap the same alignment coordinates Remove deviants by filtering based on the alignment start and end position Remove also sequences which contain homopolymers longer than those in the reference Based on Mothur tool screen.seqs. Input file: Fasta file and count_table file Output files screened.fasta.gz = screened sequences screened.count_table = updated count_table summary.screened.tsv = sequence information

53 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs, remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units Remove unwanted lineages Count species per sample Statistical analyses, visualization

54 Filter alignment for empty columns and overhangs, remove identical sequences Sequences should overlap only the common alignment region, without overhangs, so we need to trim the ends remove alignment columns containing terminal gap characters '.' Remove also alignment columns which contain only gaps - Removing alignment columns can create identical sequences need to extract unique sequences again Based on Mothur tools filter.seqs and unique.seqs. Input file: Fasta file and count_table file Output files filtered-unique.fasta = trimmed aligned sequences filtered-log.txt = how many alignment columns were removed filtered-unique.count_table = updated count_table filtered-unique-summary.tsv = sequence information

55 Alignment before and after filtering

56 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs, remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units Remove unwanted lineages Count species per sample Statistical analyses, visualization

57 Precluster very similar sequences Removes sequences that are likely to contain sequencing errors assumes that abundant sequences are more likely to generate errors. ranks sequences in order of their abundance and then walks through the list of sequences looking for rarer sequences which differ only by x number of bases from the original sequence. Those that are within the threshold are merged. allow 1 mismatch for every 100 bp of sequence Based on Mothur tool precluster.seqs Input file: Fasta file and count_table file Output files preclustered.fasta = trimmed aligned sequences preclustered.count_table = updated count_table preclustered-summary.tsv = sequence information

58 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs, remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units Remove unwanted lineages Count species per sample Statistical analyses, visualization

59 Remove chimeric sequences Removes sequences that are likely to contain sequencing errors you can use either the full Silva Gold 16S rrna reference set or the bacterial subsection of it. if you set reference = none, Mothur will use the more abundant sequences in your data as the reference parameter Dereplicate specifies if a chimera should be removed from all the samples (false), or only from the sample it was discovered in (true) based on Mothur tools chimera.uchime and chimera.vsearch (faster) Input file: Fasta file and count_table file Output files chimeras.removed.fasta = aligned sequences chimeras.removed.count_table = updated count_table chimeras.removed.summary.tsv = sequence information

60 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs, remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units Remove unwanted lineages Count species per sample Statistical analyses, visualization

61 Classify sequences to taxonomic units and remove unwanted lineages Based on the Mothur tool classify.seqs and the Wang method calculates the probability that a query sequence would be in a given taxonomy based on the k-mers it contains. uses bootstrapping to find the confidence limit of the assignment by randomly choosing 1/8 of the k-mers in the query you can use either the full Silva reference set and its taxonomy file, or the bacterial subsection of it If you discover unwanted lineages, you can remove them list them in the text field and run the tool again. For example: Chloroplast-mitochondria-Archaea-Eukaryota-unknown Input file: Fasta file and count_table file Output files reads-taxonomy-assignment.txt = sequence name and taxonomy classification-summary.tsv = indicates the number of sequences that were found at each level picked.fasta and picked.count_table = kept sequences

62 Classification output files reads-taxonomy-assignment.txt classification-summary.tsv

63 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs, remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units Remove unwanted lineages Count species per sample Statistical analyses, visualization

64 Count species per sample For statistical analysis and visualization we need a table which lists the frequency of the different taxa in each sample We also need a file which allows us to indicate the experimental design Assign samples to experimental groups Other experimental factors such as time, gender, age, This tool is based on an R script by Jarno Tuimala We will integrate Mothur s OTU-based approach in September dist.seqs cluster (using opticlust) - make.shared - classify.otu Input file: sequences-taxonomy-assignment.txt and group file NOTE that it doesn t currently take Mothur s count file as input! Output files counttable.tsv = rows are samples, columns are taxa phenodata.tsv = allows you to assign samples to groups

65 Counttable.tsv

66 phenodata.tsv

67 Phenodata file: describe the experiment Describe experimental groups, time, gender etc with numbers e.g. 1 = control, 2 = treatment Define sample names for visualizations in Description column

68 Analysis workflow for MiSeq data Check the base quality of the reads Make a Tar Package of your fastq files Combine paired reads to contigs Screen sequences for length and ambiguous bases Remove identical sequences Align sequences to reference alignment Screen aligned sequences for alignment position, homopolymers Filter alignment for empty columns and overhangs, remove new identical sequences Precluster very similar sequences Remove chimeric sequences Classify sequences to taxonomic units Remove unwanted lineages Count species per sample Statistical analyses, visualization

69 Statistical analysis and visualization tool Visual analysis of data Did you sample both groups equally well (rarefaction curve)? How different are the two groups in terms of species richness and relative abundance (rank abundance curve)? Does the group variable explain some of the difference between the samples (RDA plot)? Statistical analysis of data Do the groups differ in species composition (AMOVA etc)? What species differentiate the groups best (indicator species analysis)? How species contribute to the diversity (contribution diversity approach)? Two input files Counttable.tsv (the frequency of the different taxa in each sample) Phenodata.tsv (sample assignment to different groups) Based on an R script by Jarno Tuimala

70 Rarefaction curve Used for checking sampling efficiency: Did you sample both groups equally well (did you take enough samples)? Plots rarefactional number of species (y) against samples (x). Lines are sample groups and clouds are confidence intervals. The curves should be pretty similar. If confidence intervals overlap, the sampling efficiency in both groups is similar.

71 Rank abundance curve How different are the two groups in terms of species richness and relative abundance? Y-axis is relative abundance, how many sequences you have per specie. The specie that has most sequences is plotted on top left. X-axis is abundance rank, based on the number of sequences for each species Species evenness is depicted by the shape of the curve flat line means that all species are equally abundant

72 Redundancy Analysis (RDA) Does the group variable explain (some of) the difference between the samples? Constrained ordination approach, uses explanatory variable (group) Data contains many zeros Hellinger transformation needed How to read the plot: Dot = sample, colored by group Small gray dot = species Group s value increases in the direction of the arrow group1 on the right P-value for group s effect Percentage of variance explained by the group

73 Indicator species analysis What are the taxa that differentiate the groups best? Two analysis tools Dufrene-Legendre Indicator Species Analysis Calculates the indicator value (fidelity and relative abundance) of species Gives p-values for taxa separating the specified groups Indicator Species Analysis Minimizing Intermediate Occurrences

74 Do groups differ in species composition? Three different statistical tools Analysis of molecular variance (AMOVA) Permutational multivariate analysis of variance using distance matrices Multivariate homogeneity of groups dispersions (variances) Use slightly different tests and methods to calculate p-values How to read the output of AMOVA: SSD = variance, total and explained by the group MSD = mean stardard deviation for the grouping P.value (Pr in other tests)

mealybugs Documentation

mealybugs Documentation Release 1.0 Thierry Gosselin June 09, 2014 Contents 1 Computer hardware requirements 3 2 Getting prepared with files 5 3 Start Mothur 7 4 Reducing sequencing and PCR errors 9 5