School of Energy and Environment, City University of Hong Kong, Hong Kong. SeqMatic LLC, Fremont, CA, 94539, United States of America

Size: px

Start display at page:

Download "School of Energy and Environment, City University of Hong Kong, Hong Kong. SeqMatic LLC, Fremont, CA, 94539, United States of America"

Herbert Cummings
6 years ago
Views:

1 Skin fungal community: the effects of hosts, co-colonizing bacteria, and environmental fungi in shaping and expanding the continental pan-mycobiome Marcus H. Y. Leung 1, Kelvin C. K. Chan 2, and Patrick K. H. Lee 1 * 1 School of Energy and Environment, City University of Hong Kong, Hong Kong 2 SeqMatic LLC, Fremont, CA, 94539, United States of America *Correspondence: B5423-AC1, School of Energy and Environment, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong patrick.kh.lee@cityu.edu.hk; Tel: (852) ; Fax: (852) This document contains original in-house codes and scripts used to generate the results (including Fig. 1, 2, 3, 5 and Additional Files 1-5, and 7-8) described in the manuscript. The document is divided into the following sections: 1) In-house script for read quality-filtering 2) In-house scripts for OTU-clustering and quality control, including chimera and contaminants removal 3) In-house scripts for alpha-diversity analyses 4) In-house scripts for beta-diversity analysis 5) In-house scripts for cross-kingdom alpha/beta-diversity comparisons 6) In-house script for taxonomic analysis 7) In-house scripts for Malassezia species-level taxonomic analysis 8) In-house script for multi-study comparison plot generation Please note that while in-house scripts required for results and figure generation are presented on this document, some minor tasks (such as table reformatting) were performed in Microsoft Excel as described below. For example, the determination of average alpha-diversity values for each sample following ten rounds of rarefaction was performed using Pivot Table function on Excel. Also note that the exact same scripts will not function across different computers. It is the responsibility of the readers to understand the scripts included here and modify accordingly. Some of the R packages required for the following scripts are: - devtools - wilkoxmisc - ggplot2 - reshape2 - plyr - pgirmess They can be installed by the input in R (for example): install.packages( pgirmess ) 1

2 1) In-house script for read quality-filtering Fastq/fasta reads preparation for read quality control and OTU clustering. Following FLASH forward and reverse reads alignment for each sample, merged reads from all samples (one merged reads file per sample) are concatenated into one fastq file: 1.1) cat *.fastq > combined.fastq The resulting merged fastq file was used as input for read quality control steps using usearch commands as described in main text Methods section. 2

3 2) In-house scripts for OTU-clustering and quality control, including chimera and contaminants removal OTU-clustering: following read quality-filtering and demutiplexing from usearch, usearch cluster_otu was used to generate OTU fasta file. Using the OTU fasta file as an input, the following perl script was used to generate a fasta file with OTU named as numbers, and a fasta file with the renamed OTUs: 2.1) assign_otu_numbers.pl #!/usr/bin/perl use Modern::Perl 2014; use autodie; $ ++; open IN, '<', 'OTUs.fasta'; open OUT, '>', 'OTUs_numbered.fasta'; open MAP, '>', 'OTU_to_reference_sequence.tidy.txt'; say MAP "OTU\tReferenceSequence\tSize"; my $OTU = 0; while (<IN>) { chomp; print "\r$. lines processed" unless $. % 1000; if (/^>(?<read>.+);size=(?<size>\d+)$/) { $OTU++; say MAP "$OTU\t$+{read\t$+{size"; say OUT ">OTU_$OTU"; else { say OUT; say "\r$. lines processed"; close IN; close OUT; close MAP; This output file was used to perform taxonomic classification using QIIME s assign_taxonomy.py script. Following chimera detection using usearch uchime_ref command, script below was used to generate txt file containing a list of chimeric OTUs: 2.2) make_chimeras_list.pl #!/usr/bin/perl use Modern::Perl 2014; use autodie; $ ++; open IN, '<', './chimeras.fasta'; open OUT, '>', 'chimeras.tidy.txt'; say OUT "OTU"; while (<IN>) { chomp; next unless /^>/; (my $OTU) = $_ =~ /^>(.+)/; 3

4 say OUT $OTU; close IN; close OUT; OTU table was prepared by compiling the following input files: -OTU_to_reference_sequence.tidy.txt (output from 2.1) -OTU_numbered_tax_assignments.txt (output of assign_taxonomy.py QIIME script) -readmap.uc (output of usearch usearch_global command) And generates the following output files: -OTU_table.tidy.txt (OTU table including OTUs that are chimeric and contaminant) -singletons.txt (txt file containing a list of singleton OTUs) 2.3) prepare_otu_table.pl #!/usr/bin/perl use Modern::Perl 2014; use autodie; $ ++; # Load OTU reference sequences say "Loading OTU reference seqences"; open OTUREFMAP, '<', './OTU_to_reference_sequence.tidy.txt'; my %OTUofRefSeq; while (<OTUREFMAP>) { next if $. == 1; chomp; print "\r$. lines processed" unless $. % 1000; (my $OTU, my $refseq) = split(/\t/, $_); $OTUofRefSeq{$refSeq = $OTU; say "\r$. lines processed"; close OTUREFMAP; # Load OTU taxonomies say "Loading OTU taxonomies"; open TAX, '<', './OTUs_numbered_tax_assignments.txt'; my %taxonomy; while (<TAX>) { chomp; print "\r$. lines processed" unless $. % 1000; = split /\t/, $_; my $OTU = $line[0]; if ($line[1] eq 'Unassigned') = ('') x 7; else = split(/;\s/, $line[1]); s/^. say "\r$. lines processed"; 4

5 close TAX; # Count reads for each OTU say "Counting reads for each OTU"; open READMAP, '<', 'readmap.uc'; my %readcount; my %OTUReadCount; while (<READMAP>) { chomp; print "\r$. lines processed" unless $. % 1000; next if /^N/; = split(/\t/, $_); (my $read, my $OTU) (my $sample) = $read =~ /^([^\ ]+)/; $readcount{$otu{$sample++; $OTUReadCount{$OTU++; say "\r$. lines processed"; close READMAP; # Produce list of singletons say "Producing list of singletons"; my %singletons; open SINGLETONS, '>', 'singletons.txt'; say SINGLETONS "OTU"; foreach my $OTU (keys %OTUReadCount) { if ($OTUReadCount{$OTU == 1) { say SINGLETONS $OTU; $singletons{$otu = 1; close SINGLETONS; # Produce OTU table say "Producing OTU table"; open OTUTABLE, '>', 'OTU_table.tidy.txt'; say OTUTABLE "OTU\tSample\tCount\tKingdom\tPhylum\tClass\tOrder\tFamily\tGe nus\tspecies"; foreach my $OTU (sort keys %readcount) { # Skip singletons next if exists $singletons{$otu; foreach my $sample (sort keys %{$readcount{$otu) { say OTUTABLE "$OTU\t$sample\t$readCount{$OTU{$sample\t", close OTUTABLE; Following creation of OTU table from 2.3), will need to identify contaminant OTUs from the table, and remove from the OTU table later. This is performed by detecting lineages that are present in negative controls in more than 5% of reads. The script below takes in OTU_table.tidy.txt file from 2.3), and generates two output files: 5

6 -contaminant_lineages.tidy.txt (a list of lineages deemed contaminants) -contaminants.txt (a list of OTUs deemed contaminants) 2.4) classify_contaminants.r # Libraries library(wilkoxmisc) # List of blank samples BlankSamples <- c("name_of_negative_sample(s)") # Read in OTU table OTUTable <- read.tidy("otu_table.tidy.txt") # Collapse by lineage OTUTable <- within(otutable, Lineage <- factor(paste(kingdom, Phylum, Class, Order, Family, Genus, Species))) OTUsByLineage <- unique(otutable[, c("otu", "Lineage")]) OTUTable <- ddply(otutable,.(sample, Lineage), summarise, Count = sum(count),.progress = "time") # Select blank samples Blank <- subset(otutable, Sample %in% BlankSamples) # Add relative abundances Blank <- add.relative.abundance(blank) # Aggregate Blank <- ddply(blank,.(lineage), summarise, RelativeAbundance = sum(relativeabundance)) # Calculate value for cutoff Cutoff <- sum(blank$relativeabundance) * 0.05 # Trim contaminant list to lineages above cutoff Blank <- Blank[which(Blank$RelativeAbundance > Cutoff), ] # Write contaminant lineages to file write.tidy(blank, "contaminant_lineages.tidy.txt") # Sort OTUs into Contaminant/Non-contaminant Contaminants <- within(otusbylineage, Contaminant <- ifelse(lineage %in% Blank$Lineage, "Contaminant", "Noncontaminant")) Contaminants$Lineage <- NULL # Write to file write.tidy(contaminants, "contaminants.txt") Having identified chimeric OTUs (chimeras.tidy.txt from 2.2) and contaminant OTUs (contaminants.txt from 2.4), these files will be used to identify OTUs to be removed from OTU_table.tidy.txt. The output file will be OTU_table_clean.tidy.txt containing high-qualilty, non-chimeric, and non-contaminating OTUs. 6

7 2.5) clean_otu_table.r # Libraries library(wilkoxmisc) # Read in OTU table OTUTable <- read.tidy("otu_table.tidy.txt") # Add fate column OTUTable$Fate <- rep(na, nrow(otutable)) # BLANK SAMPLES BlankSamples <- c("blk", "Blk2") # Remove blank samples OTUTable <- OTUTable[which(! OTUTable$Sample %in% BlankSamples), ] # CHIMERAS ## Read in list of chimeras Chimeras <- read.tidy("chimeras.tidy.txt") Chimeras <- as.character(chimeras$otu) # Mark chimeric OTUs OTUTable$Fate <- ifelse( OTUTable$OTU %in% Chimeras & is.na(otutable$fate), 'Chimera', OTUTable$Fate ) # CONTAMINANTS ## Read in list of contaminants Contaminants <- read.tidy("contaminants.txt") Contaminants <- as.character(contaminants[which(contaminants$contaminant == 'Contaminant'), "OTU"]) # Mark contaminant OTUs OTUTable$Fate <- ifelse( OTUTable$OTU %in% Contaminants & is.na(otutable$fate), 'Contaminant', OTUTable$Fate ) ## OUTPUT # Summarise OTUs by fate and write to file OTUFates <- unique(otutable[c("otu", "Fate")]) write.tidy(otufates, "OTU_fates.tidy.txt") # Remove failures from OTU table and write to file OTUTable <- OTUTable[which(is.na(OTUTable$Fate)), ] OTUTable$Fate <- NULL write.tidy(otutable, "OTU_table_clean.tidy.txt") Prepare clean OTU_table to format readable for biom_convert command in QIIME: 7

8 2.6) cast_otu_table.r # Libraries library(wilkoxmisc) library(reshape2) # Read in OTU table OTUCounts <- read.tidy("otu_table_clean.tidy.txt") # Cast OTUCounts <- dcast(otucounts, OTU ~ Sample, value.var = "Count", fill = 0) # Write write.tidy(otucounts, "OTU_table_clean.cast.txt") The output OTU_table_clean.cast.txt can be used as input for biom_convert command in QIIME. 8

9 3) In-house scripts for alpha-diversity analyses Following rarefaction using QIIME scripts multiple_rarefaction.py and alpha_diversity.py, ten separate rarefied alpha diversity txt files were created. The cat function was used to combine all ten files into a single combined file: 3.1) cat alpha_diversity_rarefied_files_*.txt > alpha_diversity_combined.txt From alpha_diversity_combined.txt, table was reorganized using Microsoft Excel to create the average alpha diversity measurements (Observed OTUs/Chao1/Simpson s/shannon) for each sample. This average file is saved as alpha_average_1175.txt. Data from Additional File 1 originates from this file. For Mann-Whitney and Kruskal-Wallis statistical tests (data shown in Additional File 2), the commands wilcox.test and kruskal.test were computed (commands that are not in-house). Also required for this script is a Metadata txt file, created during sample collection, created manually on Microsoft Excel. 3.2) alpha_statistical_test.r library(wilkoxmisc) library(devtools) library(ggplot2) #Open data and meta files and merge Alpha <- read.tidy("alpha_average_1175.txt") Meta <- read.tidy("metadata.txt") Merge <- merge(alpha, Meta, by = "Sample", all.x = TRUE) #Perform Kruskal-Wallis test or Mann-Whitney test for statistical significance of average alpha-diversities between variables Wilcox.test(Observed_OTU~Gender, data=merge) Kruskal.test(Observed OTU~Age_Group, data=merge) #For significant Kruskal-Wallis comparisons, kruskalmc command in pgirmess package is used to perform multiple pairwise comparisons (requires library(pgirmess)) library(pgirmess) kruskalmc(merge$observed_otu,merge$age_group, data=merge) Using alpha_average_1175.txt as input, the following R script was entered to generate plots via ggplot. The script generates plots on Additional File 3. Also required for this script is a Metadata txt file, created during sample collection, created manually on Microsoft Excel. 3.3) make_alpha_plot.r library(wilkoxmisc) library(devtools) library(ggplot2) 9

10 #Open data and meta files and merge Alpha <- read.tidy("alpha_average_1175.txt") Meta <- read.tidy("metadata.txt") Merge <- merge(alpha, Meta, by = "Sample", all.x = TRUE) #Plot boxplot based on comparisons (single example provided) Plot <- ggplot(merge, aes(x = Age_Group, y = Observed_f)) Plot <- Plot + geom_boxplot() Plot <- Plot + coord_flip() Plot <- Plot + xlab(paste0("age Group")) + ylab(paste0("observed Number of OTUs")) Plot <- Plot + theme_classic() Plot <- Plot + theme(legend.title = element_blank()) Plot <- Plot + theme(axis.title=element_text(size=18, face="bold")) Plot <- Plot + theme(axis.text=element_text(size=14, face="bold")) Plot <- Plot + theme(legend.text = element_text(size=14, face = "bold")) ggsave("age_group_observed_1175.png") 10

11 4) In-house scripts for beta-diversity analysis QIIME script beta_diversity.py generates two output matrix files, one for Bray- Curtis (abundance-weighted, taxonomic-based beta-diversity analysis) and one for Jaccard (abundance-nonweighted, taxonomic-based beta-diversity analysis). For each matrix file, the following script was used to determine ANOSIM and statistical significance for predictive variables (data in Additional File 2). 4.1) anosim.r #Load libraries (these packages are required to draw PCoA plot and run ANOSIM) library(wilkoxmisc) library(ggplot2) library(devtools) #Open beta diversity output matrix file, as well as metadata file Beta <- read.dist("bray_curtis_rarefied_otu_table_clean.txt") Meta <- read.tidy("../metadata/metadata_rarefied_1175_dot.txt") BetaMatrix <- as.matrix(beta) #Perform ANOSIM #First list samples from both beta diversity matrix file and metadata, to check if the two lists are identical AllSamples <-data.frame(sample = row.names(betamatrix)) AllSamples <- merge(allsamples, Meta, by = "Sample", all.x = TRUE) #Check that the two lists are identical (output is either TRUE or FALSE) sum(row.names(betamatrix)==allsamples$sample) == length(allsamples$sample) #Perform ANOSIM based on grouping of your choice (e.g. Gender/Household/Age_Group etc.) Beta <- as.dist(beta) ANOSIM <- anosim(beta, grouping = AllSamples$Gender) #Load ANOSIM statistics and significance ANOSIM$statistic ANOSIM$signif 4.2) inter_vs_intra_comparison.r #This script rearranges beta-diversity distance matrix files to compare intra/inter-group community dissimilarities library(reshape2) #Load weighted Jaccard distances Beta <- as.matrix(read.dist("binary_jaccard_rarefied_otu_table_clean.t xt")) 11

12 #Melt Beta <- melt(beta, value.name = "Distance") names(beta)[1:2] <- c("sample1", "Sample2") #Add household and type Samples <- read.tidy("metadata.txt")[c("sample", "Location", "Individual","Area","Anatomy","Age_Group","Gender")] Beta <- merge(beta, Samples, by.x = "Sample1", by.y = "Sample", all.x = TRUE) names(beta)[4:5] <- c("location1", "Individual1") Beta <- merge(beta, Samples, by.x = "Sample2", by.y = "Sample", all.x = TRUE) names(beta)[6:7] <- c("area1", "Anatomy1") names(beta)[8:9] <- c("age_group1","gender1") names(beta)[10:11] <-c("location2","individual2") names(beta)[12:13] <-c("area2","anatomy2") names(beta)[14:15] <- c("age_group2","gender2") #Remove self-self samples Beta <- Beta[which(! Beta$Sample1 == Beta$Sample2), ] Beta$HouseholdType <- ifelse(beta$location1 == Beta$Location2, "Within households", "Between households") Beta$IndividualType <- ifelse(beta$individual1 == Beta$Individual2, "Within Individuals", "Between Individuals") Beta$SiteType <- ifelse(beta$area1 == Beta$Area2, "Same Site","Different Site") Beta$AnatomyType <- ifelse(beta$anatomy1 == Beta$Anatomy2, "Same Anatomical Site","Different Anatomical Site") Beta$AgeType <- ifelse(beta$age_group1 == Beta$Age_Group2, "Same Age Group","Different Age Group") Beta$GenderType <- ifelse(beta$gender1 == Beta$Gender2, "Same Gender","Different Gender") write.tidy(beta, "Beta_Comparison_Jaccard.txt") The resulting Beta_Comparison_Jaccard.txt file contains Jaccard distances between pairwise samples, and columns indicating whether the comparison is between the same group in question. The average values for each group can be determined from Microsoft Excel, and statistical significance in mean beta dissimilarities between comparison groups can be determined from wilcox.test and kruskal.test commands. Using Beta_Comparison_Jaccard.txt and the corresponding Bray-Curtis file, density plots can be constructed. Below shows the example of density plot as shown in Fig. 1b. 4.3) draw_density_plot.r Beta <- read.tidy("beta_comparison_jaccard.txt") library(plyr) mu <- ddply(beta, "IndividualType", summarise, grp.mean=mean(distancef)) Beta$IndividualType <- factor(beta$individualtype, levels = c("within Individuals","Between Cohabitants","Between Households")) 12

13 Plot <- ggplot(beta, aes(x = Distance, colour = IndividualType)) Plot <- Plot + stat_density(aes(group=individualtype, color=individualtype),position="identity",geom="line",size=2)+ scale_y_continuous(expand=c(0,0),limits = c(0,10)) Plot <- Plot + geom_segment(aes(x=0.744,y=0,xend=0.744,yend=4.5),size=1,linetype="d otdash",color="#e41a1c") Plot <- Plot + geom_segment(aes(x=0.792,y=0,xend=0.792,yend=7.0),size=1,linetype="d otdash",color="#377eb8") Plot <- Plot + geom_segment(aes(x=0.822,y=0,xend=0.822,yend=7.0),size=1,linetype="d otdash",color="#4daf4a") Plot <- Plot + theme_classic() ylab <- paste0("density (%)") xlab <- paste0("normalized Binary Jaccard Distance") Plot <- Plot + ylab(ylab) + xlab(xlab) Plot <- Plot + theme(axis.title = element_text(size = 20)) Plot <- Plot + theme(legend.title = element_blank()) Plot <- Plot + theme(legend.text = element_text(size = 14, face = "bold")) Plot <- Plot + theme(axis.text = element_text(size = 16, face = "bold")) Plot <- Plot + theme(legend.position = "bottom") Plot <- Plot + scale_colour_brewer(palette="set1") Plot <- Plot + guides(fill = guide_legend(override.aes = list(colour = NULL))) Plot <- Plot + geom_text(aes(0.70,6, label="mean = 0.744",angle=315),color="#e41a1c",size=6,face="bold") Plot <- Plot + geom_text(aes(0.75,8.5, label="mean = 0.792",angle=315),color="#377eb8", size=6,face="bold") Plot <- Plot + geom_text(aes(0.80,8.5, label="mean = 0.822",angle=315),color="#4daf4a", size=6,face="bold") ggsave("density_individual_jaccard.png",width=8,height=7) To sub-select data based on particular factor (e.g. divide density plot data by skin site), create separate txt files containing only data from particular factor (e.g. Additional Files 4 and 5). 4.4) subdivide_beta_data.r Beta <- read.tidy("beta_comparison_jaccard.txt") SameSite <- Beta[which(Beta$SiteType == "Same Site"), ] Forehead <- SameSite[which(SameSite$Site1 == "Forehead"), ] LeftForearm <- SameSite[which(SameSite$Site1 == "Left Forearm"), ] RightForearm <- SameSite[which(SameSite$Site1 == "Right Forearm"), ] LeftPalm <- SameSite[which(SameSite$Site1 == "Left Palm"), ] RightPalm <- SameSite[which(SameSite$Site1 == "Right Palm"), ] Write.tidy(Forehead, "Jaccard_Forehead.txt") Write.tidy(LeftForearm, "Jaccard_LF.txt") Write.tidy(RightForearm, "Jaccard_RF.txt") Write.tidy(LeftPalm, "Jaccard_LP.txt") Write.tidy(RightPalm, "Jaccard_RP.txt") Each resulting output file can be plugged into 4.3) to generate data and plots shown in Additional Files 2, 4, and 5). 13

14 5) In-house scripts for cross-kingdom alpha/beta-diversity comparisons Cross-domain alpha analysis was computed by first merging bacterial alpha diversity data from previous publication 1 with fungal data from this work. Plot as shown in Fig. 2a provided by code below. Spearman correlation values as shown in Additional File 1, tab Cross-Domain Alpha Correlation computed as shown below. 5.1) cross_domain_alpha.r library(ggplot2) Fungus <- read.tidy("alpha_average_1175.txt") Bacteria <- read.tidy("alpha_average_bacteria.txt") #Merge fungal and bacterial data, and include only samples with alpha data for both domains (some samples were removed from both studies following insufficient reads for normalization Table <- merge(fungus,bacteria,by="sample",all.x = FALSE) #Add metadata Meta <- read.tidy("metadata.txt") Table <- merge(table,meta,by="sample",all.x = FALSE) write.tidy(table,"fun_bac_alpha.txt") #Plot Plot <- ggplot(table, aes(x = Observed_f, y = Observed_b, colour = Anatomy)) Plot <- Plot + geom_point(size=4) Plot <- Plot + theme_classic() Plot <- Plot + xlab(paste0("observed Number of Fungal OTUs")) + ylab(paste0("observed Number of Bacterial OTUs")) Palette <- c("#9b30ff","#20b2aa","#ff4500") Plot <- Plot + scale_colour_manual(values=palette) #Calculate slope, intercept, and draw linear regression line coef(lm(observed_b~observed_f, data=table)) (Intercept) Observed_f Plot <- Plot + geom_abline(intercept = 963, slope = 7.5, colour = "blue", size = 1) Plot <- Plot + theme(legend.title = element_blank()) Plot <- Plot + theme(axis.title=element_text(size=18, face="bold")) Plot <- Plot + theme(axis.text=element_text(size=14, face="bold")) Plot <- Plot + theme(legend.text = element_text(size=14, face = "bold")) ggsave("fun_bac_correlation.png") #Test for Spearman correlation 1 Supplementary Data 5 file from Leung MHY, Wilkins D, and Lee PKH. Sci Rep. 2015;5:

15 cor.test(~observed_b+observed_f, Table, method = "spearman") Spearman's rank correlation rho data: Observed_b and Observed_f S = , p-value = 2.724e-07 alternative hypothesis: true rho is not equal to 0 sample estimates: rho Correlation values and significance for other comparisons (e.g. Chao1, Shannon, etc.) shown in Additional File 1 computed in the same way, substituting cor.test command by corresponding alpha indices: cor.test(~chao1_b+shannon_f, Table, method = "spearman") Spearman's rank correlation rho data: Chao_b and Simpson_f S = , p-value = alternative hypothesis: true rho is not equal to 0 sample estimates: rho Similarly, for cross-kingdom beta analysis, bacterial pairwise UniFrac data from previous study (raw data from was combined to perform correlation analysis. The files for respective bacterial and fungal beta-diversity dissimilarities for each sample pair are merged into one combined file: 5.2) cross_domain_beta.r #Open fungal Bray-Curtis sample pairwise dissimilarity file betafung <- read.tidy("beta_comparison_bray.txt") #Add column Comparison. This will be used as the common column for merging with bacterial beta-dissimilarity pairwise file betafung$comparison <- paste0(betafung$sample1, " vs. ", betafung$sample2) write.tidy(betafung, "Beta_Comparison_Fungus.txt") #Open bacterial UniFrac sample pairwise dissimilarity file betabac <- read.tidy("beta_comparison_bacteria.txt") #Add column Comparison. This will be used as the common column for merging with fungal beta-distance pairwise file betabac$comparison <- paste0(betabac$sample1, " vs. ", betabac$sample2) write.tidy(betabac, "Beta_Comparison_Bacteria.txt") In Microsoft Excel, only include columns Distance_b and Comparison, and save as Beta_Comparison_Bacteria_Simple.txt ) Return to R, and merge Beta_Comparison_Fungus.txt and Beta_Comparison_Bacteria_Simple.txt. 15

16 #Open fungal Bray-Curtis sample pairwise dissimilarity file betafung <- read.tidy("beta_comparison_bray.txt") #Open bacterial UniFrac sample pairwise dissimilarity file betabac <- read.tidy("beta_comparison_bacteria_simple.txt") Merge <- merge(betafung, betabac, by = Comparison, all.x = TRUE Write.tidy(Merge, "Bacteria_Fungus_BC_Merged.txt") The file Bacteria_Fungus_BC_Merged.txt is Additional File 1, tab Cross-Domain Beta Data. To calculate cross-domain beta-diversity Spearman correlation for Within Individuals, Within Household, and Between Household, Cross domain beta data needed to be divided by these three groups. In all cases, Bacteria_Fungus_BC_Merged.txt was used as input: 5.3) cross_domain_beta_subselect.r Table <- read.tidy("bacteria_fungus_bc_merged.txt") #Within Individual Individual <- Table[which(Table$IndividualType == "Within Individuals"), ] #Within Household Household <- Table[which(Table$IndividualType == "Between Individuals Within Households"), ] #Between household Different <- Table[which(Table$IndividualType == "Between Households"), ] #Save each file write.tidy(individual,"spearman_individual.txt") write.tidy(household, "spearman_household.txt") write.tidy(different,"spearman_different.txt") For each output file, remove all columns except the two containing actual beta values (one for bacteria, one for fungal). Rename the two columns as col1 and col2, and save with same file names adding _simple.txt. (e.g. spearman_different_simple.txt ). Return to R, and calculate Spearman correlation for each comparison group: 5.4) cross_domain_spearman.r Ind <- read.tidy(spearman_individual_simple.txt") cor.test(ind$col1,ind$col2,method="spearman") 16

17 Spearman's rank correlation rho data: Ind$col1 and Ind$col2 S = , p-value = 1.172e-12 alternative hypothesis: true rho is not equal to 0 sample estimates: rho House <- read.tidy(spearman_household_simple.txt") cor.test(house$col1,house$col2,method="spearman") Spearman's rank correlation rho data: House$beta_f and House$beta_b S = , p-value = 8.074e-11 alternative hypothesis: true rho is not equal to 0 sample estimates: rho Diff <- read.tidy("spearman_different_simple.txt") cor.test(diff$col1,diff$col2,method="spearman") > cor.test(diff$col1,diff$col2,method="spearman") Spearman's rank correlation rho data: Diff$col1 and Diff$col2 S = e+12, p-value < 2.2e-16 alternative hypothesis: true rho is not equal to 0 sample estimates: rho For other group comparisons and their correlations shown in Additional File 1 tab Cross-Domain Beta Correlation, sub-select comparison group in question and repeat script 5.3). To determine linear regression slope and intercept, and plot Fig. 2b-d (example below shown for Fig. 2d): 5.5) plot_beta_regression.r library(ggplot2) Diff <- read.tidy("spearman_different_simple.txt") coef(lm(col2~col1, data=diff)) (Intercept) col Different <- read.tidy("spearman_different.txt") Plot <- ggplot(different, aes(x = beta_f, y = beta_b, colour = IndividualType)) Plot <- Plot + geom_point(size=1) Palette <- c("orange") Plot <- Plot + scale_colour_manual(values=palette) Plot <- Plot + theme_classic() 17

18 xlab <- paste0("fungal Bray-Curtis Dissimilarity Between Samples") ylab <- paste0("bacterial UniFrac Distance Between Samples") Plot <- Plot + xlab(xlab) + ylab(ylab) Plot <- Plot + theme(legend.title = element_blank()) Plot <- Plot + theme(axis.title=element_text(size=18, face="bold")) Plot <- Plot + theme(axis.text=element_text(size=14, face="bold")) Plot <- Plot + theme(legend.text = element_text(size=14, face = "bold")) Plot <- Plot + geom_abline(intercept = 0.236, slope = , colour = "purple", size = 1) Plot <- Plot + theme(legend.position = "bottom") ggsave("spearman_bacteria_fungus_different.png") 18

19 6) In-house script for taxonomic analysis R-script takes in clean OTU table from 2.6, and creates txt file indicating top taxa of a particular taxonomic rank. This output can be used as input to construct visual plot in ggplot. Also required Metadata txt file, containing sample information, created on Microsoft Excel during sample collection. 6.1) R < make_tax_plot.r library(wilkoxmisc) library(reshape2) library(ggplot2) #Open taxonomy OTU table OTU <- read.tidy("otu_table_clean.tidy.txt") Meta <- read.tidy("metadata.txt") #Tabulate read counts by genus OTU <- ddply(otu,.(sample, Genus), summarise, count = sum(count)) #Convert count to relativeabundance and add column OTU <- ddply(otu,.(sample), mutate, RelativeAbundance = (count * 100) / sum(count)) #collapse taxa table to only 5 or 8 top phyla, genus, family, etc (require reshape2). OTUTable <- collapse.taxon.table(otu, n = 10, Rank = "Genus") #Merge relative abundance table and metatable together OTUTable <- merge(otutable, Meta, by = "Sample", all.x = TRUE) write.tidy(otutable, "Top10Genus.txt") #Plot OTUTable <- read.tidy("top10genus.txt") OTUTable$Genus <- factor(otutable$genus, levels = c("aspergillus","candida","cryptococcus","malassezia","penicil lium","sporobolomyces","unclassified Basidiomycota Genus","Unclassified Saccharomycetales Genus","Unclassified Sporidiobolales Genus","Minor/Unclassified")) Plot <- ggplot(otutable, aes(x = Sample, y = RelativeAbundance, fill = Genus, order= -as.numeric(genus))) Plot <- Plot + geom_bar(aes(x=as.factor(sample)),stat="identity", width=1.5) + scale_y_continuous(expand = c(0,0)) Plot <- Plot + facet_grid(anatomy~location, scales = "free_x") Plot <- Plot + xlab(paste0("sample by Household and Anatomy")) Plot <- Plot + theme_classic() Plot <- Plot + scale_fill_brewer(palette = "Set3") Plot <- Plot + theme(axis.text.y = element_text(size = 8)) Plot <- Plot + theme(axis.text.x = element_blank()) Plot <- Plot + theme(axis.ticks.x = element_blank()) Plot <- Plot + theme(panel.margin = unit(0.5, "lines")) Plot <- Plot + theme(axis.title = element_text(size=16)) Plot <- Plot + theme(legend.title = element_text(size=12)) Plot <- Plot + theme(legend.text = element_text(size=10)) ggsave("taxonomy_by_anatomy_location.png") Plot from this script was as shown in Fig

20 7) In-house scripts for Malassezia species-level taxonomic analysis OTUs that were classified as Malassezia according to curated database were identified using Microsoft Excel s pivot table function, where OTUs were selected as rows and datatable filtered to show only OTUs where Genus = Malassezia. This list of OTUs were saved onto a txt file (Malassezia_OTU.txt). Reads clustered into these OTUs will need to be selected from other non-malassezia reads from the OTU fasta file. This can be performed using an in-house perl script. In order for the script to work properly, the following input files are required: - list of OTUs to be selected out (Malassezia OTUs) - original OTU fasta file (containing all OTUs, output of usearch cluster_otus command) The output will be a fasta file containing only OTUs classified as Malassezia. Please note that this script was used either when only analyzing Hong Kong data, or in conjunction with data from United States during the multi-study analysis (Additional Files 7 and 8). 7.1) select_otus.pl #!/usr/bin/perl use Modern::Perl 2014; use autodie; use Getopt::Long; use File::Slurp qw(read_file); $ ++; my $USAGE = q/usage: perl select_otus_and_samples.pl -o <list of OTUs> -f <reads fasta file> -u <output fasta file> /; my $OTUFile; my $fastafile; my $outputfile; GetOptions ( 'o=s' => \$OTUFile, 'f=s' => \$fastafile, 'u=s' => \$outputfile, ) or die $USAGE; die $USAGE unless $OTUFile && $fastafile && $outputfile; say "List of OTUs: $OTUFile"; say "Input fasta: $fastafile"; say "Output fasta: $outputfile"; #Read list of wanted OTUs say "Reading list of wanted OTUs..."; 20

21 my %OTUs = map { $_ => 1 read_file($otufile, chomp => 1); say scalar keys %OTUs, " OTUs wanted"; #Extract wanted reads say "Extracting wanted reads from fasta file..."; open FASTA, '<', $fastafile; open OUT, '>', $outputfile; my $read; while (<FASTA>) { chomp; print "$. lines processed\r" unless $. % 1000; if (/^>(.+)/) { $read = $1; next unless exists $reads{$read; say OUT; say "$. lines processed"; close FASTA; close OUT; say "Wanted reads written to $outputfile"; The output of this script was then used to perform taxonomic classification using USEARCH as described in main text Methods section. OTU table containing read counts for each sample and each Malassezia OTU is merged with usearch output file with species-level classification for each Malassezia OTU. The example script below contains inputs from sequences of Hong Kong and United States studies. 7.2) taxonomy_species.r library(reshape2) library(ggplot2) #Open OTU table with only OTUs classified as Malassezia Table <- read.tidy("findley_otu_table_w_malassezia_only.txt") #Open Table with usearch species-level information for each OTU OTU <- read.tidy("usearch_results_findley_malassezia97_blast.txt") OTU <- merge(table,otu,by = "OTU", all.x = TRUE) write.tidy(otu,"findley_otu_w_malassezia_species.txt") head(otu) OTU Sample Study Count Kingdom Phylum 1 SRR SRR Bethesda 8 Fungi Basidiomycota 2 SRR SRR Bethesda 105 Fungi Basidiomycota 3 SRR SRR Bethesda 53 Fungi Basidiomycota 4 SRR SRR Bethesda 52 Fungi Basidiomycota 21

22 5 SRR SRR Bethesda 106 Fungi Basidiomycota 6 SRR SRR Bethesda 23 Fungi Basidiomycota Class Order 1 Basidiomycota_class_incertae_sedis Malasseziales 2 Basidiomycota_class_incertae_sedis Malasseziales 3 Basidiomycota_class_incertae_sedis Malasseziales 4 Basidiomycota_class_incertae_sedis Malasseziales 5 Basidiomycota_class_incertae_sedis Malasseziales 6 Basidiomycota_class_incertae_sedis Malasseziales Family Genus BLASTSpecies 1 Malasseziales_family_incertae_sedis Malassezia M. globosa 2 Malasseziales_family_incertae_sedis Malassezia M. globosa 3 Malasseziales_family_incertae_sedis Malassezia M. globosa 4 Malasseziales_family_incertae_sedis Malassezia M. globosa 5 Malasseziales_family_incertae_sedis Malassezia M. globosa 6 Malasseziales_family_incertae_sedis Malassezia M. globosa #Tabulate read counts by species OTU <- ddply(otu,.(sample, Species), summarise, Count = sum(count)) #Convert count to relativeabundance and add column OTU <- ddply(otu,.(sample), mutate, RelativeAbundance = (Count * 100) / sum(count)) #collapse taxa table to only 5 or 8 top phyla, genus, family, etc (require reshape2). OTUTable <- collapse.taxon.table(otu, n = 7, Rank = "Species") #Open metatable Meta <- read.tidy( Metadata.txt ) #Merge relative abundance table and metatable together OTUTable <- merge(otutable, Meta, by = "Sample", all.x = TRUE) write.tidy(otutable, "TopMalasseziaSpecies.txt") The resulting output contains relative abundance values of each sample, and also metadata containing sample information. Data was subsequently reorganized for Additional Files 7 and 8. 22

23 8) In-house script for multi-study comparison plot generation Following OTU table construction for each of the Hong Kong and United States studies, taxonomic information at genus, family, order, and class levels were used to generate pan-microbiome plot (Fig. 5). The number of different taxa at these ranks was manually tabulated, and a txt file panmicrobiome.txt was constructed in Microsoft Excel. This txt file is subsequently used as input to generate Fig ) draw_pan_plot.r library(ggplot2) Table <- read.tidy("panmicrobiome.txt") head(table) Comparison Type Number 1 Hong Kong only Genus Hong Kong + Bethesda Genus Hong Kong + Bethesda + Berkeley Genus Hong Kong only Family 67 5 Hong Kong + Bethesda Family 74 6 Hong Kong + Bethesda + Berkeley Family 78 Table$Type <- factor(table$type, levels = c("genus","family","order","class")) Table$Comparison <- factor(table$comparison, levels = c("hong Kong only","hong Kong + Bethesda", "Hong Kong + Bethesda + Berkeley")) Plot <- ggplot(table,aes(x=comparison,y=number,fill=type)) Plot <- Plot + geom_bar(stat="identity",position="dodge") Plot <- Plot + theme_classic() Plot <- Plot + theme(axis.text.x = element_text(angle=45, hjust=1, size=14,face="bold")) xlab <- paste0("study Included") ylab <- paste0("total Number of Taxa") Plot <- Plot + xlab(xlab) + ylab(ylab) Plot <- Plot + theme(axis.text.y = element_text(size=14,face="bold")) Plot <- Plot + theme(axis.title = element_text(size=18, face="bold")) Plot <- Plot + theme(legend.text = element_text(size=14,face="bold")) Plot <- Plot + theme(legend.title = element_blank()) ggsave("panmicrobiome.png") 23

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure. Qiime Community Profiling University of Colorado at Boulder

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure. Qiime Community Profiling University of Colorado at Boulder 1 Abstract 2 Introduction This SOP describes QIIME (Quantitative Insights Into Microbial Ecology) for community profiling using the Human Microbiome Project 16S data. The process takes users from their