all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m TopHat2

Pairs processed per second 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 72,318 418 1,666 49,495 21,123 69,984 35,694 1,9 71,538 3,5 17,381 61,223 69,39 55 19,579 44,79 65,126 96 5,115 33,6 61,787 318 5,117 42,6 5,146 122 2,1 24,393 all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m 1 5 1 5 1 5 1 5 1 5 1 5 88 86.2 87.3 83.3 97.8 98 98 95 (89.1 (86.2 (88.4 (84.4 (99.4 (99.3 (99.3 (98 98.7 98.8 98.8 98.8 98.7 98.6 98.6 97.8 (99.9 (98.8 (99.8 (99.9 (99.9 (99.8 (99.8 (99.7 98.2 94.7 97.8 98 98.3 98.3 98.3 95.6 (99.4 (94.7 (98.9 (99.3 (99.5 (99.5 (99.5 (97.4 92.9 93.3 89.9 56.5 97.7 98.1 98 95.3 (93.9 (93.3 (91.1 (57.4 (99.2 (99.3 (99.4 (97.1 7.7 6 93.9 96.5 97 83.9 (7.7 ( (6.8 ( (98.6 (98.3 (98.9 (96.8 47.7 43.9 44.5 31.6 92.4 93.2 93.5 82.2 (48.4 (43.9 (46.2 (32.4 (95.6 (95.1 (95.4 (9 all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m x1 x1 Correctly and uniquely mapped Correctly mapped (multimapped Incorrectly mapped Unmapped Supplementary Figure 1 Alignment speed and sensitivity of spliced alignment software for 4 million error-free simulated paired-end reads (1 bp long, 2 million pairs. This figure shows the alignment speed and sensitivity for each type of pair (all, M, 2M_gt_15, 2M_8_15, 2M_1_7, gt_2m, where all includes all the pair types. Since a pair consists of left and right reads, the type of a pair is determined by the more difficult read type. The difficulties of read types are given in the following order from easiest to most difficult: M, 2M_gt_15, 2M_8_15, 2M_1_7, and gt_2m. The plot on the left shows the alignment speed of the programs in terms of the number of pairs processed per second. The right plot shows alignment sensitivity. Pairs are categorized as: (1 correctly and uniquely mapped, (2 correctly mapped (multi-mapped, (3 incorrectly mapped, and (4 unmapped. Case (2 covers instances where an aligner mapped a pair to multiple locations and one of the locations was correct. These four categories encompass all of the pairs. The numbers in the right plot represent the percentages of case (1. The numbers inside the parentheses represent the percentages of cases (1 and (2 combined. Nature Methods doi:1.138/nmeth.3317

1 1, Reads processed per second 5, 1, 5, 1, 5, 1, 5, 1, 5, 1, 5, 121,331 848 14,611 81,412 4,639 11,193 56,397 1,954 134,56 7,639 2,335 9,42 11,11 827 2,613 68,233 13,8 12 7,156 47,896 1,49 55 5,733 72,427 73,835 161 3,789 37,131 all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m 5 1 5 1 5 1 5 1 5 1 5 92.1 91.6 92.3 88.9 96.7 97.6 97.6 94.4 (93.5 (91.6 (93.8 (9.5 (99 (99.2 (99.3 (97.4 98.2 98.8 98.3 98.3 98.2 98.1 98.2 96.4 (99.7 (98.8 (1 (99.9 (99.8 (99.6 (99.7 (98.7 97.1 95.2 97.3 95.4 97.6 97.7 97.6 94.1 (98.7 (95.2 (98.6 (97.3 (99.4 (99.3 (99.3 (96 88.2 88.5 9.4 51.1 94.4 97.2 97.2 93.5 (89.4 (88.5 (91.7 (52.2 (96.4 (98.9 (99 (95.5 7.6 9.2 79.8 94.4 95.4 77.8 (7.7 ( (9.4 ( (92.6 (97.3 (98.3 (96 49.8 47 48.9 32.7 86.4 91 91.5 79.5 (5.6 (47 (49.8 (33.7 (91.2 (93.5 (94 (89.2 all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m x1 x1 Correctly and uniquely mapped Correctly mapped (multimapped Incorrectly mapped Unmapped Supplementary Figure 2 Alignment speed and sensitivity of spliced alignment software for 2 million simulated single-end reads with a mismatch rate of.5% (1 bp long. This figure shows the alignment speed and sensitivity for each type of read (all, M, 2M_gt_15, 2M_8_15, 2M_1_7, gt_2m, where all includes all the read types. The plot on the left shows the alignment speed of the programs in terms of the number of reads processed per second. The right plot shows alignment sensitivity. Reads are categorized as: (1 correctly and uniquely mapped, (2 correctly mapped (multi-mapped, (3 incorrectly mapped, and (4 unmapped. Case (2 covers instances where an aligner mapped a read to multiple locations and one of the locations was correct. These four categories encompass all of the reads. The numbers in the right plot represent the percentages of case (1. The numbers inside the parentheses represent the percentages of cases (1 and (2 combined. Note that by looking at both plots, it is easy to see tradeoffs between alignment speed and sensitivity. Nature Methods doi:1.138/nmeth.3317

1,, 1 2 3 Cumulative number of alignments 9,, 8,, 82,442,576 8,814,637 82,782,639 79,998,5 85,39,233 85,458,151 85,37,285 84,611,44 9,339,896 89,84,713 9,911,241 87,699,1 92,798,89 92,857,133 92,826,893 91,767,63 93,978,493 91,841,199 94,834,85 91,589,128 95,991,53 95,922,96 95,92,247 92,17,398 7,, 6,, 6,216,885 58,136,995 59,968,421 58,569,58 62,872,637 62,826,883 62,74,21 62,29,82 x1 x1 x1 x1 Supplementary Figure 3 Alignment results for 19 million reads, each 11 bp long, from a human sample. Shown are the cumulative numbers of alignments up to a given edit distance. Edit distance is defined here simply as the number of differences ( edits between the read and the reference sequence. The leftmost panel shows reads that matched exactly (with an edit distance of. The next panel (labelled 1 shows the number of reads that aligned with either or 1 mismatches; similarly for the panels labelled 2 and 3. Note that and report soft-clipped alignments where bases on the ends of reads are left unaligned. To compute edit distances for these alignments, we re-aligned the soft-clipped bases to their corresponding locations in the reference genome and calculated the number of mismatches. Nature Methods doi:1.138/nmeth.3317

4,, 1 2 3 Cumulative number of spliced alignments 35,, 3,,,, 23,928,616 23,449,936 23,684,974 22,859,6 27,459,59 23,84,932 26,554,911 24,382,93 31,891,348 31,297,715 31,67,423 3,556,243 29,591,947,685,448 28,881,695 26,177,445 34,2,12 33,672,7 34,99,48 32,378,422 3,454,61 26,177,557 29,89,841 26,943,561 35,26,747 34,63,133 35,48,9 32,46,57 2,, 2,651,537 17,688,958 19,693,461 18,518,166 15,, x1 x1 x1 x1 Supplementary Figure 4 Alignment results of spliced alignment software for 19 million real reads (11 bp long. This figure shows the cumulative number of spliced alignments up to a given edit distance (, 1, 2, and 3 whose splice sites are known in gene annotations. Nature Methods doi:1.138/nmeth.3317

Pairs processed per second 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 6, 4, 2, 61,82 432 8,51 38,678 16,655 59,1 3,111 993 65,889 3,286 21,121 44,55 58,774 592 8,63 34,889 56,235 99 3,59 27,86 53,619 31 2,818 34,8 43,8 1 1,447 21,897 all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m 1 5 1 5 1 5 1 5 1 5 1 5 87.1 85.1 87.2 81.6 96.5 97.7 97.8 91.7 (88.2 (85.1 (88.2 (82.6 (98.1 (99.1 (99.2 (94.8 98.7 98.6 98.8 98.8 98.7 98.5 98.6 95.1 (99.8 (98.6 (99.8 (99.9 (99.8 (99.6 (99.7 (97.1 97.3 93.9 97.6 95.1 97.9 98.1 98.1 91.7 (98.5 (93.9 (98.6 (96.4 (99.2 (99.3 (99.3 (93.6 88.2 86.7 88.6 49.6 94.7 97.8 97.8 91.3 (89.1 (86.7 (89.7 (5.4 (96.1 (99.1 (99.2 (93.1 7.3 6.6 87.3 95.5 95.9 8.4 (7.4 ( (7.1 ( (92 (98.1 (98.6 (93.1 45 4.9 43.9 27.5 87 92 92.3 78.6 (45.6 (4.9 (45.4 (28.2 (9 (94.3 (94.6 (86.1 all M 2M_gt_15 2M_8_15 2M_1_7 gt_2m x1 x1 Correctly and uniquely mapped Correctly mapped (multimapped Incorrectly mapped Unmapped Supplementary Figure 5 Alignment speed and sensitivity of spliced-alignment software for 4 million simulated paired-end reads with a mismatch rate of.5% (1 bp long, 2 million pairs. This figure shows the alignment speed and sensitivity for each type of pair (all, M, 2M_gt_15, 2M_8_15, 2M_1_7, gt_2m, where all includes all the pair types. Since a pair consists of left and right reads, the type of a pair is determined by the more difficult read type. The difficulties of read types are given in the following order from easiest to most difficult: M, 2M_gt_15, 2M_8_15, 2M_1_7, and gt_2m. The plot on the left shows the alignment speed of the programs in terms of the number of pairs processed per second. The right plot shows alignment sensitivity. Pairs are categorized as: (1 correctly and uniquely mapped, (2 correctly mapped (multi-mapped, (3 incorrectly mapped, and (4 unmapped. Case (2 covers instances where an aligner mapped a pair to multiple locations and one of the locations was correct. These four categories encompass all of the pairs. The numbers in the right plot represent the percentages of case (1. The numbers inside the parentheses represent the percentages of cases (1 and (2 combined. Nature Methods doi:1.138/nmeth.3317

1 2 3 Cumulative number of alignments 8,, 6,, 57,8,869 53,679,38 56,4,561 52,578,59 6,978, 61,1,977 61,554,855 6,16,174 68,924,773 66,47,692 68,722,382 63,51,482 72,572,66 73,2,477 73,512,143 71,493,24,916,168 72,122,578 76,396,176 7,134,483 78,873,64 8,134,122 79,921,993 74,898,85 4,, 34,513,68 31,73,356 33,877,189 31,955,317 37,469,436 37,835,664 37,652,745 36,576,969 x1 x1 x1 x1 1 2 3 Cumulative number of spliced alignments 4,, 3,, 2,, 16,126,868 13,32,6 15,6,51 13,48,48 19,244,5 18,685,217 19,344,872 17,958,936 26,533,212 21,973,2,973,767 21,936,392 31,163,993 3,,711 31,468,24 29,247,848 32,113,158 27,48,631 31,734,296 26,416,1 36,979,986 35,957,689 37,492,853 34,416,851 35,48,898 29,232,646 35,442,628 29,215,692 4,128,327 39,63,627 4,769,893 35,823,95 1,, x1 x1 x1 x1 Supplementary Figure 6 Alignment results of spliced alignment software for ~218 million real paired-end reads (~19 million pairs. This figure shows two plots: (1 the cumulative number of alignments up to a given edit distance (, 1, 2, and 3 and (2 the cumulative number of spliced alignments whose splice sites are known in gene annotations. Note these alignments are pair alignments with the Nature Methods doi:1.138/nmeth.3317

combined edit distance from the left and the right alignments. Spliced alignments are those whose read alignment is a spliced alignment. Nature Methods doi:1.138/nmeth.3317

4,, 1 2 3 Cumulative number of alignments 3,, 2,,,154,97 23,639,685 24,662,743 23,57,846,926,18 26,327,722 26,296,362,2,864 3,84,984 29,849,555 3,438,38 28,958,27 31,44,36 31,984,6 31,949,9 31,263,33 34,532,5 33,24,444 34,37,826 32,46,381 34,85,45 35,543,86 35,512,35 33,326,217 15,497,183 15,96,955 15,988,814 15,777,34 14,51,653 14,663,817 13,696,721 15,12,545 1,, x1 x1 x1 x1 15,, 1 2 3 Cumulative number of spliced alignments 1,, 5,, 4,698,296 3,131,12 4,386,873 3,676,66 5,726,68 5,529,426 5,574,186 5,196,392 7,48,762 5,151,23 7,6,32 5,785,42 8,93,155 8,611,826 8,693,834 8,162,418 8,964,632 6,473,714 8,553,883 6,998,111 1,573,566 1,24,895 1,348,669 9,645,871 1,2,626 7,26,49 9,663,867 7,842,816 11,639,311 11,283,489 11,41,833 1,153,31 x1 x1 x1 x1 Supplementary Figure 7 Alignment results of spliced alignment software for ~126 million real paired-end reads (~63 million pairs. This figure shows two plots: (1 the cumulative number of alignments up to a given edit distance (, 1, 2, and 3 and (2 the cumulative number of spliced alignments whose splice sites are known in gene annotations. Note these alignments are pair alignments with the Nature Methods doi:1.138/nmeth.3317

combined edit distance from the left and the right alignments. Spliced alignments are those whose read alignment is a spliced alignment. Nature Methods doi:1.138/nmeth.3317

Chr22 e1# 24,447,287 24,447,436 e2# 24,451,336 24,451,622 Read Exon GlobalSearch LocalSearch (1 (2 Intron Extension (3 a x mismatch b LocalFMindexforchr22from24,417,28to24,482,559 24,447,287 24,451,622 c Supplementary Figure 8 Three working examples demonstrating how applies its hierarchical indexing for fast and sensitive alignment. The examples include alignment of one exonic read and two junction reads (one an intermediate-anchored read and the other a longanchored read. Reads are error-free and 1-bp long. Nature Methods doi:1.138/nmeth.3317

1 st runoftodiscoversplicesites mapped e3# unmapped 2 nd runoftoalignreadsbymakinguseofthelistofsplicesitescollectedabove e3# Read Exon Intron GlobalSearch LocalSearch Extension Junc:onextension Supplementary Figure 9 Two-step approach version of to allow alignment of junction reads with small anchors. This figure shows how to align reads with short anchors (1-7 bp by making use of splice sites found by reads with long anchors. Nature Methods doi:1.138/nmeth.3317

Chr1 e1# 65,656,393 65,656,512 e2# 65,684,437 65,684,69 Read Exon Intron GlobalSearch LocalSearch Extension Chr1 x a Onebasedifference Chr17 x e1#+#e2# Chr1 b Chr17 x x e1#+#e2# Supplementary Figure 1 Alignment of junction reads in the presence of processed pseudogenes. This figure shows how to correctly align reads that would otherwise be mapped incorrectly to processed pseudogenes. Nature Methods doi:1.138/nmeth.3317

e3# mismatch indel Read Exon Intron GlobalSearch LocalSearch Extension GapClosure (1 x e3# (3 (2 a mismatch x e3# 3 2 1 b e3# indel 3 2 1 4 Gapclosure c e3# 2 x 1 4 3 3 5 Supplementary Figure 11 Three more examples demonstrating how applies its hierarchical indexing for reads involving mismatches, indels and three exons. The examples include alignment of one exonic read with one mismatch, one exonic read with an indel, and three exon spanning reads with two small anchors on both sides. Reads are 1-bp long. Nature Methods doi:1.138/nmeth.3317

Supplementary Table 1 Program No. of splice sites reported No. of true splice sites reported Sensitivity (% Precision (% x1 95,732 91,5 97.7 95.1 94,217 91,21 97.9 96.8 94,121 91,159 97.8 96.9 96,326 9,535 97.1 94. 95,44 9,59 97.2 95. 97,385 91,171 97.8 93.6 91,17 87,71 94.1 96.3 19,276 84,866 91.1 77.7 Sensitivity and precision of splice sites reported by spliced alignment software for 4M simulated error-free paired-end reads (2 million pairs from the entire human genome. The number of known splice sites included in the simulated paired-end reads (2 million pairs is 93,199. Sensitivity is the percentage of true splice sites found out of the total that were present. Precision (or positive predictive value is the percentage of reported splice sites that are correct. Nature Methods: doi:1.138/nmeth.3317

Supplementary Table 2 Program Correctly and uniquely aligned reads Correctly mapped reads (multi-mapped Incorrectly mapped reads Unmapped reads x1 92.1% 1.5% 3.7% 2.8% 97.6% 1.7%.2%.5% 97.6% 1.7%.3%.5% 88.9% 1.6% 9.2%.3% 96.7% 2.3%.9%.2% 92.3% 1.5% 6.%.1% 91.6%.% 4.4% 4.% 94.4% 3.1%.1% 2.4% Alignment results of spliced alignment software for 2 million simulated 1- bp reads with a mismatch rate of.5%. This table shows the alignment results for all the reads (all, M, 2M_gt_15, 2M_8_15, 2M_1_7, gt_2m. Reads are categorized as: (1 correctly and uniquely mapped, (2 correctly mapped (multi-mapped, (3 incorrectly mapped, and (4 unmapped. Case (2 covers instances where an aligner mapped a read to multiple locations and one of the locations was correct. These four categories encompass all of the reads. Nature Methods: doi:1.138/nmeth.3317

Supplementary Table 3 Alignment precision for non- GT/AG splice sites (exact matching Alignment precision for non- GT/AG splice sites (±5-bp window x1 49,244 (39% 66,112 (52% 72,12 (57% 95,72 (% 73,738 (58% 96,83 (% 49,276 (39% 69,52 (55% 68,337 (54% 91,398 (72% 64,688 (51% 69,572 (55% 68,863 (54% 7,934 (56% Alignment precision involving non-gt/ag splice sites reported by spliced alignment software for 2 million simulated single-end reads from the human genome, with a mismatch rate of.5%. There were 127,287 reads spanning non-gt/ag splice sites in the simulated 2 million single-end reads. Alignment precision measures the percentage of reads that are aligned correctly. Reads that mapped to multiple locations were considered correct if one of the mapped locations is correct. Column 2 shows the precision of each program if alignments were required to map precisely to the non-consensus splice site. Column 3 shows precision with a relaxed criterion, counting alignments as correct if they match within 5 bp of the non-consensus splice site. Note that is not included in the table because it does not predict non-gt/ag splice sites. Nature Methods: doi:1.138/nmeth.3317

Supplementary Table 4 Program No. of splice sites reported No. of true splice sites reported Sensitivity (% Precision (% x1 98,345 91,59 97.3 92.6 96,246 91,3 97.6 94.8 96,186 91,29 97.5 94.8 12,488 9,624 96.9 88.4 98,496 9,683 96.9 92.1 1,741 91,58 97.8 9.8 93,851 88,171 94.3 93.9 114,36 84,995 9.9 74.5 Sensitivity and precision of splice sites reported by spliced alignment software for 4M simulated paired-end reads (2 million pairs from the entire human genome, with a mismatch rate of.5%. The number of known splice sites included in the simulated paired-end reads (2 million pairs is 93,543. Sensitivity is the percentage of true splice sites found out of the total that were present. Precision (or positive predictive value is the percentage of reported splice sites that are correct. Nature Methods: doi:1.138/nmeth.3317

Supplementary Table 5 Program Run time (minutes x1 46.2 96.7 55.8 52.8 147.4 1365.3 1978.7 2416.5 Run-time of the alignment software for ~218 million real paired-end reads (~19 million pairs Supplementary Table 6 Program Run time (minutes x1 31 64.5 34.6 35.8 93.1 88.5 1187.9 166. Run-time of the alignment software for ~126 million real paired-end reads (~63 million pairs Nature Methods: doi:1.138/nmeth.3317

Supplementary Table 7 Program Version Parameters x1 (default.1.2-beta hisat-align -p 3 --no-temp-splicesite <index> -1 <read_1> -2 <read_2> hisat-align -p 3 <index> -1 <read_1> -2 <read_2> First pass hisat-align -p 3 --novel-splicesite-outfile splicesites.txt <index> -1 <read_1> -2 <read_2> Second pass hisat-align -p 3 --novel-splicesite-infile splicesites.txt <index> -1 <read_1> -2 <read_2> 2..11 tophat -p 3 --read-edit-dist 3 --no-sort-bam --read-realign-edit-dist --keep-tmp <index> <read_1> <read_2> Left read olego --num-reads-batch 124 -t 3 -M 3 -o out1.sam <index> <read_1> 1.1.2 Right read olego --num-reads-batch 124 -t 3 -M 3 -o out2.sam <index> <read_2> Pairing alignments of left and right reads mergepesam.pl out1.sam out2.sam out.sam 214-5-3 gsnap -A sam -t 3 --max-mismatches=3 -D. -N 1 -d <index> <read_1> <read_2> --runthreadn 3 --genomedir <index> --genomeload NoSharedMemory --readfilesin <read_1> <read_2> --outfiltermismatchnmax 6 2.4.a August 214 First pass --runthreadn 3 --genomedir <index> --genomeload NoSharedMemory --readfilesin <read_1> <read_2> --outfiltermismatchnmax 6 Indexing --genomedir <new_index> --runmode genomegenerate --genomefastafiles genome.fa --sjdbfilechrstartend SJ.out.tab.Pass1.sjdb --sjdboverhang 99 --runthreadn 3 Second pass --runthreadn 3 --genomedir <new_index> --genomeload NoSharedMemory --alignsjdboverhangmin 1 --readfilesin <read_1> <read_2> --outfiltermismatchnmax 6 Program parameters for running simulated and real reads Nature Methods: doi:1.138/nmeth.3317

The last column (Parameters shows specific parameters for each program that allow a pair to be aligned with edit distances of, 1, 2, and 3. For single-end reads, <read_2> field is not needed, --outfiltermismatchnmax 3 is used in /, and only Left read is needed in. We ran the programs on Mac Pro with a 3.7 GHz Quad- Core (Intel Xeon E5 processor and 64 GB of RAM (1866 MHz DDR3 ECC memory. Nature Methods: doi:1.138/nmeth.3317

Supplementary Note (1 Details on simulated data sets Rather than precisely imitating real RNA-seq experiments, we generated reads specifically for the purpose of testing the alignment programs. To simulate reads, we used the transcript expression model from the Flux simulator 1. First, we randomly ranked the transcripts from the protein coding genes found in the Ensembl human gene annotation (release 66. Then we modeled the expression levels of the transcripts as follows. The expression level y of a transcript is defined as y = x k e x x1 x x1 2, x where x is the rank number of a transcript, x = 5 1 7, x 1 = 95, and k =.6. Fragment lengths are chosen according to a normal distribution (mean: bp and s.d.: 4 bp and fragments are generated from the transcripts with their 3 and 5 positions randomly selected according to a uniform distribution. Left and right reads (1-bp long are generated from the fragments. To create reads with mismatches, we replaced each nucleotide of a read with a different base with a probability of.5%. The maximum number of mismatches allowed in each read is 3. All these procedures are implemented in the TuxSim simulation program, which was originally developed by Cole Trapnell and slightly modified by us to allow reads with mismatches. To run TuxSim, download tuxsim-.1.tar.gz from http://www.ccb.jhu.edu/software/hisat/downloads/hisat-suppl/tuxsim-.1.tar.gz and follow the instructions below. (Version 1.38. or higher of the boost library, http://www.boost.org, is required. i tar xvzf tuxsim-.1.tar.gz ii cd tuxsim-.1 iii./configure --with-boost=/path/to/boost_prefix_dir iv make We used release GRCh37 of the human genome, available from many sites including http://ccb.jhu.edu/software/hisat/downloads/hisat-suppl/genome.fa The human gene annotation used for our simulation is available at http://ccb.jhu.edu/software/hisat/downloads/hisat-suppl/genes.gtf The 4 million error-free 1-bp reads (2 million pairs are available for download at http://ccb.jhu.edu/software/hisat/downloads/hisat-suppl/reads_perfect.tar.gz Alternatively, these reads can be re-generated them as follows i tuxsim-.1/src/tuxsim sim_perfect.cfg (sim_perfect.cfg is available at http://ccb.jhu.edu/software/hisat/downloads/hisat-suppl/sim_perfect.cfg ii python tuxsim-.1/src/extract_and_shuffle.py sim_2m 2 Nature Methods: doi:1.138/nmeth.3317

The 4 million 1-bp reads (2 million pairs with mismatches at a rate of.5% are available for download at http://www.ccb.jhu.edu/software/hisat/downloads/hisatsuppl/reads_mismatch.tar.gz This data set can be re-generated as follows: i tuxsim-.1/src/tuxsim sim_mismatch.cfg (sim_mismatch.cfg is available at http://www.ccb.jhu.edu/software/hisat/downloads/hisat-suppl/sim_mismatch.cfg ii python tuxsim-.1/src/extract_and_shuffle.py sim_2m 2 (2 Downloading the real data sets used in this study For the real data in the main text [GEO accession number: GSM981249], the reads are available to download at ftp://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodedcc/wgencodecshllongrnaseq /wgencodecshllongrnaseqimr9cellpapfastqrd1rep1.fastq.gz, ftp://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodedcc/wgencodecshllongrnaseq /wgencodecshllongrnaseqimr9cellpapfastqrd1rep2.fastq.gz For Supplementary Fig. 6 and Supplementary Table 5, we used the same data set as in the main text, 217,498,662 paired-end reads (18,749,331 pairs from fetal lung fibroblasts using Illumina HiSeq 2 [GEO accession number: GSM981249]. All reads are 11 bp in length. For Supplementary Fig. 7 and Supplementary Table 6, we used RNA-seq reads gathered across a time course experiment reported by Chen et al. 2 [GEO accession number: GSM818581]. This data includes 1,642,456 million paired-end reads in 62,821,228 pairs. All reads are 11 bp in length. Nature Methods: doi:1.138/nmeth.3317

Details on how handles alignment involving mismatches, indels, and short anchors on both ends of a read Here we illustrate the algorithm in greater detail through the use of three examples: (a a read with one mismatch, (b a read with one indel, and (c a read with short anchors on both ends (Supplementary Fig. 11. For case (b, it does not matter for the sake of discussion whether the indel is an insertion or a deletion. As with the examples given in the main text, reads are assumed to be 1-bp long. In addition to the global search, local search, and directed read extension strategies presented in the main text, also includes a gap closure operation, which combines two partial alignments and fills gaps if any exist. Also, the extension operation allows mismatches while extending the alignment of a read and, unlike index-based operations (global and local operations, the extension operation is bi-directional. We illustrate these relatively simple cases, providing some insight into how aligns more complicated cases. In the following description, we avoid excessive details so that we may convey key ideas about the core alignment algorithms of. (a A read with one mismatch. Similar to the examples in Supplementary Fig. 8, we first search the read from its right end using the global FM-index. Once we find a uniquely aligned anchor, we extend the alignment until we encounter a mismatch (Supplementary Fig. 11a. At this point, we might try either a local search (assuming an intron is present or an extension operation to align the remaining part of the read. Suppose we use extension operation with one mismatch allowed. Since the read has only one mismatch and the read is included entirely within an exon (e1, the extension succeeds and sweeps across the rest of the read. Note that the sequence of operations applied to align this read is shown just below the operation arrows in the figure. (b A read with an indel. Suppose the indel in Supplementary Fig. 11b is 2 bp long. As in the previous case, we first use the global index to anchor the read and extend the alignment until we encounter a mismatch (due to the indel. We then try local search just after the mismatched base, but the local search fails because the indel is two bases long and the second base of the indel is included for the local search. We also try directed extension operation with one mismatch allowed, but the extension fails because the operation does not allow gaps. The next step is to skip a certain number of bases (e.g., 8 bp and retry local search, producing a partial alignment on the left side of the indel. We now have two partial alignments on both sides of the indel, which we combine using the gap closure operation, producing the alignment with the indel as shown in the figure. The current version of allows only one indel per gap closure operation. (c A read with short anchors (1-bp each on both ends. Suppose the read in Supplementary Fig. 11c spans three exons (1 bp on the left exon, 8 bp on the middle one, and 1 bp on the right one. first searches the read from its right end, but fails to anchor it using the global index because a right segment of the read spans two exons. Note this failure will also happen if the segment includes Nature Methods: doi:1.138/nmeth.3317

mismatches or indels. Next tries to anchor the read beginning at the 18 th base (this parameter can be adjusted and this time it succeeds in anchoring the read. then extends the anchored alignment in both directions as shown in the figure. The left extension and the right extension stop at 91 st base and 1 th base respectively, because of the exons (e1 and e3 the read spans on its both ends. The left 1-bp segment of the read is aligned using local search. also uses the local index to align the right 1-bp segment. Note that the direction of the last local index search is right-to-left, aligning the small anchor from its right end. provides several parameters with which users can customize its alignment strategy, including adjustable penalties for mismatches, indels, and non-canonical splice sites. The default penalty for a mismatch ranges from 2 (minimum: MN to 6 (maximum: MX depending on the quality score (Q at the mismatched base, MN + floor((mx-mn * MIN(Q, 4. / 4.. The default gap opening and extension penalties are 3 and 5, respectively. The default penalty score for each non-canonical splice site is 12. The default minimum score for an alignment to be reported is -18. For example, if a read s alignment has one mismatch (assuming Q is 2 and involves one non-canonical splice site, the alignment score is -16, the sum of -4 (for the mismatch and -12 (for the noncanonical splice site. Since the alignment score is greater than the default minimum score (-18, the alignment will be reported. Nature Methods: doi:1.138/nmeth.3317

References 1. Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, Sammeth M: Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic acids research 212. 2. Chen R, Mias GI, Li-Pook-Than J, Jiang L, Lam HY, Miriami E, Karczewski KJ, Hariharan M, Dewey FE, Cheng Y, et al: Personal omics profiling reveals dynamic molecular and medical phenotypes. Cell 212, 148:1293-137. Nature Methods: doi:1.138/nmeth.3317