Maruyama et al. SUPPLEMENTARY SCRIPTS. Script S1: PeakMarker.plx Script S2: SiteWriter_CFD.plx

Size: px

Start display at page:

Download "Maruyama et al. SUPPLEMENTARY SCRIPTS. Script S1: PeakMarker.plx Script S2: SiteWriter_CFD.plx"

Milton Boone
5 years ago
Views:

1 Maruyama et al. SUPPLEMENTARY SCRIPTS Script S1: PeakMarker.plx Script S2: SiteWriter_CFD.plx To use: cut all text between (but not including) tracts and paste into a new file using the code/text editor of your choice. Save As using the script name. Create /in and /out directories and edit paths in SET THE VARIABLES BELOW AS REQUIRED section. Set other variables as required. Script S1: !/usr/bin/perl Written: Nick Kent, Aug 2010 Last updated: Nick Kent, 8th Apr 2012 USAGE:- perl PeakMarker.plx This script takes an.sgr file as an input, and calls peak centre/summit bins above a single, but scalable, noise threshold. It is, therefore a very simple peak calling program. It outputs an.sgr listing these bin positions with a y-axis value proportional to the scaled summit bin read frequency. The scaling value can be altered to reflect differences in read depth between two experiments. use strict; use warnings; use Math::Round; SET THE VARIABLES BELOW AS REQUIRED $indir_path - The directory containing the.sgr files to be processed $outdir_path - The directory to store the.sgr peak output files $thresh - The aligned read number noise threshold value $scale_factor - A proportion based on differences in read depth my $indir_path ="/sgr_in"; my $outdir_path ="/peaks_out";

2 my $thresh = 10; my $scale_factor = 1.00; MAIN PROGRAM define some variables my (@files, $infile, $outfile, store input file names in an array opendir(dir, $indir_path) die "Unable to access file at: $indir_path = readdir(dir); process each input file within the indir_path in turn foreach $infile (@files){ ignore hidden files and only get those ending.sgr if (($infile!~ /^\.+/) && ($infile =~ /.*\.sgr/)){ define outfile name from infile name $outfile = substr($infile,0,-4)."_peak_t".$thresh; $outfile.= '.sgr'; print out some useful info print ("\nprocessing '".$infile."'\n"); open(in, "$indir_path/$infile") die "Unable to open $infile: $!"; define three new arrays to store required values from infile loop through infile to get values while(<in>){ split line by delimiter and store elements in an = split('\t',$_); store the columns we want in two new arrays push(@chr,$line[0]); push(@bins,$line[1]); push(@freq,$line[2]);

3 close in file handle close(in); store size of array my $size try and open output file open(out,"> $outdir_path/$outfile") die "Unable to open $outfile: $!"; need a variable to store line count my $count = 0; this calls the peaks - giving an x-axis bin value ONLY for the peak centre and a y-axis value as the peak hight scaled to some value proportionate to relative read depth for a relevant pair-wise comparison. The logic here is the most simple definition of a"peak". You can fiddle here to make the rules stricter. while ($count < $size){ if (($freq[$count]>=$freq[$count-1]) && ($freq[$count]>=$freq[$count+1]) && ($freq[$count]*$scale_factor>=$thresh)){ print(out $count++; $chr[$count]."\t". $bins[$count]."\t". round($freq[$count]*$scale_factor)."\n"); else{ $count++; close out file handle close(out); Script S2:

4 !/usr/bin/perl Written: Nick Kent, 12th Sept 2010 Last updated: Nick Kent, 19th Apr 2012 USAGE:- perl SiteWriter_CFD.plx FUNCTION: This script takes.txt files containing a list of sites/genomic features (these could be TSSs or TF sites or whatever you want) and compares it with whole-genome, Partn.sgr files. It then outputs CUMULATIVE FREQUENCY DISTRIBUTION values over a user-specified bin range centered on, and surrounding the sites. The output file can be used to plot average chromatin particle environments for different sorts of TSS for example. Sites close to chromosome ends, which would not yield the full range of data are ignored, but reported at the command line. INPUT AND OUTPUT (all tab-delimited): The input.txt files should have four columns: chrn; Site ID; site dyad pos; strand. The input.sgr files should have three columns: chrn; bin pos; pairedread dyad freq. The output.txt file has an input file header and column headers and returns 5 columns: Bin (relative to Site); F strand cumulative freq; R strand cumulative freq; summed F+R cumulative freq; normalised cumulative freq. The idea is to plot the first and last columns as a line graph to produce a TREND GRAPH for the data. Each bins F+R frequencies are normalised to the average F+R frequency for the entire bin window. Note: Use multiple.sgrs and then cat the CFD.txt files for processing in R or Excel - particularly useful for plotting surface landscape graphs. Note: The script handles F and R strand data separately. If you give it all F (or all R) strand sites it will work just fine, however, it will also throw a load of uninitialised variable warnings at the command line. If you find this upsetting, stick a in front of use warnings (below)

5 For development see: Kent et al.,(2011) Chromatin particle spectrum analysis: a method for comparative chromatin structure analysis using paired-end mode next-generation DNA sequencing. NAR 39: e26. use strict; use warnings; use Cwd; use List::Util; SET THE VARIABLES BELOW AS REQUIRED $sgr_indir_path - The directory containing the full genome Partn.sgr files $siteid_indir_path - The directory containing the site list.txt file $outdir_path - The directory to store the output files $bin_window - number of bins surrounding the site of interest. E.g. if you set this to 40 then you will get 40 bins either side of your site - 400bp if you were using 10bp binned data. $bin_size - binning interval of.sgr file in base pairs. $output_scale - controls how many bins are included in the output file. If set to 1 you will get every bin (use this). Set to 3 to output only every third bin in the series.you can use this feature to scale output files derived from input.sgr data with different bin intervals. my $sgr_indir_path ="/Sgr_in"; my $siteid_indir_path ="/Site_in"; my $outdir_path ="/CFD_out"; my $bin_window = 40; my $bin_size = 10; my $output_scale = 1; MAIN PROGRAM

6 define some variables my $cwd = getcwd; my $infile_sgr; my $infile_siteid; my $cfd_outfile; my $sgr_size; my $F_siteID_size; my $R_siteID_size; my %bin_map; my $chr_count; my $descriptor; Get site list and write to an array - from.txt format with four columns: chrn;siteid; site position; F/R store input file name in an array opendir(dir,$siteid_indir_path) die "Unable to access file at: $siteid_indir_path = readdir(dir); process the input file within siteid_indir_path foreach $infile_siteid (@files_siteid){ ignore hidden files and only get those ending.txt if (($infile_siteid!~ /^\.+/) && ($infile_siteid =~ /.*\.txt/)){ $descriptor = substr($infile_siteid,0, -4); print "Found, and processing, $infile_siteid \n"; open(in, "$siteid_indir_path/$infile_siteid") die "Unable to open $infile_siteid: $!"; define strand-specific arrays to store site chromosome no., and position

7 loop through infile to get values while(<in>){ chomp; split line by delimiter and store elements in an = split('\t',$_); store the required chrn, position in two pairs of strandspecific arrays if($line_siteid[3] =~ "F"){ infile if 1 push(@f_site_chr,$line_siteid[0]); push(@f_site_pos,$line_siteid[2]); elsif($line_siteid[3] =~ "R"){ else{ push(@r_site_chr,$line_siteid[0]); push(@r_site_pos,$line_siteid[2]); print "Failed to match strand at $line_siteid[0], $line_siteid[1], $line_siteid[2]\n"; infile if 1 closer close in file handle close(in); closedir(dir); store sizes of the arrays $F_siteID_size $R_siteID_size print "Contains: $F_siteID_size forward strand site IDs; $R_siteID_size reverse strand site IDs\n"; Read in the.sgr file values to three enormous arrays

8 opendir(dir,$sgr_indir_path) die "Unable to access file at: $sgr_indir_path = readdir(dir); process the input file within sgr_indir_path foreach $infile_sgr (@files_sgr){ define some arrays that will be reset during each iteration ignore hidden files and only get those ending.sgr if (($infile_sgr!~ /^\.+/) && ($infile_sgr =~ /.*\.sgr/)){ print "Found, and processing, $infile_sgr \n"; open(in, "$sgr_indir_path/$infile_sgr") die "Unable to open $infile_sgr: $!"; define three new arrays to store the.sgr values from infile loop through infile to get values while(<in>){ chomp; split line by delimiter and store elements in an = split('\t',$_); store the columns we want in the three new arrays push(@sgr_chr,$line_sgr[0]); push(@sgr_bin,$line_sgr[1]); push(@sgr_freq,$line_sgr[2]); close in file handle close(in); store size of bin array $sgr_size print "Contains a whopping: $sgr_size bin values\n";

9 BUILD THE BIN MAP my $map_count = 0; a counter variable Set bottom $bin_map{$sgr_chr[$map_count] = 0; $map_count ++; scan through array and mark the bins where each new chromsomome starts until ($map_count == $sgr_size){ if ($sgr_chr[$map_count] ne $sgr_chr[$map_count-1]){ $bin_map{$sgr_chr[$map_count] = $map_count; $map_count ++; else{ $map_count ++; output the number of chromosome types found as the number of hash keys. $chr_count = keys %bin_map; print "The sgr file contains values for: $chr_count chromosomes\n"; FORWARD STRAND.sgr calculations: some counter variables my $site_count = 0; Counter for each site ID my $bin_count = 0; Counter.sgr bin numbers my $cfd_count = 0; Counter for the cfd arrays my $top_limit = 0; A top limit for $bin_window F.sgr output array for chr F.sgr output array for bin pos F.sgr output array for read freq my $F_out_size = 0; Size of F.sgr output arrays my $i=0; An iterator variable until ($site_count == $F_siteID_size){ until 1 Use %bin_map to jump to correct region of sgr arrays

10 $bin_count = (int($f_site_pos[$site_count]/$bin_size) + $bin_map{$f_site_chr[$site_count]) - 3; this looks mad, but it allows me to recycle all the code from the last version, and takes up any rounding slack which would come from different $bin_size values find an.sgr bin which contains the current site until ($F_site_chr[$site_count] eq $sgr_chr[$bin_count] && $F_site_pos[$site_count] >= $sgr_bin[$bin_count] && $F_site_pos[$site_count] < $sgr_bin[$bin_count +1]){ until 2 until 2 closer $bin_count ++; now that we've found the match, let's write values to the output files set the bin_counter BACK $bin_window places and set the $top_limit $bin_count -= $bin_window; $top_limit = $bin_count + ($bin_window*2); Better test to see if match is close to ends of a chromosome. If so, the reported bins and read freqs will be chaemeric - we don't want this so we will ditch such matches if($f_site_chr[$site_count] ne $sgr_chr[$bin_count] $F_site_chr[$site_count] ne $sgr_chr[$top_limit]){ if 1 print "Can't output forward strand values for $F_site_chr[$site_count] site: $F_site_pos[$site_count]\n"; if 1 closer else { else 1 Push the chrn, bin and freq values to the F.sgr arrays and add values to F cfd freq array until ($bin_count == $top_limit+1){ until 3 push (@F_out_chr,$sgr_chr[$bin_count]); push (@F_out_bin,$sgr_bin[$bin_count]); push (@F_out_freq,$sgr_freq[$bin_count]); $F_cfd_freqsum[$cfd_count] += $sgr_freq[$bin_count];

11 $bin_count ++; $cfd_count ++; until 3 closer else 1 closer $cfd_count = 0; $bin_count = 0; $site_count ++; until 1 closer $F_out_size REVERSE STRAND.sgr calculations: reset the counter variables and define some more arrays $site_count = 0; Counter for each site ID $cfd_count = 0; Counter for the cfd arrays $bin_count = 0; R.sgr output array for chr R.sgr output array for bin pos R.sgr output array for read freq my $R_out_size = 0; Size of F.sgr output arrays until ($site_count == $R_siteID_size){ until 1 Use %bin_map to jump to correct region of sgr arrays $bin_count = (int($r_site_pos[$site_count]/$bin_size) + $bin_map{$r_site_chr[$site_count]) - 3; find an.sgr bin which contains the current site until ($R_site_chr[$site_count] eq $sgr_chr[$bin_count] && $R_site_pos[$site_count] >= $sgr_bin[$bin_count] && $R_site_pos[$site_count] < $sgr_bin[$bin_count +1]){ until 2 until 2 closer $bin_count ++; now that we've found the match, let's write values to the output files set the bin_counter BACK $bin_window places and set the $top_limit $bin_count -= $bin_window; $top_limit = $bin_count + ($bin_window*2);

12 Better test to see if match is close to ends of a chromosome. If so, the reported bins and read freqs will be chaemeric - we don't want this so we will ditch such matches if($r_site_chr[$site_count] ne $sgr_chr[$bin_count] $R_site_chr[$site_count] ne $sgr_chr[$top_limit]){ if 1 print "Can't output reverse strand values for $R_site_chr[$site_count] site: $R_site_pos[$site_count]\n"; if 1 closer else { else 1 Push the chrn, bin and freq values to the R.sgr arrays and add values to R cfd freq array until ($bin_count == $top_limit+1){ until 3 push (@R_out_chr,$sgr_chr[$bin_count]); push (@R_out_bin,$sgr_bin[$bin_count]); push (@R_out_freq,$sgr_freq[$bin_count]); $R_cfd_freqsum[$cfd_count] += $sgr_freq[$bin_count]; $bin_count ++; $cfd_count ++; until 3 closer else 1 closer $cfd_count = 0; $bin_count = 0; $site_count ++; until 1 closer $R_out_size The output file define outfile name and set correct endings $cfd_outfile = substr($infile_sgr,0,-4)."_".$descriptor."_cfd";

13 $cfd_outfile.= '.txt'; try and open the.cfd output file open(out,"> $outdir_path/$cfd_outfile") die "Unable to open $cfd_outfile: $!"; print "Have just created $cfd_outfile\n"; Set counter variables and define new arrays $bin_count = 0; $cfd_count = 0; my $cfd_sum = 0; a sum of sums for normalizing the data my $norm_factor = 0; calced from $cfd_sum my $R_cfd_count = $bin_window*2; array to hold summed F and R strand CFD values array to hold ordered R strand CFD values $bin_count -= $bin_window; until ($bin_count == $bin_window+1){ until 4 re-order reverse strand cfd freqsum values push (@R_cfd, $R_cfd_freqsum[$R_cfd_count]); calculate summed value for both F and R cfd freqsums push (@FandR_cfd, $F_cfd_freqsum[$cfd_count] + $R_cfd_freqsum[$R_cfd_count]); $bin_count ++; $cfd_count ++; $R_cfd_count --; until 4 closer Need to find average read values over bin_window to normalize data $cfd_sum += $_ $norm_factor = $cfd_sum/(($bin_window*2)+1); reset counters once more $bin_count = (0-$bin_window); $cfd_count = 0; print a header for the CFD.txt file so you can read it in Excel print (OUT "Values from $cfd_outfile\n"); print (OUT "CFD sum: $cfd_sum\n"); print (OUT "Normalization Factor: $norm_factor\n"); print column headers

14 print (OUT "Bin"."\t"."F Freq"."\t"."R Freq"."\t"."Comb Freq"."\t"."Norm Freq"."\n"); print data values until ($bin_count == $bin_window+1){ until 5 print(out $bin_count*$bin_size."\t". $F_cfd_freqsum[$cfd_count]."\t". $R_cfd[$cfd_count]."\t". $FandR_cfd[$cfd_count]."\t". $FandR_cfd[$cfd_count]/$norm_factor."\n"); $bin_count += $output_scale; $cfd_count += $output_scale; until 5 closer close.cfd out file handle close(out);

User's guide to ChIP-Seq applications: command-line usage and option summary

User's guide to ChIP-Seq applications: command-line usage and option summary 1. Basics about the ChIP-Seq Tools The ChIP-Seq software provides a set of tools performing common genome-wide ChIPseq analysis