Package gpart. November 19, 2018

Similar documents
Package snplist. December 11, 2017

Package genotypeeval

Package seqcat. March 25, 2019

Package SNPediaR. April 9, 2019

Package FitHiC. October 16, 2018

Package icnv. R topics documented: March 8, Title Integrated Copy Number Variation detection Version Author Zilu Zhou, Nancy Zhang

Package ChIPseeker. July 12, 2018

Package nngeo. September 29, 2018

Package multihiccompare

Package FunciSNP. November 16, 2018

Package omicplotr. March 15, 2019

Package methylgsa. March 7, 2019

Package ccmap. June 17, 2018

Package SIMLR. December 1, 2017

Package MODA. January 8, 2019

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package dmrseq. September 14, 2018

Package validara. October 19, 2017

Package Numero. November 24, 2018

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Package ECctmc. May 1, 2018

Package Organism.dplyr

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package semisup. March 10, Version Title Semi-Supervised Mixture Model

Package yarn. July 13, 2018

Package sigqc. June 13, 2018

Package bnbc. R topics documented: April 1, Version 1.4.0

Package kdtools. April 26, 2018

Package haplor. November 1, 2017

Package TMixClust. July 13, 2018

Package fastdummies. January 8, 2018

Package DMCHMM. January 19, 2019

Package compartmap. November 10, Type Package

Package readxl. April 18, 2017

Package MEAL. August 16, 2018

Package metagene. November 1, 2018

Package VSE. March 21, 2016

Package nmslibr. April 14, 2018

Package strat. November 23, 2016

Package dupradar. R topics documented: July 12, Type Package

Package docxtools. July 6, 2018

Package LOLA. December 24, 2018

Package semver. January 6, 2017

Package RTCGAToolbox

Package phantasus. March 8, 2019

Package igc. February 10, 2018

Package MIRA. October 16, 2018

Package bisect. April 16, 2018

Package mimager. March 7, 2019

Package fastrtext. December 10, 2017

Package SEMrushR. November 3, 2018

Package CoGAPS. January 24, 2019

Package BubbleTree. December 28, 2018

Package geneslope. October 26, 2016

Package CTDquerier. March 8, 2019

Package sigmanet. April 23, 2018

Package mixsqp. November 14, 2018

Package QUBIC. September 1, 2018

Package rsppfp. November 20, 2018

Package readobj. November 2, 2017

Package TFutils. October 11, 2018

Package triebeard. August 29, 2016

Package mfa. R topics documented: July 11, 2018

Package EMDomics. November 11, 2018

Package DRIMSeq. December 19, 2018

Package BDMMAcorrect

Package githubinstall

Package methyvim. April 5, 2019

Package detectruns. February 6, 2018

Package BiocNeighbors

Package NFP. November 21, 2016

Package queuecomputer

Package datasets.load

Package GARS. March 17, 2019

Package BASiNET. October 2, 2018

Package tibble. August 22, 2017

Package RNAseqNet. May 16, 2017

Package packcircles. April 28, 2018

Package TnT. November 22, 2017

Package tiler. June 9, 2018

Package ClusterSignificance

ChromHMM: automating chromatin-state discovery and characterization

Package rgreat. June 19, 2018

Package EventPointer

Package rtext. January 23, 2019

Package weco. May 4, 2018

Package rprojroot. January 3, Title Finding Files in Project Subdirectories Version 1.3-2

Package Linnorm. October 12, 2016

Package scmap. March 29, 2019

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package rqt. November 21, 2017

Package CNTools. November 6, 2018

Package scruff. November 6, 2018

Package GreyListChIP

Package climber. R topics documented:

Package muscle. R topics documented: March 7, Type Package

Package edfreader. R topics documented: May 21, 2017

Package canvasxpress

Package mmpa. March 22, 2017

Package cregulome. September 13, 2018

Transcription:

Package gpart November 19, 2018 Title Human genome partitioning of dense sequencing data by identifying haplotype blocks Version 1.0.0 Depends R (>= 3.5.0), grid, Homo.sapiens, TxDb.Hsapiens.UCSC.hg38.knownGene, we provide a new SNP sequence partitioning method which partitions the whole SNP sequence based on not only LD block structures but also gene location information. The LD block construction for GPART is performed using Big-LD algorithm, with additional improvement from previous version reported in Kim et al.(2017). We also add a visualization tool to show the LD heatmap with the information of LD block boundaries and gene locations in the package. License MIT + file LICENSE Encoding UTF-8 LazyData true biocviews Software, Clustering Imports igraph, biomart, Rcpp, data.table, OrganismDbi, AnnotationDbi, grdevices, stats, utils, GenomicRanges, IRanges LinkingTo Rcpp RoxygenNote 6.1.0 Suggests knitr, rmarkdown, BiocStyle, testthat VignetteBuilder knitr git_url https://git.bioconductor.org/packages/gpart git_branch RELEASE_3_8 git_last_commit 1a7b9d8 git_last_commit_date 2018-10-30 Date/Publication 2018-11-18 Author Sun Ah Kim [aut, cre, cph], Yun Joo Yoo [aut, cph] Maintainer Sun Ah Kim <sunnyeesl@gmail.com> 1

2 BigLD R topics documented: BigLD............................................ 2 CLQD............................................ 4 convert2grange...................................... 5 geneinfo........................................... 6 geno............................................. 6 GPART........................................... 7 LDblockHeatmap...................................... 8 SNPinfo........................................... 11 Index 12 BigLD Estimation of LD block regions BigLD returns the estimation of LD block regions of given data. BigLD(geno=NULL, SNPinfo=NULL,genofile=NULL, SNPinfofile=NULL, cutbyforce=null, LD=c("r2", "Dprime"), CLQcut=0.5, clstgap=40000, CLQmode=c("density", "maximal"), leng=200, subtasksize=1500, MAFcut=0.05, appendrare=false, hrsttype=c("near-nonhrst", "fast", "nonhrst"), hrstparam=200, chrn=null, startbp=-inf, endbp=inf) Arguments geno SNPinfo genofile SNPinfofile cutbyforce LD Data frame or matrix of additive genotype data, each column is additive genotype of each SNP. Data frame or matrix of SNPs information. 1st column is rsid and 2nd column is bp position. Character constant; Genotype data file name (supporting format:.txt,.ped,.raw,.traw,.vcf). Character constant; SNPinfo data file name (supporting format:.txt,.map). Data frame; information of SNPs which are forced to be the last SNP LD block boundary. Character constant; LD measure to use, r2 or Dprime. Default "r2". CLQcut Numeric constant; a threshold for the LD measure value r, between 0 to 1. Default 0.5. clstgap Numeric constant; a threshold of physical distance (bp) between two consecutive SNPs which do not belong to the same clique, i.e., if a physical distance between two consecutive SNPs in a clique greater than clstgap, then the algorithm split the cliques satisfying each clique do not contain such consecutive SNPs. Default 40000.

BigLD 3 Value CLQmode leng subtasksize Character constant; the way to give priority among detected cliques. if CLQmode = "density" then the algorithm gives priority to the clique of largest value of (NumberofSNP s)/(rangeofcliqu else if CLQmode = "maximal", then the algorithm gives priority to the largest clique. The default is "density". Numeric constant; the number of SNPs in a preceding and a following region of each sub-region boundary, every SNP in a preceding and every SNP in a following region need to be in weak LD. Default 200. Numeric constant; upper bound of the number of SNPs in a one-take sub-region. Default 1500. MAFcut Numeric constant; the MAF threshold. Default 0.05. appendrare hrsttype hrstparam chrn startbp endbp logical; if TRUE, the function append rare variants with MAF<MAFcut to the constructed blocks. Character constant; heuristic methods. If you want to do not use heuristic algorithm, set hrsttype = "nonhrst". If you want to use heuristic algorithm suggested in Kim et al.,(2017), set hrsttype = "fast". That algorithm is fastest heuristic algorithm and suitable when your memory capacity is not greater than 8GB. If you want to obtain the results similar to the that of non-heuristic algorithm, set hrsttype = "near-nonhrst". Numeric constant; parameter for heuristic algorithm "near-nonhrst". Default is 200. It is recommended that you set the parameter to greater than 150. Numeric(or Character) constant (or vector); chromosome number to use. Numeric constant; starting bp position of the chrn. Default -Inf. Numeric constant; last bp position of the chrn. Default Inf. A data frame of block estimation result. Each row of data frame shows the starting SNP and end SNP of each estimated LD block. Author(s) Sun-Ah Kim <sunny03@snu.ac.kr>, Yun Joo Yoo <yyoo@snu.ac.kr> See Also CLQD, LDblockHeatmap Examples data(geno) data(snpinfo) BigLD(geno[,1:100], SNPinfo[1:100,]) ## Not run: BigLD(geno, SNPinfo, LD = "Dprime") BigLD(geno, SNPinfo, CLQcut = 0.5, clstgap = 40000, leng = 200, subtasksize = 1500) ## End(Not run)

4 CLQD CLQD partitioning into cliques CLQD partitioning the given data into subgroups that contain SNPs which are highly correlated. CLQD(geno, SNPinfo, CLQcut=0.5, clstgap=40000, hrsttype=c("near-nonhrst", "fast", "nonhrst"), hrstparam=200, CLQmode=c("density", "maximal"), LD=c("r2", "Dprime")) Arguments geno SNPinfo Data frame or matrix of additive genotype data, each column is additive genotype of each SNP. (Use data of non-monomorphic SNPs) Data frame or matrix of SNPs information. 1st column is rsid and 2nd column is bp position. CLQcut Numeric constant; a threshold for the LD measure value r, between 0 to 1. Default 0.5. clstgap hrsttype hrstparam CLQmode LD Numeric constant; a threshold of physical distance (bp) between two consecutive SNPs which do not belong to the same clique, i.e., if a physical distance between two consecutive SNPs in a clique greater than clstgap, then the algorithm split the cliques satisfying each clique do not contain such consecutive SNPs. Default 40000. Character constant; heuristic methods. If you want to do not use heuristic algorithm, set hrsttype = "nonhrst". If you want to use heuristic algorithm suggested in Kim et al.,(2017), set hrsttype = "fast". That algorithm is fastest heuristic algorithm and suitable when your memory capacity is not greater than 8GB. If you want to obtain the results similar to the that of non-heuristic algorithm, set hrsttype = "near-nonhrst". Numeric constant; parameter for heuristic algorithm "near-nonhrst". Default is 200. It is recommended that you set the parameter to greater than 150. Character constant; the way to give priority among detected cliques. if CLQmode = "density" then the algorithm gives priority to the clique of largest value of (NumberofSNP s)/(rangeofcliqu else if CLQmode = "maximal", then the algorithm gives priority to the largest clique. The default is "density". Character constant; LD measure to use, "r2" or "Dprime". Default "r2". Value A vector of cluster numbers of all SNPs (NA represents singleton cluster). Author(s) Sun-Ah Kim <sunny03@snu.ac.kr>, Yun Joo Yoo <yyoo@snu.ac.kr>

convert2grange 5 See Also BigLD Examples data(geno) data(snpinfo) CLQD(geno=geno[,1:100],SNPinfo=SNPinfo[100,]) CLQD(geno=geno[,1:100],SNPinfo=SNPinfo[100,], CLQmode = 'maximal') CLQD(geno=geno[,1:100],SNPinfo=SNPinfo[100,], LD='Dprime') convert2grange convert output of BigLD/GPART to "GRangesList" object convert2grange convert a BigLD or GPART output to a data of GRangesList object. convert2grange(blockresult) Arguments blockresult BigLD or GPART output Value GRangeList object including the BigLD or GPART output Author(s) Sun-Ah Kim <sunny03@snu.ac.kr>, Yun Joo Yoo <yyoo@snu.ac.kr> Examples testbigld <- BigLD(geno=geno[,1:100], SNPinfo=SNPinfo[1:100,]) testbigld_grange <- convert2grange(testbigld)

6 geno geneinfo gene information data Format Source This data set gives gene information in chromosome 21. data(geneinfo) A data frame with 736 rows and 4 columns for genename, chromosome name, start bp and end bp of each gene. <http://grch37.ensembl.org/index.html> References Zerbino, Daniel R., et al. "Ensembl 2018." Nucleic acids research 46.D1 (2017): D754-D761. geno genotype data Format Source This data set gives genotype data that are subset of 1000 Genomes Project phase1 release 3 genotype data with 286 individuals from JPT, CHB and CHS populations (1000 Genomes Project Consortium, 2012). The dataset contains 9000 SNPs in chromosome 21 data(geno) A data frame with 286 rows and 9000 columns 1000 genomes project Phase 1 dataset <http://www.internationalgenome.org/> References 1000 Genomes Project Consortium. "An integrated map of genetic variation from 1,092 human genomes." Nature 491.7422 (2012): 56.

GPART 7 GPART Partitioning genodata based on the result obtained by using Big-LD and gene region information. GPART partition the given genodata using the result obtained by Big-LD and gene region information. The algorithm partition the whole sequence into sub sequences of which size do not exceed the given threshold. GPART(geno=NULL, SNPinfo=NULL, geneinfo=null, genofile=null, SNPinfofile=NULL, geneinfofile=null, genedb = c("ensembl","ucsc"), assembly = c("grch38", "GRCh37"), geneid = "hgnc_symbol",ensbversion = NULL, chrn=null, startbp=-inf, endbp=inf, BigLDresult=NULL, minsize=4, maxsize=50, LD=c("r2", "Dprime"), CLQcut=0.5, CLQmode=c("density", "maximal"), MAFcut = 0.05, GPARTmode=c("geneBased", "LDblockBased"), Blockbasedmode=c("onlyBlocks", "usegeneregions")) Arguments geno SNPinfo geneinfo genofile SNPinfofile geneinfofile genedb assembly geneid ensbversion Data frame or matrix of additive genotype data, each column is additive genotype of each SNP. Data frame or matrix of SNPs information. 1st column is rsid and 2nd column is bp position. Data frame or matrix of Gene info data. (1st col : Genename, 2nd col : chromosome, 3rd col : start bp, 4th col : end bp) Character constant; Genotype data file name (supporting format:.txt,.ped,.raw,.traw,.vcf). Character constant; SNPinfo data file name (supporting format:.txt,.map). A Character constant; file containing the gene information (1st col : Genename, 2nd col : chromosome, 3rd col : start bp, 4th col : end bp) A Character constant; database type for gene information. Set "ensembl" to get gene info from "Ensembl", or set "ucsc" to get gene info from "UCSC genome browser" (See package "biomart" or package "homo.sapiens"/"txdb.hsapiens.ucsc.hg38.knownge "TxDb.Hsapiens.UCSC.hg19.knownGene" for details.) A character constant; set "GRCh37" for GRCh37, or set "GRCh38" for GRCh38 A character constant; When you use the gene information by genedb. specity the symbol for gene name to use. default is "hgnc_symbol". (eg. ensembl_gene_id for genedb = "ensembl", "ENTREZID"/"ENSEMBL"/"REFSEQ"/"TXNAME" for genedb="ucsc". See package biomart or package Homo.sapiens for details) a integer constant; you can set the release version of ensembl when you use the gene information by using genedb='emsembl' and assembly='grch38' chrn Numeric(or Character) constant (or vector); chromosome number to use. If NULL(default), we use all chromosome. startbp Numeric constant; starting bp position of the chrn. Default -Inf.

8 LDblockHeatmap then the algorithm gives priority to the clique of largest value of (NumberofSNP s)/(rangeofcliqu endbp Numeric constant; last bp position of the chrn. Default Inf. BigLDresult Data frame; a result obtained by BigLD function. If NULL(default), the GPART function first excute BigLD function to obtain LD blocks estimation result. minsize Numeric constant; the lower bound of number of SNPs in a partition. maxsize Numeric constant; the upper bound of number of SNPs in a partition. LD Character constant; LD measure to use, r2 or Dprime. CLQcut Numeric constant; threshold for the correlation value r, between 0 to 1. CLQmode Character constant; the way to give priority among detected cliques. if CLQmode = "density" else if CLQmode = "maximal", then the algorithm gives priority to the largest clique. The default is "density". MAFcut Numeric constant; the MAF threshold. Default 0.05. GPARTmode Character constant; GPART algorithm methods to use, "genebased" or "LDblockBased". Default is??genebased?? Blockbasedmode Character constant; When you set GPARTmode = "LDblockBased", specify LDblock based method as "onlyblocks"("ldblock based only" algorithm) or "usegeneregions"(ldblock based and also use gene info algorithm). Value GPART returns data frame which contains 9 information of each partition (chromosome, index number of the first SNP and last SNP, rsid of the first SNP and last SNP, basepair position of the first SNP and last SNP, blocksize, Name of a block) Author(s) Sun Ah Kim <sunny03@snu.ac.kr>, Yun Joo Yoo <yyoo@snu.ac.kr> See Also BigLD Examples data(geno) data(snpinfo) data(geneinfo) GPART(geno=geno[,1:100], SNPinfo=SNPinfo[1:100,], geneinfo=geneinfo) LDblockHeatmap visualization of LD block structure LDblockHeatmap visualize the LD structure or LD block results of the inputed data. LDblock- Heatmap <- function(geno=null, SNPinfo=NULL, genofile=null, SNPinfofile=NULL,geneinfo=NULL, geneinfofile = NULL, genedb = c("ensembl","ucsc","file"), assembly = c("grch38", "GRCh37"), geneid = "hgnc_symbol", ensbversion = NULL, chrn=null,startbp=-inf, endbp=inf, blockresult=null, blocktype=c("bigld", "gpart"), minsize=4, maxsize=50, LD=c("r2", "Dprime", "Dpstr"), MAFcut=0.05,CLQcut=0.5, CLQmode=c("density", "maximal"), CLQshow=FALSE, type=c("png", "tif"), filename="heatmap", res=300, onlyheatmap=false)

LDblockHeatmap 9 LDblockHeatmap(geno = NULL, SNPinfo = NULL, genofile = NULL, SNPinfofile = NULL, geneinfo = NULL, geneinfofile = NULL, genedb = c("ensembl", "ucsc", "file"), assembly = c("grch38", "GRCh37"), geneid = "hgnc_symbol", ensbversion = NULL, geneshow = TRUE, chrn = NULL, startbp = -Inf, endbp = Inf, blockresult = NULL, blocktype = c("bigld", "gpart"), minsize = 4, maxsize = 50, LD = c("r2", "Dprime", "Dp-str"), MAFcut = 0.05, CLQcut = 0.5, CLQmode = c("density", "maximal"), CLQshow = FALSE, type = c("png", "tif"), filename = "heatmap", res = 300, onlyheatmap = FALSE) Arguments geno SNPinfo genofile SNPinfofile geneinfo geneinfofile genedb assembly geneid ensbversion geneshow chrn startbp endbp blockresult blocktype minsize Data frame or matrix of additive genotype data, each column is additive genotype of each SNP. Data frame or matrix of SNPs information. 1st column is rsid and 2nd column is bp position. Character constant; Genotype data file name (supporting format:.txt,.ped,.raw,.traw,.vcf). Character constant; SNPinfo data file name (supporting format:.txt,.map). Data frame or matrix of Gene info data. (1st col : Genename, 2nd col : chromosome, 3rd col : start bp, 4th col : end bp) A Character constant; file containing the gene information (1st col : Genename, 2nd col : chromosome, 3rd col : start bp, 4th col : end bp) A Character constant; database type for gene information. Set "ensembl" to get gene info from "Ensembl", or set "ucsc" to get gene info from "UCSC genome browser" (See package "biomart" or package "homo.sapiens"/"txdb.hsapiens.ucsc.hg38.knownge "TxDb.Hsapiens.UCSC.hg19.knownGene" for details.) A character constant; set "GRCh37" for GRCh37, or set "GRCh38" for GRCh38 A character constant; When you use the gene information by genedb. specity the symbol for gene name to use. default is "hgnc_symbol". (eg. ensembl_gene_id for genedb = "ensembl", "ENTREZID"/"ENSEMBL"/"REFSEQ"/"TXNAME" for genedb="ucsc". See package biomart or package Homo.sapiens for details) a integer constant; you can set the release version of ensembl when you use the gene information by using genedb='emsembl' and assembly='grch38' logical; do not show the gene information if geneshow=false. Default is geneshow=true Numeric(or Character) constant ; chromosome number to use. If the data contains more than one chromosome, you need to specify the chromosome to show. Numeric constant; starting bp position of the chrn. Numeric constant; last bp position of the chrn. Data frame; a result obtained by BigLD function or GPART. If NULL(default), the function first excute BigLD orgpart to obtain block estimation result denpending on the blocktype. Character constant; "bigld" for Big-LD or "gpart" for GPART. Default is "gpart". Integer constant; when blockresult=null, blocktype="gpart" specify the threshold for minsize of a block obtained by GPART

10 LDblockHeatmap maxsize LD Integer constant; when blockresult=null, blocktype="gpart" specify the threshold for maxsize of a block obtained by GPART Character constant; LD measure to use, "r2" or "Dprime" or "Dp-str". LD measures for LD heatmap (and BigLD execution when BigLDresult=NULL). When LD = "Dp-str", heatmap shows only two cases, "weak LD or not-informative" and "strong LD". When LD= Dprime heatmap shows the estimated D measures. MAFcut Numeric constant; MAF threshold of SNPs to use. Default 0.05 CLQcut CLQmode CLQshow type filename Numeric constant; threshold for the correlation value r, between 0 to 1. Default 0.5. Character constant; the way to give priority among detected cliques. if CLQmode = "density" then the algorithm gives priority to the clique of largest value of (NumberofSNP s)/(rangeofcliqu else if CLQmode = "maximal", then the algorithm gives priority to the largest clique. Default "density". logical; Show the LD bin structures, i.e. CLQ results, in each LD blocks. Notice that if the region to show includes more than 200 SNPs, the function do now show LD bin structures automatically. Default false. Character constant; file format of output image file. set png for PNG format file, or tif for TIFF format file. Character constant; filename of output image file. Default "heatmap". res Numeric constant; resolution of image. Default 300. onlyheatmap logical; show the LD heatmap without the LD block boundaries. Default false. Value GPART returns data frame which contains 9 information of each partition (chromosome, index number of the first SNP and last SNP, rsid of the first SNP and last SNP, basepair position of the first SNP and last SNP, blocksize, Name of a block) Author(s) Sun-Ah Kim <sunny03@snu.ac.kr>, Yun Joo Yoo <yyoo@snu.ac.kr> See Also BigLD Examples LDblockHeatmap(geno=geno[,1:100], SNPinfo=SNPinfo[1:100,], geneinfo=geneinfo, filename="chr21heatmap")

SNPinfo 11 SNPinfo SNP information data Format Source This data set gives information data of SNPs in geno The dataset contains chromosome name, rsid and bp information of the 9000 SNPs in geno data(snpinfo) A data frame with 9000 rows and 3 columns for chromosome name, rsid and bp 1000 genomes project Phase 1 dataset <http://www.internationalgenome.org/> References 1000 Genomes Project Consortium. "An integrated map of genetic variation from 1,092 human genomes." Nature 491.7422 (2012): 56. @keywords datasets

Index Topic datasets geneinfo, 6 geno, 6 BigLD, 2, 5, 8, 10 CLQD, 3, 4 convert2grange, 5 geneinfo, 6 geno, 6 GPART, 7 LDblockHeatmap, 3, 8 SNPinfo, 11 12