pygenbank Documentation - PDF Free Download

pygenbank Documentation Release 0.0.1 Matthieu Bruneaux February 06, 2017

Contents 1 Description 1 2 Contents 3 2.1 Installation................................................ 3 2.2 genbank module............................................. 4 2.3 Command-line scripts.......................................... 13 3 Indices and tables 17 Python Module Index 19 i

CHAPTER 1 Description This Python module provides some light wrapper functions around Biopython tools to provide a simplified interface to NCBI s GenBank server. Those functions are accessible from the genbank module, and in addition a few scripts accessible from the command lines are provided. 1

2 Chapter 1. Description

CHAPTER 2 Contents 2.1 Installation 2.1.1 Easy install with pip Local install (no sudo rights, recommended) To install the module and the command line tools, without sudo rights, type (thanks to this post from Kaz hack for installing without sudo rights): pip install --user --upgrade git+https://github.com/matthieu-bruneaux/pygenbank This will install the executables in ~/.local/bin/ and the Python module in ~/.local/lib/python2.7/site-packages/. To run the executables from the command line, you might need to add ~/.local/bin/ to the $PATH variable in your ~/.bashrc file: # Lines to add in your ~/.bashrc file, if needed PATH=$PATH:~/.local/bin export PATH System-wide install (with sudo rights, use with care!) If you want to install the module and the command line tools system-wide (i.e. accessible for anyone), you can run pip with sudo rights (this is not the recommended way to install!): sudo pip install --upgrade git+https://github.com/matthieu-bruneaux/pygenbank Test installation You can test if the installation worked with: pygenbank-search -h pygenbank-extract-cds -h and from Python: import genbank as gb dir(gb) 3

Uninstall To remove the module and the command line tools, type: pip uninstall pygenbank -y if pygenbank was installed without sudo rights. If it was installed with sudo rights, then you will need to use: sudo pip uninstall pygenbank-y 2.1.2 Installation from a cloned repository First, clone the GitHub repository: git clone https://github.com/matthieu-bruneaux/pygenbank cd pygenbank A Makefile is provided with the Python project folder. Just type make to have a short summary of the different targets available. To install the module: make install This will install the module for the current user (using pip install --user -e.). See above how to add ~/.local/bin to the PATH variable if the command line tools are not accessible after this install. To remove the module: make uninstall You can also run some tests and regenerate the documentation: make tests make doc 2.2 genbank module 2.2.1 Description genbank provides some light wrappers around Biopython functions to download records from the GenBank server. The module can be used from within Python, or through the command line tool pygenbank. 2.2.2 Tutorial This is a simple tutorial to learn how to use the genbank module. Setup the environment import genbank as gb # Setup your email address for Entrez gb.entrez.email = "yourname@youraddress" 4 Chapter 2. Contents

Search GenBank and retrieve record ids Performing a GenBank search is as simple as: # Perform a GenBank search mysearch = gb.search(term = "hemocyanin") The returned value is a Bio.Entrez.Parser.DictionaryElement which contains information about the returned results: mysearch.keys() # Available information mysearch["count"] # Number of entries found mysearch["querytranslation"] # How the query was understood by GenBank mysearch["idlist"] # List of GenBank identifiers returned Any query string that you would be using on the GenBank web page can be used as the term: mysearch = gb.search(term = "hemocyanin AND lito* [ORGN]") mysearch["count"] mysearch = gb.search(term = "citrate synthase AND mus m* [ORGN] To get more details about the returned entries, you can fetch the record summaries using the previous search result: summaries = gb.getdocsum(mysearch) The search results can be used to get summaries of the results and apply some simple filtering on the record id before proceeding to the actual record downloading: # Get the summaries from the results summaries = gb.getdocsum(mysearch) # Extract the id of interest ("Gi" field) myid = [x["gi"] for x in summaries if int(x["length"]) < 10000] len(myid) Download the GenBank records # Download the GenBank records genbank.downloadrecords(idlist = myid, destdir = ".", batchsize = 20) 2.2. genbank module 5

2.2.3 Functions search _downloadwgs downloadrecords _downloadbatch downloadwgs _makewgsurl getdocsum _parsedocsumxml _getrecordbatch _recordiswgs _main_search getdocsumfromid _processargstologic_search _getdocsumxml _checkretmax _makeparser_search _checkemailoption _filelinestolist writedocsums _processargstologic_extract_cds _processoutfmtarg _main_extract_cds _makeparser_extract_cds _getproteinhashfromcds _summarizerecord _makesummaryforcds _CDSinfo(CDS, outfmt, fmtdictcds=none, fmtdictrecord=none, parentrecord=none, hashconstructor=none) Get some information about a GenBank CDS (stored as a Bio.SeqFeature.SeqFeature object of type CDS ). CDS (Bio.SeqFeature.SeqFeature of type CDS ) GenBank CDS outfmt (list of str) List of information keys fmtdictcds (dict) Dictionary mapping the information keys to simple functions to retrieve the corresponding information (for CDS attributes). If None, default is _GB_CDS_FMTDICT. fmtdictrecord (dict) Dictionary mapping the information keys to simple functions to retrieve the corresponding information (for record attributes). If None, default is _GB_RECORD_FMTDICT. parentrecord (Bio.SeqRecord.SeqRecord) Parent record, needed to extract nucleotide sequence and other record-related information hashconstructor (function) Hash algorithm to be used (from the hashlib module) Returns Dictionary containing the information corresponding to each key in outfmt Return type dict _checkemailoption(args, stderr=none) Check that an email option was provided and setup Entrez email, produce a message and exit if not. 6 Chapter 2. Contents

args (namespace) Output from parser.parse_args() stderr (file) Writable stderr stream (if None, use sys.stderr) Returns Nothing, but setup Entrez.email or exit the program with a message to stderr if no email was provided _checkretmax(retmax, stderr) _downloadbatch(idbatch, destdir, downloadfullwgs=false) Download a batch of GenBank records to a destination directory. You should not call this function directly, but rather use downloadrecords() (which itself calls _downloadbatch()) for your downloads. _downloadbatch() calls _getrecordbatch() to download data from GenBank, and then takes care of separating individual records and writing them to files. idbatch (list of str) List of GenBank id destdir (str) Path to the folder where the records will be saved downloadfullwgs (boolean) If True, also download the full GenBank files corresponding to GenBank records with WGS trace reference Returns Nothing, but saves the GenBank records in the destination folder specified by destdir. _downloadwgs(wgsurl) Download a WGS GenBank file. The output is an uncompressed version of the file. WGSurl (str) Url to download the gzip file, output from _makewgsurl() Returns Uncompressed GenBank file content. Return None if there was a problem during gzip decompression. Return type str _filelinestolist(filename) Simple function to get a list of stripped lines from a file. filename (str) Path to the file to read Returns A list containing all non-empty white-stripped lines from the file Return type list of str _getdocsumxml(searchresult, retmax=none) Fetch the documents summaries in XML format for the entries from an Entrez.esearch. searchresult (Bio.Entrez.Parser.DictionaryElement) Object containing the result of an Entrez.esearch, typically the output from search() (or at minimum a dictionary with WebEnv and QueryKey entries) retmax (int) Maximum number of document summaries to get. If no number is given, uses the RetMax element from searchresult. Returns A string containing the summaries in XML format Return type str _getfullrecord(searchresult, retmax=1) Fetch the full GenBank records for the entries from an Entrez.esearch. 2.2. genbank module 7

searchresult (Bio.Entrez.Parser.DictionaryElement) Object containing the result of an Entrez.esearch, typically the output from search() (or at minimum a dictionary with WebEnv and QueryKey entries). retmax (int) Maximum number of full records to get. Default is 1, which is a safe approach since individual records can sometimes be very large (e.g. chromosomes). Returns A string containing the full records in XML format Return type str _getproteinhashfromcds(cds, hashconstructor) Extract the protein sequence from a Bio.SeqFeature.SeqFeature CDS object and determine its hash value. CDS (Bio.SeqFeature.SeqFeature) CDS of interest hashconstructor (function) Hash algorithm to be used (from the hashlib module) Returns A tuple containing the protein sequence and the hash value. If there is no translation available in the CDS qualifiers, returns NA as the protein sequence. If there are several translation available, join them with commas. Return type (str, str) _getrecordbatch(idlist) Retrieve the GenBank records for a list of GenBank id (GIs). This is a relatively low-level function that only gets data from GenBank but does not manage batches or write files. You should use the higher level wrapper downloadrecords() for your own downloads. idbatch (list of str) List of GenBank id Returns A string with all the data (can be large if many id are provided) Return type str _main_extract_cds(args=none, stdout=none, stderr=none, gb_record_fmtdict=none, gb_cds_fmtdict=none) Main function, used by the command line script -extract-cds. This function sends the arguments to _processargstologic_extract_cds() to determine which actions must be performed, and then performs the actions. args (namespace) Namespace with script arguments, parse the command line arguments if None stdout (file) Writable stdout stream (if None, use sys.stdout) stderr (file) Writable stderr stream (if None, use sys.stderr) gb_record_fmtdict (dict) Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records. If None, default is _GB_RECORD_FMTDICT. gb_cds_fmtdict (dict) Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS. If None, default is _GB_CDS_FMTDICT. Returns None _main_search(args=none, stdout=none, stderr=none) Main function, used by the command line script -search. This function sends the arguments to 8 Chapter 2. Contents

_processargstologic_search() to determine which actions must be performed, and then performs the actions. args (namespace) Namespace with script arguments, parse the command line arguments if None stdout (file) Writable stdout stream (if None, use sys.stdout) stderr (file) Writable stderr stream (if None, use sys.stderr) Returns None _makeparser_extract_cds() Build the argument parser for the main script -extract-cds. Returns An argument parser object ready to be used to parse the command line arguments Return type argparse.argumentparser() object _makeparser_search() Build the argument parser for the main script -search. Returns An argument parser object ready to be used to parse the command line arguments Return type argparse.argumentparser() object _makesummaryforcds(record, CDS, hstr, summaryformat, getattrfuncs=none) Make a summary for one CDS feature object. record (Bio.SeqRecord.SeqRecord) Parent record CDS (Bio.SeqFeature.SeqFeature) CDS of interest hstr (str) Hash string for the protein sequence of the CDS summaryformat (list of str) List of attribute descriptors determining the columns of the summary table getattrfuncs (dict of functions) Dictionary mapping the attribute descriptors to functions of the form: f(cds, record, hstr). If None (default), use the module-defined GET_ATTR_FUNCS dictionary. Returns Summary string for this CDS Return type str _makewgsurl(wgsline) Prepare the url to download the GenBank records corresponding to one WGS line. The WGS line is the output from _genbankrecirdiswgs(). WGSline (str) WGS line from a GenBank record, output from _genbankrecirdiswgs(). Returns Url to download the record Return type str _parsedocsumxml(xmlcontent) Parse the documents summaries from xml format into a list of dictionaries. xmlcontent (string) Document summaries in XML format (note: this is a string, not a file name). This is typically the output from _getdocsumxml(). 2.2. genbank module 9

Returns A list of dictionaries containing the document summaries, or an empty list if no entry was found. Return type list of dictionaries _processargstologic_extract_cds(args, stdout, stderr, gb_record_fmtdict, gb_cds_fmtdict) Process the command line arguments and determine the action logic for the _main_extract_cds() function. args (namespace) Argument namespace stdout (file) stdout stream stderr (file) stderr stream gb_record_fmtdict (dict) Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records gb_cds_fmtdict (dict) Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS Returns The argument namespace given input in args with added flags for actions Return type namespace _processargstologic_search(args, stdout, stderr) Process the command line arguments and determine the action logic for the _main_search() function. args (namespace) Argument namespace stdout (file) stdout stream stderr (file) stderr stream Returns The argument namespace given input in args with added flags for actions Return type namespace _processoutfmtarg(outfmt, stderr, gb_record_fmtdict, gb_cds_fmtdict) Check that all outfmt specifiers are allowed and return the splitted outfmt specifiers. outfmt (str) String from the command line argument --outfmt stderr (file) stderr stream gb_record_fmtdict (dict) Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank records gb_cds_fmtdict (dict) Dictionary mapping outfmt specifiers to functions to extract the corresponding information from GenBank CDS Returns List of outfmt specifiers ready to pass to _CDSinfo() Return type list _recordinfo(record, outfmt, fmtdict=none) Get some information about a GenBank record (stored as a Bio.SeqRecord.SeqRecord object). record (Bio.SeqRecord.SeqRecord) GenBank record 10 Chapter 2. Contents

outfmt (list of str) List of information keys fmtdict (dict) Dictionary mapping the information keys to simple functions to retrieve the corresponding information. If None, use the default _GB_RECORD_FMTDICT Returns List containing the information corresponding to each key in outfmt Return type List _recordiswgs(recordstr) Check if a GenBank record is from a whole genome shotgun project. This is done by searching for the WGS string at the beginning of a line recordstr (str) Content of a GenBank record Returns WGS line or False Return type str or False _summarizerecord(record, summaryformat, hashconstructor, existinghashes={}) Produce a tabular summary of all the CDS features present in a GenBank record and a dictionary containing hashes of the unique sequences (hash, protein sequence). If a dictionary of pre-existing hashes is given, update this one. Checks for collisions in the hash dictionary. Returns record (Bio.SeqRecord.SeqRecord) GenBank record (Biopython object) summaryformat (list of str) List of attribute descriptors determining the columns of the summary table hashconstructor (function) Hash algorithm to be used (from the hashlib module) existinghashes (dict) Dictionary (k, v) = (hash, protein sequences). This is updated with the CDS hashes from the input record and checked for collisions. A tuple containing the following objects string: tabular summary for all CDS features in the GenBank record, ready to be written to an output stream dictionary (hash, protein sequences): dictionary given in input as existinghashes and updated with the hashes from record. This is the dictionary one can pass to another call to _summarizerecord() in order to progressively build a complete dictionary of all hashes for several GenBank records. dictionary (hash, protein sequences): dictionary containing only the new hashes not already present in existinghashes. This is useful if one wants to update an output stream with the unique hashes after each call to _summarizerecord() when processing several records. Return type (str, dict, dict) downloadrecords(idlist, destdir, batchsize, delay=30, forcedownload=false, downloadfull- WGS=False) Download the GenBank records for a list of IDs and save them in a destination folder. This is the function to use to download data from GenBank. It applies a waiting delay between each batch download. Each record is saved as a file with name id +.gb. Note that a record is not downloaded if a file with the expected name already exists, except if forcedownload is True. The downloading itself is performed by _downloadbatch() and _getrecordbatch(). 2.2. genbank module 11

some GenBank records do not contain actual sequence data but some reference to a WGS (whole genome shotgun sequencing) project. For those, setting downloadfullwgs to True is necessary to download another GenBank file with the actual sequence data. Note that if a GenBank record was first downloaded without this option, and actually contains a WGS reference, then the forcedownload option must be enabled (or the file must be removed) for the WGS file to be also downloaded in a new call of this function. idlist (list of str) List of GenBank id destdir (str) Path to the folder where the records will be saved forcedownload (boolean) Should records for which a destination file already exists be downloaded anyway? (default: False) downloadfullwgs (boolean) If True, also download the full GenBank files corresponding to GenBank records with WGS trace reference Returns Nothing, but saves the GenBank records in the destination folder specified by destdir. downloadwgs(gbrecord, destdir) Download and save the WGS GenBank file corresponding to a GenBank record with a WGS reference gbrecord (str) Text content of a GenBank record with WGS reference destdir (str) Path to the directory where to save the GenBank file Returns Nothing, but save the complete GenBank file corresponding to the WGS reference into the specified folder. The file name is the GI number plus WGS getdocsum(searchresult, retmax=none) Fetch the documents summaries for the entries from an Entrez.esearch. searchresult (Bio.Entrez.Parser.DictionaryElement) Object containing the result of an Entrez.esearch, typically the output from :func: search (or at minimum a dictionary with WebEnv and QueryKey entries) retmax (int) Maximum number of document summaries to get. If no number is given, uses the RetMax element from searchresult. Returns A list of dictionaries containing the document summaries Return type list of dictionaries getdocsumfromid(listid, retmax=none) Fetch the documents summaries from a list of GenBank identifiers. listid (list of str) A list of GenBank identifiers retmax (int) Maximum number of document summaries to get. If None (default), returns all the document summaries. Returns A list of dictionaries containing the document summaries Return type list of dictionaries search(term, retmax=none) Search GenBank for a given query string. Perform a Bio.Entrez.esearch on db= nuccore, using history. 12 Chapter 2. Contents

Returns term (str) Query to submit to GenBank retmax (int) Maximum number of returned ids The result of the search, with detailed information accessible as if it was a Python dictionary. Keys are: "Count", "RetMax", "IdList", "TranslationStack", "QueryTranslation", "TranslationSet", "RetStart", "QueryKey", "WebEnv" This object can be used with other function of this module to get the actual data (getdocsum() and _getfullrecord()) Return type Bio.Entrez.Parser.DictionaryElement writedocsums(docsums, handle) Write the documents summaries into a tabular format to any handle with a write method. docsums (list of dictionaries or None) A list of dictionaries containing the document summaries, typically the output from parsedocsumxml(). If it is an empty list, the function will not write anything. handle (similar to file handle) Handle object with a write method (e.g. sys.stdout, StringIO object) Returns Nothing but writes the document summaries in a tabular format to the specified file. open file, 2.3 Command-line scripts 2.3.1 pygenbank-search pygenbank-search is a tool to perform searches on GenBank and to retrieve GenBank records, either as document summaries or as full records. The user has to provide an email address for use of the Entrez resource. Type pygenbank-search --help for detailed usage. See http://www.ncbi.nlm.nih.gov/books/nbk49540/ for more details on GenBank search queries. Examples Please use your own email address as the --email argument. Search GenBank and retrieve document summaries: pygenbank-search --query "hemoglobin AND mammal" --retmax 10000 --email "name@address" > mysearc less -S mysearch Search GenBank and retrieve full records: mkdir gbresults # Records will be saved here pygenbank-search -q "myoglobin AND sperm whale" -e "name@address" -d -o gbresults Specify a length range in the GenBank query: 2.3. Command-line scripts 13

pygenbank-search -q "carcinus maenas" -e "name@address" -r 10000 > mysearch pygenbank-search -q "carcinus maenas AND 1000:100000[SLEN]" -e "name@address" -r 10000 > mysearc Specify a taxon in the GenBank query: MY_QUERY="complete genome AND staphylococcus aureus [PORGN]" pygenbank-search -q "$MY_QUERY" -e "name@address" -r 10000 > mysearch More complex query to get all complete Staphylococcus aureus genomes (up to 10000): MY_QUERY="complete genome AND staphylococcus aureus [PORGN] AND 1000000:10000000 [SLEN]" echo $MY_QUERY pygenbank-search -q "$MY_QUERY" -e "name@address" -r 10000 > mysearch 2.3.2 pygenbank-extract-cds pygenbank-extract-cds is a tool to extract CDS summaries from GenBank records and to produce fasta file with unique amino-acid sequences if needed (e.g. to prepare a clustering analysis). Type pygenbank-extract-cds --help for detailed usage. Examples Get CDS summaries for all GenBank files in the current directory: pygenbank-extract-cds *.gb > mysummaries 2.3.3 How to profile command-line script execution cprofilev is a convenient tool to visualize the results of a profiling run of a Python script: sudo pip install cprofilev cprofilev can be used to profile the execution of a script this way: python -m cprofilev myscript.py [args] The ouput is visible at the address http://localhost:4000. Using cprofilev with the command-line scripts pygenbank-search and pygenbank-extract-cds use entry points in the genbank.py module, and cannot be called directly with the Python interpreter to use the cprofilev module at the same time (at least I didn t find a way to do it for now). To solve this problem, there is a bit of code added at the end of the genbank.py module to make it callable from the python interpreter. The module can then be called with: python genbank.py search [args] python genbank.py extract-cds [args] where [args] are passed to the corresponding _main_... functions. For example: python genbank.py search -q "hemocyanin" > summaries python genbank.py extract-cds --help 14 Chapter 2. Contents

Note that the full path to genbank.py must be provided (so here we assume we are running the profiling run from within the module folder). To perform a profiling run: python -m cprofilev genbank.py extract-cds -u toto.fasta *.gb > summaries and then visit http://localhost:4000 with a web browser. 2.3. Command-line scripts 15

16 Chapter 2. Contents

CHAPTER 3 Indices and tables genindex modindex search 17

18 Chapter 3. Indices and tables

Python Module Index g genbank, 6 19

20 Python Module Index

Index Symbols _CDSinfo() (in module genbank), 6 _checkemailoption() (in module genbank), 6 _checkretmax() (in module genbank), 7 _downloadbatch() (in module genbank), 7 _downloadwgs() (in module genbank), 7 _filelinestolist() (in module genbank), 7 _getdocsumxml() (in module genbank), 7 _getfullrecord() (in module genbank), 7 _getproteinhashfromcds() (in module genbank), 8 _getrecordbatch() (in module genbank), 8 _main_extract_cds() (in module genbank), 8 _main_search() (in module genbank), 8 _makeparser_extract_cds() (in module genbank), 9 _makeparser_search() (in module genbank), 9 _makesummaryforcds() (in module genbank), 9 _makewgsurl() (in module genbank), 9 _parsedocsumxml() (in module genbank), 9 _processargstologic_extract_cds() (in module genbank), 10 _processargstologic_search() (in module genbank), 10 _processoutfmtarg() (in module genbank), 10 _recordinfo() (in module genbank), 10 _recordiswgs() (in module genbank), 11 _summarizerecord() (in module genbank), 11 D downloadrecords() (in module genbank), 11 downloadwgs() (in module genbank), 12 G genbank (module), 6 getdocsum() (in module genbank), 12 getdocsumfromid() (in module genbank), 12 S search() (in module genbank), 12 W writedocsums() (in module genbank), 13 21