Textual Description of webbioc

Similar documents
Appendix A GLOSSARY. SYS-ED/ Computer Education Techniques, Inc.

Automatic Dependency Management for Scientific Applications on Clusters. Ben Tovar*, Nicholas Hazekamp, Nathaniel Kremer-Herman, Douglas Thain

Eli System Administration Guide

Outlook. File-System Interface Allocation-Methods Free Space Management

Using Java to Front SAS Software: A Detailed Design for Internet Information Delivery

File Systems Management and Examples

NBIC TechTrack PBS Tutorial

The Grid Monitor. Usage and installation manual. Oxana Smirnova

CHAPTER 2. Troubleshooting CGI Scripts

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

Installation and Maintenance Instructions for SAS 9.2 Installation Kit for Basic DVD Installations on z/os

CGI Architecture Diagram. Web browser takes response from web server and displays either the received file or error message.

NBIC TechTrack PBS Tutorial. by Marcel Kempenaar, NBIC Bioinformatics Research Support group, University Medical Center Groningen

A file system is a clearly-defined method that the computer's operating system uses to store, catalog, and retrieve files.

Agent Teamwork Research Assistant. Progress Report. Prepared by Solomon Lane

R version has been released on (Linux source code versions)

Chapter 7: File-System

IBM. Bulk Load Utilities Guide. IBM Emptoris Contract Management SaaS

Outline. ASP 2012 Grid School

An Introduction to Unix Power Tools

Getting Started with Linux

Introduction to High-Performance Computing (HPC)

500K Data Analysis Workflow using BRLMM

Table of Contents. Table of Contents Job Manager for remote execution of QuantumATK scripts. A single remote machine

Introduction. Secondary Storage. File concept. File attributes

TECHNICAL NOTE. Technical Note P/N REV A01

The UNIX Time- Sharing System

Installing and Configuring Worldox/Web Mobile

Project 1: Implementing a Shell

Geant4 on Azure using Docker containers

The CartIt Commerce System Installation Guide

Bioconductor tutorial

Site Owners: Cascade Basics. May 2017

DupScout DUPLICATE FILES FINDER

A Hands-On Tutorial: RNA Sequencing Using High-Performance Computing

User Manual. Admin Report Kit for IIS 7 (ARKIIS)

Part Four - Storage Management. Chapter 10: File-System Interface

There is a general need for long-term and shared data storage: Files meet these requirements The file manager or file system within the OS

Dynamic Documents. Kent State University Dept. of Math & Computer Science. CS 4/55231 Internet Engineering. What is a Script?

Linux Operating System

Practical 5. Linux Commands: Working with Files

Step by Step Installation of CentOS Linux 7 and Active Circle

C HAPTER F OUR F OCUS ON THE D ATABASE S TORE

Introduction to High-Performance Computing (HPC)

Technical Intro Part 1

Intellicus Cluster and Load Balancing- Linux. Version: 18.1

Read Source Code the HTML Way

Chapter Two. Lesson A. Objectives. Exploring the UNIX File System and File Security. Understanding Files and Directories

Lecture 5. Essential skills for bioinformatics: Unix/Linux

File System: Interface and Implmentation

DiskSavvy Disk Space Analyzer. DiskSavvy DISK SPACE ANALYZER. User Manual. Version Dec Flexense Ltd.

File Systems. What do we need to know?

To the Cloud and Back: A Distributed Photo Processing Pipeline

3. When you process a largest recent earthquake query, you should print out:

NCSA Security R&D Last Updated: 11/03/2005

Getting started with the CEES Grid

Linux Essentials. Smith, Roderick W. Table of Contents ISBN-13: Introduction xvii. Chapter 1 Selecting an Operating System 1

Backpacking with Code: Software Portability for DHTC Wednesday morning, 9:00 am Christina Koch Research Computing Facilitator

Chapter 11: File-System Interface

Secondary Storage (Chp. 5.4 disk hardware, Chp. 6 File Systems, Tanenbaum)

Simics Installation Guide for Linux/Solaris

GMI-Cmd.exe Reference Manual GMI Command Utility General Management Interface Foundation

Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

CS370: Operating Systems [Spring 2017] Dept. Of Computer Science, Colorado State University

Chapter 10: File-System Interface. Operating System Concepts with Java 8 th Edition

File Management By : Kaushik Vaghani

Lecture 3. Essential skills for bioinformatics: Unix/Linux

Report Commander 2 User Guide

cron How-To How to use cron to Schedule rsync Synchronizations September 29, 2004

Installing Ubuntu 8.04 for use with ESP-r 8 May 2009 Jon W. Hand, ESRU, Glasgow, Scotland

Real-Time Monitoring Configuration Utility

Hadoop On Demand: Configuration Guide

CS3600 SYSTEMS AND NETWORKS

Unix/Linux: History and Philosophy

OpenServer Kernel Personality (OKP)

A Graphical User Interface for Job Submission and Control at RHIC/STAR using PERL/CGI

Master Syndication Gateway V2. User's Manual. Copyright Bontrager Connection LLC

Hands-On Introduction to Queens College Web Sites

Real-Time Monitoring Configuration Utility

HORTICOPIA Professional

Files & I/O. Today. Comp 104: Operating Systems Concepts. Operating System An Abstract View. Files and Filestore Allocation

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

Chapter 11: File System Implementation. Objectives

UNIVERSITY OF NEBRASKA AT OMAHA Computer Science 4500/8506 Operating Systems Fall Programming Assignment 1 (updated 9/16/2017)

Chapter 1 - Introduction. September 8, 2016

HarePoint Analytics. For SharePoint. User Manual

W E B M I N : A W E B - B A S E D S Y S T E M A D M I N I S T R AT I O N T O O L F O R U N I X

Bioinformatics Introduction. Sebastian Schmeier

AdventNet ManageEngine OpManager Installation Guide. Table Of Contents INTRODUCTION... 2 INSTALLING OPMANAGER Windows Installation...

Files and the Filesystems. Linux Files

USER MANUAL. SEO Hub TABLE OF CONTENTS. Version: 0.1.1

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "

CSCI 4152/6509 Natural Language Processing Lecture 6: Regular Expressions; Text Processing in Perl

REFERENCE ARCHITECTURE Quantum StorNext and Cloudian HyperStore

DATA STRUCTURES USING C

What is an Operating System? A Whirlwind Tour of Operating Systems. How did OS evolve? How did OS evolve?

INSTALLATION AND CONFIGURATION GUIDE R SOFTWARE for PIPELINE PILOT 2016

UNIT V SECONDARY STORAGE MANAGEMENT

/ Computational Genomics. Normalization

Transcription:

Textual Description of webbioc Colin A. Smith October 13, 2014 Introduction webbioc is a web interface for some of the Bioconductor microarray analysis packages. It is designed to be installed at local sites as a shared bioinformatics resource. Unfortunately, webbioc currently provides only Affymetrix-based analysis. (We would certainly like to collaborate with someone interested in creating a cdna module.) The existing modules provide a workflow that takes users from CEL files to differential expression with multiple hypothesis testing control, and finally to metadata annotation of the gene lists. 1 User Interface Goals This package aims to address a number of issues facing prospective users: ˆ Ease of use. Using the web interface, the user does not need to know how to use either a command line interface or the R language. Depending on the computer savvy of the user, R tends to have a somewhat steep learning curve. webbioc has a very short learning curve and should be usable by any biologist. ˆ Ease of installation. After an initial installation by a system administrator, there is no need to install additional software on user computers. Installing and maintaining an R installation with all the Bioconductor packages can be a daunting task, often best suited to a system administrator. Using webbioc, only one such installation needs to be maintained. ˆ Discoverability. Graphical user interfaces are significantly more discoverable than command line interfaces. That is, a user browsing around a software package is much more likely to discover and use new features if they are graphically presented. Additionally, a unified user interface for the different Bioconductor packages can help show how they can be used together in a data-processing pipeline. Ideally, a user should be able to start using the web interface without reading external documentation. 1

ˆ Documentation. Embedding context-sensitive online help into the interface helps first-time users make good decisions about which statistical approaches to take. Because of its power, Bioconductor includes a myriad of options for analysis. Helping the novice statistician wade through that pool of choices is an important aspect of webbioc. 2 Architecture webbioc is written in a combination of Perl, R, and shell scripts. For most processing, the Perl-driven web interface dynamically creates an R script, which is processed in batch mode. A shell script controls the execution of R and catches any errors that result. The web interface can be configured to execute the shell script in one of two ways. In the single-machine configuration, Perl forks off an auxiliary thread which then runs the script. However, webbioc can also be configured to use PBS (Portable Batch System) to send the job to another computer for processing. If used in a cluster configuration, webbioc depends on having a shared partition between the web server and all the compute nodes. This is typically accomplished with NFS. The shared partition allows the compute nodes to push results directly out to the user without directly interacting with the web server. Microarray analysis is very data intensive and tends to produce large files. Thus special care must be taken to efficiently move that data back and forth between the web-client and the server. webbioc uses two different systems for exchanging data with the client, one for input and the other for output. The Upload Manager handles all files to be processed by Bioconductor. When users start a session with the upload manager, they are granted a unique token that identifies the session. Using that short random string of letters and numbers, they may access their uploaded files from any of the Bioconductor tools. In that way, users must only upload files a single time. Upload manager sessions are meant to be temporary, with any files being automatically deleted after a given amount of time. For output of results, webbioc creates static HTML pages on the fly that contain relevant images or files. Like the upload manager, job results are uniquely identified and are only temporarily stored. Because saving old results could potentially fill up the disk, those too should be automatically purged on a regular basis. The results of one job may need to be fed into the input of another. For instance, expression summary data created by affy needs to be supplied to multtest for detection of differentially expressed genes. To facilitate that data exchange, the web interface will optionally copy results back to the upload manager for processing by other packages. The data interchange format used by webbioc is the standard R data file. By convention, each file contains only one object. Additionally, instead of using the common.rda or.rdata extensions, webbioc will use the class of the stored object as the file extension. This provides a useful abstraction that many computer users have come to 2

understand and expect. The extensions are merely for the benefit of the user as the web interface ignores them. 3 System Requirements ˆ R 1.8.0: webbioc has only been tested with R 1.8.0 and later and there is no guarantee that it will work with any version earlier than that. ˆ Biobase, affy, multtest, annaffy, vsn, gcrma, qvalue: webbioc directly invokes these packages for processing work. Additionally, those packages depend on others that are not listed here but must also be installed. ˆ Bioconductor metadata packages: Much of the computation done by affy, annaffy, and gcrma depends on pre-built metadata packages available on the Bioconductor web site. We recommend that all of the metadata packages be installed so that users are given the maximum flexibility with which chips they may process. As of this writing, a full install of Bioconductor and all data packages uses approximately 1.25 GB of disk space. ˆ Unix: webbioc was written in a Unix environment and depends on many Unix conventions including path names, directory structure, and interprocess communication. It is doubtful whether it will run under Windows and has not been tested as such. However, an ambitious person could port webbioc to Windows. The community would greatly appreciate such an effort. ˆ Perl 5.6: Development and testing has been done with both Perl 5.6 and 5.8. It will likely work with either. Perl 5.6 may however require installation of some modules that do not ship as part of the default installation. webbioc currently uses the following modules: CGI, Digest::MD5, File::stat, IPC::Open3, and POSIX. ˆ Ghostscript: The web interface uses Ghostscript to produce raw graphics. It must be installed. ˆ Netpbm: The Netpbm series of programs handle graphics manipulation. If you install it using the RPMs, make sure to install both netpbm and netpbmprogs. ˆ SGE or PBS (Optional): webbioc has been developed and tested with two batch queueing systems, the Sun Grid Engine and the Portable Batch System. (Only PBS Pro has been tested, although OpenPBS should also work.) If you will be using a batch queueing system, webbioc depends on having a shared filesystem mounted in the same place on the web server as well as all the compute nodes. Please note that these are optional as webbioc can also run jobs by forking to the background. Adding support for another batch queueing system is a fairly trivial matter. Please contact the author if you are interested in using a different system. 3

4 Installation Installing the webbioc files is relatively straightforward. First create a directory within the web server s CGI directory that will hold all the Perl scripts. One might typically call that directory bioconductor. Copy all the files from /R lib dir/webbioc/cgi/ into that directory. Next verify that the permissions are correct on the copied files. Change go to the directory to which you copied the files. Use the command chmod 755 *.cgi to make the CGI scripts executable and the command chmod 644 *.pm to make the support modules readable. You must decide how you want to link the Bioconductor web interface into your existing web site. It was designed to integrate seamlessly into an existing site design. webbioc includes a very rudimentary home page in /R lib dir/webbioc/www/. You may either use this as a starting point or create your own. Please note that you may have to change the links slightly depending on where you place the CGI scripts. The last step of the installation is to create two directories where the web interface can store files. Remember that if using a batch queueing system, these two directories must be shared between the web server and all compute nodes via NFS (or some other file sharing mechanism). If other users have access to any of the machines running the web interface, we recommend setting the permissions of these directories so that only the web sever user can read them. The first directory will be for storing uploaded files. The largest type of file uploaded will probably be CEL files. They are typically around 10 MB each. Therefore, depending on expected server usage, the directory should be kept on a partition with hundreds of megabytes to gigabytes of free space. It can be stored anywhere and does not necessarily have to be web-accessible. The second directory will be used for storing the results of jobs that clients submit. Job results will typically be anywhere from 5 MB to 5 KB. The free space necessary for this directory is again subject to usage. This directory must be accessible via the web server to allow results to be delivered asynchronously. Both directories should be regularly purged of old files. In the future we hope to provide scripts that will help you do that. Until then, you will have to handle that yourself. We would appreciate the contribution of any scripts towards that end. To facilitate installation and updating of metadata packages, webbioc includes a function to download and install every metadata package from the Bioconductor web site. The following will install all metadata packages and update any out-of-date metadata packages. (Change the path name to match your system.) library(webbioc) installreps("/library/install/path") Make sure to run R with a user who has permission to write to your library directory. Depending on your site, you may wish to set up a cron job to execute this code approximately once per month to check for updates or additions to the medadata packages. If 4

packages are already up-to-date, it will not waste bandwidth nor CPU by re-installing them. 5 Configuration Beyond putting the CGI scripts and HTML page in the right places and setting up directories to receive files, all configuration is done through the Site.pm file. The configuration options are discussed here. ˆ UPLOAD DIR This is the directory where uploaded files are stored. You should set this to the absolute path name of the upload directory created earlier. ˆ RESULT DIR This is the directory where the web interface will place all result files. You should set this to the absolute path name of the results directory created earlier. ˆ RESULT URL This is the absolute URL from which the above directory can be accessed through your web server. In almost all cases this is different than the filesystem path. ˆ BIOC URL This is the absolute URL to the CGI scripts installed earlier. ˆ SITE URL This is the site URL including only the domain name. It combined with the above URL to create links within the system. ˆ R BINARY The path to the R executable you wish to use with the web interface. If using PBS, R must be placed in the same location on both the web server and computed nodes. ˆ R LIBS If you store R libraries outside the default location, enter the paths to those directories here as a colon-delimited list. Otherwise leave this as an empty string. ˆ DEBUG Turn on or off debug mode. If debug mode is on, the web interface will leave scripts, output, and other files behind for inspection. Otherwise, those files will be deleted before a job completes. ˆ SH HEADER This option is for appending lines to the top of all shell scripts. It can be very useful for setting environment variables such as PATH. (PATH must include the directories containing standard shell tools, Ghostscript, Netpbm binaries, and sendmail.) Depending on your installation you may have to set the GS LIB variable so Ghostscript can find its libraries and fonts. ˆ BATCH SYSTEM This option selects the type of batch queueing system to use. Currently supported options are fork, sge, and pbs. 5

ˆ BATCH ENV This hash allows you to set environment variables necessary for running the batch queueing system. SGE typically needs SGE ROOT and possibly SGE CELL or COMMD PORT. PBS typically needs PBS HOME, PBS EXEC, and PBS SERVER. ˆ BATCH BIN This specifies the location of the batch queueing system bin directory. ˆ BATCH ARG This specifies any additional arguments to be passed to the job submission program that are necessary for job routing, accounting, etc. It can be left blank. ˆ site header This subroutine is called to produce the site header HTML for all CGI pages. It is implemented as a function to allow you to include other files and/or do any other crazy programming you wish. If you make changes, make sure to set the title of the page to the $title Perl variable. ˆ site footer This subroutine is called to produce the site footer HTML for all CGI pages. It is your responsibility to close off any HTML tags here that you open in site footer. 6