Two Examples of Datanomic. David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

Similar documents
Sequence Alignment. GBIO0002 Archana Bhardwaj University of Liege

Introduc)on to annota)on with Artemis. Download presenta.on and data

Assessing Transcriptome Assembly

Genome Browsers - The UCSC Genome Browser

How to use KAIKObase Version 3.1.0

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

The Ensembl API. What is the API? November 7, European Bioinformatics Institute (EBI) Hinxton, Cambridge, UK

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

Genome Browsers Guide

Lecture 5 Advanced BLAST

CLC Server. End User USER MANUAL

QoS support for Intelligent Storage Devices

Annotating a Genome in PATRIC

BioExtract Server User Manual

How to Run NCBI BLAST on zcluster at GACRC

INTRODUCTION TO BIOINFORMATICS

Database Searching Using BLAST

Genome Browser. Background and Strategy

Wilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment

INTRODUCTION TO BIOINFORMATICS

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

Wilson Leung 05/27/2008 A Simple Introduction to NCBI BLAST

Creating and Using Genome Assemblies Tutorial

Geneious 5.6 Quickstart Manual. Biomatters Ltd

MetaPhyler Usage Manual

SolexaLIMS: A Laboratory Information Management System for the Solexa Sequencing Platform

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

Preliminary Syllabus. Genomics. Introduction & Genome Assembly Sequence Comparison Gene Modeling Gene Function Identification

Introduction to Genome Browsers

BovineMine Documentation

Bioinformatics explained: BLAST. March 8, 2007

Tutorial 1: Exploring the UCSC Genome Browser

When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame

BLAST Exercise 2: Using mrna and EST Evidence in Annotation Adapted by W. Leung and SCR Elgin from Annotation Using mrna and ESTs by Dr. J.

Quality of Service (QoS) Enabled Dissemination of Managed Information Objects in a Publish-Subscribe-Query

Finding and Exporting Data. BioMart

Fast-track to Gene Annotation and Genome Analysis

Using many concepts related to bioinformatics, an application was created to

HEP replica management

Securing Grid Data Transfer Services with Active Network Portals

MacVector for Mac OS X

WSSP-10 Chapter 7 BLASTN: DNA vs DNA searches

DNA sequences obtained in section were assembled and edited using DNA

Cloud Meets Big Data For VMware Environments

Topics of the talk. Biodatabases. Data types. Some sequence terminology...

Integrated Genome browser (IGB) installation

GPFS for Life Sciences at NERSC

NCGAS Makes Robust Transcriptome Assembly Easier with a Readily Usable Workflow Following de novo Assembly Best Practices

FASTA. Besides that, FASTA package provides SSEARCH, an implementation of the optimal Smith- Waterman algorithm.

No Tradeoff Low Latency + High Efficiency

Min Wang. April, 2003

Miniproject 1. Part 1 Due: 16 February. The coverage problem. Method. Why it is hard. Data. Task1

Scientific Workflows

Information Resources in Molecular Biology Marcela Davila-Lopez How many and where

Federated Array of Bricks Y Saito et al HP Labs. CS 6464 Presented by Avinash Kulkarni

MacVector for Mac OS X. The online updater for this release is MB in size

Identifying and Eliminating Backup System Bottlenecks: Taking Your Existing Backup System to the Next Level

An Introduction to Taverna Workflows Katy Wolstencroft University of Manchester

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

iscsi Technology Brief Storage Area Network using Gbit Ethernet The iscsi Standard

NGS Data Analysis. Roberto Preste

Securing Grid Data Transfer Services with Active Network Portals

Tutorial 4 BLAST Searching the CHO Genome

Introduction to Grid Computing

Sun N1: Storage Virtualization and Oracle

Sequence Alignment: BLAST

HymenopteraMine Documentation

Public Repositories Tutorial: Bulk Downloads

The Design and Implementation of AQuA: An Adaptive Quality of Service Aware Object-Based Storage Device

User Guide for DNAFORM Clone Search Engine

Proteome Comparison: A fine-grained tool for comparative genomics

CAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1

Database Services at CERN with Oracle 10g RAC and ASM on Commodity HW

Exon Probeset Annotations and Transcript Cluster Groupings

visualize and recover Grapegen Affymetrix Genechip Probeset Initial page: Optimized for Mozilla Firefox 3 (recommended browser)

NetApp Clustered Data ONTAP 8.2 Storage QoS Date: June 2013 Author: Tony Palmer, Senior Lab Analyst

Exercise 2: Browser-Based Annotation and RNA-Seq Data

TBtools, a Toolkit for Biologists integrating various HTS-data

Sarah Cohen-Boulakia. Université Paris Sud, LRI CNRS UMR

Summary. Introduction. Susan M. Dombrowski and Donna Maglott

Introduction. Application Performance in the QLinux Multimedia Operating System. Solution: QLinux. Introduction. Outline. QLinux Design Principles

EBI services. Jennifer McDowall EMBL-EBI

Categorized software tools: (this page is being updated and links will be restored ASAP. Click on one of the menu links for more information)

ArcGIS Server Architecture Considerations. Andrew Sakowicz

Welcome to the MSI Cargill Computer Lab. Center for Mass Spectrometry and Proteomics Phone (612) (612)

BLAST, Profile, and PSI-BLAST

RNA-seq. Manpreet S. Katari

TRAPPIST: A toolkit for comparative analysis and visualization of genomic regions

Similarity Searches on Sequence Databases

Examining De Novo Transcriptome Assemblies via a Quality Assessment Pipeline

Pairwise Sequence Alignment. Zhongming Zhao, PhD

Solid State Storage Technologies. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Deploying Software Defined Storage for the Enterprise with Ceph. PRESENTATION TITLE GOES HERE Paul von Stamwitz Fujitsu

access addresses/addressing advantages agents allocation analysis

Rapid Deployment of VS Workflows. Meta Scheduling Service

Path-based systems to guide scientists in the maze of biological data sources

EMC Virtual Infrastructure for Microsoft Exchange 2010 Enabled by EMC Symmetrix VMAX, VMware vsphere 4, and Replication Manager

Quality of Service in US Air Force Information Management Systems

Browser Exercises - I. Alignments and Comparative genomics

As of August 15, 2008, GenBank contained bases from reported sequences. The search procedure should be

Transcription:

Two Examples of Datanomic David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

Datanomic Computing (Autonomic Storage) System behavior driven by characteristics of the data and the changing environment Automatic optimization to ever changing data requirements Allocate resources according to increase in demand of the data Transform data formats to support different applications Seamless data access from anywhere at anytime Location and context aware access to data Content-based search Adaptive performance Consistent view of each user s data Independent of platforms, operating systems, and data formats Exploit active object, active and intelligent disk Solve data explosion and provenance issues

Three Possible Approaches Semantic Web Web is the key Grid Computing Services offered by middleware Intelligent Storage Devices Reduce layers by adding features to storage devices

Two Examples E2E QoS Provisioning for Network- Attached Storage Systems Solutions to Data Provenance Problem

Motivation of E2E QoS Provisioning OSD supports diverse applications Different applications require different Performance guarantees: bandwidth, response time, throughput Objects in OSD carries application semantics Objects in OSD has full knowledge of its current storage condition

QoS Challenges in Network Attached Storage QoS Requirements from applications Data are accessed from remote storage devices via IP network connections How to ensure QoS delivery within storage devices? How to ensure QoS over networks? How to ensure QoS E2E?

Feedback Based QoS Control Use a controller between clients and the storage server (client side vs. ISP) Clients provide performance goals A feedback mechanism in control compares the performance measured and expected The controller throttles the user requests if there is performance goal violation

A Feedback Control Perf Goal Requests Diff Controller Adjusted Req rate TCP/IP TCP/IP Network Network Storage Server Computed Performance Measurement

Possible Control Along The End User Access Data Path Client Client Client Client 1 1 1 1 ISP ISP 2 2 TCP/IP Network Storage 3 Server

Storage QoS Control Motivation Multimedia application requires guaranteed timely delivery Different applications has different QoS requirements Storage access has a lot of variations QoS provisioning QoS aware disk scheduling to guarantee the real-time requirements

Storage Brick A Storage Brick Target BW 1 Gigabit Ethernet Gigabit Ethernet Brick Controller CPU Memory Initiator 1 TCP Network BW 2 HBA (SATA) HBA (SATA) Initiator 2 BW n d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 Initiator n

Challenges in QoS Provisioning in Storage Brick Storage brick often uses striping and replication to improve performance RAID systems make the disk scheduling more difficult and previous algorithm inappropriate Client connections have different network bandwidths iscsi has the upper level flow control

Issue 1: Network-Aware Scheduling Goal: exploit the knowledge of underlying network conditions to efficiently schedule object requests Environment: storage brick attached to Internet Assumption: Multiple initiators with different BW access the storage brick through iscsi Each session s BW can be acquired The objects are striped over multiple disks in brick for performance and load-balance purpose

Issue 2 QoS-Aware Storage Scheduling Motivations: Different QoS requirements from different applications Different network bandwidth from different sessions Different RAID configurations in the storage brick Objectives Propose a framework to support different app QoS for different sessions

Scheduling with Object Replicas Previous scheduling assumes there is only one copy of the requested object Object can have multiple replicas Locate the most favorable replica of an object to be requested Schedule disk access on the favorable object

Issue 3: End-End QoS Support Storage QoS support only provides guarantee within the storage devices TCP/IP network is a best-effort network, no hard guarantee is provided TCP/IP network is shared by a variety of users, not just storage access users Feedback control is not practical given the variety of clients and diverse distribution Integration of network QoS and Storage QoS to provide true end-end QoS

A List of Related Projects iscsi Performance Study iscsi Simulation Implementation and Study Adaptive iscsi Storage Access QoS for iscsi with OSD Support Implementation and Evaluation of DMAPI- Based Data Backup Prototype Network-Aware Resource Scheduling QoS support for OSD Implementation

What is data provenance? Provenance is a relationship between data objects to explain how a particular object has been derived. A workflow of data processes usually explains this relationship Using provenance, a user can trace the workflow that led to the aggregation of processes producing a particular object.

EnsEMBL Pipeline (Workflow) Genomic Sequence Data Regular (daily) addition of new data Occasional updates to existing data Download & Import Scripts Primary Data: Contig, Clone, Assembly, dna, Corrections take the form of updates Also, assembly data (partial chromosome locations) Search Targets / Models / Parameter Sets Examples: NCBI NR, PFAM Models Update Scripts Target Sets Preliminary Pipeline Some updates (BLAST targets) are additive Some represent retraining and cannot be easily added to the (new models for HMMs, contig sets from TIGR) Update frequency currently driven by computational limits Features: Dna_align_feature, protein_align_feature External Gene Calls & xrefs Gene Calling Protein Annotations Transcript, Exon, Gene, Xref, Protein Pipeline EnsEMBL Genes: Transcript, Exon, Gene, Xref,

GenBank HTG FASTA 1 Nightly download of new phase3 and phase2 (2 ordered pieces at most) HTG sequences University of of Minnesota Mt Mt BAC Registry Young lab MtBR -Linkage group -BAC ordering (to come) Check GenBank Accession and Version numbers against CCGB-DeCIFR contents to avoid duplication. If Acc# already present in CCGB-DeCIFR with earlier version#, drop all analysis results from database tables For the same Acc#, keep only the latest version to perform analysis on. Use MtBR linkage group information to assign BAC display to chromosome Create pipeline analysis queuing job SubmitContig ContigStartState CCGB-DeCIFR analysis pipeline 1. Repeat masker 2. Genscan (Ath smat) 2b 3. Fgenesh (Dicots smat) 4. BLASTX vs PIR-NREF (soon to be replaced by UniProt) 5. BLASTN vs NCBI_dbEST NCBI_nt NCBI Mt cdna NCBI Ath genome NCBI Lj HTGs 3 TIGR latest unigenes o Arabidopsis thaliana o Lotus japonicus Incremental BLAST o Glycine max Target update o Medicago truncatula CCGB unigenes Query update o Medicago truncatula o Peanut o Pseudorobinia accacia (Black locust) CCGB DeCIFR private 2a Upon all analysis completion for a BAC, push that BAC analysis results to production database instance (public) CCGB DeCIFR public http://decifr.ccgb.umn.edu

http://decifr.ccgb.umn.edu/medicago_truncatula/project_status Nightly download of genomic sequences thata are to be put into the pipeline 1 2b 2a 3

Suggestions for test development for provenance using the CCGB-DeCIFR genome annotation pipeline As the annotation pipeline currently stands, three development points in the pipeline are suggested. The first two are immediately available. The third one will be available in the near future. The third one requires us to write a fair amount of new code, and that particular project needs to be integrated into our development schedule. 1. Provenance of sequences downloaded from NCBI on a nightly basis Every night a cron job is run to check for the NCBI release of new Medicago genome sequences that fit specific criteria. A list of the seq ID ( Acc# and gi) is made and compare with the content of CCGB-DeCIFR database. Sequences that are downloaded are: - New accession ( an fit the specific criteria) - Old accession but new GI [sequence updates] 2. Provenance of gene prediction analysis (result features, parameters used, DAS source(?)) Gene prediction programs may have been trained on different training sets ( different research groups US, EU) Focus on the FGENESH ( trained for dicots)[2a] and Genscan (trained for Arabidopsis)[2b] 3. Provenance for incremental update of target databases for homology searches [ BLAST, HMM]

How to solve data provenance in bioinformatics? Workflow of Functional Genomics Data Dependent Relationships Between Data Objects Analysis Tools: take several input data with a set of parameter values to produce a version of output data object Results and generated knowledge are presented as annotations and feedback to the system

Generalized Black Box for An Analysis Tool any object (target/db/query..) input /w metadata parameters @ analysis instance output /w metadata analysis model w/ db (gene calling algorithm/matching algorithm/filters/general db search/user scripts...) all necessary configuration sets e.g. version information includes intermediate data

Our Proposed Solution Taking Intelligent Storage Approach to Demonstrate Its Power Provenance Information is part of metadata or attributes associated with data Infinite Number of Versions of A Data Object exist What is the efficient way to store and to maintain these many versions? How does a change to one object affect the others?