Two Examples of Datanomic. David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

Two Examples of Datanomic David Du Digital Technology Center Intelligent Storage Consortium University of Minnesota

Datanomic Computing (Autonomic Storage) System behavior driven by characteristics of the data and the changing environment Automatic optimization to ever changing data requirements Allocate resources according to increase in demand of the data Transform data formats to support different applications Seamless data access from anywhere at anytime Location and context aware access to data Content-based search Adaptive performance Consistent view of each user s data Independent of platforms, operating systems, and data formats Exploit active object, active and intelligent disk Solve data explosion and provenance issues

Three Possible Approaches Semantic Web Web is the key Grid Computing Services offered by middleware Intelligent Storage Devices Reduce layers by adding features to storage devices

Two Examples E2E QoS Provisioning for Network- Attached Storage Systems Solutions to Data Provenance Problem

Motivation of E2E QoS Provisioning OSD supports diverse applications Different applications require different Performance guarantees: bandwidth, response time, throughput Objects in OSD carries application semantics Objects in OSD has full knowledge of its current storage condition

QoS Challenges in Network Attached Storage QoS Requirements from applications Data are accessed from remote storage devices via IP network connections How to ensure QoS delivery within storage devices? How to ensure QoS over networks? How to ensure QoS E2E?

Feedback Based QoS Control Use a controller between clients and the storage server (client side vs. ISP) Clients provide performance goals A feedback mechanism in control compares the performance measured and expected The controller throttles the user requests if there is performance goal violation

A Feedback Control Perf Goal Requests Diff Controller Adjusted Req rate TCP/IP TCP/IP Network Network Storage Server Computed Performance Measurement

Possible Control Along The End User Access Data Path Client Client Client Client 1 1 1 1 ISP ISP 2 2 TCP/IP Network Storage 3 Server

Storage QoS Control Motivation Multimedia application requires guaranteed timely delivery Different applications has different QoS requirements Storage access has a lot of variations QoS provisioning QoS aware disk scheduling to guarantee the real-time requirements

Storage Brick A Storage Brick Target BW 1 Gigabit Ethernet Gigabit Ethernet Brick Controller CPU Memory Initiator 1 TCP Network BW 2 HBA (SATA) HBA (SATA) Initiator 2 BW n d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 Initiator n

Challenges in QoS Provisioning in Storage Brick Storage brick often uses striping and replication to improve performance RAID systems make the disk scheduling more difficult and previous algorithm inappropriate Client connections have different network bandwidths iscsi has the upper level flow control

Issue 1: Network-Aware Scheduling Goal: exploit the knowledge of underlying network conditions to efficiently schedule object requests Environment: storage brick attached to Internet Assumption: Multiple initiators with different BW access the storage brick through iscsi Each session s BW can be acquired The objects are striped over multiple disks in brick for performance and load-balance purpose

Issue 2 QoS-Aware Storage Scheduling Motivations: Different QoS requirements from different applications Different network bandwidth from different sessions Different RAID configurations in the storage brick Objectives Propose a framework to support different app QoS for different sessions

Scheduling with Object Replicas Previous scheduling assumes there is only one copy of the requested object Object can have multiple replicas Locate the most favorable replica of an object to be requested Schedule disk access on the favorable object

Issue 3: End-End QoS Support Storage QoS support only provides guarantee within the storage devices TCP/IP network is a best-effort network, no hard guarantee is provided TCP/IP network is shared by a variety of users, not just storage access users Feedback control is not practical given the variety of clients and diverse distribution Integration of network QoS and Storage QoS to provide true end-end QoS

A List of Related Projects iscsi Performance Study iscsi Simulation Implementation and Study Adaptive iscsi Storage Access QoS for iscsi with OSD Support Implementation and Evaluation of DMAPI- Based Data Backup Prototype Network-Aware Resource Scheduling QoS support for OSD Implementation

What is data provenance? Provenance is a relationship between data objects to explain how a particular object has been derived. A workflow of data processes usually explains this relationship Using provenance, a user can trace the workflow that led to the aggregation of processes producing a particular object.

EnsEMBL Pipeline (Workflow) Genomic Sequence Data Regular (daily) addition of new data Occasional updates to existing data Download & Import Scripts Primary Data: Contig, Clone, Assembly, dna, Corrections take the form of updates Also, assembly data (partial chromosome locations) Search Targets / Models / Parameter Sets Examples: NCBI NR, PFAM Models Update Scripts Target Sets Preliminary Pipeline Some updates (BLAST targets) are additive Some represent retraining and cannot be easily added to the (new models for HMMs, contig sets from TIGR) Update frequency currently driven by computational limits Features: Dna_align_feature, protein_align_feature External Gene Calls & xrefs Gene Calling Protein Annotations Transcript, Exon, Gene, Xref, Protein Pipeline EnsEMBL Genes: Transcript, Exon, Gene, Xref,

GenBank HTG FASTA 1 Nightly download of new phase3 and phase2 (2 ordered pieces at most) HTG sequences University of of Minnesota Mt Mt BAC Registry Young lab MtBR -Linkage group -BAC ordering (to come) Check GenBank Accession and Version numbers against CCGB-DeCIFR contents to avoid duplication. If Acc# already present in CCGB-DeCIFR with earlier version#, drop all analysis results from database tables For the same Acc#, keep only the latest version to perform analysis on. Use MtBR linkage group information to assign BAC display to chromosome Create pipeline analysis queuing job SubmitContig ContigStartState CCGB-DeCIFR analysis pipeline 1. Repeat masker 2. Genscan (Ath smat) 2b 3. Fgenesh (Dicots smat) 4. BLASTX vs PIR-NREF (soon to be replaced by UniProt) 5. BLASTN vs NCBI_dbEST NCBI_nt NCBI Mt cdna NCBI Ath genome NCBI Lj HTGs 3 TIGR latest unigenes o Arabidopsis thaliana o Lotus japonicus Incremental BLAST o Glycine max Target update o Medicago truncatula CCGB unigenes Query update o Medicago truncatula o Peanut o Pseudorobinia accacia (Black locust) CCGB DeCIFR private 2a Upon all analysis completion for a BAC, push that BAC analysis results to production database instance (public) CCGB DeCIFR public http://decifr.ccgb.umn.edu

http://decifr.ccgb.umn.edu/medicago_truncatula/project_status Nightly download of genomic sequences thata are to be put into the pipeline 1 2b 2a 3

Suggestions for test development for provenance using the CCGB-DeCIFR genome annotation pipeline As the annotation pipeline currently stands, three development points in the pipeline are suggested. The first two are immediately available. The third one will be available in the near future. The third one requires us to write a fair amount of new code, and that particular project needs to be integrated into our development schedule. 1. Provenance of sequences downloaded from NCBI on a nightly basis Every night a cron job is run to check for the NCBI release of new Medicago genome sequences that fit specific criteria. A list of the seq ID ( Acc# and gi) is made and compare with the content of CCGB-DeCIFR database. Sequences that are downloaded are: - New accession ( an fit the specific criteria) - Old accession but new GI [sequence updates] 2. Provenance of gene prediction analysis (result features, parameters used, DAS source(?)) Gene prediction programs may have been trained on different training sets ( different research groups US, EU) Focus on the FGENESH ( trained for dicots)[2a] and Genscan (trained for Arabidopsis)[2b] 3. Provenance for incremental update of target databases for homology searches [ BLAST, HMM]

How to solve data provenance in bioinformatics? Workflow of Functional Genomics Data Dependent Relationships Between Data Objects Analysis Tools: take several input data with a set of parameter values to produce a version of output data object Results and generated knowledge are presented as annotations and feedback to the system

Generalized Black Box for An Analysis Tool any object (target/db/query..) input /w metadata parameters @ analysis instance output /w metadata analysis model w/ db (gene calling algorithm/matching algorithm/filters/general db search/user scripts...) all necessary configuration sets e.g. version information includes intermediate data

Our Proposed Solution Taking Intelligent Storage Approach to Demonstrate Its Power Provenance Information is part of metadata or attributes associated with data Infinite Number of Versions of A Data Object exist What is the efficient way to store and to maintain these many versions? How does a change to one object affect the others?