Yogesh Simmhan. escience Group Microsoft Research

Similar documents
Pan-STARRS Introduction to the Published Science Products Subsystem (PSPS)

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson, Nelson Araujo, Dennis Gannon, Wei Lu, and

Performing Large Science Experiments on Azure: Pitfalls and Solutions

escience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows

Data Intensive Scalable Computing


Introduction to Grid Computing

Harnessing Grid Resources to Enable the Dynamic Analysis of Large Astronomy Datasets

HDFS: Hadoop Distributed File System. CIS 612 Sunnie Chung

NFSv4 as the Building Block for Fault Tolerant Applications

Contingency Planning and Disaster Recovery

ebay s Architectural Principles

escience in the Cloud Dan Fay Director Earth, Energy and Environment

ebay Marketplace Architecture

Distributed File Systems II

Extending the SDSS Batch Query System to the National Virtual Observatory Grid

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

The Canadian CyberSKA Project

CISC 7610 Lecture 5 Distributed multimedia databases. Topics: Scaling up vs out Replication Partitioning CAP Theorem NoSQL NewSQL

Pegasus. Automate, recover, and debug scientific computations. Rafael Ferreira da Silva.

Data Analytics at Logitech Snowflake + Tableau = #Winning

The Social Grid. Leveraging the Power of the Web and Focusing on Development Simplicity

Huge market -- essentially all high performance databases work this way

The Materials Data Facility

Automating Real-time Seismic Analysis

NFS: Naming indirection, abstraction. Abstraction, abstraction, abstraction! Network File Systems: Naming, cache control, consistency

VMware vsphere 4.0 The best platform for building cloud infrastructures

Writing a Data Management Plan A guide for the perplexed

Distributed Filesystem

Heckaton. SQL Server's Memory Optimized OLTP Engine

Isilon: Raising The Bar On Performance & Archive Use Cases. John Har Solutions Product Manager Unstructured Data Storage Team

SAP Agile Data Preparation Simplify the Way You Shape Data PUBLIC

Microsoft SQL Server Fix Pack 15. Reference IBM

Engineering Goals. Scalability Availability. Transactional behavior Security EAI... CS530 S05

Problems Caused by Failures

Storage for HPC, HPDA and Machine Learning (ML)

Rediffmail Enterprise High Availability Architecture


Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

New England Data Camp v2.0 It is all about the data! Caregroup Healthcare System. Ayad Shammout Lead Technical DBA

Inge Van Nieuwerburgh OpenAIRE NOAD Belgium. Tools&Services. OpenAIRE EUDAT. can be reused under the CC BY license

MarkLogic Technology Briefing

The Google File System

CLOUD-SCALE FILE SYSTEMS

VERITAS Storage Foundation 4.0 TM for Databases

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Agenda. AWS Database Services Traditional vs AWS Data services model Amazon RDS Redshift DynamoDB ElastiCache

Designing a Multi-petabyte Database for LSST

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Modern Data Warehouse The New Approach to Azure BI

Cloud Computing & Visualization

Cluster-Based Scalable Network Services

Microsoft SQL Server on Stratus ftserver Systems

How to recover a failed Storage Spaces

Scale-Out Architectures for Secondary Storage

CSE 124: Networked Services Lecture-17

Data Intensive Scalable Computing. Thanks to: Randal E. Bryant Carnegie Mellon University

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Swiss IT Pro SQL Server 2005 High Availability Options Agenda: - Availability Options/Comparison - High Availability Demo 08 August :45-20:00

RECOVERY CHAPTER 21,23 (6/E) CHAPTER 17,19 (5/E)

VCS-276.exam. Number: VCS-276 Passing Score: 800 Time Limit: 120 min File Version: VCS-276

HIGH PERFORMANCE SANLESS CLUSTERING THE POWER OF FUSION-IO THE PROTECTION OF SIOS

Toward Energy-efficient and Fault-tolerant Consistent Hashing based Data Store. Wei Xie TTU CS Department Seminar, 3/7/2017

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

Chapter 18: Database System Architectures.! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems!

Data Sheet: Storage Management Veritas Storage Foundation for Oracle RAC from Symantec Manageability and availability for Oracle RAC databases

VOLTDB + HP VERTICA. page

Top Trends in DBMS & DW

Data Centers. Tom Anderson

Adaptive Cluster Computing using JavaSpaces

Pimp My Data Grid. Brian Oliver Senior Principal Solutions Architect <Insert Picture Here>

Current Topics in OS Research. So, what s hot?

The Future of High Performance Computing

EMC Virtual Infrastructure for Microsoft Applications Data Center Solution

Big Data Retos y Oportunidades

Collaborative data-driven science. Collaborative data-driven science. Mike Rippin

Distributed computing: index building and use

BusinessObjects XI Release 2

Discretized Streams. An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters

Master Services Agreement:

Name: Instructions. Problem 1 : Short answer. [48 points] CMU / Storage Systems 20 April 2011 Spring 2011 Exam 2

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Introduction to Database Services

Clouds: An Opportunity for Scientific Applications?

Big Data for Compliance and Industry Trends. Nadeem Asghar Field CTO Financial Services - Hortonworks

Today: Coda, xfs. Case Study: Coda File System. Brief overview of other file systems. xfs Log structured file systems HDFS Object Storage Systems

The VERCE Architecture for Data-Intensive Seismology

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

Amazon Aurora Relational databases reimagined.

CS 350 Winter 2011 Current Topics: Virtual Machines + Solid State Drives

Distributed System. Gang Wu. Spring,2018

Distributed ETL. A lightweight, pluggable, and scalable ingestion service for real-time data. Joe Wang

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Zero Data Loss Recovery Appliance DOAG Konferenz 2014, Nürnberg

Google File System. Arun Sundaram Operating Systems

The Google File System (GFS)

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Distributed Computations MapReduce. adapted from Jeff Dean s slides

NPTEL Course Jan K. Gopinath Indian Institute of Science

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Transcription:

External Research Yogesh Simmhan Group Microsoft Research Catharine van Ingen, Roger Barga, Microsoft Research Alex Szalay, Johns Hopkins University Jim Heasley, University of Hawaii

Science is producing vast quantities of data Data needs to be ingested & published for community use Different user roles, different domain/it expertise Provenance, reliability, semantics are key! Commodity hardware Produc er Raw Data Valet Verified Data Curator Ingested Data Publish er Published Data Consu mer

Scientific workflows are popular with scientists Composable & reusable pipelines supporting data flows Diverse execution environments, tracking & monitoring Workflows are suitable for data ingest But reliability is important for valet workflows Workflow design models for reliability in distributed environment Workflow framework features for fault tolerance

Panoramic Survey Telescope And Rapid Response System Discover dangerous asteroids & comets. Use data for Solar System, cosmology study One of the largest visible light telescopes 2/ 3 rds of sky scanned thrice a month Large, growing data collection Annual 1PB of raw images 30TB of community data 5.5B objects 350B detections Daily 2.5TB image data 100GB of shared data 1000 images 150M detections The PS1 Telescope at Haleakala, Maui, Hawaii Raw Digital Images [~1PB/yr] Image Processing Pipeline (IPP) Unix Cluster Detections, Objects, Metadata (Comma Separated Value files) [~100TB/yr] Object Data Manager (ODM) MS-SQL DB on Win2008 Servers Query Results SQL Query CASJobs Batch Submission Portal ~300 Astronomers

Data partitioned as distributed DB with views WinHPC Cluster,MSSQL 3 copies: hot/cold/warm CSV files arrive from Image Pipeline: 1k daily Each file loaded into one Load database Load DBs from each week merged with prior data in Cold database Copies of new Cold DB surfaced to users Workflows for data ops: Cold DB Hot DB S1 S2 S3 S15 S16 L192 RAID Storage Cold DB spatial partitions Database is Logical Store Warm DB S1 S2 S16 S14 S7 S8 S9 S11 S15 S16 S3 S1 CSV Batch CSV Batch Load Load Workflow (900/night) Loads a CSV Batch into a new Load DB (~100MB) Online Servers Load DB Load DB Cold DB Hot/Warm DB spatial partitions Merge Workflow (16/week) Merge Load DBs into a Cold DB (~4TB) Merge Load, Merge, Copy, Flip running on Trident Offline Servers Cold Merged DB Copy Workflow (32/week) N/W copy merged Cold DB to replace Hot/Warm DB (~4TB) Copy Copy Warm DB Warm DB Hot DB Hot DB L581 csv csv csv Flip Flip CSV Batch data product on file system logical store Load DB partition Flip Workflow (2/week) Pause CASJobs & recreate Distributed Partion View over the 16 Hot/Warm DBs Warm DPV Hot DPV

Well defined, Well tested workflows Valet workflows run repeatedly, their impact is cumulative Workflows and activities do not change often Changes need to be synchronized with repository schedule Testing is crucial E.g. Load workflow run 900 times/day. Faults can compound. Granular, Reusable workflows Avoid custom activities for ad hoc tasks Easier to test and maintain library of core activities Separate policy from mechanism E.g. static n/w files E.g. Copy & Flip are separate workflows. Preamble for Copy is separate from actual copy action.

Workflows as Data State Machines State of repository depends on state of its constituent resources Workflows operate on state machines Data containers have states Workflows and activities cause state transitions Instantly determine state of system Easier to interpret workflow progress, complex interactions Impact of workflows on states are different. E.g. Load vs. Merge workflow failures Easily define fault paths based on data state Goal: Recovery from fault state to clean state

Workflows as Data State Machines Simple state definitions depending on stage of loading data, recovery model used Activities should cause at most one state change States used for recovery workflow, policy workflow & display Load Workflow Activities State Machine for Load DB State transitions take place when activities execute. CSV Batch Create DB Success Does Not Exist Drop DB (if exists) Drop DB Success Create DB Failed Created Create DB Insert CSV Success Insert CSV Failed Loading Insert CSV Batch Validate Success Drop DB Failed Validate Failed Validate Data Load Complete Load DB Continues in the Merge workflow Activity Type Blue/Solid : Mainline Code Purple/Dashed: Recovery Preamble Success States Green/Solid : Clean Yellow/Dashed: In Flight Failed States Orange/Dotted: Recoverable Red/Dotted: Unrecoverable

Workflow Recovery Baked into Design Faults are a fact of life in distributed systems: hardware loss, transient failure, data errors Recovery as part of normal workflow makes handling faults a routine action Coarser (simpler) approach to recovery Different recovery designs Re-Execute Idempotent Resume Idempotent Recover-Resume Idempotent Independent Recovery

Workflow Recovery Baked into Design Re-Execute Idempotent Recovery Idempotent workflows that can be rerun without side-effects Input data states valid, no in-place update Retry limits E.g. Flip workflow for DPV reconstruction Resume Idempotent Recovery Idempotent activities allow goto at start Cost, complexity of resume vs. re-execute E.g. Copy workflow for parallel file copies

Workflow Recovery Baked into Design Recover & Resume Separate activity(s) to rollback to initial state Reduce the problem to resume/re-execute Passive Recovery: Rollback activity(s) at start of workflow. E.g. Load workflow drop database Active Recovery: Rollback workflow captures operations to undo dirty state. E.g. Merge workflow on machine failure Fail fast vs. Fault tolerant activities Independent Recovery Complex faults requiring global synchronization Sanity check, cleanup, consistent states Manually coordinated. E.g. Loss of merged Cold DB

Specific requirements to support data valet tasks. E.g. tracking more imortant than ad hoc composition Provenance Collection Track operations across data, activities, workflows, distributed temporally & spatially Record inputs and outputs to activities Valet workflows need fine grained, system level provenance (logging) Reliable, scalable provenance collection

Provenance Collection for Data State Tracking Provenance provides record of state transitions Provenance can be mined for current state of data resources. Recovery based on it. Direct State Recording: Activities expose the state transition they effect as in/outputs Pro: Easily queried; Con: Reusability, local view Indirect State Recording: External mapping from activity execution to state transition Pro: Reusable, WF level; Con: Complexity PS uses indirect model

Provenance Collection for Forensics Investigate cause of distributed, repetitive faults Low level, system provenance ~monitoring, logging, tracing: disk, machine, services, I/O Scalable & Resilient Provenance Collection Long running, continuous workflow execution Fine-grained collection Stored till data release, even instrument lifetime. E.g. Status of 30,000 Load workflows Efficient querying for recovery, current state Provenance loss Doubtful repository state PS has realtime SQL replication of provenance

Support for Workflow Recovery Models Re-run previous workflow with same inputs Well defined exceptions & recovery execution paths E.g Recover & Resume workflows Identify exception source. Transmit it, reliably, to initiate recovery. Low level, system provenance ~monitoring, logging, tracing: disk, machine, services, I/O Fail-fast guarantees Workflows, activities must fail fast Framework must halt on errors: timeout, liveliness, network connectivity Early, accurate detection Early recovery Trident Workflow configured to fail fast for PS Synchronous provenance logging to Registry HPC fault event monitoring

Reliability of repository data management less studied compared to other aspects Data reliability Replication in Grids, Clouds replicated file system, database Transparency vs. Update, recovery overhead Specialized for repositories operate only by workflows, not general purpose applications Application reliability Over-provisioning workflows. Infeasible. Checkpoint restart. Insufficient. Transactional workflows. Do not support data states.

Data management in is complex Large shared data respositories on commodity clusters more common Data valets have special needs from workflows Goal driven approach to workflow design using data state machines Simple models for reliability & recovery go a long way Trident workflow provides tools to support both scientist & valet users

Acknowledgements Maria Nieto-Santisteban, Richard Wilton, Sue Werner, Johns Hopkins University Conrad Holmberg, University of Hawaii Dean Guo, Nelson Araujo, Jared Jackson, Microsoft Research Trident Workflow Team www.ps1sc.org http://research.microsoft.com/en-us/collaboration/tools/trident.aspx External Research