Amara's law : Overestimating the effects of a technology in the short run and underestimating the effects in the long run

Similar documents
Archive Solutions at the Center for High Performance Computing by Sam Liston (University of Utah)

AMS Board on Enterprise Communications Improving Weather Forecasts November 29, Data Sources: Pulling it all together

Unidata and data-proximate analysis and visualization in the cloud

C3S Data Portal: Setting the scene

ARCHITECTURE OF MADIS DATA PROCESSING AND DISTRIBUTION AT FSL

Moving Weather Model Ensembles To A Geospatial Database For The Aviation Weather Center

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Yajing (Phillis)Tang. Walt Wells

A Robust, Flexible Platform for Expanding Your Storage without Limits

Scientific and Multidimensional Raster Support in ArcGIS

Cumulus Services Working Group. Dan Pilone SE TIM / August 2017

The New Economics of Cloud Storage

Cloud scaling of Visual Weather

Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Why software defined storage matters? Sergey Goncharov Solution Architect, Red Hat

Werden Sie ein Teil von Internet der Dinge auf AWS. AWS Enterprise Summit 2015 Dr. Markus Schmidberger -

Ceph at the DRI. Peter Tiernan Systems and Storage Engineer Digital Repository of Ireland TCHPC

AWIPS Technology Infusion Darien Davis NOAA/OAR Forecast Systems Laboratory Systems Development Division April 12, 2005

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Writing a Data Management Plan A guide for the perplexed

Cloud Computing For Researchers

NetCDF-4: : Software Implementing an Enhanced Data Model for the Geosciences

SoftNAS Cloud Data Management Products for AWS Add Breakthrough NAS Performance, Protection, Flexibility

Public Cloud Leverage For IT/Business Alignment Business Goals Agility to speed time to market, adapt to market demands Elasticity to meet demand whil

Introducing SUSE Enterprise Storage 5

Data Archival and Dissemination Tools to Support Your Research, Management, and Education

Oracle IaaS, a modern felhő infrastruktúra

New research on Key Technologies of unstructured data cloud storage

NetCDF-4 Update. Ed Hartnett, Unidata/UCAR NetCDF Workshop, July 25 26, 2011

Benefits of 25, 40, and 50GbE Networks for Ceph and Hyper- Converged Infrastructure John F. Kim Mellanox Technologies

Application of machine learning and big data technologies in OpenAIRE system

TechTour. Winning Technology Roadmaps. Sig Knapstad. Cutting Edge

Cloud Storage with AWS: EFS vs EBS vs S3 AHMAD KARAWASH

Working with Scientific Data in ArcGIS Platform

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

cdo Data Processing (and Production) Luis Kornblueh, Uwe Schulzweida, Deike Kleberg, Thomas Jahns, Irina Fast

-Presented By : Rajeshwari Chatterjee Professor-Andrey Shevel Course: Computing Clusters Grid and Clouds ITMO University, St.

Assessing Applications of Cloud Computing to NASA s Earth Observing System Data and Information System (EOSDIS)

CLOUD 102 The Long Range Forecast is Cloudy, with a Chance Of Virtualization Ron Clifton, James Kelso & Thomas Crowe III

Data publication and discovery with Globus

Bruce Wright, John Ward, Malcolm Field, Met Office, United Kingdom

Simplifying Collaboration in the Cloud

IaaS. IaaS. Virtual Server

New Fresh Storage Approach for New IT Challenges Laurent Denel Philippe Nicolas OpenIO

Uniform Resource Locator Wide Area Network World Climate Research Programme Coupled Model Intercomparison

November 7, DAN WILSON Global Operations Architecture, Concur. OpenStack Summit Hong Kong JOE ARNOLD

IaaS. IaaS. Virtual Server

Handling Big Data an overview of mass storage technologies

Dell EMC Surveillance for IndigoVision Body-Worn Cameras

Earth Science Data Delivery

Opendedupe & Veritas NetBackup ARCHITECTURE OVERVIEW AND USE CASES

Using Cloud Services behind SGI DMF

Cloud Analytics and Business Intelligence on AWS

BIG DATA CHALLENGES A NOAA PERSPECTIVE

Data Centers and Cloud Computing

Data Centers and Cloud Computing. Slides courtesy of Tim Wood

Dell EMC Surveillance for VIEVU Body- Worn Cameras

EMC Solution for VIEVU Body Worn Cameras

CS November 2017

Service Portal User Guide

Renovating your storage infrastructure for Cloud era

HPE SimpliVity. The new powerhouse in hyperconvergence. Boštjan Dolinar HPE. Maribor Lancom

Data Curation Profile Movement of Proteins

Introduction to Amazon Web Services

Data Centers and Cloud Computing. Data Centers

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Altus Data Engineering

Distributed Online Data Access and Analysis

SolidFire and Ceph Architectural Comparison

Object storage platform How it can help? Martin Lenk, Specialist Senior Systems Engineer Unstructured Data Solution, Dell EMC

StratusLab Cloud Distribution Installation. Charles Loomis (CNRS/LAL) 3 July 2014

Big Data solution benchmark

NCEP HPC Transition. 15 th ECMWF Workshop on the Use of HPC in Meteorology. Allan Darling. Deputy Director, NCEP Central Operations

EMC Surveillance for Edesix Body- Worn Cameras

The Logical Data Store

HPSS Treefrog Summary MARCH 1, 2018

Getting Creative: Moving Digital Content to the Cloud. Presenter: Jim Coleman

LINUX, WINDOWS(MCSE),

A Gentle Introduction to Ceph

IaaS. IaaS. Virtual Server

IaaS. IaaS. Virtual Server

MAID for Archiving. Aloke Guha. COPAN Systems

Hadoop File System S L I D E S M O D I F I E D F R O M P R E S E N T A T I O N B Y B. R A M A M U R T H Y 11/15/2017

Improving Oceanographic Anomaly Detection Using High Performance Computing

Instituting an observation database

Enterprise Ceph: Everyway, your way! Amit Dell Kyle Red Hat Red Hat Summit June 2016

TrafficDB: HERE s High Performance Shared-Memory Data Store Ricardo Fernandes, Piotr Zaczkowski, Bernd Göttler, Conor Ettinoffe, and Anis Moussa

Introducing HPE SimpliVity 380

IaaS. IaaS. Virtual Server

Oracle NoSQL Database and Cisco- Collaboration that produces results. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

DRS Update. HL Digital Preservation Services & Library Technology Services Created 2/2017, Updated 4/2017

AUTOMATING IBM SPECTRUM SCALE CLUSTER BUILDS IN AWS PROOF OF CONCEPT

Scientific Data Plat f or m

IaaS. IaaS. Virtual Server

MIT805 BIG DATA MAPREDUCE

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

HPSS Treefrog Introduction.

Autopsy as a Service Distributed Forensic Compute That Combines Evidence Acquisition and Analysis

Research Data Management & Preservation: A Library Perspective

HFIP Architecture and Real-Time Workflows. 14 April 2011

Transcription:

Cloud Archiving and Data Mining: Operational and Research Examples John Horel*, Brian Blaylock, Chris Galli* Department of Atmospheric Sciences University of Utah john.horel@utah.edu *Synoptic Data Corporation Amara's law : Overestimating the effects of a technology in the short run and underestimating the effects in the long run

MesoWest/SynopticLabs https://synopticlabs.org/ ~40 billion observations Objective: access, archive, and disseminate publicly-accessible provisional environmental data Transitioned over last several years from University of Utah to cloud-based IT infrastructure

Public Cloud Infrastructure Cloud archiving: highly efficient mysql/tokudb databases Data mining via API over a million requests to download over 10 billion data values per day Type Quantity Monthly Cost Server Instances 29 $1800 Disk 3.5 TB $350 Data Transfers Egress $350 Total $2500 Number & Types of Server Instances 13 Ingest, Processing, Load Balancing 6 Database 2 Real-time data checking 8 Web services/api/alerts, Acquire Process Archive QC Disseminate

Issues with Cloud Archiving/Data Mining Critical to manage costs and only use resources that are absolutely necessary Balancing data storage vs. data access costs is challenging Cheaper solutions exist to store large amounts of data, but may result in large latency to retrieve that data Compute nodes required for data mining need to be close to data archive to avoid data transfer costs

Cloud Archiving and Data Mining of Forecast Model Output: High Resolution Rapid Refresh (HRRR) No retrospective archive of HRRR model output at this time at NCEI, NCEP, or ESRL accessible externally What we needed for WRF model initialization and HRRR model validation: Efficient and expandable archival storage for thousands of large GRIB2 files Fast retrieval of 2D fields within those files Ability to make data publicly accessible to other researchers Solution: Object storage is an affordable, useable, and reliable long-term archive approach See poster by Brian Blaylock & access archive via: http://hrrr.chpc.utah.edu/ Manuscript submitted: Blaylock, B., J. Horel, and S. Liston, 2017: Cloud Archiving and Data Mining of High Resolution Rapid Refresh Forecast Model Output. Computers and Geosciences. In review.

Object Store Options Public cloud (Amazon Web Services, Microsoft Azure, etc.) Private cloud (University of Utah Data Center) Our choice: Private Cloud @ University of Utah Disk-based S3-like object storage at the University of Utah s Data Center managed by the Center for High Performance Computing (CHPC) Configured with 6+3 erasure coding: Objects are broken into 9 pieces 6 data pieces and 3 redundancy pieces No data loss, even if every disk fails on three servers New hardware and expanded storage can easily be added or replaced over time Cloud Storage Service CHPC Pando Amazon Glacier Microsoft Azure Cost over 5 Years $120/TB $540/TB + $30/TB download $600/TB + $10/TB download

HRRR GRIB2 files (from NOMADS/ESRL) Model Data Flow UU local NFS User local NFS Retrospective HRRR External Users UU Private Cloud 3 Monitors 9 Object Storage DeviceServers (each with 16 8TB drives) RADOS Gateway Node

HRRR Output Being Archived (~70 GB/day) Model Type File Type First Date Available 2D fields f00-f18 Operational HRRR 3D fields Subhourly fields f00 f00-f18 Analyses: April 18, 2015 Forecasts: July 27, 2016 Subhour: May 11 2017 Experimental HRRR 2D fields f00 December 1, 2016 Experimental HRRR Alaska 2D fields 3D fields f00-f36 f00 September 1, 2016

Data mining of HRRR output Users may download the entire field with wget or curl or download a single 2D surface if the byte range is known Possible to use multi-thread python to simultaneously return thousands of grids in small amount of time HRRR model validation Wind Gust 95 th Percentile over 2 Years WRF Initialization

Issues with Cloud Archiving/Data Mining S3-type objects must be downloaded to a local disk to process the data Incur data download costs from public cloud provider Red Hat now supports POSIX compliant Ceph File System to handle objects in cloud To avoid excessive data downloading, GRIB2 format allows selecting by byte range and returning only the fields within an object Other file formats such as HDF5 may eventually allow subsetting of S3 objects by variable, region, single grid point, all vertical levels at a point, etc. Siphon, NetCDF subsetting service NSF & other funding agencies require data management plans While geoscience data repositories exist, they have strict standards Academic institutions need to meet data stewardship requirements Will institutions subsidize the costs to maintain large archives? UU Hive has 500 GB limit

Cloud Archiving and Data Mining: Operational and Research Examples HRRR Archive MesoWest/ SynopticLabs

Questions? https://synopticlabs.org/ http://hrrr.chpc.utah.edu/

MesoWest/SynopticLabs https://synopticlabs.org/ Objective: access, archive, and disseminate publiclyaccessible provisional environmental data rapidly Drivers: National Weather Service Acquire Process Archive Research & Public Service Key Drivers Fire Weather QC Air Quality Commercial Applications Road Weather Disseminate