Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016

Similar documents
Storage for HPC, HPDA and Machine Learning (ML)

Server-Side Processing for Large-Scale Data Analytics ESGF Compute Working Team (ESGF-CWT)

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

FastForward I/O and Storage: ACG 8.6 Demonstration

Quobyte The Data Center File System QUOBYTE INC.

DDN s Vision for the Future of Lustre LUG2015 Robert Triendl

Parallel I/O Libraries and Techniques

Data Movement & Tiering with DMF 7

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Hadoop. Introduction / Overview

libhio: Optimizing IO on Cray XC Systems With DataWarp

Introducing SUSE Enterprise Storage 5

Cloudian Sizing and Architecture Guidelines

The Fusion Distributed File System

Kinetic Open Storage Platform: Enabling Break-through Economics in Scale-out Object Storage PRESENTATION TITLE GOES HERE Ali Fenn & James Hughes

Performance Boost for Seismic Processing with right IT infrastructure. Vsevolod Shabad CEO and founder +7 (985)

Improved Solutions for I/O Provisioning and Application Acceleration

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

Best Practices for Deploying Hadoop Workloads on HCI Powered by vsan

Evolving To The Big Data Warehouse

Emerging Technologies for HPC Storage

Application Performance on IME

Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms

Crossing the Chasm: Sneaking a parallel file system into Hadoop

IN11E: Architecture and Integration Testbed for Earth/Space Science Cyberinfrastructures

IBM Spectrum NAS, IBM Spectrum Scale and IBM Cloud Object Storage

Crossing the Chasm: Sneaking a parallel file system into Hadoop

Habanero Operating Committee. January

DVS, GPFS and External Lustre at NERSC How It s Working on Hopper. Tina Butler, Rei Chi Lee, Gregory Butler 05/25/11 CUG 2011

HPE Scalable Storage with Intel Enterprise Edition for Lustre*

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Analyzing the High Performance Parallel I/O on LRZ HPC systems. Sandra Méndez. HPC Group, LRZ. June 23, 2016

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

HPC Storage Use Cases & Future Trends

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

MOHA: Many-Task Computing Framework on Hadoop

The Hadoop Distributed File System Konstantin Shvachko Hairong Kuang Sanjay Radia Robert Chansler

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

UNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

Analytics in the cloud

InfiniBand Networked Flash Storage

Data Transformation and Migration in Polystores

Dell EMC Ready Bundle for HPC Digital Manufacturing Dassault Systѐmes Simulia Abaqus Performance

I/O Monitoring at JSC, SIONlib & Resiliency

Xyratex ClusterStor6000 & OneStor

Backtesting with Spark

GPFS Experiences from the Argonne Leadership Computing Facility (ALCF) William (Bill) E. Allcock ALCF Director of Operations

Apache Spark Graph Performance with Memory1. February Page 1 of 13

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

XtreemStore A SCALABLE STORAGE MANAGEMENT SOFTWARE WITHOUT LIMITS YOUR DATA. YOUR CONTROL

Oracle Big Data Connectors

All-Flash High-Performance SAN/NAS Solutions for Virtualization & OLTP

Refining and redefining HPC storage

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Cloudera Enterprise 5 Reference Architecture

SurFS Product Description

An ESS implementation in a Tier 1 HPC Centre

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Enosis: Bridging the Semantic Gap between

Application Acceleration Beyond Flash Storage

I/O at the Center for Information Services and High Performance Computing

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

The Oracle Database Appliance I/O and Performance Architecture

Addressing the Variety Challenge to Ease the Systemization of Machine Learning in Earth Science

Cluster Setup and Distributed File System

ECMWF's Next Generation IO for the IFS Model and Product Generation

Parallel Storage Systems for Large-Scale Machines

Big Data and Object Storage

Intel Enterprise Edition Lustre (IEEL-2.3) [DNE-1 enabled] on Dell MD Storage

System Software for Big Data and Post Petascale Computing

The BioHPC Nucleus Cluster & Future Developments

NCAR Globally Accessible Data Environment (GLADE) Updated: 15 Feb 2017

LBRN - HPC systems : CCT, LSU

PART-I (B) (TECHNICAL SPECIFICATIONS & COMPLIANCE SHEET) Supply and installation of High Performance Computing System

The Leading Parallel Cluster File System

Apache Spark 2 X Cookbook Cloud Ready Recipes For Analytics And Data Science

Structuring PLFS for Extensibility

Technologies for High Performance Data Analytics

Isilon Performance. Name

5 Fundamental Strategies for Building a Data-centered Data Center

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

HIGH-PERFORMANCE STORAGE FOR DISCOVERY THAT SOARS

Analytics Platform for ATLAS Computing Services

Cold Storage: The Road to Enterprise Ilya Kuznetsov YADRO

FLORIDA DEPARTMENT OF TRANSPORTATION PRODUCTION BIG DATA PLATFORM

Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management

A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS

Architectures for Scalable Media Object Search

Dell EMC Ready Bundle for HPC Digital Manufacturing ANSYS Performance

Hyper-converged storage for Oracle RAC based on NVMe SSDs and standard x86 servers

DriveScale-DellEMC Reference Architecture

DRIVESCALE-HDP REFERENCE ARCHITECTURE

IBM POWER8 HPC System Accelerates Genomics Analysis with SMT8 Multithreading.

Parallel File Systems. John White Lawrence Berkeley National Lab

Parallel I/O on JUQUEEN

DDN. DDN Updates. Data DirectNeworks Japan, Inc Shuichi Ihara. DDN Storage 2017 DDN Storage

Big measurements and data analysis, challenges and some ideas

Transcription:

National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor at the NASA Center for Climate Simulation (NCCS) www.nasa.gov

HyperWall Read access from all nodes within the ADAPT system Serve to data portal services Serve data to virtual machines for additional processing Mixing model and observations Read access from the HyperWall to facilitate visualizing model outputs quickly after they have been created. Mass Storage Read and write access from the mass storage Stage data into and out of the centralized storage environment as needed DASS Concept ADAPT Data Analytics and Storage System (DASS) ~10 PB Note that more than likely all the services will still have local file systems to enable local writes within their respective security domain. Climate Analytics as a Service Analytics through web services or higher level APIs are executed and passed down into the centralized storage environment for processing; answers are returned. Only those analytics that we have written are exposed. HPC - Discover Write and Read from all nodes within Discover models write data into GPFS which is then staged into the centralized storage (burst buffer like). Initial data sets could include: Nature Run Downscaling Results Reanalysis (MERRA, MERRA2) High Resolution Reanalysis

Data Analytics Storage System (DASS) Data movement and sharing of data across services within the NCCS is still a challenge Large data sets created on Discover (HPC) On which users perform many analyses And may not be in a NASA Distributed Active Archive Center (DAAC) Create a true centralized combination of storage and compute capability Capacity to store many PBs of data for long periods of time Architected to be able to scale both horizontally (compute and bandwidth) and vertically (storage capacity) Can easily share data to different services within the NCCS Free up high speed disk capacity within Discover Enable both traditional and emerging analytics No need to modify data; use native scientific formats Your Title Here 4

Initial DASS Capability Overview Initial Capacity 20.832 PB Raw Data Storage 2,604 by 8TB SAS Drives 14 Units 28 Servers 896 Cores 14,336 GB Memory 16 GB/Core 37 TF of compute Roughly equivalent to the compute capacity of the NCCS just 6 years ago! Designed to easily scale both horizontally (compute) and vertically (storage) Your Title Here 5

DASS Compute/Storage Units HPE Apollo 4520 (Initial quantity of 14) Two (2) Proliant XL450 servers, each with Two (2) 16-core Intel Haswel E5-2697Av4 2.6 GHz processors 256 GB of RAM Two (2) SSD s for the operating system Two (2) SSD s for metadata One (1) smart array P841/4G controller One (1) HBA One (1) Infiniband FDR/40 GbE 2-port adapter Redundant power supplies 46 x 8 TB SAS drives Two (2) D6000 JBOD Shelves for each Apollo 4520 70 x 8TB SAS drives 70 x 8 TB = 560 TB XL450 XL450 46 x 8 TB = 368 TB 70 x 8 TB = 560 TB D6000 Apollo 4520 D6000 Your Title Here 6

DASS Compute/Storage Units Traditional Data moved from storage to compute. Open, Read, Write, MPI, C-code, Python, etc. POSIX Interface MapReduce, Spark, Machine Learning, etc. RESTful Interface, Custom APIs, Notebooks Emerging Analytics moved from servers to storage. Infiniband, Ethernet Cloudera and SIA Shared Parallel File System (GPFS) Shared Parallel File System (GPFS) Hadoop Connector Native Scientific Data stored in HPC Storage or Commodity Servers and Storage Open Source Software Stack on DASS Servers Centos Operating System Software RAID Linux Storage Enclosure Services Pacemaker Corasync

Spatiotemporal Index Approach (SIA) and Hadoop Use what we know about the structured scientific data Create a spatiotemporal query model to connect the array-based data model with the key-value based MapReduce programming model using grid concept Built a spatiotemporal index to Link the logical to physical location of the data Make use of an array-based data model within HDFS Developed a grid partition strategy to Keep high data locality for each map task Balance the workload across cluster nodes A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce Zhenlong Lia, Fei Hua, John L. Schnase, Daniel Q. Duffy, Tsengdar Lee, Michael K. Bowen and Chaowei Yang International Journal of Geographical Information Science, 2016 http://dx.doi.org/10.1080/13658816.2015.1131830

Analytics Infrastructure Testbed Test Cluster 1 SIA Cloudera HDFS Test Cluster 2 SIA Cloudera Hadoop Connector GPFS Test Cluster 3 SIA Cloudera Hadoop Connector Lustre 20 nodes (compute and storage) Cloudera HDFS Sequenced data Native NetCDF data Put only 20 nodes (compute and storage) Cloudera GPFS Spectrum Scale Hadoop Transparency Connector Sequenced data Put and Copy Native NetCDF Data Put and Copy 20 nodes (compute and storage) Cloudera Lustre Lustre HAM and HAL Sequenced data Put and Copy Native NetCDF Data Put and Copy

DASS Initial Serial Performance Compute the average temperature for every grid point (x, y, and z) Vary by the total number of years MERRA Monthly Means (Reanalysis) Comparison of serial c-code to MapReduce code Comparison of traditional HDFS (Hadoop) where data is sequenced (modified) with GPFS where data is native NetCDF (unmodified, copy) Using unmodified data in GPFS with MapReduce is the fastest Only showing GPFS results to compare against HDFS

DASS Initial Parallel Performance Compute the average temperature for every grid point (x, y, and z) Vary by the total number of years MERRA Monthly Means (Reanalysis) Comparison of serial c-code with MPI to MapReduce code Comparison of traditional HDFS (Hadoop) where data is sequenced (modified) with GPFS where data is native NetCDF (unmodified, copy) Again using unmodified data in GPFS with MapReduce is the fastest as the number of years increases Only showing GPFS results to compare against HDFS

Future of Data Analytics Climate/Weather Models (HPC) Data In Memory SIA Index Parallel File System on Disk ML Spark HDFS SIA Index ML Spark HDFS C, IDL, Python Streaming data from memory. Post process data on disk. Continue to enable traditional methods of post processing. Future HPC systems must be able to efficiently transform information into knowledge using both traditional analytics and emerging machine learning techniques. Requires the ability to be able to index data in memory and/or on disk and enable analytics to be performed on the data where it resides even in memory All without having to modify the data