Brent Welch. Director, Architecture. Panasas Technology. HPC Advisory Council Lugano, March 2011

Similar documents
Introducing Panasas ActiveStor 14

Parallel NFS (pnfs) Garth Gibson CTO & co-founder, Panasas Assoc. Professor, Carnegie Mellon University.

HPC File Systems and Storage. Irena Johnson University of Notre Dame Center for Research Computing

The Last Bottleneck: How Parallel I/O can improve application performance

pnfs Update A standard for parallel file systems HPC Advisory Council Lugano, March 2011

Panasas Corporate & Product Overview

The current status of the adoption of ZFS* as backend file system for Lustre*: an early evaluation

Object Storage: Redefining Bandwidth for Linux Clusters

PANASAS TIERED PARITY ARCHITECTURE

The Last Bottleneck: How Parallel I/O can attenuate Amdahl's Law

A ClusterStor update. Torben Kling Petersen, PhD. Principal Architect, HPC

Storage Aggregation for Performance & Availability:

The advantages of architecting an open iscsi SAN

Dell Fluid Data solutions. Powerful self-optimized enterprise storage. Dell Compellent Storage Center: Designed for business results

Store Process Analyze Collaborate Archive Cloud The HPC Storage Leader Invent Discover Compete

Storage for High-Performance Computing Gets Enterprise Ready

DATA PROTECTION IN A ROBO ENVIRONMENT

Panasas Performance Scale-Out NAS-Architectural Overview. Panasas ActiveStor Solution: Architectural Overview WHITE PAPER

Next-Generation NVMe-Native Parallel Filesystem for Accelerating HPC Workloads

Isilon Scale Out NAS. Morten Petersen, Senior Systems Engineer, Isilon Division

White Paper. EonStor GS Family Best Practices Guide. Version: 1.1 Updated: Apr., 2018

SFA12KX and Lustre Update

HCI: Hyper-Converged Infrastructure

HPC Considerations for Scalable Multidiscipline CAE Applications on Conventional Linux Platforms. Author: Correspondence: ABSTRACT:

DELL POWERVAULT NX3500. A Dell Technical Guide Version 1.0

Coordinating Parallel HSM in Object-based Cluster Filesystems

Sun Lustre Storage System Simplifying and Accelerating Lustre Deployments

Get More Out of Storage with Data Domain Deduplication Storage Systems

Lustre overview and roadmap to Exascale computing

Vblock Architecture. Andrew Smallridge DC Technology Solutions Architect

IBM Storwize V7000 Unified

HIGH PERFORMANCE STORAGE SOLUTION PRESENTATION All rights reserved RAIDIX

Chapter 10: Mass-Storage Systems

Andreas Dilger, Intel High Performance Data Division Lustre User Group 2017

IBM Spectrum NAS. Easy-to-manage software-defined file storage for the enterprise. Overview. Highlights

<Insert Picture Here> Oracle Storage

Network Storage Appliance

Customer Success Story Los Alamos National Laboratory

EMC Celerra CNS with CLARiiON Storage

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

DAHA AKILLI BĐR DÜNYA ĐÇĐN BĐLGĐ ALTYAPILARIMIZI DEĞĐŞTĐRECEĞĐZ

Microsoft Office SharePoint Server 2007

NST6000 UNIFIED HYBRID STORAGE. Performance, Availability and Scale for Any SAN and NAS Workload in Your Environment

Scality RING on Cisco UCS: Store File, Object, and OpenStack Data at Scale

Configuring a Single Oracle ZFS Storage Appliance into an InfiniBand Fabric with Multiple Oracle Exadata Machines

MongoDB on Kaminario K2

Chapter 10: Mass-Storage Systems. Operating System Concepts 9 th Edition

Hitachi Virtual Storage Platform Family

Shared Object-Based Storage and the HPC Data Center

The File Systems Evolution. Christian Bandulet, Sun Microsystems

Costefficient Storage with Dataprotection

EMC Business Continuity for Microsoft Applications

Next Generation Computing Architectures for Cloud Scale Applications

SurFS Product Description

UK LUG 10 th July Lustre at Exascale. Eric Barton. CTO Whamcloud, Inc Whamcloud, Inc.

Xcellis Technical Overview: A deep dive into the latest hardware designed for StorNext 5

Lessons learned from Lustre file system operation

What is QES 2.1? Agenda. Supported Model. Live demo

Vendor: Hitachi. Exam Code: HH Exam Name: Hitachi Data Systems Storage Fondations. Version: Demo

SUN ZFS STORAGE 7X20 APPLIANCES

SONAS Best Practices and options for CIFS Scalability

Hitachi HQT-4210 Exam

SUN ZFS STORAGE APPLIANCE

HP Supporting the HP ProLiant Storage Server Product Family.

Hitachi HH Hitachi Data Systems Storage Architect-Hitachi NAS Platform.

Parallel File Systems for HPC

System input-output, performance aspects March 2009 Guy Chesnot

NEC M100 Frequently Asked Questions September, 2011

Storage Optimization with Oracle Database 11g

Backup Solutions with (DSS) July 2009

Feedback on BeeGFS. A Parallel File System for High Performance Computing

The storage challenges of virtualized environments

SUN ZFS STORAGE APPLIANCE

Introducing Tegile. Company Overview. Product Overview. Solutions & Use Cases. Partnering with Tegile

EMC Backup and Recovery for Microsoft Exchange 2007

Evaluating Cloud Storage Strategies. James Bottomley; CTO, Server Virtualization

Scale-out Storage Solution and Challenges Mahadev Gaonkar igate

Warsaw. 11 th September 2018

SGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012

Edge for All Business

HP X9000 Network Storage Systems (NAS)

Architecting Storage for Semiconductor Design: Manufacturing Preparation

RED HAT CEPH STORAGE ROADMAP. Cesar Pinto Account Manager, Red Hat Norway

STORAGE CONSOLIDATION WITH IP STORAGE. David Dale, NetApp

SolidFire and Ceph Architectural Comparison

朱义普. Resolving High Performance Computing and Big Data Application Bottlenecks with Application-Defined Flash Acceleration. Director, North Asia, HPC

OceanStor 5300F&5500F& 5600F&5800F V5 All-Flash Storage Systems

Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions

Next Gen Storage StoreVirtual Alex Wilson Solutions Architect

Nové možnosti storage virtualizace s řešením IBM Storwize V7000

Lenovo Database Configuration Guide

The Genesis HyperMDC is a scalable metadata cluster designed for ease-of-use and quick deployment.

EASY DATA MANAGEMENT REAL-TIME ANALYSIS MAXIMUM STORAGE EFFICIENCY

Veritas NetBackup on Cisco UCS S3260 Storage Server

A High-Performance Storage and Ultra- High-Speed File Transfer Solution for Collaborative Life Sciences Research

The Oracle Database Appliance I/O and Performance Architecture

New HPE 3PAR StoreServ 8000 and series Optimized for Flash

A Cloud WHERE PHYSICAL ARE TOGETHER AT LAST

Dynamically unify your data center Dell Compellent: Self-optimized, intelligently tiered storage

FLASHARRAY//M Smart Storage for Cloud IT

Transcription:

Brent Welch Director, Architecture Panasas Technology HPC Advisory Council Lugano, March 2011

Panasas Background Technology based on Object Storage concepts invented by company founder, Garth Gibson, a pioneer in RAID technology 1990 s research on Network Attached Secure Disk (NASD) lead to (Object Storage Device) OSD standards for iscsi. Storage Systems Company Integrated hardware and software storage systems Hardware, FS, RAID 4 th generation blade platform HC storage platform Customer base 2/3 Commercial 1/3 Research and Higher Ed 2

Customer Success Highlights Half the Formula One teams use Panasas Five of the top six O&G companies in the world use Panasas The world s first Petascale system, RoadRunner, uses Panasas The world s three largest genomic data centres use Panasas The largest aircraft manufacturer in the world uses Panasas Leading Universities including Cambridge, Oxford, Stanford & Yale use Panasas The world s largest Hedge Fund uses Panasas Energy Automotive Aerospace Bio Science Pharma Finance University Research Government 3

Technical Differentiation Scalable Performance start with 12 servers -> grow to 12,000 Novel Data Integrity Protection File system and RAID are integrated highly reliable data w/ novel data protection systems Maximum Availability built-in distributed system platform, tested under fire LANL measured > 99.9% availability across its Panasas systems for all reasons Simple to Deploy and Maintain integrated storage system with appliance model Application Acceleration customer proven results 4

Data Integrity Object RAID Horizontal striping with redundant data on different OSDs Per-file RAID equation allows multiple layouts Small files are mirrored RAID-1 Large files are RAID-5 or RAID-10 Very large files use two level striping scheme to counter network incast Vertical Parity RAID across sectors to catch silent data corruption Repair single sector media defects Network Parity Read back per-file parity to achieve true end-to-end data integrityg Background scrubbing Media, RAID equations, distributed file system attributes 5

Declustered Object RAID Each file striped across different combination of StorageBlades Component objects include file data and file parity File attributes replicated on first two component objects Components grow & new components created as data written Declustered, randomized placement distributes RAID workload 20 OSD Storage Pool Mirrored or 9-OSD Parity Stripes H G k E C F E Read about half of each surviving OSD Write a little to each OSD Scales up in larger Storage Pools 6 6

Panasas Manageability Appliance-like quick setup and ease of use Automatic discovery of new equipment with transparent load balancing and capacity balancing A no knobs attitude towards performance tuning Detailed performance charts and reporting Transparent software and hardware upgrades Fully integrated automatic fail over Realm, MetaData, Data, Network, NFS/CIFS Continuous Improvement This job is never done: systems get larger, new problems emerge Tight feedback loop between experienced service organization and product development 7

Panasas Distributed System Platform Problem: managing large numbers of hardware and software components in a highly available system What is the system configuration? What hardware elements are active in the system? What software services are available? What is the desired state of the system? What components are failed? Answer: The Panasas Realm Manager 3-way or 5-way Redundant Master Control Program Distributed file system one of several services managed by the RM Configuration management Software upgrade Failure Detection GUI/CLI management Hardware monitoring 9

Configuring the RM Replication Set 10

System View Compute Nodes Out of Band architecture with direct, parallel paths from clients to storage nodes NFS/CIFS Client RPC SysMgr Up to 12,000 iscsi/osd Client PanFS Manager Nodes 100+ Storage Nodes 1,000+ OSDFS Internal cluster management makes a large collection of blades work as a single system 11

Functional Architecture Async Mirror (p)nfs Parallel File System DirectFLOW Client File Mgr Cache Mgr Lock Mgr NDMP Gateway Perf Mgr Storage Mgr Object RAID CIFS Snap shot Object storage iscsi Target OSD v2 Security, QOS OSDFS Block Allocation Copy-on-Write Extensible Attributes Vertical Parity Management UI HTTP, SNMP, CLI Monitor, Report Distributed System Service Mgmt Resource Discovery Voting PTP ConfigDB Common TxnLog Hardware Mgmt RPC OS (BSD) 12

Managing metadata services 13

PAS-12 4 th Generation Blade Panasas Active Scale 2002 850 MHz / PC 100 80 GB PATA 2004 1.2 GHz / PC 100 250 GB SATA 330 MB/sec 2006 1.5 GHz / DDR 400 500 GB SATA 400 MB/sec 2008 10 GE shelf switch 750 GB SATA 600 MB/sec 2009 SSD Hybrid 1000 GB SATA, 32GB SSD 600 MB/sec 2010 1.67 GHz / DDR3 800 2000 GB SATA, (64GB SSD) 1.5 GB/sec PS1 BAT PS2 NET1 NET2 11x Blades 14

PAS-12 Blade Configurations Director Blade 10Gbe NIC x 2 ports 3 Memory Channels i7 Multi-Core CPU 40TB (2TB HDD x2) 2 Memory Channels Storage Blade i7 Single Core CPU SSD Options 1) With SSD 2) No SSD 15

Managing Hardware 16

Panasas Features Object RAID (2003-2004) NFS w/ multiprotocol file locking (2005) Replicated cluster management (2006) Declustered Parallel Object RAID rebuild (2006) Metadata Fail Over (2007) Snapshots, NFS Fail Over, Tiered Parity (2008) Async Mirror, Data Migration (2009) SSD/SATA Hybrid Blade (2009) 64-bit multicore (2010) User Group Quota (2010) Dates are the year the feature shipped in a production release 17

MB/sec Scaling Performance with Capacity 2500 Shelf Scaling 2000 1500 1000 500 Read 4 shelves 16 clients Write 4 shelves 16 clients Read 2 shelves 8 clients Write 2 shelves 8 clients Read 1 shelf 8 clients Write 1 shelf 8 clients Read 1 shelf 1 client Write 1 shelf 1 client 0 0 20 40 60 80 100 120 140 3.4 testing December 2008, PAS 8 10GE IOR processes 18

MB/Second Scaling Clients (100 shelves, 1000 Blades) 60000 50000 40000 Anecdote is a 55 GB/sec observation during a full machine checkpoint restart 30000 20000 10000 Read Write Anecdote The read/write results are from early tests when about half the machine was available 0 0 360 720 1080 1440 1800 2160 2520 2880 3240 LANL Roadrunner, 3.2.3 Number of RoadRunner Compute Nodes

Technical Philosophy Storage is Hard Never fail, ever scale, wire-speed goals Built from low-cost, flakey hardware Fault handling is the key to building large systems Performance comes naturally if you can scale up Panasas layers its parallel file system on top of its distributed system platform Data integrity becomes more critical in today s very large systems Object-RAID provides per-file protection and fault isolation Vertical parity provides protection from media defects and firmware bugs Network parity provides end-to-end data integrity 20

Thank You 21

Panasas Global Storage Model Client Node Client Node Client Node Client Node TCP/IP network Panasas System A BladeSet 1 BladeSet 2 VolX home VolY delta VolZ /panfs/sysa/delta/file2 Global Name Panasas System B BladeSet 3 VolN VolM VolL /panfs/sysb/volm/proj38 Physical Storage Pool Logical Quota Tree DNS Name 22

LANL Petascale Network (Red) FY08 NFS complex and other network services, WAN Site wide Shared Global Parallel File System 650-1500 TB 50-200 GB/s (spans lanes) N gb/s N gb/s 4 gb/s per 5-8 TB N gb/s N gb/s FTA s Archive 156 I/O nodes, 1 10gbit link each, 195 GB/sec, planned for Accelerated Road Runner 1PF sustained 96 I/O nodes, 2 1gbit links each, 24 GB/sec IO Unit I B 4 X Compute Unit F a t T r e e M y r i n e t Compute Unit CU Road Runner Base, 70TF, 144 node units, 12 I/O nodes/unit, 4 socket dual core AMD nodes, 32 GB mem/node, full fat tree, 14 units, Acceleration to 1 PF sustained Bolt, 20TF, 2 socket, singe/dual core AMD, 256 node units, reduced fat tree, 1920 nodes CU Lane switches, 6 X 105 = 630 GB/s If more bandwidth is needed we just need to add more lanes and add more storage, scales nearly linearly, not N 2 like a fat tree Storage Area Network. Lane passthru switches, to connect legacy Lightning/Bolt 64 I/O nodes, 2 1gbit links each, 16 GB/sec IO Unit M y r i n e t CU Lightning, 14 TF, 2 socket single core AMD, 256 node units, full fat tree, 1608 nodes CU

Storage Server Platform (PAS-HC) Panasas architecture mapped from blades to servers Director function still runs on DirectorBlades Integrated battery provides NVRAM for fast in-memory journals Number of DirectorBladees is flexible and orthogonal to data storage Object storage (OSDFS) stored in LUNs managed by dual-redundant blockbased RAID controllers (e.g., LSI 9700) Front end servers run OSDFS and implement iscsi/osd target Each LUN stores one instance of OSDFS (i.e., OSD or OST) Any server attached to controller can manage the OSDs, as directed by Realm Data Striping Panasas files are striped RAID-0 across different OSDs (on any/all controllers) Panasas directories are replicated RAID-1 on two different OSDs - Can scavenge the file system if a LUN is lost 24

Storage Server Platform DirectorBlades 10GE Switch Infrastructure Dual 10GE 2 to 4 Storage server Redundant Storage Servers Dual-redundant Controller Storage server RAID Controller Storage server 8 Gb/s FC Multi-path protection RAID-6 protected LUN formatted as OSDFS 25