Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware

CLUSTER TO CLOUD Evolving HPC Solutions Using Open Source Software & Industry-Standard Hardware Carl Trieloff cctrieloff@redhat.com Red Hat, Technical Director Lee Fisher lee.fisher@hp.com Hewlett-Packard, WW FSI-HPC Business Development 1

Financial compute to cloud example Scale up Grid Internal grid Messaging Scale out Scheduler Messaging Trader Latency External public cloud Trade execution 2 Internal private cloud

What are some of the requirements? Cloud computing is a hot topic, but many people have important questions and challenges they need addressed before they can adopt cloud: How do I build an internal cloud? How do I avoid lock-in to a single cloud? 3 How do I deal with homogeneous & non homogeneous hardware requirements? How do I mix, match, and blend different cloud resources including internal and external clouds? How do I manage a variety of applications and groups with different SLAs, priorities, and resource requirements across clouds? How do I manage software licensing/ hardware limits? How do I abstract resource management, accounting and permissions?

Red Hat Enterprise MRG Integrated platform for high performance distributed computing High speed, interoperable, open standard Messaging Deterministic, low-latency Realtime kernel High performance & throughput computing Grid scheduler for distributed workloads and Cloud computing 4

AMQP Messaging on 8-node HP Nehalem Infiniband 40Gps > 11 M mes/s 7000000 3.1 1 3.0 8 3.1 3 6000000 2.5 Messages/Sec 5000000 2 4000000 1.5 3000000 1 2 000000 0.5 1000000 0 0 4 Broker 2 Broker Number of Brokers per Server 5 1 Broker HP-G6 Nahlem HP-G5 Harperton Improvement ratio

AMQP, HP Performance, scale up. Single HP Nehalem BL460c 40G Infiniband AMQP Perftest 12M 12000000 10M Messages/Sec 10000000 8M 8000000 8 bytes 64 Bytes 256 Bytes 1024 Byt es 6M 6000000 4M 4000000 2M 2000000 00 8 Broker 4 Broker 2 Broker 1 Broker Number of Brokers on the Server two BL460c G6 with two Intel(R) Xeon(R) X5570 CPUs per blade (Nehalem 2.93 GHz, 8MB L3 cache, 95W) Memory 24GB(6x4GB), Memory Type DDR3-1333, HT, Turbo 2/2/3/3) Infiniband 4X QDR IB Dual-port Mezzanine HCAs (1 port connected) Infiniband Switch BLc 4X QDR IB Switch 6

KVM Performance Only 5% of bare metal AMQP Messaging Intel Nahalem 2 10Gbit Vt-D > 1 M mes/s RHEL 5.4 KVM AMQP 2-Guest 12 00000 12M 900 1046081 1000000 10M 800 1023869 902689 804045 800000 Messages / Sec 700 880965 8M 600 741297 500 6M 600000 555465 400 4M 369145 400000 210634 2M 200000 00 7 300 200 100 0 16 32 64 12 8 256 51 2 Msg Size (bytes) 1 024 2048 4096 Msg/sec Throughput MB/sec

MRG scheduling resources http://www.youtube.com/watch?v=osm7ff8kkjk 8

MRG Messaging Infiniband RDMA Latency: Under 40 Microseconds Reliably Acknowledged MRG Messaging Latency Test on HP BL460c G6 Infiniband 100K Message Rate 0.0480 Average Latency (ms) 0.0460 0.0440 32 Bytes RDMA Nehalem 256 Bytes RDMA Nehalem 1024 Bytes RDMA Nehalem 0.0420 0.0400 0.0380 0.0360 0.0340 1 9 3 5 7 9 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97

Components of the Solution Stack Solutions still matter in an industry-standard, open source world FSI-HPC Solution Stack Tuning & working in labs Red Hat MRG Tuning tools Red Hat MRG & RHEV Messaging/Grid/Virt HP Voltaire / Red Hat RDMA Red Hat MRG Realtime & KVM HP reduced SMI BIOS's HP compute & storage Application Environment Workload Middleware Integrated Systems Server Interconnect L2 Fabric Operating System BIOS X86-64 Server Architecture Determinism, and performance needs to work at each layer, HP & Red Hat are partnered across the stack 10 Services Red Hat / HP Systems Users

Hardware matters Scale-Up Blades Scale-Out Rack-Optimized SL6000 Today s RFP Metrics: Performance/Watt Performance/BTU Performance/Rack HP Low Latency Lab with MRG + Red Hat MRG Lab with HP BL460/BL685 & IB 11

Dealing with SMIs HP BIOS Option for Low Latency Apps Disable frequent SMIs used for Dynamic Power Savings Mode, CPU Utilization monitoring, P-state monitoring and ECC reporting Benefits both RHEL & MRG operating environments. Latency spikes with standard BIOS settings 12 Latencies when SMIs disabled in BIOS

MRG Realtime RHEL on HP systems Enables applications and transactions to run predictably, with guaranteed response times. Upgrades RHEL 5 to realtime OS Provides replacement kernel for RHEL5; x86/x86_64 Preserves RHEL Application Compatibility Certified on HP hardware, see Red Hat / HP certifications Response time Time 13

MRG Realtime Scheduling Latency Vanilla Min: Max: 1 2857 Mean: 11.47 Mode: 9.00 Median: 9.00 Std. Deviation: 54.94 MRG RT Min: Max: 4 43 Mean: 8.34 Mode: 8.00 Median: 8.00 Std. Deviation: 1.49 14

Networking matters Voltaire DDR and QDR InfiniBand: 36 QDR QSFP ports Ethernet mngt port LEDs USB port Serial port Test Configuration: Two Nehalem-based server w/ ConnectX PCI-E HCAs, back-to-back QDR ConnectX HCA running at QDR DDR ConnectX HCA running at DDR RHEL5 UPDATE 2 Mellanox VERBs Performance Test RoEE RDMA on Enhanced Ethernet RoEE is defined to be a verbs compliant IB transport running over the emerging IEEE Converged Enhanced Ethernet standard www.openfabrics.org/archives/spring2009sonoma/monday/grun.pdf 15

Building Cloud capabilities with MRG Scalable Virtualization Schedule VMs directly as jobs via libvirt Powerful Policies Provision VMs via Red Hat Enterprise Virtualization Inject jobs into VMs Resource Accounting SLA's 16 Track resources via Condor's resource accounting Apply priorities and policies Apply security Authentication (e.g. SSL, ), Integrity, Encryption VMs run multiple concurrent instances, start on Black Friday or semi-monthly, re-run after fault Machines only run VMs from owner s group between 9 and 5, everyone else has a low priority shot from 5 to 9 Global control limiters (e.g. NFS mount users, licenses), Various Cloud Services IaaS clouds: run all workloads as VMs PaaS clouds: leverage job scheduling with VM scheduling

Aggregating & Bridging Clouds MRG includes the ability to schedule jobs and applications to multiple clouds, based on policy 17 MRG has the ability to send VMs to other resource managers MRG becomes the unified interface to many types of resources internal VM resources and multiple external clouds MRG's life-cycle management, accounting and policy benefits still available Use cases include Manage overflow/spillover Access to specialized resource managers Transformation between VM types/systems Allow a single app/stack to bridge multiple clouds

MRG Cloud Aggregation Architecture Schedd accepts jobs over SOAP, AMQP, CLI GAHP: Grid ASCII Helper Protocol 18 An adapter to an external resource manager Exist for many batch systems Exists for EC2-like resource managers Extensible to new resource managers Job Router transforms types, e.g. stack to VM to EC2 AMI

Durable Messaging Throughput comparison MRG Durable Messaging Throughput Across Different Storage Types 70 0 0 0 0 Intel 16 CPU Harpertown 12GB memory 667 Memory speed Intel 82571EB Gigabit Ethernet HP IO Accelerator (Fusion-io) 32-byte messages 60 0 0 0 0 50 0 0 0 0 Message Rate 40 0 0 0 0 1 NIC 1 NIC Dura ble IO Fusio n Ca rd 1 NIC Dura ble Fibe r Disk 1 NIC Dura ble In te rn a l SCSI d rive 30 0 0 0 0 20 0 0 0 0 10 0 0 0 0 0 1 19 3 5 7 9 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 HP IO Accelerator

Durable Messaging Latency Comparison Latencytest with Durable Store Different Storage Types 1.400 Intel 16 CPU Hapertown 12GB memory 667 Memory speed Intel 82571EB Gigabit Ethernet HP IO Fusion 32-byte messages 1.200 Average Latency (ms) 1.000 0.800 1 NIC No Durable 1 NIC Iofusion Durable 1 NIC Fiber on durable 1 NIC Sata Durable 0.600 0.400 0.200 0.000 1 20 3 5 7 9 11 15 19 23 27 31 35 39 43 47 51 55 59 63 67 71 75 79 83 87 91 95 99 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 HP IO Accelerator

MRG template for HP Matrix HP BladeSystem Matrix enables: Automated provisioning to speed deployment Capacity planning to optimize workloads dynamically Disaster recovery simplified Red Hat + HP developing MRG template for Matrix: to quickly stand up 'internal cloud' deployments with workflows, scripts, and best practice templates www.hp.com/go/matrixtemplates 21

Testing and developing solutions working together Delivered in reference papers & certifications Throughput Memory Usage Red Hat / HP White Paper: 74 72 70 cache buff free 68 66 64 62 60 1-GigE 22 10-GigE IPoIB IB SDP IB RDMA

Additional Information www.redhat.com/mrg www.hp.com/go/realtimelinux 23

THANK YOU 24