High Performance and Productivity Computing with Windows HPC George Yan. Group Manager Windows HPC Microsoft China

Similar documents
The F# Team Microsoft

Why is Microsoft investing in Functional Programming?

Don Syme, Principal Researcher Microsoft Research, Cambridge

F# Succinct, Expressive, Efficient. The F# Team Microsoft Developer Division, Redmond Microsoft Research, Cambridge

F# Succinct, Expressive, Efficient Functional Programming for.net

Windows OpenFabrics (WinOF) Update

Supercomputing and Mass Market Desktops

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

SUSE. High Performance Computing. Eduardo Diaz. Alberto Esteban. PreSales SUSE Linux Enterprise

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Windows OpenFabrics (WinOF)

IBM Lotus Domino Product Roadmap

Application Acceleration Beyond Flash Storage

Arm Processor Technology Update and Roadmap

Performance Tools for Technical Computing

Improving the Productivity of Scalable Application Development with TotalView May 18th, 2010

A unified multicore programming model

Horizontal Scaling Solution using Linux Environment

On-Demand Supercomputing Multiplies the Possibilities

Sun and Oracle. Kevin Ashby. Oracle Technical Account Manager. Mob:

by Brian Hausauer, Chief Architect, NetEffect, Inc

Compute Cluster Server Lab 1: Installation of Microsoft Compute Cluster Server 2003

To Infiniband or Not Infiniband, One Site s s Perspective. Steve Woods MCNC

SUSE. High Performance Computing. Kai Dupke. Meike Chabowski. Senior Product Manager SUSE Linux Enterprise

Implementing MPI on Windows: Comparison with Common Approaches on Unix

Administration. Coursework. Prerequisites. CS 378: Programming for Performance. 4 or 5 programming projects

Checklist for Selecting and Deploying Scalable Clusters with InfiniBand Fabrics

Post-K: Building the Arm HPC Ecosystem

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Clusters. Rob Kunz and Justin Watson. Penn State Applied Research Laboratory

Administration. Prerequisites. Website. CSE 392/CS 378: High-performance Computing: Principles and Practice

HPC on Windows. Visual Studio 2010 and ISV Software

Windows-HPC Environment at RWTH Aachen University

Chelsio Communications. Meeting Today s Datacenter Challenges. Produced by Tabor Custom Publishing in conjunction with: CUSTOM PUBLISHING

SQL Server 2005 on a Dell Scalable Enterprise Foundation

2008 International ANSYS Conference

The MOSIX Scalable Cluster Computing for Linux. mosix.org

CSC 447: Parallel Programming for Multi- Core and Cluster Systems

HPC Tools on Windows. Christian Terboven Center for Computing and Communication RWTH Aachen University.

Operating Systems (ECS 150) Spring 2011

RAIDIX Data Storage Solution. Clustered Data Storage Based on the RAIDIX Software and GPFS File System

1. Which programming language is used in approximately 80 percent of legacy mainframe applications?

Programming for Fujitsu Supercomputers

IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Lecture Topics. Announcements. Today: Operating System Overview (Stallings, chapter , ) Next: Processes (Stallings, chapter

Lustre A Platform for Intelligent Scale-Out Storage

NTRDMA v0.1. An Open Source Driver for PCIe NTB and DMA. Allen Hubbe at Linux Piter 2015 NTRDMA. Messaging App. IB Verbs. dmaengine.h ntb.

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Sybase Adaptive Server Enterprise on Linux

Cray RS Programming Environment

Best Practices for Setting BIOS Parameters for Performance

Windows Server 2012 Hands- On Camp. Learn What s Hot and New in Windows Server 2012!

ORACLE Linux / TSC.

Technical Computing Suite supporting the hybrid system

6.1 Multiprocessor Computing Environment

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Improve Web Application Performance with Zend Platform

OpenMP 4.0: A Significant Paradigm Shift in Parallelism

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Box s 1 minute Bio l B. Eng (AE 1983): Khon Kean University

Windows Compute Cluster Server 2003 allows MATLAB users to quickly and easily get up and running with distributed computing tools.

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Oracle Enterprise Manager 12c IBM DB2 Database Plug-in

Future Routing Schemes in Petascale clusters

Innovative Alternate Architecture for Exascale Computing. Surya Hotha Director, Product Marketing

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

Industrial system integration experts with combined 100+ years of experience in software development, integration and large project execution

High Performance Computing Software Development Kit For Mac OS X In Depth Product Information

InsightConnector Version 1.0

HPC Architectures. Types of resource currently in use

IBM Scale Out Network Attached Storage (SONAS) using the Acuo Universal Clinical Platform

Advanced Threading and Optimization

BUILD BETTER MICROSOFT SQL SERVER SOLUTIONS Sales Conversation Card

User Manual. Admin Report Kit for IIS 7 (ARKIIS)

Scaling to Petaflop. Ola Torudbakken Distinguished Engineer. Sun Microsystems, Inc

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

MM5 Modeling System Performance Research and Profiling. March 2009

Hardware and Software solutions for scaling highly threaded processors. Denis Sheahan Distinguished Engineer Sun Microsystems Inc.

Linux multi-core scalability

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Multi-core Programming - Introduction

The Arm Technology Ecosystem: Current Products and Future Outlook

System input-output, performance aspects March 2009 Guy Chesnot

Advanced Computer Networks. End Host Optimization

Oracle Developer Studio 12.6

Architecting Storage for Semiconductor Design: Manufacturing Preparation

An Introduction to GPFS

Windows Azure Solutions with Microsoft Visual Studio 2010

MCTS Guide to Microsoft Windows Server 2008 Applications Infrastructure Configuration (Exam # ) Chapter One Introducing Windows Server 2008

ECMWF Workshop on High Performance Computing in Meteorology. 3 rd November Dean Stewart

Reducing Network Contention with Mixed Workloads on Modern Multicore Clusters

Understanding the latent value in all content

A Road Map to the Future of Linux in the Enterprise. Timothy D. Witham Lab Director Open Source Development Lab

High-Performance Lustre with Maximum Data Assurance

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Scalability issues : HPC Applications & Performance Tools

Transcription:

High Performance and Productivity Computing with Windows HPC George Yan Group Manager Windows HPC Microsoft China

HPC at Microsoft 1997 NCSA deploys first Windows clusters on NT4 2000 Windows Server 2000 ships 2001 Microsoft Computational Clustering Preview kit and Beowulf Cluster Computing with Windows book released 2002 Cornell Theory Center migrates to all- Windows infrastructure, eventually reaching over 600 nodes and 1,200 user accounts, first Top500 appearance 2003 Argonne National Labs releases MPICH on Windows

HPC at Microsoft 2004 Windows HPC team established in both Redmond and Shanghai 2005 Microsoft launches HPC entry at SC 05 in Seattle with Bill Gates keynote 2006 Windows Compute Cluster Server 2003 ships 2007 Microsoft named one of the Top 5 companies to watch in HPC at SC 07 2008 Windows HPC Server 2008

Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Fall 2007, Microsoft, #116 2048 cores, 11.8 TF, 77.1% 30% efficiency improvement Spring 2007, Microsoft, #106 2048 cores, 9 TF, 58.8% Windows HPC Server 2008 Windows Compute Cluster 2003 Spring 2006, NCSA, #130 896 cores, 4.1 TF Winter 2005, Microsoft 4 procs, 9.46 GFlops

HPC Clusters in Every Lab X64 Server

Parallelism Everywhere Power Density (W/cm 2 ) 10,000 1,000 100 10 1 Today s Architecture: Heat becoming an unmanageable problem! 4004 8008 8080 8086 8085 286 386 Hot Plate 486 Nuclear Reactor Rocket Nozzle Sun s Surface Pentium processors 70 80 90 00 10 GOPS 32,768 2,048 128 To Grow, To Keep Up, We Must Embrace Parallel Computing Many-core Peak Parallel GOPs Parallelism Opportunity 80X Single Threaded Perf 10% per year 16 2004 2006 2008 2010 2012 2015 Intel Dev eloper Forum, Spring 2004 - Pat Gelsinger we see a very significant shift in what architectures will look like in the future... fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations. Intel Developer Forum, Spring 2004 Pat Gelsinger Chief Technology Officer, Senior Vice President Intel Corporation February, 19, 2004

Today s s Environment High Speed networking Corporate Infrastructure Clusters/Super Computers Storage Engineers Scientists Financial Analysts Information workers Specialized languages Mainstream Technologies Compilers Debuggers

High Productivity Computing Combined Infrastructure Integrated Desktop and HPC Environment Unified Development Environment

Microsoft s s Productivity Vision Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating with the tools they are already using. Administrator Integrated Turnkey Solution Simplified Setup and Deployment Built-In Diagnostics Efficient Cluster Utilization Integrates with IT Infrastructure and Policies Application Developer Highly Productive Parallel Programming Frameworks Service-Oriented HPC Applications Support for Key HPC Development Standards Unix Application Migration End - User Seamless Integration with Workstation Applications Integrated Collaboration and Workflow Solutions Secure Job Execution and Data Access World-class Performance

Industry Focused Solutions Academia Aerospace Automotive Financial Services Geo Services Government Life Sciences

Windows HPC Server 2008 Rapid large scale deployment and built-in diagnostics suite Integrated monitoring, management and reporting Familiar UI and rich scripting interface Systems Management Job Scheduling Integrated security via Active Directory Support for batch, interactive and service-oriented applications High availability scheduling Interoperability via OGF s HPC Basic Profile Storage MPI Access to SQL, Windows and Unix file servers Key parallel file server vendor support (GPFS, Lustre, Panasas) In-memory caching options MS-MPI stack based on MPICH2 reference implementation Performance improvements for RDMA networking and multi-core shared memory MS-MPI integrated with Windows Event Tracing

Ease of deployment

Ease of Deployment

Comprehensive Diagnostics Suite

Single Management Console

Integrated Monitoring

Built-in in Reporting

Integrated Job Scheduling

Service Oriented HPC UDF Scheduler Jobs UDF UDF UDF UDF Head Node Job Cluster Mgmt Mgmt Scheduling Resource Mgmt Results UDF UDF UDF Compute Node Job Execution User App MPI

HPC SOA Programming Model Sequential Parallel Session session = new session(startinfo); PricingClient client = new P ricingclient(binding, session.endpointaddress); for (i = 0; i < 100,000,000; i++) { r[i] = worker.dowork(dataset[i]); } reduce ( r ); for (i = 0; I < 100,000,000, i++) { client.begindowork(dataset[i], new AsyncCallback(callback), i) } void callback(iasyncresult handle) { r = client.enddowork(handle); // aggregate results reduce ( r ); }

Placement via Job Context node grouping, job templates, filters Application Aware Capacity Aware MATLAB M M M M M M A A A A A A T T T T T T L L L L L L A A A A A A B B B B B B An ISV application (requires Nodes where the application is ins talled) Multi-threaded application (requires machine with many Cores) A big model (requi res Large memory machines) MAT LAB M M M M M M A A A A A A T T T T T T L L L L L L A A A A A A B B B B B B Numa Aware 4-way Structural Analysis MPI Job C0 C0 M C 1 C 1 C2 C2 M C 3 C 3 M M M M P0 P1 P2 P3 M M M M M Quad-core IO 32-core IO

NetworkDirect A new RDMA networking interface built for speed and stability 2 usec latency, 2 GB/sec bandwidth on ConnectX OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols TCP/Ethernet Networking Verbs-based based design for close fit with native, high-perf networking interfaces Equal to Hardware- Optimized stacks for MPI micro-benchmarks Socket- Based App Windows Sockets (Winsock + WSD) TCP IP NDIS Networking Networking Mini-port Hardware Hardware Driver Networking Networking WinSock Hardware Hardware Direct Prov ider Networking Hardware Networking Hardware Hardware Driver MPI App MS-MPI Networking Hardware Networking Hardware User Mode Access Layer Networking Hardware Networking Hardware Networking Hardware RDMA Networki ng Networking NetworkDirect Networking Hardware Hardware Prov ider Kernel By- Pass User Mode Kernel Mode (ISV) App CCP Component OS Component IHV Component

Partnering for Performance Networking Hardware vendors NetworkDirect design review NetworkDirect & WinsockDirect provider development Windows Core Networking Team Commercial Software Vendors Win64 best practices MPI usage patterns Collaborative performance tuning 4 benchmarking centers online IBM, HP, Dell, SGI Now working with Cray!

Devs can't tune what they can't see MS-MPI MPI integrated with Event Tracing for Windows Single, time-correlated log of: OS, driver, MPI, and app events CCS-specific additions High-precision CPU clock correction Log consolidation from multiple compute nodes into a single record of parallel app execution Dual purpose: Performance Analysis Application Trouble-Shooting Trace Data Display Visual Studio & Windows ETW tools Intel Collector/Analyzer Vampir Jumpshot

HPC Storage Solutions Aggregate (Mb/s/core) Windows Server 2003 Windows Server 2008 Number of cores in cluster IBM GPFS Panasas Active Scale HP - PolyServe Sun - Lustre Ibrix - Fusion Quantum - StorNext SANbolic Melio file system

Unix Application Porting Windows Subsystem for Unix applications Complete SVR-5 and BSD UNIX environment with 300 commands, utilizes, shell scripts, compilers Visual Studio extensions for debugging POSIX applications Support for 32 and 64-bit applications Recent port of WRF weather model 350K lines, Fortran 90 and C using MPI, OpenMP Traditionally developed for Unix HPC systems Two dynamical cores, full range of physics options Porting experience Fewer than 750 lines of code changed Changes in only several hundred of lines of code, primarily in the build mechanism (Makefiles, scripts) Level of effort and nature of tasks not unlike porting to any new version of UNIX Performance on par with the Linux systems

F# is......a functional, object-oriented, oriented, imperative and explorative programming language for.net

Example: Taming Asynchronous I/O using System; using System.IO; using System.Threading; public static void ReadInImageCallback(IAsyncResult asyncresult) { public static void ProcessImagesInBulk() ImageStateObject state = (ImageStateObject)asyncResult.AsyncState; { public class BulkImageProcAsync Stream stream = state.fs; Console.WriteLine("Processing images... "); { int bytesread = stream.endread(asyncresult); long t0 = Environment.TickCount; public const String ImageBaseName = "tmpimage-"; if (bytesread!= numpixels) NumImagesToFinish = numimages; public const int numimages = 200; throw new Exception(String.Format AsyncCallback readimagecallback = new public const int numpixels = 512 * 512; ("In ReadInImageCallback, got the wrong number of " + AsyncCallback(ReadInImageCallback); "bytes from the image: {0}.", bytesread)); for (int i = 0; i < numimages; i++) // ProcessImage has a simple O(N) loop, and ProcessImage(state.pixels, you can vary the number state.imagenum); { // of times you repeat that loop to make the stream.close(); application more CPU- ImageStateObject state = new ImageStateObject(); // bound or more IO-bound. state.pixels = new byte[numpixels]; public static int processimagerepeats = 20; // Now write out the image. state.imagenum = i; // Using asynchronous I/O here appears not to be best practice. // Very large items are read only once, so you can make the // Threads must decrement NumImagesToFinish, // It and ends protect up swamping the threadpool, because the threadpool // buffer on the FileStream very small to save memory. // their access to it through a mutex. // threads are blocked on I/O requests that were just queued tofilestream fs = new FileStream(ImageBaseName + i + ".tmp", public static int NumImagesToFinish = numimages; // the threadpool. FileMode.Open, FileAccess.Read, FileShare.Read, 1, true); public static Object[] NumImagesMutex = new FileStream Object[0]; fs = new FileStream(ImageBaseName + state.imagenum + state.fs = fs; // WaitObject is signalled when all image processing ".done", is FileMode.Create, done. FileAccess.Write, FileShare.None, fs.beginread(state.pixels, 0, numpixels, readimagecallback, public static Object[] WaitObject = new Object[0]; 4096, false); state); public class ImageStateObject fs.write(state.pixels, 0, numpixels); } { fs.close(); public byte[] pixels; // Determine whether all images are done being processed. public int imagenum; // If not, block until all are finished. public FileStream fs; bool mustblock = false; } lock (NumImagesMutex) } // This application model uses too much memory. // Releasing memory as soon as possible is a good idea, // especially global state. state.pixels = null; fs = null; // Record that an image is finished now. lock (NumImagesMutex) { NumImagesToFinish--; if (NumImagesToFinish == 0) { Monitor.Enter(WaitObject); Monitor.Pulse(WaitObject); Monitor.Exit(WaitObject); } } } } { if (NumImagesToFinish > 0) mustblock = true; } if (mustblock) { Processing 200 images in parallel Console.WriteLine("All worker threads are queued. " + " Blocking until they complete. numleft: {0}", NumImagesToFinish); Monitor.Enter(WaitObject); Monitor.Wait(WaitObject); Monitor.Exit(WaitObject); } long t1 = Environment.TickCount; Console.WriteLine("Total time processing images: {0}ms", (t1 - t0));

Example: Taming Asynchronous I/O Open the file, synchronously let ProcessImageAsync(i) = This object coordinates Equivalent F# code (same perf) async { let instream = File.OpenRead(sprintf "source%d.jpg" i) let! pixels = instream.readasync(numpixels) let pixels' = TransformImage(pixels,i) let outstream = File.OpenWrite(sprintf "result%d.jpg" i) do! outstream.writeasync(pixels') do Console.WriteLine "done!" } let ProcessImagesAsync() = Async.Run (Async.Parallel Read from the file, asynchronously Write the result, asynchronously [ for i in 1.. numimages -> ProcessImageAsync(i) ])! = asynchronous Generate the tasks and queue them in parallel

Microsoft HPC++ Experience Application Benefits The most productive distributed application development environment Cluster Benefits Complete HPC cluster platform integrated with the enterprise infrastructure System Benefits Cost-effective, reliable and high performance server operating system

Resources www.microsoft.com/hpc www.microsoft.com/science www.microsoft.com/servers www.microsoft.com/sql www.microsoft.com/excel research.microsoft.com/fsharp www.osl.iu.edu/research/mpi.net www.microsoft.com/msdn www.microsoft.com/technet

Thank you!