High Performance and Productivity Computing with Windows HPC George Yan Group Manager Windows HPC Microsoft China
HPC at Microsoft 1997 NCSA deploys first Windows clusters on NT4 2000 Windows Server 2000 ships 2001 Microsoft Computational Clustering Preview kit and Beowulf Cluster Computing with Windows book released 2002 Cornell Theory Center migrates to all- Windows infrastructure, eventually reaching over 600 nodes and 1,200 user accounts, first Top500 appearance 2003 Argonne National Labs releases MPICH on Windows
HPC at Microsoft 2004 Windows HPC team established in both Redmond and Shanghai 2005 Microsoft launches HPC entry at SC 05 in Seattle with Bill Gates keynote 2006 Windows Compute Cluster Server 2003 ships 2007 Microsoft named one of the Top 5 companies to watch in HPC at SC 07 2008 Windows HPC Server 2008
Spring 2008, NCSA, #23 9472 cores, 68.5 TF, 77.7% Spring 2008, Umea, #40 5376 cores, 46 TF, 85.5% Spring 2008, Aachen, #100 2096 cores, 18.8 TF, 76.5% Fall 2007, Microsoft, #116 2048 cores, 11.8 TF, 77.1% 30% efficiency improvement Spring 2007, Microsoft, #106 2048 cores, 9 TF, 58.8% Windows HPC Server 2008 Windows Compute Cluster 2003 Spring 2006, NCSA, #130 896 cores, 4.1 TF Winter 2005, Microsoft 4 procs, 9.46 GFlops
HPC Clusters in Every Lab X64 Server
Parallelism Everywhere Power Density (W/cm 2 ) 10,000 1,000 100 10 1 Today s Architecture: Heat becoming an unmanageable problem! 4004 8008 8080 8086 8085 286 386 Hot Plate 486 Nuclear Reactor Rocket Nozzle Sun s Surface Pentium processors 70 80 90 00 10 GOPS 32,768 2,048 128 To Grow, To Keep Up, We Must Embrace Parallel Computing Many-core Peak Parallel GOPs Parallelism Opportunity 80X Single Threaded Perf 10% per year 16 2004 2006 2008 2010 2012 2015 Intel Dev eloper Forum, Spring 2004 - Pat Gelsinger we see a very significant shift in what architectures will look like in the future... fundamentally the way we've begun to look at doing that is to move from instruction level concurrency to multiple cores per die. But we're going to continue to go beyond there. And that just won't be in our server lines in the future; this will permeate every architecture that we build. All will have massively multicore implementations. Intel Developer Forum, Spring 2004 Pat Gelsinger Chief Technology Officer, Senior Vice President Intel Corporation February, 19, 2004
Today s s Environment High Speed networking Corporate Infrastructure Clusters/Super Computers Storage Engineers Scientists Financial Analysts Information workers Specialized languages Mainstream Technologies Compilers Debuggers
High Productivity Computing Combined Infrastructure Integrated Desktop and HPC Environment Unified Development Environment
Microsoft s s Productivity Vision Windows HPC allows you to accomplish more, in less time, with reduced effort by leveraging users existing skills and integrating with the tools they are already using. Administrator Integrated Turnkey Solution Simplified Setup and Deployment Built-In Diagnostics Efficient Cluster Utilization Integrates with IT Infrastructure and Policies Application Developer Highly Productive Parallel Programming Frameworks Service-Oriented HPC Applications Support for Key HPC Development Standards Unix Application Migration End - User Seamless Integration with Workstation Applications Integrated Collaboration and Workflow Solutions Secure Job Execution and Data Access World-class Performance
Industry Focused Solutions Academia Aerospace Automotive Financial Services Geo Services Government Life Sciences
Windows HPC Server 2008 Rapid large scale deployment and built-in diagnostics suite Integrated monitoring, management and reporting Familiar UI and rich scripting interface Systems Management Job Scheduling Integrated security via Active Directory Support for batch, interactive and service-oriented applications High availability scheduling Interoperability via OGF s HPC Basic Profile Storage MPI Access to SQL, Windows and Unix file servers Key parallel file server vendor support (GPFS, Lustre, Panasas) In-memory caching options MS-MPI stack based on MPICH2 reference implementation Performance improvements for RDMA networking and multi-core shared memory MS-MPI integrated with Windows Event Tracing
Ease of deployment
Ease of Deployment
Comprehensive Diagnostics Suite
Single Management Console
Integrated Monitoring
Built-in in Reporting
Integrated Job Scheduling
Service Oriented HPC UDF Scheduler Jobs UDF UDF UDF UDF Head Node Job Cluster Mgmt Mgmt Scheduling Resource Mgmt Results UDF UDF UDF Compute Node Job Execution User App MPI
HPC SOA Programming Model Sequential Parallel Session session = new session(startinfo); PricingClient client = new P ricingclient(binding, session.endpointaddress); for (i = 0; i < 100,000,000; i++) { r[i] = worker.dowork(dataset[i]); } reduce ( r ); for (i = 0; I < 100,000,000, i++) { client.begindowork(dataset[i], new AsyncCallback(callback), i) } void callback(iasyncresult handle) { r = client.enddowork(handle); // aggregate results reduce ( r ); }
Placement via Job Context node grouping, job templates, filters Application Aware Capacity Aware MATLAB M M M M M M A A A A A A T T T T T T L L L L L L A A A A A A B B B B B B An ISV application (requires Nodes where the application is ins talled) Multi-threaded application (requires machine with many Cores) A big model (requi res Large memory machines) MAT LAB M M M M M M A A A A A A T T T T T T L L L L L L A A A A A A B B B B B B Numa Aware 4-way Structural Analysis MPI Job C0 C0 M C 1 C 1 C2 C2 M C 3 C 3 M M M M P0 P1 P2 P3 M M M M M Quad-core IO 32-core IO
NetworkDirect A new RDMA networking interface built for speed and stability 2 usec latency, 2 GB/sec bandwidth on ConnectX OpenFabrics driver for Windows includes support for Network Direct, Winsock Direct and IPoIB protocols TCP/Ethernet Networking Verbs-based based design for close fit with native, high-perf networking interfaces Equal to Hardware- Optimized stacks for MPI micro-benchmarks Socket- Based App Windows Sockets (Winsock + WSD) TCP IP NDIS Networking Networking Mini-port Hardware Hardware Driver Networking Networking WinSock Hardware Hardware Direct Prov ider Networking Hardware Networking Hardware Hardware Driver MPI App MS-MPI Networking Hardware Networking Hardware User Mode Access Layer Networking Hardware Networking Hardware Networking Hardware RDMA Networki ng Networking NetworkDirect Networking Hardware Hardware Prov ider Kernel By- Pass User Mode Kernel Mode (ISV) App CCP Component OS Component IHV Component
Partnering for Performance Networking Hardware vendors NetworkDirect design review NetworkDirect & WinsockDirect provider development Windows Core Networking Team Commercial Software Vendors Win64 best practices MPI usage patterns Collaborative performance tuning 4 benchmarking centers online IBM, HP, Dell, SGI Now working with Cray!
Devs can't tune what they can't see MS-MPI MPI integrated with Event Tracing for Windows Single, time-correlated log of: OS, driver, MPI, and app events CCS-specific additions High-precision CPU clock correction Log consolidation from multiple compute nodes into a single record of parallel app execution Dual purpose: Performance Analysis Application Trouble-Shooting Trace Data Display Visual Studio & Windows ETW tools Intel Collector/Analyzer Vampir Jumpshot
HPC Storage Solutions Aggregate (Mb/s/core) Windows Server 2003 Windows Server 2008 Number of cores in cluster IBM GPFS Panasas Active Scale HP - PolyServe Sun - Lustre Ibrix - Fusion Quantum - StorNext SANbolic Melio file system
Unix Application Porting Windows Subsystem for Unix applications Complete SVR-5 and BSD UNIX environment with 300 commands, utilizes, shell scripts, compilers Visual Studio extensions for debugging POSIX applications Support for 32 and 64-bit applications Recent port of WRF weather model 350K lines, Fortran 90 and C using MPI, OpenMP Traditionally developed for Unix HPC systems Two dynamical cores, full range of physics options Porting experience Fewer than 750 lines of code changed Changes in only several hundred of lines of code, primarily in the build mechanism (Makefiles, scripts) Level of effort and nature of tasks not unlike porting to any new version of UNIX Performance on par with the Linux systems
F# is......a functional, object-oriented, oriented, imperative and explorative programming language for.net
Example: Taming Asynchronous I/O using System; using System.IO; using System.Threading; public static void ReadInImageCallback(IAsyncResult asyncresult) { public static void ProcessImagesInBulk() ImageStateObject state = (ImageStateObject)asyncResult.AsyncState; { public class BulkImageProcAsync Stream stream = state.fs; Console.WriteLine("Processing images... "); { int bytesread = stream.endread(asyncresult); long t0 = Environment.TickCount; public const String ImageBaseName = "tmpimage-"; if (bytesread!= numpixels) NumImagesToFinish = numimages; public const int numimages = 200; throw new Exception(String.Format AsyncCallback readimagecallback = new public const int numpixels = 512 * 512; ("In ReadInImageCallback, got the wrong number of " + AsyncCallback(ReadInImageCallback); "bytes from the image: {0}.", bytesread)); for (int i = 0; i < numimages; i++) // ProcessImage has a simple O(N) loop, and ProcessImage(state.pixels, you can vary the number state.imagenum); { // of times you repeat that loop to make the stream.close(); application more CPU- ImageStateObject state = new ImageStateObject(); // bound or more IO-bound. state.pixels = new byte[numpixels]; public static int processimagerepeats = 20; // Now write out the image. state.imagenum = i; // Using asynchronous I/O here appears not to be best practice. // Very large items are read only once, so you can make the // Threads must decrement NumImagesToFinish, // It and ends protect up swamping the threadpool, because the threadpool // buffer on the FileStream very small to save memory. // their access to it through a mutex. // threads are blocked on I/O requests that were just queued tofilestream fs = new FileStream(ImageBaseName + i + ".tmp", public static int NumImagesToFinish = numimages; // the threadpool. FileMode.Open, FileAccess.Read, FileShare.Read, 1, true); public static Object[] NumImagesMutex = new FileStream Object[0]; fs = new FileStream(ImageBaseName + state.imagenum + state.fs = fs; // WaitObject is signalled when all image processing ".done", is FileMode.Create, done. FileAccess.Write, FileShare.None, fs.beginread(state.pixels, 0, numpixels, readimagecallback, public static Object[] WaitObject = new Object[0]; 4096, false); state); public class ImageStateObject fs.write(state.pixels, 0, numpixels); } { fs.close(); public byte[] pixels; // Determine whether all images are done being processed. public int imagenum; // If not, block until all are finished. public FileStream fs; bool mustblock = false; } lock (NumImagesMutex) } // This application model uses too much memory. // Releasing memory as soon as possible is a good idea, // especially global state. state.pixels = null; fs = null; // Record that an image is finished now. lock (NumImagesMutex) { NumImagesToFinish--; if (NumImagesToFinish == 0) { Monitor.Enter(WaitObject); Monitor.Pulse(WaitObject); Monitor.Exit(WaitObject); } } } } { if (NumImagesToFinish > 0) mustblock = true; } if (mustblock) { Processing 200 images in parallel Console.WriteLine("All worker threads are queued. " + " Blocking until they complete. numleft: {0}", NumImagesToFinish); Monitor.Enter(WaitObject); Monitor.Wait(WaitObject); Monitor.Exit(WaitObject); } long t1 = Environment.TickCount; Console.WriteLine("Total time processing images: {0}ms", (t1 - t0));
Example: Taming Asynchronous I/O Open the file, synchronously let ProcessImageAsync(i) = This object coordinates Equivalent F# code (same perf) async { let instream = File.OpenRead(sprintf "source%d.jpg" i) let! pixels = instream.readasync(numpixels) let pixels' = TransformImage(pixels,i) let outstream = File.OpenWrite(sprintf "result%d.jpg" i) do! outstream.writeasync(pixels') do Console.WriteLine "done!" } let ProcessImagesAsync() = Async.Run (Async.Parallel Read from the file, asynchronously Write the result, asynchronously [ for i in 1.. numimages -> ProcessImageAsync(i) ])! = asynchronous Generate the tasks and queue them in parallel
Microsoft HPC++ Experience Application Benefits The most productive distributed application development environment Cluster Benefits Complete HPC cluster platform integrated with the enterprise infrastructure System Benefits Cost-effective, reliable and high performance server operating system
Resources www.microsoft.com/hpc www.microsoft.com/science www.microsoft.com/servers www.microsoft.com/sql www.microsoft.com/excel research.microsoft.com/fsharp www.osl.iu.edu/research/mpi.net www.microsoft.com/msdn www.microsoft.com/technet
Thank you!