Brutus. Above and beyond Hreidar and Gonzales

Size: px

Start display at page:

Download "Brutus. Above and beyond Hreidar and Gonzales"

Nigel Roberts
6 years ago
Views:

1 Brutus Above and beyond Hreidar and Gonzales Dr. Olivier Byrde Head of HPC Group, IT Services, ETH Zurich Teodoro Brasacchio HPC Group, IT Services, ETH Zurich 1

2 Outline High-performance computing at ETH (Olivier Byrde) The shareholder model Central clusters and their applications Introducing Brutus A closer look at Brutus (Teodoro Brasacchio) The Brutus platform Key features Next steps (Olivier Byrde) Installation status Integration of Hreidar and Gonzales Extension of Brutus 2

3 Prologue Not so long ago supercomputing was reserved to an elite Supercomputers were very expensive (at least $10M) They were housed in big supercomputer centers A small army was needed to maintain and operate them There was a huge demand for smaller, cheaper systems Some companies made mini-supercomputers ( super-minis ) Others made super-workstations Few of them were commercial successes Then came the Beowulf revolution Started in 1994 by Donald Becker and Thomas Sterling, NASA Quickly adopted by users and vendors (those still in business!) Completely dominates the HPC landscape today 3

4 Part. I High-performance computing at ETH Dr. Olivier Byrde Head of HPC Group, IT Services, ETH Zurich 4

5 Shareholder model Professors who need lots of computing power pool their resources to finance a large, common cluster The IT Services take care of the purchase, maintenance and operation of the cluster Professors receive a share of the cluster proportional to their investment Shares are valid for the whole lifetime of the cluster Advantages Cheaper than buying an individual cluster (economy of scale) No need to worry about maintenance and administration Better utilization of computing resources 5

6 Shareholder model in practice The central clusters of ETH are jointly owned by 22 professors* in 9 departments and the IT Services The share of the IT Services is made available to the whole scientific community of ETH Any member of ETH can request an account on the central clusters Shareholders enjoy special privileges *) Abhari, Aebersold, Anastasiou, Bonhoeffer, Carollo, Govindjee, Gruber, Hiptmair, Katzgraber, Knutti, Koumoutsakos, Kröger, Lilly, Lüthi, Oganov, Öttinger, Parrinello, Pelkmans, Poulikakos, Stelling, Tackley and Troyer 6

7 Shareholders Hreidar Gonzales 7

8 Central clusters Hreidar In operation since AMD Opteron processors, GHz (single-core) Gigabit Ethernet network Red Hat Enterprise Linux 3 Gonzales In operation since AMD Opteron processors, 2.4 GHz (single-core) High-performance Quadrics QsNet II network SuSE Linux 9.2 8

9 Intended use Hreidar Serial applications Parallel applications with little communication Embarrassingly parallel computations (e.g. Monte-Carlo) Gonzales Communication-intensive parallel computations (MPI) 9

10 Observations Many users have only access to one cluster Not necessarily the best system for their applications Those who use both Hreidar and Gonzales must cope with different: Operating systems Compilers, libraries and applications Usage rules and policies These differences prevent an optimal utilization of the resources provided by the central clusters 10

11 Solution Standardize Same operating system on both clusters Same compilers and libraries Common user name and authentication and integrate Merge the existing clusters Work in progress 11

12 Step-by-step approach Develop a new cluster platform from scratch Take the best from Hreidar and Gonzales Learn from our mistakes Implement and test this platform on a new cluster Easier than changing existing clusters on-the-fly (no down-time) Migrate users to the new cluster Beta users first Once the system is proven stable, normal users Move Hreidar and Gonzales to the new platform Software upgrade (OS, cluster management, applications) Hardware integration (network, storage, etc.) 12

13 Introducing the Brutus platform Better Reliability and Usability Thanks to Unified System 13

14 Part. II A closer look at Brutus Teodoro Brasacchio HPC Group, IT Services, ETH Zurich 14

15 Brutus platform Common platform for existing and future clusters Hardware (network, storage, etc.) Software (OS, compilers, libraries, applications) Batch system (policies, queues, fair-share) Simpler to use Single user environment (NETHZ login) Simpler to administer The same staff can manage a much larger system Better user support The less time we need to manage the system, the more time we can spend supporting our users 15

16 Brutus user s view 16

17 Brutus behind the scenes 17

18 Networks Gigabit Ethernet Backbone of the Brutus platform Up to 1200 nodes Quadrics QsNet II High bandwidth (870 MB/s sustained) Low latency (1µs) Up to 512 nodes Management network 18

19 Compute nodes 2008/2 : 280 nodes / 1216 cores 272 nodes with 4 cores (2.8 GHz) 8 fat nodes with 16 cores (2.8 GHz) 2008/4 : 756 nodes / 2168 cores 272 nodes with 4 cores (2.8 GHz) 8 fat nodes with 16 cores (2.8 GHz) 196 nodes with 2 cores ( GHz) ex-hreidar 280 nodes with 2 cores (2.4 GHz) ex-gonzales 2008/6 : 968 nodes / cores Same as 2008/4, plus at least 212 nodes with either 4 cores (2.8 GHz) or 8 cores (2.2 GHz) 19

20 Storage SAN : ~10 TB Storage for home directories Subject to relatively small quota (to be defined) Backed up daily (NetBackup) Panasas : ~40 TB Medium-term storage for work directories Short-term storage (scratch) for very large data sets Extremely fast, ideal for I/O-intensive applications No backup Users can rent additional disk space if needed 20

21 Reliability No single point of failure almost Redundant servers for all critical services Redundant storage (SAN, Panasas) Redundant network (Quadrics) Redundant (uninterruptible) power supply Only exception: Ethernet network 24x7 availability Login always possible even if a login node crashed Files always accessible even if a file server crashed No down-time necessary for regular maintenance Redundant components can be upgraded on-the-fly 21

22 System Single access point : brutus.ethz.ch Single contact for support : cluster-support@id.ethz.ch Central user authentication : NETHZ Operating system : Red Hat Enterprise Linux 5 Batch system : LSF HPC 6.2 Compilers : GCC 4.1, Intel 10.1, PGI

23 Applications Brutus can handle all the applications currently running on Hreidar and Gonzales Serial and embarrassingly parallel Communication-intensive It also opens the door to a new range of applications OpenMP up to 16 threads Commercial applications (CFX, Fluent, MATLAB, etc.) The batch system takes care of the allocation of resources based on an application s requirements 23

24 Teraflops (peak) Peak performance quadcore /2 2008/4 2008/6 Hreidar Gonzales Brutus 24

25 Part. III Next steps Dr. Olivier Byrde Head of HPC Group, IT Services, ETH Zurich 25

26 Disclaimer All the information presented hereafter is valid as of March 5, 2008 Things are changing extremely rapidly Go to for up-to-date information 26

27 History 2006 June first discussions with potential shareholders 2007 January-March definition of the cluster s specifications April-May call for tender June-July evaluation of all offers, selection of winning bid October purchase approved by Executive Board of ETH December hardware delivery and installation 2008 January hardware tests, software installation, alpha users February acceptance tests, beta users March third-party software installation, normal users 27

28 Current status Brutus has passed all basic hardware tests About 5% of the compute nodes failed during these tests All have been repaired or exchanged Ethernet part is functional Software installation and configuration is in progress Compilers, libraries and some applications are available Open to beta users Quadrics part is still in alpha stage Hardware installation is not complete yet Software installation has just started Testing will start immediately thereafter Can be used for benchmarking purposes if needed 28

29 Roadmap (subject to change) March 2008 Interconnection of Quadrics networks of Brutus and Gonzales Gradual opening of Brutus to normal users Ordering of Brutus extension (200+ nodes, Ethernet & Quadrics) April Integration of Hreidar and Gonzales into Brutus (start) Migration of all cluster users to Brutus May Integration of Hreidar and Gonzales into Brutus (end) June Delivery and installation of Brutus extension 29

30 Beta users A lot of software has still to be installed and/or tested Compilers Scientific libraries Batch system MPI and OpenMP applications Third-party applications We are looking for volunteers! Please contact cluster-support@id.ethz.ch to apply for a beta user account 30

31 User migration Hreidar users must request a new account Some users will have a different username on Brutus External users may need to apply for a NETHZ account first About 50 applications have been received so far Gonzales users will get an account automatically if Their NETHZ account is still valid They ran jobs on Gonzales in the last 6 months About 150 users meet these requirements New users New accounts will be created once Brutus is fully operational In the meantime new users may apply for a beta user account 31

32 Very important! Hreidar and Gonzales will cease to exist as independent clusters in May 2008 It is your responsibility to verify that all your applications will run on Brutus We will be happy to help you, but we cannot do all the work for you Do not wait until the last minute! 32

33 Brutus extension We are preparing the next cluster extension and expect to place an order in March 2008 Due to power constraints, this will be the only extension this year We have already firm commitments for over 200 nodes (Ethernet and Quadrics) About 50 nodes are still up for grabs If you would like to take this opportunity to become a shareholder or to increase your share, please contact Olivier Byrde, byrde@id.ethz.ch 33

34 Thank You 34

Leonhard: a new cluster for Big Data at ETH

Leonhard: a new cluster for Big Data at ETH Bernd Rinn, Head of Scientific IT Services Olivier Byrde, Group leader High Performance Computing Bernd Rinn & Olivier Byrde 2017-02-15 1 Agenda Welcome address