Clustering. Research and Teaching Unit

Size: px

Start display at page:

Download "Clustering. Research and Teaching Unit"

Nathaniel Doyle
5 years ago
Views:

1 Clustering Research and Teaching Unit

2 Disclaimer...though it cannot hope to be useful or informative on all matters, it does at least make the reassuring claim, that where it is inaccurate it is at least definitively inaccurate. Douglas Adams The hitchhikers guide to the galaxy. I don't know how the slides will transition either.

3 Overview Quick history of cluster usage Current Hardware Software Usage and Cost Future options

4 Clustering - history Initial cluster set up (at third attempt) in Dell GX WS530 Added condor pools of lab/staff desktops

5 Eddie Mk I

6 Eddie Mk 1 University-wide service Minimal free usage Chargable for major users 512 cores (intel X86_64) SL4?/gridengine/GPFS

7 Eddie MK 2 Not biggest bang for bucks Bought to a total cost over a 5 year period Power/cooling costs included Bought in 2 stages (~2000 cores) Stage quad core Xeons Stage sex core Xeons Backed by large filestore (IO performance) Small GPU provision

8 Eddie Mk 2

9 Current School provision >184 cores (we're not really competing) 24 PE SC1425 dual core 34 PE SC1425 quad core Small pile of desktops 3.7Tb GPFS filesystem provided by 4 file nodes (PE 750) w 1Tb sata disks 2 head nodes + 1 scheduler nodes

10 (SUN^wOracle/open/sGridengine Bought by Sun, sun bought by Oracle Open sourced by sun (took several years) Closed again by Oracle, Multiple forks, we run older version of open source code Job scheduler Users submit jobs to queue(s) Scheduler matches jobs to resources based on: free slots, priority, prior usage, available licenses Can schedule parallel jobs (mpi etc)

11 General Parallel File System (GPFS) IBM product, originally for multimedia Commercial software we have historic academic license. Posix complient (api, acl, locking) kernel driver with userspace daemons, presents as mountable device Fstab: /dev/gpfsdev /gpfs gpfs noauto 0 0 df -k: /dev/gpfsdev 3.7T 2.1T 1.7T 56% /gpfs Support for HA, DMAPI, HSM and ILM.

12 GPFS architecture (how IBM sell it)

13 GPFS architecture

14 GPFS Architecture Break files down into small chunks (~1M) Stripe across multiple disks/nodes Think RAID 0 but with fileservers not disks Singular Array of Expensive Nodes?

15 GPFS Read/write data in parallel Meta data can be stored separately SSD on node Dedicated nodes Can opt to store two copies of data to improve resilience (storage penalty). Filesystems are resilient-ish, can drop a chunk of nodes out of the filesystem.

16 GPFS Architecture (what we actually have)

17 Network Shared Disk

18 Our setup All gridengine nodes in gpfs: one filesystem mounted as /gpfs using redundant storage For admin commands root must ssh anywhere Using ssh keys and a very long passphrase 5 NSDs 4x1T data servers and 1x80G metadata server.

19 Goodies we don't use Storage Pools Can use to define a tiered filesystem Tier 1 local disk on node (fast) Tier 2 diskspace on NSD (slower) Policy to store most recently accessed data on tier 1 & migrate older data to tier 2 (transparently)

20 Remote mounting of filesystems

21 Fully tiered filesystem (HSM) Data Management API (DMAPI) Filesystem (jfs, xfs...) Distributed filesystem (GPFS) Tape storage (HSS) Can move files transparently from local hard disk to tape storage (and back again) based on policy engine.

22 Issues RPM distribution is horrible Upgrades override original rpms Huge pre and post install scripts Need for root to ssh into nodes Can do evil things to network Commercialsation means we don't get the nice new features

23 Hadoop Framework for distributed applications Based on Google's map/reduce and Google File system Java based Provides a filesystem & a job scheduling system

24 The Infamous map and reduce "Map" The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. The worker node processes the smaller problem, and passes the answer back to its master node. "Reduce" The master node then collects the answers to all the sub-problems and combines them in some way to form the output.

25 Hadoop Architecture Master node to keep track of hdfs data (namenode) Second master handles job assignments (jobtracker)

26 HDFS Hadoop can be made aware of the network topology of the cluster. Admin breaks cluster into racks Files are broken into blocks (~64M) stored in multiple locations across the cluster (2 in same rack) Metadata held on namenode Data location is exposed to Jobtracker Jobtracker will try to schedule jobs near where required data is stored

27 Our Implementation Runs over the 58 gridengine nodes plus the pile of desktops Both schedulers will ignore nodes which are already loaded 1 PE 750 w 1T drive for a bit more filespace HDFS accessible via hadoop command or via web interface at:

28 Issues Need to shut down cluster to add/remove nodes Authentication HDFS access based on `whoami` If your map tasks spawn subtasks they run as hadoop user (root equivalent on HDFS) Authors need to read xkcd Users can embed html in job description which gets run on jobtracker web interface.

29 GPU usage Recent interest in nvidia Roswell (ANC) - dual GTX480 (2x480 cores) Rendlesham (ILLC) dual GTX580 (2x512 cores) Bonniebridge(ILLC) dual GTX680 (2x1536 cores) ILLC are looking to add two GTX690 (2x3072 cores)to Bonniebridge and order an additional machine Using nvidia cuda parallel computing platform

30 GPU usage Cuda: parallel computing architecture developed by Nvidia for graphics processing. Accessible via C, perl, python, ruby,... Suits parallel processing of large blocks of data slowly rather than a stream quickly No current school provision.

31 Costs (last 5 years) Capital costs 5x1Tb disks 235(*) 5xsata power cables 5 Development costs Approx 15 days (install of SL5 and GPFS) Operating costs Support time?? but not high <cough> power, when we start paying for it (*) currently selling for 250 on ebay

32 Power usage not including pile of desktops) GPFS NSDs 5xPE750 (0.45A idle 0.69A loaded) = Infrastructure nodes 2xPE750 (0.45A idle 0.69A loaded) 1xPE1425 1cpu (0.58 idle 0.81 loaded) = Worker nodes 24xPE1425 1cpu (0.58 idle 0.81 loaded) 34xPE1425 2cpu (0.79 idle 1.25 loaded) 8,300-12,600 = /year Total cost: 9,100-13,000/year

33 Usage Gridengine ~20 user/year ~39% usage (of total cputime) Usage tends to be in spikes and is slowly falling off Tends to get used: Because software not available on Eddie Eddie is too busy Want sole access to whole cluster Hadoop Extreme MSC computing students & degree course Research usage About 40% usage

34 Possible savings Bin the whole cluster Save ~ 11K What about the current users? Replace with new kit - 50K+ Replace with same number of cores 30K Replace with discards/kit in storage and some spending? Leverage some/all of school share in Eddie Admin difficult with hardware at ECDF Need ~10% of eddie nodes for at least semester 1

35 Possible savings Double up CPUs in single CPU nodes Saves /year But reduces the memory/cpu ratio Bin the single CPU 1425s Saves /year Reduces the core count by 48 Switch off the cluster when not in use Saves ~ 5,800 Hard to implement in hadoop

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support

Data Management. Parallel Filesystems. Dr David Henty HPC Training and Support Data Management Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960 Overview Lecture will cover Why is IO difficult Why is parallel IO even worse Lustre GPFS Performance on ARCHER