Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

Size: px

Start display at page:

Download "Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)"

Coleen Jefferson
5 years ago
Views:

1 Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011) Sergio Maffioletti Grid Computing Competence Centre, University of Zurich March 18, 2012

2 Overview of the course Theme 1: from local execution to distributed computing through clouds Theme 2: Overview of large scale infrastructures for scientific computing (grids and clouds) Theme 3: Scientific application s challenges Theme 4: Security Theme 5: Python basic (yes, you need it) Theme 6: Data handling and information processing Theme 7: Let s put everything together and solve some real problems

3 What will you learn here? What characterize a scale computing infrastructure? Why such an infrastructure could be beneficial for scientific research? Why scientific research has a demand for large amount of computing resources? How do we map a scientific usecase in terms of infrastructure requirements? What are the challenges that need to be addressed when porting a scientific usecase on a large scale infrastructure?

4 Practical information Dates and location: Wednesday 12:00-14:00. Room BIN-2.A.10 Exercises: Thursday 16:00-18:00. Room BIN-1.D.12 Course web link: HS11/suche/e details.html Individual projects: During the course, each attenders will develop an individual project centered around one of the thematics we will discuss throughout the course Exam: still to define dates and modalities Learning material: At the end of each class we will provide pointers to online documentation

5 Lecture 1: from Local computing to Distributed systems through Clouds Let s look at the application s execution profile 1. Local systems 2. Cluster systems 3. Distributed systems Slides available for download from:

6 Local System Execute an application on your personal computer. Everything locally available: Application, input data and results Dedicated system (most of the times, user has 100% control) Performance depends on local machine Reliability depends on Application Sequential execution mode No scalability issues (provided one has time to wait until all data are processed sequentially) Access exclusive (own account): 1 username + 1 password

7 Local System Single Resource Single Owner No particular security requirements nor access policies Reliable environment (you know your laptop!) Resource is homogeneous Local resource (you re sitting in front of it!) No resource management policies No specific network connectivity

8 Question 1 Given a single thread application and 10 input files to analyze (1 application execution per input file). Given a 4 cores machine with a single SATA disk. How can we reduce the overall execution time?

9 Question 2 If each application execution generates an I/O throughput of 50MB/s r w (and assuming disk access performance is 100MB/s r w) How shall we distribute the load to optimize the throughput? Note: this exercise can be made also considering memory bandwidth

10 What do we learn here? From these exercises we learn that we need to profile the application as well as the entire experiment (e.g. the 10 input files to analyze) according to the available platform to understand what HTC approach to follow. This can also be applied the other way round; the more we understand the application and experiment behaviours, the better we can plan the computing and data infrastructure.

11 Cluster System What is Cluster? a collection of parallel and distributed processing system that are interconnected by a high-speed network work as a single integrated computing resource

12 Cluster System Application Application Application Queue 1 Queue 2 Queue N Local Resource Management System Node1 Node2 NodeK Cluster Interconnection Network/Switch

13 Example of PBS structure

14 Cluster System Most of the time data available on the cluster: Application, input data and results Minimal control on the system Network File Server involved (NFS, Lustre, GPFS,... ) Execution needs to be described (i.e. Resource requirements) Performance can be tuned by adapting execution to hosting environments (e.g. local storage vs Network file server)

15 Cluster System, cont. Shared access. Own account (configured by a system administrator). 1 username + 1 password (equal on each node of the cluster) Reliability I depends also on how the application behaves during the execution. Reliability II may be affected by reliabiity of the execution node(s) Asynchronous execution (controlled by a Local Resource Management System) Parallel execution (having more nodes at disposal) Scalability is measured against the entire system

16 Cluster System Multiple Resources Owned by a single institution Single security and access policies Volatile environment (It is always better to check before start executing) Resources are homogeneous Resources are within your institution s campus Single resource management policies May have structured network connectivity within university campus and on the Internet

17 Goals of a batch management system Administrative goals Maximize utilization and cluster responsiveness Tune fairness policies and workload distribution Automate time-consuming tasks Trouble-shoot job and resource failures Integrate new hardware and cluster services into the batch system User goals Manage current workload Identify available resources Minimize workload response time Track historical usage Identify effectiveness of prior submissions

18 Example of Resource requirements cput: max CPU time used by all processes in the job pcput: max CPU time used by any single process in the job mem: max amount of physical memory used by the job pmem: max amount of physical memory used by any process of the job vmem: max amount of virtual memory used by the job pvmem: max amount of virtual memory used by any process of the job

19 Example of Resource requirements cont. walltime: wall clock time running file: the largest size of any single file that may be created by the job host: name of the host on which job should be run nodes: number and/or type of nodes to be reserved for exclusive use by the job

20 Question 1 Provided an homogeneous cluster with 4 nodes (4 cores per node); pre-installed application binary. 100 input files to analyze (1 application run per input file) How do we distribute the load?

21 Question 2 If application and data are available on a Network Filesystem (let s say NFS) and each execution node has a local disk large enough to contain 4 input files, how can we improve the overall performance?

Experiences with HP SFS / Lustre in HPC Production

Experiences with HP SFS / Lustre in HPC Production Computing Centre (SSCK) University of Karlsruhe Laifer@rz.uni-karlsruhe.de page 1 Outline» What is HP StorageWorks Scalable File Share (HP SFS)? A Lustre