Scientific Computing in practice

Size: px
Start display at page:

Download "Scientific Computing in practice"

Transcription

1 Scientific Computing in practice Kickstart 2013 Ivan Degtyarenko, Tapio Leipälä, Mikko Hakala, Janne Blomqvist School of Science, Aalto University June 10, 2013 slide 1 of 143

2 What is it? Part of Scientific Computing in Practice: kickstart 2013 course Mon 10.6 at 09:15 12:15 Overview of the scientific computing concepts and terminology Hardware architectures, interconnects, live examples Triton practicalities: common, jobs with SLURM, IOs with Lustre, environment, Matlab Code compilation on Triton slide 2 of 143

3 Computing resources pyramid Supercomputing at EU: Prace facilities Supercomputing at CSC: sisu, taito, louhi, vuori Midrange Computing: Aalto Triton-cluster and Grid computing by Science-IT Cloud computing, Departments' local servers on workstations: Condor, local runs slide 3 of 143

4 Science-IT at Aalto Computational Studies are within top-5 highest priorities at Aalto midrange Scientific Computing resources for Aalto researchers: Triton-cluster wide collaboration between Finnish Universities and CSC, part of European grid infrastructure 2005 collaboration started under M-grid; FGI, a wider consortium joint procurement and administration collaboration, focus on serving (special) needs of Aalto s computational science, i.e. raw CPU power, Matlab, big data sets, large memory servers, GPU computing, Hadoop slide 4 of 143

5 Science-IT management Management board Prof. Martti Puska (SCI/PHYS), Prof. Jari Saramäki (SCI/BECS), Prof. Keijo Heljanko (SCI/ICS, chair), Prof. Mikko Kurimo (ELEC) Stakeholder model All costs are distributed to schools (departments) based on the agreed Science-IT share Key departments (Becs, Phys, ICS) contribute 50% to administrative personnel responsible for day-to-day tasks. Local support person named at each participating unit. 20% of resources for Grid => Freely available for Finnish research community via grid-interface Small share (5%) for testing. Users outside stakeholder model slide 5 of 143

6 Science-IT support team The team Tapio Leipälä (ICS/SCI) Mikko Hakala, M. Sc. (BECS/SCI) Janne Blomqvist, D. Sc. (Tech) (PHYS+BECS/SCI) Ivan Degtyarenko, D. Sc. (Tech) (PHYS/SCI) responsible for daily cluster maintenance wiki-area at look for Triton User Guide, support tracker at easy entry point for new users named by Departments to be local Science-IT support team members provide department user support, being close to researchers being contact person between schools (departments) and Science-IT support slide 6 of 143

7 Why supercomputing? Primarily increase the problem scale, secondarily decrease the job runtime Make possible serving large number of user applications at one place Other reasons: physical constraints preventing CPU frequency scaling: principal limitation of serial computing Supercomputing is often the only way to achieve specific computational goals at a given time slide 7 of 143

8 Real HPC by examples ORNL's Titan, #1 of top500.org on Nov 2012 Linpack performance: 17.6 PFlop/s Theoretical performance: 27.1 PFlop/s slide 8 of 143

9 LLNL's Sequoia, #2 at Nov bit IBM PowerPC A2 1.6 GHz, 16-way SMP PFlop/s peak PFlop/s Linpack 1.6 PByte memory 96 racks 98,304 compute nodes 1,572,864 CPU cores 5D interconnect Water cooling slide 9 of 143

10 RIKEN AICS K computer, #3 Nov bit Sparc VIIIfx, 2.0 GHz, 8-way SMP PFlop/s peak, PFlop/s Linpack 1.41 PByte memory; 705,024 CPU cores 864 racks; 88,128 nodes 6D interconnect (Tofu) Water cooling slide 10 of 143

11 RIKEN AICS K computer (cont.) The only computer with its own train station 4th floor: K computer 3rd floor: Computer cooling 2nd floor: Disks 1st floor: Disk cooling slide 11 of 143

12 Vertical moves damper horizontal moves damper K computer building Earth quake security flexible pipes wiggle moves damper slide 12 of 143

13 SuperMUC, #5 on Nov 2012 Fat node 2 x 64-bit Intel Sandy Bridge EP 2.7 GHz, 8-way SMP Thin node 4 x Intel Westmere EX 2.4 Ghz, 10-way SMP 3.19 PFlop/s peak 2.9 PFlop/s Linpack 340 TByte memory 9,216 fat / 205 thin nodes 155,656 total cores Infiniband FDR10 interconnect Warm-water cooling (in 30 out 50 ) slide 13 of 143

14 TUM Computer Science Machine hall LRZ SuperMAC water infrastructure Machine hall raised floor slide 14 of 143

15 slide 15 of 143 Top500.org snapshots

16 Local examples at Aalto: Triton HP cluster: SL390s and BL465s interconnected with InfiniBand at CSC: Sisu Cray XC30 and Taito HP cluster outdated: Louhi CRAY XT4/XT5, Vuori HP cluster, s/hpc/resources slide 16 of 143

17 Triton: overview 238 compute nodes with 12 CPU cores mix of Xeon and Opteron nodes InfiniBand connections GPU computing (Tesla M2090 cards) Fat (large memory) nodes with 1TB of RAM High-performance Lustre filesystem, capacity 600 TB Operating system: Scientific Linux 6 slide 17 of 143

18 Triton tech specs in details HP SL390s G7 nodes HP BL465c G6 nodes 2x Xeon 2.67GHz (6core) / 48 GB RAM 2x local SATA disks (7.2k) -> 800GB of local storage 2x Tesla M2090 GPU (Fermi, 6GB mem) on 11 servers / 24 GB RAM 2x AMD Opteron 2.6GHz (6core) 200GB local storage 32/64 GB RAM HP DL580 G7 nodes 4x 2.67 GHz (6core) 1300GB local storage 1024 GB RAM slide 18 of 143

19 Triton tech specs in details (cont.) Lustre storage system Interconnect 600 TB raw capacity IOPS, 3-4 GB/s Stream (read/write) DDN SFA10k system SL390, DL580: QDR Infiniband (5GT/s) BL465: DDR Infiband (2.5GT/s) 10 GB Ethernet in/out Operating system & Software Scientific Linux 6.x (RHEL based) CVMFS (apps over the grid sites) SLURM as a batch system MDCS Matlab distributed computing server with 128 seats Modules + lots of software slide 19 of 143

20 Triton stats 2012 Users (2012) Jobs (6/12-12/12) jobs calculated 4,1 M cpu-hours computed Utilization ~100% Right top: memory scattering of jobs Right bottom (Dec 2012): Total cpu-time per job type vs. wait time in the queue. Serial, small parallel (2-12 cores), large parallel (12+ cores) Wait tme (%) 178 active users 10 departments 3 schools Cputme (h) 0 long Runtme (h) slide 20 of 143 Serial Small Large Serial Small Large

21 Triton vs others Triton vs CSC: no billing units, student access, faster queues, Aalto users only, Aalto licensed software; less CPU capacity, no really large runs over thousands of CPU cores Triton vs Condor: large MPI runs, large memory runs, wide software packages choice, queuing Triton vs Grid: locality, less room for debugging; queuing slide 21 of 143

22 Computing Thesaurus Parallel Computing Grid (aka Distributed) Computing Use of the fastest and biggest machines with fast interconnects and large storage capabilities to solve large computational problems GPU computing (or other accelerators) Solving many similar, but independent, tasks; e.g. parameter sweeps. Often associated with Grid or Condor computing Supercomputing and High Performance Computing (HPC) Solving a task by simultaneous use of multiple isolated, often physically distributed heterogeneous computer systems connected with Grid middleware Embarrassingly Parallel Means using more than one processing unit to solve a computational problem Use of GPU cards (or other accelerators) to solve computational problems Cloud computing Computing as a utility, cf. electricity, water. Role in future scientific computing? slide 22 of 143

23 flop/s FLOPS (or flops or flop/s) FLoating point OPerationS, a measure of a computer's performance a product of number of cores, cycles per second each core runs at, and number of double-precision (64 bit) FLOPS each core performs per cycle (nowadays a factor of four/eight) Theoretical peak-performance of Vuori (272 nodes x 12 cores x 2.6 GHz x 4) 34 Tflop/s. LINPACK max performance 27 Tflop/s. Top 10 of top500 are petaflop computers, i.e. they perform 1015 floating point operations per second; your calculator is >10 flop/s, your quad core PC about 25 Gflop/s, single 12 cores Triton compute node about 125 Gflop/s, Nvidia Tesla K20 GPU computing processors around 1.2 Tflop/s slide 23 of 143

24 Computer architecture: von Neumann Virtually all computers follow these general requirements: Comprised of four main components: Read/write, random access memory is used to store both program instructions and data Memory Control Unit Arithmetic Logic Unit Input/Output Program instructions are coded data which tell the computer to do something Data is the information to be used by the program Control unit fetches instructions/data from memory, decodes the instructions and then sequentially coordinates operations to accomplish the programmed task Aritmetic Unit performs basic arithmetic operations Input/Output is the interface to the human operator slide 24 of 143

25 Parallel Computer Architectures A common way to classify parallel computers is to distinguish them by the way how processors can access the system's main memory Distributed memory Shared memory UMA NUMA Hybrid slide 25 of 143

26 Distributed memory systems vary widely but share a common characteristic: system requires a communication network to connect inter-process memory Processors have their own local memory. Memory addresses in one processor do not map to another processor: no global address space across all processors Each processor operates independently. Changes to its local memory have no effect on the memory of other processors. The concept of cache coherency does not apply. When needed, a processor needs access to data in another processor, it requests it through communication network It is the task of the programmer to explicitly define how and when data is communicated The network "fabric" used for data transfer varies widely, though it can can be as simple as Ethernet slide 26 of 143

27 Distributed memory: pros and cons Advantages: Memory is scalable with number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking. Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. It may be difficult to map existing data structures, based on global memory, to this memory organization. Non-uniform memory access (NUMA) times slide 27 of 143

28 Shared memory General Characteristics: ability for all processors to access all memory as global address space multiple processors can operate independently but share the same memory resources changes in a memory location effected by one processor are visible to all other processors. Shared memory machines can be divided into two main classes based upon memory access times: UMA and NUMA slide 28 of 143

29 Shared memory: UMA Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines Identical processors Equal access and access times to memory Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level slide 29 of 143

30 Shared memory: NUMA Non-Uniform Memory Access (NUMA): Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP Not all processors have equal access time to all memories Memory access across link is slower ccnuma - Cache Coherent NUMA; if not specifically stated otherwise, NUMA means ccnuma, since nowadays the applications for non-cache coherent NUMA machines are almost non-existent, and they are a real pain to program for slide 30 of 143

31 Shared memory: pros and cons Advantages: Global address space provides a user-friendly programming perspective to memory Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory-cpu path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management. Programmer responsibility for synchronization constructs that ensure "correct" access of global memory. Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors slide 31 of 143

32 Hybrid Distributed-Shared Memory Today's HPC installations employ both shared and distributed memory architectures The shared memory component is usually a cache coherent NUMA machine. Processors on a given SMP can address that machine's memory as global, though it will take longer to access some regions of memory than others. The distributed memory component is the networking of multiple cc-numa machines. A particular compute node knows only about their own memory - not the memory on another nodes. Therefore, network communications are required to move data from one node to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. slide 32 of 143

33 Interconnects (aka networking) Interconnecting distributed memory nodes Infiniband fabrics Ethernet for MPI, Lustre [aka $WRKDIR] on Triton Xeon nodes: SL390, DL580: QDR Infiniband (5GT/s) Opteron nodes: BL465: DDR Infiband (2.5GT/s) 4x spine switches are QDR: Voltaire 4036 Fat-tree configuration with 2:1 subscription 1G internal net for SSH, NFS [aka /home] on Triton 10G uplink towards Aalto net Main characteristics: latency and bandwidth slide 33 of 143

34 Accelerating: GPU computing topic will be fully covered on Wed morning GPGPU stands for General-Purpose computation on Graphics Processing Units. Graphics Processing Units (GPUs) are massively parallel many-core processors with fast memories; Triton has 20x Tesla M2090 cards on gpu[ ] nodes slide 34 of 143

35 More accelerators Tesla K20: new family of Nvidia GPUs Xeon Phi: Intel's 60 cores co-processor by Intel; PCIe card with standard x86 processors; OpenMP programming with standard Intel development tools as of 2013 Tesla cards and Phi take the most of the coprocessors market Radeon HD: direct competitor for Tesla on GPU computing market; AMD APP SDK as a development platform; Neither of them available on Triton yet; Tesla K20 and Radeon HD cards available for testing purpose at PHYS Common usage reason more compute performance; the barrier complexity of programming, need for much wider adoption on software level slide 35 of 143

36 Accelerating modes Offload: main program runs on a normal CPU and offloads computationally intensive parts to the accelerator Symmetric: workload is shared between normal CPU and co-processors Many-core: code runs on the accelerator only. Can be used on Phi only, but of limited use due to memory amount slide 36 of 143

37 Accelerators: way to use existing applications tuned for particular accelerator. Examples for Tesla: Gromacs, MATLAB, LAMMPS, NAMD, Amber, Quantum Espresso, etc.; Intel: Accelyrs, Altair, CD-Adapco, etc. libraries CUBLAS, cufft, Thrust for GPU; Intel's MKL for Phi OpenACC: compiler directives that specify regions of code in C/C++ and Fortran to be offloaded from main CPU; hopefully the standard portable approach for the future for programming accelerators; possibly will be incorporated into OpenMP standard native programming with CUDA, OpenMP + Intel dev tools slide 37 of 143

38 Triton practicalities Access & Help Login, Environment Code compilation Quick run setup Serial code optimization Compiling with Makefile slide 38 of 143

39 Using Triton in practice slide 39 of 143

40 Accessing Triton $ ssh triton.aalto.fi Directly reachable from: Aalto net: Outside of Aalto: department workstations wired visitor networks, wireless Aalto Open and Eduroam at Aalto CSC servers Must hop through Aalto shell servers; first ssh to james.ics.hut.fi, hugo.hut.fi, etc. If Triton's login name differs from the one you have on your workstation, then $ ssh username@triton.aalto.fi slide 40 of 143

41 SSH key On your workstation where from you want to login to Triton: $ ssh-keygen $ ssh-copy-id -i ~/.ssh/id_rsa.pub triton.aalto.fi $ ssh triton.aalto.fi Must do step for Matlab users, for a sake of security/convenience for others SSH key must have a secure passphrase! slide 41 of 143

42 After login cd $WRKDIR slide 42 of 143

43 Getting help points of search for help: at wiki.aalto.fi look for Triton User Guide follow list: all users MUST BE there tracker.triton.aalto.fi: for any issue; see whether it has been published already, if not, then shoot access requires valid Aalto account see also FAQ over there Accessible from Aalto net only, from outside run proxy through your department (hugo, james, amor, etc) local support team member: for account, quota slide 43 of 143

44 Frontend: triton.aalto.fi Not a cluster or supercomputer, just one node out of 238 adapted for server needs 12x Xeon cores with 48G of memory (i.e. same as any other Xeon compute node) Please don't run large programs on it! The only difference: eth2 uplink connected to Aalto net Use it to submit/monitor your jobs slide 44 of 143

45 Frontend node: intended usage File editing File transfers Code compilation Job control Debugging Checking results No multi-cpu loads No multi-gb datasets into memory Matlab, R, IPython sessions otherwise OK slide 45 of 143

46 Triton cluster report Ganglia monitoring tool available for all FGI resources, here is Triton's link: Shows overall cluster utilization, CPU usage, memory usage Available to all users slide 46 of 143

47 Triton nodes naming scheme Labeled after real label on each phyiscal node Compute nodes: cn01, cn02, cn03 cn224 Same for gpu nodes: gpu001, gpu002 gpu010 Fat nodes: fn01, fn02 Mostly administrative purpose nodes: tb01, tb08 Listing: cn[01-224] or gpu[ ] Operons: cn[01-112] Xeons: cn[ ] This naming scheme is used by SLURM commands, at any place one needs a nodelist squeue -w gpu[ ] slide 47 of 143

48 Transferring files Network share (BECS, ICS) SSHFS Mount remote directories over SSH SCP/SFTP /triton/ filesystem mounted on workstations Copy individual files and directories (inefficiently) $ scp file.txt triton:file.txt Rsync over SSH Like scp, but avoids copying existing files again $ rsync -ave ssh source/ triton:target/ slide 48 of 143

49 Software: modules environment $ python3 -bash: python3: command not found $ module load python3 $ python3 Python3.3.1 (default, Apr , 16:11:51)... Compilers, libraries, interpreters and other software Maintained by Triton admins or FGI partners (/cvmfs) Module adds installed programs and libraries to $PATH, $LD_LIBRARY_PATH, etc and sets up required environment variables for the lifetime of current shell or job slide 49 of 143

50 Module environment slide 50 of 143

51 module subcommands module help or man module module list module purge module avail module whatis scalasca/1.4.2 module unload mvapich2/1.9a module load openmpi/1.6.3-intel module swap matlab/r2013a module show intel slide 51 of 143

52 Using the cluster for managing computational tasks slide 52 of 143

53 Using the cluster Each set of 3-5 slides will introduce: A new concept Explained through examples Relevant cluster utility Possible mistakes to avoid slide 53 of 143

54 What is a 'job'? Job = Your program + handling instructions Shell script that runs your program Allocation size Single command or complex piece of programming (Nodes or CPUs) x Memory x Time Constraints Submitted in a single script (.sh) file from the front node or through Grid interface slide 54 of 143

55 Job scheduler slide 55 of 143

56 Slurm job scheduler Basic units of resources: Nodes / CPU cores Memory Time Takes in cluster jobs from users Compares user-specified requirements to resources available on compute nodes Starts jobs on available machine(s) Individual computers and the variety of hardware configurations (mostly) abstracted away slide 56 of 143

57 Slurm job scheduler Tracks historical usage (walltime)...to enable hierarchical fair-share scheduling Jobs sorted in pending queue by priority Computed from user's historical usage (fair-share), job age (waiting time), and service class (QOS) Highest priority jobs are started on compute nodes when resources become available Using cluster time consumes fairshare, lowering priority of future jobs for that user & department slide 57 of 143

58 Simple batch script for 1 CPU job Job = Your program + handling instructions Example: #!/bin/sh #SBATCH --time=5:00 #SBATCH --mem-per-cpu=1g /bin/echo hello world slide 58 of 143

59 Important! Always use estimated time option: --time=days-hours:minutes:seconds Three and half days: --time=3-12 One hour: --time=1:00:00 30 minutes: --time=30 One day, two hours and 27 minutes: --time=1-2:27:00 Otherwise: the default time limit is the partition s default time limit + OverTimeLimit. OverTimeLimit is set to be 60 minutes, thus on short partition your job will be run for 4 hours and 59 minutes exactly at most slide 59 of 143

60 Submitting script file $ sbatch job.sh slide 60 of 143

61 After job submission Job waits in PENDING state until the resource manager finds available resources on matching node(s) Job script is executed on one of the nodes assigned to it; the requested memory and CPUs are reserved for it for a certain time Program output saved to '.out' and '.err' files slide 61 of 143

62 Job requirements vs. job behavior Job reqs. are not programming instructions #SBATCH --ntasks=n Makes a reservation for and allows execution on n CPUs, but does not magically make programs run in parallel #SBATCH --mem-per-cpu=x Memory requirement becomes an enforced resource limit, attempting to allocate more results in job termination #SBATCH -time=d-hh:mm Job is killed after timelimit + 1 hour slide 62 of 143

63 Pitfalls: background processes Programs started on the background (myprog&) may escape job tracking when job finishes Solution: use wait at the end of the script: someprogram & otherprogram & wait slide 63 of 143

64 First test jobs slide 64 of 143

65 Cluster partitions #SBATCH --partition=name Partition is a group of computers play meant for <15 min test runs (to see if code runs) Default partition 'batch', made of 12-core machines short has additional 16 nodes (192 cores) for <4h jobs play short batch hugemem Should use it if the task can finish in time; there are no downsides hugemem is reserved for >64 GB jobs One machine with 1 TB memory and 24 slightly slower CPUs Separate from 'batch' only to keep it immediately available to big jobs when specifically requested slide 65 of 143

66 Test jobs Two dedicated machines for test runs and debugging in the play partition, cn01 and tb008, i.e. Opteron and Xeon correspondingly #!/bin/sh #SBATCH --partition=play #SBATCH --time=5:00 #SBATCH --mem-per-cpu=1g /bin/echo hello world slide 66 of 143

67 Checking results Check output files Check if job finished successfully with: $ slurm history slide 67 of 143

68 Slurm utility Triton-specific wrapper for Slurm commands squeue, sinfo, scontrol, sacct,... Informative commands only (safe to test out) Shows detailed information about jobs, queues, nodes, and job history Example: show current jobs' state: $ slurm queue JOBID PARTITION NAME TIME START_TIME STATE batch test 0:00 N/A PENDING slide 68 of 143

69 Short jobs slide 69 of 143

70 Job classes (QOS) #SBATCH --qos=name short medium normal long_risky Represents a class of job Eligible jobs get preferential queue treatment Short jobs prioritized over longer ones But only if requested with --qos=short or medium Cluster-wide limit to very long jobs QOS cheat sheet on shown on Triton login slide 70 of 143

71 Short jobs #!/bin/sh #SBATCH --partition=short #SBATCH --qos=short #SBATCH --time=3:00:00 #SBATCH --mem-per-cpu=1g myprogram slide 71 of 143

72 Multiprocess job slide 72 of 143

73 Job allocation When job enters RUNNING state, resources on a set of nodes have been assigned to it The job script is executed on the first of the assigned nodes User code is now responsible making use of the allocated resources List of assigned nodes is exposed to the user program through environment variables: SLURM_CPUS_ON_NODE, SLURM_JOB_NODELIST, etc. MPI libraries handle launching processes on correct nodes Non-MPI programs will run only on the first node! slide 73 of 143

74 Multiprocess job #!/bin/sh #SBATCH -time=5:00 #SBATCH -mem=1g #SBATCH -nodes=1-1 #SBATCH -ntasks=2 module load tutorialtools memoryhog 400 & memoryhog 400 & wait slide 74 of 143

75 Pitfalls: multiple nodes in allocation Using --ntasks alone does not guarantee job reservation on one computer CPUs may be spread across many nodes Must be avoided with non-mpi programs! Solution: specify upper limit of one (1) node: #SBATCH --ntasks=8 #SBATCH -nodes=1-1 multithreaded_program -threads 8 slide 75 of 143

76 Large number of tasks inside a job slide 76 of 143

77 Job steps / tasks srun -n 1 myprogram & Job steps are smaller tasks started inside a job allocation Runtime resource usage can be tracked $ slurm tasks <jobid> Job accounting remembers exit status and resource usage of each individual task $ slurm history slide 77 of 143

78 Large number of tasks inside a job #!/bin/sh #SBATCH -time=5:00 #SBATCH -mem=1g #SBATCH -nodes=1-1 #SBATCH -ntasks=4 module load tutorialtools for incr in $(seq 1 20); do let var=$incr*100 srun -n 1 memoryhog $var & done wait slide 78 of 143

79 Multithreaded job slide 79 of 143

80 Multithreaded job #!/bin/sh #SBATCH -time=4:00:00 #SBATCH -mem=1g #SBATCH -nodes=1-1 #SBATCH -ntasks=4 multithreaded_program -threads 4 slide 80 of 143

81 Pitfalls: memory limits for threads Job option --mem-per-cpu=x means just that memory requirement for the job allocation is X multiplied by # of tasks Threads have a shared memory space; this means starting more threads does not multiply the programs' actual memory use Easy to accidentally leave overblown requirements in the job script when testing program behavior with increasing number of threads Solution: use the simpler --mem=x option slide 81 of 143

82 Large parallel jobs slide 82 of 143

83 Exclusive access to nodes #SBATCH exclusive Resource manager guarantees no other job runs on the same node(s) Terrible idea for <12 core jobs in general May be necessary for timing cache-sensitive code Don't use it unless you need it; constraints only make job less likely to find available resources Makes sense for parallel jobs: moving data over the network is orders of magnitude slower than interprocess communication inside a single host Resources are wasted if --ntasks is not a multiple of 12 slide 83 of 143

84 Constraints Limit job to running on nodes with some specified feature(s)... #SBATCH --constraint=opteron... Available: opteron,xeon Useful for multi-node parallel jobs slide 84 of 143

85 Parallel (communicating) job #!/bin/sh #SBATCH -time=12:00:00 #SBATCH -mem=4g #SBATCH -exclusive #SBATCH -ntasks=48 mpirun -o mpihello.out mpihello slide 85 of 143

86 Embarrassingly parallel jobs slide 86 of 143

87 Large numbers of independent jobs A few jobs can be submitted by hand, but what if there are thousands? Requires code outside the job system to automate submissions and parameter handling Job requirements can be tuned to match inputdependent program behavior (runtime, memory use) General recommendations: Make it possible to submit single jobs outside of the batch Failures should be noticeable: use a lot of logging! Organize data first, then the code slide 87 of 143

88 External submitter script submitter.sh: #!/bin/sh for dataset in $WRKDIR/d/*; do sbatch worker.sh $dataset done worker.sh: #!/bin/sh #SBATCH -t 1-0 #SBATCH -mem=1g process_file $1 slide 88 of 143

89 Matlab jobs slide 89 of 143

90 Matlab environments on Triton Interactive use on fn01 (currently 1 machine) triton$ ssh -XY fn01 fn01$ module load matlab; matlab & 24 CPUs and 1TB memory shared by a number of users Noninteractive Slurm job (unlimited jobs) Matlab started from a user-submitted job script Licensing terms changed in 2013; use of Matlab in cluster and grid environments is no longer prohibited Matlab Distributed Computing Server (128 workers) Special license that extends Parallel Computing Toolbox Mainly useful for actual parallel applications Tight integration between the interactive Matlab GUI and cluster slide 90 of 143

91 Choosing the Matlab environment Slurm job + Matlab script Parallel Toolbox + MDCS Unconstrained license Batch job without GUI Simple and efficient way to run serial jobs Best use case a series of single-threaded jobs Licensed for only 128 cores MDCS allows large parallel jobs across multiple nodes Jobs controlled from interactive Matlab GUI Supports data-parallel SPMD paradigm Parallel Toolbox brings some programming overhead Learning curve slide 91 of 143

92 Matlab on Triton Primarily for serial runs use Slurm directly. A batch job without graphical window. 1) Prepare a M-file for the job 2) Prepare slurm submission script (example on right) 3) Submit #!/bin/bash #SBATCH -p short #SBATCH --qos=short #SBATCH -t 04:00:00 #SBATCH mem-per-cpu=1000 module load matlab/r2013a cd /triton/becs/scratch/myjob/ matlab -nojvm -singlecompthread -r run slide 92 of 143

93 Matlab on Triton Parallel Matlab runs using parallel computing toolbox (MDCS) ClusterInfo.setWallTime('05:00'); 1) Setup Triton environment (validation DEMO) ClusterInfo.setMemUsage('4096'); 2) Manage your jobs within Matlab workspace. % this job requires 6 cores on the cluster %j1 = batch('script_spmd', 'matlabpool', 5); slide 93 of 143

94 Questions about Triton job system? slide 94 of 143

95 Triton File systems, IO with files slide 95 of 143

96 File systems Motivation & basic concepts Storage systems Local disk, NFS, Lustre, ramdisk Do's and Dont's Triton best practices slide 96 of 143

97 File systems: Motivation Case #1: Did my calculations in 10 min and spent 1h writing my results more jobs to go. Can I boost the performance? Case #2: I just run my code from here 15 min later 150 other users cannot access their files! Case #3: My directory listing takes 10 min while on my laptop it takes 3s. Am I doing something wrong? slide 97 of 143

98 File systems: Basic concepts What is a file? File = metadata + block data Accessing block data: cat a.txt Showing some metadata: ls -l a.txt slide 98 of 143

99 File systems: Basic concepts What is a file system? Data store than manages multiple files Ext4 used in Triton/linux Can be local (physical disks on a server) or shared over network Some basic operations ls, cp, scp, rsync, rm, vi, less,... slide 99 of 143

100 File systems: Basic concepts Access and limits Quota enforced, mainly for groups Linux filesystem permissions Some department export filesystem to department workstations (Becs, ICS) Rsync, scp, SSHFS are always the options when transferring files to/from the cluster slide 100 of 143

101 Storage systems slide 101 of 143

102 File systems: NFS Network filesystem Server/storage SPOF and bottleneck slide 102 of 143

103 File systems: Lustre Many servers & storage elements Redundancy and scaling Single file can be split between multiple servers! slide 103 of 143

104 File systems: Local storage Storage local to compute node /local -directory Best for calculation time storage Copy relevant data after computation, will be deleted on re-install Ramdisk for fast IO operations /ramdisk -directory (20GB) Use case: job spends most of its time doing file operations on millions of small files. slide 104 of 143

105 File systems: Meters of performance Stream and (random) IOPS Stream measures reading large sequential data from system (~big file) IOPS measure random small reads to the system (~some metadata operations) To measure your own application, profiling or internal timers needed slide 105 of 143

106 File systems: Meters of performance Total numbers Device IOPS Stream Sata disk (7.2k) MB/s SSD disk MB/s Ramdisk MB/s Triton NFS MB/s Triton Lustre MB/s With 200 concurrent jobs using storage... Device IOPS Stream Sata disk (7.2k) MB/s SSD disk N/a N/a Triton NFS MB/s Triton Lustre MB/s DON'T run job jobs from HOME! (NFS) slide 106 of 143

107 File system locations on Triton Location Type Usage Size/quota /home NFS Home dir 1 Gb /local local Local scratch 800Gb /triton Lustre Work/Scratch 200Gb (varies, even several TB's) /ramdisk Ramdisk Local scratch 20GB slide 107 of 143

108 File systems: advanced Lustre By default striping is turned off lfs getstripe <dir> shows striping lfs setstripe -c N <dir> stripe over N targets, -1 means all targets setstripe -d <dir> revert to default Use with care. Useful for HUGE files (>100GB) or parallel I/O from multiple compute nodes (MPI-I/O). Real numbers from single client (16 MB IO blocksize for 17 GB): lfs Striping File size Stream Mb/s Off (1) 17 GB GB GB 557 max 17 GB 508 max 11 MB 55 max 200 KB 10 slide 108 of 143

109 Do's and Dont's slide 109 of 143

110 File systems: do's and dont's ls vs ls -la ls in a directory with 1000 files ls -la in a directory with 1000 files Simple ls is only a few IOPS Local fs: IOPS (stat() each file!) NFS: a bit more overhead Lustre (striping off) 2000 IOPS (+rpcs) Lustre (striping on) IOPS! (+rpcs) => Whole Lustre stuck for a while for everyone Use ls -la and variant (ls --color) ONLY when needed slide 110 of 143

111 File systems: do's and dont's Lots of small files (<1MB) Well, bad starting point already in general. Though, sometimes no way to improve (i.e. legacy code) /ramdisk or /local: Best place for these Lustre: Not the best place. With many users local disk provides more IOPS and Stream in general NFS (Home): Very Bad idea, do not run calculation from Home The very best approach: modify you code. Large file(s) instead of many small. Or even no-files-at-all. Sometimes IO due to unnecessary checkpointing. slide 111 of 143

112 File systems: do's and dont's 500Gb of data Estimated read time in minutes Many small files Single big /local /triton (stripe off) /triton (stripe max) BAD IDEA 15 Use /local or Lustre (+ maybe striping) for big files Note that above triton results assume exclusive access (reality: shared with all other users)! slide 112 of 143

113 File systems: do's and dont's Databases (sqlite) These can generate a lot of small random reads (=IOPS) /ramdisk or /local: Best place for these Lustre: Not the best place. With many users local disk provides more IOPS and Stream in general NFS (Home): very Bad idea slide 113 of 143

114 File systems: best practices When unsure what is the best approach Google? Ask your supervisor. tracker.triton.aalto.fi Ask your local Triton support person trial-and-error slide 114 of 143

115 Questions or comments regarding Triton filesystems? slide 115 of 143

116 Serial programming slide 116 of 143

117 Simple C compilation example Using GNU C compiler $ gcc -O2 -g -Wall hello.c -o hello $./hello Hello world! C++: g++ Fortran: gfortran slide 117 of 143

118 More about compilers Intel compilers (C, C++, Fortran) also available (see intel module) C/C++/Fortran are unforgiving, use all the help you can get -Wall (Warnings), bounds checking (Fortran) Gdb (GNU debugger, remember -g flag) Valgrind (find memory leaks, use of uninitialized memory, etc.) slide 118 of 143

119 Optimizing serial code Often simpler to make serial code faster than to parallelize (up to a point, obviously)!, n other words, before you go to parallel, get the most out your serial binary first Use optimizing switches! Basic level: -O2, practically always useful -O3 -ffast-math more aggressive, potentially dangerous Measure, don't guess; use a profiler! perf example in triton wiki slide 119 of 143

120 Optimizing serial code (cont.) Use existing high performance libraries BLAS (MKL, ATLAS, Goto/OpenBLAS) C++ users: Check out Eigen library! FFT (MKL, FFTW) Be aware whether the library you use uses multi-threading, if so how it can be controlled, and how it interacts with the batch scheduler on the cluster. slide 120 of 143

121 Running in Parallel Means using more than one processing unit to solve a computational problem slide 121 of 143

122 Quick start On Triton Get or implement your code Link and compile against parallel libs Test/debug it on play partition Send it to the queue with SLURM MPI or OpenMP If needed, use optimized system specific math libs Request a CPU / node number Set OpenMP vars or use srun When compiled and start up script is ready, parallel runs no more difficult than serial slide 122 of 143

123 Parallel models A bit of theory: The models exist as an abstraction above hardware and memory architectures. Two main programming models: Message Passing Shared memory / Threads slide 123 of 143

124 Message Passing model A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines. Tasks exchange data through communications by sending and receiving messages. Natural to distributed memory archs, but can be used on shared memory machine slide 124 of 143

125 MPI implementations From a programming perspective: MPI is a library of subroutines that are embedded in source code. The programmer is responsible for determining all parallelism. In 1992, the MPI Forum was formed with the primary goal of establishing a standard interface for message passing implementations. Part 1 of the Message Passing Interface (MPI) was released in Part 2 (MPI-2) was released in In 2008 an update was published as version 2.1. MPI 3.0 has been released in 2012, though for maximum portability we should limit ourselves to MPI 2.2. MPI specifications are available on the web at Provides Fortran 77/90, C/C++ language bindings MPI is now the "de facto" industry standard for message passing, replacing virtually all other message passing implementations used for production work. Almost all availabe open-source and commercial MPI implementation suppor 2.0 standard. OpenMPI is one of the most popular parallel computing platforms. MVAPICH is another one popular with InfiniBand interconnects. For shared memory architectures, MPI implementations usually don't use a network for task communications. Instead, they use shared memory (memory copies) for performance reasons. slide 125 of 143

126 MPI basic routines MPI consists of more than 320 functions. But realistic programs can already be developed based on no more than six functions: MPI_Init initializes the library. It has to be called at the beginning of a parallel operation before any other MPI routines are executed. MPI_Finalize frees any resources used by the library and has to be called at the end of the program. MPI_Comm_size determines the number of processors executing the parallel program. MPI_Comm_rank returns the unique process identifier. MPI_Send transfers a message to a target process. This operation is a blocking send operation, i.e., it terminates when the message buffer can be reused either because the message was copied to a system buffer by the library or because the message was delivered to the target process. MPI_Recv receives a message. This routine terminates if a message was copied into the receive buffer. slide 126 of 143

127 #include <mpi.h> MPI example int main(int argc, char** argv) { MPI_Init(NULL, NULL); // initialize the MPI int world_size; MPI_Comm_size(MPI_COMM_WORLD, &world_size); // number of processes int world_rank; MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); // rank of the process char processor_name[mpi_max_processor_name]; int name_len; MPI_Get_processor_name(processor_name, &name_len); // processor name printf("hello world from processor %s, rank %d" " out of %d processors\n", processor_name, world_rank, world_size); MPI_Finalize(); // finalize the MPI } slide 127 of 143

128 MPI version of example program load mpi* binaries, libs and headers, then compile and run $ module load openmpi/1.6.3-gcc $ mpicc hello_mpi.c -o hello_mpi $ salloc -p play -n 2 mpirun hello_mpi salloc: Pending job allocation salloc: job queued and waiting for resources salloc: job has been allocated resources salloc: Granted job allocation Hello world from processor tb008, rank 0 out of 2 processors Hello world from processor tb008, rank 1 out of 2 processors salloc: Relinquishing job allocation Use sbatch for production runs, salloc for interactive/short-test cases slide 128 of 143

129 Shared memory or threads model Threads are commonly associated with shared memory architectures The main program a.out is scheduled normally as any sequential code, it performs some serial work, and then creates a number of tasks (threads) that can be scheduled and run by the operating system concurrently Each thread has local data, but also, shares the entire resources of a.out (memory, IO) Any thread can execute any part of program at the same time as other threads. Threads communicate with each other through global memory (updating address locations). This requires synchronization constructs to ensure that more than one thread is not updating the same global address at any time. Threads can come and go, but a.out remains present to provide the necessary shared resources until the application has completed. slide 129 of 143

130 Threads implementations From a programming perspective, threads implementations commonly comprise: The programmer is responsible for determining all parallelism. Two very different implementations of threads: POSIX Threads and OpenMP. POSIX Threads Library based; requires parallel coding Specified by the IEEE POSIX c standard (1995). C Language only OpenMP A library of subroutines that are called from within parallel source code A set of compiler directives embedded in either serial or parallel source code Compiler directive based; can use serial code Jointly defined and endorsed by a group of major computer hardware and software vendors. The OpenMP Fortran API was released October 28, The C/C++ API was released in late Portable / multi-platform, including Unix and Windows NT platforms Available in C/C++ and Fortran implementations Can be very easy and simple to use - provides for "incremental parallelism" Threads vs. processes: MPI processes are typically independent, while OpenMP threads exist as subsets of a process (contained inside a process) and share state as well as memory and other resources slide 130 of 143

131 OpenMP directives!$omp PARALLEL DO specifies a loop that can be executed in parallel!$omp PARALLEL SECTIONS starts a set of sections that are each executed in parallel by a team of threads!$omp PARALLEL introduces a code region that is executed redundantly by the threads!$omp DO / FOR is a work sharing construct and may be used within a parallel region!$omp SECTIONS is also a work sharing construct; allows the current team of threads executing the surrounding parallel region to cooperate in the execution of the parallel sections.!$omp TASK (available with the new version 3.0) greatly simplifies the parallelization on non-loop constructs by allowing to dynamically specify portions of the programs which can run independently slide 131 of 143

132 OpenMP features The code is still a sequential code. No need to maintain separate serial and parallel versions Thread data can be either shared or private: private data corresponds to a particular thread while shared data can be accessed by all threads OpenMP provides special synchronization constructs: barrier synchronization and critical sections Barrier syncronization enforcing that all threads have reached some point before executin will continue Critical sections ensure that only a single thread can enter the section: race condition protection slide 132 of 143

133 OpenMP example #include <stdio.h> int main(void) { #pragma omp parallel printf("hello, world.\n"); return 0; } slide 133 of 143

134 OpenMP example in action $ gcc -fopenmp hello_omp.c -o hello_omp $ salloc -p play --nodes=1-1 --cpus-per-task=4 srun hello_omp salloc: Pending job allocation salloc: job queued and waiting for resources salloc: job has been allocated resources salloc: Granted job allocation Hello, world. Hello, world. Hello, world. Hello, world. salloc: Relinquishing job allocation slide 134 of 143

135 Compiling with OpenMP GCC: -fopenmp Intel: -openmp PGI: -mp=nonuma Fortran examples: gfortran -fopenmp code.f90 -o code.exe pgf90 -mp=nonuma -o code.f90 code.exe See slide 135 of 143

136 Scalability Refers to ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that contribute to scalability include: Hardware: particularly memory-cpu bandwidths and network communications Application algorithm Parallel overhead related Characteristics of your specific application and coding slide 136 of 143

137 MPI flavors on Triton Available through module OpenMPI MVAPICH2 Try both, your software may benefit from a particular flavor Compiled against particular compiler suit: GCC or Intel Parallel math libs: BLACS and ScaLAPACK slide 137 of 143

138 MPI IO Parallel applications make use of the input/output (IO) subsystem to read and write data sets. IO reasons: read the input, store the output, pass info to other programs. IO can be implemented in three ways: Sequential IO. A single task is responsible to perform all the IO operations. Private IO. Each tasks accesses its own files. Parallel IO. All tasks access the same file. slide 138 of 143

139 MPI IO: factors to consider I/O operations are generally regarded as inhibitors to parallelism In an environment where all tasks see the same file space, write operations can result in file overwriting Read operations can be affected by the file server's ability to handle multiple read requests at the same time I/O that must be conducted over the network (NFS, non-local) can cause severe bottlenecks The good news: parallel file systems are available and Triton has one. For example: GPFS: General Parallel File System for AIX (IBM) Lustre: for Linux clusters (the one on Triton) The parallel I/O programming interface specification for MPI has been available since 1996 as part of MPI-2. Practical aspects for Triton: use a parallel file system under /triton (aka $WRKDIR) Reduce overall IO as much as possible Confine IO to specific serial portions of the job, and then use parallel communications to distribute data to parallel tasks For distributed memory systems with shared filespace, perform IO in local, non-shared filespace (for example in local /tmp on Trtion compute node) slide 139 of 143

140 Parallel debugging Debugging, tuning and analyzing performance of parallel program is significantly more of a challenge than for serial programs. A number of parallel tools for execution monitoring and program analysis are available. Debugging: TotalView, DDT Parallel performance analysis tool: Scalasca, mpip slide 140 of 143

141 Parallel computing practicalities MPI implementation varies: use the one suggested by vendor. OpenMPI is the most popular of the opensource. Check: scalability and memory requirements System optimized math libs: arch and compiler suit Check the compiler suits OpenMP related options and environment variables Use the compiler optimization Mind the IO slide 141 of 143

142 Further reading George Hager, Gerhard Wellein Introduction to High Performance Computing for Scientists and Engineers slide 142 of 143

143 Sources Crash Course on HPC lectures by Bernd Mohr (JSC) Introduction to Parallel Computing, tutorial by Blaise Barney, Lawrence Livermore National Lab: Introduction to Parallel Computing in NIC series by Bernd Mohr: Materials to CSC courses Andvanced MPI, Parallel I/O, GPU tutorial, Grid & Clouds introduction, see others slide 143 of 143

Slurm basics. Summer Kickstart June slide 1 of 49

Slurm basics. Summer Kickstart June slide 1 of 49 Slurm basics Summer Kickstart 2017 June 2017 slide 1 of 49 Triton layers Triton is a powerful but complex machine. You have to consider: Connecting (ssh) Data storage (filesystems and Lustre) Resource

More information

Triton file systems - an introduction. slide 1 of 28

Triton file systems - an introduction. slide 1 of 28 Triton file systems - an introduction slide 1 of 28 File systems Motivation & basic concepts Storage locations Basic flow of IO Do's and Don'ts Exercises slide 2 of 28 File systems: Motivation Case #1:

More information

Scientific Computing in practice

Scientific Computing in practice Scientific Computing in practice Summer Kickstart 2017 Ivan Degtyarenko, Janne Blomqvist, Simo Tuomisto, Richard Darst, Mikko Hakala School of Science, Aalto University June 5-7, 2017 slide 1 of 37 What

More information

Scientific Computing in practice

Scientific Computing in practice Scientific Computing in practice Kickstart 2015 (cont.) Ivan Degtyarenko, Janne Blomqvist, Mikko Hakala, Simo Tuomisto School of Science, Aalto University June 1, 2015 slide 1 of 62 Triton practicalities

More information

Data storage on Triton: an introduction

Data storage on Triton: an introduction Motivation Data storage on Triton: an introduction How storage is organized in Triton How to optimize IO Do's and Don'ts Exercises slide 1 of 33 Data storage: Motivation Program speed isn t just about

More information

Parallel and High Performance Computing CSE 745

Parallel and High Performance Computing CSE 745 Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel

More information

Scientific Computing in practice

Scientific Computing in practice Scientific Computing in practice Winter Kickstart 2016 Ivan Degtyarenko, Janne Blomqvist, Mikko Hakala, Simo Tuomisto School of Science, Aalto University Feb 29, 2016 slide 1 of 99 What is it? Triton crash

More information

Our new HPC-Cluster An overview

Our new HPC-Cluster An overview Our new HPC-Cluster An overview Christian Hagen Universität Regensburg Regensburg, 15.05.2009 Outline 1 Layout 2 Hardware 3 Software 4 Getting an account 5 Compiling 6 Queueing system 7 Parallelization

More information

JURECA Tuning for the platform

JURECA Tuning for the platform JURECA Tuning for the platform Usage of ParaStation MPI 2017-11-23 Outline ParaStation MPI Compiling your program Running your program Tuning parameters Resources 2 ParaStation MPI Based on MPICH (3.2)

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology

Introduction to the SHARCNET Environment May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology Introduction to the SHARCNET Environment 2010-May-25 Pre-(summer)school webinar Speaker: Alex Razoumov University of Ontario Institute of Technology available hardware and software resources our web portal

More information

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop

Before We Start. Sign in hpcxx account slips Windows Users: Download PuTTY. Google PuTTY First result Save putty.exe to Desktop Before We Start Sign in hpcxx account slips Windows Users: Download PuTTY Google PuTTY First result Save putty.exe to Desktop Research Computing at Virginia Tech Advanced Research Computing Compute Resources

More information

Tech Computer Center Documentation

Tech Computer Center Documentation Tech Computer Center Documentation Release 0 TCC Doc February 17, 2014 Contents 1 TCC s User Documentation 1 1.1 TCC SGI Altix ICE Cluster User s Guide................................ 1 i ii CHAPTER 1

More information

SCALABLE HYBRID PROTOTYPE

SCALABLE HYBRID PROTOTYPE SCALABLE HYBRID PROTOTYPE Scalable Hybrid Prototype Part of the PRACE Technology Evaluation Objectives Enabling key applications on new architectures Familiarizing users and providing a research platform

More information

Számítogépes modellezés labor (MSc)

Számítogépes modellezés labor (MSc) Számítogépes modellezés labor (MSc) Running Simulations on Supercomputers Gábor Rácz Physics of Complex Systems Department Eötvös Loránd University, Budapest September 19, 2018, Budapest, Hungary Outline

More information

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat

Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat Submitting and running jobs on PlaFRIM2 Redouane Bouchouirbat Summary 1. Submitting Jobs: Batch mode - Interactive mode 2. Partition 3. Jobs: Serial, Parallel 4. Using generic resources Gres : GPUs, MICs.

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit cores : individual processing units within a CPU Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Introduction to PICO Parallel & Production Enviroment

Introduction to PICO Parallel & Production Enviroment Introduction to PICO Parallel & Production Enviroment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Domenico Guida d.guida@cineca.it Nicola Spallanzani n.spallanzani@cineca.it

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 6 th CALL (Tier-0) Contributing sites and the corresponding computer systems for this call are: GCS@Jülich, Germany IBM Blue Gene/Q GENCI@CEA, France Bull Bullx

More information

High Performance Computing Cluster Advanced course

High Performance Computing Cluster Advanced course High Performance Computing Cluster Advanced course Jeremie Vandenplas, Gwen Dawes 9 November 2017 Outline Introduction to the Agrogenomics HPC Submitting and monitoring jobs on the HPC Parallel jobs on

More information

Graham vs legacy systems

Graham vs legacy systems New User Seminar Graham vs legacy systems This webinar only covers topics pertaining to graham. For the introduction to our legacy systems (Orca etc.), please check the following recorded webinar: SHARCNet

More information

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads) Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program

More information

Sherlock for IBIIS. William Law Stanford Research Computing

Sherlock for IBIIS. William Law Stanford Research Computing Sherlock for IBIIS William Law Stanford Research Computing Overview How we can help System overview Tech specs Signing on Batch submission Software environment Interactive jobs Next steps We are here to

More information

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 Email:plamenkrastev@fas.harvard.edu Objectives Inform you of available computational resources Help you choose appropriate computational

More information

Compiling applications for the Cray XC

Compiling applications for the Cray XC Compiling applications for the Cray XC Compiler Driver Wrappers (1) All applications that will run in parallel on the Cray XC should be compiled with the standard language wrappers. The compiler drivers

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

ECE 574 Cluster Computing Lecture 4

ECE 574 Cluster Computing Lecture 4 ECE 574 Cluster Computing Lecture 4 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 31 January 2017 Announcements Don t forget about homework #3 I ran HPCG benchmark on Haswell-EP

More information

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 Introduction to Parallel Computing CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014 1 Definition of Parallel Computing Simultaneous use of multiple compute resources to solve a computational

More information

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy

30 Nov Dec Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Advanced School in High Performance and GRID Computing Concepts and Applications, ICTP, Trieste, Italy Why serial is not enough Computing architectures Parallel paradigms Message Passing Interface How

More information

High Performance Computing. Introduction to Parallel Computing

High Performance Computing. Introduction to Parallel Computing High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0)

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 11th CALL (T ier-0) Contributing sites and the corresponding computer systems for this call are: BSC, Spain IBM System X idataplex CINECA, Italy The site selection

More information

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu Duke Compute Cluster Workshop 3/28/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch

More information

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU Parallel Applications on Distributed Memory Systems Le Yan HPC User Services @ LSU Outline Distributed memory systems Message Passing Interface (MPI) Parallel applications 6/3/2015 LONI Parallel Programming

More information

Introduction to HPC2N

Introduction to HPC2N Introduction to HPC2N Birgitte Brydsø HPC2N, Umeå University 4 May 2017 1 / 24 Overview Kebnekaise and Abisko Using our systems The File System The Module System Overview Compiler Tool Chains Examples

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is connected

More information

Introduction to the Cluster

Introduction to the Cluster Follow us on Twitter for important news and updates: @ACCREVandy Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu The Cluster We will be

More information

Beginner's Guide for UK IBM systems

Beginner's Guide for UK IBM systems Beginner's Guide for UK IBM systems This document is intended to provide some basic guidelines for those who already had certain programming knowledge with high level computer languages (e.g. Fortran,

More information

Lecture 3: Intro to parallel machines and models

Lecture 3: Intro to parallel machines and models Lecture 3: Intro to parallel machines and models David Bindel 1 Sep 2011 Logistics Remember: http://www.cs.cornell.edu/~bindel/class/cs5220-f11/ http://www.piazza.com/cornell/cs5220 Note: the entire class

More information

New User Seminar: Part 2 (best practices)

New User Seminar: Part 2 (best practices) New User Seminar: Part 2 (best practices) General Interest Seminar January 2015 Hugh Merz merz@sharcnet.ca Session Outline Submitting Jobs Minimizing queue waits Investigating jobs Checkpointing Efficiency

More information

The Stampede is Coming: A New Petascale Resource for the Open Science Community

The Stampede is Coming: A New Petascale Resource for the Open Science Community The Stampede is Coming: A New Petascale Resource for the Open Science Community Jay Boisseau Texas Advanced Computing Center boisseau@tacc.utexas.edu Stampede: Solicitation US National Science Foundation

More information

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

Lecture 9: MIMD Architecture

Lecture 9: MIMD Architecture Lecture 9: MIMD Architecture Introduction and classification Symmetric multiprocessors NUMA architecture Cluster machines Zebo Peng, IDA, LiTH 1 Introduction MIMD: a set of general purpose processors is

More information

SuperMike-II Launch Workshop. System Overview and Allocations

SuperMike-II Launch Workshop. System Overview and Allocations : System Overview and Allocations Dr Jim Lupo CCT Computational Enablement jalupo@cct.lsu.edu SuperMike-II: Serious Heterogeneous Computing Power System Hardware SuperMike provides 442 nodes, 221TB of

More information

The RWTH Compute Cluster Environment

The RWTH Compute Cluster Environment The RWTH Compute Cluster Environment Tim Cramer 29.07.2013 Source: D. Both, Bull GmbH Rechen- und Kommunikationszentrum (RZ) The RWTH Compute Cluster (1/2) The Cluster provides ~300 TFlop/s No. 32 in TOP500

More information

HPC Architectures. Types of resource currently in use

HPC Architectures. Types of resource currently in use HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

High Performance Computing (HPC) Using zcluster at GACRC

High Performance Computing (HPC) Using zcluster at GACRC High Performance Computing (HPC) Using zcluster at GACRC On-class STAT8060 Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC?

More information

ACCRE High Performance Compute Cluster

ACCRE High Performance Compute Cluster 6 중 1 2010-05-16 오후 1:44 Enabling Researcher-Driven Innovation and Exploration Mission / Services Research Publications User Support Education / Outreach A - Z Index Our Mission History Governance Services

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What is HPC Concept? What is

More information

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ Duke Compute Cluster Workshop 11/10/2016 Tom Milledge h:ps://rc.duke.edu/ rescompu>ng@duke.edu Outline of talk Overview of Research Compu>ng resources Duke Compute Cluster overview Running interac>ve and

More information

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK

RHRK-Seminar. High Performance Computing with the Cluster Elwetritsch - II. Course instructor : Dr. Josef Schüle, RHRK RHRK-Seminar High Performance Computing with the Cluster Elwetritsch - II Course instructor : Dr. Josef Schüle, RHRK Overview Course I Login to cluster SSH RDP / NX Desktop Environments GNOME (default)

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC On-class PBIO/BINF8350 Georgia Advanced Computing Resource Center University of Georgia Zhuofei Hou, HPC Trainer zhuofei@uga.edu Outline What is GACRC? What

More information

High Performance Computing Cluster Basic course

High Performance Computing Cluster Basic course High Performance Computing Cluster Basic course Jeremie Vandenplas, Gwen Dawes 30 October 2017 Outline Introduction to the Agrogenomics HPC Connecting with Secure Shell to the HPC Introduction to the Unix/Linux

More information

Using a Linux System 6

Using a Linux System 6 Canaan User Guide Connecting to the Cluster 1 SSH (Secure Shell) 1 Starting an ssh session from a Mac or Linux system 1 Starting an ssh session from a Windows PC 1 Once you're connected... 1 Ending an

More information

Getting started with the CEES Grid

Getting started with the CEES Grid Getting started with the CEES Grid October, 2013 CEES HPC Manager: Dennis Michael, dennis@stanford.edu, 723-2014, Mitchell Building room 415. Please see our web site at http://cees.stanford.edu. Account

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

How to Use a Supercomputer - A Boot Camp

How to Use a Supercomputer - A Boot Camp How to Use a Supercomputer - A Boot Camp Shelley Knuth Peter Ruprecht shelley.knuth@colorado.edu peter.ruprecht@colorado.edu www.rc.colorado.edu Outline Today we will discuss: Who Research Computing is

More information

Introduction to GALILEO

Introduction to GALILEO November 27, 2016 Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it SuperComputing Applications and Innovation Department

More information

Lab: Scientific Computing Tsunami-Simulation

Lab: Scientific Computing Tsunami-Simulation Lab: Scientific Computing Tsunami-Simulation Session 4: Optimization and OMP Sebastian Rettenberger, Michael Bader 23.11.15 Session 4: Optimization and OMP, 23.11.15 1 Department of Informatics V Linux-Cluster

More information

Monitoring and Trouble Shooting on BioHPC

Monitoring and Trouble Shooting on BioHPC Monitoring and Trouble Shooting on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2017-03-15 Why Monitoring & Troubleshooting data code Monitoring jobs running

More information

Introduction to GALILEO

Introduction to GALILEO Introduction to GALILEO Parallel & production environment Mirko Cestari m.cestari@cineca.it Alessandro Marani a.marani@cineca.it Alessandro Grottesi a.grottesi@cineca.it SuperComputing Applications and

More information

ACEnet for CS6702 Ross Dickson, Computational Research Consultant 29 Sep 2009

ACEnet for CS6702 Ross Dickson, Computational Research Consultant 29 Sep 2009 ACEnet for CS6702 Ross Dickson, Computational Research Consultant 29 Sep 2009 What is ACEnet? Shared resource......for research computing... physics, chemistry, oceanography, biology, math, engineering,

More information

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu Duke Compute Cluster Workshop 10/04/2018 Tom Milledge rc.duke.edu rescomputing@duke.edu Outline of talk Overview of Research Computing resources Duke Compute Cluster overview Running interactive and batch

More information

Introduction to the Cluster

Introduction to the Cluster Introduction to the Cluster Advanced Computing Center for Research and Education http://www.accre.vanderbilt.edu Follow us on Twitter for important news and updates: @ACCREVandy The Cluster We will be

More information

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1

Sami Saarinen Peter Towers. 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Acknowledgements: Petra Kogel Sami Saarinen Peter Towers 11th ECMWF Workshop on the Use of HPC in Meteorology Slide 1 Motivation Opteron and P690+ clusters MPI communications IFS Forecast Model IFS 4D-Var

More information

Illinois Proposal Considerations Greg Bauer

Illinois Proposal Considerations Greg Bauer - 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and

More information

Introduction to CINECA HPC Environment

Introduction to CINECA HPC Environment Introduction to CINECA HPC Environment 23nd Summer School on Parallel Computing 19-30 May 2014 m.cestari@cineca.it, i.baccarelli@cineca.it Goals You will learn: The basic overview of CINECA HPC systems

More information

Exercises: Abel/Colossus and SLURM

Exercises: Abel/Colossus and SLURM Exercises: Abel/Colossus and SLURM November 08, 2016 Sabry Razick The Research Computing Services Group, USIT Topics Get access Running a simple job Job script Running a simple job -- qlogin Customize

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU

Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU Introduction to Joker Cyber Infrastructure Architecture Team CIA.NMSU.EDU What is Joker? NMSU s supercomputer. 238 core computer cluster. Intel E-5 Xeon CPUs and Nvidia K-40 GPUs. InfiniBand innerconnect.

More information

Computing with the Moore Cluster

Computing with the Moore Cluster Computing with the Moore Cluster Edward Walter An overview of data management and job processing in the Moore compute cluster. Overview Getting access to the cluster Data management Submitting jobs (MPI

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC Introduction to HPC Using zcluster at GACRC On-class STAT8330 Georgia Advanced Computing Resource Center University of Georgia Suchitra Pakala pakala@uga.edu Slides courtesy: Zhoufei Hou 1 Outline What

More information

Leibniz Supercomputer Centre. Movie on YouTube

Leibniz Supercomputer Centre. Movie on YouTube SuperMUC @ Leibniz Supercomputer Centre Movie on YouTube Peak Performance Peak performance: 3 Peta Flops 3*10 15 Flops Mega 10 6 million Giga 10 9 billion Tera 10 12 trillion Peta 10 15 quadrillion Exa

More information

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

LAB. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers LAB Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers Dan Stanzione, Lars Koesterke, Bill Barth, Kent Milfeld dan/lars/bbarth/milfeld@tacc.utexas.edu XSEDE 12 July 16, 2012 1 Discovery

More information

The JANUS Computing Environment

The JANUS Computing Environment Research Computing UNIVERSITY OF COLORADO The JANUS Computing Environment Monte Lunacek monte.lunacek@colorado.edu rc-help@colorado.edu What is JANUS? November, 2011 1,368 Compute nodes 16,416 processors

More information

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research

Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Introduction to SLURM on the High Performance Cluster at the Center for Computational Research Cynthia Cornelius Center for Computational Research University at Buffalo, SUNY 701 Ellicott St Buffalo, NY

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Cluster Clonetroop: HowTo 2014

Cluster Clonetroop: HowTo 2014 2014/02/25 16:53 1/13 Cluster Clonetroop: HowTo 2014 Cluster Clonetroop: HowTo 2014 This section contains information about how to access, compile and execute jobs on Clonetroop, Laboratori de Càlcul Numeric's

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor

School of Parallel Programming & Parallel Architecture for HPC ICTP October, Intro to HPC Architecture. Instructor: Ekpe Okorafor School of Parallel Programming & Parallel Architecture for HPC ICTP October, 2014 Intro to HPC Architecture Instructor: Ekpe Okorafor A little about me! PhD Computer Engineering Texas A&M University Computer

More information

CNAG Advanced User Training

CNAG Advanced User Training www.bsc.es CNAG Advanced User Training Aníbal Moreno, CNAG System Administrator Pablo Ródenas, BSC HPC Support Rubén Ramos Horta, CNAG HPC Support Barcelona,May the 5th Aim Understand CNAG s cluster design

More information

Introduction to High-Performance Computing (HPC)

Introduction to High-Performance Computing (HPC) Introduction to High-Performance Computing (HPC) Computer components CPU : Central Processing Unit CPU cores : individual processing units within a Storage : Disk drives HDD : Hard Disk Drive SSD : Solid

More information

Accelerator Programming Lecture 1

Accelerator Programming Lecture 1 Accelerator Programming Lecture 1 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de January 11, 2016 Accelerator Programming

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

How to run a job on a Cluster?

How to run a job on a Cluster? How to run a job on a Cluster? Cluster Training Workshop Dr Samuel Kortas Computational Scientist KAUST Supercomputing Laboratory Samuel.kortas@kaust.edu.sa 17 October 2017 Outline 1. Resources available

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Using the computational resources at the GACRC

Using the computational resources at the GACRC An introduction to zcluster Georgia Advanced Computing Resource Center (GACRC) University of Georgia Dr. Landau s PHYS4601/6601 course - Spring 2017 What is GACRC? Georgia Advanced Computing Resource Center

More information

Trends in HPC (hardware complexity and software challenges)

Trends in HPC (hardware complexity and software challenges) Trends in HPC (hardware complexity and software challenges) Mike Giles Oxford e-research Centre Mathematical Institute MIT seminar March 13th, 2013 Mike Giles (Oxford) HPC Trends March 13th, 2013 1 / 18

More information

High Performance Computing in C and C++

High Performance Computing in C and C++ High Performance Computing in C and C++ Rita Borgo Computer Science Department, Swansea University WELCOME BACK Course Administration Contact Details Dr. Rita Borgo Home page: http://cs.swan.ac.uk/~csrb/

More information

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines What is OpenMP? What does OpenMP stands for? What does OpenMP stands for? Open specifications for Multi

More information

HPC Introductory Course - Exercises

HPC Introductory Course - Exercises HPC Introductory Course - Exercises The exercises in the following sections will guide you understand and become more familiar with how to use the Balena HPC service. Lines which start with $ are commands

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Choosing Resources Wisely. What is Research Computing?

Choosing Resources Wisely. What is Research Computing? Choosing Resources Wisely Scott Yockel, PhD Harvard - Research Computing What is Research Computing? Faculty of Arts and Sciences (FAS) department that handles nonenterprise IT requests from researchers.

More information

An Overview of Fujitsu s Lustre Based File System

An Overview of Fujitsu s Lustre Based File System An Overview of Fujitsu s Lustre Based File System Shinji Sumimoto Fujitsu Limited Apr.12 2011 For Maximizing CPU Utilization by Minimizing File IO Overhead Outline Target System Overview Goals of Fujitsu

More information

Genius Quick Start Guide

Genius Quick Start Guide Genius Quick Start Guide Overview of the system Genius consists of a total of 116 nodes with 2 Skylake Xeon Gold 6140 processors. Each with 18 cores, at least 192GB of memory and 800 GB of local SSD disk.

More information

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende

Introduction to the NCAR HPC Systems. 25 May 2018 Consulting Services Group Brian Vanderwende Introduction to the NCAR HPC Systems 25 May 2018 Consulting Services Group Brian Vanderwende Topics to cover Overview of the NCAR cluster resources Basic tasks in the HPC environment Accessing pre-built

More information