HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

Similar documents
AN INTRODUCTION TO USING

Introduction to HTCondor

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Day 9: Introduction to CHTC

HTCondor Essentials. Index

Getting Started with OSG Connect ~ an Interactive Tutorial ~

Introduction to Condor. Jari Varje

Tutorial 4: Condor. John Watt, National e-science Centre

AN INTRODUCTION TO WORKFLOWS WITH DAGMAN

Presented by: Jon Wedell BioMagResBank

History of SURAgrid Deployment

Using HTCondor on the Biochemistry Computational Cluster v Jean-Yves Sgro

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

! " # " $ $ % & '(()

Grid Compute Resources and Grid Job Management

An Introduction to Using HTCondor Karen Miller

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

! " #$%! &%& ' ( $ $ ) $*+ $

HPC Resources at Lehigh. Steve Anthony March 22, 2012

Teraflops of Jupyter: A Notebook Based Analysis Portal at BNL

An Introduction to Using HTCondor Karen Miller

ITNPBD7 Cluster Computing Spring Using Condor

Introduction Matlab Compiler Matlab & Condor Conclusion. From Compilation to Distribution. David J. Herzfeld

Primer for Site Debugging

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT October 23, 2012

Introduction to HPC Using zcluster at GACRC

Project Blackbird. U"lizing Condor and HTC to address archiving online courses at Clemson on a weekly basis. Sam Hoover

Backpacking with Code: Software Portability for DHTC Wednesday morning, 9:00 am Christina Koch Research Computing Facilitator

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Using HTCondor on the Biochemistry Computational Cluster v Jean-Yves Sgro

Cloud Computing. Up until now

Grid Compute Resources and Job Management

Name Department/Research Area Have you used the Linux command line?

Introduction to GALILEO

Slurm and Abel job scripts. Katerina Michalickova The Research Computing Services Group SUF/USIT November 13, 2013

Setup InstrucIons. If you need help with the setup, please put a red sicky note at the top of your laptop.

Introduction to High-Performance Computing (HPC)

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

The HTCondor CacheD. Derek Weitzel, Brian Bockelman University of Nebraska Lincoln

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

CHTC Policy and Configuration. Greg Thain HTCondor Week 2017

XSEDE High Throughput Computing Use Cases

Introduction to High-Performance Computing (HPC)

RESEARCH DATA DEPOT AT PURDUE UNIVERSITY

Introduction to Distributed HTC and overlay systems

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

An Introduction to Cluster Computing Using Newton

Introduction to PICO Parallel & Production Enviroment

UMass High Performance Computing Center

New User Seminar: Part 2 (best practices)

Shark Cluster Overview

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

Exercises: Abel/Colossus and SLURM

Ge#ng Started with the RCE. Len Wisniewski

Quick Guide for the Torque Cluster Manager

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

TreeSearch User Guide

Grid Computing. Department of Statistics Staff Retreat February 2008

Slurm basics. Summer Kickstart June slide 1 of 49

AixViPMaP towards an open simulation platform for microstructure modelling

Using ITaP clusters for large scale statistical analysis with R. Doug Crabill Purdue University

Introduction to High-Performance Computing (HPC)

Interfacing HTCondor-CE with OpenStack: technical questions

How to install Condor-G

High Performance Computing (HPC) Using zcluster at GACRC

Getting started with the CEES Grid

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

PARALLEL COMPUTING IN R USING WESTGRID CLUSTERS STATGEN GROUP MEETING 10/30/2017

Memory - Paging. Copyright : University of Illinois CS 241 Staff 1

Martinos Center Compute Cluster

Protected Environment at CHPC. Sean Igo Center for High Performance Computing September 11, 2014

Memory Allocation. Copyright : University of Illinois CS 241 Staff 1

Job Management on LONI and LSU HPC clusters

Outline. March 5, 2012 CIRMMT - McGill University 2

Grid Computing Competence Center Large Scale Computing Infrastructures (MINF 4526 HS2011)

HPC DOCUMENTATION. 3. Node Names and IP addresses:- Node details with respect to their individual IP addresses are given below:-

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

CMS Grid Computing at TAMU Performance, Monitoring and Current Status of the Brazos Cluster

Gondor : Making HTCondor DAGman Groovy

PBS Pro Documentation

OBTAINING AN ACCOUNT:

UF Research Computing: Overview and Running STATA

Introduction to HPC Using zcluster at GACRC

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

Batch Systems. Running calculations on HPC resources

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

MOHA: Many-Task Computing Framework on Hadoop

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Triton file systems - an introduction. slide 1 of 28

Day 15: Science Code in Python

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

[301] Bits and Memory. Tyler Caraza-Harter

Introduction to High Performance Computing and an Statistical Genetics Application on the Janus Supercomputer. Purpose

Condor and Workflows: An Introduction. Condor Week 2011

UoW HPC Quick Start. Information Technology Services University of Wollongong. ( Last updated on October 10, 2011)

CENTER FOR HIGH PERFORMANCE COMPUTING. Overview of CHPC. Martin Čuma, PhD. Center for High Performance Computing

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

High Performance Computing How-To Joseph Paul Cohen

The BioHPC Nucleus Cluster & Future Developments

Transcription:

HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH Christina Koch, Research Computing Facilitator Center for High Throughput Computing STAT679, October 29, 2018 1

About Me I work for the Center for High Throughput Computing (CHTC) We provide resources for UW Madison faculty, students and staff whose research problems are too big for their laptops/desktops 2

Overview Large Scale Computing High Throughput Computing Large Scale Computing Resources and the Center for High Throughput Computing (CHTC) Running HTC Jobs at CHTC User Expectations Next Steps 3

LARGE SCALE COMPUTING 4

What is large-scale computing? larger than desktop (in memory, data, processors) 5

Problem: running many computations takes a long time (running on one processor) time 6

So how do you speed things up? time 7

Break up the work! Use more processors! (parallelize) n processors time 8

High throughput computing n processors time 9

High performance computing n processors time 10

HIGH THROUGHPUT COMPUTING 11

High Throughput Examples Test many parameter combinations Analyze multiple images or datasets Do a replicate/randomized analysis Align genome/rna sequence data 12

The Philosophy of HTC Break work into many smaller jobs single or few CPUs, short run times, smaller input and output per job Run on as many processors as possible smaller and shorter jobs are best take dependencies with you (like R) Automate as much as you can black box programs that use various input files numbered files Scale up gradually 13

CENTER FOR HIGH THROUGHPUT COMPUTING 14

High throughput computing n processors time 15

WHAT WE NEED Lots of computers, to run multiple independent computations 16

CHTC Services Center for High Throughput Computing, est. 2006 Large-scale, campus-shared computing systems high-throughput computing (HTC) and high-performance computing (HPC) systems all standard services provided free-of-charge hardware buy-in options support and training for using our systems proposal assistance chtc.cs.wisc.edu 17

CHTC Accessible Computing CHTC time limit: <72 hrs/job 10,000+ CPU hrs/day S 18

CHTC Accessible Computing UW Grid up to ~8 hrs/job ~20,000 CPU hrs/day S CHTC <72 hrs/job 10,000 hrs/day 19

CHTC Accessible Computing Open Science Grid up to ~4 hrs/job ~200,000 CPU hrs/day UW Grid up to ~8 hrs/job ~20,000 CPU hrs/day S CHTC <72 hrs/job 10,000 hrs/day 20

Jul'15- Jun'16 Jul'16- Jun'17 http://chtc.cs.wisc.edu Jul'17- Jun'18 Quick Facts 325 383 408 Million Hours Served 205 255 287 Research Projects Researchers who use the CHTC are located all over campus (red buildings) 21

http://chtc.cs.wisc.edu Jul'15- Jun'16 Jul'16- Jun'17 Jul'17- Jun'18 Quick Facts 325 383 408 Million Hours Served 205 255 287 Research Projects Individual researchers: 30 years of computing per day Researchers who use the CHTC are located all over campus (red buildings) 22

What else? Software We can support most unlicensed/open source software (R, Python, other projects) Can also support Matlab Data CHTC cannot *store* data However, you can work with up to several TB of data on our system 23

How it works Job submission Instead of running programs on your own computer, log in and submit jobs Job = single independent calculation Can run hundreds of jobs at once (or more) HTCondor Computer program that controls and runs jobs on CHTC s computers 24

Getting Started Facilitators Help researchers get started Advise on best approach and resources for your research problem How do I get an account? Temporary accounts created for this class. If you want to use CHTC beyond this class for research, fill out our account request form: http://chtc.cs.wisc.edu/form 25

RUNNING A JOB ON CHTC S HIGH THROUGHPUT SYSTEM WITH HTCONDOR 26

How It Works (in CHTC) Submit jobs to a queue (on a submit server) HTCondor schedules them to run on computers that belong to CHTC (execute servers) submit execute execute execute 27

What is HTCondor? Software that schedules and runs computing tasks on computers HTCONDOR 28

Job Example Consider an imaginary program called compare_states, which compares two data files and produces a single output file. wi.dat compare_ states wi.dat.out us.dat $ compare_states wi.dat us.dat wi.dat.out 29

Submit File job.submit executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB queue 1 List your executable and any arguments it takes. compare_ states Arguments are any options passed to the executable from the command line. 30 $ compare_states wi.dat us.dat wi.dat.out

Submit File job.submit executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB Indicate your input files. wi.dat us.dat queue 1 31

Submit File job.submit executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB queue 1 HTCondor will transfer back all new and changed files (usually output) from the job. wi.dat.out 32

Submit File job.submit executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB log: file created by HTCondor to track job progress output/erro r: captures stdout and stderr queue 1 33

Submit File job.submit executable = compare_states arguments = wi.dat us.dat wi.dat.out should_transfer_files = YES transfer_input_files = us.dat, wi.dat when_to_transfer_output = ON_EXIT log = job.log output = job.out error = job.err request_cpus = 1 request_disk = 20MB request_memory = 20MB queue 1 Request the appropriate resources for your job to run. queue: keyword indicating create a job. 34

Submitting and Monitoring To submit a job/jobs: condor_submit submit_file_name To monitor submitted jobs, use: condor_q $ condor_submit job.submit Submitting job(s). 1 job(s) submitted to cluster 128. $ condor_q -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/01/17 10:35:54 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice ID: 128 5/9 11:09 1 1 128.0 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended HTCondor Manual: condor_submit HTCondor Manual: condor_q 35

More about condor_q By default condor_q shows: user s job only (as of 8.6) jobs summarized in batches (as of 8.6) Constrain with username, ClusterId or full JobId, which will be denoted [U/C/J] in the following slides $ condor_q -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... @ 05/09/17 11:35:54 OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS alice ID: 128 5/9 11:09 1 1 128.0 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended JobId = ClusterId.ProcId 36

More about condor_q To see individual job information, use: condor_q -nobatch $ condor_q -nobatch -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 alice 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat us.dat 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended We will use the -nobatch option in the following slides to see extra detail about what is happening with a job 37

Job Idle $ condor_q -nobatch -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 alice 5/9 11:09 0+00:00:00 I 0 0.0 compare_states wi.dat us.dat 1 jobs; 0 completed, 0 removed, 1 idle, 0 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log 38

Job Starts $ condor_q -nobatch -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 alice 5/9 11:09 0+00:00:00 < 0 0.0 compare_states wi.dat us.dat w 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log compare_states wi.dat us.dat Execute Node (execute_dir)/ 39

Job Running $ condor_q -nobatch -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128.0 alice 5/9 11:09 0+00:01:08 R 0 0.0 compare_states wi.dat us.dat 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log Execute Node (execute_dir)/ compare_states wi.dat us.dat stderr stdout wi.dat.out 40

Job Completes $ condor_q -nobatch -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 128 alice 5/9 11:09 0+00:02:02 > 0 0.0 compare_states wi.dat us.dat 1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log stderr stdout wi.dat.out Execute Node (execute_dir)/ compare_states wi.dat us.dat stderr stdout wi.dat.out 41

Job Completes (cont.) $ condor_q -nobatch -- Schedd: submit-5.chtc.wisc.edu : <128.104.101.92:9618?... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended Submit Node (submit_dir)/ job.submit compare_states wi.dat us.dat job.log job.out job.err wi.dat.out 42

TRY IT OUT See the instructions on your handout 43

Submitting Multiple Jobs executable = compare_states arguments = wi.dat us.dat wi.dat.out transfer_input_files = us.dat, wi.dat queue 1 Replacing single job inputs executable = compare_states arguments = $(infile) us.dat $(infile).out transfer_input_files = us.dat, $(infile) queue... with a variable of choice 44

Possible Queue Statements multiple queue statements infile = wi.dat queue 1 infile = ca.dat queue 1 infile = ia.dat queue 1 Not Recommended matching... pattern in... list from... file queue infile matching *.dat queue infile in (wi.dat ca.dat ia.dat) queue infile from state_list.txt Recommended wi.dat ca.dat ia.dat state_list.txt 45

USER EXPECTATIONS 46

Be responsible! These resources are shared and you get to use them for free -- be a good citizen. Ask questions if you aren t sure about something. Don t run programs directly on the submit server. Data files should be small. Talk to us if you want to submit jobs with big (> 1GB) of data. 47

Resource Request Jobs are nearly always using a part of a computer, not the whole thing Very important to request appropriate resources (memory, cpus, disk) for a job your request whole computer 48

Resource Assumptions Even if your system has default CPU, memory and disk requests, these may be too small! Important to run test jobs and use the log file to request the right amount of resources: requesting too little: causes problems for your and other jobs; jobs might by held by HTCondor requesting too much: jobs will match to fewer slots 49

Log File 000 (128.000.000) 05/09 11:09:08 Job submitted from host: <128.104.101.92&sock=6423_b881_3>... 001 (128.000.000) 05/09 11:10:46 Job executing on host: <128.104.101.128:9618&sock=5053_3126_3>... 006 (128.000.000) 05/09 11:10:54 Image size of job updated: 220 1 - MemoryUsage of job (MB) 220 - ResidentSetSize of job (KB)... 005 (128.000.000) 05/09 11:12:48 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 33 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 33 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 14 20480 17203728 Memory (MB) : 1 20 20 50

TESTING IS KEY! ALWAYS run test jobs before submitting many jobs at once. 51

Getting Help Christina (me) and Lauren Michael work for CHTC as Research Computing Facilitators. It s our job to answer questions and help people get started with computing at CHTC. Email us! chtc@cs.wisc.edu Or come to office hours in the WID: Tues/Thurs, 3:00-4:30 Wed, 9:30-11:30 http://chtc.cs.wisc.edu/get-help 52

NEXT STEPS 53

Building up a workflow Try to get ONE job running Follow steps that follow Troubleshoot Check memory/disk requirements Do a small scale test of 5-10 jobs Check memory + disk requirements *again* Run full-scale set of jobs 54

Special Considerations for Code Your code should be able to take in different inputs (usually via an argument) Example (in R): # analysis.r options <- commandargs(trailingonly=true) input_file <- options[[1]] # continue w/ input file $ Rscript analysis.r input.tgz 55

Special Considerations for R Guide for R Jobs: chtc.cs.wisc.edu/r-jobs.shtml Need to: 1.prepare a portable R installation 2. create a primary executable script that will run your R code 3. submit jobs with different input for each 56

1. Prepare portable R 1. download your required R version (.tar.gz) 2. follow CHTC s guide to start an interactive build job: chtc.cs.wisc.edu/inter-submit.shtml 3. install R and any libraries 4. pack up the resulting R installation directory 5. exit the interactive session S ib described in: chtc.cs.wisc.edu/r-jobs.shtml 57

2. Create the Job executable #!/bin/bash # command line script (runr) to run R # untar the portable R installation: tar xzf R.tar.gz # make sure the above R installation will be used: export PATH=$(pwd)/R/bin:$PATH export RHOME=$(pwd)/R # run R, with my R script as an argument to this executable: Rscript myscript.r $1 58

3. Create the submit file executable = runr.sh arguments = input.tgz log = $(Process).log error = $(Process).err output = $(Process).out transfer_input_files = R.tar.gz, input_file, myscript.r request_cpus = 1 request_disk = 1GB request_memory = 1GB queue 59