Day 9: Introduction to CHTC

Similar documents
HIGH-THROUGHPUT COMPUTING AND YOUR RESEARCH

Tutorial 4: Condor. John Watt, National e-science Centre

ITNPBD7 Cluster Computing Spring Using Condor

HPC Resources at Lehigh. Steve Anthony March 22, 2012

Presented by: Jon Wedell BioMagResBank

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

Introduction to Condor. Jari Varje

Introduction to HTCondor

HTCondor Essentials. Index

History of SURAgrid Deployment

! " # " $ $ % & '(()

Grid Compute Resources and Job Management

! " #$%! &%& ' ( $ $ ) $*+ $

CHTC Policy and Configuration. Greg Thain HTCondor Week 2017

Project Blackbird. U"lizing Condor and HTC to address archiving online courses at Clemson on a weekly basis. Sam Hoover

MIGRATING TO THE SHARED COMPUTING CLUSTER (SCC) SCV Staff Boston University Scientific Computing and Visualization

Ge#ng Started with the RCE. Len Wisniewski

An Introduction to Using HTCondor Karen Miller

(HT)Condor - Past and Future

Grid Mashups. Gluing grids together with Condor and BOINC

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

High Performance Computing (HPC) Using zcluster at GACRC

An Introduction to Using HTCondor Karen Miller

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Introduction to HPC Using zcluster at GACRC

Introduction to HPC Using zcluster at GACRC

Introduction to Distributed HTC and overlay systems

Using HTCondor on the Biochemistry Computational Cluster v Jean-Yves Sgro

Getting Started with OSG Connect ~ an Interactive Tutorial ~

Introduction to HPC Using zcluster at GACRC

High Performance Computing (HPC) Prepared By: Abdussamad Muntahi Muhammad Rahman

Day 11: Workflows with DAGMan

How to install Condor-G

Introduction to Grid Computing

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Introduction to HPC Using zcluster at GACRC On-Class GENE 4220

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Cloud Computing. Summary

Welcome to HTCondor Week #16. (year 31 of our project)

Setup InstrucIons. If you need help with the setup, please put a red sicky note at the top of your laptop.

FACADE Financial Analysis Computing Architecture in Distributed Environment

Building the International Data Placement Lab. Greg Thain

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Habanero Operating Committee. January

Day 15: Science Code in Python

Introduction to Discovery.

MOHA: Many-Task Computing Framework on Hadoop

Protected Environment at CHPC. Sean Igo Center for High Performance Computing September 11, 2014

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

AN INTRODUCTION TO CLUSTER COMPUTING

Cloud Computing. Up until now

Ioan Raicu Distributed Systems Laboratory Computer Science Department University of Chicago

Center for Mathematical Modeling University of Chile HPC 101. HPC systems basics and concepts. Juan Carlos Maureira B.

Regional & National HPC resources available to UCSB

On-demand provisioning of HEP compute resources on cloud sites and shared HPC centers

InfoBrief. Platform ROCKS Enterprise Edition Dell Cluster Software Offering. Key Points

WVU RESEARCH COMPUTING INTRODUCTION. Introduction to WVU s Research Computing Services

High-Performance Scientific Computing

Cluster Network Products

Introduction Matlab Compiler Matlab & Condor Conclusion. From Compilation to Distribution. David J. Herzfeld

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Large Scale in What Dimension?

High Performance Computing Resources at MSU

Synonymous with supercomputing Tightly-coupled applications Implemented using Message Passing Interface (MPI) Large of amounts of computing for short

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Introducing the HTCondor-CE

Introduction to High Performance Computing Using Sapelo2 at GACRC

Outline. 1. Introduction to HTCondor. 2. IAC: our problems & solutions. 3. ConGUSTo, our monitoring tool

Automatic Dependency Management for Scientific Applications on Clusters. Ben Tovar*, Nicholas Hazekamp, Nathaniel Kremer-Herman, Douglas Thain

Brutus. Above and beyond Hreidar and Gonzales

HPC at UZH: status and plans

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

High Performance Computing. What is it used for and why?

Introduction to High-Performance Computing

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

An introduction to checkpointing. for scientific applications

Parallel Programming & Murat Keçeli

An update on the scalability limits of the Condor batch system

Introduction to Cluster Computing

Exercise Architecture of Parallel Computer Systems

Introduction to Discovery.

XSEDE High Throughput Computing Use Cases

Grid Compute Resources and Grid Job Management

Chapter 4:- Introduction to Grid and its Evolution. Prepared By:- NITIN PANDYA Assistant Professor SVBIT.

Introduction to CARC. To provide high performance computing to academic researchers.

Condor and Workflows: An Introduction. Condor Week 2011

Parallel Programming & Cluster Computing High Throughput Computing

Outline. Definition of a Distributed System Goals of a Distributed System Types of Distributed Systems

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Typically applied in clusters and grids Loosely-coupled applications with sequential jobs Large amounts of computing for long periods of times

Sherlock for IBIIS. William Law Stanford Research Computing

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Extending Condor Condor Week 2010

Introduction to HPC Using zcluster at GACRC

Queuing and Scheduling on Compute Clusters

Accelerating Science with High Throughput Computing (HTC)

Getting started with the CEES Grid

Transcription:

Day 9: Introduction to CHTC Suggested reading: Condor 7.7 Manual: http://www.cs.wisc.edu/condor/manual/v7.7/ Chapter 1: Overview Chapter 2: Users Manual (at most, 2.1 2.7) 1

Turn In Homework 2

Homework Review 3

CHTC Center for High Throughput Computing 4

Science Theory Computing Experiments 5

Computing resources for researchers Right here on campus Free for UW Madison researchers Funded by UW, NSF, Dept. of Energy, NIH, Last year: 15 million CPU hours delivered 6

High-Throughput Computing use of many computing resources over long periods of time to accomplish a computational task Wikipedia (retrieved 7 Nov 2011) Not high-performance computing (HPC) TOP500 list of supercomputers FLOPS (floating-point operations per second) Aims to maximize long-term throughput How many results this week/month/year? FLOPY (60 60 24 365) FLOPS 7

The Hope (& Hype) of Distributed Computing Do a lot of computing Always be available and reliable Degrade gracefully Spread the workload automatically Grow (and shrink) easily when needed Respond well to temporary overloads Adapt easily to new uses Adapted from: Enslow, P. H., Jr. (1978). What is a distributed data processing system? Computer, 11(1), 13 21. doi:10.1109/c-m.1978.217901 8

Definition of Distributed Computing Multiplicity of resources General purpose; not same, but same capabilities More replication is better Component interconnection Networked, loosely coupled Unity of control Not centralized control (single point of failure) Unified by common goal, and hence policy System transparency Whole system appears as one virtual system to user Component autonomy Autonomous (act locally) but cooperative (think globally) Enslow, P. H., Jr., & Saponas, T. G. (1981). Distributed and decentralized control in fully distributed processing systems: A survey of applicable models (GIT-ICS-81/02). Georgia Institute of Technology. 9

What CHTC Offers Computing People 10

CHTC Machines Hardware ~160 8 12-core 2.6 2.8 GHz Intel 64-bit, 1U servers Typical machine: 12 24 GB memory, ~350 GB disk 1 Gbit Ethernet (good for file transfer, not MPI) Software Scientific Linux 5 (var. of Red Hat Enterprise Linux 5) Languages: Python, C/C++, Java, Perl, Fortran, Extra software (no licenses): R, MATLAB, Octave Location: Mostly in CompSci B240, some in WID 11

CHTC Usage Statistics ~35,000 hours per day ~1,000,000 hours per month ~15,000,000 hours per year 12

Open Science Grid HTC scaled way up Over 100 sites Mostly in U.S., plus others Past year: ~200,000,000 jobs ~514,000,000 CPU hours ~280,000 TB transferred Can submit jobs to CHTC, move to OSG http://www.opensciencegrid.org/ 13

Anyone want a tour? 14

Condor 15

History and Status History Started in 1988 as a cycle scavenger Protected interests of users and machine owners Today Expanded to become CHTC team: 20+ full-time staff Current production release: Condor 7.6.4 Condor software alone: ~700,000 lines of C/C++ code Miron Livny Professor, UW Madison CompSci Director, CHTC Dir. of Core Comp. Tech., WID/MIR Tech. Director & PI, OSG 16

What Does Condor Do? Users Define jobs, their requirements, and preferences Submit and cancel jobs Check on the state of a job Check on the state of the machines Administrators Configure and control the Condor system Declare policies on machine use, pool use, etc. Internally Match jobs to machines (enforcing all policies) Track and manage machines Track and run jobs 17

Jobs = Computer programs Not interactive (e.g., Word, Firefox, email) Batch processing: Run without human intervention Input: command-line arguments, files, downloads? Run: do stuff Output: standard output & error, files, DB update? Scheduling Reserved: Person gets time slot, computer runs then Opportunistic: Person submits job, computer decides schedule 18

Machines Terminology A machine is a physical computer (typically) May have multiple processors (computer chips) These days, each may have multiple cores (CPUs) Condor: Slot One assignable unit of a computing resource Most often, corresponds to one core Thus, typical machines today have 4 40 slots Advanced Condor feature: Can request multiple slots for a single job (that uses parallel computing) 19

Matchmaking Two-way process of matching jobs and machines Job Requirements, e.g.: OS, architecture, memory, disk Preferences, e.g.: owner, speed, memory, disk, load Machine Requirements, e.g.: submitter, time of day, usage Preferences, e.g.: submitter, memory, disk, load Administrator Preferences, e.g.: prior usage, priority, various limits Thus: Not as simple as waiting in a line! 20

Running Jobs 21

Our Submit Machine Access Hostname (ssh): submit-368.chtc.wisc.edu If enrolled, get account info from me Rules Full access to all CHTC resources (i.e., machines) All UW Information Technology policies apply http://www.cio.wisc.edu/policies.aspx OK for research and training Usage is monitored Notes No backups! Keep original files elsewhere Accounts will be disabled 1 January 2012, unless 22

Viewing Slots condor_status With no arguments, lists all slots currently in pool Summary info at end For more options: -h, Condor Manual, next class slot6@opt-a001.cht LINUX X86_64 Claimed Busy 1.000 1024 0+19:09:32 slot7@opt-a001.cht LINUX X86_64 Claimed Busy 1.000 1024 0+19:09:31 slot8@opt-a001.cht LINUX X86_64 Unclaimed Idle 1.000 1024 0+17:37:54 slot9@opt-a001.cht LINUX X86_64 Claimed Busy 1.000 1024 0+19:09:32 slot10@opt-a002.ch LINUX X86_64 Unclaimed Idle 0.000 1024 0+17:55:15 slot11@opt-a002.ch LINUX X86_64 Unclaimed Idle 0.000 1024 0+17:55:16 Total Owner Claimed Unclaimed Matched Preempting Backfill INTEL/WINNT51 2 0 0 2 0 0 0 INTEL/WINNT61 52 2 0 50 0 0 0 X86_64/LINUX 2086 544 1258 284 0 0 0 Total 2140 546 1258 336 0 0 0 23

Viewing Jobs condor_q With no args, lists all jobs waiting or running here For more options: -h, Condor Manual, next class -- Submitter: submit-368.chtc.wisc.edu : <...> :... ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 6.0 cat 11/12 09:30 0+00:00:00 I 0 0.0 explore.py 6.1 cat 11/12 09:30 0+00:00:00 I 0 0.0 explore.py 6.2 cat 11/12 09:30 0+00:00:00 I 0 0.0 explore.py 6.3 cat 11/12 09:30 0+00:00:00 I 0 0.0 explore.py 6.4 cat 11/12 09:30 0+00:00:00 I 0 0.0 explore.py 5 jobs; 5 idle, 0 running, 0 held condor_q owner Just one owner s jobs (e.g., your own) 24

Basic Submit File executable = word_freq.py universe = vanilla arguments = "words.txt 1000" output = word_freq.out error = word_freq.err log = word_freq.log should_transfer_files = YES when_to_transfer_output = ON_EXIT transfer_input_files = words.txt queue Must have this to run job! Program to run. Must be runnable from Command-line command line. Path arguments is relative to to pass Local current Condor s to executable files directory log that file will when receive from when run; running surround the submitted contents the with of standard job; double very output helpful, quotes and [opt] error do not from omit! Commaseparated list the run [opt] of input files to transfer to machine [opt] 25

Submit a Job condor_submit submit-file Submits job to local submit machine Use condor_q to track Submitting job(s). 1 job(s) submitted to cluster NNN. One condor_submit yields one cluster (in queue) Each queue statement yields one process condor_q: ID is cluster.process (e.g., 8.0) We will see how to set up multiple jobs next time 26

Remove a Job condor_rm cluster [...] condor_rm cluster.process [...] Removes one or more jobs from the queue Identify each removal by whole cluster or single ID Only you (or admin) can remove your own jobs Cluster NNN has been marked for removal. 27

Homework 28

Homework Run a job or several! I supply a Python script a bit like homework #1 How many of your past homeworks can you run? Do you have any other jobs to run? Turn in submit file + resulting log, out, and err files In spite of the above, enjoy the Thanksgiving break! 29