glideinwms UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Similar documents
glideinwms Training Glidein Internals How they work and why by Igor Sfiligoi, Jeff Dost (UCSD) glideinwms Training Glidein internals 1

glideinwms architecture by Igor Sfiligoi, Jeff Dost (UCSD)

glideinwms Frontend Installation

Configuring a glideinwms factory

HTCondor overview. by Igor Sfiligoi, Jeff Dost (UCSD)

Shooting for the sky: Testing the limits of condor. HTCondor Week May 2015 Edgar Fajardo On behalf of OSG Software and Technology

Care and Feeding of HTCondor Cluster. Steven Timm European HTCondor Site Admins Meeting 8 December 2014

CCB The Condor Connection Broker. Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison

An update on the scalability limits of the Condor batch system

Tutorial 4: Condor. John Watt, National e-science Centre

Matchmaker Policies: Users and Groups HTCondor Week, Madison 2016

Introducing the HTCondor-CE

glideinwms: Quick Facts

Eurogrid: a glideinwms based portal for CDF data analysis - 19th January 2012 S. Amerio. (INFN Padova) on behalf of Eurogrid support group

Look What I Can Do: Unorthodox Uses of HTCondor in the Open Science Grid

What s new in HTCondor? What s coming? HTCondor Week 2018 Madison, WI -- May 22, 2018

One Pool To Rule Them All The CMS HTCondor/glideinWMS Global Pool. D. Mason for CMS Software & Computing

Flying HTCondor at 100gbps Over the Golden State

Introduction to Distributed HTC and overlay systems

HTCondor Essentials. Index

Condor-G: HTCondor for grid submission. Jaime Frey (UW-Madison), Jeff Dost (UCSD)

Primer for Site Debugging

Factory Ops Site Debugging

Day 9: Introduction to CHTC

Condor and BOINC. Distributed and Volunteer Computing. Presented by Adam Bazinet

Networking and High Throughput Computing. Garhan Attebury HTCondor Week 2015

Cloud Computing. Summary

Monitoring Primer HTCondor Week 2017 Todd Tannenbaum Center for High Throughput Computing University of Wisconsin-Madison

OSG Lessons Learned and Best Practices. Steven Timm, Fermilab OSG Consortium August 21, 2006 Site and Fabric Parallel Session

OSGMM and ReSS Matchmaking on OSG

Administrating Condor

A Virtual Comet. HTCondor Week 2017 May Edgar Fajardo On behalf of OSG Software and Technology

Special Topics: CSci 8980 Edge History

First Principles Vulnerability Assessment. Motivation

Building Campus HTC Sharing Infrastructures. Derek Weitzel University of Nebraska Lincoln (Open Science Grid Hat)

Scalability and interoperability within glideinwms

238P: Operating Systems. Lecture 14: Process scheduling

Chapter 5: User Management. Chapter 5 User Management

Job and Machine Policy Configuration HTCondor European Workshop Hamburg 2017 Todd Tannenbaum

PROOF-Condor integration for ATLAS

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Cloud Computing. Up until now

CHTC Policy and Configuration. Greg Thain HTCondor Week 2017

ECE 550D Fundamentals of Computer Systems and Engineering. Fall 2017

Corral: A Glide-in Based Service for Resource Provisioning

What s new in HTCondor? What s coming? European HTCondor Workshop June 8, 2017

What s new in Condor? What s c Condor Week 2010

CMS experience of running glideinwms in High Availability mode

Grid Compute Resources and Grid Job Management

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

PROCESS CONTROL BLOCK TWO-STATE MODEL (CONT D)

Operating Systems. Process scheduling. Thomas Ropars.

Operating Systems 2014 Assignment 2: Process Scheduling

HTCondor: Virtualization (without Virtual Machines)

OODA Security. Taking back the advantage!

Virtual Memory. ICS332 Operating Systems

! " # " $ $ % & '(()

The former pager tasks have been replaced in 7.9 by the special savepoint tasks.

Presented by: Jon Wedell BioMagResBank

Monitoring and Analytics With HTCondor Data

New Directions and BNL

Interfacing HTCondor-CE with OpenStack: technical questions

First evaluation of the Globus GRAM Service. Massimo Sgaravatto INFN Padova

ò mm_struct represents an address space in kernel ò task represents a thread in the kernel ò A task points to 0 or 1 mm_structs

A Guide to Condor. Joe Antognini. October 25, Condor is on Our Network What is an Our Network?

elinks, mail processes nice ps, pstree, top job control, jobs, fg, bg signals, kill, killall crontab, anacron, at

Things you may not know about HTCondor. John (TJ) Knoeller Condor Week 2017

Announcement. Exercise #2 will be out today. Due date is next Monday

Scheduling. Don Porter CSE 306

Connecting Restricted, High-Availability, or Low-Latency Resources to a Seamless Global Pool for CMS

ENGR 3950U / CSCI 3020U Midterm Exam SOLUTIONS, Fall 2012 SOLUTIONS

Automatic Related Products

Effective Testing for Live Applications. March, 29, 2018 Sveta Smirnova

How to install Condor-G

Section 1: Tools. Contents CS162. January 19, Make More details about Make Git Commands to know... 3

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Fall 2008.

Operating Systems. Computer Science & Information Technology (CS) Rank under AIR 100

CSC369 Lecture 2. Larry Zhang

HTCONDOR USER TUTORIAL. Greg Thain Center for High Throughput Computing University of Wisconsin Madison

elinks, mail processes nice ps, pstree, top job control, jobs, fg, bg signals, kill, killall crontab, anacron, at

Outline. What is TCP protocol? How the TCP Protocol Works SYN Flooding Attack TCP Reset Attack TCP Session Hijacking Attack

Changing landscape of computing at BNL

Processes. CS 475, Spring 2018 Concurrent & Distributed Systems

Start of Lecture: February 10, Chapter 6: Scheduling

Operating Systems. Overview Virtual memory part 2. Page replacement algorithms. Lecture 7 Memory management 3: Virtual memory

ò Paper reading assigned for next Tuesday ò Understand low-level building blocks of a scheduler User Kernel ò Understand competing policy goals

Scheduling Mar. 19, 2018

DHCP Failover: An Improved Approach to DHCP Redundancy

CSE 544 Principles of Database Management Systems

B. V. Patel Institute of Business Management, Computer &Information Technology, UTU

Jan 20, 2005 Lecture 2: Multiprogramming OS

Configuring the Oracle Network Environment. Copyright 2009, Oracle. All rights reserved.

Figure 1: Typical output of condor q -analyze

Linux System Administration

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Ricardo Rocha. Department of Computer Science Faculty of Sciences University of Porto

Processes, PCB, Context Switch

Operating Systems Design Fall 2010 Exam 1 Review. Paul Krzyzanowski

Scheduling. CS 161: Lecture 4 2/9/17

UNIX Memory Management Interview Questions and Answers

Transcription:

glideinwms Training @ UCSD Condor tunning by Igor Sfiligoi (UCSD) UCSD Jan 18th 2012 Condor Tunning 1

Regulating User Priorities UCSD Jan 18th 2012 Condor Tunning 2

User priorities By default, the Negotiator treats all users equally You get fair-share out of the box If you have two users, on average each gets ½ of slots If you have different needs, Negotiator supports Priority factors PF reduces user priority If users X and Y have PFX(N-1)*PFY, on average user X gets 1 /N of slots (with user Y the rest) Accounting groups (see next slide) http://research.cs.wisc.edu/condor/manual/v7.6/3_4user_priorities.html UCSD Jan 18th 2012 Condor Tunning 3

Accounting groups Users can be joined in accounting groups The Negotiator defines the groups, but jobs specify which group they belong to Each group can be given a quota Can be relative to the size of the pool Sum of group jobs cannot exceed it If quotas >100%, can be used for relative prio Here higher is better Each group will be given, on average, quota G/sum(quotas) of slots Jobs without any group may never get anything UCSD Jan 18th 2012 Condor Tunning 4

Configuring AC To enable accounting groups, just define the quotas in the Negotiator config file Which jobs go into which group not defined here #condor_config.local #condor_config.local # # 1.0 1.0 is is 100% 100% GROUP_QUOTA_DYNAMIC_group_higgs GROUP_QUOTA_DYNAMIC_group_higgs 3.2 3.2 GROUP_QUOTA_DYNAMIC_group_b GROUP_QUOTA_DYNAMIC_group_b 2.2 2.2 GROUP_QUOTA_DYNAMIC_group_k GROUP_QUOTA_DYNAMIC_group_k 6.5 6.5 GROUP_QUOTA_DYNAMIC_group_susy GROUP_QUOTA_DYNAMIC_group_susy 8.8 8.8 UCSD Jan 18th 2012 Condor Tunning 5

Using the AC Users must specify which group they belong to No automatic mapping or validation in Condor Based on trust Jobs must add to their submit file +AccountingGroup "<group>.<user>" Universe Universe vanilla vanilla Executable Executable cosmos cosmos Arguments Arguments -k -k 1543.3 1543.3 Output Output cosmos.out cosmos.out Input Input cosmos.in cosmos.in Log Log cosmos.log cosmos.log +AccountingGroup +AccountingGroup "group_higgs.frieda" "group_higgs.frieda" Queue Queue 1 UCSD Jan 18th 2012 Condor Tunning 6

Configuring PF Priority factors a dynamic property Set by cmdline tool condor_userprio Anyone can read Can only be set by the Negotiator administrator $ condor_userprio -all -allusers grep user1@node1 $ condor_userprio -all -allusers grep user1@node1 user1@node1 8016.22 8.02 10.00 0 15780.63 11/23/2011 05:59 12/30/2011 20:37 user1@node1 8016.22 8.02 10.00 0 15780.63 11/23/2011 05:59 12/30/2011 20:37 $ condor_userprio -setfactor user1@node1 1000 $ condor_userprio -setfactor user1@node1 1000 The priority factor of user1@node1 was set to 1000.000000 The priority factor of user1@node1 was set to 1000.000000 $ condor_userprio -all -allusers grep user1@node1 $ condor_userprio -all -allusers grep user1@node1 user1@node1 8016.22 8.02 1000.00 0 15780.63 11/23/2011 05:59 12/30/2011 20:37 user1@node1 8016.22 8.02 1000.00 0 15780.63 11/23/2011 05:59 12/30/2011 20:37 Although persistent between Negotiator restarts May want to set higher default PF (e.g. 1000) default of 1, and user PF cannot go below Attribute regulated by DEFAULT_PRIO_FACTOR http://research.cs.wisc.edu/condor/manual/v7.6/2_7priorities_preemption.html#sec:user-priority-explained UCSD Jan 18th 2012 Condor Tunning 7

Negotiation cycle optimization UCSD Jan 18th 2012 Condor Tunning 8

Negotiation internals The Negotiator does the matchmaking in loops Get all Machine ClassAds (from Collector) Get Job ClassAds (from Schedds) one by one, prioritized by user Tell Schedd if a match has been found Sleep Repeat UCSD Jan 18th 2012 Condor Tunning 9

Sleep time The Negotiator sleeps between cycles to save CPU But will be interrupted if new jobs are submitted Makes sense for static pools But can be a problem for dynamic ones like glideins Keep the sleep as low as feasible NEGOTIATOR_INTERVAL Installer sets it to 60s NEGOTIATOR_INTERVAL NEGOTIATOR_INTERVAL 60 60 http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:negotiatorinterval The sleep will not be interrupted if a new glidein join Unmatched glideins UCSD Jan 18th 2012 Condor Tunning 10

Preemption By default, Condor will preempt a user as soon as jobs from a higher priority users are available Preemption wasted CPU Preemptionkilling jobs The glideinwms installer disables preemption Once a glidein is given to a user, he can keep it until the glidein dies # # Prevent Prevent preemption preemption PREEMPTION_REQUIREMENTS PREEMPTION_REQUIREMENTS False False # # Speed Speed up up Matchmaking Matchmaking NEGOTIATOR_CONSIDER_PREEMPTION NEGOTIATOR_CONSIDER_PREEMPTION False False Glidein lifetime is limited, so fair share still enforced long term http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:negotiatorconsiderpreemption UCSD Jan 18th 2012 Condor Tunning 11

Protecting against dead glideins Glideins can die Just a fact of life on the Grid But their ClassAds will stay in the Collector for about 20 mins If they were Idle at the time, will match jobs! Make sure the lowest priority user get stuck Order Machine ClassAds by freshness NEGOTIATOR_POST_JOB_RANK NEGOTIATOR_POST_JOB_RANK MY.LastHeardFrom MY.LastHeardFrom http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:negotiatorpostjobrank UCSD Jan 18th 2012 Condor Tunning 12

Protecting against slow schedds If one schedd gets stuck, you want to limit the damage, and match jobs from other schedds Negotiator can limit the time spent on each schedd The glideinwms installer sets these limits See Condor manual for details Was really important before 7.6.X... less of an issue now NEGOTIATOR_MAX_TIME_PER_SUBMITTER60 NEGOTIATOR_MAX_TIME_PER_SUBMITTER60 NEGOTIATOR_MAX_TIME_PER_PIESPIN20 NEGOTIATOR_MAX_TIME_PER_PIESPIN20 http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:negotiatormaxtimepersubmitter UCSD Jan 18th 2012 Condor Tunning 13

One way matchmaking The Negotiator normally sends the match both the Schedd and the Startd Only Schedd strictly needed Given that the glideins can be behind a firewall, it is a good idea to disable this behavior NEGOTIATOR_INFORM_STARTD NEGOTIATOR_INFORM_STARTD False False UCSD Jan 18th 2012 Condor Tunning 14

Collector tunning UCSD Jan 18th 2012 Condor Tunning 15

Collector tunning The major tunning is the Collector tree Already discussed in the Installation section Three more optional tunnings Security session lifetime File descriptor limits Client handling parameters UCSD Jan 18th 2012 Condor Tunning 16

Security session lifetime All communication to the Collector (leafs) is GSI based But GSI handshakes are expensive! Condor thus uses it just to exchange a shared secret (i.e. password) with the remote glidein Will use this passwd for all further communications But the passwd also has a limited lifetime Default lifetime quite long The glideinwms installer limits it to 12h SEC_DAEMON_SESSION_DURATION SEC_DAEMON_SESSION_DURATION 50000 50000 http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:secdefaultsessionduration http://iopscience.iop.org/1742-6596/219/4/042017 To limit damage in case it is stolen Was months in old versions of Condor! UCSD Jan 18th 2012 Condor Tunning 17

File descriptor limit The Collector (leafs) also function as CCB Will use 5+ FDs for every glidein they serve! By default, OS only gives 1024 FDs to each process But Condor will limit itself to ~80% of that So it has a few FDs left for internal needs (e.g. logs) You may want to increase this limit Must be done as root (and Collector usually not root) The alternative is to just run more secondary collectors (recommended) And don't forget the system-wide limits /proc/sys/fs/file-max Not done by glideinwms installer UCSD Jan 18th 2012 Condor Tunning 18

Client handling limits Every time someone runs condor_status the Collector must act on it A query is also done implicitly when you run condor_q -name <schedd_name> Since the Collector is a busy process, the Condor team added the option to fork on query This will use up RAM! So there is a limit COLLECTOR_QUERY_WORKERS You may want to either lower or increase it It only uses RAM if the parent memory pages change But not unusual on a busy Collector Defaults to 16 Used by the Frontend processes UCSD Jan 18th 2012 Condor Tunning 19

Firewall tunning If you decided you want to keep the firewall you may need to tune it Open ports for the Collector tree And possibly the Negotiator (if Schedds outside the perimeter) You may also need to tune the firewall scalability knobs Negotiator port usually dynamic... will need to fix it with NEGOTIATOR_ARGS -f -p 9617 Statefull firewalls will have to track 5+ TCP connections per glidein # # for for iptables iptables sysctl sysctl -w -w net.ipv4.netfilter.ip_conntrack_max385552; net.ipv4.netfilter.ip_conntrack_max385552; sysctl sysctl -p; -p; echo echo 48194 48194 > > /sys/module/ip_conntrack/parameters/hashsize /sys/module/ip_conntrack/parameters/hashsize UCSD Jan 18th 2012 Condor Tunning 20

Schedd tunning UCSD Jan 18th 2012 Condor Tunning 21

Schedd tunning The following knobs may need to be tuned Job limits Startup/Termination limits Client handling limits Match attributes Requirements defaults Periodic expressions You also may need to tune the system settings UCSD Jan 18th 2012 Condor Tunning 22

Job limits As you may remember, each running job uses a non-negligible amount of resources Thus the Schedd has a limit on them MAX_JOB_RUNNING Set it to a value that is right for your HW # # glideinwms glideinwms installer installer provided provided value value MAX_JOBS_RUNNING MAX_JOBS_RUNNING 6000 6000 You may waste some CPU on glideins, but better than crashing the submit node http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:maxjobrunning UCSD Jan 18th 2012 Condor Tunning 23

Startup limits Each time a job is started The Schedd must create a shadow The Shadow must load the input files from disk The Shadow must transfer the files over the network If too many start at once, may overload the node Two sets of limits: Limit at the Schedd JOB_START_DELAY/JOB_START_COUNT MAX_CONCURRENT_UPLOADS Limit on the network http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:jobstartcount http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:maxconcurrentuploads UCSD Jan 18th 2012 Condor Tunning 24

Termination limits Termination can be problematic as well Output files must be transferred back And you do not know how big will it be! If using glexec, an authentication call will be made on each glidein Similar limits: JOB_STOP_DELAY/JOB_STOP_COUNT MAX_CONCURRENT_DOWNLOADS Each calling the site GUMS. Imagine O(1k) within a second! Limit at the Schedd Limit on the network http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:jobstopcount http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:maxconcurrentdownloads Can easily overload the submit node. UCSD Jan 18th 2012 Condor Tunning 25

Client handling limits Every time someone runs condor_q the Schedd must act on it Since the Schedd is a busy process, the Condor team added the option to fork on query This will use up RAM! So there is a limit SCHEDD_QUERY_WORKERS You will want to increase it It only uses RAM if the parent memory pages change But not unusual on a busy Schedd Defaults to 3 Used by the Frontend processes http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:scheddqueryworkers UCSD Jan 18th 2012 Condor Tunning 26

Match attributes for history It is often useful to know where were the jobs running By default, you get just the host name LastRemoteHost May be useful to know which glidein factory it came from, too (etc.) But you can get match info easily if you... Add to the job submit files job_attribute $$(glidein attribute) After matching, Schedd adds in the ClassAd MATCH_EXP_job_attribute UCSD Jan 18th 2012 Condor Tunning 27

System level addition You can add them in the system config # # condor_config.local condor_config.local JOB_GLIDEIN_Factory JOB_GLIDEIN_Factory "$$(GLIDEIN_Factory:Unknown)" "$$(GLIDEIN_Factory:Unknown)" JOB_GLIDEIN_Name JOB_GLIDEIN_Name "$$(GLIDEIN_Name:Unknown)" "$$(GLIDEIN_Name:Unknown)" JOB_GLIDEIN_Entry_Name JOB_GLIDEIN_Entry_Name "$$(GLIDEIN_Entry_Name:Unknown)" "$$(GLIDEIN_Entry_Name:Unknown)" JOB_GLIDEIN_ClusterId JOB_GLIDEIN_ClusterId "$$(GLIDEIN_ClusterId:Unknown)" "$$(GLIDEIN_ClusterId:Unknown)" JOB_GLIDEIN_ProcId JOB_GLIDEIN_ProcId "$$(GLIDEIN_ProcId:Unknown)" "$$(GLIDEIN_ProcId:Unknown)" JOB_GLIDECLIENT_Name JOB_GLIDECLIENT_Name "$$(GLIDECLIENT_Name:Unknown)" "$$(GLIDECLIENT_Name:Unknown)" JOB_GLIDECLIENT_Group JOB_GLIDECLIENT_Group "$$((GLIDECLIENT_Group:Unknown)" "$$((GLIDECLIENT_Group:Unknown)" JOB_Site JOB_Site "$$(GLIDEIN_Site:Unknown)" "$$(GLIDEIN_Site:Unknown)" SUBMIT_EXPRS SUBMIT_EXPRS $(SUBMIT_EXPRS) $(SUBMIT_EXPRS) JOB_GLIDEIN_Factory JOB_GLIDEIN_Factory JOB_GLIDEIN_Name JOB_GLIDEIN_Name JOB_GLIDEIN_Entry_Name JOB_GLIDEIN_Entry_Name JOB_GLIDEIN_ClusterId JOB_GLIDEIN_ClusterId JOB_GLIDEIN_ProcId JOB_GLIDEIN_ProcId JOB_GLIDECLIENT_Name JOB_GLIDECLIENT_Name JOB_GLIDECLIENT_Group JOB_GLIDECLIENT_Group JOB_Site JOB_Site Factory identification Glidein identification (to give the Factory admins) Frontend identification Any other attribute This is how you push them in the user job ClassAds UCSD Jan 18th 2012 Condor Tunning 28

Additional job defaults With glideins, we often want empty user job requirements But Condor does not allow it It will forcibly add something about Arch Disk Memory e.g. (Arch "INTEL") e.g. (Disk>DiskUsage) e.g. (Memory>ImageSize) Unless they are already there... so lets add them # # condor_config.local condor_config.local APPEND_REQ_VANILLA APPEND_REQ_VANILLA (Memory>1)&&(Disk>1)&&(Arch!"fake") (Memory>1)&&(Disk>1)&&(Arch!"fake") Just to make sure it always evaluates to True UCSD Jan 18th 2012 Condor Tunning 29

Periodic expressions Unless you have very well behaved users, some of the jobs will be pathologically broken You want to detect them and remove the from the system Condor allows for periodic expressions SYSTEM_PERIODIC_REMOVE SYSTEM_PERIODIC_HOLD An arbitrary boolean expression What to look for comes with experience The will be wasting CPU cycles at best Just remove Leave in the queue, but do not match UCSD Jan 18th 2012 Condor Tunning 30

Example periodic expression # # condor_config.local condor_config.local SYSTEM_PERIODIC_HOLD SYSTEM_PERIODIC_HOLD ((ImageSize>2000000) ((ImageSize>2000000) (DiskSize>1000000) (DiskSize>1000000) (( (( JobStatus JobStatus 2 2 ) ) && && ((CurrentTime-JobCurrentStartDate ((CurrentTime-JobCurrentStartDate ) ) > > 100000)) 100000)) (JobRunCount>5)) (JobRunCount>5)) SYSTEM_PERIODIC_HOLD_REASON SYSTEM_PERIODIC_HOLD_REASON ifthenelse((imagesize>2000000), ifthenelse((imagesize>2000000), Used Used too too much much memory, memory, ifthenelse((jobruncount>5), ifthenelse((jobruncount>5), Tried Tried too too many many times, times, Reason Reason unknown )...) unknown )...) Too much memory Too much disk space Took too long Tried too many times In Condor 7.7.X you can also provide a human readable reason string http://www.cs.wisc.edu/condor/manual/v7.6/3_3configuration.html#param:systemperiodichold http://www.cs.wisc.edu/condor/manual/v7.7/3_3configuration.html#param:systemperiodicholdreason UCSD Jan 18th 2012 Condor Tunning 31

File descriptors and process IDs Each Shadow keeps open O(10) FDs And we have a shadow per running job Make sure you have enough system-wide file descriptors Make also sure the system can support all the shadow PIDs ~$ ~$ echo echo 4849118 4849118 > > /proc/sys/fs/file-max /proc/sys/fs/file-max ~$ ~$ cat cat /proc/sys/fs/file-nr /proc/sys/fs/file-nr 59160 59160 0 0 4849118 4849118 ~$ ~$ echo echo 128000 128000 > > /proc/sys/kernel/pid_max /proc/sys/kernel/pid_max http://www.cs.wisc.edu/condor/condorg/linux_scalability.html UCSD Jan 18th 2012 Condor Tunning 32

Firewall tunning If you decided you want to keep the firewall you may need to tune it At minimum, open the port of the Shared Port daemon You may also need to tune the firewall scalability knobs Statefull firewalls will have to track 2 connections per running job # # for for iptables iptables sysctl sysctl -w -w net.ipv4.netfilter.ip_conntrack_max385552; net.ipv4.netfilter.ip_conntrack_max385552; sysctl sysctl -p; -p; echo echo 48194 48194 > > /sys/module/ip_conntrack/parameters/hashsize /sys/module/ip_conntrack/parameters/hashsize UCSD Jan 18th 2012 Condor Tunning 33

The End UCSD Jan 18th 2012 Condor Tunning 34

Pointers The official project Web page is http://tinyurl.com/glideinwms glideinwms development team is reachable at glideinwms-support@fnal.gov Condor Home Page http://www.cs.wisc.edu/condor/ Condor support condor-user@cs.wisc.edu condor-admin@cs.wisc.edu UCSD Jan 18th 2012 Condor Tunning 35

Acknowledgments The glideinwms is a CMS-led project developed mostly at FNAL, with contributions from UCSD and ISI The glideinwms factory operations at UCSD is sponsored by OSG The funding comes from NSF, DOE and the UC system UCSD Jan 18th 2012 Condor Tunning 36