PBS Pro Setup for Red
This page explains how PBS Pro 7.1.4 is set up on Red. Goes on to explain each setting in more detail.
UNL PBS Setup
At UNL, we use PBS as our batch system. In general, we have several queues that we use, along with priorities, and fairshare that allow us to give preference to CMS jobs while allowing others to gain opportunistic use.
Here is the general setup we use at UNL:
-----8<-----8<-----8<-----
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default Priority = 1
set queue default max_running = 400
set queue default resources_max.nodect = 444
set queue default resources_max.walltime = 24:00:00
set queue default max_user_run = 150
set queue default max_user_res.ncpus = 150
set queue default enabled = True
set queue default started = True
#
# Create and define queue osg
#
create queue osg
set queue osg queue_type = Execution
set queue osg enabled = True
set queue osg started = True
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq Priority = 130
set queue workq max_running = 444
set queue workq resources_max.nodect = 444
set queue workq resources_max.walltime = 24:00:00
set queue workq max_user_run = 444
set queue workq max_user_res.ncpus = 444
set queue workq enabled = True
set queue workq started = True
#
# Create and define queue pushpa
#
create queue pushpa
set queue pushpa queue_type = Execution
set queue pushpa Priority = 138
set queue pushpa max_running = 20
set queue pushpa resources_max.nodect = 5
set queue pushpa resources_max.walltime = 72:00:00
set queue pushpa max_user_run = 20
set queue pushpa max_user_res.ncpus = 20
set queue pushpa enabled = True
set queue pushpa started = True
#
# Create and define queue dzero
#
create queue dzero
set queue dzero queue_type = Execution
set queue dzero Priority = 120
set queue dzero max_running = 444
set queue dzero resources_max.walltime = 24:00:00
set queue dzero max_user_run = 444
set queue dzero max_user_res.ncpus = 444
set queue dzero enabled = True
set queue dzero started = True
#
# Create and define queue zeng
#
create queue zeng
set queue zeng queue_type = Execution
set queue zeng Priority = 1
set queue zeng max_running = 16
set queue zeng resources_max.nodect = 64
set queue zeng resources_max.walltime = 24:00:00
set queue zeng acl_group_enable = True
set queue zeng acl_groups = zeng
set queue zeng max_user_run = 16
set queue zeng max_group_run = 16
set queue zeng max_user_res.ncpus = 64
set queue zeng max_group_res.ncpus = 64
set queue zeng enabled = True
set queue zeng started = True
#
# Create and define queue aarond
#
create queue aarond
set queue aarond queue_type = Execution
set queue aarond Priority = 131
set queue aarond max_running = 80
set queue aarond resources_max.nodect = 80
set queue aarond resources_max.walltime = 24:00:00
set queue aarond max_user_run = 80
set queue aarond max_user_res.ncpus = 80
set queue aarond enabled = True
set queue aarond started = True
#
# Create and define queue atlas
#
create queue atlas
set queue atlas queue_type = Execution
set queue atlas Priority = 130
set queue atlas max_running = 444
set queue atlas resources_max.nodect = 444
set queue atlas resources_max.walltime = 24:00:00
set queue atlas max_user_run = 444
set queue atlas max_user_res.ncpus = 444
set queue atlas enabled = True
set queue atlas started = True
#
# Create and define queue cmsprod
#
create queue cmsprod
set queue cmsprod queue_type = Execution
set queue cmsprod Priority = 120
set queue cmsprod max_running = 444
set queue cmsprod resources_max.walltime = 24:00:00
set queue cmsprod max_user_run = 444
set queue cmsprod max_user_res.ncpus = 444
set queue cmsprod enabled = True
set queue cmsprod started = True
#
# Create and define queue cms
#
create queue cms
set queue cms queue_type = Execution
set queue cms Priority = 135
set queue cms max_running = 444
set queue cms resources_max.nodect = 444
set queue cms resources_max.walltime = 24:00:00
set queue cms max_user_run = 444
set queue cms max_user_res.ncpus = 444
set queue cms enabled = True
set queue cms started = True
#
# Create and define queue zeng_long
#
create queue zeng_long
set queue zeng_long queue_type = Execution
set queue zeng_long max_running = 8
set queue zeng_long resources_max.nodect = 8
set queue zeng_long acl_group_enable = True
set queue zeng_long acl_groups = zeng
set queue zeng_long max_user_run = 8
set queue zeng_long max_user_res.ncpus = 8
set queue zeng_long enabled = True
set queue zeng_long started = True
#
# Create and define queue lcgadmin
#
create queue lcgadmin
set queue lcgadmin queue_type = Execution
set queue lcgadmin Priority = 135
set queue lcgadmin max_running = 4
set queue lcgadmin resources_max.nodect = 4
set queue lcgadmin resources_max.walltime = 24:00:00
set queue lcgadmin acl_group_enable = True
set queue lcgadmin acl_groups = lcgadmin
set queue lcgadmin acl_groups += uscmsPool018
set queue lcgadmin max_user_run = 4
set queue lcgadmin max_user_res.ncpus = 4
set queue lcgadmin enabled = True
set queue lcgadmin started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
-----8<-----8<-----8<-----
We route gridex to the default queue and give it a low priority. If we have jobs that need to run and can't wait, we can give these queues a priority over 140, and it will preempt any job in the queue (we have set jobs over 140 priority points to preempt jobs under 140).
For OSG, we also need to hack the pbs.pm file so that all of the jobs will find the correct queues. In our pbs.pm file (osg-041/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm), after the line
print JOB "#PBS -m $email_when\n";
we added the following:
-----8<-----8<-----8<-----
$_ = $description->directory();
if($description->queue() ne '')
{
print JOB '#PBS -q ', $description->queue(), "\n";
print JOB "####PBS QUEUE WAS ASSIGNED\n";
}
elsif(/cmssoft/)
{
print JOB "#PBS -q cmsprod\n";
}
elsif(/cmsprod/)
{
print JOB "#PBS -q cmsprod\n";
}
elsif(/cms/)
{
print JOB "#PBS -q cms\n";
}
elsif(/atlas/)
{
print JOB "#PBS -q atlas\n";
}
elsif(/dzero/)
{
print JOB "#PBS -q dzero\n";
}
elsif(/gridex/)
{
print JOB "#PBS -q default\n";
}
-----8<-----8<-----8<-----
NOTE: If the username does not match any of this, it will just run in the queue that is the default (workq for our site).
For fairshare, we add the following to ~PBSHOME/sched_priv/pbs_resource_group
-----8<-----8<-----8<-----
cms 80 root 80
rest 20 root 20
uscmssoft 101 cms 50
uscmsprod 102 cms 50
cmsphedex 106 cms 50
cmsprod 107 cms 50
cmssoft 108 cms 50
uscms01 109 cms 50
uscms02 110 cms 50
uscmsPool001 111 cms 50
uscmsPool002 112 cms 50
uscmsPool003 113 cms 50
uscmsPool004 114 cms 50
uscmsPool005 115 cms 50
uscmsPool006 116 cms 50
cdf 9700 rest 5
condor 9701 rest 5
des 9702 rest 5
feller 9703 rest 5
fmri 9704 rest 5
gadu 9705 rest 5
glow 9706 rest 5
gpn 9707 rest 5
grase 9708 rest 5
#gridex 9709 rest 5
hacluster 9710 rest 5
helium 9711 rest 5
ivdgl 9712 rest 5
ligo 9713 rest 5
localGridUser 9714 rest 5
lsc01 9715 rest 5
mis 9716 rest 5
nanohub 9717 rest 5
ops 9718 rest 5
osg 9719 rest 5
output 9720 rest 5
sdss 9721 rest 5
star 9722 rest 5
usatlas1 9723 rest 5
usatlas2 9724 rest 5
usatlas3 9725 rest 5
dzero 9726 rest 5
pushpa 9727 cms 50
-----8<-----8<-----8<-----
Please note that there are MANY more uscmsPool users but I left them out to save space. Note that the second column are like userIDs and must be unique. The frist two lines are the two groups. We use "cms" and "rest". The two numbers are the percentages of the cluster.
Everybody in the cms group can have half of the 80% of the entire cluster. More if there are no other jobs in queue.
For $PBSHOME/sched_priv/sched_config file, we set all of the scheduler configurations.
Here are some of the settings that are set on red.
-----8<-----8<-----8<-----
round_robin: False all
-----8<-----8<-----8<-----
This is set so we have fairshare working. If this is set to be true, every queue will run jobs in round robin order which we don't want.
-----8<-----8<-----8<-----
by_queue: True prime
by_queue: True non_prime
-----8<-----8<-----8<-----
This is set like this so that during both prime and non_prime times, as if neither round robin or by queue are set to be false, the scheduler will not look at the queues and will look at all jobs in the queue as one large queue, regardless of what queue it was submitted to.
-----8<-----8<-----8<-----
strict_fifo: false ALL
-----8<-----8<-----8<-----
We don't want strict fifo as if it's set to be true, all jobs, regardless of what queue or what priorities they should have, will run as a first in first out.
-----8<-----8<-----8<-----
help_starving_jobs: false ALL
-----8<-----8<-----8<-----
Starving job help is turned off because we want to use the priorities and fairshare that is set. If this is turned on, it will give more priorities to jobs that have been in queue for a long time, regardless of what shares they may/should have.
-----8<-----8<-----8<-----
max_starve: 24:00:00
-----8<-----8<-----8<-----
This doesn't make a difference with help_starving_jobs turned off.
-----8<-----8<-----8<-----
backfill: false ALL
-----8<-----8<-----8<-----
This doesn't make a difference with help_starving_jobs turned off.
-----8<-----8<-----8<-----
backfill_prime: false ALL
-----8<-----8<-----8<-----
If this is set to true, primetime jobs won't run into nonprimetime and vice versa. However, we don't have primetime or nonprimetimes so this doesn't make a difference.
-----8<-----8<-----8<-----
prime_exempt_anytime_queues: true
-----8<-----8<-----8<-----
We have set this to true so that we have no backfilling at all.
-----8<-----8<-----8<-----
job_sort_key: "cput LOW" ALL
-----8<-----8<-----8<-----
We use this so that the resources to run jobs for everything (including fair share, preemption, and sorting) will be done by cputime.
-----8<-----8<-----8<-----
node_sort_key: "sort_priority HIGH" ALL
-----8<-----8<-----8<-----
Nodes here are sorted by what resources each node has...doesn't matter much for a mostly homogeneous cluster.
-----8<-----8<-----8<-----
sort_queues: true ALL
-----8<-----8<-----8<-----
This will allow the sorting of queues by what priorities they have.
-----8<-----8<-----8<-----
resources: "ncpus, mem, arch, host"
-----8<-----8<-----8<-----
This allows us to give priorities (in order) of how to sort jobs into nodes. For our site, we only really care about the number of CPUs.
-----8<-----8<-----8<-----
load_balancing: false ALL
-----8<-----8<-----8<-----
Doesn't matter as we don't have timesharing nodes.
-----8<-----8<-----8<-----
smp_cluster_dist: pack
-----8<-----8<-----8<-----
I have made this a "pack" so that if we should have an smp job, this will give it a better chance of running.
-----8<-----8<-----8<-----
fair_share: true ALL
-----8<-----8<-----8<-----
We use fair share for our cluster.
-----8<-----8<-----8<-----
unknown_shares: 10
-----8<-----8<-----8<-----
With this, anybody not in a group will still get a share of 10. If the resource_goups are not set up, everybody will get equal shares.
-----8<-----8<-----8<-----
fairshare_usage_res: cput
-----8<-----8<-----8<-----
We fairshare by cpu time
-----8<-----8<-----8<-----
fairshare_entity: euser
-----8<-----8<-----8<-----
We fairshare per user and not group. As we talked about once, we can do either or, but not both.
-----8<-----8<-----8<-----
half_life: 24:00:00
-----8<-----8<-----8<-----
I never understood why they do half life instead of full life, but if I understand correclty, half life will give us fairsharing for two days, with the fairshare usage being cut in half every day.
-----8<-----8<-----8<-----
sync_time: 1:00:00
-----8<-----8<-----8<-----
This makes the fairshare data to be written to disk every one hour.
-----8<-----8<-----8<-----
# fairshare_enforce_no_shares: TRUE
-----8<-----8<-----8<-----
We comment this out as we want any jobs to run, even if it has zero shares (which shouldn't happen with unknown jobs getting 10 shares)
-----8<-----8<-----8<-----
preemptive_sched: true ALL1G
-----8<-----8<-----8<-----
We have preemption turned on in case we ever need to use it.
-----8<-----8<-----8<-----
preempt_queue_prio: 150
-----8<-----8<-----8<-----
The priority needs to be over 150 for it to preempt jobs.
-----8<-----8<-----8<-----
preempt_prio: "express_queue, normal_jobs"
-----8<-----8<-----8<-----
We can set which jobs get preempted in backwards order (ie. express_queue which is the preemption jobs will get preempted after normal jobs). We can also use things like fairshare to kill jobs over their fairshare limits. If we had starving turned on, then jobs that have starved could pre prioritized to be killed later as well as they have waited so long in queue.
-----8<-----8<-----8<-----
preempt_order: "SCR"
-----8<-----8<-----8<-----
This order only does the order of trying to requeue a job. SCR has no percentages after it in our case, so any/all jobs that get killed will try to suspend, checkpoint, and requeue. We can set up times of which jobs get requeued and such as well.
-----8<-----8<-----8<-----
preempt_sort: min_time_since_start
-----8<-----8<-----8<-----
This will make it so that the preemption happens to jobs that have the minimum time since it started.
-----8<-----8<-----8<-----
peer_queue
-----8<-----8<-----8<-----
One thing we can play around with because we have the t3 cluster is to try peer_queueing. This will allow remote scheduling to obtain jobs from the t3 cluster to red.
-----8<-----8<-----8<-----
dedicated_prefix: ded
-----8<-----8<-----8<-----
We don't use dedicated time queues but can...
-----8<-----8<-----8<-----
log_filter: 1280
-----8<-----8<-----8<-----
This allows us to keep the logging simpler...although it's still very large.