Personal tools
You are here: Home Documentation PBS Pro Setup for Red
Document Actions

PBS Pro Setup for Red

by Mako Furukawa last modified 2007-09-06 06:52

This page explains how PBS Pro 7.1.4 is set up on Red. Goes on to explain each setting in more detail.

UNL PBS Setup

At UNL, we use PBS as our batch system. In general, we have several queues that we use, along with priorities, and fairshare that allow us to give preference to CMS jobs while allowing others to gain opportunistic use.

Here is the general setup we use at UNL:

-----8<-----8<-----8<-----
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default Priority = 1
set queue default max_running = 400
set queue default resources_max.nodect = 444
set queue default resources_max.walltime = 24:00:00
set queue default max_user_run = 150
set queue default max_user_res.ncpus = 150
set queue default enabled = True
set queue default started = True
#
# Create and define queue osg
#
create queue osg
set queue osg queue_type = Execution
set queue osg enabled = True
set queue osg started = True
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq Priority = 130
set queue workq max_running = 444
set queue workq resources_max.nodect = 444
set queue workq resources_max.walltime = 24:00:00
set queue workq max_user_run = 444
set queue workq max_user_res.ncpus = 444
set queue workq enabled = True
set queue workq started = True
#
# Create and define queue pushpa
#
create queue pushpa
set queue pushpa queue_type = Execution
set queue pushpa Priority = 138
set queue pushpa max_running = 20
set queue pushpa resources_max.nodect = 5
set queue pushpa resources_max.walltime = 72:00:00
set queue pushpa max_user_run = 20
set queue pushpa max_user_res.ncpus = 20
set queue pushpa enabled = True
set queue pushpa started = True
#
# Create and define queue dzero
#
create queue dzero
set queue dzero queue_type = Execution
set queue dzero Priority = 120
set queue dzero max_running = 444
set queue dzero resources_max.walltime = 24:00:00
set queue dzero max_user_run = 444
set queue dzero max_user_res.ncpus = 444
set queue dzero enabled = True
set queue dzero started = True
#
# Create and define queue zeng
#
create queue zeng
set queue zeng queue_type = Execution
set queue zeng Priority = 1
set queue zeng max_running = 16
set queue zeng resources_max.nodect = 64
set queue zeng resources_max.walltime = 24:00:00
set queue zeng acl_group_enable = True
set queue zeng acl_groups = zeng
set queue zeng max_user_run = 16
set queue zeng max_group_run = 16
set queue zeng max_user_res.ncpus = 64
set queue zeng max_group_res.ncpus = 64
set queue zeng enabled = True
set queue zeng started = True
#
# Create and define queue aarond
#
create queue aarond
set queue aarond queue_type = Execution
set queue aarond Priority = 131
set queue aarond max_running = 80
set queue aarond resources_max.nodect = 80
set queue aarond resources_max.walltime = 24:00:00
set queue aarond max_user_run = 80
set queue aarond max_user_res.ncpus = 80
set queue aarond enabled = True
set queue aarond started = True
#
# Create and define queue atlas
#
create queue atlas
set queue atlas queue_type = Execution
set queue atlas Priority = 130
set queue atlas max_running = 444
set queue atlas resources_max.nodect = 444
set queue atlas resources_max.walltime = 24:00:00
set queue atlas max_user_run = 444
set queue atlas max_user_res.ncpus = 444
set queue atlas enabled = True
set queue atlas started = True
#
# Create and define queue cmsprod
#
create queue cmsprod
set queue cmsprod queue_type = Execution
set queue cmsprod Priority = 120
set queue cmsprod max_running = 444
set queue cmsprod resources_max.walltime = 24:00:00
set queue cmsprod max_user_run = 444
set queue cmsprod max_user_res.ncpus = 444
set queue cmsprod enabled = True
set queue cmsprod started = True
#
# Create and define queue cms
#
create queue cms
set queue cms queue_type = Execution
set queue cms Priority = 135
set queue cms max_running = 444
set queue cms resources_max.nodect = 444
set queue cms resources_max.walltime = 24:00:00
set queue cms max_user_run = 444
set queue cms max_user_res.ncpus = 444
set queue cms enabled = True
set queue cms started = True
#
# Create and define queue zeng_long
#
create queue zeng_long
set queue zeng_long queue_type = Execution
set queue zeng_long max_running = 8
set queue zeng_long resources_max.nodect = 8
set queue zeng_long acl_group_enable = True
set queue zeng_long acl_groups = zeng
set queue zeng_long max_user_run = 8
set queue zeng_long max_user_res.ncpus = 8
set queue zeng_long enabled = True
set queue zeng_long started = True
#
# Create and define queue lcgadmin
#
create queue lcgadmin
set queue lcgadmin queue_type = Execution
set queue lcgadmin Priority = 135
set queue lcgadmin max_running = 4
set queue lcgadmin resources_max.nodect = 4
set queue lcgadmin resources_max.walltime = 24:00:00
set queue lcgadmin acl_group_enable = True
set queue lcgadmin acl_groups = lcgadmin
set queue lcgadmin acl_groups += uscmsPool018
set queue lcgadmin max_user_run = 4
set queue lcgadmin max_user_res.ncpus = 4
set queue lcgadmin enabled = True
set queue lcgadmin started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
-----8<-----8<-----8<-----

We route gridex to the default queue and give it a low priority. If we have jobs that need to run and can't wait, we can give these queues a priority over 140, and it will preempt any job in the queue (we have set jobs over 140 priority points to preempt jobs under 140).

For OSG, we also need to hack the pbs.pm file so that all of the jobs will find the correct queues. In our pbs.pm file (osg-041/globus/lib/perl/Globus/GRAM/JobManager/pbs.pm), after the line
     print JOB "#PBS -m $email_when\n";
we added the following:

-----8<-----8<-----8<-----
    $_ = $description->directory();
    if($description->queue() ne '')
    {
        print JOB '#PBS -q ', $description->queue(), "\n";
        print JOB "####PBS QUEUE WAS ASSIGNED\n";
    }
    elsif(/cmssoft/)
    {
        print JOB "#PBS -q cmsprod\n";
    }
    elsif(/cmsprod/)
    {
        print JOB "#PBS -q cmsprod\n";
    }
    elsif(/cms/)
    {
        print JOB "#PBS -q cms\n";
    }
    elsif(/atlas/)
    {
        print JOB "#PBS -q atlas\n";
    }
    elsif(/dzero/)
    {
        print JOB "#PBS -q dzero\n";
    }
    elsif(/gridex/)
    {
        print JOB "#PBS -q default\n";
    }
-----8<-----8<-----8<-----

NOTE: If the username does not match any of this, it will just run in the queue that is the default (workq for our site).

For fairshare, we add the following to ~PBSHOME/sched_priv/pbs_resource_group

-----8<-----8<-----8<-----
cms      80     root    80     
rest     20     root    20

uscmssoft       101     cms     50
uscmsprod       102     cms     50
cmsphedex       106     cms     50
cmsprod         107     cms     50
cmssoft         108     cms     50
uscms01         109     cms     50
uscms02         110     cms     50
uscmsPool001       111     cms     50
uscmsPool002       112     cms     50
uscmsPool003       113     cms     50
uscmsPool004       114     cms     50
uscmsPool005       115     cms     50
uscmsPool006       116     cms     50
cdf             9700     rest    5
condor          9701     rest    5
des             9702     rest    5
feller          9703     rest    5
fmri            9704     rest    5
gadu            9705     rest    5
glow            9706     rest    5
gpn             9707     rest    5
grase           9708     rest    5
#gridex         9709     rest    5
hacluster       9710     rest    5
helium          9711     rest    5
ivdgl           9712     rest    5
ligo            9713     rest    5
localGridUser   9714     rest    5
lsc01           9715     rest    5
mis             9716     rest    5
nanohub         9717     rest    5
ops             9718     rest    5
osg             9719     rest    5
output          9720     rest    5
sdss            9721     rest    5
star            9722     rest    5
usatlas1        9723     rest    5
usatlas2        9724     rest    5
usatlas3        9725     rest    5
dzero           9726    rest    5
pushpa          9727    cms     50
-----8<-----8<-----8<-----

Please note that there are MANY more uscmsPool users but I left them out to save space. Note that the second column are like userIDs and must be unique. The frist two lines are the two groups. We use "cms" and "rest". The two numbers are the percentages of the cluster.

Everybody in the cms group can have half of the 80% of the entire cluster. More if there are no other jobs in queue.

For $PBSHOME/sched_priv/sched_config file, we set all of the scheduler configurations.

Here are some of the settings that are set on red.


-----8<-----8<-----8<-----
round_robin: False      all
-----8<-----8<-----8<-----

This is set so we have fairshare working. If this is set to be true, every queue will run jobs in round robin order which we don't want.

-----8<-----8<-----8<-----
by_queue: True          prime
by_queue: True          non_prime
-----8<-----8<-----8<-----

This is set like this so that during both prime and non_prime times, as if neither round robin or by queue are set to be false, the scheduler will not look at the queues and will look at all jobs in the queue as one large queue, regardless of what queue it was submitted to.

-----8<-----8<-----8<-----
strict_fifo: false      ALL
-----8<-----8<-----8<-----

We don't want strict fifo as if it's set to be true, all jobs, regardless of what queue or what priorities they should have, will run as a first in first out.

-----8<-----8<-----8<-----
help_starving_jobs:     false   ALL
-----8<-----8<-----8<-----

Starving job help is turned off because we want to use the priorities and fairshare that is set. If this is turned on, it will give more priorities to jobs that have been in queue for a long time, regardless of what shares they may/should have.

-----8<-----8<-----8<-----
max_starve: 24:00:00
-----8<-----8<-----8<-----

This doesn't make a difference with help_starving_jobs turned off.

-----8<-----8<-----8<-----
backfill:       false   ALL
-----8<-----8<-----8<-----

This doesn't make a difference with help_starving_jobs turned off.

-----8<-----8<-----8<-----
backfill_prime: false   ALL
-----8<-----8<-----8<-----

If this is set to true, primetime jobs won't run into nonprimetime and vice versa. However, we don't have primetime or nonprimetimes so this doesn't make a difference.

-----8<-----8<-----8<-----
prime_exempt_anytime_queues:    true
-----8<-----8<-----8<-----

We have set this to true so that we have no backfilling at all.

-----8<-----8<-----8<-----
job_sort_key: "cput LOW"        ALL
-----8<-----8<-----8<-----

We use this so that the resources to run jobs for everything (including fair share, preemption, and sorting) will be done by cputime.

-----8<-----8<-----8<-----
node_sort_key: "sort_priority HIGH"     ALL
-----8<-----8<-----8<-----

Nodes here are sorted by what resources each node has...doesn't matter much for a mostly homogeneous cluster.

-----8<-----8<-----8<-----
sort_queues:    true    ALL
-----8<-----8<-----8<-----

This will allow the sorting of queues by what priorities they have.

-----8<-----8<-----8<-----
resources: "ncpus, mem, arch, host"
-----8<-----8<-----8<-----

This allows us to give priorities (in order) of how to sort jobs into nodes. For our site, we only really care about the number of CPUs.

-----8<-----8<-----8<-----
load_balancing: false   ALL
-----8<-----8<-----8<-----

Doesn't matter as we don't have timesharing nodes.

-----8<-----8<-----8<-----
smp_cluster_dist: pack
-----8<-----8<-----8<-----

I have made this a "pack" so that if we should have an smp job, this will give it a better chance of running.

-----8<-----8<-----8<-----
fair_share: true        ALL
-----8<-----8<-----8<-----

We use fair share for our cluster.

-----8<-----8<-----8<-----
unknown_shares: 10
-----8<-----8<-----8<-----

With this, anybody not in a group will still get a share of 10. If the resource_goups are not set up, everybody will get equal shares.

-----8<-----8<-----8<-----
fairshare_usage_res: cput
-----8<-----8<-----8<-----

We fairshare by cpu time

-----8<-----8<-----8<-----
fairshare_entity: euser
-----8<-----8<-----8<-----

We fairshare per user and not group. As we talked about once, we can do either or, but not both.

-----8<-----8<-----8<-----
half_life: 24:00:00
-----8<-----8<-----8<-----

I never understood why they do half life instead of full life, but if I understand correclty, half life will give us fairsharing for two days, with the fairshare usage being cut in half every day.

-----8<-----8<-----8<-----
sync_time: 1:00:00
-----8<-----8<-----8<-----

This makes the fairshare data to be written to disk every one hour.

-----8<-----8<-----8<-----
# fairshare_enforce_no_shares: TRUE
-----8<-----8<-----8<-----

We comment this out as we want any jobs to run, even if it has zero shares (which shouldn't happen with unknown jobs getting 10 shares)

-----8<-----8<-----8<-----
preemptive_sched: true  ALL1G
-----8<-----8<-----8<-----

We have preemption turned on in case we ever need to use it.

-----8<-----8<-----8<-----
preempt_queue_prio:     150
-----8<-----8<-----8<-----

The priority needs to be over 150 for it to preempt jobs.

-----8<-----8<-----8<-----
preempt_prio: "express_queue, normal_jobs"
-----8<-----8<-----8<-----

We can set which jobs get preempted in backwards order (ie. express_queue which is the preemption jobs will get preempted after normal jobs). We can also use things like fairshare to kill jobs over their fairshare limits. If we had starving turned on, then jobs that have starved could pre prioritized to be killed later as well as they have waited so long in queue.

-----8<-----8<-----8<-----
preempt_order: "SCR"
-----8<-----8<-----8<-----

This order only does the order of trying to requeue a job. SCR has no percentages after it in our case, so any/all jobs that get killed will try to suspend, checkpoint, and requeue. We can set up times of which jobs get requeued and such as well.

-----8<-----8<-----8<-----
preempt_sort: min_time_since_start
-----8<-----8<-----8<-----

This will make it so that the preemption happens to jobs that have the minimum time since it started.

-----8<-----8<-----8<-----
peer_queue
-----8<-----8<-----8<-----

One thing we can play around with because we have the t3 cluster is to try peer_queueing. This will allow remote scheduling to obtain jobs from the t3 cluster to red.

-----8<-----8<-----8<-----
dedicated_prefix: ded
-----8<-----8<-----8<-----

We don't use dedicated time queues but can...

-----8<-----8<-----8<-----
log_filter: 1280
-----8<-----8<-----8<-----

This allows us to keep the logging simpler...although it's still very large.


Powered by Plone, the Open Source Content Management System