Slide Navigation






[[ Slide navigation:

Forwards:right arrow, space-bar or enter key
Reverse:left arrow

]]   


CSF Policies: Set and Proposed
2

 

Introductions

CSF Policies — Set and Proposed


Simon Hood

simon.hood@manchester.ac.uk

Project Manager & Technical Lead
Infra Ops, IT Services


CSF Policies: Set and Proposed
3

 

Set Policies

Set Policies


CSF Policies: Set and Proposed
4

 

Contributing

Contact
computational-sf@manchester.ac.uk
Default Option: Dell C6100, 12.5k + VAT + cluster-"taxes"
  • Chassis containing four compute nodes, each of which:
    • two 6-core Intel CPUs (X5650);
    • 48GB RAM;
    • two 250GB disks.
  • AMD option available.
  • Nvidia 2050/2070 GPGPU option available.
  • Less money? RQ2.


CSF Policies: Set and Proposed
5

 

Node Lifetime


CSF Policies: Set and Proposed
6

 

Sustainability Charge

What contributors pay per C6100
  • Contribution to cluster internal networks: 385 pounds
  • Contribution to filesystems: 500 pounds first C6100, 200 per C6100 thereafter (250GB, 100GB)
  • Alces: 400(?) pounds
What is paid for by the centre:
  • Hosting, campus network, sysadmin, app. support. . .
Costs, per C6100:
  • Annual power/cooling: 1.6k
  • Annual backups, 250GB: 1k


CSF Policies: Set and Proposed
7

 

Job Scheduling

Job Scheduling

Questions we need to address:


First, some necessary SGE fundamentals. . .


CSF Policies: Set and Proposed
8

 

Job Priority Formula

Formula Used by SGE to Determine Job Priority


  Job priority 

      = Urgency        * normalized urgency value 
            # ...depends on req'd resources, waiting time and
            #    deadline (qsub -dl)...

      + Tickets        * normalized ticket value 
            # ...fairshare, functional, override...

      + Posix Priority * normalized priority value
            # ...qsub -p...


CSF Policies: Set and Proposed
9

 

Fair Share Job Scheduling

How does Fairshare Work?



CSF Policies: Set and Proposed
10

 

SGE's Fair Share

Half Life
  • SGE's fair share does not use a simple interval in which all jobs count equally
    • uses a half-life factor — jobs completed longer ago penalize a user/project less
Compensation Factor
  • Suppose a user/project has little used the system for a while:
    • then submits many jobs
    • could lead to domination of the system
  • Compensation factor limits such domination.


CSF Policies: Set and Proposed
11

 

Priority with FS: Override and Urgency

How can we get urgent jobs run if we use fairshare? (1/2)

Override Policy
  • Example: a project (group) has a conference coming up. . .
  • Sysadmin can dynamically adjust relative priority of single job, or all of a user's or project's jobs
    • without necessity of changing fair share tree
SGE's Urgency
Each job has an urgency value associated with it, depends on:
  • resources required (qsub -l resource=value) — see later
  • waiting time
  • job deadline if any (qsub --dl)
    • users cannot specify this by default


CSF Policies: Set and Proposed
12

 

Priority with FS: Posix Priority and Deadlines

How can we get urgent jobs run if we use fairshare? (2/2)

Posix Priority: qsub -p <integer>
  • Users can only DEcrease posix priority
    • Sysadmin can INcrease
Deadline: qsub -dl <date>
  • SGE's urgency value for deadline jobs:
        deadline_weighting   # ... specified in SGE config...
           / time until job's specified deadline
  • Users cannot submit deadline jobs by default.


CSF Policies: Set and Proposed
13

 

SGE's ARCo

SGE's Accounting and Reporting Console


We can be sure everyone is getting what they paid for, no matter what scheduling we choose to implement.



CSF Policies: Set and Proposed
14

 

Recommendations

Job Scheduling Schemes — For Discussion

Two Schemes
  1. Fallback — everyone always gets immediate access to their contribution.
  2. A better way to run a(n increasingly large) cluster.
Info: Compute Node Genders
  • One * C6100, 48 cores, 8GB/core
  • Six+two/10/more. . . * C6100, 48 cores, 4GB/core
  • 16 * R815, 32 cores, 2GB/core, IB


CSF Policies: Set and Proposed
15

 

Option 1/2: Fallback Scheme

Group always gets access to its own nodes via job preemption

  1. One high-priority queue for each research group's boxes (assuming all same gender):
    • max. wall clock/queue length determined by that group;
    • group members only have access to this queue.
  2. Low-priority queue across each gender too;
    • everyone has access.
  3. Jobs in l-p queues are subordinate to (preempted by) those in h-p queues.
  4. Fair share across the total.
  5. Monitor (ARCo) and correct share (override policy) if necessary.


I don't like it!


CSF Policies: Set and Proposed
19

 

Extras

Extras


CSF Policies: Set and Proposed
20

 

Tickets

 -- functional, fair share and override


CSF Policies: Set and Proposed
21

 

Interactive?

 -- interactive (qrsh) work?


CSF Policies: Set and Proposed
22

 

Resource Quotas



CSF Policies: Set and Proposed
23

 

Licence Management

 qsub -l matlablicence=1 ...

 qstat -r ...

  -- have killer cron job running...


CSF Policies: Set and Proposed
24

 

More

 -- Oracle white paper:  August 2010  
    Beginners's Guide to Oracle Grid Engine 6.2

 -- Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System
     -- shortcut: SP for JP in GE6

 -- http://wikis.sun.com/display/GridEngine/Managing+the+Scheduler
 -- http://wikis.sun.com/display/GridEngine/Managing+Policies
 -- http://wikis.sun.com/display/GridEngine/Managing+Resource+Quotas

 -- gridengine.info
 -- gridengine.org

 -- arc.liv.ac.uk/SGE/howto

 -- http://wiki.gridengine.info --- out of date???


Page Contents:


Set Policies


Job Scheduling


Extras

[raiser] [escape] [lower]

Contents: