Understanding the hardware

Introduction

We have a “tick over” of approximately 200 processor cores running 24/7. Over night and at weekends we bring online a further 3000 processor cores, and when in vacation mode these cores remain in the pool unless someone reboots the machine into Windows or switches it off. The number of machines (a Condor node is a whole PC) and number of cores can be viewed dynamically on the status website, this page also lists other useful information e.g. a list of held jobs, and software availability.

Note: amount of memory per node on the status page reports memory seen by Condor, excluding memory used by the operating system, i.e. this value will be slightly smaller than the physical memory, so for example a machine with 4GB physical memory will show up in 2-4GB slot, not 4-8GB.

In general there are two types of standard machine available in the pool and these are described below.

General purpose worker nodes

Most Condor nodes make use of around 800 teaching cluster PCs to run your jobs. During term time these machines automatically boot into our standard Linux image (to join the Condor pool) each weekday evening, starting at around 6pm and then 7pm, 9pm and 11pm, depending upon the official opening hours of the room/building the cluster is in.

These machines then automatically boot into Windows (“Student Desktop”) at 7:30am on weekdays and leave the pool. Therefore they remain in the pool over weekends, they are also left in the pool all week when in vacation mode.

It is important to realise that Condor jobs can be evicted at any time, e.g. when a machine is booted into Windows or someone uses a machine to do work in a teaching cluster. Therefore there is no guarantee that any particular job will run to completion. Any job which is evicted is placed back in the queue and the default behaviour is that this job starts again from scratch on the next available node, although it may be possible to checkpoint.

These machine have various specifications, in general:

  • Intel i5 processors
  • 4 cores
  • 8GB memory
  • 40GB disk storage managed by Condor (via Request_Disk) and shared by all jobs running on that machine (and also shared by /tmp)

These machines are identified by the HAS_STANDARD_IMAGE ClassAd, therefore include (HAS_STANDARD_IMAGE =?= True) in the requirements line of your submission file to use these machine, e.g.

requirements = (Opsys == "LINUX" && Arch == "X86_64" && (HAS_STANDARD_IMAGE=?=True) )

To use all disk and memory available on these PCs, request a whole machine rather than a single core.

Backbone nodes

There are currently 14 nodes with the standard image that remain continuously in the Condor pool (unless shut down due to failure or maintenance). Each node consists of 8 cores sharing 64GB memory. These nodes have more disk storage than general purpose nodes, 132 GB is shared between /tmp and all Condor jobs on each node. An additional 3 backbone nodes will soon be added to the pool.

Note: Condor does not reserve cores for multicore jobs, therefore it is likely jobs requesting a whole backbone machine will never start.

Further details about how to submit jobs to the backbone nodes.

Non-standard nodes

Additional non-standard machines may be found in the pool (e.g. machines donated by research groups) which generally do not have the same libraries and software packages as our standard image.

Further information

Additional information, including details on memory, can be found on the status webpagewebsite.

Last modified on May 31, 2017 at 3:05 pm by Pen Richardson