Slide Navigation






[[ Slide navigation:

Forwards:right arrow, space-bar or enter key
Reverse:left arrow

]]   

 

Introductions

The Computational Shared Facility: Status and Strategy


Simon Hood

simon.hood@manchester.ac.uk

Project Manager & Technical Lead
Infra Ops, IT Services




 

Background and Context

Background and Context




 

Uni RC Strategy: Mi Whitepaper

  • Occasional whitepaper-related meetings since June 2008

Funding Model — IT Services-Run Comp. Shared-Fac.
  • One-off capital secured from centre: 90k
    • Cluster infra. (head nodes, storage, network h/w. . .)
  • All compute nodes must be paid for by research groups
    • With "tax" to contribute to future infrastructure
  • No contribution, (almost) no use!
From Many to One
  • Academics encouraged to buy in to central facility
    • strongly discouraged from buying own (small) clusters


Whitepaper published in latter part of 2009. . .




 

Status

Current Status of the CSF




 

The Present

What do we have?
  • Adoption of the Redqueen model.
  • 90k sounds small. . .
  • . . .but, for the first time(?), both:
    • University political backing from the top
    • and University central IT support (esp. for dedicated network).




  • Much more than a replacement for Horace.
  • I'm very optimistic!




 

New System (Hardware)

Apps installed, testing finished, user accounts created:
  • 68TB parallel, high-perf scratch (Lustre)
  • 240 cores at 4GB/core
  • 48 cores at 8GB/core
  • 96 cores awaiting installation
Much more on its way. . .
  • 512 cores on order!
  • 96 cores to be ordered (Monday?)
  • More expected in Spring. . .

In Reynolds House. . .




 

New System (Apps and Users)

Software:
  • Apps installed
    • Gaussian 09, Amber. . .
    • Matlab, R. . .
    • CHARMM, NAMD, Polyrate. . .
Users:
  • Testing by users well underway
  • Awaiting registration system. . .




 

Pic 1




 

Pic 2




 

Pic 3




 

Who and What

Contributors thus far:

Chris Taylor20kImag., Gen. and Prot.
Mike Sutcliffe55kCEAS
Ian Hillier15kChemistry
Richard Bryce15kPharmacy
Peter Stansby/Colin Bailey125kMACE/EPS
School contribution32kCEAS

Upcoming (expected):

15kTranslational Med.
15kFLS (Bioinf)
30k(?)Mathematics
6kMBS




 

GPGPUs

Pooled purchasing clearly having the desired effect!

Dell provided loss-lead blade chassis (M1000) and two blades (M610x):

  • 12 cores, 48GB RAM,
  • plus one Nvidia 2070, each
. . .a very low price. . .


More?

  • MBS likely to add to this very soon
  • Any one else? — coordinate to get max. discount
    • Email us!




 

Next Steps

Next Steps




 

Assimilation

Tightly-Integrate Clusters on Campus

Link private cluster networks (dedicated 10Gb links):
  • Share filesystems — easy to implement (with reqd h/w);
    • much better for users!
  • Ultimately, one "collective" instance of workload manager, Grid Engine (aka SGE)
  • No requirement for "grid" middleware
What? New System, RQ2 and Redqueen
  • Dedicated 10Gb link between Reynolds and Kilburn
  • Total ~2000 cores

. . .details/timescales over. . .




 

RQ2

The first step in assimilation. . .

RQ2:

  • HEP, MACE, RGF, Chemistry. . .
  • Same machine room as CSF (Reynolds House)
  • 300 cores?
  • Filesystems shared in four to six weeks
    • collective SGE later




 

Redqueen

The second step will take longer. . .

Redqueen:

What?
  • ~800 cores; 16 Nvdia 2050s
  • MACE, Economics, SEAES (Atmos), EEE, Chemistry, MBS
  • Different machine room from CSF (Kilburn)
Steps
  • Awaiting dedicated 10Gb link between Reynolds and Kilburn
  • Filesystem upgrade
    • Has ~25TB storage — upgrade some to Lustre
  • Summer?




 

Man1, Man2 and The RGF

Can we use the Revolving Green Fund?

We hope:

  • Replace Man1/Weyl (5.5 yrs old) and Man2/Noether (5 yrs old)
  • with 96 new cores (two * C6100)
  • Meeting with RGF people next week. . .
    • if success: money available July/August




 

Phase Two: Cloud

The Cloud. . .

Whitepaper:

"Centralised, shared facility fits well with cloud computing model."


  • Plan to add features and access to CSF via "gateways". . .
  • CSF login nodes are 10.99.203.0/24. . .




 

Web Portal

Database searches. . .

  • Some types of computational work are easily submitted via a Web interface
    • Bioinformatics community — string matching
  • IO virtual host (Ange/Owen/MikeBT)
  • Submits to CSF batch system in background




 

External Access

Basic, off-campus access. . .

  • Uni VPN not always the answer:
    • Non-UoM collaborators
    • If don't want home machine to have only a UoM IP
  • SSH gateway!
    • X509?




 

Virtual Desktop Service

Start interactive work at work, finish at home?

  • Supports the (increasing) interactive (GUI-based) use of HPC clusters
  • Bonus: eliminates one reason to leave office machines on at night/weekends. . .
  • VNC, RDP, FreeNX?




 

Condor Integration

Integrate with the other big computation resource on campus. . .

  • EPS Condor pool — expanding greatly as all(?) EPS public clusters now dual-boot Linux
  • Backfill the CSF with Condor. . .
    • Dedicated Condor gateway node
  • Web portal can submit to both dedicated hardware and Condor pools.




 

Finally



computational-sf@manchester.ac.uk