Click [slideshow] to begin presentation.

 

Introductions

Research Computing Infrastructure


Simon Hood

simon.hood@manchester.ac.uk

Project Manager & Technical Lead Computational Shared Facility
Infra Ops, IT Services

(RCS RIP)




 

Research Computing

What do I mean by research computing?

  • HPC! (parallel)
    • Also: HTC, HPV, high-mem, huge IO...






 

Background

Background: Uni Research Computing Strategy




Some context. . .




 

Recent Past Uni. HPC

Meagre HPC facilities:
  • At "merger" (2004): Cosmos, Eric and Bezier (totalling ~80 cores).
  • CSAR machines — national, not local, service.
  • Horace (2006 – 2010) — leased by RCS (~200 cores).
Research Computing Strategy
  • ???




 

Redqueen

Summer 2008, Development One

Ad hoc contribution machine:

  • Started by RCS with 14k left in a budget (spend it or lose it)
    • Three servers, a switch and a rack
  • No official backing
  • Two and a half years:
    • hand-to-mouth existence
    • 300k(?)
    • ~800 cores (MACE, Economics, SEAS, EEE, MBS, Chemistry)




 

Manchester Informatics

Summer 2008, Development Two

  • Home Page:
  • A computational research community
  • Chris Taylor (Associate Vice President for Research, UoM)
    • Mike Sutcliffe (HoD CEAS)
    • Carmel Dickenson (Programme Manager)



  • Occasional whitepaper-related meetings since June 2008


Whitepaper published in latter part of 2009. . .




 

Mi W/paper 1/2: Comp. Res. Themes

Mi: A Computational Research Umbrella

  • Mi provides research computing strategy and governance
  • Themes:
    • Nuclear power, earth systems, aerospace
    • Finance and economics, health and lifescience




 

Mi W/paper 2/2: Comp Shared-Fac

Funding Model — IT Services-Run Comp. Shared-Fac.
  • One-off capital secured from centre: 90k
    • Cluster infra. (head nodes, storage, network h/w. . .)
  • All compute nodes must be paid for by research groups
    • With "tax" to contribute to future infrastructure
  • No contribution, no use!
From Many to One
  • Academics encouraged to buy in to central facility
    • strongly discouraged from buying own (small) clusters

Campus (Research Computing) Cloud Project

This is Phase One of the Campus Cloud project. . .




 

The Present

The Present — Summary

What do we have?

  • Adoption of the Redqueen model.
  • 90k sounds small. . .
  • . . .but, for the first time(?), both
    • University political backing from the top
    • and University central IT support (esp. for dedicated network).
  • Much more than a replacement for Horace.
  • I'm optimistic!




 

New System 1/2

Hardware:
  • Reynolds House
  • 68TB parallel, high-perf scratch (Lustre)
  • 240 cores at 4GB/core; 48 cores at 8GB/core
  • 96 cores awaiting installation
Software and Users:
  • Apps installed
  • Testing by users underway
  • Awaiting registration system. . .

. . .continued. . .




 

New System 2/2

Much more on its way. . .
  • 512 cores on order:
    • 125k from Colin Bailey/Peter Stansby for Modelling and Simulation Centre
  • More coming:
    • MHS: 48 cores; FLS: 48 cores; Maths: 96(?) cores
    • Dell(!): 96 cores with two Nvidia M2070 cards
OS
  • All customers want Linux;
  • possible virtual support of legacy OS.




 

Tight Integration

Phase 1.5: Tightly-Integrate Clusters on Campus

Dedicated 10Gb Networks
  • Merge private cluster networks
  • Share filesystems — easy to implement (with reqd h/w), huge productivity increase for users
  • One "collective" instance of workload manager, Grid Engine ("SGE")
What? Redqueen and New System (and RQ2)
  • First, new system and RQ2 (both in Reynolds House)
  • Secondly, Redqueen (Kilburn)
    • dedicated 10Gb link between Reynolds and Kilburn
  • Total ~2000 cores




 

Phase Two: Cloud

The Cloud. . .

Whitepaper:

"Centralised, shared facility fits well with cloud computing model."


  • Plan to add features and access to CSF via "gateways". . .
  • CSF login nodes are 10.99.203.0/24. . .




 

Web Portal

Database searches. . .

  • Some types of computational work are easily submitted via a Web interface
    • Bioinformatics community — string matching
  • IO virtual host (Ange/Owen/MikeBT)
  • Submits to CSF batch system in background




 

External Access

Basic, off-campus access. . .

  • Uni VPN not always the answer:
    • Non-UoM collaborators
    • If don't want home machine to have only a UoM IP
  • SSH gateway!
    • X509?




 

Virtual Desktop Service

Start interactive work at work, finish at home?

  • Supports the (increasing) interactive (GUI-based) use of HPC clusters
  • Bonus: eliminates one reason to leave office machines on at night/weekends. . .
  • VNC, RDP, FreeNX?




 

Condor Integration

Integrate with the other big computation resource on campus. . .

  • EPS Condor pool — expanding greatly as all(?) EPS public clusters now dual-boot Linux
  • Backfill the CSF with Condor. . .
    • Dedicated Condor gateway node
  • Web portal can submit to both dedicated hardware and Condor pools.




 

Finally



computational-sf@manchester.ac.uk