Feed The Monster
INteractive Computational LINux Environment — INCLINE aka iCSF (2013 May)
- Background
-
So, we have been running batch computational platforms for many years. Yet
many powerful workstations are procured each year for research groups and
individuals all over campus. Such workstations:
- are idle much of the time and therefore wasteful of financial resources;
- make user/researcher support by Research IT very innefficient as each is set up slightly differently from the others, they are geographically dispersed and each requires installation of software applications.
- User Requirement
-
Researchers:
- often need to run GUI-based applications — use of qrsh on batch/queue platforms offers a far from delightful experience;
- sometimes need to develop and debug code with frequent and short code runs;
- may need to pre-process or post-process data the main processing of which is performed on our batch compute systems.
Redqueen and The CSF. - Solution Overview
-
Build a load-balanced SSH Linux
Virtual Server!
- Procure compute nodes with the same specification as those used in the CSF together with a LVS node for performing the routing and load-balancing.
- Using dedicated network infrastructure, make the same home and shared storage areas as are mounted on Redqueen and the CSF available on all compute nodes.
- Use mon to dynamically tweak the LVS config so as to perform load-balancing.
- Solution Detail: Ensure Network Traffic Forwarding is Active
-
On the LVS node:
echo 1 > /proc/sys/net/ipv4/ip_forward # ...ensure forwarding enabled...
- Solution Detail: Install and Configure LVS:
-
On the LVS node:
yum install ipvsadm ipvsadm --clear ipvsadm -A -t 10.99.203.43:22 -p 86400 # # ...persistent --- we want successive SSH connections from one client/user # to go to the same compute node... # ipvsadm -a -t 10.99.203.43:22 -r 10.7.7.1:22 -m ipvsadm -a -t 10.99.203.43:22 -r 10.7.7.2:22 -m # # ...10.99.203.43 is the external IP of the headnode and # 10.7.7.0/25 is the internal/private network...
The LVS now looks like:ipvsadm -L IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP incline:ssh wlc -> incline01:ssh Masq 1 1 0 -> incline02:ssh Masq 1 0 0 # # ...incline01, incline02, etc. are the compute nodes... #
- Solution Detail: LVS-Specific Parts of The IPTables Configuiration On the LVS node:
- Solution Detail: Use mon To Monitor Compute Node Availability
-
On the LVS node, use mon to check availability of nodes — extract
from /etc/mon/mon.cf:
watch incline64-ssh service ssh interval 5m monitor ssh-simonh.monitor # ...returns a list of unhappy hosts within this group in the summary line... period wd {Mon-Fri} hr {7am-7pm} alert mail.alert root@localhost upalert mail.alert root@localhost alertafter 2 alertevery 1h period wd {Sat-Sun} alertafter 2 alertevery 6h alert mail.alert root@localhost
- Solution Detail: Use mon To Load-Balancing Compute Nodes
-
On the LVS node use mon
to load-balance compute nodes — extract from /etc/mon/mon.cf:
watch incline64-ssh-load service ssh interval 1m monitor ssh-load-simonh.monitor -V 10.99.203.43:22 # ...returns a list of unhappy hosts within this group in the summary line... period wd {Mon-Fri} hr {7am-7pm} alertafter 2 alertevery 5m # ...that is unusually frequent (every 5 mins) --- this is because we are using this # for load-ballancing... alert lvs-incline-load-simonh.alert -P tcp -V 10.99.203.43:22 -F nat -f root@localhost root@localhost upalert lvs-incline-load-simonh.alert -P tcp -V 10.99.203.43:22 -F nat -f root@localhost root@localhost period wd {Sat-Sun} alertafter 2 alertevery 5m # ...that is unusually frequent (every 5 mins) --- this is because we are using this # for load-ballancing...
- Solution Detail: SSH Hostkeys
- On compute nodes, to avoid users' SSH clients concerns about SSH hostkeys, ensure they are all the same, e.g. copy /etc/ssh/*host* from one compute node to all the others.
- Solution Detail: SSH Keepalive
-
On compute nodes, to avoid SSH connections/sessions being broken,
set the SSH daemon to push out keepalive packets — extract from
/etc/ssh/sshd_config:
Keepalive:
ClientAliveInterval 5
$IPT -A FORWARD -p tcp --sport 22 -i $IN2INT -o $EXTINT -s 10.7.7.0/24 -d 130.88.0.0/16 -j ACCEPT $IPT -A FORWARD -p tcp --sport 22 -i $IN2INT -o $EXTINT -s 10.7.7.0/24 -d 10.99.0.0/16 -j ACCEPT $IPT -t nat -A POSTROUTING -s 10.7.7.0/24 -j MASQUERADE # ...$EXTINT is the public IP address of the LVS node, $INT2INT is the 10.7.7.0/24 # IP address of the LVS node...