Users should first have an account on berserk. This gives them a permanent, backed-up place to keep data, and it provides much more software. I've just been copying over the relevant lines from /etc/passwd and /etc/shadow on berserk and editing /etc/group. I haven't automated this yet. You also have to manually create the home directory.
Every few days you should run showq and checknodes and examine the output to make sure that everything is in order.
Maui uses a mysql database to keep track of everything. Refer to the documentation for the maui database for details on all the tables and fields. I'll give a few examples here. Open the database with
mysql maui_db
mysql> alter table policy_default alter max_duration set default 604800 ;
If a node gets corrupted or the hard drive fails, it will be necessary to reinstall the node. Because we are using the automated installer FAI (see below) this will only take about 15 minutes and very little intervention. However, since the network boot cannot function on a channel-bonded network, and since channel-bonding is not interoperable with standard ethernet, the whole cluster must be shut down to reinstall any node.
Here's the procedure.
Shut down all nodes by running cshutdown -H t 0 on the master node.
Edit /etc/bootptab and uncomment the lines corresponding only to those nodes which you want to reinstall.
Switch to non-channel-bonded networking.
ifdown bond0
ifup eth1
Stop the firewall, which blocks eth1. (It doesn't know about it.)
/etc/init.d/easy-firewall stop
Restart (with the power switch) the nodes to be installed.
Wait for installation to complete and "Press Enter to reboot message to appear".
Restore /etc/bootptab to it's former state, with all nodes commented out.
ifdown eth1
ifup bond0
Restart the firewall.
/etc/init.d/easy-firewall start
Restart all nodes (power switch OK).
At some point it may be necessary to switch one node for another in the cluster (for ease of physical removal etc.). I think this would be quite rare, but since it has been done, the instructions are here.
It's possible that you'll also need to edit /etc/modules depending on what network cards are in the machine.
Another unusual task would be changing the number of nodes in the cluster. This could be necessary if one of them has a hardware failure for instance. The following steps need to be taken on the front-end node, piston00.
At this point you should be done. checknode will likely list any removed node as unknown though. Not sure why. If checknode lists a good node as unknown you can rsh to the node and try remounting the nfs mounts.
Might also need to edit /etc/dsh/machines.list, /etc/hosts, /etc/bootptab(not really necessary), /etc/hosts.equiv.
It is probably a good idea to back up the master node from time to time. The following steps seem to have worked in the past.