1. Administration

1.1. Adding a user

Users should first have an account on berserk. This gives them a permanent, backed-up place to keep data, and it provides much more software. I've just been copying over the relevant lines from /etc/passwd and /etc/shadow on berserk and editing /etc/group. I haven't automated this yet. You also have to manually create the home directory.

1.2. Monitoring the cluster

Every few days you should run showq and checknodes and examine the output to make sure that everything is in order.

1.3. Changing wall times or other scheduler parameters

Maui uses a mysql database to keep track of everything. Refer to the documentation for the maui database for details on all the tables and fields. I'll give a few examples here. Open the database with

mysql maui_db

1.3.1. Changing the default wall time

mysql> alter table policy_default alter max_duration set default 604800 ;

1.3.2. Changing one user's wall time

mysql> UPDATE policy_default SET max_duration = 2419200 WHERE auth = "anders";

1.4. Reinstalling a node

If a node gets corrupted or the hard drive fails, it will be necessary to reinstall the node. Because we are using the automated installer FAI (see below) this will only take about 15 minutes and very little intervention. However, since the network boot cannot function on a channel-bonded network, and since channel-bonding is not interoperable with standard ethernet, the whole cluster must be shut down to reinstall any node.

Here's the procedure.

  1. Shut down all nodes by running cshutdown -H t 0 on the master node.

  2. Edit /etc/bootptab and uncomment the lines corresponding only to those nodes which you want to reinstall.

  3. Switch to non-channel-bonded networking.

    ifdown bond0

    ifup eth1

  4. Stop the firewall, which blocks eth1. (It doesn't know about it.)

    /etc/init.d/easy-firewall stop

  5. Restart (with the power switch) the nodes to be installed.

  6. Wait for installation to complete and "Press Enter to reboot message to appear".

  7. Restore /etc/bootptab to it's former state, with all nodes commented out.

    ifdown eth1

    ifup bond0

  8. Restart the firewall.

    /etc/init.d/easy-firewall start

  9. Restart all nodes (power switch OK).

1.5. Swapping Nodes

At some point it may be necessary to switch one node for another in the cluster (for ease of physical removal etc.). I think this would be quite rare, but since it has been done, the instructions are here.

  1. After the physical swap is completed, we need to change the hostname on the node. Edit /etc/hostname and /etc/mailname as appropriate. Might want to change some of the motd stuff.
  2. Run hostname pistonXX. Replace the XX with the appropriate number.
  3. Edit /etc/network/interfaces to change the IP address (they are numbered sequentially).
  4. Run nohup /etc/init.d/networking restart. This last step is a bit of a hack. Restarting the network will kick you off (since you are likely doing all this remotely via ssh). Using nohup allows the command to complete.

It's possible that you'll also need to edit /etc/modules depending on what network cards are in the machine.

1.6. Changing the Number of Nodes

Another unusual task would be changing the number of nodes in the cluster. This could be necessary if one of them has a hardware failure for instance. The following steps need to be taken on the front-end node, piston00.

  1. Edit /etc/update-cluster/cluster.xml. Remove the node.
  2. Run update-cluster-regenerate all. Could check by: cat /etc/mpich/machines.LINUX.
  3. Edit /etc/c3.conf (controls scripts like cexec).
  4. Run cexec /etc/init.d/maui-node stop.
  5. Run /etc/init.d/maui-control stop and then /etc/init.d/maui-control start.
  6. Run cexec /etc/init.d/maui-node start.

At this point you should be done. checknode will likely list any removed node as unknown though. Not sure why. If checknode lists a good node as unknown you can rsh to the node and try remounting the nfs mounts.

Might also need to edit /etc/dsh/machines.list, /etc/hosts, /etc/bootptab(not really necessary), /etc/hosts.equiv.

1.7. Backing up the master node

It is probably a good idea to back up the master node from time to time. The following steps seem to have worked in the past.

  1. Put the desired backup tape in the tape drive on berserk.
  2. Shutdown the cluster.
  3. Boot the master node with knoppix. Use the knoppix boot command knoppix 2 vga=normal to get a command line that works with the cluster monitor.
  4. Mount the master node hard drive with mount /dev/hda1 /mnt/hda1.
  5. Bring up the network with ifconfig eth0 up 142.103.140.90 (that's the master node IP address).
  6. Run tar -czvl -C /mnt/hda1 --exclude='dev/*' . | buffer | ssh -l root 142.103.140.62 "buffer -B > /dev/tapes/tape0/mt". The IP is for berserk. You can probably feel free to customize this as desired. The buffer commands may not be necessary (?). Also, you have to type in the root password for berserk, possibly after the tar file list starts to fill the screen!
  7. Shutdown the master node (remember to go catch the knoppix CD as it falls!) and reboot.
  8. Test the backup if you want. See the berserk admin guide for the use of the backup/restore commands (you might have to poke around to find the manual commands).