| Carl Friedrich Gauß Faculty | Department of Computer Science

Running "Heavy" Jobs on IBR Servers

AuthorFrank Steinberg
KeywordsSimulations Simulationen Batch Jobs Compute

Available Hosts

"Heavy Jobs" like ns-2 simulations that you might want to run on IBR hosts need lots of resources of mainly three types: (1) CPU power for short runtimes, (2) system memory during execution, and (3) disk space for results (and maybe for temporary data).

Please choose parameters and hosts for your jobs very carefully before you run your jobs. When your jobs run out of system memory and the system starts swapping or thrashing, it does not help you and it bothers other users and administrators.

You can use tools like top(1), free(1), df(1) to find out how many CPUs a host has, how much system memory is installed and available, and where huge local temporary disk space (typically /opt/tmp) is available.

Here is some host specific information as of 2010-12, but it may change over time: predator, radiator, operator, animator, bierator have 64bit Linux operation systems and huge amounts of memory (6-16GB). Of course, you have to compile your programs for this environment, to make use of the large address space. Unimator should not be used because this host is usually used for short-term interactive sessions, for which people expect a reasonable responsiveness.

You might also use workstations from the linux pool.

How To Run Your Jobs

Before you start your jobs, please make sure where it will dump its results. It is not wise to write gigabytes of data via NFS to a remote filesystem like your home directory. Use one of the above mentioned local "tmp" directories, whenever possible. Please also try to estimake how much data will be written to output files and check the available disk space in advance.

Please, use nice(1) (or renice(1), if the process is already running) to run your heavy jobs! Always! This helps the operating system to schedule processes so that CPU intensive jobs don't bother interactive jobs too much and the system keeps relatively responsive to people working interactively on the host. Since interactive jobs generally require very few CPU cycles compared to your heavy jobs, this slows down your jobs only very very slightly (by ~1%), but it helps your fellows significantly!

On multi-processor hosts, you might decide to run as many processes in parallel as CPU cores are present (or maybe even one or two more, if there is some significant amount of I/O on your jobs). But still keep aware of the system's total resource limits!

You might want to hack some scripting around the command lines to start your jobs. This is not the place for a Script-Howto, but you might want to read the nohup(1) manual page and info page to find out how to start your jobs in the background, redirecting output to a file and not terminating when you logout.

In some rare cases, you might want to pause/resume jobs. You can do this by sending it STOP/CONT signals, see kill(1).

If you recognize that the system starts swapping (top(1), vmstat(8) can help to recognize this), please consider terminating your job! A system that starts swapping is slowing down significantly, so that in most cases you get faster results, when you restart your job from scratch with a smaller footprint or on a larger machine.

When your jobs are finished, please clean up the temporary disk space as early as possible! If you expect to need the immediate output of your jobs for a longer time for further processing, please try to find a way to condense it down to the relevant parts and/or to compress it.

last changed 2010-12-06, 16:28 by Frank Steinberg