Running a job on server

This page explains how to run a job on the calculation server(s) of supercomputing system.
We assume that you can login to front end node of Large-Scale Parallel Computing Server (super.sc.imr.tohoku.ac.jp) with ssh command.

PBS job schedular

Jobs (calculations on supercomputer) running on MASAMUNE-IMR are controlled by PBS Professional job schedular.
To run a job on a server, users need to submit it to the corresponding job queue by running a script or command.
Then PBS schedules the order, time, node, etc. of queued jobs and assign the jobs to the calculation nodes accordingly.

Job script

To submit a job to a queue, you need qsub command and a script file which describes what you want to do.
You can submit a job by running the following command.

$ qsub [-q queue_name] [-l select=the_number_of_nodes] [-N job_name] [-M email_address] [-m specification_of_email_notice] [-l walltime=upper_limit_of_running_time] [-l license_name=the_number_of_license] script_file


Fortunately, these options can be contained in the script file. We highly recommend that you specify the options in this way.
The script file is typically like below.

#!/bin/sh
#PBS -l select=the_number_of_nodes
#PBS -q queue_name
#PBS -N job_name

(Write what you want to do here.)


The first line of this script means this script is run as shell script.
The second or later starting “#PBS” specify PBS options.

Let's make a basic job script and run the job.

If an application you want to use is installed on system, you can check the job script example from Application list / Usage.


Make a script file as follows and save as “hello.sh”.
This script submits a job named “hello” to P_016 queue. The job uses 1 node.

hello.sh
#!/bin/sh
#PBS -l select=1
#PBS -q P_016
#PBS -N hello

# move entire directory to /work area, and go to the directory.
DIRNAME=`basename $PBS_O_WORKDIR`
WORKDIR=/work/$USER/$PBS_JOBID
mkdir -p $WORKDIR
cp -raf $PBS_O_WORKDIR $WORKDIR
cd $WORKDIR/$DIRNAME

# Run a program.
# Standard output and error output are redirected to result.out and result.err respectively.
aprun echo "Hello world!" > result.out 2> result.err

# After running a program, move the results back to the original directory.
cd; if cp -raf $WORKDIR/$DIRNAME $PBS_O_WORKDIR/.. ; then rm -rf $WORKDIR; fi

important

When you run a program on Large-Scale Parallel Computing Server, be sure to use aprun command.
For more information about job script, please check Job submit/management commands.
You can check a job's progress by checking the files in /work/$USER/$PBS_JOBID/$DIRNAME.
If a program is not finished within walltime, the last line of above script will not run and output files will not be copied into original directory.


Then submit the script with qsub command as follows.
If it succeeded, the job ID will appear (123456.sdb in this case).

$ qsub hello.sh
123456.sdb


You can check the job status with statj command.

$ statj

sdb: 
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
123456.sdb      username P_016    hello         --    1  72  768gb 24:00 Q   -- 


Each column shows the following information.

column description
Job ID ID assigned to the job
Username user ID
Queue queue name
Jobname job name
SessID session ID
NDS the number of occupied nodes
TSK the number of occupied CPU cores
Req'd Memory requested memory size
Req'd Time requested running time
S status of the job
Q: being queued
R: running
E: exiting
H: being held
Elap Time elapsed time


When the job is finished, the following files will be saved in the current directory.

result.out
result.err
hello.o123456
hello.e123456


Files named “{jobID}.o{jobID}” and “{jobID}.e{jobID}” are standard output file and error output file, respectively.
These file will be empty since standard output and error output are redirected into result.out and result.err.

attention!

When you run a program which writes a lot of output to standard output or error output, be sure to redirect the output to file(s).
If not redirected, it is temporarily saved on storage of PBS system, which can affect the system's performance.
So please understand that we might have to cancel the job if the problem occurred.


Let's see the result.

$ cat result.out
Hello world!
Application 5598879 resources: utime ~0s, stime ~1s, Rss ~9980, inblocks ~0, outblocks ~0


Congratulations! Now you can run your program on MASAMUNE-IMR system.
You might want to use other services of MASAMUNE-IMR. User manual provides more server-specific information.
If you want to use pre-installed application, Application list / Usage will be of your help.


  • getting_started/submit_job.txt
  • Last modified: 2023/04/11 05:24
  • (external edit)