====== Running a job on server ====== This page explains how to run a job on the calculation server(s) of supercomputing system.\\ We assume that you can login to front end node of Large-Scale Parallel Computing Server (super.sc.imr.tohoku.ac.jp) with ssh command.\\ ===== Overview ===== ==== PBS job schedular ==== Jobs (calculations on supercomputer) running on MASAMUNE-IMR are controlled by [[https://www.altair.com/pbs-professional//|PBS Professional]] job schedular.\\ To run a job on a server, users need to submit it to the corresponding job queue by running a script or command.\\ Then PBS schedules the order, time, node, etc. of queued jobs and assign the jobs to the calculation nodes accordingly.\\ \\ ==== Job script ==== To submit a job to a queue, you need qsub command and a script file which describes what you want to do.\\ You can submit a job by running the following command.\\ $ qsub [-q queue_name] [-l select=the_number_of_nodes] [-N job_name] [-M email_address] [-m specification_of_email_notice] [-l walltime=upper_limit_of_running_time] [-l license_name=the_number_of_license] script_file \\ Fortunately, these options can be contained in the script file. We highly recommend that you specify the options in this way.\\ The script file is typically like below.\\ #!/bin/sh #PBS -l select=the_number_of_nodes #PBS -q queue_name #PBS -N job_name (Write what you want to do here.) \\ The first line of this script means this script is run as shell script.\\ The second or later starting "#PBS" specify PBS options.\\ \\ ===== Running a job ===== Let's make a basic job script and run the job.\\ If an application you want to use is installed on system, you can check the job script example from [[:application|Application list / Usage]].\\ \\ Make a script file as follows and save as "hello.sh".\\ This script submits a job named "hello" to P_016 queue. The job uses 1 node. #!/bin/sh #PBS -l select=1 #PBS -q P_016 #PBS -N hello # move entire directory to /work area, and go to the directory. DIRNAME=`basename $PBS_O_WORKDIR` WORKDIR=/work/$USER/$PBS_JOBID mkdir -p $WORKDIR cp -raf $PBS_O_WORKDIR $WORKDIR cd $WORKDIR/$DIRNAME # Run a program. # Standard output and error output are redirected to result.out and result.err respectively. aprun echo "Hello world!" > result.out 2> result.err # After running a program, move the results back to the original directory. cd; if cp -raf $WORKDIR/$DIRNAME $PBS_O_WORKDIR/.. ; then rm -rf $WORKDIR; fi When you run a program on Large-Scale Parallel Computing Server, be sure to use aprun command.\\ For more information about job script, please check [[:user_manual:supercomputer:job_submission_management|Job submit/management commands]]. You can check a job's progress by checking the files in /work/$USER/$PBS_JOBID/$DIRNAME. \\ If a program is not finished within walltime, the last line of above script will not run and output files will not be copied into original directory. \\ \\ Then submit the script with qsub command as follows.\\ If it succeeded, the job ID will appear (123456.sdb in this case).\\ $ qsub hello.sh 123456.sdb \\ You can check the job status with statj command.\\ $ statj sdb: Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 123456.sdb username P_016 hello -- 1 72 768gb 24:00 Q -- \\ Each column shows the following information. ^column ^description ^ |Job ID |ID assigned to the job | |Username |user ID | |Queue |queue name | |Jobname |job name | |SessID |session ID | |NDS |the number of occupied nodes | |TSK |the number of occupied CPU cores | |Req'd Memory |requested memory size | |Req'd Time |requested running time | |S |status of the job\\ Q: being queued\\ R: running\\ E: exiting\\ H: being held | |Elap Time |elapsed time | \\ When the job is finished, the following files will be saved in the current directory. result.out result.err hello.o123456 hello.e123456 \\ Files named "{jobID}.o{jobID}" and "{jobID}.e{jobID}" are standard output file and error output file, respectively.\\ These file will be empty since standard output and error output are redirected into result.out and result.err.\\ When you run a program which writes a lot of output to standard output or error output, be sure to redirect the output to file(s).\\ If not redirected, it is temporarily saved on storage of PBS system, which can affect the system's performance.\\ So please understand that we might have to cancel the job if the problem occurred. \\ Let's see the result. $ cat result.out Hello world! Application 5598879 resources: utime ~0s, stime ~1s, Rss ~9980, inblocks ~0, outblocks ~0 \\ Congratulations! Now you can run your program on MASAMUNE-IMR system.\\ You might want to use other services of MASAMUNE-IMR. [[:user_manual|User manual]] provides more server-specific information.\\ If you want to use pre-installed application, [[:application|Application list / Usage]] will be of your help.\\ \\ ------ [[getting_started:transfer_file|<< transferring files]] | [[:getting_started|Getting started]] ~~NOCACHE~~