Managing your batch job

Once you have submitted your job(s) to the queue, you have various ways of checking in on the status of your jobs including e-mail notification of job completion and command line access to both your jobs status and the current state of the pool.

Managing Job Status

You can monitor progress of your batch processing using the condor_status and condor_q commands. This section describes how to check the status of your processes at any time, and how to remove a process from the Condor queue.

After you submit a job for processing, you can check the status of the Condor machine pool and verify that machines are available on which your jobs can execute.

To check the status of the Condor pool, type the command condor_status. This command returns information about the pool resources. Output lists the number of slots available in the pool and whether they are in use. If there are no idle slots, your batch processing is queued when it is submitted.

For example:

> condor_status

Name OpSys Arch State Activity LoadAv Mem ActvtyTime

vm1@mc-1-1.hm LINUX X86_64 Claimed Busy 1.060 19750+17:43:50
vm2@mc-1-1.hm LINUX X86_64 Claimed Busy 1.060 1975 0+17:43:48
vm1@mc-1-2.hm LINUX X86_64 Claimed Busy 1.000 1975 0+17:44:43
vm2@mc-1-2.hm LINUX X86_64 Claimed Busy 1.000 1975 0+17:44:36
vm1@mc-1-3.hm LINUX X86_64 Unclaimed Idle 0.010 1975 0+00:03:57
vm2@mc-1-3.hm LINUX X86_64 Unclaimed Idle 0.000 1975 0+00:00:04
vm1@mc-1-4.hm LINUX X86_64 Unclaimed Idle 0.000 1975 0+00:00:04

Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 7 0 4 3 0 0 0
Total 7 0 4 3 0 0 0

To check the cumulative use of resources within in the Condor pool, include the option -submitter with the command condor_status. This command returns information about each user in the Condor queue. Output lists the user's name, machine in use, and current number of jobs per machine. Use this command to help determine how many resources Condor has available to run your jobs. An example is shown here:

> condor_status -submitter

Name Machine Running IdleJobs HeldJobs

mkellerm@hmdc.harvar w4.hmdc.ha 2 0 0
jgreiner@hmdc.harvar x1.hmdc.ha 9 0 0
jgreiner@hmdc.harvar x3.hmdc.ha 40 0 0
kquinn@hmdc.harvard. x5.hmdc.ha 32 0 0

RunningJobs IdleJobs HeldJobs

jgreiner@hmdc.harvar 49 0 0
kquinn@hmdc.harvard. 32 0 0
mkellerm@hmdc.harvar 2 0 0

Total 83 0 0

Cluster Status Summary

To view a summary of the Condor cluster available resources, run: rce-info.shTo view a summary of the resources currently in use on the Condor cluster, run: rce-info.sh -t used

Removing your job

To remove a process from the queue, type the command condor_rm <cluster ID>.<process ID>. For example:

> condor_rm 9.9
Job 9.9 marked for removal

To find a list of your jobs type:

> condor_q  $USER


To remove all jobs affiliated with a cluster, type the command condor_rm <cluster ID> . For example, the command condor_rm 4 removes all jobs assigned to cluster 4.

To remove all of your clusters' jobs from the Condor queue, type condor_rm -a. For example:

> condor_rm -a
All jobs marked for removal.

Jobs must be deleted from the host they were submitted from.

When you run condor_q you may see multiple "Schedd" sections:

-- Schedd: HMDC.rce@rce6-1.hmdc.harvard.edu
-- Schedd: HMDC.rce@rce6-2.hmdc.harvard.edu
-- Schedd: HMDC.rce@rce6-3.hmdc.harvard.edu

Each of these sections represents a different RCE Login server.

When you submit a job, the server you are logged in to is responsible for "scheduling" that job and keeping track of its status.

Each RCE Login server maintains this status separately, so when you want to remove a job, you must also specify the server where you started it.

The full syntax to remove a job is thus:

  condor_rm <cluster ID>[.<process ID>] -name <schedd_string>

e.g.
  condor_rm 4806 -name "HMDC.rce@rce6-4.hmdc.harvard.edu"