Troubleshooting Problems

The Condor central manager stops (evicts or preempts) a process for several reasons, including the following:

Another job or another user's job in the queue has a higher priority and preempts or evicts your job.
The pool machine on which your process is executed encounters an issue with the machine state or the machine policy.
You specified attributes in your submit file that cannot process without error.

Refer to the Condor manual for detailed information about submission, job status, and processing errors:

http://research.cs.wisc.edu/htcondor/manual/latest/2_Users_Manual.html

Note: A simple action can help you to diagnose problems if you submit multiple jobs to Condor. Be sure to specify unique file names for each job's output, history, error, and log files. If you do not specify unique file names for each submission, Condor overwrites existing files that have the same names. This can prevent you from locating information about problems that might occur.

Priorities and Preemption

Job priorities enable you to assign a priority level to each submitted Condor job. Job priorities, however, do not impact user priorities.

User priorities are linked to the allocation of Condor resources based upon a user's priority. A lower numerical value for user priority means higher priority, so a user with priority 5 is allocated more resources than a user with priority 50. You can view user priorities by using the condor_userprio command. For example:

> condor_userprio -allusers

Condor continuously calculates the share of available machines. For example, a user with a priority of 10 is allocated twice as many machines as a user with a priority of 20. New users begin with a priority of 0.5 and, based upon increased usage, their priority rating rises proportionately in relation to other users. Condor enforces this function such that each user gets a fair share of machines according to user priority and historical volume. For example, if a low-priority user is using all available machines and a higher-priority user submits a job, Condor immediately performs a checkpoint and vacates the jobs that belong to the lower-priority user, except for that user's last job.

User priority rating decreases over time and returns to a baseline of 0.5 as jobs are completed and idle time is realized relative to other users.

Process Tracking

To track progress of your processes:

Type condor_q to view the status of your process IDs.
Check your output directory for the time stamps of your output, log, and error files.

If the output file and log file for a submitted process are more current than the error file, your process probably is running without error.

Process Queue

To view detailed information about your processes, including the ClassAd requirements for your jobs, type the command condor_q -analyze.

Refer to the Condor Version 6.8.0 Manual for a description of the value that represents why a process was placed on hold or evicted. Go to the following URL for section 2.5, "Submitting a Job," and search for the text JobStatus under the heading "ClassAd Job Attributes":

http://www.cs.wisc.edu/condor/manual/v6.8.0/2_5Submitting_Job.html

For example:

> condor_q -analyze
Run analysis summary. Of 43 machines, 
43 are rejected by your job's requirements
0 are available to run your job
WARNING: Be advised:
No resources matched request's constraints
Check the Requirements expression below:
Requirements = ((Memory > 8192)) && (Disk >= DiskUsage)

Error Log

An error file includes information about any errors occurred when your batch processing executed.

To view the error file for a process and determine where an error occurred, use the cat command. For example:

> cat errorfile
Error in readChar(con, 5) : cannot open the connection
In addition: Warning message:
cannot open compressed file 'Utilization1.RData' 
Execution halted

History File

When batch processing completes, Condor removes the cluster from the queue and records information about the processes in the history file. History is displayed for each process on a single line. Information provided includes the following:

ID - The cluster and process IDs of the job
OWNER - The owner of the job
SUBMITTED - The month, day, hour, and minute at which the job was submitted to the queue
CPU_USAGE - Remote user central processing unit (CPU) time accumulated by the job to date, in days, hours, minutes, and seconds
ST - Completion status of the job, where C is completed and X is removed
COMPLETED - Time at which the job was completed
CMD - Name of the executable

To view information about processes that you executed on the Condor system, type the command condor_history. For example:

> condor_history
IDOWNER SUBMITTED RUN_TIME ST COMPLETED CMD
 1.0 arose 9/26 11:45 0+00:00:00 C 9/26 11:45 /usr/bin/R --no
 2.0 arose 9/26 11:48 0+00:00:01 C 9/26 11:48 /usr/bin/R --no
 3.0 arose 9/26 11:49 0+00:00:00 C 9/26 11:50 /usr/bin/R --no
 3.1 arose 9/26 11:49 0+00:00:01 C 9/26 11:50 /usr/bin/R --no
 6.0 arose 10/3 15:52 0+00:00:00 C 10/3 15:52 /nfs/fs1/home/A
 6.1 arose 10/3 15:52 0+00:00:00 C 10/3 15:52 /nfs/fs1/home/A
 6.2 arose 10/3 15:52 0+00:00:00 C 10/3 15:52 /nfs/fs1/home/A
 6.5 arose 10/3 15:52 0+00:00:00 C 10/3 15:52 /nfs/fs1/home/A
 6.3 arose 10/3 15:52 0+00:00:00 C 10/3 15:52 /nfs/fs1/home/A
 6.4 arose 10/3 15:52 0+00:00:00 C 10/3 15:52 /nfs/fs1/home/A
 6.6 arose 10/3 15:52 0+00:00:01 C 10/3 15:52 /nfs/fs1/home/A
 9.0 arose 10/4 11:02 0+00:00:00 C 10/4 11:02 /nfs/fs1/home/A
 9.1 arose 10/4 11:02 0+00:00:01 C 10/4 11:02 /nfs/fs1/home/A
 9.2 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.3 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.5 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.6 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.4 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A

Search through the history file for your process and cluster IDs to locate information about your jobs.

To view information about all completed processes in a cluster, type the command condor_history <cluster ID>. To view information about one process, type the command condor_history <cluster ID>.<process ID>. For example:

> condor_history 9
 IDOWNER SUBMITTED RUN_TIME ST COMPLETED CMD
 9.0 arose 10/4 11:02 0+00:00:00 C 10/4 11:02 /nfs/fs1/home/A
 9.1 arose 10/4 11:02 0+00:00:01 C 10/4 11:02 /nfs/fs1/home/A
 9.2 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.3 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.5 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.6 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A
 9.4 arose 10/4 11:02 0+00:00:00 X ??? /nfs/fs1/home/A

Process Log File

A log file includes information about everything that occurred during your cluster processing: when it was submitted, when execution began and ended, when a process was restarted, if there were any issues. When processing finishes, the exit conditions for that process are noted in the log file.

Refer to the Condor Manual for a description of the entries in the process log file. Go to the following URL for section 2.6, "Managing a Job," and go to subsection 2.6.6, "In the log file":

http://research.cs.wisc.edu/htcondor/manual/latest/2_6Managing_Job.html

To view the log file for a process and determine where an error occurred, use the cat command. For example, the following log file indicates that the process completed normally:

> cat log.1
000 (012.001.000) 10/04 12:14:51 Job submitted from host: <10.0.0.47:60603>
...
001 (012.001.000) 10/04 12:15:00 Job executing on host: <10.0.0.61:37097>
...
005 (012.001.000) 10/04 12:15:00 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
7 - Run Bytes Sent By Job
163 - Run Bytes Received By Job
7 - Total Bytes Sent By Job
163 - Total Bytes Received By Job
...

Following is an example log file for a process that did not complete execution:

> cat log.4
000 (09.000.000) 09/20 14:47:31 Job submitted from host:
<x1.hmdc.harvard.edu>
...
007 (09.000.000) 09/20 15:02:10 Shadow exception!
Error from starter on x1.hmdc.harvard.edu: Failed
to open 'scratch.1/frieda/workspace/v67/condor-
test/test3/run_0/b.input' as standard input: No such
file or directory (errno 2)
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...

Held Process

To view information about processes that Condor placed on hold, type condor_q -hold. For example:

> condor_q -hold

-- Submitter: vnc.hmdc.harvard.edu : <10.0.0.47:60603> : vnc.hmdc.harvard.edu
 ID OWNER HELD_SINCEHOLD_REASON
 17.0 arose 10/5 12:53via condor_hold (by user arose)
 17.1 arose 10/5 12:53via condor_hold (by user arose)
 17.2 arose 10/5 12:53via condor_hold (by user arose)
 17.3 arose 10/5 12:53via condor_hold (by user arose)
 17.4 arose 10/5 12:53via condor_hold (by user arose)
 17.5 arose 10/5 12:53via condor_hold (by user arose)
 17.6 arose 10/5 12:53via condor_hold (by user arose)
 17.7 arose 10/5 12:53via condor_hold (by user arose)
 17.9 arose 10/5 12:53via condor_hold (by user arose)

9 jobs; 0 idle, 0 running, 9 held

Refer to the Condor Manual for a description of the value that represents why a process was placed on hold. Go to the following URL for section 2.5, "Submitting a Job," and look for subsection 2.5.2.2, "ClassAd Job Attributes." Look for the entry HoldReasonCode:

http://research.cs.wisc.edu/htcondor/manual/latest/2_5Submitting_Job.html

To place a process on hold, type the command condor_hold <cluster ID>.<process ID>. For example:

> condor_hold 8.33
Job 8.33 held

To place on hold any processes not completed in a full cluster, type condor_hold <cluster ID>. For example:

> condor_hold 8
Cluster 8 held.

The status of those uncompleted processes in cluster 8 is now H (on hold):

> condor_q

-- Submitter: vnc.hmdc.harvard.edu : <10.0.0.47:60603> vnc.hmdc.harvard.edu
 ID OWNER SUBMITTED RUN_TIME STPRISIZECMD
 8.2 sspade 10/4 11:19 0+00:00:00 H 0 9.8 dwarves.pl
 8.5 sspade 10/4 11:19 0+00:00:00 H 0 9.8 dwarves.pl
 8.6 sspade 10/4 11:19 0+00:00:00 H 0 9.8 dwarves.pl

3 jobs; 0 idle, 0 running, 3 held

To release a process from hold, type the command condor_release <cluster ID>.<process ID>. For example:

> condor_release 8.33
Job 8.33 released.

To release the full cluster from hold, type the command condor_release <cluster ID>. For example:

> condor_release 8
Cluster 8 released.

You can instruct the Condor system to place your batch processing on hold if it spends a specified amount of time suspended (that is, not processing). For example, include the following attribute in your submit file to place your jobs on hold if they spends more than 50 percent of their time suspended:

Periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime /2.0)