RCE Powered job failures

February 27, 2014

During our scheduled LDAP outage this morning (2/27) we discovered that some RCE Powered jobs were killed due a failure to lookup user identities.  Our condor resource manager was configured to continue jobs, even on user lookup failure, but this appears to be a bug in the system as it did fail jobs when the system was unable to lookup user information.

We apologize for any lost work.  We will be putting a moratorium on any updates and changes to the system that have even a remote chance of causing issues in the RCE.  RCE Users expect our system to be stable and available for their important work.  We will make sure we schedule any work well in advance and schedule outages to the system so users can plan accordingly.

Again my apologies for the recent instability in the RCE.

Wes