Home About Software Documentation Support Outreach Ecosystem Dev Awards Team & Sponsors

B. Fault Detection & Recovery Capabilities

B.1. Job Cancellation

Description:

  • A Job could be cancelled for several reasons, for example by the local resource management system when it exceeds the wall time limit or by the system administrator to preserve system performance.

Support in Last Release:

  • GW detects job cancellation when the job exit code is not specified and requests migration.

B.2 Remote System Crash or Outage

Description:

  • Grid resources could unpredictably fail. These failures comprise hardware, operating system and Grid middleware components.

Support in Last Release:

  • GW detects system crash when the polling of the job fails and requests migration.

B.3 Network Disconnection

Description:

  • Grid connections could unpredictably fail. Moreover, system administrators are freely to disconnect it resources, for example, due to local site maintenance.

Support in Last Release:

  • GW detects network disconnection when the polling of the job fails and requests migration.

B.4. Client Fault Tolerance

Description:

  • The system running the scheduler could fail.

Support in Last Release:

  • GW periodically saves its state in order to recover from local failure.
Admin · Log In