Home | About | Software | Documentation | Support | Outreach | Ecosystem | Dev | Awards | Team & Sponsors |
GridWay requires that the environment variables GLOBUS_LOCATION and GW_LOCATION are set. These are set to the base of your Globus installation and GridWay installation. In GT 4.2, GridWay is installed in the same place as Globus, so you can set both of these environment variables to the same location.
Important |
---|
Note that the GridWay daemon SHOULD NOT be run as root. Only part of the installation will require privileged access. |
Login as root account and follow the next steps:
... # User alias specification ... Runas_Alias GW_USERS = %<gwgroup> ... # GridWay entries globus ALL=(GW_USERS) NOPASSWD: /home/gwadmin/gw/bin/gw_em_mad_prews * globus ALL=(GW_USERS) NOPASSWD: /home/gwadmin/gw/bin/gw_em_mad_ws * globus ALL=(GW_USERS) NOPASSWD: /home/gwadmin/gw/bin/gw_tm_mad_ftp *
Usually sudo clears all environment variables for security reasons. However MADs need the GW_LOCATION and GLOBUS_LOCATION variables to be set. To preserve those variables in the MAD environment, add the following line to your “sudoers” file:
Defaults>GW_USERS env_keep="GW_LOCATION GLOBUS_LOCATION"
The following line shouldn't be in the sudoers file, otherwise gridway could not use sudo as it will ask for a tty:
Defaults requiretty
Please refer to the sudo manual page for more information.To test the sudo command configuration try to execute a MAD as a user in the “<gwgroup>” group, for example:
$ sudo -u <gw_user> /home/gwadmin/gw/bin/gw_em_mad_prews
The configuration files for GridWay are read from the following locations:
Options are defined one per line, with the following format:
<option> = [value]
If the value is missing the option will fall back to its default. Blank lines and any character after a '#' are ignored. Note: Job template options can use job or host variables to define their value, these variables are substituted at run time with their corresponding values (see GridWay 5.4: User's Guide).
The GridWay daemon (GWD) configuration options are defined in “$GW_LOCATION/etc/gridway/gwd.conf”. The table below summarizes the configuration file options, their description and default values. Note that blank lines and any character after a '#' are ignored.
Table 2. GWD Configuration File Options.
Option | Description | Default |
---|---|---|
Connection Options | ||
GWD_PORT | TCP/IP Port where GWD will listen for client requests. If this port is in use, GWD will try to bind to the next port until it finds a free one. The TCP/IP port used by GWD can be found in “$GW_LOCATION/var/gridway/gwd.port” | 6725 |
MAX_NUMBER_OF_CLIENTS | Maximum number of simultaneous client connections. Note that only blocking client requests keeps its connection open. | 20 |
Pool Options | ||
NUMBER_OF_JOBS | The maximum number of jobs that will be handled by the GridWay system | 200 |
NUMBER_OF_ARRAYS | The maximum number of array-jobs that will be handled by the GridWay system | 20 |
NUMBER_OF_HOSTS | The maximum number of hosts that will be handled by the GridWay system | 100 |
NUMBER_OF_USERS | The maximum number of different users in the GridWay system | 30 |
Intervals | ||
SCHEDULING_INTERVAL | Period (in seconds) between two scheduling actions | 30 |
DISCOVERY_INTERVAL | How often (in seconds) the information manager searches the Grid for new hosts | 300 |
MONITORING_INTERVAL | How often (in seconds) the information manager updates the information of each host | 120 |
POLL_INTERVAL | How often (in seconds) the underlying grid middleware is queried about the state of a job. | 60 |
Middleware Access Driver (MAD) Options | ||
IM_MAD | Information Manager MADs, see Section “Information Driver Configuration” | - |
TM_MAD | Transfer Manager MADs, see Section “File Transfer Driver Configuration” | - |
EM_MAD | Execution Manager MADs, see Section “ Execution Driver Configuration ” | - |
MAX_ACTIVE_IM_QUERIES | Maximum number (soft limit) of active IM queries (each query spawns one process) | 4 |
Scheduler Options | ||
DM_SCHED | Scheduling module, see Section “Scheduler Configuration” | - |
Here is an example of a GWD configuration file:
#-------------------------------- # Example: GWD Configuration File #-------------------------------- GWD_PORT = 6725 MAX_NUMBER_OF_CLIENTS = 20 NUMBER_OF_ARRAYS = 20 NUMBER_OF_JOBS = 200 NUMBER_OF_HOSTS = 100 NUMBER_OF_USERS = 30 JOBS_PER_SCHED = 10 JOBS_PER_HOST = 10 JOBS_PER_USER = 30 SCHEDULING_INTERVAL = 30 DISCOVERY_INTERVAL = 300 MONITORING_INTERVAL = 120 POLL_INTERVAL = 60 IM_MAD = mds4:gw_im_mad_mds4:-s hydrus.dacya.ucm.es:gridftp:ws TM_MAD = gridftp:gw_tm_mad_ftp: EM_MAD = ws:gw_em_mad_ws::rsl2 DM_SCHED = flood:gw_flood_scheduler:-h 10 -u 30 -c 5
Default values for every job template option can be set in “$GW_LOCATION/etc/gridway/job_template.default”. You can use this file to set the value of advanced job configuration options and use them for all your jobs. Note that the values set in a job template file override those defined in “job_template.default”. See GridWay 5.4: User's Guide for a detailed description of each job option.
GridWay reporting and accounting facilities provide information about overall performance and help troubleshoot configuration problems. GWD generates the following logs under the “$GW_LOCATION/var” directory:
001;TIMESTAMP;JOBID;STATE;EXIT_CODE
Since GridWay 4.9, when you start the daemon, gwd tries to recover its previous state. This is, any submitted job is stored in a persistent pool, and in case of a gwd (or client machine) crash these jobs are recovered. This includes, for jobs in wrapper state, contacting with the remote jobmanager.
Recovery actions are performed by default, if you do not want to recover the previous submitted jobs use the -c option.
For example, to start gwd in multi-user mode and clear its previous state, use:
$ gwd -c -m
Grid scheduling consists of finding a suitable (in terms of a given target) assignment between a computational workload (jobs) and computational resources. The scheduling problem has been thoroughly studied in the past and efficient algorithms have been devised for different computing platforms. Although some of the experience gained in scheduling can be applied to the Grid, it presents some characteristics that differ dramatically from classical computing platforms (i.e. clusters or MPPs), namely: different administration domains, limited control over resources, heterogeneity and dynamism.
Grid scheduling is an active research area. The Grid scheduling problem is better understood today and several heuristics, performance models and algorithms have been proposed and evaluated with the aid of simulation tools. However, current working Grid schedulers are only based on match-making, and barely consider multi-user environments.
In this section, we describe the state-of-the-art scheduling policies implemented in the GridWay system. The contents of this guide reflect the experience obtained since GridWay version 4, and a strong feedback from the GridWay user community.
The scheduler is responsible for assigning jobs to Grid resources; therefore, it decides when and where to run a job. These decisions are made periodically in an endless loop. The frequency of the scheduler interventions can be adjusted with the “SCHEDULER_INTERVAL” configuration parameter (see Section “GridWay Daemon (GWD) Configuration”).
In order to make job to resource assignments the scheduler receives information from the following sources (see Figure 5, “Job Scheduling in GridWay”:
The information gathered from the previous sources is combined with a given scheduling policy to prioritize jobs and resources. Then, the scheduler dispatches the highest priority job to the best resource for it. The process continues until all jobs are dispatched, and those that could not be assigned wait for the next scheduling interval.
Figure 5. Job Scheduling in GridWay
A scheduling policy is used to assign a dispatch priority to each job and a suitability priority to each resource. Therefore, a Grid scheduling policy comprises two components:
These two top-level policies can be combined to implement a wide range of scheduling schemes (see Figure 6, “Job and resource prioritization policies in GridWay.”. The above scheduling policies are described in the following sections.
Figure 6. Job and resource prioritization policies in GridWay.
The job prioritization policies allow Grid administrators to influence the dispatch order of the jobs, that is, to decide which job is sent to the Grid. Traditionally, DRMS implement different policies based on the owner of the job, the resources consumed by each user or the requirements of the job. Some of these scheduling strategies can be directly applied in a Grid, while others must be adapted because of their unique characteristics: dynamism, heterogeneity, high fault rate and site autonomy.
This policy assigns a fixed priority to each job. The fixed priority ranges from 00 (lowest priority) to 19 (highest priority), so jobs with a higher priority will be dispatched first. The default priority values are assigned, by the Grid administrator, using the following criteria:
The user priority prevails over the group one. Also there is a special user (“DEFAULT”) to define the default priority value when no criteria apply.
The users can set the priority of their own jobs (gwsubmit -p) but without exceeding their limit set by the administrator in the scheduler configuration file.
Here is an example configuration for the fixed priority (see also Section “Built-in Scheduler Configuration File”):
# Weight for the Fixed priority policy FP_WEIGHT = 1
# Fixed priority values for David's and Alice's jobs FP_USER[david] = 2 FP_USER[alice] = 12
# Fixed priority for every body in the staff group FP_GROUP[staff] = 5
# Anyone else gets a default priority 3 FP_USER[DEFAULT] = 3
The Grid administrator can also set the fixed priority of a job to 20. When a job gets a fixed priority of 20, it becomes an urgent job. Urgent jobs are dispatched as soon as possible, bypassing all the scheduling policies.
The fair-share policy allows you to establish a dispatching ratio among the users of a scheduling domain. For example, a fair-share policy could establish that jobs from David and Alice must be dispatched to the Grid in a 2:5 ratio. In this case, the scheduler tracks the jobs submitted to the Grid by these two users and dispatches the jobs so they target a 2:5 ratio of job submissions.
This policy resembles the well-known fair-share of traditional LRMS. However, note that what GridWay users share is the ability to submit a job to the Grid and not resource usage. Resource usage share cannot be imposed at a Grid level, as Grid resources are shared with other Grid users and with local users from the remote organization. In addition, the set of resources that can be potentially used by each user is not homogeneous, as each user may belong to a different VO.
GridWay tracks the jobs submitted to the Grid by the users over time. Grid administrators can specify a timeframe over which user submissions are evaluated. The amount of time considered by GridWay is defined by a number of time intervals (“SH_WINDOW_DEPTH”) and the duration of each one (“SH_WINDOW_SIZE”, in days). The effective number of submissions in a given window is exponentially damped, so present events become more relevant.
Here is an example configuration for the share policy (see also Section “Built-in Scheduler Configuration File”):
# Weight for the Fair-share policy SH_WEIGHT = 1
# Shared values for David's and Alice's submissions SH_USER[david] = 2 SH_USER[alice] = 5
# Anyone else gets a default share value of 1 SH_USER[DEFAULT] = 1
# Consider submissions in the last 5 days SH_WINDOW_SIZE = 1 SH_WINDOW_DEPTH= 5
The goal of this policy is to prevent low-priority jobs from starving. So jobs in the pending state long enough will be eventually submitted to the Grid. This policy can be found in most of the DRMS today. In GridWay, the priority of a job is increased linearly with the waiting time.
Here is an example configuration for this policy:
# Weight for the Waiting-time policy WT_WEIGHT = 1
GridWay includes support for specifying deadlines at job submission. The scheduler will increase the priority of a job as its deadline approaches.
Important |
---|
Note that this policy does not guarantee that a job is completed before the deadline. |
Grid administrators should provide a way to qualify the remaining time to reach the job deadline by defining when a job should get half of the maximum priority assigned by this policy (“DL_HALF”, in days).
Here is an example configuration for the deadline policy (see also Section “Built-in Scheduler Configuration File”):
# Weight of the Deadline Policy DL_WEIGHT = 1
# Assign half of the priority two days before the deadline DL_HALF = 2
The list of all pending jobs is sorted by the dispatch priority, which is computed as a weighted sum of the contribution from the previous policies. In this way, the Grid administrator can implement different scheduling schemes by adjusting the policy weights.
The dispatch priority (P) of a job (j) is therefore computed as:
Pj = Σi wi pij , where i={fixed, share, wait-time, deadline}
where wi is the weight for each policy (integer value) and pij is the priority (normalized) contribution from each policy.
The resource prioritization policies allow Grid administrators to influence the usage of resources made by the users, that is, decide where to run a job. Usually, in classical DRMS, this resource usage is administered by means of the queue concept.
In GridWay, the scheduler builds a meta-queue (a queue consisting of the local queues of the Grid resources) for each job based on its requirements (e.g., operating system or architecture). Note that this meta-queue is not only built in terms of resource properties but is also based upon the owner of the job, (as each Grid user may belong to a different VO with its own access rights and usage privileges).
The meta-queue for a job consists of the queues of those resources that meet the job requirements specified in the job template and have at least one free slot. By default, this queue is sorted in a first-discovered first-used fashion. This order can be influenced by means of the subsequent resource prioritization policies.
Usually, GridWay is configured with several Information Managers (IM). Grid administrators can prioritize resources based upon the IM that discovered the resource. Grid administrators can also assign priorities to individual resources. For example, a fixed priority policy can specify that resources from the intranet (managed by an IM driver tagged intranet) should always be used before resources from other sites (managed by an IM driver tagged grid).
The priority of a resource ranges from 01 (lowest priority) to 99 (highest priority), so resources with a higher priority will be used first. Grid administrators can also prioritize individual resources based on business decisions.
When a resource gets the priority value 00, it becomes a banned resource, and will not be used for any job. So Grid administrators can virtually unplug resources from their scheduling domain.
Example configuration for the resource Fixed Priority Policy:
# Weight for the Resource fixed priority policy RP_WEIGHT = 1
# Fixed priority values for specific resources RP_HOST[my.cluster.com] = 12 RP_HOST[slow.machine.com] = 02
# Fixed priority for every resource in the intranet RP_IM[intranet] = 65
# Fixed priority for every resource discovered by the grid IM RP_IM[grid] = 05
# Anyone else gets a default priority 04 (i.e. other IM) RP_IM[DEFAULT] = 01
The goal of this policy is to prioritize those resources more suitable for the job, from its own point of view. For example, the rank policy for a job can state that resources with faster CPUs should be used first. This policy is configured through the “RANK” attribute in the job template, please refer to GridWay 5.4: User's Guide.
Example configuration for the Rank policy:
# Weight of the Rank policy RA_WEIGHT = 1
This policy reflects the behavior of Grid resources based on job execution statistics. So, crucial performance variables, like the average queue wait time or network transfer times, are considered when scheduling a job. This policy is derived from the sum of two contributions: history and current.
These values are used to compute an estimated execution time of a job on a given resource for a given user:
T = ( 1 - w )( Thexe + Thxfr + Thque) + w ( Tcexe + Tcxfr + Tcque )
where Tc are the execution statistics of the last job (execution, transfer and queue wait-time), Th are the execution statistics based on the history data; and w is the history ratio. Those resources with a lower estimated time are used first to execute a job.
The Usage policy can be configured with:
Example configuration for Usage policy:
# Weight of the Usage policy UG_WEIGHT = 1
# Number of days in the history window UG_HISTORY_WINDOW = 3
# Accounting database to last execution ratio UG_HISTORY_RATIO = 0.25
When a resource fails, GridWay implements an exponential linear back-off strategy at resource level (and per each user); henceforth, resources with persistent failures are discarded (for a given user).
In particular, when a failure occurs a resource is banned for T seconds:
T=T∞ ( 1 - e-Δt/C )
where T∞ is the maximum time that a resource can be banned, Δt is the time since last failure, and C is a constant that determines how fast the T∞ limit is reached.
The failure rate policy can be configured with the following parameters:
Example configuration for the Failure Rate policy:
# Maximum time that a resource will not be used, in seconds FR_MAX_BANNED_TIME = 3600 # Exponential constant FR_BANNED_C = 650
The list of all candidate resources is sorted by the suitability priority, which is computed as a weighted sum of the contribution from the previous policies. The suitability priority resource is therefore computed with:
Ph = Σi wi pih , where i={fixed, usage, rank}
where wi is the weight for each policy (integer value) and pih is the priority (normalized) contribution from each policy.
Also, the scheduler can migrate running jobs in the following situations:
See Section “GridWay Daemon (GWD) Configuration” and GridWay 5.4: User's Guide, for information on configuring these policies.
The built-in scheduler configuration options are defined in “$GW_LOCATION/etc/sched.conf”. The table below summarizes the configuration file options, their description and default values. Note that blank lines and any character after a '#' are ignored.
Table 3. Built-in Scheduler Configuration File Options.
Option | Description | Default |
---|---|---|
Job Scheduling Policies. Pending jobs are prioritized according to four policies:fixed (FP), share(SH), deadline (DL) and waiting-time (WT). The dispatch priority of a job is computed as a weighted sum of the contribution of each policy (normalized to one). | ||
DISPATCH_CHUNK | The maximum number of jobs that will be dispatched for each scheduling action | 15 (0 to dispatch as many jobs as possible) |
MAX_RUNNING_USER | The maximum number of simultaneous running jobs per user. | 30 (0 to dispatch as many jobs as possible) |
Fixed Priority (FP) Policy: Assigns a fixed priority to each job | ||
FP_WEIGHT | Weight for the policy (real numbers allowed). | 1 |
FP_USER[<username>] | Priority for jobs owned by <username>. Use the special username DEFAULT to set default priorities. Priority range [0,19] | |
FP_GROUP[<groupname>] | Priority for jobs owned by users in group <groupname>. Priority range [0,19] | |
Share (SH) Policy: Allows you to establish a dispatch ratio among users. | ||
SH_WEIGHT | Weight for the policy (real numbers allowed). | |
SH_USER[<username>] | Share for jobs owned by <username>. Use the special username DEFAULT to set default shares. | |
SH_WINDOW_DEPTH | Number of intervals (windows) to “remember” each user's dispatching history. The submissions of each window are exponentially “forgotten”. | 5, the maximum value is 10. |
SH_WINDOW_SIZE | The size of each interval in days (real numbers allowed). | 1 |
Waiting-time (WT) Policy: The priority of a job is increased linearly with the waiting time to prevent job starvation | ||
WT_WEIGHT | Weight for the policy (real numbers allowed). | 0 |
Deadline (DL) Policy: The priority of a job is increased exponentially as its deadline approaches. | ||
DL_WEIGHT | Weight for the policy (real numbers allowed). | 1 |
DL_HALF | Number of days before the deadline when the job should get half of the maximum priority. | 1 |
Resource Scheduling Policies.The resource policies allows grid administrators to influence the usage of resources made by the users, according to: fixed (FP), rank (RA), failure rate (FR), and usage (UG). The suitability priority of a resource is computed as a weighted sum of the contribution of each policy (normalized to one). | ||
MAX_RUNNING_RESOURCE | The maximum number of jobs that the scheduler submits to a given resource | 10 |
Fixed Priority (RP) Policy: Assigns a fixed priority (range [01,99]) to each resource | ||
RP_WEIGHT | Weight for the policy (real numbers allowed). | 1 (real numbers allowed) |
RP_HOST[<FQDN>] | Priority for resource <FQDN>. Those resources with priority 00 WILL NOT be used to dispatch jobs. | |
RP_IM[<im_tag>] | Priority for ALL resources discovered by the IM <im_tag> (as set in “gwd.conf”, see Section “GridWay Daemon (GWD) Configuration”). Use the special tag “DEFAULT” to set default priorities for resources. | |
Usage (UG) Policy: Resources are prioritized based on the estimated execution time of a job (on each resource). | ||
UG_WEIGHT | Weight for the policy (real numbers allowed). | 1 (real numbers allowed) |
UG_HISTORY_WINDOW | Number of days used to compute the history contribution. | 3 (real numbers allowed) |
UG_HISTORY_RATIO | Weight to compute the estimated execution time on a given resource. | 0.25 |
Rank (RA) Policy: Prioritize resources based on their RANK (as defined in the job template) | ||
RA_WEIGHT | Weight for the policy. | 0 (real numbers allowed) |
Failure Rate (FR) Policy. Resources with persistent failures are banned | ||
FR_MAX_BANNED | The maximum time a resource is banned, in seconds. Use 0 TO DISABLE this policy. | 3600 |
FR_BANNED_C | Exponential constant to compute banned time | 650 |
GridWay uses an external and selectable scheduler module to schedule jobs. The following schedulers are distributed with GridWay:
Important |
---|
The flood (user round-robin) scheduler is included as an example, and should not be used in production environments. |
The schedulers are configured with the “DM_SCHED” option in the “gwd.conf” file, with the format:
DM_SCHED = <sched_name>:<path_to_sched>:[args]
where:
By default, GridWay is configured to use the built-in policy engine described in the previous sections. If for any reason you need to recover this configuration, add the following line to “$GW_LOCATION/etc/gwd.conf”:
DM_SCHED = builtin:gw_sched:
Do not forget to adjust the scheduler policies to your needs by editing the “$GW_LOCATION/etc/sched.conf” file.
To configure the round-robin/flood scheduler, first disable the built-in engine policy in the “$GW_LOCATION/etc/sched.conf” configuration file by adding the following line:
DISABLE = yes
Then add the following line to “$GW_LOCATION/etc/gwd.conf”:
DM_SCHED = flood:gw_flood_scheduler:-h 10 -u 30 -c 5 -s 15
where:
GridWay uses several Middleware Access Drivers (MAD) to interface with different Grid services. The following MADs are part of the GridWay distribution:
These drivers are configured and selected via the GWD configuration interface described in Section “GridWay Daemon (GWD) Configuration”. Additionally you may need to configure your environment (see Chapter 4, Testing) in order to successfully load the MADs into the GWD core. To do so, you can also use global and per user environment configuration files (“gwrc”).
There is one global config file and per user configuration files that can be used to set environment variables for MADs. These files are standard shell scripts that are sourced into the MAD environment before it is loaded. It can be used, for example, to set the variable “X509_USER_PROXY” so you can have it located elsewhere instead of the standard place (“/tmp/x509_u<uid>”). Other variables can be set and you can even source other shell scripts, for instance, you can prepare another globus environment for MADs for some users, like this:
X509_USER_PROXY=$HOME/.globus/proxy.pem
GLOBUS_LOCATION=/opt/globus-4.0 . $GLOBUS_LOCATION/etc/globus-user-env.sh
The file for global MAD environment configuration is “$GW_LOCATION/etc/gridway/gwrc” and the user specific one is “$HOME/.gwrc”.
You have to take into account a couple of things:
if [ -d /opt/globus-devel ]; then export GLOBUS_LOCATION=/opt/globus-devel fi
The Execution Driver interfaces with Grid Execution Services and is responsible for low-level job execution and management. The GridWay distribution includes the following Execution MADs:
Note that the use of these MADs requires a valid proxy.
Execution MADs are configured with the “EM_MAD” option in the “$GW_LOCATION/etc/gwd.conf” file, with the following format:
EM_MAD = <mad_name>:<path_to_mad>:<args>:<rsl|rsl_nsh|rsl2>
where:
For example, the following line will configure GridWay to use the Execution Driver gw_em_mad_prews using RSL syntax with name prews:
EM_MAD = prews:gw_em_mad_prews::rsl
To use WS-GRAM services, you can include the following line in your “$GW_LOCATION/etc/gwd.conf” file:
EM_MAD = ws:gw_em_mad_ws::rsl2
Note |
---|
You can simultaneously use as many Execution Drivers as you need (up to 10). So GridWay allows you to simultaneously use pre-WS and WS Globus Services. |
Now it is possible to specify a different gatekeeper port than the standard one (8443) in the Web Service driver. The line to configure EM MADs in “gwd.conf” has changed so you can add parameters to it. The parameter to change the port is the “-p” followed by the port number. For example:
EM_MAD = osg_ws:gw_em_mad_ws:-p 9443:rsl2
This line tells the EM MAD to use port 9443 to connect to the GT4 Gatekeeper.
The File Transfer Driver interfaces with Grid Data Management Services and is responsible for file staging, remote working directory set-up and remote host clean up. The GridWay distribution includes:
The use of this driver requires a valid Proxy.
File Transfer Managers are configured with the TM_MAD option in the “gwd.conf” file, with the format:
TM_MAD = <mad_name>:<path_to_mad>:[arg]
where:
To configure the Transfer Driver, add a line to “$GW_LOATION/etc/gwd.conf”, with the following format:
TM_MAD = <mad_name>:<path_to_mad>:[arguments]>
The GridFTP driver does not require any command line arguments. So to configure the driver, add the following line to “$GW_LOCATION/etc/gwd.conf”:
TM_MAD = gridftp:gw_tm_mad_ftp:
The name of the driver will be later used to specify the transfer mechanisms with Grid resource.
The Dummy driver should be used with those resources (clusters) which do not have a shared home. In this case, transfer and execution are performed as follows:
The following servers can be configured to access files on the client machine:
The Dummy driver behavior is specified with the following command line arguments:
Sample configuration to use a GridFTP server:
TM_MAD = dummy:gw_tm_mad_dummy:-u gsiftp\://hostname
Important |
---|
You MUST escape the colon character in gsiftp URL. Also, “hostname” should be the host running the GridWay instance. |
Sample configuration to use GASS servers:
TM_MAD = dummy:gw_tm_mad_dummy:-g
The Information Driver interfaces with Grid Monitoring Services and is responsible for host discovery and monitoring. The following Information Drivers are included in GridWay:
To configure an Information Driver, add a line to “$GW_LOATION/etc/gwd.conf”, with the following format:
IM_MAD = <mad_name>:<path_to_mad>:[args]:<tm_mad_name>:<em_mad_name>
where:
For example, to configure GWD to access a MDS4 hierarchical information service:
IM_MAD = mds4:gw_im_mad_mds4:-s hydrus.dacya.ucm.es:gridftp:ws
All the Information Drivers provided with GridWay use a common interface to configure their operation mode. The arguments used by the Information Drivers are:
These options allow you to configure your Information Drivers in the three operation modes, described below.
In this mode, hosts are statically discovered by reading a host list file (note: each time it is read). Also the attributes of each host are read from files. Hint: Use this mode for testing purposes and not in a production environment. To configure a Information Driver in SS mode use the host list option, for example:
IM_MAD = static:gw_im_mad_static:-l examples/im/host.static:gridftp:ws
The host list file contains one host per line, with format:
FQDN attribute_file
where:
For example (you can find this file, “host.list”, in “$GW_LOCATION/examples/im/”)
hydrus.dacya.ucm.es examples/im/hydrus.attr draco.dacya.ucm.es examples/im/draco.attr
The “attribute_file” includes a single line with the host information and other lines with the information of each queue (one line per queue). Use the examples below as templates for your hosts.
Example of attribute file for a PBS cluster (you can find this file in “$GW_LOCATION/examples/im/”):
HOSTNAME="hydrus.dacya.ucm.es" ARCH="i686" OS_NAME="Linux" OS_VERSION="2.6.4" CPU_MODEL="Intel(R) Pentium(R) 4 CPU 2" CPU_MHZ=2539 CPU_FREE=098 CPU_SMP=1 NODECOUNT=4 SIZE_MEM_MB=503 FREE_MEM_MB=188 SIZE_DISK_MB=55570 FREE_DISK_MB=39193 FORK_NAME="jobmanager-fork" LRMS_NAME="jobmanager-pbs" LRMS_TYPE="pbs" QUEUE_NAME[0]="q4small" QUEUE_NODECOUNT[0]=1 QUEUE_FREENODECOUNT[0]=4 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=20 QUEUE_MAXCOUNT[0]=4 QUEUE_MAXRUNNINGJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=1 QUEUE_STATUS[0]="enabled" QUEUE_DISPATCHTYPE[0]="batch" QUEUE_NAME[1]="q4medium" QUEUE_NODECOUNT[1]=4 QUEUE_FREENODECOUNT[1]=4 QUEUE_MAXTIME[1]=0 QUEUE_MAXCPUTIME[1]=120 QUEUE_MAXCOUNT[1]=4 QUEUE_MAXRUNNINGJOBS[1]=0 QUEUE_MAXJOBSINQUEUE[1]=1 QUEUE_STATUS[1]="enabled" QUEUE_DISPATCHTYPE[1]="batch"
Example of attribute file for a Fork Desktop (you can find this file in “$GW_LOCATION/examples/im/”):
HOSTNAME="draco.dacya.ucm.es" ARCH="i686" OS_NAME="Linux" OS_VERSION="2.6-xen" CPU_MODEL="Intel(R) Pentium(R) 4 CPU 3" CPU_MHZ=3201 CPU_FREE=185 CPU_SMP=2 NODECOUNT=2 SIZE_MEM_MB=431 FREE_MEM_MB=180 SIZE_DISK_MB=74312 FREE_DISK_MB=40461 FORK_NAME="jobmanager-fork" LRMS_NAME="jobmanager-fork" LRMS_TYPE="fork" QUEUE_NAME[0]="default" QUEUE_NODECOUNT[0]=1 QUEUE_FREENODECOUNT[0]=1 QUEUE_MAXTIME[0]=0 QUEUE_MAXCPUTIME[0]=0 QUEUE_MAXCOUNT[0]=0 QUEUE_MAXRUNNINGJOBS[0]=0 QUEUE_MAXJOBSINQUEUE[0]=0 QUEUE_STATUS[0]="0" QUEUE_DISPATCHTYPE[0]="Immediate"
To use the WS version of these files just change jobmanager-fork with Fork and jobmanager-pbs with PBS.
Hosts are discovered by reading a host list file. However, the information of each host is gathered by querying its information service (GRIS in MDS2 or the DefaultIndexService in MDS4). Hint: Use this mode if the resources in your Grid does not vary too much, i.e. resource are not added or removed very often. To configure an Information Driver in SD mode, use the host list option, for example:
IM_MAD = glue:gw_im_mad_mds2_glue:-l examples/im/host.list:gridftp:prews
In this case the host list file contains one host per line, with the format:
FQDN ... FQDN
where:
For example (you can find this file in “$GW_LOCATION/examples/im/”)
hydrus.dacya.ucm.es ursa.dacya.ucm.es draco.dacya.ucm.es
Note |
---|
The information services of each host (GRIS or/and DefaultIndexServices) must be properly configured to use this mode. |
Important |
---|
You can configure your IMs to work in a dynamic monitoring mode but get some static information from an attributes file (as described in the SS mode). This configuration is useful when you want to add some host attributes missing from the IndexService (like software availability, special hardware devices…). You can see a useful use of this mode in section Chapter 6, Troubleshooting. |
In this mode, hosts are discovered and monitored by directly accessing the Grid Information Service. Hint: Use this mode if the resources in your Grid does vary too much, i.e. resource are added or removed very often. To configure a Information Driver in SD mode, use the server option, for example:
IM_MAD = mds4:gw_im_mad_mds4:-s hydrus.dacya.ucm.es:gridftp:ws
Note |
---|
A hierarchical information service (GIIS or/and DefaultIndexService) must be properly configured to use this mode. |
If you are using an MDS2 information service you may need to specify the Virtual Organization name in the DN of the LDIF entries (Mds-vo-name) with the base option described above.
Note |
---|
You can simultaneously use as many Information Drivers as you need (up to 10). So GridWay allows you to simultaneously use MDS2 and MDS4 Services. You can also use resources from different Grids at the same time. |
Note |
---|
You can mix SS, SD and DD modes in the same Information Driver. |
There is a way to specify a different machine to be used as gsiftp endpoint than the one that has the gatekeeper installed. This is useful when the CE machine does not have gsiftp server configured but there is another machine that works as a Storage Element. Right now, this information could be set statically but the rest of the information can be updated dynamically. To use this feature you have to create a file for each host you want to configure with extra information and another file with pairs of host and file name (as described above for the SS mode). The filename can be a full path or a relative path to “GW_LOCATION”. Then in the IM MAD you must specify the list file with “-l”, like this (in “gwd.conf”):
IM_MAD = mds4:gw_im_mad_mds4:-l etc/gridway/host.list:gridftp:ws
The file list should look like this:
wsgram-host1.domain.com etc/gridway/wsgram-host1.attr wsgram-host2.domain.com etc/gridway/wsgram-host2.attr
And the attributes file for each node should look like this:
SE_HOSTNAME="gridftp-host1.domain.com"
GridWay architecture flexibility allows it to interoperate with grids based on different middleware stacks. The following documents states how to configure GridWay for the following infrastructures:
In order to test the GridWay installation, login as your user account, in the single-mode installation, or as the “<gwadmin>” account, in the multiple-user installation, and follow the steps listed below:
$ export GW_LOCATION=<path_to_GridWay_installation> $ export PATH=$PATH:$GW_LOCATION/binor
$ setenv GW_LOCATION <path_to_GridWay_installation> $ setenv PATH $PATH:$GW_LOCATION/bindepending on the shell you are using.
$ grid-proxy-init Your identity: /O=Grid/OU=GRIDWAY/CN=GRIDWAY User Enter GRID pass phrase for this identity: Creating proxy ................................. Done Your proxy is valid until: Mon Oct 29 03:29:17 2005
$ gwd -v Copyright 2002-2008 GridWay Team, Distributed Systems Architecture Group, Universidad Complutense de Madrid GridWay 5.4.0 is distributed and licensed for use under the terms of the Apache License, Version 2.0 (www.apache.org/licenses/LICENSE-2.0).
$ gwd
$ gwps USER JID DM EM START END EXEC XFER EXIT NAME HOST $ gwhost HID PRIO OS ARCH MHZ %CPU MEM(F/T) DISK(F/T) N(U/F/T) LRMS HOSTNAME
$ pkill gwd
To perform more sophisticated tests, check the GridWay 5.4: User's Guide. If you experience problems, check Troubleshooting.
GridWay is shipped with a test suite, available in the test directory. The test suite exercises different parts of GridWay, and can be used to track functionality bugs. However you need a working GridWay installation and testbed to execute the suite. Usage information is available with “gwtest -h”. Tests can be performed individually (using the test id) or all together automatically.
Table 4. GridWay tests description.
Test # | Test Name | Description |
---|---|---|
1 | Normal Execution (SINGLE) | Submits a single job and verifies it is executed correctly |
2 | Normal Execution (BULK) | Submits an array of 5 jobs and verifies that all of them are executed correctly. |
3 | Pre Wrapper | Verifies that GridWay is able to execute the pre wrapper functionality. |
4 | Prolog Fail (Fake Stdin) No Reschedule | Submits a single job that fails in the prolog state due to a wrong input file for stdin. |
5 | Prolog Fail (Fake Stdin) Reschedule | Equal to the previous one, but GridWay tries to reschedule the job up to 2 times. |
6 | Prolog Fail (Fake Input File) No Reschedule | Same as #4 with a wrong input file for the executable. |
7 | Prolog Fail (Fake Executable) No Reschedule | Same as #4 with a wrong filename for the executable. |
8 | Prolog Fail (Fake Executable) No Reschedule | Same as #4 with a wrong filename for the executable. |
9 | Prolog Fail (Fake Stdin) No Reschedule (BULK) | Same as #4 submitting an array of 5 jobs. |
10 | Execution Fail No Reschedule | Submits a single job designed to fail (bad exit code) and verifies the correctness of the final state (failed). |
11 | Execution Fail Reschedule | Same as #9 but GridWay tries to reschedule the job up to 2 times. |
12 | Hold Release | Submits a single job on hold, releases it and verifies that it is executed correctly. |
13 | Stop Resume | Submits a single job, stops it (in Wrapper state), resumes it and verifies that it is executed correctly. |
14 | Kill Sync | Submits a job and kills it using a synchronous signal. |
15 | Kill Async | Submits a job and kills it using an asynchronous signal. |
16 | Kill Hard | Submits a job and hard kills it. |
17 | Migrate | Submits a job and sends a migrate signal when it reaches the Wrapper state. It then verifies the correct execution of the job. |
18 | Checkpoint local | Submits a job which creates a checkpoint file and verifies the correct execution of the job and the correct creation of the checkpoint file. |
19 | Checkpoint remote server | Same as #17 but the checkpoint file is created in a remote gsiftp server. |
20 | Wait Timeout | Submits a job and waits for it repeatedly using short timeouts until it finishes correctly. |
21 | Wait Zerotimeout | Same as #19 but with zero timeout (effectively, an asynchronous wait). |
22 | Input Output files | Tests the different methods GridWay offers to stage files (both input and output). |
23 | Epilog Fail (Fake Output) No Reschedule | Submits a single job that fails in the epilog state due to a wrong output filename. |
24 | Epilog Fail (Fake Output) Reschedule | Same as #22 but GridWay tries to reschedule the job up to 2 times. |
25 | Epilog Fail (Fake Output) No Reschedule (BULK) | Same as #22 but submitting an array of 5 jobs. |
GridWay also ships with a DRMAA test suite, conceived to test the DRMAA Java implementations. Download and untar the following tarball, then follow the instructions found in the README file.
Access authorization to the GridWay server is done based on the Unix identity of the user (accessing GridWay directly or through a Web Services GRAM, as in GridGateWay). Hence, security in GridWay has the same implications as the Unix accounts of their users.
Also, GridWay uses proxy certificates to use Globus services, so the security implications of managing certificates also must be taken into account
• Lock file exists
GridWay finishes with the following message when you try to start it:
Error! Lock file <path_to_GridWay>/var/.lock exists.
Be sure that no other GWD is running, then remove the lock file and try again.
• Error in MAD initialization
GridWay finishes with the following message, when you try to start it:
Error in Execution MAD prews initialization, exiting. Check path, you have a valid proxy...
Check that you have generated a valid proxy (for example with the grid-proxy-info command). Also, check that the directory “$GW_LOCATION/bin” is in your path, and the executable name of all the MADs is defined in “gwd.conf”.
• Error contacting GWD
Client commands, like gwps, finish with the message:
connect(): Connection refused Could not connect to gwd
Be sure that GWD is running (ex. pgrep -l gwd). If it is running, check that you can connect to GWD (ex. telnet `cat $GW_LOCATION/var/gwd.port`)