lcg-RB - Update to version 3.0.7-0
Date | 05.02.07 |
---|---|
Priority | Normal |
Description
This update fixes various bugs. For the full list of bugs, please see list below.
WARNING: this patch makes the RB forget all unfinished jobs.
This means that the RB should be drained sufficiently before the patch is applied. One can prevent new job submissions by letting the node's firewall refuse remote (!) connections to port 9002.
condor-lcgrb replaces condor on lcg-RB nodes to make the Condor-G components more robust, particularly w.r.t. job proxies.
Between LCG-2_7_0 and gLite 3.0.0 the condor rpm on the lcg-RB was upgraded only because the gLite-WMS needed the newer version in the repository.
There were no significant improvements expected. Instead, there were worries that the newer version might not have undergone a comparable amount of stress testing.
Now we have encountered various problems that were not seen
with the old version:
- A job may be given the wrong proxy, which either makes the job
immediately abort after successful submission, or lets the job live on when
it should have aborted.
The latter has been an issue for SAM, while the former has
been plaguing the Atlas FCR jobs and has been reported at
least in GGUS tickets 12226 and 15694. - GGUS ticket 12940 discusses how a gLite 3 RB can become
unusable for some user if a proxy cannot be renewed for
one of the user's jobs.
An LCG-2_7_0 RB can be made to fail the same way if the
condor rpm is upgraded to that of gLite 3. - There have been more run-away gahp_server processes,
dissociated from the controlling condor_gridmanager process,
than observed with the old version, which had a separate
condor_gridmanager instance per gahp_server.
Since the lcg-RB is not supposed to be given a lot of development or testing effort any more, it would seem best to return to the previous version of condor.
To avoid clashes with the gLite WMS, instead of downgrading
condor it seems better to introduce a new rpm that happens
to contain an older version of condor, and to remove the
dependency on condor from the lcg-RB meta rpm.
The condor-lcgrb rpm contains the condor-6.6.6-lcg3 functionality, but
relocated under /opt/condor-20.0.7 so that YAIM will consider it the highest
version of condor available.
The pre-install script will stop the Condor-G processes,
but will not restart them, since the admin will first have
to reconfigure condor on the lcg-RB, e.g. as follows:
/opt/glite/yaim/scripts/run_function \
the-site-info.def
config_condor
/etc/init.d/edg-wl-jc start
A full reconfiguration will also work, of course:
/opt/glite/yaim/scripts/configure_node \
the-site-info.def
lcg-RB
Fixed bugs
Number | Description |
---|---|
#13888 | VOMS Admin: Internal database inconsistency detected: Got more roles than expected for user "<my DN>" |
#15566 | VOMS Admin does not enforce the correct group semantics |
#16245 | [VOMS Admin] Removing a VO should call check_parameters() |
#16472 | VOMS Admin voms.request.webui.enabled config parameter does not work |
#17476 | voms-admin fails in creating users correctly on oracle |
#18140 | [ voms-admin ] create-group option doesn't work properly in the command line |
Updated rpms
Name | Version | Full RPM name | Description |
---|---|---|---|
condor-lcgrb | 1.0.0-3 | condor-lcgrb-1.0.0-3.i386.rpm | condor 6.6.6 with LCG patches for LCG-RB |
glite-security-voms-admin-client | 1.2.15-1 | glite-security-voms-admin-client-1.2.15-1.noarch.rpm | gLite VOMS Administration clients |
glite-security-voms-admin-interface | 1.0.5-1 | glite-security-voms-admin-interface-1.0.5-1.noarch.rpm | gLite VOMS Administration service (interface) |
lcg-RB | 3.0.7-0 | lcg-RB-3.0.7-0.noarch.rpm | lcg RB node |
The RPMs can be updated using apt via
- via apt: apt-get dist-upgrade
- or via a download from:
http://glitesoft.cern.ch/EGEE/gLite/APT/R3.0/rhel30/RPMS.updates/
Service reconfiguration after update
Service must be reconfigured. See above for details.
Service restart after update
Service must be restarted.
How to apply the fix
- Update the RPMs (see above)
- Update configuration (see above)
- Restart the service if necessary (see above)