===============================================================================
=========================== LCG1 Installation notes ===========================
===============================================================================
=========== C 2003 by Emanuele Leonardi - Emanuele.Leonardi@cern.ch ===========
===============================================================================
Reference tag: LCG1-1_1_3
These notes will assist you in installing the latest LCG1 tag. Read through it
carefully before starting the installation of a new site.
Release LCG1-1_1_3 is a minor update to LCG1-1_1_2 to upgrade the kernel
installed on all machines to version 2.4.20-24.7. This fixes a critical
security bug present in all previous kernel versions. Several other minor
fixes are also included. You can find a full description of the changes in
Appendix F.
If you are currently running releases LCG1-1_1_1 or LCG1-1_1_2 then you can
just do an update following the procedures detailed in Appendix E.
*** WARNING *** with the addition of the new LHCb rpms in version LCG1-1_1_2
and the new CMS rpm in LCG1-1_1_3 the disk space occupation on the WN has gone
up by ~1.3GB, so that now the total disk occupation on WNs is ~7.8GB. Make sure
your WNs have the required amount of disk space before attempting any update.
Introduction and overall setup
==============================
In this text we will assume that you are already familiar with the LCFGng
server installation and management.
A detailed guide can be found at
http://grid-deployment.web.cern.ch/grid-deployment/gis/lcfgng-server73.pdf
Note that by following the procedure described in that document, you will
install on your LCFGng server the latest available version of each object.
In some cases this may be incompatible with the object version used on the
LCFGng client nodes. To make sure that the correct version of each object is
installed on your server, you should use the lcfgng_server_update.pl script,
available from CVS. See chapter "Preparing the installation of current tag"
below for instructions on how to obtain and run this script.
Also note that by following the instructions there, the apache server running
on your LCFG server will allow access from anywhere on the net. This is a
potential security breach as LCFG node configuration files contain sensible
information (e.g. encrypted root passwords). To limit access to nodes belonging
to your domain you should edit the apache configuration file,
/etc/httpd/conf/httpd.conf, and apply the following changes:
- in the section delimited by these two lines:
...
find the line "Allow from all" and replace it with "Allow from "
where is your network domain (e.g. at CERN this gives
"Allow from cern.ch").
- repeat the same operation in the section delimited by lines:
...
Files needed for the current LCG1 release are available from a CVS server at
CERN. This CVS server contains the list of rpms to install and the LCFGng
configuration files for each node type. The CVS area, called "lcg1", can be
reached from
http://lcgapp.cern.ch/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=lcgdeploy
Note1: at the same location there is another directory called "lcg-release":
this area is used for the integration and certification software, NOT for
production. Just ignore it!
Note2: documentation about access to this CVS repository can be found in
http://grid-deployment.web.cern.ch/grid-deployment/documentation/cvs-guide
In the same CVS location we created an area for each of the sites participating
to LCG1, e.g. BNL, BUDAPEST, CERN, etc. These directories (should) contain the
configuration files used to install and configure the nodes at the
corresponding site. Site managers are required to keep these directories
up-to-date by committing all changes they do to their configuration files back
to CVS so that we will be able to keep track of the status of each site at
any given moment. If a site reaches a consistent working configuration, site
managers can (should) create a tag which will allow them to easily recover
configuration information if needed. The tag name should follow the convention
described in
http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status
Note: if you have not done it yet, please get in touch with Louis Poncet
or Markus Schulz to activate
your write-enabled account on the CVS server.
Given the increasing number of sites joining the LCG production system, support
for installation at new sites is now organized in a hierarchical way: new
secondary (Tier2) sites should direct questions about installation problems to
their reference primary (Tier1) site. Primary sites will then escalate problems
when needed.
All site managers have in any case to join and monitor the LCG-Rollout list
where all issues related to the LCG deployment, including announcements of
updates and security patches, are discussed. You can join this list by going to
http://cclrclsv.RL.AC.UK/archives/lcg-rollout.html
and clicking on the "Join or leave the list" link.
Preparing the installation of current tag
=========================================
The current LCG1 tag is ---> LCG1-1_1_3 <---
In the following instructions/examples, when you see the string,
you should replace it with the name of the tag defined above.
To install it, check it out on your LCFG server with
> cvs checkout -r -d lcg1
Note: the "-d " will create a directory named
and copy there all the files. If you do not specify the -d parameter, the
directory will be a subdirectory of the current directory named lcg1.
The default way to install the tag is to copy the content of the rpmlist
subdirectory to the /opt/local/linux/7.3/rpmcfg directory on the LCFG server.
This directory is NFS-mounted by all client nodes and is visible as
/export/local/linux/7.3/rpmcfg
Now go to the directory where you keep your local configuration files. If you
want to create a new one, you can check out from CVS any of the previous tags
with:
> cvs checkout -r -d
If you want the latest (HEAD) version of your config files, just omit the
"-r " parameter.
Go to , copy there the template files from /source,
cfgdir-cfg.h.template, local-cfg.h.template, site-cfg.h.template, and
private-cfg.h.template, rename them cfgdir-cfg.h, local-cfg.h, site-cfg.h, and
private-cfg.h, and edit their content according to the instructions in the
files.
NOTE: if you already have localized versions of these files, just compare
them with the new templates to verify that no new parameter needs to be set.
To download all the rpms needed to install this version you can use the
updaterep command. In /updaterep you can find 2 configuration
files for this script: updaterep.conf and updaterep_full.conf. The first will
tell updaterep to only download the rpms which are actually needed to install
the current tag, while updaterep_full.conf will do a full mirror of the LCG rpm
repository. Copy updaterep.conf to /etc/updaterep.conf and run the updaterep
command. By default all rpms will be copied to the /opt/local/linux/7.3/RPMS
area, which is visible from the client nodes as /export/local/linux/7.3/RPMS.
You can change the repository area by editing /etc/updaterep.conf and modifying
the REPOSITORY_BASE variable.
IMPORTANT NOTICE: as the list and structure of Certification Authorities (CA)
accepted by the LCG project can change independently from the middle-ware
releases, the rpm list related to the CAs certificates and URLs has been
decoupled from the standard LCG1 release procedure. This means that the version
of the security-rpm.h file contained in the rpmlist directory associated to the
current tag might be incomplete or obsolete. Please go to URL
http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status
Click on the "LCG1 CAs" link at the bottom of the page and follow the
instructions there to update all CA-related settings. Changes and updates of
these settings will be announced on the LCG-Rollout mailing list.
To make sure that all the needed object rpms are installed on your LCFG server,
you should use the lcfgng_server_update.pl script, also located in
/updaterep. This script will report which rpms are missing or
have the wrong version and will create the /tmp/lcfgng_server_update_script.sh
script which you can then use to fix the server configuration. Run it in the
following way:
> lcfgng_server_update.pl /rpmlist/lcfgng-common-rpm.h
> /tmp/lcfgng_server_update_script.sh
> lcfgng_server_update.pl /rpmlist/lcfgng-server-rpm.h
> /tmp/lcfgng_server_update_script.sh
WARNING: please always give a look to /tmp/lcfgng_server_update_script.sh
and verify that all rpm update commands look reasonable before running it.
In the source directory you should give a look to the redhat73-cfg.h file
and see if the location of the rpm lists (updaterpms.rpmcfgdir) and of the rpm
repository (updaterpms.rpmdir) are correct for your site (the defaults are
consistent with the instructions in this document). If needed, you can redefine
these paths from the local-cfg.h file.
In private-cfg.h you can (must!) replace the default root password with the one
you want to use for your site:
+auth.rootpwd <--- replace with your own crypted password
To obtain using the MD5 encryption algorithm (stronger than the
standard crypt method) you can use the following command:
> openssl passwd -1
This command will prompt you to insert the clear text version of the password
and then print the encrypted version. E.g.
> openssl passwd -1
Password: <- write clear text password here
$1$iPJJEhjc$rtV/65l890BaPinzkb58z1 <- string
To finalize the adaptation of the current tag to your site you should edit your
site-cfg.h file. You can use the site-cfg.h.template file in the source
directory as a starting point. If you already have a site-cfg.h file that you
used to install the LCG1-1_1_1 release, you can find a detailed description of
the modifications to this file needed for the new tag in Appendix E below.
WARNING: the template file site-cfg.h.template assumes you want to run the
PBS batch system without sharing the /home directory between the CE and all the
WNs. This is the highly recommended setup. If for some reason you want to
run PBS in traditional mode, i.e. with the CE exporting /home with NFS and
all the WNs mounting it, you should edit your site-cfg.h file and comment out
the following two lines:
#define NO_HOME_SHARE
...
#define CE_JM_TYPE lcgpbs
In addition to this, your WN configuration file should include this line:
#include CFGDIR/UsersNoHome-cfg.h"
just after including Users-cfg.h (please note that BOTH Users-cfg.h AND
UsersNoHome-cfg.h must be included).
WARNING: in the current default configuration the "file" protocol access to the
SE is enabled. This means that SE and WN nodes must share a disk area called
/flatfiles/SE00. This is where the various VO will store their files, each VO
using a different subdirectory named after the VO itself (e.g.
/flatfiles/SE00/atlas) plus an extra directory, named "data" (i.e.
/flatfiles/SE00/data). If you are using an external file server to hold this
area, mounting it both on the SE and on the WNs, then you should create the
and "data" subdirectories yourself and set the ownership as root:
(or root:root for "data") and the access as 775, i.e.
> ls -ld /flatfiles/SE00
drwxrwxr-x 3 root alice 4096 Sep 2 16:13 alice
drwxrwxr-x 3 root atlas 4096 Sep 2 16:13 atlas
drwxrwxr-x 3 root cms 4096 Sep 2 16:13 cms
drwxrwxr-x 3 root root 4096 Sep 2 16:13 data
drwxrwxr-x 3 root dteam 4096 Sep 2 16:13 dteam
drwxrwxr-x 3 root lhcb 4096 Sep 2 16:13 lhcb
If on the other hand you want to keep the /flatfiles/SE00 area on the local
disk of the SE, then you can tell LCFG to create the whole structure by
including the following line in your SE node configuration file:
#include CFGDIR/flatfiles-dirs-SECLASSIC-cfg.h"
Note: the line should be inserted close to where you include
StorageElement-cfg.h but the exact position is not important.
Node installation and configuration
===================================
In your site-specific directory you should already have the do_mkxprof.sh
script. If you do not find it, you can check out the one in the CERN CVS area
with
> cvs checkout CERN/do_mkxprof.sh
This script just calls the mkxprof command specifying that it should look for
configuration files starting from the current directory (option "-S."). Feel
free to use your preferred call to the mkxprof command but note that running
mkxprof as a daemon is NOT recommended and can easily lead to massive
catastrophes if not used with extreme care: do it at your own risk.
To create the LCFG configuration for one or more nodes you can do
> ./do_mkxprof.sh node1 [node2 node3, ...]
If you get an error status for one or more of the configurations, you can get
a detailed report on the nature of the error by looking into URL
http:///status/
and clicking on the name of the node with a faulty configuration (a small red
bug should be shown beside the node name).
Once all node configurations are correctly published, you can proceed and
install your nodes following any one of the installation procedures described
in the "LCFGng Server Installation Guide" mentioned above.
When the initial installation completes (expect two automatic reboots in the
process), each node type requires a few manual steps, detailed below, to be
completely configured. After completing these steps, some of the nodes need
a final reboot which will bring them up with all the needed services active.
The need for this final reboot is explicitly stated among the node
configuration steps below.
Note about UI installation: the current default for a UI node is to NOT install
the CERN libraries rpms (this was added by mistake to release LCG1-1_1_1). If
you wish to make the CERN libraries available from your UI you should edit the
UI rpm list, UI-rpm in the /opt/local/linux/7.3/rpmcfg directory (this is the
directory where you copied all your rpm lists), and uncomment line
/* #include "apps_common-rpm.h" */
Common steps
------------
-- On the ResourceBroker, MyProxy, ComputingElement, and StorageElement nodes
you should install the host certificate/key files in /etc/grid-security with
names hostcert.pem and hostkey.pem. Also make sure that hostkey.pem is only
readable by root with
> chmod 400 /etc/grid-security/hostkey.pem
-- All Globus services grant access to LCG users according to the list of
certificates contained in the /etc/grid-security/grid-mapfile file.
The list of VOs included in grid-mapfile is defined in
/opt/edg/etc/edg-mkgridmap.conf. By default all VOs accepted in LCG
are included in this list. You can prevent VOs from accessing your site by
commenting out the corresponding line in edg-mkgridmap.conf.
E.g. by commenting out line
group ldap://grid-vo.nikhef.nl/ou=lcg1,o=alice,dc=eu-datagrid,dc=org .alice
on your CE you will prevent users in the Alice VO to submit jobs to your
site.
After installing a ResourceBroker, ComputingElement, or StorageElement node
and modifying (if needed) the local edg-mkgridmap.conf file you may force a
first creation of the grid-mapfile by running
> /opt/edg/sbin/edg-mkgridmap --output --safe
Every 6 hours a cron job will repeat this procedure and update grid-mapfile.
Important Notice: if your site is not supporting one or more of the LCG
standard VOs, please make sure you comment out the corresponding line in
edg-mkgridmap.conf as described above. Failure to do this will result in
jobs from unsupported VOs to be submitted to your site and then fail, thus
creating a grid "black hole" (jobs fall in and never come back).
UserInterface
-------------
No additional configuration steps are currently needed on a UserInterface node.
ResourceBroker
--------------
-- Configure the MySQL database. See detailed recipe in Appendix C at the end
of this document
-- Reboot the node
ComputingElement
----------------
-- Configure the PBS server. See detailed recipe in Appendix B at the end of
this document.
-- Create the first version of the /etc/ssh/ssh_known_hosts file by running
> /opt/edg/sbin/edg-pbs-knownhosts
A cron job will update this file every 6 hours.
-- If your CE is NOT sharing the /home directory with your WNs (this is the
default configuration) then you have to configure sshd to allow WNs to copy
job output back to the CE using scp. This requires the following two steps:
1) modify the sshd configuration. Edit the /etc/ssh/sshd_config file
and add these lines at the end:
HostbasedAuthentication yes
IgnoreUserKnownHosts yes
IgnoreRhosts yes
and then restart the server with
> /etc/rc.d/init.d/sshd restart
2) configure the script enabling WNs to copy output back to the CE.
- in /opt/edg/etc, copy edg-pbs-shostsequiv.conf.template to
edg-pbs-shostsequiv.conf then edit this file and change parameters to your
needs. Most sites will only have to set NODES to an empty string.
- create the first version of the /etc/ssh/shosts.equiv file by running
> /opt/edg/sbin/edg-pbs-shostsequiv
A cron job will update this file every 6 hours.
Note: every time you will add or remove WNs, do not forget to run
> /opt/edg/sbin/edg-pbs-shostsequiv <--- only if you do not share /home
> /opt/edg/sbin/edg-pbs-knownhosts
on the CE or the new WNs will not work correctly till the next time cron runs
them for you.
-- The CE is supposed to export information about the hardware configuration
(i.e. CPU power, memory, disk space) of the WNs. The procedure to collect
these informations and publish them is described in Appendix D of this
document.
-- Reboot the node
-- If your CE exports the /home area to all WNs, then after rebooting it make
sure that all WNs can still see this area. If this is not the case, execute
this command on all WNs:
> /etc/obj/nfsmount restart
StorageElement
--------------
-- Make sure that all subdirectories in /flatfiles/SE00 were correctly created
(see WARNING notice at the end of the "Preparing the installation of the
current tag" section).
-- Reboot the node.
-- If your SE exports the /flatfiles/SE00 area to all WNs, then after rebooting
the node make sure that all WNs can still see this area. If this is not the
case, execute this command on all WNs:
> /etc/obj/nfsmount restart
WorkerNode
----------
-- The default allowed maximum number of open file on a RedHat node is only
26213. This number might be too small if users submit file-hungry jobs (we
already had one case) so you may want to increase it on your WNs. At CERN we
currently use 256000. To set this parameter you can use this command:
> echo 256000 > /proc/sys/fs/file-max
You can make this setting reboot-proof by adding the following code at the
end of your /etc/rc.d/rc.local file:
# Increase max number of open files
if [ -f /proc/sys/fs/file-max ]; then
echo 256000 > /proc/sys/fs/file-max
fi
-- Every 6 hours each WN needs to connect to the web sites of all known CAs to
check if a new CRL (Certificate Revocation List) is available. As the script
which handles this functionality uses wget to retrieve the new CRL, you can
direct your WNs to use a web proxy. This is mandatory if your WNs sit on a
hidden network with no direct external connectivity.
To redirect your WNs to use a web proxy you should edit the /etc/wgetrc file
and add a line like:
http_proxy = http://web_proxy.cern.ch:8080/
where you should replace the node name and the port to match those of your
web proxy.
Note: I could not test this recipe directly as I am not aware of a web proxy
at CERN. If you try it and find problems, please post a message on the
lcg-rollout list.
-- If your WNs are NOT sharing the /home directory with your CE (this is the
default configuration) then you have to configure ssh to enable them to copy
job output back to the CE using scp. To this end you have to modify the ssh
client configuration file /etc/ssh/ssh_config adding these lines at the end:
Host *
HostbasedAuthentication yes
Note: the "Host *" line might already exist. In this case, just add the second
line after it.
-- Create the first version of the /etc/ssh/ssh_known_hosts file by running
> /opt/edg/sbin/edg-pbs-knownhosts
A cron job will update this file every 6 hours.
BDII Node
---------
To avoid having a single top MDS which could be easily overloaded, we have
split the information system in several (currently three) independent
information regions, each served by one or more regional MDSes.
For this schema to work, each and every BDII on the GRID must know the name of
all region MDSes and merge information coming from them into a single database.
All software needed to handle this data collection is now included in an rpm
installed on the BDII node. To configure it you have to:
- go to the /opt/edg/etc directory
- copy bdii-cron.conf.template to bdii-cron.conf
The template file included in the current tag only knows about two regions so
you will have to add the third one by hand. The correct configuration as of
November 5th, 2003, is:
MDS_HOST_LIST="
adc0026.cern.ch:2135/lcg00108.grid.sinica.edu.tw:2135
lcgcs01.gridpp.rl.ac.uk:2135
wn-02-37-a.cr.cnaf.infn.it:2135/pic115.ifae.es:2135
"
Should new Regional MDS nodes appear, they will be announced on the lcg-rollout
mailing list. In this case you will have to edit bdii-cron.conf and add them
to the correct group. You can find a description of the syntax for the
MDS_HOST_LIST variable in Appendix A in this document.
If in doubt, send e-mail to the lcg-rollout mailing list asking for the
correct setting of MDS_HOST_LIST for your site.
Regional MDS Node
-----------------
No additional configuration steps are currently needed on a Regional MDS node.
Note: If your site is hosting a Regional MDS node, once in a while you will be
notified of new sites joining your region. In this case on the LCFGng server
you should modify the node configuration file for your Regional MDS and add a
line like:
EXTRA(globuscfg.allowedRegs_topmds) :2135
where is the hostname of the node hosting the GIIS for the new
site. Then you must update your Regional MDS node with mkxprof.
MyProxy Node
------------
-- Reboot the node after installing the host certificates (see "Common Steps"
above).
Testing
-------
IMPORTANT NOTICE: if /home is NOT shared between CE and WNs (this is the
default configuration) due to the way the new jobmanager works, a
globus-job-run command will take at least 2 minutes. Even in the configuration
with shared /home the execution time of globus-job-run will be slightly longer
than before. Keep this in mind when testing your system.
To perform the standard tests (edg-job-submit & co.) you need to have your
certificate registered in one VO and to sign the LCG usage guidelines.
Detailed information on how to do these two steps can be found in :
http://lcg-registrar.cern.ch/
If you are working in one of the four LHC experiments, then ask for
registration in the corresponding VO, otherwise you can choose the "LCG
Deployment Team" (aka DTeam) VO.
A test suite which will help you in making sure your site is correctly
configured is now available. This software provides basic functionality tests
and various utilities to run automated sequences of tests and to present
results in a common HTML format.
Extensive on-line documentation about this test suite can be found in
http://grid-deployment.web.cern.ch/grid-deployment/tstg/docs/LCG-Certification-help
Experiment software
-------------------
Waiting for the final agreement on a flexible and experiment controlled system
to install and certify VO specific application software, we included in the WN
rpm list a few rpms for the Alice, Atlas, CMS, and LHCb collaborations plus the
full CERN libraries to allow an initial use of the system.
With LCG version 2 we will switch to the new system and no VO specific software
will be included in the LCG distribution anymore.
Appendix A
==========
Syntax for the MDS_HOST_LIST variable
-------------------------------------
The LCG information system is currently partitioned into several independent
regions, each served by one or more top level MDS servers.
To collect a full set of information, the BDII should contact one MDS server
per region and download from it all the information relative to that region.
The correct setting for the MDS_HOST_LIST will then consist of a list of MDS
servers organized into several groups, each group corresponding to a different
region.
The BDII will scan through this list, looping over the different groups. For
each group it will try and get the information from the first MDS server in
the group. If the server does not answer, the BDII will try the second, etc.
As soon as one of the servers in the group answers and provides information
about the region, the BDII will switch to the next group and repeat the same
operation.
If none of the servers in one group answers, no information for the
corresponding region will be retrieved and the whole region will disappear
from the information available to the Resource Broker.
The syntax for MDS_HOST_LIST is composed of a list of servers organized into
several groups. Groups of servers are separated by one or more "white space"
characters (i.e. space, tab, LF), while servers belonging to the same group are
separated by a "/" (slash) character. A server name can be followed by a ":"
(colon) character and the slapd port number. If the slapd port number is the
default one (2135) it can be omitted from the server specification.
Example:
MDS_HOST_LIST="
server1/server2:2136/server3
server4/server5:2135 server6
"
In this example the grid is organized in three regions. The first region
is served by three servers: server1, server2, and server3. On server2 slapd
is listening to the (non-standard) port 2136. The second region is served by
two servers: server4 and server5. In this case the specification of the
port for server5 is not really needed. The third region is served by a
single server, server6, where slapd is listening on the standard port 2135.
N.B.: the name of the servers must by in a format usable for the BDII to
contact it. Both numeric (e.g. 141.108.5.5) and text (e.g. adc0026.cern.ch)
formats are correct. If the MDS server is local to the BDII, then the network
name can be omitted, but in general this is not recommended as it might induce
confusion if someone wants to use the same configuration from a different site.
N.B.: as explained before, the order in which the servers appear in a group
is relevant as this is the order in which the BDII will query them, stopping
after the first successful contact. This means that by listing first the
MDS servers which are closest to your site from the network efficiency point
of view may improve the update time for your BDII. Also, having different
sites using different query orderings results in a statistical balancing of
the load on the MDS servers of the same region, thus improving the general
scalability of the information system. If you are in doubt about the best
way to sort the MDS servers for any of the regions, please contact the LCG
Roll-Out group at .
At the time of writing (November 5th, 2003) three regions are active: lcgeast,
lcgwest, and lcgsouth.
- lcgeast is served by two MDS servers, one located at CERN and one in Taipei
- lcgwest is served by one MDS server located at RAL
- lcgsouth is served by two MDS servers, one located at PIC in Barcelona and
one at CNAF in Bologna
The correct setting for the MDS_HOST_LIST variable is then:
MDS_HOST_LIST="
adc0026.cern.ch:2135/lcg00108.grid.sinica.edu.tw:2135
lcgcs01.gridpp.rl.ac.uk:2135
wn-02-37-a.cr.cnaf.infn.it:2135/pic115.ifae.es:2135
"
Please choose the order between adc0026.cern.ch and lcg00108.grid.sinica.edu.tw
and that between wn-02-37-a.cr.cnaf.infn.it and pic115.ifae.es according to
your site connectivity: sites that are better connected to your site should
appear first in the lists.
In all cases the ":2135" port specification can be omitted.
Appendix B
==========
How to configure the PBS server on a ComputingElement
-----------------------------------------------------
1) load the server configuration with this command (replace with
the hostname of the CE you are installing):
@-----------------------------------------------------------------------------
/usr/bin/qmgr <
set server operators = root@
set server default_queue = short
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server default_node = lcgpro
set server node_pack = False
create queue short
set queue short queue_type = Execution
set queue short resources_max.cput = 00:15:00
set queue short resources_max.walltime = 02:00:00
set queue short enabled = True
set queue short started = True
create queue long
set queue long queue_type = Execution
set queue long resources_max.cput = 12:00:00
set queue long resources_max.walltime = 24:00:00
set queue long enabled = True
set queue long started = True
create queue infinite
set queue infinite queue_type = Execution
set queue infinite resources_max.cput = 48:00:00
set queue infinite resources_max.walltime = 72:00:00
set queue infinite enabled = True
set queue infinite started = True
EOF
@-----------------------------------------------------------------------------
Note that queues short, long, and infinite are those defined in the site-cfg.h
file and the time limits are those in use at CERN. Feel free to
add/remove/modify them to your liking but do not forget to modify site-cfg.h
accordingly.
2) edit file /var/spool/pbs/server_priv/nodes to add the list of WorkerNodes
you plan to use. CERN settings are:
@-----------------------------------------------------------------------------
lxshare0223.cern.ch np=2 lcgpro
lxshare0224.cern.ch np=2 lcgpro
lxshare0225.cern.ch np=2 lcgpro
lxshare0226.cern.ch np=2 lcgpro
lxshare0227.cern.ch np=2 lcgpro
lxshare0228.cern.ch np=2 lcgpro
lxshare0249.cern.ch np=2 lcgpro
lxshare0250.cern.ch np=2 lcgpro
lxshare0372.cern.ch np=2 lcgpro
lxshare0373.cern.ch np=2 lcgpro
@-----------------------------------------------------------------------------
where np=2 gives the number of job slots (usually equal to #CPUs) available
on the node, and lcgpro is the group name as defined in the default_node
parameter in the server configuration.
3) Restart the PBS server
> /etc/rc.d/init.d/pbs_server restart
Appendix C
==========
How to configure the MySQL database on a ResourceBroker
-------------------------------------------------------
Connect to your RB node, represented by in the example, make sure
that the mysql server is up and running:
> /etc/rc.d/init.d/mysql start
If it was already running you will just get notified of the fact.
Now you can choose a DB management you like (write it down
somewhere!) and then configure the server with the following commands:
> mysqladmin password
> mysql --password= \
--exec "set password for root@=password('')" mysql
> mysqladmin --password= create lbserver20
> mysql --password= lbserver20 < /opt/edg/etc/server.sql
> mysql --password= \
--exec "grant all on lbserver20.* to lbserver@localhost" lbserver20
Note that the database name "lbserver20" is hardwired in the LB server code
and cannot be changed so use it exactly as shown in the commands.
Appendix D
==========
Publishing WN information from the CE
-------------------------------------
When submitting a job, users of LCG are supposed to state in their jdl the
minimal hardware resources (memory, scratch disk space, CPU time) required
to run the job. These requirements are matched by the RB with the information
on the BDII to select a set of available CEs where the job can run.
For this schema to work, each CE must publish some information about the
hardware configuration of the WNs connected to it. This means that site
managers must collect information about WNs available at the site and insert
it in the information published by the local CE.
The procedure to do this is the following:
- choose a WN which is "representative" of your batch system (see below for a
definition of "representative") and make sure that the chosen node is fully
installed and configured. In particular, check if all expected NFS partitions
are correctly mounted.
- on the chosen WN run the following script as root, saving the output to a
file.
@-----------------------------------------------------------------------------
#!/bin/bash
echo -n 'hostname: '
host `hostname -f` | sed -e 's/ has address.*//'
echo "Dummy: `uname -a`"
echo "OS_release: `uname -r`"
echo "OS_version: `uname -v`"
cat /proc/cpuinfo /proc/meminfo /proc/mounts
df
@-----------------------------------------------------------------------------
- copy the obtained file to /opt/edg/var/info/edg-scl-desc.txt on your CE,
replacing any pre-existing version.
- restart the GRIS on the CE with
> /etc/rc.d/init.d/globus-mds restart
Definition of "representative WN": in general, WNs are added to a batch system
at different times and with heterogeneous hardware configurations. All these
WNs often end up being part of a single queue, so that when an LCG job is sent
to the batch system, there is no way to ask for a specific hardware
configuration (note: LSF and other batch systems offer ways to do this but the
current version of the Globus gatekeeper is not able to take advantage of this
possibility). This means that the site manager has to choose a single WN as
"representative" of the whole batch cluster. In general it is recommended that
this node is chosen among the "least powerful" ones, to avoid sending jobs with
heavy hardware requirements to under-spec nodes.
Appendix E
==========
Update procedure from release LCG1-1_1_1 to LCG1-1_1_3
------------------------------------------------------
Fixes to apply independently from the main update
-------------------------------------------------
- You should limit access to the apache server running on the LCFG node to
accept connection only from nodes which are on the local site. The procedure
for this is described in the "Introduction and overall setup" chapter.
- The configuration of the MySQL database server on the RB has changed to
improve security: connection to the root user of the db now always requires
a password (this was not the case with the old settings). If you already
configured the MySQL server on your RB following the old instructions, then
you should connect to your RB node and issue the following command:
> mysql --password= \
--exec "set password for root@=password('')" mysql
where is the administrative password you gave when you configured
your MySQL server (you DID write it down, didn't you?) and is the
full name of your RB, e.g. lxshare0380.cern.ch at CERN.
- You should increase the maximum number of open file on WNs following the
instructions in the WorkerNode section of the "Node installation and
configuration" chapter.
- You can instruct WNs to use a web proxy when downloading new CRLs. The
procedure for this is described in the WorkerNode section of the "Node
installation and configuration" chapter.
Before updating the nodes
-------------------------
- Modifications to your site-cfg.h file:
...
#define SITE_EDG_VERSION LCG1-1_1_3
...
#define CE_IP_RUNTIMEENV LCG-1 ALICE-3.09.06 ATLAS-6.0.4 CMKIN-1.1.0 CMKIN-VALID CMSIM-VALID CMS-OSCAR-2.4.5 LHCb_dbase_common-v3r1
...
- In your local-cfg.h replace the section defining the encrypted root password
with the following lines (see local-cfg.h.template for an example):
/* Include settings for sensible parameters from a non-CVS file */
#include "private-cfg.h"
- Create a new file named private-cfg.h using private-cfg.h.template as an
example and insert there the root password for your site, encrypted as
explained in the "Preparing the installation of current tag" section. It
is recommended that you use a new root password.
- See note in the "Node installation and configuration" section about CERN
libraries availability on your UI.
- Node configuration files should not need any change.
Update Procedure
----------------
Run mkxprof (or do_mkxprof) for all your nodes. You may want to verify that
nodes are actually updated by looking in the /var/obj/log/client file on the
nodes themselves.
After updating the nodes
------------------------
On your CE issue the following two commands to make sure that the new runtime
environment tag and the new software version are published to the information
system:
> /etc/obj/infoproviders start
> /etc/rc.d/init.d/globus-mds restart
No other post-update procedures are needed. Just verify that:
- you can logon your nodes using the new root password
- the LHCb software has been installed on all your WNs (see if the /opt/lhcb
directory is there)
- the GRIS on your CE is publishing the new LHCb_dbase_common-v3r1 runtime
variable. You can check this from your UI with the command:
> ldapsearch -LLL -x -H ldap://:2135 -b "mds-vo-name=local,o=grid" \
"(objectClass=GlueSubCluster)" GlueHostApplicationSoftwareRunTimeEnvironment
where is your CE node.
Finally, you should execute the LCG1-1_1_2 to LCG1-1_1_3 update procedure
detailed below.
Update procedure from release LCG1-1_1_2 to LCG1-1_1_3
------------------------------------------------------
Before updating the nodes
-------------------------
- Modifications to your site-cfg.h file:
...
#define SITE_EDG_VERSION LCG1-1_1_3
...
#define CE_IP_RUNTIMEENV LCG-1 ALICE-3.09.06 ATLAS-6.0.4 CMKIN-1.1.0 CMKIN-VALID CMSIM-VALID CMS-OSCAR-2.4.5 LHCb_dbase_common-v3r1
...
- Node configuration files should not need any change.
Update Procedure
----------------
As the current update includes a replacement of the kernel on all machines,
we recommend extra care in executing the update and then checking the result
before rebooting the nodes. An error in this phase could easily end up in
making nodes un-bootable and in requiring a complete reinstall.
The recommended update procedure is:
- run mkxprof (or do_mkxprof) for all your nodes and give the LCFG client
program the time to execute the rpm update. At CERN this took ~5 minutes
but this time depends on the number of nodes at your site and on the access
speed of the node where your rpm repository sits. Waiting at least 10 minutes
should put you on the safe side.
- edit the redhat73-cfg.h file in the release source directory and do the
":/tmp" trick, i.e.:
1) if the directories listed in updaterpms.rpmdir at the end of the file
include ":/tmp" at the end, then remove it
2) if the directories listed in updaterpms.rpmdir at the end of the file
do NOT include ":/tmp" at the end, then add it
E.g. for case 1):
updaterpms.rpmdir RPMDIR/release:\ | updaterpms.rpmdir RPMDIR/release:\
... ---> ...
RPMDIR/apps_common:/tmp | RPMDIR/apps_common
and for case 2):
updaterpms.rpmdir RPMDIR/release:\ | updaterpms.rpmdir RPMDIR/release:\
... ---> ...
RPMDIR/apps_common | RPMDIR/apps_common:/tmp
- run mkxprof (or do_mkxprof) for all your nodes again: this apparently
redundant procedure guarantees that the new kernel rpms are really installed
on the nodes and that the kernel version used by grub is aligned with it
- to be extra-safe you should log on each node and verify that the installed
kernel is really 2.4.20-24.7 and that grub will use it when you reboot the
node:
> rpm -qa | grep kernel
kernel-2.4.20-24.7
kernel-smp-2.4.20-24.7
kernel-source-2.4.20-24.7
> grep kernel /boot/grub/grub.conf
kernel (hd0,0)/vmlinuz-2.4.20-24.7smp root=/dev/hda2
- when you are convinced that the installed kernel and grub agree on the
version to use you can proceed and reboot all your nodes. Reboot order should
take into account NFS mounting. At CERN we first rebooted our CE and SE nodes
and then all our WNs. The other nodes can be rebooted in any order.
After updating the nodes
------------------------
You can verify that your nodes are now running the new kernel with:
> uname -a
Linux lxshare0219.cern.ch 2.4.20-24.7smp #1 SMP Mon Dec 1 13:18:03 EST 2003 i686 unknown
You can also check if the new RunTimeEnvironment tag is published by the GRIS
on your CE with
> ldapsearch -LLL -x -H ldap://:2135 -b "mds-vo-name=local,o=grid" \
"(objectClass=GlueSubCluster)" GlueHostApplicationSoftwareRunTimeEnvironment
where is your CE node.
The information that your WNs have now a different kernel should be published
to the information system of your site. To do this you have to repeat the
procedure described in Appendix D.
Appendix F
==========
Change History
--------------
Release LCG1-1_1_3 (04/12/2003):
- Updated kernel to version2.4.20-24.7 to fix a critical security bug
- Removed ca_CERN-old-0.19-1 and ca_GermanGrid-0.19-1 rpms as the corresponding
CAs have recently expired
- On user request, added zsh back to to the UI rpm list
- Updated myproxy-server-config-static-lcg rpm to recognize the new CERN CA
- Added oscar-dar rpm from CMS to WN
Release LCG1-1_1_2 (25/11/2003):
- Added LHCb software to WN
- Introduced private-cfg.h.template file to handle sensible settings for the
site (only the encrypted root password, for the moment)
- Added instructions on how to use MD5 encryption for root password
- Added instructions on how to configure http server on the LCFG node to be
accessible only from nodes on site
- Fixed TCP port range setting for Globus on UI
- Removed CERN libraries installation on the UI (added by mistake in release
LCG1-1_1_1)
- Added instructions to increase maximum number of open files on WNs
- Added instructions to correctly set the root password for the MySQl server
on the RB
- Added instructions to configure WNs to use a web proxy for CRL download