=============================================================================== =========================== LCG1 Installation notes =========================== =============================================================================== =========== C 2003 by Emanuele Leonardi - Emanuele.Leonardi@cern.ch =========== =============================================================================== Reference tag: LCG1-1_1_4 These notes will assist you in installing the latest LCG1 tag. Read through it carefully before starting the installation of a new site. Release LCG1-1_1_4 is a minor update to LCG1-1_1_3 to upgrade the kernel installed on all machines to version 2.4.20-28.7. This fixes a reported security bug present in all previous kernel versions. If you are currently running any previous LGC1-1_1_X release then you can just do an update following the procedures detailed in Appendix E. Introduction and overall setup ============================== In this text we will assume that you are already familiar with the LCFGng server installation and management. A detailed guide can be found at http://grid-deployment.web.cern.ch/grid-deployment/gis/lcfgng-server73.pdf Note that by following the procedure described in that document, you will install on your LCFGng server the latest available version of each object. In some cases this may be incompatible with the object version used on the LCFGng client nodes. To make sure that the correct version of each object is installed on your server, you should use the lcfgng_server_update.pl script, available from CVS. See chapter "Preparing the installation of current tag" below for instructions on how to obtain and run this script. Also note that by following the instructions there, the apache server running on your LCFG server will allow access from anywhere on the net. This is a potential security breach as LCFG node configuration files contain sensible information (e.g. encrypted root passwords). To limit access to nodes belonging to your domain you should edit the apache configuration file, /etc/httpd/conf/httpd.conf, and apply the following changes: - in the section delimited by these two lines: ... find the line "Allow from all" and replace it with "Allow from " where is your network domain (e.g. at CERN this gives "Allow from cern.ch"). - repeat the same operation in the section delimited by lines: ... Files needed for the current LCG1 release are available from a CVS server at CERN. This CVS server contains the list of rpms to install and the LCFGng configuration files for each node type. The CVS area, called "lcg1", can be reached from http://lcgapp.cern.ch/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=lcgdeploy Note1: at the same location there is another directory called "lcg-release": this area is used for the integration and certification software, NOT for production. Just ignore it! Note2: documentation about access to this CVS repository can be found in http://grid-deployment.web.cern.ch/grid-deployment/documentation/cvs-guide In the same CVS location we created an area for each of the sites participating to LCG1, e.g. BNL, BUDAPEST, CERN, etc. These directories (should) contain the configuration files used to install and configure the nodes at the corresponding site. Site managers are required to keep these directories up-to-date by committing all changes they do to their configuration files back to CVS so that we will be able to keep track of the status of each site at any given moment. If a site reaches a consistent working configuration, site managers can (should) create a tag which will allow them to easily recover configuration information if needed. The tag name should follow the convention described in http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status Note: if you have not done it yet, please get in touch with Louis Poncet or Markus Schulz to activate your write-enabled account on the CVS server. Given the increasing number of sites joining the LCG production system, support for installation at new sites is now organized in a hierarchical way: new secondary (Tier2) sites should direct questions about installation problems to their reference primary (Tier1) site. Primary sites will then escalate problems when needed. All site managers have in any case to join and monitor the LCG-Rollout list where all issues related to the LCG deployment, including announcements of updates and security patches, are discussed. You can join this list by going to http://cclrclsv.RL.AC.UK/archives/lcg-rollout.html and clicking on the "Join or leave the list" link. Preparing the installation of current tag ========================================= The current LCG1 tag is ---> LCG1-1_1_4 <--- In the following instructions/examples, when you see the string, you should replace it with the name of the tag defined above. To install it, check it out on your LCFG server with > cvs checkout -r -d lcg1 Note: the "-d " will create a directory named and copy there all the files. If you do not specify the -d parameter, the directory will be a subdirectory of the current directory named lcg1. The default way to install the tag is to copy the content of the rpmlist subdirectory to the /opt/local/linux/7.3/rpmcfg directory on the LCFG server. This directory is NFS-mounted by all client nodes and is visible as /export/local/linux/7.3/rpmcfg Now go to the directory where you keep your local configuration files. If you want to create a new one, you can check out from CVS any of the previous tags with: > cvs checkout -r -d If you want the latest (HEAD) version of your config files, just omit the "-r " parameter. Go to , copy there the template files from /source, cfgdir-cfg.h.template, local-cfg.h.template, site-cfg.h.template, and private-cfg.h.template, rename them cfgdir-cfg.h, local-cfg.h, site-cfg.h, and private-cfg.h, and edit their content according to the instructions in the files. NOTE: if you already have localized versions of these files, just compare them with the new templates to verify that no new parameter needs to be set. To download all the rpms needed to install this version you can use the updaterep command. In /updaterep you can find 2 configuration files for this script: updaterep.conf and updaterep_full.conf. The first will tell updaterep to only download the rpms which are actually needed to install the current tag, while updaterep_full.conf will do a full mirror of the LCG rpm repository. Copy updaterep.conf to /etc/updaterep.conf and run the updaterep command. By default all rpms will be copied to the /opt/local/linux/7.3/RPMS area, which is visible from the client nodes as /export/local/linux/7.3/RPMS. You can change the repository area by editing /etc/updaterep.conf and modifying the REPOSITORY_BASE variable. IMPORTANT NOTICE: as the list and structure of Certification Authorities (CA) accepted by the LCG project can change independently from the middle-ware releases, the rpm list related to the CAs certificates and URLs has been decoupled from the standard LCG1 release procedure. This means that the version of the security-rpm.h file contained in the rpmlist directory associated to the current tag might be incomplete or obsolete. Please go to URL http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status Click on the "LCG1 CAs" link at the bottom of the page and follow the instructions there to update all CA-related settings. Changes and updates of these settings will be announced on the LCG-Rollout mailing list. To make sure that all the needed object rpms are installed on your LCFG server, you should use the lcfgng_server_update.pl script, also located in /updaterep. This script will report which rpms are missing or have the wrong version and will create the /tmp/lcfgng_server_update_script.sh script which you can then use to fix the server configuration. Run it in the following way: > lcfgng_server_update.pl /rpmlist/lcfgng-common-rpm.h > /tmp/lcfgng_server_update_script.sh > lcfgng_server_update.pl /rpmlist/lcfgng-server-rpm.h > /tmp/lcfgng_server_update_script.sh WARNING: please always give a look to /tmp/lcfgng_server_update_script.sh and verify that all rpm update commands look reasonable before running it. In the source directory you should give a look to the redhat73-cfg.h file and see if the location of the rpm lists (updaterpms.rpmcfgdir) and of the rpm repository (updaterpms.rpmdir) are correct for your site (the defaults are consistent with the instructions in this document). If needed, you can redefine these paths from the local-cfg.h file. In private-cfg.h you can (must!) replace the default root password with the one you want to use for your site: +auth.rootpwd <--- replace with your own crypted password To obtain using the MD5 encryption algorithm (stronger than the standard crypt method) you can use the following command: > openssl passwd -1 This command will prompt you to insert the clear text version of the password and then print the encrypted version. E.g. > openssl passwd -1 Password: <- write clear text password here $1$iPJJEhjc$rtV/65l890BaPinzkb58z1 <- string To finalize the adaptation of the current tag to your site you should edit your site-cfg.h file. You can use the site-cfg.h.template file in the source directory as a starting point. If you already have a site-cfg.h file that you used to install the LCG1-1_1_1 release, you can find a detailed description of the modifications to this file needed for the new tag in Appendix E below. WARNING 1: the template file site-cfg.h.template assumes you want to run the PBS batch system without sharing the /home directory between the CE and all the WNs. This is the highly recommended setup. If for some reason you want to run PBS in traditional mode, i.e. with the CE exporting /home with NFS and all the WNs mounting it, you should edit your site-cfg.h file and comment out the following two lines: #define NO_HOME_SHARE ... #define CE_JM_TYPE lcgpbs In addition to this, your WN configuration file should include this line: #include CFGDIR/UsersNoHome-cfg.h" just after including Users-cfg.h (please note that BOTH Users-cfg.h AND UsersNoHome-cfg.h must be included). WARNING 2: note that, even if you are not running a RB service at your site, you must nonetheless set the SITE_BDII_HOST and SITE_BDII_PORT variables to point to an existing BDII node. This is needed as the edg-rm client, installed on WNs and UIs, need it as an access point to the information system. If you cannot locate a BDII either at your site or at a nearby primary site, you are welcome to use the BDII at CERN. At the time of writing this is done with: #define SITE_BDII_HOST lxshare0222.cern.ch #define SITE_BDII_PORT 2170 but please verify on the RollOut list the current node. WARNING 3: in the current default configuration the "file" protocol access to the SE is enabled. This means that SE and WN nodes must share a disk area called /flatfiles/SE00. This is where the various VO will store their files, each VO using a different subdirectory named after the VO itself (e.g. /flatfiles/SE00/atlas) plus an extra directory, named "data" (i.e. /flatfiles/SE00/data). If you are using an external file server to hold this area, mounting it both on the SE and on the WNs, then you should create the and "data" subdirectories yourself and set the ownership as root: (or root:root for "data") and the access as 775, i.e. > ls -ld /flatfiles/SE00 drwxrwxr-x 3 root alice 4096 Sep 2 16:13 alice drwxrwxr-x 3 root atlas 4096 Sep 2 16:13 atlas drwxrwxr-x 3 root cms 4096 Sep 2 16:13 cms drwxrwxr-x 3 root root 4096 Sep 2 16:13 data drwxrwxr-x 3 root dteam 4096 Sep 2 16:13 dteam drwxrwxr-x 3 root lhcb 4096 Sep 2 16:13 lhcb If on the other hand you want to keep the /flatfiles/SE00 area on the local disk of the SE, then you can tell LCFG to create the whole structure by including the following line in your SE node configuration file: #include CFGDIR/flatfiles-dirs-SECLASSIC-cfg.h" Note: the line should be inserted close to where you include StorageElement-cfg.h but the exact position is not important. Node installation and configuration =================================== In your site-specific directory you should already have the do_mkxprof.sh script. If you do not find it, you can check out the one in the CERN CVS area with > cvs checkout CERN/do_mkxprof.sh This script just calls the mkxprof command specifying that it should look for configuration files starting from the current directory (option "-S."). Feel free to use your preferred call to the mkxprof command but note that running mkxprof as a daemon is NOT recommended and can easily lead to massive catastrophes if not used with extreme care: do it at your own risk. To create the LCFG configuration for one or more nodes you can do > ./do_mkxprof.sh node1 [node2 node3, ...] If you get an error status for one or more of the configurations, you can get a detailed report on the nature of the error by looking into URL http:///status/ and clicking on the name of the node with a faulty configuration (a small red bug should be shown beside the node name). Once all node configurations are correctly published, you can proceed and install your nodes following any one of the installation procedures described in the "LCFGng Server Installation Guide" mentioned above. When the initial installation completes (expect two automatic reboots in the process), each node type requires a few manual steps, detailed below, to be completely configured. After completing these steps, some of the nodes need a final reboot which will bring them up with all the needed services active. The need for this final reboot is explicitly stated among the node configuration steps below. Note about UI installation: the current default for a UI node is to NOT install the CERN libraries rpms (this was added by mistake to release LCG1-1_1_1). If you wish to make the CERN libraries available from your UI you should edit the UI rpm list, UI-rpm in the /opt/local/linux/7.3/rpmcfg directory (this is the directory where you copied all your rpm lists), and uncomment line /* #include "apps_common-rpm.h" */ Common steps ------------ -- On the ResourceBroker, MyProxy, ComputingElement, and StorageElement nodes you should install the host certificate/key files in /etc/grid-security with names hostcert.pem and hostkey.pem. Also make sure that hostkey.pem is only readable by root with > chmod 400 /etc/grid-security/hostkey.pem -- All Globus services grant access to LCG users according to the list of certificates contained in the /etc/grid-security/grid-mapfile file. The list of VOs included in grid-mapfile is defined in /opt/edg/etc/edg-mkgridmap.conf. By default all VOs accepted in LCG are included in this list. You can prevent VOs from accessing your site by commenting out the corresponding line in edg-mkgridmap.conf. E.g. by commenting out line group ldap://grid-vo.nikhef.nl/ou=lcg1,o=alice,dc=eu-datagrid,dc=org .alice on your CE you will prevent users in the Alice VO to submit jobs to your site. After installing a ResourceBroker, ComputingElement, or StorageElement node and modifying (if needed) the local edg-mkgridmap.conf file you may force a first creation of the grid-mapfile by running > /opt/edg/sbin/edg-mkgridmap --output --safe Every 6 hours a cron job will repeat this procedure and update grid-mapfile. Important Notice: if your site is not supporting one or more of the LCG standard VOs, please make sure you comment out the corresponding line in edg-mkgridmap.conf as described above. Failure to do this will result in jobs from unsupported VOs to be submitted to your site and then fail, thus creating a grid "black hole" (jobs fall in and never come back). UserInterface ------------- No additional configuration steps are currently needed on a UserInterface node. ResourceBroker -------------- -- Configure the MySQL database. See detailed recipe in Appendix C at the end of this document -- Reboot the node ComputingElement ---------------- -- Configure the PBS server. See detailed recipe in Appendix B at the end of this document. -- Create the first version of the /etc/ssh/ssh_known_hosts file by running > /opt/edg/sbin/edg-pbs-knownhosts A cron job will update this file every 6 hours. -- If your CE is NOT sharing the /home directory with your WNs (this is the default configuration) then you have to configure sshd to allow WNs to copy job output back to the CE using scp. This requires the following two steps: 1) modify the sshd configuration. Edit the /etc/ssh/sshd_config file and add these lines at the end: HostbasedAuthentication yes IgnoreUserKnownHosts yes IgnoreRhosts yes and then restart the server with > /etc/rc.d/init.d/sshd restart 2) configure the script enabling WNs to copy output back to the CE. - in /opt/edg/etc, copy edg-pbs-shostsequiv.conf.template to edg-pbs-shostsequiv.conf then edit this file and change parameters to your needs. Most sites will only have to set NODES to an empty string. - create the first version of the /etc/ssh/shosts.equiv file by running > /opt/edg/sbin/edg-pbs-shostsequiv A cron job will update this file every 6 hours. Note: every time you will add or remove WNs, do not forget to run > /opt/edg/sbin/edg-pbs-shostsequiv <--- only if you do not share /home > /opt/edg/sbin/edg-pbs-knownhosts on the CE or the new WNs will not work correctly till the next time cron runs them for you. -- The CE is supposed to export information about the hardware configuration (i.e. CPU power, memory, disk space) of the WNs. The procedure to collect these informations and publish them is described in Appendix D of this document. -- Reboot the node -- If your CE exports the /home area to all WNs, then after rebooting it make sure that all WNs can still see this area. If this is not the case, execute this command on all WNs: > /etc/obj/nfsmount restart StorageElement -------------- -- Make sure that all subdirectories in /flatfiles/SE00 were correctly created (see WARNING notice at the end of the "Preparing the installation of the current tag" section). -- Reboot the node. -- If your SE exports the /flatfiles/SE00 area to all WNs, then after rebooting the node make sure that all WNs can still see this area. If this is not the case, execute this command on all WNs: > /etc/obj/nfsmount restart WorkerNode ---------- -- The default allowed maximum number of open file on a RedHat node is only 26213. This number might be too small if users submit file-hungry jobs (we already had one case) so you may want to increase it on your WNs. At CERN we currently use 256000. To set this parameter you can use this command: > echo 256000 > /proc/sys/fs/file-max You can make this setting reboot-proof by adding the following code at the end of your /etc/rc.d/rc.local file: # Increase max number of open files if [ -f /proc/sys/fs/file-max ]; then echo 256000 > /proc/sys/fs/file-max fi -- Every 6 hours each WN needs to connect to the web sites of all known CAs to check if a new CRL (Certificate Revocation List) is available. As the script which handles this functionality uses wget to retrieve the new CRL, you can direct your WNs to use a web proxy. This is mandatory if your WNs sit on a hidden network with no direct external connectivity. To redirect your WNs to use a web proxy you should edit the /etc/wgetrc file and add a line like: http_proxy = http://web_proxy.cern.ch:8080/ where you should replace the node name and the port to match those of your web proxy. Note: I could not test this recipe directly as I am not aware of a web proxy at CERN. If you try it and find problems, please post a message on the lcg-rollout list. -- If your WNs are NOT sharing the /home directory with your CE (this is the default configuration) then you have to configure ssh to enable them to copy job output back to the CE using scp. To this end you have to modify the ssh client configuration file /etc/ssh/ssh_config adding these lines at the end: Host * HostbasedAuthentication yes Note: the "Host *" line might already exist. In this case, just add the second line after it. -- Create the first version of the /etc/ssh/ssh_known_hosts file by running > /opt/edg/sbin/edg-pbs-knownhosts A cron job will update this file every 6 hours. BDII Node --------- To avoid having a single top MDS which could be easily overloaded, we have split the information system in several (currently three) independent information regions, each served by one or more regional MDSes. For this schema to work, each and every BDII on the GRID must know the name of all region MDSes and merge information coming from them into a single database. All software needed to handle this data collection is now included in an rpm installed on the BDII node. To configure it you have to: - go to the /opt/edg/etc directory - copy bdii-cron.conf.template to bdii-cron.conf The template file included in the current tag only knows about two regions so you will have to add the third one by hand. The correct configuration as of January 30th, 2004, is: MDS_HOST_LIST=" adc0026.cern.ch:2135/lcg00108.grid.sinica.edu.tw:2135 lcgcs01.gridpp.rl.ac.uk:2135 wn-02-37-a.cr.cnaf.infn.it:2135/pic115.ifae.es:2135 " Should new Regional MDS nodes appear, they will be announced on the lcg-rollout mailing list. In this case you will have to edit bdii-cron.conf and add them to the correct group. You can find a description of the syntax for the MDS_HOST_LIST variable in Appendix A in this document. If in doubt, send e-mail to the lcg-rollout mailing list asking for the correct setting of MDS_HOST_LIST for your site. Regional MDS Node ----------------- No additional configuration steps are currently needed on a Regional MDS node. Note: If your site is hosting a Regional MDS node, once in a while you will be notified of new sites joining your region. In this case on the LCFGng server you should modify the node configuration file for your Regional MDS and add a line like: EXTRA(globuscfg.allowedRegs_topmds) :2135 where is the hostname of the node hosting the GIIS for the new site. Then you must update your Regional MDS node with mkxprof. MyProxy Node ------------ -- Reboot the node after installing the host certificates (see "Common Steps" above). Testing ------- IMPORTANT NOTICE: if /home is NOT shared between CE and WNs (this is the default configuration) due to the way the new jobmanager works, a globus-job-run command will take at least 2 minutes. Even in the configuration with shared /home the execution time of globus-job-run will be slightly longer than before. Keep this in mind when testing your system. To perform the standard tests (edg-job-submit & co.) you need to have your certificate registered in one VO and to sign the LCG usage guidelines. Detailed information on how to do these two steps can be found in : http://lcg-registrar.cern.ch/ If you are working in one of the four LHC experiments, then ask for registration in the corresponding VO, otherwise you can choose the "LCG Deployment Team" (aka DTeam) VO. A test suite which will help you in making sure your site is correctly configured is now available. This software provides basic functionality tests and various utilities to run automated sequences of tests and to present results in a common HTML format. Extensive on-line documentation about this test suite can be found in http://grid-deployment.web.cern.ch/grid-deployment/tstg/docs/LCG-Certification-help Experiment software ------------------- Waiting for the final agreement on a flexible and experiment controlled system to install and certify VO specific application software, we included in the WN rpm list a few rpms for the Alice, Atlas, CMS, and LHCb collaborations plus the full CERN libraries to allow an initial use of the system. With LCG version 2 we will switch to the new system and no VO specific software will be included in the LCG distribution anymore. Appendix A ========== Syntax for the MDS_HOST_LIST variable ------------------------------------- The LCG information system is currently partitioned into several independent regions, each served by one or more top level MDS servers. To collect a full set of information, the BDII should contact one MDS server per region and download from it all the information relative to that region. The correct setting for the MDS_HOST_LIST will then consist of a list of MDS servers organized into several groups, each group corresponding to a different region. The BDII will scan through this list, looping over the different groups. For each group it will try and get the information from the first MDS server in the group. If the server does not answer, the BDII will try the second, etc. As soon as one of the servers in the group answers and provides information about the region, the BDII will switch to the next group and repeat the same operation. If none of the servers in one group answers, no information for the corresponding region will be retrieved and the whole region will disappear from the information available to the Resource Broker. The syntax for MDS_HOST_LIST is composed of a list of servers organized into several groups. Groups of servers are separated by one or more "white space" characters (i.e. space, tab, LF), while servers belonging to the same group are separated by a "/" (slash) character. A server name can be followed by a ":" (colon) character and the slapd port number. If the slapd port number is the default one (2135) it can be omitted from the server specification. Example: MDS_HOST_LIST=" server1/server2:2136/server3 server4/server5:2135 server6 " In this example the grid is organized in three regions. The first region is served by three servers: server1, server2, and server3. On server2 slapd is listening to the (non-standard) port 2136. The second region is served by two servers: server4 and server5. In this case the specification of the port for server5 is not really needed. The third region is served by a single server, server6, where slapd is listening on the standard port 2135. N.B.: the name of the servers must by in a format usable for the BDII to contact it. Both numeric (e.g. 141.108.5.5) and text (e.g. adc0026.cern.ch) formats are correct. If the MDS server is local to the BDII, then the network name can be omitted, but in general this is not recommended as it might induce confusion if someone wants to use the same configuration from a different site. N.B.: as explained before, the order in which the servers appear in a group is relevant as this is the order in which the BDII will query them, stopping after the first successful contact. This means that by listing first the MDS servers which are closest to your site from the network efficiency point of view may improve the update time for your BDII. Also, having different sites using different query orderings results in a statistical balancing of the load on the MDS servers of the same region, thus improving the general scalability of the information system. If you are in doubt about the best way to sort the MDS servers for any of the regions, please contact the LCG Roll-Out group at . At the time of writing (November 5th, 2003) three regions are active: lcgeast, lcgwest, and lcgsouth. - lcgeast is served by two MDS servers, one located at CERN and one in Taipei - lcgwest is served by one MDS server located at RAL - lcgsouth is served by two MDS servers, one located at PIC in Barcelona and one at CNAF in Bologna The correct setting for the MDS_HOST_LIST variable is then: MDS_HOST_LIST=" adc0026.cern.ch:2135/lcg00108.grid.sinica.edu.tw:2135 lcgcs01.gridpp.rl.ac.uk:2135 wn-02-37-a.cr.cnaf.infn.it:2135/pic115.ifae.es:2135 " Please choose the order between adc0026.cern.ch and lcg00108.grid.sinica.edu.tw and that between wn-02-37-a.cr.cnaf.infn.it and pic115.ifae.es according to your site connectivity: sites that are better connected to your site should appear first in the lists. In all cases the ":2135" port specification can be omitted. Appendix B ========== How to configure the PBS server on a ComputingElement ----------------------------------------------------- 1) load the server configuration with this command (replace with the hostname of the CE you are installing): @----------------------------------------------------------------------------- /usr/bin/qmgr < set server operators = root@ set server default_queue = short set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600 set server default_node = lcgpro set server node_pack = False create queue short set queue short queue_type = Execution set queue short resources_max.cput = 00:15:00 set queue short resources_max.walltime = 02:00:00 set queue short enabled = True set queue short started = True create queue long set queue long queue_type = Execution set queue long resources_max.cput = 12:00:00 set queue long resources_max.walltime = 24:00:00 set queue long enabled = True set queue long started = True create queue infinite set queue infinite queue_type = Execution set queue infinite resources_max.cput = 48:00:00 set queue infinite resources_max.walltime = 72:00:00 set queue infinite enabled = True set queue infinite started = True EOF @----------------------------------------------------------------------------- Note that queues short, long, and infinite are those defined in the site-cfg.h file and the time limits are those in use at CERN. Feel free to add/remove/modify them to your liking but do not forget to modify site-cfg.h accordingly. 2) edit file /var/spool/pbs/server_priv/nodes to add the list of WorkerNodes you plan to use. CERN settings are: @----------------------------------------------------------------------------- lxshare0223.cern.ch np=2 lcgpro lxshare0224.cern.ch np=2 lcgpro lxshare0225.cern.ch np=2 lcgpro lxshare0226.cern.ch np=2 lcgpro lxshare0227.cern.ch np=2 lcgpro lxshare0228.cern.ch np=2 lcgpro lxshare0249.cern.ch np=2 lcgpro lxshare0250.cern.ch np=2 lcgpro lxshare0372.cern.ch np=2 lcgpro lxshare0373.cern.ch np=2 lcgpro @----------------------------------------------------------------------------- where np=2 gives the number of job slots (usually equal to #CPUs) available on the node, and lcgpro is the group name as defined in the default_node parameter in the server configuration. 3) Restart the PBS server > /etc/rc.d/init.d/pbs_server restart Appendix C ========== How to configure the MySQL database on a ResourceBroker ------------------------------------------------------- Connect to your RB node, represented by in the example, make sure that the mysql server is up and running: > /etc/rc.d/init.d/mysql start If it was already running you will just get notified of the fact. Now you can choose a DB management you like (write it down somewhere!) and then configure the server with the following commands: > mysqladmin password > mysql --password= \ --exec "set password for root@=password('')" mysql > mysqladmin --password= create lbserver20 > mysql --password= lbserver20 < /opt/edg/etc/server.sql > mysql --password= \ --exec "grant all on lbserver20.* to lbserver@localhost" lbserver20 Note that the database name "lbserver20" is hardwired in the LB server code and cannot be changed so use it exactly as shown in the commands. Appendix D ========== Publishing WN information from the CE ------------------------------------- When submitting a job, users of LCG are supposed to state in their jdl the minimal hardware resources (memory, scratch disk space, CPU time) required to run the job. These requirements are matched by the RB with the information on the BDII to select a set of available CEs where the job can run. For this schema to work, each CE must publish some information about the hardware configuration of the WNs connected to it. This means that site managers must collect information about WNs available at the site and insert it in the information published by the local CE. The procedure to do this is the following: - choose a WN which is "representative" of your batch system (see below for a definition of "representative") and make sure that the chosen node is fully installed and configured. In particular, check if all expected NFS partitions are correctly mounted. - on the chosen WN run the following script as root, saving the output to a file. @----------------------------------------------------------------------------- #!/bin/bash echo -n 'hostname: ' host `hostname -f` | sed -e 's/ has address.*//' echo "Dummy: `uname -a`" echo "OS_release: `uname -r`" echo "OS_version: `uname -v`" cat /proc/cpuinfo /proc/meminfo /proc/mounts df @----------------------------------------------------------------------------- - copy the obtained file to /opt/edg/var/info/edg-scl-desc.txt on your CE, replacing any pre-existing version. - restart the GRIS on the CE with > /etc/rc.d/init.d/globus-mds restart Definition of "representative WN": in general, WNs are added to a batch system at different times and with heterogeneous hardware configurations. All these WNs often end up being part of a single queue, so that when an LCG job is sent to the batch system, there is no way to ask for a specific hardware configuration (note: LSF and other batch systems offer ways to do this but the current version of the Globus gatekeeper is not able to take advantage of this possibility). This means that the site manager has to choose a single WN as "representative" of the whole batch cluster. In general it is recommended that this node is chosen among the "least powerful" ones, to avoid sending jobs with heavy hardware requirements to under-spec nodes. Appendix E ========== Update procedure from release LCG1-1_1_1 to LCG1-1_1_4 ------------------------------------------------------ Fixes to apply independently from the main update ------------------------------------------------- - You should limit access to the apache server running on the LCFG node to accept connection only from nodes which are on the local site. The procedure for this is described in the "Introduction and overall setup" chapter. - The configuration of the MySQL database server on the RB has changed to improve security: connection to the root user of the db now always requires a password (this was not the case with the old settings). If you already configured the MySQL server on your RB following the old instructions, then you should connect to your RB node and issue the following command: > mysql --password= \ --exec "set password for root@=password('')" mysql where is the administrative password you gave when you configured your MySQL server (you DID write it down, didn't you?) and is the full name of your RB, e.g. lxshare0380.cern.ch at CERN. - You should increase the maximum number of open file on WNs following the instructions in the WorkerNode section of the "Node installation and configuration" chapter. - You can instruct WNs to use a web proxy when downloading new CRLs. The procedure for this is described in the WorkerNode section of the "Node installation and configuration" chapter. Before updating the nodes ------------------------- - Modifications to your site-cfg.h file: ... #define SITE_EDG_VERSION LCG1-1_1_4 ... #define CE_IP_RUNTIMEENV LCG-1 ALICE-3.09.06 ATLAS-6.0.4 CMKIN-1.1.0 CMKIN-VALID CMSIM-VALID CMS-OSCAR-2.4.5 LHCb_dbase_common-v3r1 ... - In your local-cfg.h replace the section defining the encrypted root password with the following lines (see local-cfg.h.template for an example): /* Include settings for sensible parameters from a non-CVS file */ #include "private-cfg.h" - Create a new file named private-cfg.h using private-cfg.h.template as an example and insert there the root password for your site, encrypted as explained in the "Preparing the installation of current tag" section. It is recommended that you use a new root password. - See note in the "Node installation and configuration" section about CERN libraries availability on your UI. - Node configuration files should not need any change. Update Procedure ---------------- Run mkxprof (or do_mkxprof) for all your nodes. You may want to verify that nodes are actually updated by looking in the /var/obj/log/client file on the nodes themselves. After updating the nodes ------------------------ On your CE issue the following two commands to make sure that the new runtime environment tag and the new software version are published to the information system: > /etc/obj/infoproviders start > /etc/rc.d/init.d/globus-mds restart No other post-update procedures are needed. Just verify that: - you can logon your nodes using the new root password - the LHCb software has been installed on all your WNs (see if the /opt/lhcb directory is there) - the GRIS on your CE is publishing the new LHCb_dbase_common-v3r1 runtime variable. You can check this from your UI with the command: > ldapsearch -LLL -x -H ldap://:2135 -b "mds-vo-name=local,o=grid" \ "(objectClass=GlueSubCluster)" GlueHostApplicationSoftwareRunTimeEnvironment where is your CE node. Finally, you should execute the LCG1-1_1_2 to LCG1-1_1_3 update procedure detailed below. Update procedure from release LCG1-1_1_2/3 to LCG1-1_1_4 -------------------------------------------------------- Before updating the nodes ------------------------- - Modifications to your site-cfg.h file: ... #define SITE_EDG_VERSION LCG1-1_1_4 ... #define CE_IP_RUNTIMEENV LCG-1 ALICE-3.09.06 ATLAS-6.0.4 CMKIN-1.1.0 CMKIN-VALID CMSIM-VALID CMS-OSCAR-2.4.5 LHCb_dbase_common-v3r1 ... - Node configuration files should not need any change. Update Procedure ---------------- As the current update includes a replacement of the kernel on all machines, we recommend extra care in executing the update and then checking the result before rebooting the nodes. An error in this phase could easily end up in making nodes un-bootable and in requiring a complete reinstall. The recommended update procedure is: - run mkxprof (or do_mkxprof) for all your nodes and give the LCFG client program the time to execute the rpm update. At CERN this took ~5 minutes but this time depends on the number of nodes at your site and on the access speed of the node where your rpm repository sits. Waiting at least 10 minutes should put you on the safe side. - edit the redhat73-cfg.h file in the release source directory and do the ":/tmp" trick, i.e.: 1) if the directories listed in updaterpms.rpmdir at the end of the file include ":/tmp" at the end, then remove it 2) if the directories listed in updaterpms.rpmdir at the end of the file do NOT include ":/tmp" at the end, then add it E.g. for case 1): updaterpms.rpmdir RPMDIR/release:\ | updaterpms.rpmdir RPMDIR/release:\ ... ---> ... RPMDIR/apps_common:/tmp | RPMDIR/apps_common and for case 2): updaterpms.rpmdir RPMDIR/release:\ | updaterpms.rpmdir RPMDIR/release:\ ... ---> ... RPMDIR/apps_common | RPMDIR/apps_common:/tmp - run mkxprof (or do_mkxprof) for all your nodes again: this apparently redundant procedure guarantees that the new kernel rpms are really installed on the nodes and that the kernel version used by grub is aligned with it - to be extra-safe you should log on each node and verify that the installed kernel is really 2.4.20-28.7 and that grub will use it when you reboot the node: > rpm -qa | grep kernel kernel-2.4.20-28.7 kernel-smp-2.4.20-28.7 kernel-source-2.4.20-28.7 > grep kernel /boot/grub/grub.conf kernel (hd0,0)/vmlinuz-2.4.20-28.7smp root=/dev/hda2 - when you are convinced that the installed kernel and grub agree on the version to use you can proceed and reboot all your nodes. Reboot order should take into account NFS mounting. At CERN we first rebooted our CE and SE nodes and then all our WNs. The other nodes can be rebooted in any order. After updating the nodes ------------------------ You can verify that your nodes are now running the new kernel with: > uname -a Linux lxshare0219.cern.ch 2.4.20-28.7smp #1 SMP Mon Dec 1 13:18:03 EST 2003 i686 unknown The information that your WNs have now a different kernel should be published to the information system of your site. To do this you have to repeat the procedure described in Appendix D. Appendix F ========== Change History -------------- Release LCG1-1_1_4 (30/01/2004): - Updated kernel to version 2.4.20-28.7 to fix a security bug Release LCG1-1_1_3 (04/12/2003): - Updated kernel to version 2.4.20-24.7 to fix a critical security bug - Removed ca_CERN-old-0.19-1 and ca_GermanGrid-0.19-1 rpms as the corresponding CAs have recently expired - On user request, added zsh back to to the UI rpm list - Updated myproxy-server-config-static-lcg rpm to recognize the new CERN CA - Added oscar-dar rpm from CMS to WN Release LCG1-1_1_2 (25/11/2003): - Added LHCb software to WN - Introduced private-cfg.h.template file to handle sensible settings for the site (only the encrypted root password, for the moment) - Added instructions on how to use MD5 encryption for root password - Added instructions on how to configure http server on the LCFG node to be accessible only from nodes on site - Fixed TCP port range setting for Globus on UI - Removed CERN libraries installation on the UI (added by mistake in release LCG1-1_1_1) - Added instructions to increase maximum number of open files on WNs - Added instructions to correctly set the root password for the MySQl server on the RB - Added instructions to configure WNs to use a web proxy for CRL download