=============================================================================== =========================== LCG1 Installation notes =========================== =============================================================================== =========== C 2003 by Emanuele Leonardi - Emanuele.Leonardi@cern.ch =========== =============================================================================== These notes will assist you in installing the latest LCG1 tag. Please be aware that this file has been extensively modified and rewritten with respect to the previous version, so please read through it carefully before starting the installation. If you have already installed the pre-release version of the LCG1 code, i.e. tag lcg1_20030717_1455 in the LCG CVS repository, you may choose if you want to re-install all your nodes or do an update from your previous configuration. Even if both procedures should work, given the amount of changes in the software and in the layout of the site, and the level of test of the two procedures, we recommend to do a complete re-install of the nodes: this requires, if at all, marginally more time than updating them and guarantees that no relics from the previous configuration are left to confuse future checks and tests. Introduction and overall setup ============================== In this text we will assume that you are already familiar with the LCFGng server installation and management. Detailed guides are at http://datagrid.in2p3.fr/distribution/datagrid/wp4/edg-lcfg/documentation Recommended readings are: - EDG LCFGng tutorial (basic LCFGng concepts are explained here) - EDG LCFG(ng) server installation cookbook using RedHat 7.3 - EDG WP4 LCFGng FAQ To help with this first install we created a CVS area at CERN containing the list of rpms to install and the LCFGng configuration files for each node type. This area, called "lcg1", is topologically equivalent to the edg-release area in the EDG CVS repository and can be reached from http://lcgapp.cern.ch/cgi-bin/viewcvs/viewcvs.cgi/?cvsroot=lcgdeploy Note1: at the same location there is another directory called lcg-release: just ignore it! Note2: documentation about access to this CVS repository can be found in http://grid-deployment.web.cern.ch/grid-deployment/documentation/cvs-guide In the same CVS location we created an area for each of the sites participating to LCG1, e.g. BNL, BUDAPEST, CERN, etc.etc. These directories (should) contain the configuration files used to install and configure the nodes at the site. Site managers are required to keep these directories up-to-date by committing all changes they do to their configuration files back to CVS so that they will be able to keep track of the status of each site at any given moment. If a site reaches a consistent working configuration, site managers can (should) create a tag which will allow them to easily recover configuration information if needed. The tag name should follow the convention described in http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status Note: if you have not done it yet, please get in touch with Louis Poncet or Markus Schulz to activate your write-enabled account on the CVS server. Questions and problems you may encounter during the installation should be addressed to the LCG-Rollout list. To join this list go to http://cclrclsv.RL.AC.UK/archives/lcg-rollout.html and click on the "Join or leave the list" link. Preparing the installation of current tag ========================================= The current LCG1 tag is ---> LCG1-1_0_1 <--- In the following instructions/examples, when you see the string, you should replace it with the name of the tag defined above. To install it, check it out on your LCFG server with > cvs checkout -r -d lcg1 Note: the "-d " will create a directory named and copy there all the files. If you do not specify the -d parameter, the directory will be a subdirectory of the current directory named lcg1. The default way to install the tag is to copy the content of the rpmlist subdirectory to the /opt/local/linux/7.3/rpmcfg directory on the LCFG server. This directory is NFS-mounted by all client nodes and is visible as /export/local/linux/7.3/rpmcfg Now go to the directory where you keep your local configuration files. If you want to create a new one, you can check out from CVS any of the previous tags with: > cvs checkout -r -d If you want the latest (HEAD) version of your config files, just omit the "-r " parameter. Go to and edit the cfgdir-cfg.h file setting the CFGDIR parameter to the directory where the node configuration files are stored. If you followed the previous instructions, the line should read #define CFGDIR "/source Note: mind that the quotas are only at the beginning of the string, not at the end (see explanatory note in cfgdir-cfg.h). To download all the rpms needed to install this version you can use the updaterep command. In /updaterep you can find 2 configuration files for this script: updaterep.conf and updaterep_full.conf. The first will tell updaterep to only download the rpms which are actually needed to install the current tag, while updaterep_full.conf will do a full mirror of the LCG rpm repository. Copy updaterep.conf to /etc/updaterep.conf and run the updaterep command. By default all rpms will be copied to the /opt/local/linux/7.3/RPMS area, which is visible from the client nodes as /export/local/linux/7.3/RPMS. You can change the repository area by editing /etc/updaterep.conf and modifying the REPOSITORY_BASE variable. IMPORTANT NOTICE: as the list and structure of Certification Authorities (CA) accepted by the LCG project can change independently from the middleware releases, the rpm list related to the CAs certificates and URLs has been decoupled from the standard LCG1 release procedure. This means that the version of the security-rpm.h file contained in the rpmlist directory associated to the current tag could be incomplete or obsolete. Please go to URL http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status Click on the "LCG1 CAs" link at the bottom of the page and follow the instructions there to update the CA-related settings. Changes and updates of these settings will be announced on the LCG-Rollout mailing list. If you want to make sure that all the needed object rpms are installed on your LCFG server, you can use the lcfgng_server_update.pl script, also located in /updaterep. This script will report which rpms are missing or have the wrong version and will create the /tmp/lcfgng_server_update_script.sh script which you can then use to fix the server configuration. Run it in the following way: > lcfgng_server_update.pl /rpmlist/lcfgng-common-rpm.h > /tmp/lcfgng_server_update_script.sh > lcfgng_server_update.pl /rpmlist/lcfgng-server-rpm.h > /tmp/lcfgng_server_update_script.sh WARNING: please always give a look to /tmp/lcfgng_server_update_script.sh before running it (just in case!). IMPORTANT NOTICE: two LCFG server specific rpms have changed name from the previous tag lcg1_20030717_1455, so you will have to remove the older version before running the lcfgng_server_update.pl script. The commands to do this are: > rpm -e edg-lcfg-infoproviders-defaults-s1-3.0.18-1 > rpm -e lcfg-client-defaults-s2-2.0.35-edg1 In the source directory you should give a look to the redhat73-cfg.h file and see if the location of the rpm lists (updaterpms.rpmcfgdir) and of the rpm repository (updaterpms.rpmdir) are correct for your site (the defaults are consistent with the instructions in this document). If needed, you can redefine these paths from the local-cfg.h file in your site's area. An example version of this file is local-cfg.h.template in the source directory. Also in local-cfg.h you can (must!) replace the default root password with the one you want to use for your site: +auth.rootpwd <--- replace with your own crypted password To obtain you can use the following command: > perl -e 'print crypt("MyPassword","y3")."\n"' where "MyPassword" should be replaced with your own clear-text root password and "y3" is a string with two randomly chosen characters (change "y3" to your liking). To finalize the adaptation of the current tag to your site you should edit your site-cfg.h file. You can use the site-cfg.h.template file in the source directory as a starting point. If you already have a site-cfg.h file, you can find a detailed description of the modifications in this file with respect to the previous tag in Appendix D below. WARNING: the template file site-cfg.h.template assumes you want to run the PBS batch system without sharing the /home directory between the CE and all the WNs. This is the highly recommended setup. If for some reason you want to run PBS in traditional mode, i.e. with the CE exporting /home with NFS and all the WNs mounting it, you should edit your site-cfg.h file and comment out the following two lines: #define NO_HOME_SHARE ... #define CE_JM_TYPE lcgpbs In addition to this, your WN configuration file should include this line: #include CFGDIR/UsersNoHome-cfg.h" just after including Users-cfg.h (please note that BOTH Users-cfg.h AND UsersNoHome-cfg.h must be included). Node installation and configuration =================================== In your site-specific directory you should already have the do_mkxprof.sh script. If you do not find it, you can check out the one in the CERN CVS area with > cvs checkout CERN/do_mkxprof.sh This script just calls the mkxprof command specifying that it should look for configuration files starting from the current directory (option "-S."). Feel free to use your preferred call to the mkxprof command but note that running mkxprof as a daemon is NOT recommended and can easily lead to massive catastrophes if not used with extreme care: do it at your own risk. To create the LCFG configuration for one or more nodes you can do > ./do_mkxprof.sh node1 [node2 node3, ...] When the initial installation completes (expect two automatic reboots in the process), each node type requires a few manual steps, detailed below, to be completely configured. After completing these steps, rebooting the nodes will bring them up with all the needed services active. Note: after manual configuration, you do not need to reboot WorkerNodes. Common steps ------------ -- On the ResourceBroker, MyProxy, ComputingElement, and StorageElement nodes you should install the host certificate/key files in /etc/grid-security with names hostcert.pem and hostkey.pem. Also make sure that hostkey.pem is only readable by root with "chmod 400 /etc/grid-security/hostkey.pem". -- On the ResourceBroker, ComputingElement, and StorageElement you should replace the default (EDG-oriented) mkgridmap configuration file in /opt/edg/etc/edg-mkgridmap.conf with the LCG version. A copy of this file can be found in Appendix A at the end of this document. Note that this file gets overwritten when you update the edg-mkgridmap-conf rpm and you will have to replace it again. An object to handle edg-mkgridmap-conf file creation is under test and will be included in the next tag. After fixing the mkgridmap configuration, you can immediately create the grid-mapfile by running > /opt/edg/sbin/edg-mkgridmap --output --safe or wait till cron does this for you. UserInterface ------------- No configuration steps are currently needed on a UserInterface node. ResourceBroker -------------- -- Configure the MySQL database. See detailed recipe in Appendix C at the end of this document -- A few scripts related to the functioning of the Workload software are broken.The next tag will include the fixed scripts directly in the rpms but for this tag the scripts have to be replaced by hand. To install the fixed scripts you have to: - log on the RB node as root - download the tar archive containing the scripts with > wget http://cern.ch/markusw/lcg_rb_fixes.tar.gz - install the scripts with > tar -xzvf lcg_rb_fixes.tar.gz -C / ComputingElement ---------------- -- Configure the PBS server. See detailed recipe in Appendix B at the end of this document -- In LCG we are using a patch to the GLOBUS software to avoid having to share the /home directories between CE and WNs. To enable output retrieval in this configuration, there are three manual configuration steps needed: 1) modify the sshd configuration. Edit the /etc/ssh/sshd_config file and add these lines at the end: HostbasedAuthentication yes IgnoreUserKnownHosts yes IgnoreRhosts yes and then restart the server with "/etc/rc.d/init.d/sshd restart". 2) configure the script enabling WNs to copy output back to the CE. - in /opt/edg/etc, copy edg-pbs-shostsequiv.conf.template to edg-pbs-shostsequiv.conf then edit this file and change parameters to your needs. Most sites will only have to set NODES to an empty string. - create the first version of the /etc/ssh/shosts.equiv file by running /opt/edg/sbin/edg-pbs-shostsequiv (or wait till cron does this for you). 3) create the first version of the /etc/ssh/ssh_known_hosts file by running /opt/edg/sbin/edg-pbs-knownhosts (or wait till cron does this for you). Note: every time you will add or remove WNs to the CE, do not forget to run /opt/edg/sbin/edg-pbs-shostsequiv and /opt/edg/sbin/edg-pbs-knownhosts or you will have to wait till cron does this for you before the new nodes will work correctly. StorageElement -------------- No configuration steps are currently needed on a StorageElement node. WorkerNode ---------- -- In LCG we are using a patch to the GLOBUS software to avoid having to share the /home directories between CE and WNs. To enable output retrieval in this configuration, there are two manual configuration steps needed: 1) modify the ssh client configuration. Edit the /etc/ssh/ssh_config file and add these lines at the end: Host * HostbasedAuthentication yes Note: the "Host *" line should already exist. Add the second line just after it. 2) create the first version of the /etc/ssh/ssh_known_hosts file by running /opt/edg/sbin/edg-pbs-knownhosts (or wait till cron does this for you). BDII Node --------- To avoid having a single top MDS which could be easily overloaded, we have split the information system in several (well, two for the moment) independent information regions, each served by two or more region MDSes. The default settings for your site already takes this into account and registers your site GIIS to the correct MDSes. Of course, for this schema to work each and every BDII on the GRID must know the name of all region MDSes and merge information coming from them into a single database. This requires a modification of the way the BDII populator program is executed by the cron job. Within the lcg1 CVS module you can find the BDII subdirectory which contains the bdii-cron, bdii_temp, and setup_BDII.sh files. You should copy them to the BDII node (anywhere will do as long as they are in the same directory) and then execute the setup_BDII.sh script which will copy the other two files in place, i.e. in /opt/edg/etc/cron and /opt/edg/etc/init.d respectively. No other changes are needed. Be aware that bdii-cron contains an explicit list of MDS servers to query. In future versions of the system we will provide a better way to configure the program, if possible by providing a LCFG object for that. For the moment, if new MDS servers appear, we will just update the bdii-cron script in the CVS repository and create a new tag. Regional MDS Node ----------------- No configuration steps are currently needed on a Regional MDS node. MyProxy Node ------------ No configuration steps are currently needed on a MyProxy node. Testing ------- To perform the standard tests (edg-job-submit & co.) you need to have your certificate registered in one VO and to sign the LCG usage guidelines. Detailed information on how to do these two steps can be found in : http://lcg-registrar.cern.ch/ If you are working in one of the four LHC experiment, then ask for registration in the corresponding VO, otherwise you can choose the "LCG Deployment Team" (aka DTeam) VO. A test suite which will help you in making sure your site is correctly configured is in preparation. You may want to give a look to the Testing Group web page to find out about the current status. This is at http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=tstg/homepage Recommended reading is the LCG-Certification-help document in the "Documents" page. Appendix A ========== Content of file /opt/edg/etc/edg-mkgridmap.conf for LCG1 nodes: @----------------------------------------------------------------------------- #### GROUP: group URI [lcluser] # LCG Standard Virtual Organizations group ldap://grid-vo.nikhef.nl/ou=lcg1,o=alice,dc=eu-datagrid,dc=org .alice group ldap://grid-vo.nikhef.nl/ou=lcg1,o=atlas,dc=eu-datagrid,dc=org .atlas group ldap://grid-vo.nikhef.nl/ou=lcg1,o=cms,dc=eu-datagrid,dc=org .cms group ldap://grid-vo.nikhef.nl/ou=lcg1,o=lhcb,dc=eu-datagrid,dc=org .lhcb group ldap://lcg-vo.cern.ch/ou=lcg1,o=dteam,dc=lcg,dc=org .dteam #### AUTH: authorization URI auth ldap://lcg-registrar.cern.ch/ou=users,o=registrar,dc=lcg,dc=org @----------------------------------------------------------------------------- Feel free to comment out the "group" lines which refer to VOs you do not support at your site. Appendix B ========== How to configure the PBS server on a ComputingElement 1) load the server configuration with this command (replace with the hostname of the CE you are installing): @----------------------------------------------------------------------------- /usr/bin/qmgr < set server operators = root@ set server default_queue = short set server log_events = 511 set server mail_from = adm set server query_other_jobs = True set server scheduler_iteration = 600 set server default_node = lcgpro set server node_pack = False create queue short set queue short queue_type = Execution set queue short resources_max.cput = 00:15:00 set queue short resources_max.walltime = 02:00:00 set queue short enabled = True set queue short started = True create queue long set queue long queue_type = Execution set queue long resources_max.cput = 12:00:00 set queue long resources_max.walltime = 24:00:00 set queue long enabled = True set queue long started = True create queue infinite set queue infinite queue_type = Execution set queue infinite resources_max.cput = 48:00:00 set queue infinite resources_max.walltime = 72:00:00 set queue infinite enabled = True set queue infinite started = True EOF @----------------------------------------------------------------------------- Note that queues short, long, and infinite are those defined in the site-cfg.h file and the time limits are those in use at CERN. Feel free to add/remove/modify them to your liking but do not forget to modify site-cfg.h accordingly. 2) edit file /var/spool/pbs/server_priv/nodes to add the list of WorkerNodes you plan to use. CERN settings are: @----------------------------------------------------------------------------- lxshare0223.cern.ch np=2 lcgpro lxshare0224.cern.ch np=2 lcgpro lxshare0225.cern.ch np=2 lcgpro lxshare0226.cern.ch np=2 lcgpro lxshare0227.cern.ch np=2 lcgpro lxshare0228.cern.ch np=2 lcgpro lxshare0249.cern.ch np=2 lcgpro lxshare0250.cern.ch np=2 lcgpro lxshare0372.cern.ch np=2 lcgpro lxshare0373.cern.ch np=2 lcgpro @----------------------------------------------------------------------------- where np=2 gives the number of job slots (usually equal to #CPUs) available on the node, and lcgpro is the group name as defined in the default_node parameter in the server configuration. 3) Restart the PBS server > /etc/rc.d/init.d/pbs_server restart Appendix C ========== How to configure the MySQL database on a ResourceBroker First make sure that the mysql server is up and running: > /etc/rc.d/init.d/mysql start If it was already running you will just get notified of the fact. Now you can choose a DB management you like (write it down somewhere!) and then configure the server with the following commands: > mysqladmin password > mysqladmin --password= create lbserver20 > mysql --password= lbserver20 < /opt/edg/etc/server.sql > mysql --password= \ --exec "grant all on lbserver20.* to lbserver@localhost" lbserver20 Note that the database name "lbserver20" is hardwired in the LB server code and cannot be changed so use it exactly as shown in the commands. Appendix D ========== Changes to site-cfg.h w.r.t. the previous tag. 1) SITE_EDG_VERSION should now be set to LCG1-1_0_0 2) the following parameters can be removed (no harm if they stay, though): SITE_DN_ SITE_GIIS_ON_CE COUNTRY_GIIS_ON_CE AFS_CELL 3) RLS_LRC_DB_PASSWORD and RLS_RMC_DB_PASSWORD have been replaced with #define RLS_LRC_ALICE_PASSWORD lcgcertlrc #define RLS_LRC_ATLAS_PASSWORD lcgcertlrc #define RLS_LRC_CMS_PASSWORD lcgcertlrc #define RLS_LRC_LHCB_PASSWORD lcgcertlrc #define RLS_LRC_DTEAM_PASSWORD lcgcertlrc #define RLS_RMC_ALICE_PASSWORD lcgcertrmc #define RLS_RMC_ATLAS_PASSWORD lcgcertrmc #define RLS_RMC_CMS_PASSWORD lcgcertrmc #define RLS_RMC_LHCB_PASSWORD lcgcertrmc #define RLS_RMC_DTEAM_PASSWORD lcgcertrmc For the moment the password defined there is not really used and acts as a place-holder. We will switch to a real password sometimes in the future. 4) The parameter NO_HOME_SHARE should be defined to allow WNs to use a local /home directory (recommended!) instead of mounting the one from the CE 5) Within the "#ifdef CE_LRMS_PBS" block add line #define CE_JM_TYPE lcgpbs This is needed to use the modified JobManager (see point 4 above). 6) CE_USE_MDS_INFO and SE_USE_MDS_INFO can be set to 0 (no need for old-style MDS info) 7) add a new parameter SE_NAME and set it to "disk" to define your SE as a classical SE (mandatory!) 8) Change SE_PROTOCOL_RFIO_PORT from 3147 to 5001. This is to make the client compatible with existing RFIO servers (e.g. CASTOR). 9) add a new parameter MY_PROXY_SERVER and point it to the Proxy server which is close to the RB you defined in UI_RESBROKER. You should be able to find the name of the node by looking at http://grid-deployment.web.cern.ch/grid-deployment/cgi-bin/index.cgi?var=lcg1Status and clicking on the Site where the RB resides. Post a message to the LCG-RollOut list if you have problems.