This document is to help new site admin to get start with Grid Operation smoothly. It covers the following sections:


EGEE Operations

There are three roles responsible for the daily operations of the EGEE grid, are broken up into the following areas

The Operations team is responsible for detecting problems, coordinating the diagnosis, and monitoring the problems through to resolution. COD submits a GGUS ticket against the ROC and CC’s the site. The ticket is followed till it is solved

Operations Escalation Procedure

  1. COD/ROC open ticket for problem tracking in CIC portal and Send email notification to Site and ROC Escalation period is 3 days depending on severity. Once site admins get the notification Email should try solve it ASAP though the following actions:
    • Respond to COD/ROC via reply Email once you start to deal with site prob.
    • Check central monitoring tools to get more info about site prob. (SEE 'Gird Monitoring' section to detect, diagnose and track prob).
    • {*} Ask for a help from your ROC if you are not able to solve prob. (See 'Ways to solve site prob' to help you solve site prob).

    • {*} Update COD/ROC with the prob progress.

  2. COD will send second email to ROC/site if no response from site admin.
    • {*} at the 2nd escalation step, COD suggest to the site to declare downtime until they solve the problem and the ROC has to be notified. If they do not accept the downtime then the COD proceed with the regular escalation procedure at the agreed deadlines.

  3. If still no response, COD suggest site is suspended. (Site removed from Top Level BDII configuration Essentially removed from the Grid).


Site Responsibilities

Act promptly and keep the site as operational as possible!

{*} Respond to tickets related to your site within 24 hours.

{*} Check sites status at least 1/day:

Protect site security:

Keep your services running!and configured correctly.

{*} Process of scheduling a downtime event for your site and services in advance if needed:

Keep your middleware reasonably up to date, <!> Updates are announced by e-mail(egee-broadcas)

Check the certificate lifetime.


Grid Monitoring

The following monitoring tools are very useful for site admin to detect site probs.

CIC Operations Portal

Operation tool for site admin

https://cic.gridops.org/index.php?section=rc

GOCDB

DOCDB collects site information and status, you have to maintain site info in it including site support info, node info and CSIRT contact info

https://goc.gridops.org/

Grid Srvice Availability Monitoring (SAM)

Monitored services for CE, SE, RB, etc.

https://lcg-sam.cern.ch:8443/sam/sam.py

IS Monitoring (Gstat)

Tool to display and monitor information published by Site-BDIIs

http://goc.grid.sinica.edu.tw/gstat//AsiaPacific.html

Nagios

Monitored site services

https://rocnagios.grid.sinica.edu.tw/nagios/

Job Accounting (apel)

Accounting records for jobs which consume CPU resources on the EGEE/WLCG grid

http://www3.egee.cesga.es/acctenfor/

Grid View

Visualize the results from SAM + Gstat

http://gridview.cern.ch/GRIDVIEW/

GridMap

Visualizing the "State" of the Grid

http://gridmap.cern.ch/gm/

Real Time Monitor

Displays job status at each site

http://gridportal.hep.ph.ic.ac.uk/rtm/


Ways To Solve Site Prob

for issues of EGEE Middleware. You may search the error msg and get a solution in it.


Site Security


Operation Policy Docs

Service Level Description between ROCs and sites: Procedures/EGEE_Service_Level_Description_SLD , This document formalizes the services which a site provides to its Regional Operations Centre.

Virtual Organisation Operations Policy: https://edms.cern.ch/document/853968 , by participating in the Grid as a Virtual Organisation (VO), you agree to this document.

Grid Security Traceability and Logging Policy: https://edms.cern.ch/document/428037, This policy defines the minimum requirements for traceability of actions on Grid Resources and Services as well as the production and retention of security related logging in the Grid.

Approval of Certification Authorities: https://edms.cern.ch/document/428038 , This document describes the procedure by which the list of trusted Certification Authorities for use in WLCG and EGEE should be created and maintained.

Policy on Grid Multi-User Pilot Jobs: https://edms.cern.ch/document/855383, Security policy for operation of multi-user pilot jobs.


APROC Resources

Operations Meetings Notes: http://lists.grid.sinica.edu.tw/apwiki/Reports/Operations_Meeting

Weekly reports of production site status and broadcast: http://lists.grid.sinica.edu.tw/apwiki/Reports/APROC_sites_weekly_report

APROC Website: http://aproc.twgrid.org/

APROC WIKI: http://lists.grid.sinica.edu.tw/apwiki/FrontPage

ASGCCA: http://ca.grid.sinica.edu.tw/

APESCI Virtual Organization: http://aproc.twgrid.org/index.php?option=com_content&task=view&id=17&Itemid=31

TWGrid Virtual Organization:http://aproc.twgrid.org/index.php?option=com_content&task=view&id=16&Itemid=31


Reference

SA1 Operational Procedures Manual: https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures

EGEE (Enabling Grids for E-sciencE): http://public.eu-egee.org/

EGEE SA1 Activity: http://egee-sa1.web.cern.ch/egee-sa1/

gLite grid middleware: http://glite.web.cern.ch/glite/

Installation/Grid Operations Procedures (last edited 2009-07-19 08:34:53 by shuting)