EGEE Operations gives a introduction to EGEE operation team, Current EGEE operation architecture and Operations Escalation Procedure(concerning ticket process).
Site Responsibilities describes what will be responsibilities as a Grid production site admin.
Gird Monitoring describes the monitoring tools which detect problems at sites such as SAM, GSTAT (Giis Monitor)
Ways to solve site prob quides site admin solve site prob.
Site Security provides incident response procedure.
Operation Policy lists the Grid operation policies. such as SLD shall be signed between ROC and site, VO ops policy This policy shall be signed for agreement by an Authorized Signatory of the Virtual Organisation, etc.
APROC Resources provides the links to APROC resources.
There are three roles responsible for the daily operations of the EGEE grid, are broken up into the following areas
Operations Team – COD and support tools developers (COD:CIC-on-duty = “EGEE Grid Operations Team”)
- Regional Operations Centre – ROC Managers, ROC support staff
- Resource Centres (sites) –local support, site admins
The Operations team is responsible for detecting problems, coordinating the diagnosis, and monitoring the problems through to resolution. COD submits a GGUS ticket against the ROC and CC’s the site. The ticket is followed till it is solved
Operations Escalation Procedure
- COD/ROC open ticket for problem tracking in CIC portal and Send email notification to Site and ROC Escalation period is 3 days depending on severity. Once site admins get the notification Email should try solve it ASAP though the following actions:
- Respond to COD/ROC via reply Email once you start to deal with site prob.
- Check central monitoring tools to get more info about site prob. (SEE 'Gird Monitoring' section to detect, diagnose and track prob).
Ask for a help from your ROC if you are not able to solve prob. (See 'Ways to solve site prob' to help you solve site prob).
Update COD/ROC with the prob progress.
- COD will send second email to ROC/site if no response from site admin.
at the 2nd escalation step, COD suggest to the site to declare downtime until they solve the problem and the ROC has to be notified. If they do not accept the downtime then the COD proceed with the regular escalation procedure at the agreed deadlines.
- If still no response, COD suggest site is suspended. (Site removed from Top Level BDII configuration Essentially removed from the Grid).
Act promptly and keep the site as operational as possible!
- Site must be available(UP) at least 70% of the time per month!!
- Site reliability must be at least 75% per month!!
Respond to tickets related to your site within 24 hours.
Check sites status at least 1/day:
- Review and update open GGUS tickets.
- Look for new problems via Gird monitoring tools below.
Can subscribe to an RSS feed to receive alarms from the COD dashboard for your site’s services: https://cic.gridops.org/index.php?section=rc&page=alertnotification
Protect site security:
- Make sure you know who has administrator privileges on your site
Keep the OS up to date with security patches
Keep the middleware up to date with security patches, CA RPMs( http://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html)
- Keep your private key very safe (not on one of your site’s machines!)
Keep the security contact details up to date in the GOC database: https://goc.gridops.org/
Keep your services running!and configured correctly.
Process of scheduling a downtime event for your site and services in advance if needed:
Keep your middleware reasonably up to date, Updates are announced by e-mail(egee-broadcas)
Learn about new Grid M/W updates.http://glite.web.cern.ch/glite/packages/R3.1/
Check the certificate lifetime.
The following monitoring tools are very useful for site admin to detect site probs.
CIC Operations Portal
Operation tool for site admin
DOCDB collects site information and status, you have to maintain site info in it including site support info, node info and CSIRT contact info
Grid Srvice Availability Monitoring (SAM)
Monitored services for CE, SE, RB, etc.
IS Monitoring (Gstat)
Tool to display and monitor information published by Site-BDIIs
Monitored site services
Job Accounting (apel)
Accounting records for jobs which consume CPU resources on the EGEE/WLCG grid
Visualize the results from SAM + Gstat
Visualizing the "State" of the Grid
Real Time Monitor
Displays job status at each site
Ways To Solve Site Prob
Contact to APROC: email@example.com , You could Email APROC for your question.
GOCWIKI: http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq , There are many symptoms and solutions
for issues of EGEE Middleware. You may search the error msg and get a solution in it.
- Search “LCG rollout list” mailing list archives. There are many operational problems discussed there, you could also report your prob here.
Subscribe to the LCG-ROLLOUT mailing list at the following URL: http://jiscmail.ac.uk/cgi-bin/webadmin?REPORT&z=3
To ask help with Grid experts or report probs, you could raise a ticket with GGUS http://www.ggus.org
Security Service Challenge: http://lists.grid.sinica.edu.tw/apwiki/Security_Service_Challenge
Security Incident Response Procedure here:https://edms.cern.ch/file/867454/1/EGEE_Incident_Response_Procedure.pdf
Operation Policy Docs
Service Level Description between ROCs and sites: Procedures/EGEE_Service_Level_Description_SLD , This document formalizes the services which a site provides to its Regional Operations Centre.
Virtual Organisation Operations Policy: https://edms.cern.ch/document/853968 , by participating in the Grid as a Virtual Organisation (VO), you agree to this document.
Grid Security Traceability and Logging Policy: https://edms.cern.ch/document/428037, This policy defines the minimum requirements for traceability of actions on Grid Resources and Services as well as the production and retention of security related logging in the Grid.
Approval of Certification Authorities: https://edms.cern.ch/document/428038 , This document describes the procedure by which the list of trusted Certification Authorities for use in WLCG and EGEE should be created and maintained.
Policy on Grid Multi-User Pilot Jobs: https://edms.cern.ch/document/855383, Security policy for operation of multi-user pilot jobs.
Operations Meetings Notes: http://lists.grid.sinica.edu.tw/apwiki/Reports/Operations_Meeting
Weekly reports of production site status and broadcast: http://lists.grid.sinica.edu.tw/apwiki/Reports/APROC_sites_weekly_report
APROC Website: http://aproc.twgrid.org/
APROC WIKI: http://lists.grid.sinica.edu.tw/apwiki/FrontPage
APESCI Virtual Organization: http://aproc.twgrid.org/index.php?option=com_content&task=view&id=17&Itemid=31
TWGrid Virtual Organization:http://aproc.twgrid.org/index.php?option=com_content&task=view&id=16&Itemid=31
SA1 Operational Procedures Manual: https://twiki.cern.ch/twiki/bin/view/EGEE/EGEEROperationalProcedures
EGEE (Enabling Grids for E-sciencE): http://public.eu-egee.org/
EGEE SA1 Activity: http://egee-sa1.web.cern.ch/egee-sa1/
gLite grid middleware: http://glite.web.cern.ch/glite/