SOP Monitoring and Performance Management
Monitoring and Performance Management
SOP Number IT/014/R.1
SOP Title Monitoring and Performance Management
NAME
TITLE
DATE
Author
Sandeep R. Yadav
System Admin
11-07-2024
Reviewer
Ashutosh Awasthi
Senior Manager
11-07-2024
Authoriser
Mahaveer Devannavar
General Manager
11-07-2024
Effective Date:
11-07-2024
Review Date:
11-07-2024
1. PURPOSE
The purpose of this documents is to drive all the activity as per scheduled date and time referring the activity checklist. Alert notification should be routed to IT representative for action related to resources. The performance of infrastructure to be monitored
2. INTRODUCTION
Activity checklist will let us know what activity needs to perform on day. Alert notification will provide notification about any abnormality related to RAM, CPU and disk resources etc, will lead to monitor performance accordingly
3. SCOPE
The scope applies to all servers VM, physical, AWS servers, storage device configured for running Somaiya infrastructure
4. RESPONSIBILITIES
4.1 Role of System Administrator:
The System Administrator is responsible for ensuring that all activity as per checklist is undertaken as per schedule. All critical servers alert notification is configured for improving the performance. The system administrator is responsible for escalating the problem to the Senior Manager, Head of Department
5. SPECIFIC PROCEDURE
5.1 Checklist for checking health of VMware infrastructure (Host server, Synology Storage, VM server, Backup, Replication, Windows server, Linux server):
Daily/Weekly/Monthly task:
5.1.1 Host server console login for accessibility check. (once a week)
5.1.2 Host server IMM and ILO login for hardware functionality check. (once a week)
5.1.3 VM server login for accessibility check. (once a week, Oracle database, My account, EMIS server everyday)
5.1.4 VM server backup status check. (everyday)
5.1.5 VM server replication status check. (everyday)
5.1.6 Vcenter server for managing whole VMware infrastructure resources. (everyday)
5.1.7 Oracle database Listener log file size check. (everyday)
5.1.8 Oracle database error log file size check. (everyday)
5.1.9 Oracle database Archive, RMAN, FBRMAN backup file check. (everyday)
5.1.10 Datacentre walkthrough Datacentre for amber LED indication check. (everyday)
5.1.11 Oracle database dmp backup check. (everyday)
5.1.12 Oracle database log collection. (every Friday, involving Rahul)
5.1.13 Host putty login check. (every 3 month)
Activity:
5.1.13 Listener log purging activity. (Listener log file reaches 2 to 2.5 GB space, every 2 month)
5.1.14 Active directory health assessment report and actions (every six month, involving partner team)
5.1.15 VMware infrastructure health assessment report and actions (every six month, involving partner team)
5.1.16 Host server security, vulnerability patch activity. (every six month as per patch availability)
5.1.17 VMware windows server security, vulnerability patch activity. (every six month)
5.1.18 Data scrubbing activity on Synology device. (every six month)
5.1.19 Hard disk quick and extended scan on Synology device. (every three month)
5.1.20 Shutdown whole VMware infrastructure activity and power on. (every 6 month)
5.1.21 Password change activity on all VM server. (every 6 month)
5.1.22 VM server backup drill. (every 6 month)
5.1.23 VM server replication drill. (proposed, every 6 month)
5.1.24 VMware DR site visit. (every 3 month)
5.1.25 Host server BIOS and firmware upgrades. (as per compatibility, involving partner team)
5.1.26 VMware ESXI upgrade, Vmware tools upgrade, Hardware compatibility upgrade. (as per compatibility and updates, involving partner team)
5.1.27 Synology NAS device patch (as per patch availability)
5.2 Checklist for checking health of AWS infrastructure:
Daily/Weekly/Monthly task:
5.2.1 AWS instance login for checking accessibility. (once a week)
5.2.2 AWS instance snapshot backup check. (everyday)
5.2.3 RDS snapshot backup check. (everyday)
5.2.4 Sensys backup check on AWS infra for HR data (download files on local system ). (everyday)
5.2.5 AWS guard duty scan check for critical AWS instance. (everyday)
5.2.6 Sizing/Resizing (increase/decrease) AWS instance configuration. (as per requirement)
5.2.7 AWS instance status check (3/3 checks passed). (everyday)
Activity:
5.2.8 PHP flush session activity on two AWS instance for clearing php session. (every last Thursday of month)
5.2.9 Sensys backup drill activity for data integrity check. (every 3 month)
5.2.10 AWS instance windows security, vulnerability patch activity. (every 6 month)
5.2.11 AWS assessment report preparation and actions. (every 6 month, involving partner)
5.2.12 AWS instance reboot activity (power off AWS instance and shutdown). (every 6 month)
5.3 IT college physical server:
Daily/Weekly/Monthly task:
5.3.1 Physical server login for accessibility check. (once a week)
5.3.2 Physical server backup check. (everyday)
5.3.3 Visit at IT College for checking physical server status. (every six month)
5.4 Backup Ananth physical server configuration file:
Daily task:
5.4.1 Ananth server configuration backup. (everyday)
5.5 Syslog file check:
Daily task:
5.5.1 IT College syslog server log file check. (everyday)
5.5.2 Somaiya syslog server log file check. (everyday)
5.5.3 IT College syslog log file shipping. (once a week)
5.6 Hosting:
5.6.1 GoDaddy console login for verifying renewals if any. (randomly)
5.6.2 Educause console login for verifying renewals if any. (randomly)
5.6.3 ERNET console login for verifying renewals if any. (randomly)
5.6.4 PDP console login for verifying renewals if any. (randomly)
5.7 Website:
Daily/Weekly task:
5.7.1 Critical website sanity check. (everyday)
5.7.2 Domain subdomain verification (every 6 month)
5.8 ALERTING:
5.8.1 Resource utilization alerting mechanism for notification
6. DEFINATIONS
6.1 Resources: A resource is anything that can be used to perform a task or achieve a goal. In computing, system resources refer to things like CPU, memory, hard drive storage, network bandwidth, and battery life
6.2 Alerts: Alerting is the capability of a monitoring system to detect and notify the operators about meaningful events that denote a grave change of state. The notification is referred to as an alert and is a simple message that may take multiple forms : email, SMS, instant message (IM), or a phone call
7. FORMS/TEMPLATES TO BE USED
A standard templates is used to record incidence by Department of Information Technology
8. CHANGE HISTORY
SOP No.
Effective
Significant Changes
Previous
Date
SOP no.
IT/014/R.1
01-12-2022
First version
N.A.