- Review Monitoring & alerts to provide recommendations for enhancement towards 360° coverage
- Create dashboards, setup synthetic and real user monitoring, visualize large data sets with interactive custom dashboards, setup alerts, reports, self-remediation actions, leverage AIOps capabilities using APM tools.
- Identify areas of automation for manual tasks and suggest utilities, solutions, and plan, which includes CI/CD implementation and best practices enforcement.
- Review Reliability/Resiliency assessment strategy and results/observations to provide recommendations for improvement
- Support reporting and tracking of reliability defects in the management platforms
Key Skills - Experience in Non-functional requirements management; gathering, determination, enforcement, assessment, and assurance
- Should have experience handling distributed (preferably multi-cloud) infrastructure.
- Should have worked on a minimum of 3 projects in performance monitoring of Applications / Infra Domain and Deployment
experience in APM tools & Cloud monitoring tools - Strong working knowledge of Git and code-review systems such as Gerrit, Bitbucket, and GitHub
- Deeper understanding of SRE concepts such as SLO, SLI and error budgeting and knowledge on Change management, Agile, ITIL concepts, SOP creation, Life Cycle management is a Plus.
- A deep understanding of CI/CD technologies & tools. Also, good understanding of AIOps
|
Good to have Skills • DevOps Tools Skills: Terraform/CloudFormation, Ansible, Chef, Puppet, Jenkins • APM Tools Skills: AppDynamics, Dynatrace, ELK, New Relic, eG Innovation, Splunk , BMC Trusight • Infra Tools Skills: Microfocus, SolarWinds • Cloud Monitoring tools: Cloud Watch, Azure App Insight, DataDog • Scripting Skills: Java Script, Python, Power Shell, Unix Shell • Fundamental Knowledge: Dockers, Kubernetes |