- Release Management: Coordinate
and manage release cycles for observability platforms. Ensure smooth and timely releases with minimal
disruption to services. Work with partners to migrate legacy monitoring to
modern solutions. Work with the observability
engineering team to provide solutions for new requirements that arise, by
leveraging existing or developing new solutions.
- Incident/Request
Management: Troubleshoot and resolve incidents related to observability
platforms. Manage escalated customer issues and requests, ensuring timely
and effective resolution. Document incident remediation activities and automate remediation activities
where possible.
- Performance Optimization: Continuously
monitor and enhance platform performance to support scalability and complexity.
- Collaboration and
Communication: Collaborate with cross-functional infrastructure, application, and
business stakeholders to ensure observability
solutions align with the broader IT strategy and infrastructure requirements.
Communicate effectively with team members, management, and other
stakeholders.
- Continuous Improvement: Identify
opportunities for process optimization and efficiency gains. Stay current with
industry trends and best practices to continuously
improve observability operations.
- Customer Focus: Ensure
high levels of customer satisfaction by effectively managing customer
relationships. Provide excellent customer
service and support for observability solutions.
- Compliance and Security: Ensure
observability platforms comply with organizational policies and security
standards. Implement tools and processes to detect and
remediate configuration drifts and security risks.
- Documentation and
Reporting: Maintain comprehensive documentation of
observability platform, Product DOU, processes, and procedures.
Technical Expertise:
- 5+ Years of experience in IT
operations, with significant responsibilities in system monitoring,
- performance tuning, and troubleshooting enterprise applications.
- 4+ Years in a Site Reliability
Engineering (SRE) role managing modern observability solutions.
- 5+ years of development
experience on enterprise class applications: Javascript/Java, Sql ,Spring boot & Micro services
- 5+ Years managing and
implementing observability and event management platforms (e.g., AppDynamics, Splunk, Prometheus,
Grafana).
- 5+ years of experience of cloud
computing platforms (GCP) and container orchestration
(e.g., Kubernetes, Docker)
- Familiarity with CI/CD pipelines
and automation tools (e.g., Jenkins, GitLab , ArgoCD etc)
- Experience developing and
implementing monitoring and logging standards for infrastructure, platforms, and applications.
- Experience establishing and
implementing event correlation policies and related rules to enrich event data, and reduce TTD and TTR.
#LI-RJ2
Salary Range - $90,790-$110,000 a year