Senior Site Reliability Engineer
Zycus
Zycus is seeking Site Reliability Engineering Experts with a strong focus on Kubernetes and Python. Candidates should have 3-7 years of experience in application performance monitoring and automation, along with expertise in administering and scaling middleware. Proficiency with monitoring tools such as AppDynamics, Graylog, Dynatrace, and Datadog, as well as experience with Ansible, Grafana, and Prometheus, is essential. We’re looking for individuals passionate about solving complex production issues in distributed systems, multi-tenant services, and large-scale infrastructures. If you thrive in dynamic environments and are eager to tackle challenging problems, we want to hear from you!
Responsibilities: Performance tuning, resource trending, capacity planning of overall Infrastructure through automation tools like grafana, logstash, prometheus. Should be leading/training team to provide best practices in Site reliability and setting up KPIs for them. To have proactive approach in seeking knowledge from engineering team and contribute in design aspects in case of middleware technologies. Serve as escalation point for app support and system engineering teams. Identifying opportunities, simplifying adhoc tasks and daily tasks with automation. Review and influence ongoing design, architecture, standards and methods for operating services and systems. Experiment with new & relevant technologies and tools, and drive adoption of the proposed technologies. Need to have deep knowledge, understanding & experience of working with a large variety of multi-tier/ multi-tenant architectures. Participate in overall service capacity planning and demand forecasting, software performance analysis and system tuning. You will drive reliability and supportability aspects of Cloud service, including change management, triage of customer escalations, remediation plans, Devops Kubernetes Ansible playbooks and automations.
Responsibilities: Performance tuning, resource trending, capacity planning of overall Infrastructure through automation tools like grafana, logstash, prometheus. Should be leading/training team to provide best practices in Site reliability and setting up KPIs for them. To have proactive approach in seeking knowledge from engineering team and contribute in design aspects in case of middleware technologies. Serve as escalation point for app support and system engineering teams. Identifying opportunities, simplifying adhoc tasks and daily tasks with automation. Review and influence ongoing design, architecture, standards and methods for operating services and systems. Experiment with new & relevant technologies and tools, and drive adoption of the proposed technologies. Need to have deep knowledge, understanding & experience of working with a large variety of multi-tier/ multi-tenant architectures. Participate in overall service capacity planning and demand forecasting, software performance analysis and system tuning. You will drive reliability and supportability aspects of Cloud service, including change management, triage of customer escalations, remediation plans, Devops Kubernetes Ansible playbooks and automations.
Confirm your E-mail: Send Email
All Jobs from Zycus