About the Role
We are looking a Resiliency Engineering Lead to enhance system resilience by proactively identifying and addressing potential failures. This role demands expertise in Resiliency Engineering along with performance engineering and proficiency with Chaos Engineering tools like Harness, LitmusChaos, and other native or open-source solutions. You will work with containerized and distributed architectures across AWS, Azure, and GCP, designing and executing chaos experiments, integrating resiliency testing into CI/CD, and collaborating with SRE, DevOps, and Performance Engineering teams. Additionally, you will implement observability solutions, establish best practices, and promote a failure-driven learning culture to ensure high availability, fault tolerance, and self-healing capabilities for critical systems.
Key Responsibilities:
Develop strategies for building highly available and fault-tolerant systems by identifying single points of failure and addressing them. Collaborate with cross-functional teams to design and execute experiments that simulate real-world failures. (e.g., Chaos Monkey, Gremlin, Litmus). Utilize SRE principles to enhance system reliability and performance. Work with cloud platforms including AWS, Azure, and GCP to deploy and manage resilient applications. Collaborate with cross-functional teams, including DevOps, SREs, and Development teams, to integrate resiliency best practices into the software development lifecycle. Lead post-mortem analysis of major incidents to identify root causes and create action plans to mitigate future risks. Provide mentorship and technical guidance to engineers in the team. Use New Relic, AWS X-Ray, and logs to track system behavior and find issues early. Builds observability dashboards using LGTM and implements distributed tracing and instrumentation. Contribute actively to CoE, Continues Improvement, Innovations, and Research Optimize Cloud and Large-Scale Systems – Improve efficiency of cloud, microservices, and containerized environments. Proven ability to work in a team and communicate effectively with all stakeholders
Preferred Skills:
Proficiency with Chaos Engineering tools such as Chaos Monkey, Gremlin, Litmus, or equivalent. Experience with AWS cloud platforms and technologies. Exposure with Azure & GCP cloud platforms and technologies. Proven experience with containerized and distributed architectures. Experience with JMeter, NewRelic , AWS CloudWatch. Exposure with GitHub, Grafana, etc,. Experience with CI/CD pipelines and Infrastructure as Code (e.g., Terraform, Ansible). Strong scripting and programming skills (e.g., Python, Go, or Java). Excellent problem-solving skills and a proactive approach to identifying and mitigating risks. Strong communication and collaboration skills to work effectively with stakeholders at all levels.Qualifications:
Bachelor’s degree in engineering, or a related field. 10+ years of experience in performance and resiliency engineering Strong experience of performance & resiliency engineering tools and methodologies. Experience with monitoring and tuning of complex systems. Excellent analytical, problem-solving, and communication skills. Ability to work effectively in a fast-paced, collaborative environment. Why Join Us? Work in a cutting-edge technology environment with a talented and passionate team. Opportunities for professional growth and career advancement. Comprehensive benefits package and competitive salary. Be part of a company that values diversity, inclusion, and innovation.If you are a dedicated performance engineering leader looking for an exciting opportunity to make a significant impact, we would love to hear from you. Apply now to join our team and help us drive performance & reliability excellence!