Site Reliability Engineer.
Responsibilities
· The engineer will enable clients to navigate and adoption of IT methodologies and operating models to drive business agility using SRE and Agile frameworks.
· As a SRE engineer, you will work closely with our clients to define clients’ operational and governance models.
· Design and deploy scalable, reliable, and secure SRE solutions.
· The ideal candidate will combine technical and business skills and a passion for working with clients to deliver excellence.
· This position is responsible for collaborating with teams to build tools and strategies for problem detection, prevention, and chaos testing.
· Troubleshoot production incidents in real-time and conduct blameless post-mortems.
· You will have the opportunity to work on/with a diverse set of projects, clients, industries, and frameworks and this position will provide opportunities to expand your horizons to reach your personal development goals.
· Mentors and coaches other members of the agile team. Leads a small team of DevOps engineers using agile methodology, with a focus on continuous delivery.
· Provides functional and technical expertise on monitoring, observability, and resilience.
· Drives engagement with Security and Infrastructure teams to ensure secure deployment of applications.
· Assists in production support and maintenance of applications as needed.
· Develops and maintains the documentation.
Must have
· Proven experience in SRE or similar role
· Experience with DevOps in public cloud (AWS/Azure)
· Expertise in defining SLAs, Error Budgets, and other key metrics for stable, scalable, robust application infrastructure.
· Good understanding of Chaso engineering and Canary deployments
· Good understanding of Cloud Infrastructure services and their limitations
· Experience in configuring & monitoring different attributes and handling scale-up and scale-down scenarios for the application in a cloud environment.
· Deploy and manage container orchestration, service mesh, serverless, API gateways, and observability stack.
· Have experience building and deploying as containers on a cloud platform using an automated CI / CD pipeline.
· Experience with network technologies and with system, security, and network monitoring tools
· Experience using Terraform for IaC automation.
· Deep knowledge of monitoring and observability tools (e.g. Prometheus, Grafana, ELK Stack, DataDog, AppDynamics, New Relic, or similar)
· Deep knowledge of ing tools (e.g. Pager Duty, Zen Duty, or similar)
· Practical scripting skills are a must.
· At least 3+ years of experience working in an Agile team.
Qualifications:
· Bachelor’s degree or equivalent in Computer Science, Engineering, or a related field, or additional comparable experience
· Proven experience in IT, application and infrastructure monitoring, DevOps, including excellent knowledge of networking, computing, and storage.
· Industry certifications in Monitoring, Observability, SRE, DevOps, and Cloud services will be a big plus.
· Any SAFe certification is desirable.