Senior DevOps Engineer R&D
Location: Casablanca (onsite work mode)
Job description:
We are looking for a highly skilled and experienced Senior DevOps Engineer with an experience on Site Reliability Engineering (SRE) to join our growing team. In this role, you will be responsible for not only developing and maintaining CI/CD pipelines and automating infrastructure management but also ensuring the reliability, scalability, and availability of our services. You will work closely with cross-functional teams (data scientists, software engineers, DevOps engineers) to ensure the infrastructure and applications can handle high traffic volumes with minimal downtime, while also mentoring junior engineers in the best practices for DevOps.
Responsibilities
What you’ll do
Build, manage, and optimize continuous integration and continuous deployment (CI/CD) pipelines to automate code testing, integration, and deployment processes. Deploy and manage infrastructure using Infrastructure as Code (IaC) tools like Terraform. Ensure environments are reproducible, scalable, and cost-effective. Work proactively on capacity planning, identifying performance bottlenecks, and scaling systems based on traffic patterns and load expectations. Manage infrastructure and applications deployed on cloud platforms like OCI, AWS, Azure, or GCP. Implement containerization using Docker and manage container orchestration with Kubernetes or similar tools. Lead initiatives to optimize application performance, including database tuning, caching strategies, and application monitoring. Work with engineering, security, and product teams to ensure high-quality service delivery, troubleshoot issues, and ensure reliable deployments. Implement and promote SRE best practices to ensure high availability, uptime, and reliability of systems and services across development, staging, and production environments. Design and manage monitoring, alerting, and incident response systems to detect performance issues, security vulnerabilities, and failures in real-time. Use tools like Prometheus, Grafana, Datadog. Define, measure, and track SLOs and SLIs for services, ensuring adherence to error budgets, and collaborating with development teams to prioritize reliability work. Develop and implement automated recovery and self-healing mechanisms to reduce human intervention during incidents, such as auto-scaling and auto-remediation processes. Provide mentorship and guidance to junior engineers, helping them understand DevOps best practices and supporting their technical growth.
Qualifications
Bachelor's degree in computer science, Engineering, or a related field. 5+ years of experience in a DevOps, SRE, or related role, with expertise in building and maintaining reliable, scalable systems. Strong experience with cloud platforms (OCI, AWS, Azure or GCP). Hands-on experience with containerization (Docker, Rancher) and orchestration platforms (Kubernetes, Helm). Expertise in monitoring and alerting tools like Prometheus, Grafana, Datadog, or similar. Experience with Infrastructure as Code (IaC) tools such as Terraform or Ansible. Proficient in scripting languages (Python, Bash, etc.) for automation tasks and system configuration. In-depth experience with CI/CD tools like Jenkins, GitLab CI, or equivalent. Strong understanding of networking protocols, security practices, and cloud-native architectures. Experience working in an environment that practices SRE principles, including SLOs, SLIs, and error budgets. Strong troubleshooting skills with the ability to manage complex production issues and outages. Experience with performance tuning and optimization techniques for both applications and infrastructure. Strong written and verbal communication skills in English. Strong documentation skills with the ability to clearly outline processes and procedures for others to follow.Personal Attributes:
Strong problem-solving and analytical abilities, with the ability to work under pressure during incidents. Strong written and verbal communication skills in English, with the ability to collaborate effectively across teams. Strong leadership capabilities to mentor junior engineers and foster a culture of reliability and continuous improvement. A proactive, self-motivated approach to work, with a passion for improving system reliability and performance.
Career Level - IC3