As a Site Reliability Engineer (SRE), you'll play a pivotal role in ensuring the health, reliability, performance, and scalability of our applications. You'll bridge the gap between development and operations, leveraging your technical expertise and problem-solving skills to triage production issues, automate operations, optimize processes, and maintain high availability.
Key Responsibilities:
• Steward of Application Health: Work closely with application developers to design resilient, scalable, and maintainable applications, ensuring they meet operational requirements and minimize downtime.
• Collaboration: Participate in code review; mentor and train peers; advocate DevOps principals to application developers.
• Infrastructure Automation: Develop and maintain automation and tools to streamline deployments, configuration management, and infrastructure provisioning.
• Monitoring and Alerting: Implement robust monitoring systems to proactively identify and address performance bottlenecks, anomalies, and security threats.
• Capacity Planning: Forecast resource needs and optimize infrastructure utilization to ensure high availability and performance.
• Change Management: Collaborate with development teams to ensure smooth deployment of new features and updates, minimizing disruptions.
• Security and Compliance: Adhere to security best practices and implement measures to protect systems from vulnerabilities and threats.
• Guardian of SLA: Actively monitor and maintain the health and performance of applications, ensuring they meet Service Level Agreements. Respond to, triage and mitigate emergent problems in production.
• On-Call Support: Participate in on-call rotation to provide timely support and resolution for critical issues.
Required Skills and Experience:
• Strong programming skills in languages like Python, Ruby, Bash, Rust, Go.
• Experience with cloud platforms (AWS, GCP, Azure) and infrastructure as code tools (Cloudformation, Terraform, Ansible, Chef)
• Deep understanding of containerization technologies (Docker, Kubernetes)
• Proficiency with linux (Ubuntu)
• Proficiency in monitoring and alerting tools (Cloudwatch, Prometheus, Grafana)
• Knowledge of DevOps practices and methodologies (CI/CD, Agile)
• Excellent problem-solving and troubleshooting skills
• Strong communication and collaboration abilities
Preferred Skills and Experience:
• Knowledge of asynchronous processing (Kafka, Celery)
• SQL Database administration and query optimization (Postgres, MySQL)
• Experience with Github Actions
Why Join Us:
• Be part of a dynamic and innovative team
• Work on cutting-edge technologies and projects
• Opportunity for professional growth and development
If you are passionate about building reliable and scalable systems and have a strong foundation in DevOps, we encourage you to apply.
We are an Equal Opportunity Employer, including disability/vets.