SRE Engineer
UST Global Inc
Position: Site Reliability Engineer (SRE) – Azure AKS
Overview:
This role involves supporting the administration of Azure Kubernetes Service (AKS) clusters that manage critical, always-on middleware systems handling thousands of transactions per second (TPS). The SRE will focus on achieving and maintaining a five-9’s (99.999%) availability target, ensuring a scalable and reliable production environment.
Key Responsibilities:
Cluster Operations: Manage AKS cluster deployments, updates, and cutovers. Perform base image updates and test infrastructure-as-code (IaC) changes. Handle daily operations aligned with reliability goals. Automation & Engineering: Apply software engineering practices to IT operations tasks. Develop and maintain IaC and automation scripts for: Creating monitoring queries and analyzing logs. Conducting disaster recovery tests and incident response. Documenting code for operational and reference purposes. Collaboration: Work with cross-functional teams to improve reliability and scalability.Required Skills & Technologies:
Cloud Expertise: Azure (mandatory) with strong networking skills. Programming: Proficiency in Python or Go. Container Orchestration: Kubernetes (AKS) with Helm. Infrastructure as Code (IaC): Terraform. CI/CD Pipelines: Experience with any CI/CD tools (GitHub Actions preferred). GitOps: ArgoCD. Observability Platforms: ELK Stack or Grafana Loki. Operating Systems: Linux with cloud networking expertise.Preferred Skills:
Monitoring and Logging: OpenTelemetry. Secrets Management: HashiCorp Vault. Experience in regulated environments with sensitive data and heightened security requirements.Education & Experience:
Academic Background: Bachelor’s degree in computer science, computer engineering, information technology, or equivalent practical experience. Professional Experience: 4–6 years in a related field, with at least 2 years as an SRE.
Confirm your E-mail: Send Email
All Jobs from UST Global Inc