## Site Reliability Engineer (SRE) - Mission-Critical SaaS Cloud Products
## Key Responsibilities
### Reliability and Performance Management
- Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.
- Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.
- Continuously optimize system performance and resource utilization across multiple cloud platforms.
- Finetune/Optimize Application performance by analyzing the code, traces and database queries.
### Incident Management and Troubleshooting
- Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.
- Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.
- Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.
### Observability and Monitoring
- Design and implement end-to-end observability solutions across our distributed systems.
- Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.
- Create and optimize product status dashboards to provide real-time visibility into system health and performance.
### Automation and Infrastructure as Code (IaC)
- Implement Infrastructure as Code practices using tools like Terraform.
- Develop and maintain automated deployment pipelines and CI/CD workflows.
- Create self-healing systems and automate routine operational tasks to reduce manual intervention.
### Cloud-Agnostic Architecture
- Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.
- Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).
- Implement and manage containerized applications using Kubernetes across different cloud environments.
### Continuous Improvement
- Regularly review and refine operational practices to enhance efficiency and reliability.
- Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.
- Contribute to the development of internal tools and frameworks to support SRE practices.
## Requirements
- Strong knowledge of cloud platforms - Azure, and their associated services on private NW.
- Expert in Observability tools (ELK Stack, Dynatrace, Prometheus )
- Expertise in containerization technologies such as Docker and Kubernetes
- Understanding of Event-driven architecture and database technologies (Mongo Atlas, Azure SQL, PostgresDB )
- Proficient in IaaC tools such as - Terraform and GitHub Actions.
- Proficiency in one or more programming languages - Python/.Net/Java
- Strong understanding of networking concepts, load balancing, and security practices.
HTSIND2022
YOU MUST HAVEBachelor’s degree with 10+ years of experience.WE VALUEUnderstanding various software development lifecycleSome relevant experienceKnowledge of software configuration management and change management practicesDiverse and global teaming and collaborationEffective communicatorIndividuals who are self-motivated and able to work with little supervision, who consistently take the initiative to get things done, do things before being asked by others or forced to by eventsAbility to consistently make timely decisions even in the face of complexity, balancing systematic analysis with decisivenessCan quickly analyze, incorporate and apply new information and conceptsAdditional InformationJOB ID: HRD250587Category: EngineeringLocation: Lot 115 (P),Nanakramguda Village,,Serilinganpally Madndal, RR District,Hyderabad,TELANGANA STATE,500019,IndiaExemptEarly Career (ALL)