Taguig City, PHL
10 days ago
Senior Site Reliability Engineer (Intelligent Operations)
Job Location Taguig City Job Description Overview of the job As the Senior SRE in Intelligent Operations you will be responsible for managing and optimizing internally developed Observability and Event Management platforms. You will be working as hands on SRE as well as you will lead regional SRE team for Asia region. You will share your time between operational work (50%) and engineering work (50%) to provide system reliability, optimize cost, meet compliance and security requirement and deliver new functionalities for the business. Your team You will lead 3 person SRE team in Philippines. You will report to IT Engineering Director and Your team will be part of bigger Engineering team consisting of: + 2 others regional SRE teams (Europe and Americas) + Core Platform Engineering team (located in Europe) Your direct team will be closely cooperating with engineers across Engineering team as well as with product managers, product owners and activation leaders from Observability and Event Management product teams. How success looks like + Provide operational support for Asia region via individual contribution and via managing regional SRE team. + Maintain and improve system reliability. Design, maintain and test disaster recovery and business continuity. + Drive platform cost visibility and cost optimizations. + Design, implement and drive adoption of self-service capabilities for users of platform. + Design and develop system enhancements. Responsibilities of the role Operational support for ASIA region: + Build and lead a regional SRE team, providing strong leadership, mentorship, and accountability for their performance and results. + Manage work and time of regional SRE team to ensure right coverage. + Hands on work on incident management, client onboarding, and handling customer requests, ensuring timely resolution and adherence to SLIs, SLOs, and SLAs . + Owning SLOs and SLIs. Continuous improvements for SLO and SLI. + Reduce Mean Time to Restore Service and Mean Time to Resolve Requests. Platform reliability and resilience + Drive the implementation of self-observability and self-healing capabilities, leveraging industry-leading tools and technologies. + Design, maintain and regularly test business continuity processes + Design, maintain and regularly test disaster recovery mechanism + Perform changes, upgrades, and regular maintenance tasks for existing solutions, ensuring system stability and optimized performance. Platform cost: + Drive visibility for cost components and their association to pricing model + Run ongoing FinOps processes, identify and execute cost savings initiatives. Self-service of platform capabilities: + Translate insights gained customer requests and onboardings into actionable proposals for automations, new capabilities, and process changes to increase self-service among users. + Increase scope of self-service capabilities and drive their adoption among user community. + Automate onboarding of new projects and plants into platform. Platform enhancements: + Collaborate closely with product and engineering teams to influence the product roadmap, provide valuable input for product increments, and align sprints with operational requirements. + Own design and development of selected user stories from product backlog Job Qualifications Role Requirements Technical Expertise and Experience: + Extensive knowledge and experience in IT technologies, spanning from operating systems to network infrastructure and cloud platforms. The following is a list of technologies that are required for the role (note: we are flexible to accept candidates who have a solid foundation or incomplete mix but are determined to learn on the job): + Proficiency in Kubernetes, with hands-on experience in running and investigating workloads in a Kubernetes environment. + Strong scripting and automation skills. + Familiarity with GitHub. + Hands-on experience with Cloud platforms (Azure preferred) and infrastructure provisioning. + Familiarity with observability ecosystem tools such as Prometheus, Thanos, Grafana, etc. would be advantageous but not mandatory. Soft skills: + Strong planning and organizational skills, enabling effective work and task management for oneself and the team. + Strong problem-solving and troubleshooting skills, with the ability to analyse complex issues and devise effective solutions + Effective communication to convey real time information during incidents and being able to translate technical issues into clear communication targeted at non technical stakeholders. + Proactive and self-motivated, with a continuous learning mindset and a drive for staying updated with industry trends and technologies. + Ability to thrive under pressure and effectively manage incidents, ensuring timely resolutions and minimizing downtime. About us We produce globally recognized brands and we grow the best business leaders in the industry. With a portfolio of trusted brands as diverse as ours, it is paramount our leaders are able to lead with courage the vast array of brands, categories and functions. We serve consumers around the world with one of the strongest portfolios of trusted, quality, leadership brands, including Always®, Ariel®, Gillette®, Head & Shoulders®, Herbal Essences®, Oral-B®, Pampers®, Pantene®, Tampax® and more. Our community includes operations in approximately 70 countries worldwide. Visit http://www.pg.com to know more. We are an equal opportunity employer and value diversity at our company. We do not discriminate against individuals on the basis of race, color, gender, age, national origin, religion, sexual orientation, gender identity or expression, marital status, citizenship, disability, HIV/AIDS status, or any other legally protected factor. Job Schedule Full time Job Number R000103344 Job Segmentation Experienced Professionals (Job Segmentation)
Confirm your E-mail: Send Email