We are looking for a Senior Operations Engineer to join our Network Operations Center (NOC) night team. In this role, you will be a strong contributor to maintaining the operational health of complex cloud infrastructure systems. You will leverage advanced monitoring tools, including AWS Cloudwatch, Datadog, and Cloudflare, to ensure the reliability, availability, and performance of customer systems. Your experience and technical expertise will help drive improvements in monitoring, incident response, and customer satisfaction. Upon the improvements of processes already established for our 24x7 Virtual Operations Center. Critical tasks include preventing and/or mitigating customer impact by monitoring, identifying, triaging, and resolving issues. As a team member, you will work with other engineering teams on escalations to ensure our services remain available, secure, and performing at world-class levels. You’ll participate in continuous improvement initiatives by creating, updating, and evolving alerts, as well as runbooks and operational processes.
This role is only available outside the US and not available to any individual based within the US or any US territory.
Responsibilities
Provide first response and act as reference for the team for the monitoring, troubleshooting, and resolution of complex incidents within the cloud infrastructure, working closely to other Engineering teams Develop and implement best practices for monitoring, response and fulfillment of our Incident, Change and Service Request queues Analyze system logs and performance metrics to identify issues and improve overall system reliability Collaborate with engineering teams to optimize and introduce new monitoring solutions; coordinate incident response when service impacts occur and support the Post-Mortem efforts to prevent recurrence The NOC is 24x7, team members are required to work shifts that include nights, weekends, and holidays Maintain a solid understanding of cloud infrastructure and services, enhancing your technical skills over timeRequirements
A proven track record - At least 5 years success with highly-scaled internet/ mobile application environments, including 3 years working in a Security Operations Center (SOC) or Network Operations Center (NOC) Sense of urgency - you rapidly acknowledge and engage on alerts maintaining our excellent team SLAs Knowledge in Incident and Problem Management, ITSM tools (like Jira, Zendesk, Confluence) Hunger to continue learning new technologies - UNIX/Linux and Cloud System administration experience and are eager to expand that skill set Ability to think critically and strategically in a fast-paced, customer-centric environment Expert-level proficiency in industry leading tools for infrastructure and application monitoring (like AWS Cloudwatch, Datadog, Splunk, CloudFlare) Strong communication skills with the ability to convey complex technical issues to both technical and non-technical stakeholders (English is required) Customer obsessed with technical curiosity - You are skilled at breaking down complex technical issues. You enjoy using available tools and data to not only fix issues, for our customers but prevent them from happening again Ability to work in a fast-paced environment and adapt to changing priorities Bachelors in Information Systems, or equivalent experience Excellent high speed connectivity from home AWS CCP Certification required ITIL Foundation is a plus Excellent communication skills both written and spoken (fluency in English required)