Irvine, CA, US
1 day ago
Software Dev Engineer II, Incident Command Systems
We're hiring a Software Development Engineer II to contribute to our Monitoring & Detection engineering efforts as part of the incident response program for Amazon's worldwide retail websites. As we reimagine incident management and response for Amazon's rapidly evolving retail operations, we need skilled engineers to help us keep pace. In this role, you will play an important part in developing and implementing key components of our strategic platform for the central incident response team. Your work will directly impact the decisions made by Amazon teams during critical incidents, where every minute counts. You'll collaborate with senior team members to analyze post-incident data, identify improvement opportunities, and address potential blind spots in our systems. This position requires a mix of technical problem-solving skills and the ability to work in a fast-paced, complex environment. While you may not lead major initiatives, you'll be deeply involved in the technical aspects of our incident management capabilities, contributing significantly to the stability and reliability of Amazon's retail platforms.

Key job responsibilities
As a Software Development Engineer II on our team, you will play an important role in building and integrating key performance indicators for various services into our incident management platform. Working within Amazon's complex architectural landscape, you'll collaborate with service owners across the organization to develop and maintain software features for our monitoring systems. Your responsibilities will include designing scalable solutions that support the monitoring of numerous services, with guidance from senior team members to ensure alignment with long-term strategies. You'll be deeply involved in the full software development lifecycle, from scoping and design to coding, testing, deployment, and maintenance. Collaborating with stakeholders to understand business and customer value will be crucial as you work to deliver appropriate solutions. You'll contribute to documentation, participate actively in code reviews, and demonstrate operational excellence in all aspects of your work. Balancing new feature development with operational needs, you'll make effective priority trade-offs and help resolve root causes of issues. As you grow in this role, you'll have opportunities to mentor junior engineers and support new team members. This position requires a passion for understanding Amazon's retail business and providing real-time visibility into its operational health. You should be comfortable working in a dynamic environment, leveraging your problem-solving skills to tackle complex challenges, and collaborating across the Amazon ecosystem.

A day in the life
A day in the life of a Software Development Engineer II on our team is filled with the challenge of navigating Amazon's complex, semi-connected systems. The scale of the company's operations presents unique technical problems that require creative problem-solving and persistence. One moment, you might be collaborating with service owners to design a scalable monitoring solution for a critical service. The next, you could be diving deep into post-incident analysis, uncovering root causes and identifying areas for improvement. Throughout the day, you'll work closely with your teammates, contributing to code reviews, documenting systems, and mentoring junior engineers. While the challenges may not be easy, you'll find immense satisfaction in knowing that your efforts directly contribute to enhancing the monitoring capabilities that are crucial for safeguarding the seamless operation of Amazon's retail experiences. By embracing these complexities and leveraging your technical expertise, you'll play a vital role in the central reliability and response efforts, helping to improve Amazon's operational resilience and responsiveness.

About the team
The Incident Command Systems team at Amazon is responsible for envisioning and building programs, which consistently improve remediation times for outages. This group consists of multiple 2-pizza teams (teams of 6-10 engineers) that each own software components for monitoring, anomaly detection of website degrading issues as well as incident management software used during these outages.
Confirm your E-mail: Send Email