Irvine, CA, US
24 days ago
Sr. TPM, Incident Command Systems, Incident Command Systems
We’re hiring a Technical Program Manager (TPM) to help shape and drive Incident Command System’s engineering efforts as part of the incident response program for the worldwide Amazon retail websites.
We are re-imagining incident management & response for Amazon’s retail operations. Amazon is evolving faster than our incident management/response programs can keep up. It’s time to change that. You’ll be reporting directly into the single threaded owner for incident command systems and will be responsible for driving programs across multiple “two-pizza” teams. You’ll be directly responsible for helping curate complex user requirements and drive adoption of both greenfield and established tooling. When Amazon is under duress, every single minute matters. Your efforts will have an impact and influence on Amazon executive decisions and every single team at Amazon that interacts with our centralized control centers or outage calls. You’ll be required to dive into the depths of post incident analysis to understand what went wrong, how we can improve it for the future, and how blind spots will be covered in the future. Amazon incidents are complex, fast paced, and deeply nuanced. If you like a challenge, this is certainly an eye opening one. This will be a fascinating and critical culmination of technical deep diving, high impact judgment calls, program/product/process management, and sweeping assessments of company wide readiness for handling and managing incidents.
As a TPM on this team, you will own short-term (< 1 year) and long-term (3+ year) vision for your programs, influence how the team operates and determine the most important new programs to initiate. You will define, scope and execute multiple simultaneous end-to-end programs based on data-driven analysis, collection of business/system requirements and creation of software architecture specifications. You will raise the bar for engineering excellence by assessing, improving and automating existing programs, tools, processes and operations. You will force-multiply your impact by introducing programs focused on improving monitoring/detection of customer impacting degradations and best-in-class incident management tooling to reduce effort, cost, duration of incidents.
Your ability to work independently, address ambiguity, manage dependencies, communicate effectively, and dive deep into technical details is essential. Successful candidates will be able to balance think big with realism, deliver achievable software plans, while communicating progress to executive audiences. You will help refine and track the appropriate fine-grained metrics that best represent our efforts and progress to accelerate detection and remediation of customer impacting events. This role has the scope to affect Amazon’s entire ecosystem of tens of thousands of services and make a difference to both our customers and developers. Enhancing Amazon’s incident response posture creates a flywheel of improvements that include adoption of best practices across all organizations.


About the team
The Incident Command Systems team at Amazon is responsible for envisioning and building programs, which consistently improve remediation times for outages. This group consists of multiple 2-pizza teams (teams of 6-8 engineers) that each own software components for monitoring, anomaly detection of website degrading issues as well as incident management software used during these outages.
Confirm your E-mail: Send Email