Jersey City, NJ, USA
1 day ago
ALTS - Lead SRE

We are seeking an experienced Lead Site Reliability Engineer (SRE) to manage and guide our team. The ideal candidate will have a strong foundation in SRE, DevOps, or infrastructure engineering, with leadership skills and the ability to drive team success in a fast-paced, dynamic environment. This role involves overseeing the team's execution, risk management, and strategic initiatives while fostering a collaborative and innovative culture.

Key Responsibilities:

Team Leadership and Management:

Lead, mentor, and develop a team of SREs, fostering a culture of collaboration and continuous improvementSet clear goals and expectations for the team, ensuring alignment with business objectives.Facilitate regular team meetings and one-on-one sessions to support individual growth and team cohesion

Execution and Delivery:

Oversee the delivery of major themes of work, ensuring high-quality execution and timely completionGuide the team in estimating delivery timelines and managing workloads effectivelyProvide expert guidance in debugging and systems design, encouraging innovative solutions and trade-off analysis

Risk Management:

Assess cross-impact of team deliverables and ensure proactive communication of potential risksSupport the team in identifying technical limitations and suggesting remediation strategies

Strategic Vision and Forward Thinking:

Develop and implement strategic plans for building robust systems with strong contracts, anticipating future changesEncourage the team to propose alternative requirements and solutions that better meet organizational needsSet and prioritize the strategic book of work for the team in line to support goals of the business

Communication and Stakeholder Engagement:

Communicate effectively with stakeholders, providing updates on progress and raising risks that will impact deliveryEnsure the team is aligned with the business vision and understands the importance of their contributions to the product

Qualifications:

Experience directly leading or functioning as a lead of technical teams, with a focus on SRE, DevOps, or infrastructure engineeringProficiency in programming languages (Python preferred) and distributed systems (Kubernetes, Kafka, Cassandra, etc.)Experience with setting up and using SLOs to track system health and performance Excellent problem-solving skills and creativity in debugging complex issuesDeep understanding of cloud fundamentals and infrastructure managementExceptional communication skills, with the ability to articulate technical problems and solutions to diverse audiencesA strategic mindset with a keen interest in automation and learningHaving a thorough understanding of the full stack of the system

 

Am example of a Task/Problem to be tackled is below. Does leading a team solving system wide problems excite you?

Our system has been working properly for the past few days in our UAT environment. We deployed a new version of core infrastructure that was tested in dev, we found it to be working & then approved it for UAT release. Suddenly, one of our services is not starting & our product or QA team cannot test changes in this environment. We receive a ping/bug report that provides high level information about what is happening, what the user would like to happen & perhaps information about what they expect to happen. We ask you to take a look at the issue.. Resolving this involves:

Asking & communicating with the user to fully understand what the issue isUnderstanding where in the stack to begin debuggingConstantly questioning your assumptions about the way the system should workBeing able to ask the right questions to your peers & team to triage an issueProviding updates to stakeholders that are counting on you to identify or fix the problemUsing your technical skill set to identify/reproduce the issueCommunicating what you have found to the team so that we can best resolve the issue
Confirm your E-mail: Send Email