As part of the Global Network Engineering organization within Oracle Cloud Infrastructure (OCI) you will Manage a team of Graphical Processing Unit (GPU) engineers responsible for GPU operations supporting Artificial Intelligence/Machine Language (AI/ML) workloads in a broadly distributed, multi-tenant cloud environment. OCI is committed to providing the best in cloud products that meet the needs of our customers who are tackling some of the world’s biggest challenges. We are looking for smart, hands-on Managers to join our Global Network Operations Center Team (GNOC), which acts as a front-line for physical network issues supporting 24x7x365. GNOC is responsible for performing data collection, triage, technical analysis, incident mitigation, and redirection as needed to maintain and optimize operations for the physical network infrastructure.
Customers demand highly available cloud services. We help Oracle support the best-in-class cloud offering by enabling our engineers to easily maintain cloud solutions.
We are looking for a Network Operations Center Senior Manager to develop and lead a new GPU network operations center in the USA. The role will lead a team of network engineers to support 24x7 network operations of Oracle’s Cloud Infrastructure as part of the Global Network Operations Center organization. We need a strong leader to build and lead an engineering organization. Our team is responsible for supporting operational functionality of GPU delivery, health monitoring, triage automation, and diagnostic services. These are essential for running distributed AI/ML/HPC workloads across thousands of GPUs, leveraging technologies like RoCE and InfiniBand.
You must be passionate about operations and the customer experience. You should be comfortable supporting distributed systems that interact with a variety of services. You should enjoy building effective organizations, coaching and mentoring engineers, and representing your organization to senior leadership. You should value simplicity and scale, work comfortably in a collaborative, agile environment, and be excited to learn. Your excellent judgment and strong communication skills will be invaluable when defining the roadmap for your areas of ownership.
The right leader for this role will make all the difference for our organization, our product, and our customers. Are you able to provide direction and structure for your teams? Do you enjoy mentoring engineers? Are you able to take feedback and learn from engineers and leaders across a large organization? Do you thrive in a fast-paced environment, and want to be an integral part of a truly great team? Come join us!
Mandatory Qualifications:
· 5+ years of experience in large scale physical network support
· 3+ years of experience in an engineering and operations management role
· Experience in a technical leadership and management role
· Experience driving hiring, onboarding new engineers and ongoing performance management
· Excellent organizational, verbal, and written communication skills
· Excellent judgment to influence product roadmap direction, features, and priorities
· Bachelor’s degree in Network Engineering, Computer Science, Electrical/Hardware Engineering or related field
Preferred Qualifications:
Prior experience with large scale data center operations.
Working knowledge of GPU/RDMA environments.
Working knowledge of equipment supporting AI/ML
BS or MS degree or equivalent experience relevant to functional area.
Career Level - M3