Santa Clara, USA
9 days ago
Staff Software Engineer

We are seeking a highly skilled and motivated Senior PyTorch Internals Engineer to join our team and lead the charge in optimizing and porting PyTorch to novel hardware architectures. This role is crucial for porting of large language models (LLMs) on our next-generation platforms.

The ideal candidate will possess deep expertise in PyTorch internals, hardware architecture, and distributed training, particularly for LLMs. You will be responsible for understanding and modifying PyTorch's core components, optimizing performance for new hardware, and implementing efficient multi-chip training strategies.

Responsibilities:

PyTorch Internals Expertise:  Dive deep into PyTorch's architecture, including its execution engine, autograd system, and memory management. Analyze and optimize PyTorch's performance bottlenecks. Hardware Porting:  Port PyTorch to new hardware architecture using accelerators. Collaborate with architecture team to understand hardware specifications and optimize PyTorch for specific architecture features. Develop and maintain hardware-specific PyTorch backends. Multi-Chip LLM Training:  Design and implement efficient multi-chip training strategies for large language models. Deep understanding of GPU-specific inter-chip communication mechanisms, including NVIDIA NVLink and PCIe interconnects. Understand and optimize communication patterns between chips, including message passing and collective communication. Implement and optimize data parallelism, model parallelism, and pipeline parallelism. Compiler Integration:  Work closely with compiler team to integrate PyTorch with hardware-specific custom compiler. Understand the interplay between PyTorch and compilers, including intermediate representations (IRs) and code generation. Performance Analysis and Optimization:  Profile and analyze the performance of PyTorch on target hardware. Identify and address performance bottlenecks. Develop and implement optimization techniques to improve training and inference speed. Collaboration and Communication: 

Qualifications:

Required:  Advanced degree (Ph.D. or Master's) in Computer Science, Electrical Engineering, or a related field. Extensive experience with PyTorch internals and development. Deep understanding of hardware architectures, including GPUs, ASICs, and other accelerators. Proven experience in porting software to new hardware platforms. Experience with multi-chip communication and optimization. Experience with low level communication libraries like NCCL, or similar. Usage of all-reduce, all-gather, and other collective operations.    Experience with Transformer based LLMs. Strong background in distributed training, particularly for large language models. Strong C++ and Python programming skills. Experience with profiling and debugging performance bottlenecks. Preferred:  Experience with specific hardware architectures (e.g., NVIDIA GPUs, custom ASICs). Contributions to the PyTorch open-source project. Experience with compiler technologies and integration. Experience with specific compiler frameworks (e.g. LLVM, MLIR).

 

The base range is $215,000-$260,000. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The successful candidate will have the opportunity to convert to a full-time regular position. We also offer new-hire RSU grants and the opportunity for annual RSU grants, as well as other highly competitive benefits.

 

Confirm your E-mail: Send Email