Research Internships at Microsoft provide a dynamic environment for research careers with a network of world-class research labs led by globally-recognized scientists and engineers, who pursue innovation in a range of scientific and technical disciplines to help solve complex challenges in diverse fields, including computing, healthcare, economics, and the environment.
Our team works on performance analysis and optimization of large language models, spanning the stack from GPU kernel implementation through to changes in model architecture. A key challenge is that quantization of models to use smaller data types is only effective if we can dequantize the formats and use them efficiently during computation. In this Research Internship we are going to tackle this problem by exploring the co-design of quantization techniques (e.g., fewer bits per weight) and kernel design for efficient decode (e.g., expanding weights to the 4-bit, 6-bit and 8-bit floating point formats in modern GPUs).