You'll design and optimize CUDA and C++ kernels powering LLMs, transformers, and generative AI with low-precision formats, operator fusion, and advanced memory optimization.
What you'll do
- Build and optimize CUDA kernels (attention, MLP, layernorm, etc.).
- Develop FP4 and FP8 kernels and support new microscaling formats such as MXFP4 and MXFP6.
- Use CUTLASS for high-performance GEMMs and fused ops.
- Profile and tune performance across the GPU memory hierarchy.
- Integrate with PyTorch, Triton, and TensorRT.
What we're looking for
- Strong CUDA and C++ expertise with deep knowledge of Hopper and Blackwell.
- Experience with low-precision formats, CUTLASS, and Triton.
- Skilled in operator fusion, tiling, and warp-level programming.
- Proficient with profiling tools including Nsight and nvprof.
Preferred
- Experience with Blackwell microscaling formats.
- Open-source contributions or published work in low-precision kernels.