CUDA Engineer - Kernel & Performance Specialist

We're hiring a CUDA Engineer to push GPU performance to the limits on NVIDIA Hopper and Blackwell.

You'll design and optimize CUDA and C++ kernels powering LLMs, transformers, and generative AI with low-precision formats, operator fusion, and advanced memory optimization.


What you'll do

  • Build and optimize CUDA kernels (attention, MLP, layernorm, etc.).
  • Develop FP4 and FP8 kernels and support new microscaling formats such as MXFP4 and MXFP6.
  • Use CUTLASS for high-performance GEMMs and fused ops.
  • Profile and tune performance across the GPU memory hierarchy.
  • Integrate with PyTorch, Triton, and TensorRT.

What we're looking for

  • Strong CUDA and C++ expertise with deep knowledge of Hopper and Blackwell.
  • Experience with low-precision formats, CUTLASS, and Triton.
  • Skilled in operator fusion, tiling, and warp-level programming.
  • Proficient with profiling tools including Nsight and nvprof.

Preferred

  • Experience with Blackwell microscaling formats.
  • Open-source contributions or published work in low-precision kernels.