Job Description
About the Role
We’re looking for an AI-Engineer within Distributed LLM Training & Infrastructure to work on large-scale model training infrastructure across distributed GPU environments. This role focuses on improving how LLMs are trained at scale optimising performance, cost, and efficiency across multi-node systems.
Responsibilities- Build and optimise distributed LLM training pipelines using PyTorch
- Work with frameworks such as Megatron-LM and DeepSpeed for large-scale training
- Improve multi-node GPU performance (throughput, memory usage, NCCL communication)
- Design and run benchmarking frameworks (tokens/sec, cost, MFU, latency)
- Develop standardised training recipes and playbooks for production-grade environments
- Core LLM training systems (not application-layer AI)
- Distributed systems challenges across multi-GPU, multi-node setups
- Perf...
Apply for This Position
Ready to take the next step? Click the button below to submit your application.
Submit Application