Job Description

About the Role We’re looking for an AI-Engineer within Distributed LLM Training & Infrastructure  to work on large-scale model training infrastructure across distributed GPU environments. This role focuses on improving how LLMs are trained at scale optimising performance, cost, and efficiency across multi-node systems. 
Responsibilities Build and optimise distributed LLM training pipelines using PyTorch 
Work with frameworks such as Megatron-LM and DeepSpeed for large-scale training 
Improve multi-node GPU performance (throughput, memory usage, NCCL communication) 
Design and run benchmarking frameworks (tokens/sec, cost, MFU, latency) 
Develop standardised training recipes and playbooks for production-grade environments 
What You’ll Work On Core LLM training systems (not application-layer AI) 
Distributed systems challenges across multi-GPU, multi-node setups 
Perf...
            

Apply for This Position

Ready to take the next step? Click the button below to submit your application.

Submit Application

Artificial Intelligence Engineer

Job Description

Apply for This Position