Job Description

Role Overview This role focuses on designing and building the infrastructure that enables scalable machine learning development, from training-ready datasets through to validated models deployed in production environments. 
The position involves establishing core systems and architectural foundations that will support long-term scalability and performance. Key areas include training infrastructure, distributed learning frameworks, experiment management, model lifecycle management, and reliable pathways from model development to production deployment. 
Key Responsibilities Training infrastructure 
Design, deploy, and operate GPU-based training environments across cloud platforms such as AWS and GCP. This includes node provisioning, workload scheduling (e.g., Kubernetes, Slurm), multi-node networking, GPU monitoring, and cost/utilization optimization. 
Distributed training systems 
Own and optimize di...
            

Apply for This Position

Ready to take the next step? Click the button below to submit your application.

Submit Application

ML Infra Engineer (m/f/d)

Job Description

Role Overview

Key Responsibilities

Apply for This Position