Job Description

                Elevate your career as a Senior Site Reliability Engineer in Toronto, managing cutting-edge HPC infrastructure with NVIDIA GPUs. Join a dynamic team focusing on advanced AI and ML clusters.

You will oversee the lifecycle of our high-performance computing (HPC) infrastructure. This role requires hands-on experience in planning, deploying, and maintaining resilient systems. Collaborate with engineering and research teams to optimize operations and ensure seamless performance.

Key Responsibilities: • Manage and optimize operations of HPC clusters • Deploy and maintain infrastructure-as-code solutions • Support research teams by optimizing cluster usage • Operate and troubleshoot Ceph storage clusters • Develop tooling and automation for efficiency

Requirements: • 5+ years experience in SRE or HPC operations • Proficiency in Linux systems (Ubuntu/Debian) • Experience with Kubernetes container orchestration • Knowledge of Ceph deployments over 1PB • Skilled in Pyt...

Apply for This Position

Ready to take the next step? Click the button below to submit your application.

Submit Application