Job Description

What You'll Do Reliability & Operations  
 Own availability, latency, and scalability across Saa S and AI systems Define and enforce SLOs, SLIs, and error budgets Participate in a global on-call rotation (~1 week every 4 weeks) Lead incident response and drive blameless postmortems with systemic fixes Platform & Infrastructure  
 Architect and operate on-premise and multi-region, multi-cloud environments Manage large-scale Kubernetes workloads Build and evolve infrastructure using Terraform and Ansible Improve system resilience, fault isolation, and capacity planning AI/ML & Automation  
 Build and scale agentic AI systems for triage, anomaly detection, and self-healing Ensure reliability of model serving infrastructure Operate, optimize and scale distributed systems What You Bring  
 5+ years in SRE , Production Engineering, or Platform Engineering Strong experience with cloud providers (AWS/GCP/OCI), Kubernetes, and Ia C (Terraform/Ansible) Proficiency in Pyth...
            

Apply for This Position

Ready to take the next step? Click the button below to submit your application.

Submit Application

Site reliability engineer (.net)

Job Description

Apply for This Position