Job Description

Job Responsibilities

Own production reliability (SLOs, capacity, incident response, postmortems) and turn every incident into a durable fix in code or automation.

Build the platform and tooling that make services easy to deploy, observe, and operate: CI/CD, infrastructure-as-code, observability stacks, runbooks-as-code.

Apply AI agentically across operations (triage, root-cause analysis, remediation, change review) and contribute to our internal agentic ecosystem.

Design and integrate the systems underneath our services: messaging (e.g. Kafka), orchestration (e.g. Kubernetes), and performance-sensitive infrastructure.

Partner with product engineers on release readiness, rollout strategy, and production hardening before things ship.

Continuously reduce toil: measure it, attack it with code, and raise the floor on what easy to maintain looks like.

Job Requirements
 <...

Apply for This Position

Ready to take the next step? Click the button below to submit your application.

Submit Application

Site Reliability Engineer (SRE)

Job Description

Apply for This Position