Full-time Posted June 12, 2026
Apply Now

Job Description

Job Responsibilities


  • Own production reliability (SLOs, capacity, incident response, postmortems) and turn every incident into a durable fix in code or automation.

  • Build the platform and tooling that make services easy to deploy, observe, and operate: CI/CD, infrastructure-as-code, observability stacks, runbooks-as-code.

  • Apply AI agentically across operations (triage, root-cause analysis, remediation, change review) and contribute to our internal agentic ecosystem.

  • Design and integrate the systems underneath our services: messaging (e.g. Kafka), orchestration (e.g. Kubernetes), and performance-sensitive infrastructure.

  • Partner with product engineers on release readiness, rollout strategy, and production hardening before things ship.

  • Continuously reduce toil: measure it, attack it with code, and raise the floor on what easy to maintain looks like.

  • Job Requirements

    <...

    Apply for This Position

    Ready to take the next step? Click the button below to submit your application.

    Submit Application