Job Description
Key Responsibilities
- Design, implement, and maintain highly available, scalable infrastructure on GCP, ensuring 99.99% uptime for mission‑critical services.
- Build and manage Kubernetes clusters, including deployment pipelines, rolling updates, and cluster autoscaling.
- Develop automation scripts in Python and Go to streamline operations, monitoring, and incident response.
- Configure and maintain observability stack with Grafana, Prometheus, and logging solutions to provide real‑time insights.
- Collaborate with development teams to embed SRE best practices into CI/CD pipelines and code reviews.
- Lead root‑cause analysis, post‑mortem documentation, and continuous improvement initiatives.
Requirements
- 5+ years of experience in site reliability engineering or DevOps roles.
- Proficiency with GCP services (Compute Engine, Kubernetes Engine, Cloud Storage, Pub/Sub).
- Strong sc...
Apply for This Position
Ready to take the next step? Click the button below to submit your application.
Submit Application