You will own reliable, low-latency systems across Kubernetes, API gateways, monitoring, and incident response.
What you'll do
- Architect and manage cloud-native inference infrastructure (Kubernetes, Docker, API gateways, CI/CD).
- Build observability pipelines for logs, metrics, tracing, and alerting with Prometheus, Grafana, ELK, Datadog, or OpenTelemetry.
- Design SLOs, error budgets, and alerting systems that keep uptime high and false positives low.
- Optimize performance to reduce API latency, cold starts, and bottlenecks across GPU and CPU workloads.
- Ensure security, compliance, and resilience with secure networking, secrets management, and failure handling.
- Collaborate with ML engineers and SREs to deliver end-to-end monitoring and reliability for AI model serving.
What we're looking for
- Strong experience with Kubernetes, Docker, and API gateways such as Cloudflare, Kong, or Envoy.
- Deep expertise in observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
- Proven track record in designing alerting, monitoring, and SLOs for production systems.
- Experience with high-throughput, low-latency inference systems and GPU utilization monitoring.
- Skilled in cloud infrastructure and infrastructure as code tooling (AWS, GCP, Azure, Terraform, Pulumi, or CloudFormation).
- Solid scripting or programming skills (Python, Go, and similar) for automation and tooling.
Preferred
- Experience with multi-tenant inference platforms, model drift monitoring, and ML observability.
- Background in edge, gateway, or CDN technologies such as Cloudflare Workers, Envoy, or Kong.
- Contributions to observability, reliability, or DevOps open source tools.