Senior Infrastructure / DevOps Engineer - AI Inference & Observability

You will own reliable, low-latency systems across Kubernetes, API gateways, monitoring, and incident response.

What you'll do

Architect and manage cloud-native inference infrastructure (Kubernetes, Docker, API gateways, CI/CD).
Build observability pipelines for logs, metrics, tracing, and alerting with Prometheus, Grafana, ELK, Datadog, or OpenTelemetry.
Design SLOs, error budgets, and alerting systems that keep uptime high and false positives low.
Optimize performance to reduce API latency, cold starts, and bottlenecks across GPU and CPU workloads.
Ensure security, compliance, and resilience with secure networking, secrets management, and failure handling.
Collaborate with ML engineers and SREs to deliver end-to-end monitoring and reliability for AI model serving.

Strong experience with Kubernetes, Docker, and API gateways such as Cloudflare, Kong, or Envoy.
Deep expertise in observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
Proven track record in designing alerting, monitoring, and SLOs for production systems.
Experience with high-throughput, low-latency inference systems and GPU utilization monitoring.
Skilled in cloud infrastructure and infrastructure as code tooling (AWS, GCP, Azure, Terraform, Pulumi, or CloudFormation).
Solid scripting or programming skills (Python, Go, and similar) for automation and tooling.

Experience with multi-tenant inference platforms, model drift monitoring, and ML observability.
Background in edge, gateway, or CDN technologies such as Cloudflare Workers, Envoy, or Kong.
Contributions to observability, reliability, or DevOps open source tools.