l> Eigen AI | Senior Infrastructure / DevOps Engineer

Senior Infrastructure / DevOps Engineer - AI Inference & Observability

Design, build, and operate our AI inference and observability platform.

You will own reliable, low-latency systems across Kubernetes, API gateways, monitoring, and incident response.


What you'll do

  • Architect and manage cloud-native inference infrastructure (Kubernetes, Docker, API gateways, CI/CD).
  • Build observability pipelines for logs, metrics, tracing, and alerting with Prometheus, Grafana, ELK, Datadog, or OpenTelemetry.
  • Design SLOs, error budgets, and alerting systems that keep uptime high and false positives low.
  • Optimize performance to reduce API latency, cold starts, and bottlenecks across GPU and CPU workloads.
  • Ensure security, compliance, and resilience with secure networking, secrets management, and failure handling.
  • Collaborate with ML engineers and SREs to deliver end-to-end monitoring and reliability for AI model serving.

What we're looking for

  • Strong experience with Kubernetes, Docker, and API gateways such as Cloudflare, Kong, or Envoy.
  • Deep expertise in observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
  • Proven track record in designing alerting, monitoring, and SLOs for production systems.
  • Experience with high-throughput, low-latency inference systems and GPU utilization monitoring.
  • Skilled in cloud infrastructure and infrastructure as code tooling (AWS, GCP, Azure, Terraform, Pulumi, or CloudFormation).
  • Solid scripting or programming skills (Python, Go, and similar) for automation and tooling.

Preferred

  • Experience with multi-tenant inference platforms, model drift monitoring, and ML observability.
  • Background in edge, gateway, or CDN technologies such as Cloudflare Workers, Envoy, or Kong.
  • Contributions to observability, reliability, or DevOps open source tools.