FlexAI

Staff DevOps Engineer/SRE

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.


Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.

 If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

Role Overview

FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments.


You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity.


What You’ll Do

Own Reliability & Architecture:

  • Design and evolve the infrastructure backbone for our AI and PaaS platform
  • Build highly available, fault-tolerant, and scalable systems
  • Define and drive SRE practices (SLIs, SLOs, error budgets)

Build Infrastructure at Scale:

  • Lead Infrastructure as Code using Pulumi
  • Own and scale Kubernetes clusters and containerized workloads
  • Standardize and automate infrastructure for global deployments

CI/CD & Automation:

  • Design and scale CI/CD pipelines for fast, reliable releases
  • Build self-healing systems and automated remediation workflows
  • Drive GitOps and platform engineering practices

Observability & Performance:

  • Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces)
  • Identify and resolve performance bottlenecks (latency, throughput, cost)
  • Lead incident response, root cause analysis, and postmortems

Leadership & Collaboration:

  • Partner with backend, AI, runtime, and security teams
  • Guide infrastructure decisions and scaling strategy
  • Mentor engineers and raise the bar on reliability and engineering standards

Security & Resilience:

  • Embed security into infrastructure and deployment workflows
  • Design for resilience (disaster recovery, chaos testing, capacity planning)

What You'll Need to Be Successful

  • 8+ years of experience in DevOps, SRE, or Infrastructure Engineering
  • Proven experience operating large-scale, distributed systems in production
  • Deep expertise in:
    • Kubernetes & container orchestration
    • Pulumi (or similar IaC tools)
    • Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
    • Observability stacks (Prometheus, Grafana, OpenTelemetry)
  • Strong experience with CI/CD, automation, and release engineering
  • Proficiency in Python, Go, or Bash
  • Strong systems thinking and debugging skills in high-scale environments
  • Experience defining and operating with SLOs / SLAs
  • Experience in startup environments
  • Comfortable leveraging AI coding tools and agents to move faster

Nice to Have

  • Experience with AI/ML infrastructure or GPU workloads
  • Familiarity with distributed or high-performance compute systems
  • Exposure to platform engineering / internal developer platforms
  • Experience scaling systems from Beta to production

Why FlexAI

  • Work on cutting-edge AI infrastructure
  • Build systems that power developers and enterprises
  • High ownership, fast execution, real impact
  • Collaborative, high-caliber team

Engineering

Bangalore, India

Share on:

Terms of servicePrivacyCookiesPowered by Rippling