Staff DevOps Engineer/SRE

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.

Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.

If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

Role Overview

FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments.

You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity.

What You’ll Do

Own Reliability & Architecture:

Design and evolve the infrastructure backbone for our AI and PaaS platform
Build highly available, fault-tolerant, and scalable systems
Define and drive SRE practices (SLIs, SLOs, error budgets)

Build Infrastructure at Scale:

Lead Infrastructure as Code using Pulumi
Own and scale Kubernetes clusters and containerized workloads
Standardize and automate infrastructure for global deployments

CI/CD & Automation:

Design and scale CI/CD pipelines for fast, reliable releases
Build self-healing systems and automated remediation workflows
Drive GitOps and platform engineering practices

Observability & Performance:

Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces)
Identify and resolve performance bottlenecks (latency, throughput, cost)
Lead incident response, root cause analysis, and postmortems

Leadership & Collaboration:

Partner with backend, AI, runtime, and security teams
Guide infrastructure decisions and scaling strategy
Mentor engineers and raise the bar on reliability and engineering standards

Security & Resilience:

Embed security into infrastructure and deployment workflows
Design for resilience (disaster recovery, chaos testing, capacity planning)

What You'll Need to Be Successful

8+ years of experience in DevOps, SRE, or Infrastructure Engineering
Proven experience operating large-scale, distributed systems in production
Deep expertise in:
- Kubernetes & container orchestration
- Pulumi (or similar IaC tools)
- Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
- Observability stacks (Prometheus, Grafana, OpenTelemetry)
Strong experience with CI/CD, automation, and release engineering
Proficiency in Python, Go, or Bash
Strong systems thinking and debugging skills in high-scale environments
Experience defining and operating with SLOs / SLAs
Experience in startup environments
Comfortable leveraging AI coding tools and agents to move faster

Nice to Have

Experience with AI/ML infrastructure or GPU workloads
Familiarity with distributed or high-performance compute systems
Exposure to platform engineering / internal developer platforms
Experience scaling systems from Beta to production

Why FlexAI

Work on cutting-edge AI infrastructure
Build systems that power developers and enterprises
High ownership, fast execution, real impact
Collaborative, high-caliber team

Engineering

Bangalore, India

Share on: