About FlexAI
Build and Deploy AI the right way, anywhere.
The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.
Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
Role Overview
FlexAI is looking for a Staff DevOps / SRE Engineer to define our infrastructure strategy, establish SRE best practices, and build systems capable of running large-scale AI workloads across distributed, multi-cloud environments.
You’ll work closely with developers to ensure our platform is reliable, performant, and scalable — without slowing down product velocity.
What You’ll Do
Own Reliability & Architecture:
- Design and evolve the infrastructure backbone for our AI and PaaS platform
- Build highly available, fault-tolerant, and scalable systems
- Define and drive SRE practices (SLIs, SLOs, error budgets)
Build Infrastructure at Scale:
- Lead Infrastructure as Code using Pulumi
- Own and scale Kubernetes clusters and containerized workloads
- Standardize and automate infrastructure for global deployments
CI/CD & Automation:
- Design and scale CI/CD pipelines for fast, reliable releases
- Build self-healing systems and automated remediation workflows
- Drive GitOps and platform engineering practices
Observability & Performance:
- Implement end-to-end observability using VictoriaMetrics and Grafana (metrics, logs, traces)
- Identify and resolve performance bottlenecks (latency, throughput, cost)
- Lead incident response, root cause analysis, and postmortems
Leadership & Collaboration:
- Partner with backend, AI, runtime, and security teams
- Guide infrastructure decisions and scaling strategy
- Mentor engineers and raise the bar on reliability and engineering standards
Security & Resilience:
- Embed security into infrastructure and deployment workflows
- Design for resilience (disaster recovery, chaos testing, capacity planning)
What You'll Need to Be Successful
- 8+ years of experience in DevOps, SRE, or Infrastructure Engineering
- Proven experience operating large-scale, distributed systems in production
- Deep expertise in:
- Kubernetes & container orchestration
- Pulumi (or similar IaC tools)
- Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
- Observability stacks (Prometheus, Grafana, OpenTelemetry)
- Strong experience with CI/CD, automation, and release engineering
- Proficiency in Python, Go, or Bash
- Strong systems thinking and debugging skills in high-scale environments
- Experience defining and operating with SLOs / SLAs
- Experience in startup environments
- Comfortable leveraging AI coding tools and agents to move faster
Nice to Have
- Experience with AI/ML infrastructure or GPU workloads
- Familiarity with distributed or high-performance compute systems
- Exposure to platform engineering / internal developer platforms
- Experience scaling systems from Beta to production
Why FlexAI
- Work on cutting-edge AI infrastructure
- Build systems that power developers and enterprises
- High ownership, fast execution, real impact
- Collaborative, high-caliber team