Senior DevOps Engineer/SRE

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.

Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.

If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

Role Overview

FlexAI is looking for a Senior DevOps / SRE Engineer to build and operate the infrastructure powering our AI and PaaS platform.

You’ll work closely with developers to ensure our systems are reliable, performant, and scalable, while enabling fast product iteration. This role is hands-on and execution-focused, with opportunities to contribute to system design and reliability practices as we scale.

What You’ll Do

Build & Operate Infrastructure:

Build and maintain infrastructure for our AI and PaaS platform
Deploy and operate Kubernetes clusters and containerized services
Implement Infrastructure as Code using Pulumi (or similar tools)

Reliability & SRE Practices:

Help define and implement SLIs, SLOs, and error budgets
Improve system reliability, availability, and performance
Participate in on-call rotations, incident response, and postmortems

CI/CD & Automation:

Build and improve CI/CD pipelines for reliable and fast releases
Automate operational workflows and reduce manual toil
Contribute to GitOps and platform engineering practices

Observability & Performance:

Implement and maintain observability using VictoriaMetrics, Grafana (metrics, logs, traces)
Monitor systems and troubleshoot performance issues (latency, throughput, cost)

Collaboration:

Work closely with developers, platform, and AI teams to support production systems
Help debug issues across infrastructure and application layers
Contribute to improving engineering productivity and developer experience

What You’ll Need to Be Successful

4+ years of experience in DevOps, SRE, or Infrastructure Engineering
Experience operating production systems at scale
Hands-on experience with:
- Kubernetes & containers
- Infrastructure as Code (Pulumi, Terraform, etc.)
- Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
- Observability tools (Prometheus, Grafana, OpenTelemetry)
Experience with CI/CD systems and automation
Proficiency in Python, Go, or Bash
Strong debugging and problem-solving skills
Familiarity with SLOs and reliability practices
Experience working in startup or fast-paced environments
Comfortable leveraging AI coding tools and agents

Nice to Have

Experience with AI/ML infrastructure or GPU workloads
Familiarity with distributed systems or compute platforms
Exposure to platform engineering concepts
Experience supporting systems from Beta to production

Why FlexAI

Work on cutting-edge AI infrastructure
Build systems that power developers and enterprises
High ownership, fast execution, real impact
Collaborative, high-caliber team

Engineering

Bangalore, India

Share on: