About FlexAI
Build and Deploy AI the right way, anywhere.
The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.
Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bengaluru, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
Role Overview
FlexAI is looking for a Senior DevOps / SRE Engineer to build and operate the infrastructure powering our AI and PaaS platform.
You’ll work closely with developers to ensure our systems are reliable, performant, and scalable, while enabling fast product iteration. This role is hands-on and execution-focused, with opportunities to contribute to system design and reliability practices as we scale.
What You’ll Do
Build & Operate Infrastructure:
- Build and maintain infrastructure for our AI and PaaS platform
- Deploy and operate Kubernetes clusters and containerized services
- Implement Infrastructure as Code using Pulumi (or similar tools)
Reliability & SRE Practices:
- Help define and implement SLIs, SLOs, and error budgets
- Improve system reliability, availability, and performance
- Participate in on-call rotations, incident response, and postmortems
CI/CD & Automation:
- Build and improve CI/CD pipelines for reliable and fast releases
- Automate operational workflows and reduce manual toil
- Contribute to GitOps and platform engineering practices
Observability & Performance:
- Implement and maintain observability using VictoriaMetrics, Grafana (metrics, logs, traces)
- Monitor systems and troubleshoot performance issues (latency, throughput, cost)
Collaboration:
- Work closely with developers, platform, and AI teams to support production systems
- Help debug issues across infrastructure and application layers
- Contribute to improving engineering productivity and developer experience
What You’ll Need to Be Successful
- 4+ years of experience in DevOps, SRE, or Infrastructure Engineering
- Experience operating production systems at scale
- Hands-on experience with:
- Kubernetes & containers
- Infrastructure as Code (Pulumi, Terraform, etc.)
- Cloud or hybrid environments (AWS, GCP, Azure, or on-prem)
- Observability tools (Prometheus, Grafana, OpenTelemetry)
- Experience with CI/CD systems and automation
- Proficiency in Python, Go, or Bash
- Strong debugging and problem-solving skills
- Familiarity with SLOs and reliability practices
- Experience working in startup or fast-paced environments
- Comfortable leveraging AI coding tools and agents
Nice to Have
- Experience with AI/ML infrastructure or GPU workloads
- Familiarity with distributed systems or compute platforms
- Exposure to platform engineering concepts
- Experience supporting systems from Beta to production
Why FlexAI
- Work on cutting-edge AI infrastructure
- Build systems that power developers and enterprises
- High ownership, fast execution, real impact
- Collaborative, high-caliber team