About FlexAI
Build and Deploy AI the right way, anywhere.
The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.
Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Silicon Valley and Bangaluru, united by a shared mission: to deliver more compute with less complexity.
If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !
Role Overview
At FlexAI, we’re building a high-performance, cloud-agnostic AI compute platform designed for next-generation training and inference workloads. As Staff AI Runtime Engineer, you’ll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond).
This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. You’ll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.
What You'll Do
Lead Runtime Design & Development:
- Own the core runtime architecture supporting AI training and inference at scale.
- Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery) within our custom PyTorch stack.
- Optimize distributed training reliability, orchestration, and job-level fault tolerance.
Drive Performance at Scale:
- Profile and enhance low-level system performance across training and inference pipelines.
- Improve packaging, deployment, and integration of customer models in production environments.
- Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU setups.
Build Internal Tooling & Frameworks:
- Design and maintain libraries and services that support model lifecycle: training, checkpointing, fault recovery, packaging, and deployment.
- Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.
- Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.
Collaborate & Mentor:
- Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.
- Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team’s capabilities.
Required Qualifications
What You’ll Need to Be Successful:
- 8+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.
- Experience in delivering PaaS services.
- Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.
- Strong programming skills in Python and C++ (Go or Rust is a plus).
- Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.
- Experience working with multi-GPU, multi-node, or cloud-native AI workloads.
- Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.
Nice to Have:
- Contributions to PyTorch internals or open-source DL infrastructure projects.
- Familiarity with LLM training pipelines, checkpointing, or elastic training orchestration.
- Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestrators.
- Background in systems research, compilers, or runtime architecture for HPC or ML.
- Start up previous experience
This position is In-Person and located at our Santa Clara, CA Office.
What We Offer
- A competitive salary and benefits package, tailored to recognize your dedication and contributions.
- The opportunity to collaborate with leading experts in AI and cloud computing, learning from the best and the brightest, fostering continuous growth.
- An environment that values innovation, collaboration, and mutual respect.
- Support for personal and professional development, empowering you with the tools and resources to elevate your skills and leave a lasting impact.
- A pivotal role in the AI revolution, shaping the technologies that power the innovations of tomorrow.