Lead/Staff AI Runtime Engineer

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.

Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Paris, Silicon Valley, and Bangalore, united by a shared mission: to deliver more compute with less complexity.

If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

Position Overview:

At FlexAI, we’re building a high-performance, cloud-agnostic AI compute platform designed for next-generation training and inference workloads. As Lead/Staff AI Runtime Engineer, you’ll play a pivotal role in the design, development, and optimization of the core runtime infrastructure that powers distributed training and deployment of large AI models (LLMs and beyond).

This is a hands-on leadership role - perfect for a systems-minded software engineer who thrives at the intersection of AI workloads, runtimes, and performance-critical infrastructure. You’ll own critical components of our PyTorch-based stack, lead technical direction, and collaborate across engineering, research, and product to push the boundaries of elastic, fault-tolerant, high-performance model execution.

What you’ll do:

Lead Runtime Design & Development

Own the core runtime architecture supporting AI training and inference at scale.

Design resilient and elastic runtime features (e.g. dynamic node scaling, job recovery) within our custom PyTorch stack.

Optimize distributed training reliability, orchestration, and job-level fault tolerance.

Drive Performance at Scale

Profile and enhance low-level system performance across training and inference pipelines.

Improve packaging, deployment, and integration of customer models in production environments.

Ensure consistent throughput, latency, and reliability metrics across multi-node, multi-GPU setups.

Build Internal Tooling & Frameworks

Design and maintain libraries and services that support model lifecycle: training, checkpointing, fault recovery, packaging, and deployment.

Implement observability hooks, diagnostics, and resilience mechanisms for deep learning workloads.

Champion best practices in CI/CD, testing, and software quality across the AI Runtime stack.

Collaborate & Mentor

Work cross-functionally with Research, Infrastructure, and Product teams to align runtime development with customer and platform needs.

Guide technical discussions, mentor junior engineers, and help scale the AI Runtime team’s capabilities.

What you’ll need to be successful:

8+ years of experience in systems/software engineering, with deep exposure to AI runtime, distributed systems, or compiler/runtime interaction.

Experience in delivering PaaS services.

Proven experience optimizing and scaling deep learning runtimes (e.g. PyTorch, TensorFlow, JAX) for large-scale training and/or inference.

Strong programming skills in Python and C++ (Go or Rust is a plus).

Familiarity with distributed training frameworks, low-level performance tuning, and resource orchestration.

Experience working with multi-GPU, multi-node, or cloud-native AI workloads.

Solid understanding of containerized workloads, job scheduling, and failure recovery in production environments.

Bonus Points:

Contributions to PyTorch internals or open-source DL infrastructure projects.

Familiarity with LLM training pipelines, checkpointing, or elastic training orchestration.

Experience with Kubernetes, Ray, TorchElastic, or custom AI job orchestrators.

Background in systems research, compilers, or runtime architecture for HPC or ML.

Start up previous experience

What we offer:

A competitive salary and benefits package, tailored to recognize your dedication and contributions.
The opportunity to collaborate with leading experts in AI and cloud computing, learning from the best and the brightest, fostering continuous growth.
An environment that values innovation, collaboration, and mutual respect.
Support for personal and professional development, empowering you with the tools and resources to elevate your skills and leave a lasting impact.
A pivotal role in the AI revolution, shaping the technologies that power the innovations of tomorrow.

Engineering

Bangalore, India

Share on: