Staff Linux & DevOps SSE

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.

Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Paris, Silicon Valley, and Bangalore, united by a shared mission: to deliver more compute with less complexity.

If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

Role Overview

FlexAI is seeking a Staff Linux & Systems Engineer to architect, build, and operate large-scale bare-metal AI/HPC GPU clusters. This role extends beyond hands-on systems engineering into technical leadership, platform architecture, and fleet-scale infrastructure ownership.

You will lead platform bring-up across the full stack (UEFI/BIOS → bootloaders → OS → kernel/device enablement), drive low-level networking performance (RoCEv2/InfiniBand), ensure GPU/accelerator stack readiness, and establish repeatable automation frameworks for provisioning, compliance, and reliability at scale.

This role is suited for engineers who are deeply comfortable operating across firmware, kernel, PCIe, and distributed AI infrastructure — and who can translate low-level expertise into scalable platform systems and engineering standards.

JD-Senior Linux & Systems Engin…

What You'll Do

Platform Architecture & Fleet Ownership:

Architect and lead end-to-end bring-up of AI/HPC server platforms from firmware to production cluster deployment
Define standards for UEFI/BIOS configuration, SecureBoot, TPM/MeasuredBoot, GRUB, PXE/iPXE provisioning workflows
Establish scalable patterns for fleet provisioning, configuration management, and lifecycle operations across GPU clusters
Own technical roadmap for bare-metal AI infrastructure and systems reliability at scale

Platform & Boot Enablement:

Lead server bring-up including UEFI/BIOS configuration, bootloader flows, and secure boot pipelines
Architect automated BMC/IPMI/Redfish workflows for out-of-band provisioning and fleet management
Standardize platform initialization processes across heterogeneous hardware environments
Diagnose and resolve complex boot, firmware, and hardware initialization issues

OS & Kernel Engineering:

Architect, build, and harden custom Linux (Ubuntu) images optimized for AI and HPC workloads
Lead kernel tuning for performance-sensitive workloads (NUMA, IRQ affinity, cgroups, namespaces)
Diagnose and resolve kernel and user-space performance issues using perf, ftrace, eBPF, and bpftrace
Drive system-level optimizations for latency, throughput, and resource utilization across clusters

PCIe, Driver & Device Enablement:

Lead validation of PCIe topologies and advanced features (ACS, ARI, ATS, SR-IOV, IOMMU/VFIO)
Own GPU/NIC driver bring-up, firmware validation, and device performance optimization
Root-cause complex regressions across kernel, drivers, firmware, and userspace layers
Partner with hardware vendors to resolve low-level device and platform issues

Provisioning & Automation at Scale:

Architect idempotent Ansible-based provisioning frameworks and automation pipelines
Build scalable golden images and repeatable provisioning workflows for large GPU fleets
Develop Python/Pytest validation harnesses for pre- and post-provisioning checks
Implement drift detection, remediation, and compliance automation across infrastructure

GPU / Accelerator & HPC Stack Readiness:

Lead enablement of NVIDIA CUDA, NCCL, GPUDirect RDMA and AMD ROCm stacks
Validate and optimize multi-GPU and multi-node distributed training performance
Tune NCCL, UCX, MPI (OpenMPI), and PyTorch distributed workloads (torchrun)
Establish performance baselines and benchmarking frameworks for AI infrastructure

High-Performance Networking:

Architect and tune RoCEv2 and/or InfiniBand fabrics for large-scale AI clusters
Validate rdma-core/libibverbs paths end-to-end across distributed environments
Optimize congestion control, MTU/jumbo frames, NUMA/RSS, and IRQ steering
Ensure consistent low-latency and high-throughput networking performance

Containers, Tooling & CI:

Standardize reproducible Docker images and container environments for validation and CI
Maintain build and automation tooling using C, Python, Go (where applicable), Make, and CMake
Integrate validation and automation workflows into CI pipelines (GitHub Actions/GitLab CI)
Improve reproducibility and reliability of system-level artifacts

Security & Compliance:

Define and enforce CIS hardening baselines across firmware, OS, and cluster layers
Maintain SecureBoot policies, measured boot attestations, and patch compliance standards
Implement access controls, auditability, and security automation across infrastructure
Lead security posture improvements for bare-metal AI infrastructure environments

Required Qualifications

What You'll Need to be Successful:

Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field
10+ years in Linux systems engineering, including deep kernel and userspace debugging
Proven ownership of bare-metal server bring-up and large-scale fleet provisioning
Extensive experience enabling multi-GPU and multi-node training over RoCEv2 or InfiniBand
Strong track record building reproducible OS images, automation pipelines, and production-grade infrastructure
Demonstrated experience operating and scaling AI/HPC or GPU cluster environments
Experience mentoring senior engineers and leading complex infrastructure initiatives

Technical Skills:

Platform/Boot: UEFI/BIOS, GRUB, SecureBoot, PXE/iPXE, BMC/IPMI/Redfish
OS/Kernel: Linux (Ubuntu), systemd/init, eBPF, perf, ftrace, bpftrace, cgroups, namespaces, NUMA, IRQ affinity
Drivers/PCIe: PCIe fundamentals (ACS/ARI/ATS), SR-IOV, VFIO, IOMMU, NIC/GPU drivers
Provisioning/Automation: Ansible, Python, Pytest, Debos, cloud-init
Containers: Docker, docker-compose
Build/Dev: C, Python, Go (preferred), Make, CMake, CI pipelines
Networking (HPC): RoCEv2, InfiniBand, libibverbs/rdma-core, NCCL/UCX, MPI (OpenMPI)
GPU/Accelerators: NVIDIA (CUDA, NCCL, GPUDirect RDMA), AMD ROCm
Security/Compliance: CIS hardening, SecureBoot, TPM, Measured Boo

Soft Skills:

Ability to lead technically while remaining hands-on in low-level systems work
Strong cross-functional collaboration with ML researchers, platform, and infrastructure teams
Clear documentation and runbook creation for scalable operational excellence
High ownership mindset suited for deep-tech, high-performance infrastructure environments

Preferred Qualifications

Experience with large-scale AI infrastructure or GPU cloud platforms
Familiarity with performance benchmarking and system optimization for AI workloads
Experience operating in deep-tech, HPC, or AI-first infrastructure environments
Prior experience defining infrastructure architecture and engineering standards at scale

Engineering

Bangalore, India

Share on: