FlexAI

Staff Linux & DevOps SSE

About FlexAI

Build and Deploy AI the right way, anywhere.

The FlexAI Compute Infrastructure Platform provides an "end-to-end AI compute layer" for running and managing workloads across any cloud, any GPU, and any deployment model (public, hybrid, or on-prem). It brings together "1-click simplicity" for users with "enterprise-grade orchestration, security, and automation" under the hood.


Founded by Brijesh Tripathi, who bring experience from Nvidia, Apple, Tesla, Intel and Zoox, FlexAI is not just building a product – we’re shaping the future of AI. Our teams are strategically distributed across Paris, Silicon Valley, and Bangalore, united by a shared mission: to deliver more compute with less complexity.

 If you're passionate about shaping the future of artificial intelligence, driving innovation, and contributing to a sustainable and inclusive AI ecosystem, FlexAI is the place for you !

Role Overview

FlexAI is seeking a Staff Linux & Systems Engineer to architect, build, and operate large-scale bare-metal AI/HPC GPU clusters. This role extends beyond hands-on systems engineering into technical leadership, platform architecture, and fleet-scale infrastructure ownership.


You will lead platform bring-up across the full stack (UEFI/BIOS → bootloaders → OS → kernel/device enablement), drive low-level networking performance (RoCEv2/InfiniBand), ensure GPU/accelerator stack readiness, and establish repeatable automation frameworks for provisioning, compliance, and reliability at scale.


This role is suited for engineers who are deeply comfortable operating across firmware, kernel, PCIe, and distributed AI infrastructure — and who can translate low-level expertise into scalable platform systems and engineering standards.

JD-Senior Linux & Systems Engin…



What You'll Do

Platform Architecture & Fleet Ownership:

  • Architect and lead end-to-end bring-up of AI/HPC server platforms from firmware to production cluster deployment
  • Define standards for UEFI/BIOS configuration, SecureBoot, TPM/MeasuredBoot, GRUB, PXE/iPXE provisioning workflows
  • Establish scalable patterns for fleet provisioning, configuration management, and lifecycle operations across GPU clusters
  • Own technical roadmap for bare-metal AI infrastructure and systems reliability at scale

Platform & Boot Enablement:

  • Lead server bring-up including UEFI/BIOS configuration, bootloader flows, and secure boot pipelines
  • Architect automated BMC/IPMI/Redfish workflows for out-of-band provisioning and fleet management
  • Standardize platform initialization processes across heterogeneous hardware environments
  • Diagnose and resolve complex boot, firmware, and hardware initialization issues

OS & Kernel Engineering:

  • Architect, build, and harden custom Linux (Ubuntu) images optimized for AI and HPC workloads
  • Lead kernel tuning for performance-sensitive workloads (NUMA, IRQ affinity, cgroups, namespaces)
  • Diagnose and resolve kernel and user-space performance issues using perf, ftrace, eBPF, and bpftrace
  • Drive system-level optimizations for latency, throughput, and resource utilization across clusters

PCIe, Driver & Device Enablement:

  • Lead validation of PCIe topologies and advanced features (ACS, ARI, ATS, SR-IOV, IOMMU/VFIO)
  • Own GPU/NIC driver bring-up, firmware validation, and device performance optimization
  • Root-cause complex regressions across kernel, drivers, firmware, and userspace layers
  • Partner with hardware vendors to resolve low-level device and platform issues

Provisioning & Automation at Scale:

  • Architect idempotent Ansible-based provisioning frameworks and automation pipelines
  • Build scalable golden images and repeatable provisioning workflows for large GPU fleets
  • Develop Python/Pytest validation harnesses for pre- and post-provisioning checks
  • Implement drift detection, remediation, and compliance automation across infrastructure

GPU / Accelerator & HPC Stack Readiness:

  • Lead enablement of NVIDIA CUDA, NCCL, GPUDirect RDMA and AMD ROCm stacks
  • Validate and optimize multi-GPU and multi-node distributed training performance
  • Tune NCCL, UCX, MPI (OpenMPI), and PyTorch distributed workloads (torchrun)
  • Establish performance baselines and benchmarking frameworks for AI infrastructure

High-Performance Networking:

  • Architect and tune RoCEv2 and/or InfiniBand fabrics for large-scale AI clusters
  • Validate rdma-core/libibverbs paths end-to-end across distributed environments
  • Optimize congestion control, MTU/jumbo frames, NUMA/RSS, and IRQ steering
  • Ensure consistent low-latency and high-throughput networking performance

Containers, Tooling & CI:

  • Standardize reproducible Docker images and container environments for validation and CI
  • Maintain build and automation tooling using C, Python, Go (where applicable), Make, and CMake
  • Integrate validation and automation workflows into CI pipelines (GitHub Actions/GitLab CI)
  • Improve reproducibility and reliability of system-level artifacts

Security & Compliance:

  • Define and enforce CIS hardening baselines across firmware, OS, and cluster layers
  • Maintain SecureBoot policies, measured boot attestations, and patch compliance standards
  • Implement access controls, auditability, and security automation across infrastructure
  • Lead security posture improvements for bare-metal AI infrastructure environments

Required Qualifications

What You'll Need to be Successful:

  • Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related field
  • 10+ years in Linux systems engineering, including deep kernel and userspace debugging
  • Proven ownership of bare-metal server bring-up and large-scale fleet provisioning
  • Extensive experience enabling multi-GPU and multi-node training over RoCEv2 or InfiniBand
  • Strong track record building reproducible OS images, automation pipelines, and production-grade infrastructure
  • Demonstrated experience operating and scaling AI/HPC or GPU cluster environments
  • Experience mentoring senior engineers and leading complex infrastructure initiatives

Technical Skills:

  • Platform/Boot: UEFI/BIOS, GRUB, SecureBoot, PXE/iPXE, BMC/IPMI/Redfish
  • OS/Kernel: Linux (Ubuntu), systemd/init, eBPF, perf, ftrace, bpftrace, cgroups, namespaces, NUMA, IRQ affinity
  • Drivers/PCIe: PCIe fundamentals (ACS/ARI/ATS), SR-IOV, VFIO, IOMMU, NIC/GPU drivers
  • Provisioning/Automation: Ansible, Python, Pytest, Debos, cloud-init
  • Containers: Docker, docker-compose
  • Build/Dev: C, Python, Go (preferred), Make, CMake, CI pipelines
  • Networking (HPC): RoCEv2, InfiniBand, libibverbs/rdma-core, NCCL/UCX, MPI (OpenMPI)
  • GPU/Accelerators: NVIDIA (CUDA, NCCL, GPUDirect RDMA), AMD ROCm
  • Security/Compliance: CIS hardening, SecureBoot, TPM, Measured Boo

Soft Skills:

  • Ability to lead technically while remaining hands-on in low-level systems work
  • Strong cross-functional collaboration with ML researchers, platform, and infrastructure teams
  • Clear documentation and runbook creation for scalable operational excellence
  • High ownership mindset suited for deep-tech, high-performance infrastructure environments



Preferred Qualifications

  • Experience with large-scale AI infrastructure or GPU cloud platforms
  • Familiarity with performance benchmarking and system optimization for AI workloads
  • Experience operating in deep-tech, HPC, or AI-first infrastructure environments
  • Prior experience defining infrastructure architecture and engineering standards at scale

Engineering

Bangalore, India

Share on:

Terms of servicePrivacyCookiesPowered by Rippling