About Us:
Positron.ai specializes in developing custom hardware systems to accelerate AI inference. These inference systems offer significant performance and efficiency gains over traditional GPU-based systems, delivering advantages in both performance per dollar and performance per watt. Positron exists to create the world's best AI inference systems.
Senior Software Engineer – Machine Learning Systems & High-Performance LLM Inference
We are seeking a Senior Software Engineer to contribute to the development of high-performance software that powers execution of open-source large language models (LLMs) on our custom appliance. This appliance leverages a combination of FPGAs and x86 CPUs to accelerate transformer-based models. The software stack is written primarily in modern C++ (C++17/20) and heavily relies on templates, SIMD optimizations, and efficient parallel computing techniques.
Key Areas of Focus & Responsibilities
- Design and implement high-performance inference software for LLMs on custom hardware.
- Develop and optimize C++-based libraries that efficiently utilize SIMD instructions, threading, and memory hierarchy.
- Work closely with FPGA and systems engineers to ensure efficient data movement and computational offloading between x86 CPUs and FPGAs.
- Optimize model execution via low-level optimizations, including vectorization, cache efficiency, and hardware-aware scheduling.
- Contribute to performance profiling tools and methodologies to analyze execution bottlenecks at the instruction and data flow levels.
- Apply NUMA-aware memory management techniques to optimize memory access patterns for large-scale inference workloads.
- Implement ML system-level optimizations such as token streaming, KV cache optimizations, and efficient batching for transformer execution.
- Collaborate with ML researchers and software engineers to integrate model quantization techniques, sparsity optimizations, and mixed-precision execution.
- Ensure all code contributions include unit, performance, acceptance, and regression tests as part of a continuous integration-based development process.
Required Skills & Experience
- 7+ years of professional experience in C++ software development, with a focus on performance-critical applications.
- Strong understanding of C++ templates and modern memory management.
- Hands-on experience with SIMD programming (AVX-512, SSE, or equivalent) and intrinsics-based vectorization.
- Experience in high-performance computing (HPC), numerical computing, or ML inference optimization.
- Experience with ML model execution optimizations, including efficient tensor computations and memory access patterns.
- Knowledge of multi-threading, NUMA architectures, and low-level CPU optimization.
- Proficiency with systems-level software development, profiling tools (perfetto, VTune, Valgrind), and benchmarking.
- Experience working with hardware accelerators (FPGAs, GPUs, or custom ASICs) and designing efficient software-hardware interfaces.
Preferred Skills (Nice to Have)
- Familiarity with LLVM/Clang or GCC compiler optimizations.
- Experience in LLM quantization, sparsity optimizations, and mixed-precision computation.
- Knowledge of distributed inference techniques and networking optimizations.
- Understanding of graph partitioning and execution scheduling for large-scale ML models.
Why Join Us?
- Work on a cutting-edge ML inference platform that redefines performance and efficiency for LLMs.
- Tackle challenging low-level performance engineering problems in AI and HPC.
- Collaborate with a team of hardware, software, and ML experts building an industry-first product.
- Opportunity to contribute to and shape the future of open-source AI inference software.