Software Engineer, LLM Inference

This is an onsite role based in Burlingame, CA with the team working together in person 2 days per week.

About Galileo

Galileo is the leading platform for Gen AI evaluation and observability, with a mission to democratize building safe, reliable and robust applications in the new era of AI powered software development. Our foundation is built on pioneering the early technology behind the world's most ubiquitous AI applications including Apple's Siri and Google Speech. We firmly believe that AI developers require meticulously crafted, research-driven tools to create trustworthy and high-quality generative AI applications that will revolutionize our work and lifestyle.


Galileo addresses the complexities inherent in implementing, evaluating, and monitoring GenAI applications, optimizing the development process for both individual developers and teams by offering a comprehensive platform that spans the full AI development lifecycle. Galileo bridges critical gaps, significantly enhancing developers' ability to refine and deploy reliable and precise GenAI applications.


Since its inception, Galileo has rapidly gained traction, serving Fortune 100 banks, Fortune 50 telecom companies, as well as AI teams at prominent organizations such as Reddit and Headspace Health, among dozens of others.


Galileo has AI research at its core, with the founders coming from Google and Uber where they solved challenging AI/ML problems in the Speech, Evaluation and ML Infra domains. It is now a Series B business backed by tier 1 investors including Battery Ventures, Scale Venture Partners, and Databricks Ventures, with $68M in total funding. We are headquartered in San Francisco with locations such as New York and Bangalore, India forming our areas of future growth.


Ideal Candidate
Someone who has built scalable machine learning compute systems and run-time microservices serving machine learning models at scale.


You Have…

  • Worked on large scale distributed systems.
  • Experience with high throughput machine learning systems and platforms, bonus if worked directly on model serving systems.
  • Excellent low-latency python programming skills
  • Worked on Model optimization techniques like 
    • Dynamic batching & concurrency of inference requests
    • Use TensorRT to optimize models prior to deployment
    • Using techniques like Precision Reduction, Layer Fusion, Kernel auto-tuning to reduce the number of kernel and memory operations
    • Low level optimizations on GPU systems
    • Built and worked on scaling LLM Inference servers, similar to NVIDIA Triton

Bonus Skills

  • Worked with frameworks like Apache Ray
  • Trained and run inference on deep learning models built on PyTorch, TensorFlow, Keras and PyTorch Lightnight

What will this role work on/get to do?


You’ll be at the heart of building and scaling the systems that make large language models practical in production. This role combines deep systems engineering with applied AI performance engineering. 


In this role, you will:


  • Design and scale inference infrastructure – architect and optimize distributed systems that serve LLMs at scale, ensuring low latency, high throughput, and cost efficiency.
  • Push the limits of performance – apply techniques like dynamic batching, concurrency optimization, precision reduction, and GPU kernel tuning to maximize throughput while maintaining quality.
  • Optimize model serving pipelines – work with TensorRT, layer fusion, kernel auto-tuning, and other advanced optimizations to squeeze maximum performance out of hardware.
  • Build robust inference microservices – design runtime services (similar to NVIDIA Triton) to support multi-tenant, real-time inference workloads in production.
  • Experiment with cutting-edge frameworks – explore and integrate new technologies like Apache Ray, distributed PyTorch/TensorFlow inference, and other emerging ML infrastructure tools.
  • Collaborate with research & product teams – translate state-of-the-art models into reliable, efficient, and observable services powering user-facing applications.
  • Shape best practices for inference at scale – help define how we run LLM workloads safely, reliably, and cost-effectively across diverse hardware environments.

Why Galileo

  • Join a seasoned founding team that has previously led product and engineering teams from 0 to $100M+ in revenue and from 0 to 1B+ users globally
  • We obsess over our team’s culture driven by inclusivity, empathy and curiosity
  • We invest in our team’s development and happiness because our employees are the keys to our success and ensuring happy customers – towards that end, we offer: 
    • 🌴 Unlimited PTO
    • 👶 Parental Leave (birthing & non-birthing) – 100% pay for 8 weeks
    • 🩺 Medical Insurance
    • 😁 Dental Insurance
    • 👀 Vision Insurance
    • 💰 401(k) Retirement Savings Plan
    • 📈 Pre-IPO Stock Options
    • 🚌 Commuter Benefits (pre-tax + company sponsored)
    • 🧘 Mental & Physical Wellness Stipend
    • 🍱 Daily Meals Stipend
    • 🏢 HQ in Burlingame + hub in NYC + hub in Bangalore
    • 🤝 Build the company alongside the Founders

The pay range for this role is:

180,000 - 300,000 USD per year (Hybrid (Burlingame, California, US))

R&D

Hybrid (Burlingame, California, US)

Share on:

Terms of servicePrivacyCookiesPowered by Rippling