Vision Language Action (VLA) models engineer

About Us

Foundation is developing the future of general purpose robotics with the goal to address the labor shortage.

Our mission is to create advanced robots that can operate in complex environments, reducing human risk in conflict zones and enhancing efficiency in labor-intensive industries.

We are on the lookout for extraordinary engineers and scientists to join our team. Your previous experience in robotics isn't a prerequisite — it's your talent and determination that truly count.

We expect that many of our team members will bring diverse perspectives from various industries and fields. We are looking for individuals with a proven record of exceptional ability and a history of creating things that work.

Our Culture

We like to be frank and honest about who we are, so that people can decide for themselves if this is a culture they resonate with. Please read more about our culture here https://foundation.bot/culture.

Who should join:

You like working in person with a team in San Francisco.
You deeply believe that this is the most important mission for humanity and needs to happen yesterday.
You are highly technical - regardless of the role you are in. We are building technology; you need to understand technology well.
You care about aesthetics and design inside out. If it's not the best product ever, it bothers you, and you need to “fix” it.
You don't need someone to motivate you; you get things done.

Why Are We Hiring for this Role

Develop and optimize vision-language-action models, including transformers, diffusion models, and multimodal encoders/decoders.
Build representations for 2D/3D perception, affordances, scene understanding, and spatial reasoning.
Integrate LLM-based reasoning with action planning and control policies.
Design datasets for multimodal learning: video-action trajectories, instruction following, teleoperation data, and synthetic data.

Interface VLAM outputs with real-time robot control stacks (navigation, manipulation, locomotion).
Implement grounding layers that convert natural language instructions into symbolic, geometric, or skill-level action plans.
Deploy models on on-board or edge compute platforms, optimizing for latency, safety, and reliability.

Build scalable pipelines for ingesting, labeling, and generating multimodal training data.
Create simulation-to-real (Sim2Real) training workflows using synthetic environments and teleoperated demonstration data.
Optimize training pipelines, model parallelism, and evaluation frameworks.

Work closely with robotics, hardware, controls, and safety teams to ensure model outputs are executable, safe, and predictable.
Collaborate with product teams to define robot capabilities and user-facing behaviors.
Participate in user and field testing to iterate on real-world performance.

What Kind of Person are we looking For

Strong experience with training multimodal models, including VLAs, VLMs, vision transformers, LLMs.
Ability to build and iterate on large-scale training pipelines.
Deep proficiency in PyTorch or JAX, distributed training, and GPU acceleration.
Strong software engineering skills in Python and modern ML tooling.

Experience with (synthetic) dataset creation and curation.
Understanding of real-time deployment constraints on embedded hardware.
Optimally, familiarity with robotics simulation environments (Isaac Lab, Mujoco, or similar).
Ideally, hands-on experience with robotics, embodied AI, or reinforcement/imitation learning.
MSc or PhD in Computer Science, Robotics, Machine Learning, or related field—or equivalent industry experience.

Benefits

We provide market standard benefits (health, vision, dental, 401k, etc.). Join us for the culture and the mission, not for the benefits.

Salary

The annual compensation is expected to be between $150,000 - $300,000. Exact compensation may vary based on skills, experience, and location.

Engineering

San Francisco, CA

Share on: