Staff+ Data Engineer (ML Infrastructure)

Sanas is pioneering the future of human communication. Founded by a team of Stanford researchers and entrepreneurs with deep industry experience, Sanas has developed the world's first real-time speech AI platform capable of accent translation, noise cancellation, speech enhancement, cross-language communication, and more.

Sanas makes conversations clearer, more inclusive, and more effective, removing barriers that prevent people from being understood, regardless of accent, background noise, or native language.

Sanas is currently one of the fastest growing startups in Silicon Valley, growing from $16M to $50M ARR in 2025. The company's core business is profitable and is on track to end 2026 with >$120M ARR. Our team combines deep expertise in model innovation and systems engineering with a design-minded product engineering culture to build and ship cutting-edge AI models and experiences — entirely in-house.

Sanas is a 180-strong team, established in 2020. In this short span, we've successfully secured over $100 million in funding. Our innovation has been supported by the industry's leading investors, including Insight Partners, Google Ventures, Quadrille Capital, General Catalyst, Quiet Capital, and other influential investors. Our reputation is further solidified by collaborations with numerous Fortune 100 companies. With Sanas, you're not just adopting a product; you're investing in the future of communication.

If you’re looking to have a significant role in roadmapping and driving technical directions, if you’re looking to deploy challenging and big ideas without much overhead or slowness, if you're looking to leave your mark on an ambitious, generational mission to change how the worlds thinks about speech + AI, then Sanas is a well-suited place for you.

About the Role

Our models are only as good as the data that trains them. As a Staff Data Engineer, you'll own the infrastructure that takes raw audio — millions of hours across accents, languages, noise conditions, and recording environments — and turns it into clean, reproducible, training-ready data at scale. You'll work directly with AI research scientists and ML engineers to design systems that move fast without breaking the data quality guarantees our models depend on.

Job Description

Data pipeline & lakehouse architecture

Design and implement large-scale data pipelines that ingest, transform, validate, and serve high-quality audio and metadata for AI model training, evaluation, and product telemetry.
Own the lakehouse architecture — table format choices (Iceberg vs. Delta Lake), partitioning strategies, metadata management, and schema evolution — with a bias toward reproducibility and auditability.
Build and maintain batch and streaming pipelines using Spark, Flink, and orchestration tooling (Airflow or Dagster), with a clear-eyed view of when each is the right tool.
Extend and maintain feature store infrastructure to serve low-latency, versioned features for both training and real-time inference.

Audio data at scale

Develop and maintain pipelines purpose-built for the unique challenges of audio data: large file volumes, time-series feature extraction, speaker and language metadata, and annotation versioning.
Build tooling that supports the full audio data lifecycle — from raw ingestion and quality filtering through augmentation, segmentation, and training split generation — with reproducibility guarantees at every stage.
Partner with ML engineers and research scientists to design data schemas, sampling strategies, and evaluation datasets that accurately reflect production conditions.
Own data pipelines that feed human-in-the-loop annotation workflows — ensuring clean round-trips between raw data, labeling platforms, and training-ready outputs.

Platform reliability & governance

Instrument pipelines with observability, data quality checks, lineage tracking, and alerting — so failures surface fast and root causes are traceable.
Drive build vs. buy decisions for data quality, observability, and cataloging tooling with a clear framework grounded in Sanas's scale and roadmap.
Own disaster recovery design for critical data assets — training datasets, evaluation benchmarks, and model checkpoints.

Technical leadership

Set the technical bar for the data engineering team — review designs and code, establish patterns, and document decisions in a way that raises the floor for everyone.
Work cross-functionally with AI research, infrastructure, product, and legal to align data architecture with business needs and regulatory requirements.
Contribute to hiring — identify strong candidates, conduct technical interviews, and help define what great looks like for data engineering at Sanas.

Qualifications

5+ years of experience in data engineering, ML infrastructure, or data platform roles.
Deep expertise building distributed batch and streaming data systems in production.
Strong command of data processing frameworks: Spark, Flink, and Ray; and orchestrators: Airflow or Dagster.
Hands-on experience with cloud data platforms — Snowflake, Databricks, or ClickHouse — and object storage (S3, GCS) on AWS or GCP.
Solid understanding of data lifecycle management: privacy, security, compliance, and reproducibility from ingestion through model training.
Proven ability to work directly with ML researchers and engineers to translate model requirements into data infrastructure decisions.

Bonus

Direct experience with audio data pipelines — file handling at scale, time-series features, speaker metadata, or audio annotation tooling.
Familiarity with ASR, TTS, or speech enhancement model training workflows and the data requirements specific to each.
Experience with MLOps tooling — experiment tracking, dataset versioning (DVC, LakeFS), and training pipeline orchestration.

Science

Palo Alto, CA

Partager sur :

Conditions d’utilisation Confidentialité Cookies Alimenté par Rippling