Senior Data Engineer, AI Platform

Kai is the AI company rebuilding cybersecurity for the machine-speed era. Founded by second time founders and trusted by Fortune 500 enterprises, Kai is building a future where security has no categories, no silos, and no human speed bottlenecks. The Kai Agentic AI Platform replaces fragmented, human-limited workflows with agentic AI systems that continuously contextualize, assess, reason, and execute security work at machine speed - making human defenders, superhuman.


Why Join Kai

  • Well-funded: With $125M raised, we have the capital, runway, and resolve to rebuild cybersecurity from first principles.
  • Proven: We've earned the trust of Fortune 500 and Global 1000 companies, and we're just getting started. Their confidence in Kai reflects what we've built: an AI-powered cybersecurity platform that performs at the scale and speed the enterprise demands.
  • Experienced founders: Our founding team consists of second-time entrepreneurs, each with over 20 years of experience in the cybersecurity industry. Their proven expertise and vision drive our ambitious goals.
  • World-class leadership team: Our Heads of AI, Engineering, and Product bring extensive experience from some of the world’s most influential companies, ensuring top-tier mentorship, direction, and vision.
  • Frontier AI Applied Research Team: Our researchers operate at the leading edge of agentic AI systems, translating breakthrough capabilities into real-world cybersecurity applications.
  • Generous compensation: We offer highly competitive salaries, equity options, and a supportive work environment. Your contributions will be valued and rewarded as we grow together.


About the Role

We are looking for a Senior Data Engineer (AI Platform) to design and build scalable data systems that power next-generation AI and Generative AI applications.

This is a senior, hands-on technical role for someone who can operate across both classical data engineering and modern AI data infrastructure — including large-scale data pipelines, vector databases, and retrieval systems for LLM-powered applications.

You will work at the intersection of data engineering, AI infrastructure, and LLM systems, enabling high-quality data flow, retrieval, and storage for production-grade intelligence systems.

Key Responsibilities

  • Design and build scalable data pipelines for batch and real-time processing
  • Develop and maintain data infrastructure supporting AI/ML and Generative AI systems
  • Build and optimize retrieval pipelines for RAG and LLM-based applications
  • Design and manage vector data pipelines (embedding generation, indexing, storage, retrieval)
  • Implement hybrid retrieval systems (BM25 + vector search)
  • Work closely with AI/ML teams to enable training, evaluation, and inference workflows
  • Develop data models and storage systems optimized for large-scale AI applications
  • Ensure data quality, consistency, and reliability across pipelines
  • Optimize systems for performance, latency, scalability, and cost
  • Collaborate with product, engineering, and AI teams to translate requirements into data solutions

Required Qualifications

  • 4+ years of experience in Data Engineering or related fields
  • Strong experience building large-scale distributed data pipelines
  • Proficiency in Python and SQL; experience with Spark or similar frameworks
  • Experience with both batch and streaming systems (e.g., Kafka, Flink, Spark Streaming)
  • Experience working with cloud data platforms (AWS, GCP, Azure)
  • Solid understanding of data modeling, storage systems, and distributed systems
  • Experience supporting AI/ML workloads through data infrastructure
  • Strong ownership mindset and ability to operate in fast-paced environments

Preferred Qualifications

  • Experience working with LLM-powered systems and RAG pipelines
  • Familiarity with vector databases and ANN search systems
  • Experience in data systems for AI platforms or ML infrastructure
  • Background in search, recommendation systems, or information retrieval

Core Technical Expertise

Data Engineering & Pipelines

  • Batch and streaming pipelines (Spark, Flink, Kafka)
  • ETL/ELT design, data modeling, and data warehousing
  • Data quality, validation, and observability

AI Data Infrastructure

  • Data pipelines for ML training and inference
  • Feature stores and dataset versioning
  • Data preparation for LLM and GenAI systems

Vector Databases & Retrieval Systems

  • Milvus, Pinecone, Databricks Vector Search, FAISS
  • ANN algorithms (HNSW, IVF, PQ)
  • Hybrid retrieval (BM25 + vector search)
  • Embedding pipelines (text, code, image)

RAG & LLM Data Systems

  • Retrieval pipelines for LLM applications
  • Context construction and ranking
  • Data indexing and chunking strategies

Storage & Distributed Systems

  • Data lakes (S3, GCS, ADLS), Parquet, Delta Lake, Iceberg
  • Distributed systems design and scalability
  • Caching and low-latency data access

Platforms & Infrastructure

  • AWS, GCP, Azure
  • Databricks, BigQuery, Snowflake
  • Kubernetes, Ray (nice to have)

Performance & Optimization

  • Query optimization and indexing strategies
  • Cost optimization for large-scale data systems
  • Latency optimization for real-time retrieval

 

AI Research

San Jose, CA

Share on:

Terms of servicePrivacyCookiesPowered by Rippling