Senior Data Engineer, AI Platform

Kai is the AI company rebuilding cybersecurity for the machine-speed era. Founded by second time founders and trusted by Fortune 500 enterprises, Kai is building a future where security has no categories, no silos, and no human speed bottlenecks. The Kai Agentic AI Platform replaces fragmented, human-limited workflows with agentic AI systems that continuously contextualize, assess, reason, and execute security work at machine speed - making human defenders, superhuman.

Why Join Kai

Well-funded: With $125M raised, we have the capital, runway, and resolve to rebuild cybersecurity from first principles.

Proven: We've earned the trust of Fortune 500 and Global 1000 companies, and we're just getting started. Their confidence in Kai reflects what we've built: an AI-powered cybersecurity platform that performs at the scale and speed the enterprise demands.

Experienced founders: Our founding team consists of second-time entrepreneurs, each with over 20 years of experience in the cybersecurity industry. Their proven expertise and vision drive our ambitious goals.

World-class leadership team: Our Heads of AI, Engineering, and Product bring extensive experience from some of the world’s most influential companies, ensuring top-tier mentorship, direction, and vision.

Frontier AI Applied Research Team: Our researchers operate at the leading edge of agentic AI systems, translating breakthrough capabilities into real-world cybersecurity applications.

Generous compensation: We offer highly competitive salaries, equity options, and a supportive work environment. Your contributions will be valued and rewarded as we grow together.

About the Role

We are looking for a Senior Data Engineer (AI Platform) to design and build scalable data systems that power next-generation AI and Generative AI applications.

This is a senior, hands-on technical role for someone who can operate across both classical data engineering and modern AI data infrastructure — including large-scale data pipelines, vector databases, and retrieval systems for LLM-powered applications.

You will work at the intersection of data engineering, AI infrastructure, and LLM systems, enabling high-quality data flow, retrieval, and storage for production-grade intelligence systems.

Key Responsibilities

Design and build scalable data pipelines for batch and real-time processing
Develop and maintain data infrastructure supporting AI/ML and Generative AI systems
Build and optimize retrieval pipelines for RAG and LLM-based applications
Design and manage vector data pipelines (embedding generation, indexing, storage, retrieval)
Implement hybrid retrieval systems (BM25 + vector search)
Work closely with AI/ML teams to enable training, evaluation, and inference workflows
Develop data models and storage systems optimized for large-scale AI applications
Ensure data quality, consistency, and reliability across pipelines
Optimize systems for performance, latency, scalability, and cost
Collaborate with product, engineering, and AI teams to translate requirements into data solutions

Required Qualifications

4+ years of experience in Data Engineering or related fields
Strong experience building large-scale distributed data pipelines
Proficiency in Python and SQL; experience with Spark or similar frameworks
Experience with both batch and streaming systems (e.g., Kafka, Flink, Spark Streaming)
Experience working with cloud data platforms (AWS, GCP, Azure)
Solid understanding of data modeling, storage systems, and distributed systems
Experience supporting AI/ML workloads through data infrastructure
Strong ownership mindset and ability to operate in fast-paced environments

Preferred Qualifications

Experience working with LLM-powered systems and RAG pipelines
Familiarity with vector databases and ANN search systems
Experience in data systems for AI platforms or ML infrastructure
Background in search, recommendation systems, or information retrieval

Core Technical Expertise

Data Engineering & Pipelines

Batch and streaming pipelines (Spark, Flink, Kafka)
ETL/ELT design, data modeling, and data warehousing
Data quality, validation, and observability

AI Data Infrastructure

Data pipelines for ML training and inference
Feature stores and dataset versioning
Data preparation for LLM and GenAI systems

Vector Databases & Retrieval Systems

Milvus, Pinecone, Databricks Vector Search, FAISS
ANN algorithms (HNSW, IVF, PQ)
Hybrid retrieval (BM25 + vector search)
Embedding pipelines (text, code, image)

RAG & LLM Data Systems

Retrieval pipelines for LLM applications
Context construction and ranking
Data indexing and chunking strategies

Storage & Distributed Systems

Data lakes (S3, GCS, ADLS), Parquet, Delta Lake, Iceberg
Distributed systems design and scalability
Caching and low-latency data access

Platforms & Infrastructure

AWS, GCP, Azure
Databricks, BigQuery, Snowflake
Kubernetes, Ray (nice to have)

Performance & Optimization

Query optimization and indexing strategies
Cost optimization for large-scale data systems
Latency optimization for real-time retrieval

AI Research

San Jose, CA

Share on: