Senior QA Engineer

Who we are:

Third Way Health (www.thirdway.health) helps medical practices and healthcare organizations across the United States to improve the patient experience while reducing the administrative burden on healthcare staff. We enable healthcare organizations to enhance the experience of their patients by providing them with a leading technology platform and world class services. What unites us is our passion to support physicians and help patients from all backgrounds to have a better healthcare experience.

About the position

We’re seeking a Senior QA / Eval Engineer to own and evolve the quality and evaluation infrastructure behind our AI-powered patient engagement platform. Our eval system is multi-factorial (deterministic, LLM-based, human expert), running against every voice interaction we handle. You’ll be responsible for the verification and evaluation infrastructure that determine whether each interaction meets multi-faceted quality criteria.

This is a high-impact individual contributor role. You won’t just write test cases — you’ll shape how we define “quality” across our automated and manual workflows and build the tooling that makes that definition measurable and actionable.

Responsibilities

Own and extend our multi-layered eval pipeline and verification portfolio: deterministic quality checks on tool calls, risk-factor heuristics, and LLM-graded transcript evaluation.
Advance our capabilities to evaluate end-to-end system performance (across orchestrated agents, RAG-supported responses, multi-party voice conversations) with modular and auditable verification that is independent of any single model provider.
Drive improvements to our observability stack to surface eval metrics, detect regressions, and enable data-driven quality decisions across the team.
Build real-time monitoring and verification loops that catch issues in production interactions as they happen, intervening with context and feeding back for system refinement.
Partner with ML engineers, product managers, and operations leads to translate real-world failure modes into automated checks, closing the loop between production incidents and eval coverage.
Build and maintain adversarial and edge-case test suites — including prompt injection resistance, guardrail robustness, and graceful degradation under ambiguous patient inputs.
Champion “shift-left” quality practices: embed eval criteria into prompt engineering workflows, define acceptance criteria for new agent behaviors, and make quality a first-class concern in the development cycle.
Contribute to the design of our QA pipeline orchestration (background processing, Slack notifications, risk assessment persistence) to improve throughput, reliability, and developer experience.

Required skills and qualifications

5+ years of software engineering or test engineering experience, with 3+ years focused on quality infrastructure for AI/ML or data-intensive systems.
Strong proficiency in Python, particularly for building test frameworks, eval pipelines, and API-level integration tests (e.g., pytest, FastAPI TestClient, Pydantic).
Demonstrated experience designing evaluation or verification systems for LLM-based applications — with a clear understanding that the model is a generation layer, not the quality layer. Comfort with both deterministic and model-graded assessment approaches, and a point of view on when each is appropriate.
Familiarity with the architectural tradeoffs of relying on LLM outputs in production — including variance across model versions, prompt sensitivity, and the need for external verification infrastructure that remains stable as underlying models change.
Experience building extensible, rule-based validation systems (check registries, plugin architectures, or similar patterns) that scale across a growing surface area of features.
Solid understanding of voice AI or conversational AI systems, including tool-calling patterns, transcript analysis, and interaction-level quality metrics.
Hands-on experience with observability and metrics instrumentation in production environments.
Excellent communication skills, with the ability to collaborate effectively across engineering, product, and non-technical stakeholders.
Strong interest in healthcare innovation and building AI systems that meaningfully improve health outcomes.

Desired skills and qualifications

Experience building QA or eval systems in healthcare or regulated environments, with familiarity with standards such as HIPAA, GDPR, or FDA guidance.
Proven experience leading complex technical initiatives and mentoring junior engineers.
Experience building or operating systems where quality guarantees live in the verification infrastructure rather than in any single model.
Familiarity with risk-scoring systems, anomaly detection, or production safety nets for autonomous AI agents.
Experience with AI safety testing, including adversarial evaluation, jailbreak testing, and bias detection in LLM outputs.
Hands-on experience with CI/CD pipelines for eval automation (CircleCI, GitHub Actions, or equivalent) and infrastructure-as-code deployment patterns.
Experience with voice UI testing tools and platforms, with a focus on evaluating speech generation and response quality.
Knowledge of accessibility testing and inclusive design principles.

At this time, we are unable to provide employment visa sponsorship. Candidates must have authorization to work in the United States without requiring sponsorship now or in the future.

Product

Cambridge, MA

Share on: