AI Evaluation Specialist - QA

Kore.ai is a globally recognized leader in the conversational and generative AI space helping enterprises deliver extraordinary experiences for their customers, employees, and contact center agents. Kore.ai’s goal is to empower businesses with effective, simple and responsible AI solutions that create engaging interactions. sectors serving over 100M of consumers and 500,000+ employees worldwide. With billions of interactions automated using our AI-powered technology, we have been able to save over $500M for these companies. 


Kore.ai is one of the fastest growing AI companies globally. We are recognized as a leader by the leading technology and industry analysts like Gartner, Forrester, IDC, ISG, Everest, and others. 


Founded in 2014 by serial successful entrepreneur, Raj Koneru, Kore.ai supports customers globally across offices in Orlando, Hyderabad, New York, London, Germany, Dubai Frankfurt, Tokyo and Seoul.


We’re reshaping the way companies harness the power of AI, simplifying and enhancing accessibility. Work alongside some of the brightest minds in the industry to pioneer safe, reliable solutions. Join the Kore.ai team and help companies of all sizes simplify the adoption of advanced AI solutions responsibly.

JOB SUMMARY:

We are seeking a Senior AI Evaluation Specialist to design and execute robust evaluation methodologies for Generative and Agentic AI systems. This role bridges AI product quality, evaluation science, and responsible AI governance — ensuring every AI feature, agent, and model release is measured, benchmarked, and validated using standardised frameworks.

 

The ideal candidate combines a QA mindset, ML evaluation rigour, and hands-on coding expertise to benchmark LLMs, multi-agent workflows, and GenAI APIs, driving consistent, measurable, and safe AI product performance.

 

RESPONSIBILITIES:

1. AI Evaluation & Benchmarking

Build and maintain end-to-end evaluation pipelines for Generative and Agentic AI features (e.g., chat, reasoning agents, RAG workflows, summarization, classification).

Implement standardized evaluation frameworks such as RAGAS, G-Eval, HELM, PromptBench, MT-Bench, or custom evaluation harnesses.

Define and measure core AI quality metrics — accuracy, groundedness, coherence, contextual recall, hallucination rate, and response time.

Create reproducible benchmarks, leaderboards, and regression tracking for models and agents across multiple releases or providers (OpenAI, Anthropic, Mistral, etc.).

 

2. Agentic AI Evaluation

Evaluate multi-agent systems and autonomous AI workflows, measuring task success rates, reasoning trace quality, and tool-use efficiency.

Assess Agentic AI behaviors such as planning accuracy, goal completion rate, context handoff success, and inter-agent communication reliability.

Validate decision-making transparency and error recovery mechanisms in autonomous agent frameworks (LangGraph, AutoGen, CrewAI, etc.).

Design agent-specific evaluation scenarios — simulated environments, user-in-the-loop testing, and “mission-based” performance scoring.

 

3. Experimentation & Automation

Develop Python-based evaluation scripts to automate testing using OpenAI, Anthropic, and Hugging Face APIs.

Conduct large-scale comparative studies across prompts, models, and fine-tuned variants, analyzing quantitative and qualitative differences.

Integrate evaluations into CI/CD pipelines to enable continuous AI quality monitoring.

Visualize results using dashboards (Plotly, Streamlit, Dash, or Grafana).

 

4. Quality Governance & Reporting

Define and enforce AI acceptance thresholds before deployment.

Collaborate with Responsible AI teams to evaluate bias, fairness, safety, and privacy implications.

Produce detailed evaluation reports and audit logs for model releases and governance boards.

Present findings to Product, Data Science, and Executive stakeholders — transforming metrics into actionable insights.

 

5. Collaboration & Continuous Improvement

Work closely with Prompt Engineers, ML Scientists, and QA Engineers to close the loop between testing and improvement.

Support Product teams in defining evaluation-driven release criteria.

Mentor junior evaluators in AI testing methodologies, benchmarking, and analysis.

Keep abreast of advances in LLM evaluation research, Agentic AI frameworks, and tool-calling reliability testing.

 

MUST HAVE SKILLS:

Category

Expected Expertise

Programming

Python (Pandas, NumPy, LangChain, LangGraph, OpenAI/Anthropic SDKs)

Evaluation Frameworks

RAGAS, HELM, G-Eval, MT-Bench, PromptBench, custom scoring pipelines

GenAI APIs

OpenAI GPT-4/5, Claude, Gemini, Mistral, Azure OpenAI

Agentic AI

Understanding of multi-agent orchestration, tool use, reasoning traces, and planning frameworks (AutoGen, CrewAI, LangGraph)

Metrics Knowledge

BLEU, ROUGE, cosine similarity, factuality, coherence, bias, toxicity, reasoning success rate

Data & Analytics

JSON parsing, prompt dataset curation, result visualization

Tooling

Git, Jupyter/Colab, Jira, Confluence, evaluation dashboards

Soft Skills

Analytical communication, documentation excellence, cross-team collaboration

 

 

EXPERIENCE REQUIRED:

·       5 to 10 years total experience with at least 3+ years in AI evaluation, GenAI QA, or LLM quality analysis.

·       Strong understanding of AI/ML model lifecycle, prompt engineering, and RAG or agentic architectures.

·       Experience contributing to AI safety, reliability, or responsible AI initiatives.

 

EDUCATION QUALIFICATION

·       Bachelor’s degree in Computer Science Engineering / Information Technology OR Master’s in

computer applications

 

Why Join Us?

At Kore.ai, you won't be maintaining quality for conventional software—you'll be defining what quality means for an entirely new category of platform technology that enables enterprise-scale agentic applications. Your work will directly influence how the world's leading organizations build, deploy, and trust AI systems, establishing standards that could transform the industry.

Join us in building not just a better platform, but the frameworks that ensure enterprise agentic applications deliver on their transformative promise safely, effectively, and responsibly at scale.

Product & Engineering

Hyderabad, India

Share on:

Terms of servicePrivacyCookiesPowered by Rippling