Job Description

Data Scientist
Design and implement end-to-end evaluation frameworks to assess performance, reliability, and safety of multi-agent AI systems
Lead experimentation and A/B testing efforts to systematically test hypotheses, validate model improvements, and track performance across agent iterations
Curate and maintain high-quality ground truth datasets to enable accurate, reproducible evaluation of multi-agent outputs
Identify and address reliability and accuracy gaps across agent workflows, failure modes, and edge cases in production-like environments
Stay current on emerging research in agentic AI, LLM evaluation, and multi-agent coordination to continuously improve framework design
Technical Skills
Proficiency in Python and ML frameworks
Hands-on experience with LLM APIs and agentic frameworks (LangChain, LlamaIndex, Semetic KernalI)
Familiarity with evaluation tooling (Ragas, DeepEval, L...
            

Apply for This Position

Ready to take the next step? Click the button below to submit your application.

Submit Application