Description
About Wizard Wizard is the top-performing AI Shopping Agent, delivering the best products from across the web with unmatched accuracy, quality, and trust. The Role We’re looking for an Applied Scientist to own how we measure, understand, and improve the accuracy of our AI agent. This role sits at the intersection of applied ML, evaluation science, and product.
You’ll define what “good” looks like for our agent, build the systems to measure it, and lead the science work to improve it, including fine-tuning the LLM judges that power our evaluation pipeline. You’ll partner with ML Engineering and AI Engineering.
What you will do
is bring scientific rigor to the most important question at Wizard: is our agent getting better, and how do we know? This is a foundational hire on our science team. Evaluation is the starting point, and the role is scoped to grow into broader applied science work as the surface area of the agent expands (recommendations, personalization, ranking, multimodal, conversational understanding).
What You’ll Do Define and evolve accuracy metrics across the full shopping experience (retrieval, ranking, recommendations, outcomes) Design and run experiments to measure improvements and regressions Build and maintain evaluation datasets, benchmarks, and scoring frameworks Improve the LLM judges that power our evaluation pipeline: prompting, calibration, and fine-tuning where it matters Translate ambiguous product questions into clear, measurable hypotheses and analysis Partner with ML Engineers to validate model changes and guide iteration Identify failure modes and edge cases, and drive im
Employer contacts (email/phone/telegram) are hidden from the public preview —
send your CV, and we will connect you directly.