
Combining theoretical rigour with empirical investigation to understand how AI models reason, solve complex problems, and collaborate with humans.
We study the science of LLM evaluation using systematic reviews, benchmark analysis, and statistical modelling. We develop new benchmarks to test LLM reasoning limits, especially in adversarial, interactive, and low-resource language settings.
We build agentic systems that automate and augment key stages of the scientific process: literature discovery, evidence synthesis, hypothesis generation, and decision support. Our agents are reliable, transparent, and grounded in domain expertise.
From bias and toxicity to misalignment in agentic systems: we investigate the harms advanced AI may pose to individuals and society, alongside technical mitigation methods and AI governance research.
Large-scale empirical studies of how people use and respond to AI systems in real-world decision-making contexts.
llm-evaluation benchmarking ai-safety agentic-ai human-ai-interaction reasoning nlp alignment bias governance low-resource-nlp scientific-discovery