Research
I work on AI safety — how humans and weaker agents maintain autonomy and calibration when interacting with more capable systems. Specifically: evaluations of gradual disempowerment and cognitive offloading in agentic settings (what happens to human judgement when it becomes optional?), sycophancy and calibration failures in RLHF-tuned models, and capability-asymmetric multi-agent deliberation (scalable oversight in reverse).
Publications
Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models
A study of RLHF side-effects framed as an NLP benchmark. Evaluates six frontier LLMs on hate-speech classification: safety-aligned models reach higher mean accuracy (78.7% vs 64.1%) but exhibit ideological rigidity under persona attacks, systemic overconfidence on Expected Calibration Error (ECE), and fairness disparities across protected groups.
Epistemic Capture in Multi-Agent LLM Deliberation
Scalable oversight in reverse: whether weaker agents can supervise a stronger one. Across 144 boardroom-style deliberations (GPT-5.4 vs GPT-4o-mini), groups defer to the stronger agent 100% of the time when it is wrong (0/8 recovery) despite diagnostic evidence surfacing in 92% of episodes. Structured round-robin evidence protocols recover 8.3pp in asymmetric groups. Framed as an agent-to-agent analogue of gradual disempowerment.
Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Proposes a pipeline for steering LLM personality along Big Five traits by extracting activation directions and identifying optimal injection layers. Achieves stable trait control while preserving fluency and general capabilities.
Do LLMs Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Pragmatic Adaptation
Evaluates whether LLMs adapt their pragmatic behaviour to implicit cultural cues. Across four deployed LLMs, five languages (English, German, Hindi, Nepali, Urdu), and 60 culturally grounded scenarios, models recover only ~20% of the pragmatic shift they produce under explicit instruction (PCS mean 0.196). A paired Hindi/Urdu natural control indicates that models respond primarily to linguistic structure rather than cultural associations; hedging density shows negative explicit gaps across all five languages, suggesting alignment training actively suppresses the target behaviour. Frames multilingual cultural pragmatics as an explicit-vs-implicit deployment problem, not a knowledge problem.
Thesis
Beyond Words: Harnessing Large Language Models for Detecting Implicit Hate Speech
Honours thesis (HD 80/100) on LLM-based detection of implicit hate speech. Subsequently rewritten and extended into the ACL 2026 paper above.
Current Work
Research assistant at the NASIM lab (UWA), building stochastic multi-agent simulations of social discourse and opinion propagation, and deploying a RAG pipeline for confidential research data. In parallel, inheriting a UWA Medical School collaboration on paediatric penicillin allergy de-labelling — extending a prior random-forest + SHAP baseline with emphasis on honest calibration and failure-mode reporting.