Research

I work on AI safety — how humans and weaker agents maintain autonomy and calibration when interacting with more capable systems. Specifically: evaluations of gradual disempowerment and cognitive offloading in agentic settings (what happens to human judgement when it becomes optional?), sycophancy and calibration failures in RLHF-tuned models, and capability-asymmetric multi-agent deliberation (scalable oversight in reverse).

Publications

Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models

S. Selvaganapathy, M. Nasim. ACL 2026 (Main Track).

A study of RLHF side-effects framed as an NLP benchmark. Evaluates six frontier LLMs on hate-speech classification: safety-aligned models reach higher mean accuracy (78.7% vs 64.1%) but exhibit ideological rigidity under persona attacks, systemic overconfidence on Expected Calibration Error (ECE), and fairness disparities across protected groups.

Epistemic Capture in Multi-Agent LLM Deliberation

S. Selvaganapathy. Manuscript in preparation (2026).

Scalable oversight in reverse: whether weaker agents can supervise a stronger one. Across 144 boardroom-style deliberations (GPT-5.4 vs GPT-4o-mini), groups defer to the stronger agent 100% of the time when it is wrong (0/8 recovery) despite diagnostic evidence surfacing in 92% of episodes. Structured round-robin evidence protocols recover 8.3pp in asymmetric groups. Framed as an agent-to-agent analogue of gradual disempowerment.

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

P. Bhandari, N. Fay, S. Selvaganapathy, A. Datta, U. Naseem, M. Nasim. EACL 2026 (Main Track).

Proposes a pipeline for steering LLM personality along Big Five traits by extracting activation directions and identifying optimal injection layers. Achieves stable trait control while preserving fluency and general capabilities.

Do LLMs Use Cultural Knowledge Without Being Told? A Multilingual Evaluation of Implicit Pragmatic Adaptation

Authors anonymized. Under review, ARR January 2026.

Evaluates whether LLMs adapt their pragmatic behaviour to implicit cultural cues. Across four deployed LLMs, five languages (English, German, Hindi, Nepali, Urdu), and 60 culturally grounded scenarios, models recover only ~20% of the pragmatic shift they produce under explicit instruction (PCS mean 0.196). A paired Hindi/Urdu natural control indicates that models respond primarily to linguistic structure rather than cultural associations; hedging density shows negative explicit gaps across all five languages, suggesting alignment training actively suppresses the target behaviour. Frames multilingual cultural pragmatics as an explicit-vs-implicit deployment problem, not a knowledge problem.

Thesis

Beyond Words: Harnessing Large Language Models for Detecting Implicit Hate Speech

S. Selvaganapathy. Honours Thesis, UWA, 2024.

Honours thesis (HD 80/100) on LLM-based detection of implicit hate speech. Subsequently rewritten and extended into the ACL 2026 paper above.

Current Work

Research assistant at the NASIM lab (UWA), building stochastic multi-agent simulations of social discourse and opinion propagation, and deploying a RAG pipeline for confidential research data. In parallel, inheriting a UWA Medical School collaboration on paediatric penicillin allergy de-labelling — extending a prior random-forest + SHAP baseline with emphasis on honest calibration and failure-mode reporting.