Research

I study how large language models encode and deploy social knowledge, particularly around cultural pragmatics, ideological robustness, and alignment.

Publications

Confident, Calibrated, or Complicit: Probing the Trade-offs between Safety Alignment and Ideological Bias in Language Models

S. Selvaganapathy, M. Nasim. Under review, ACL 2026.

Investigates whether uncensored LLMs provide more objective hate speech classification than safety-aligned models. Finds censored models outperform (78.7% vs 64.1% accuracy) but act as ideological anchors resistant to persona-based manipulation.

arXiv PDF

Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs

P. Bhandari, N. Fay, S. Selvaganapathy, A. Datta, U. Naseem, M. Nasim. EACL 2026.

Proposes a pipeline for steering LLM personality along Big Five traits by extracting activation directions and identifying optimal injection layers. Achieves stable trait control while preserving fluency and general capabilities.

arXiv PDF

Fifteen Percent Fluency: Measuring the Cultural Knowledge-Behaviour Gap in LLMs

Authors anonymized. Under review, ARR January 2026.

Introduces Pragmatic Context Sensitivity (PCS) to quantify how much cultural knowledge LLMs deploy without explicit instruction. Finds models use only 15% of their cultural capability when relying on implicit contextual cues.

PDF

Thesis

Beyond Words: Harnessing Large Language Models for Detecting Implicit Hate Speech

S. Selvaganapathy. Honours Thesis, May 2024.

Literature review examining LLM applications to implicit hate speech detection, covering prompting strategies, model guardrails, and the challenges of classifying subtle harmful content that evades keyword-based approaches.

PDF

Current Work

Research assistant at UWA, developing stochastic multi-agent simulations to model social discourse dynamics and opinion propagation.