InterpDetect at NeurIPS 2025 MechInt Workshop

Design

Kevin Wu

November 13, 2025

InterpDetect at NeurIPS 2025 MechInt Workshop

Mechanistic Interpretability • GenAI in Finance

Earlier this month, we announced that our paper - InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation - was accepted to two NeurIPS 2025 Workshops: Mechanistic Interpretability and GenAI in Finance.

Each workshop convenes a different part of the research community.
Mechanistic Interpretability focuses on the circuits and pathways that govern model behavior.
GenAI in Finance brings together researchers tackling some of the highest-stakes applied problems in the world.

InterpDetect sits directly at their intersection.

Why InterpDetect Matters

Hallucinations remain a fundamental reliability challenge in RAG systems.
Even when retrieval is strong, models often generate content unsupported by evidence — a failure mode that breaks financial analysis, compliance workflows, and enterprise agents.

InterpDetect introduces a mechanistic framework for diagnosing why hallucinations happen, not just identifying whether they occur.

External Context Score (ECS): Measures how much attention heads rely on retrieved evidence.
Parametric Knowledge Score (PKS): Measures how strongly FFNs inject internal knowledge into the residual stream.

Across thousands of examples, we find a consistent pattern:
Hallucinations arise when models underweight external context (low ECS) and over-rely on late-layer FFNs (high PKS).

These activation-level signals offer a new, interpretable way to understand and detect errors in RAG pipelines.

Results We're Sharing at NeurIPS

At NeurIPS, we presented three main findings:

1. Mechanistic signals predict hallucinations with high fidelity.
ECS negatively correlates with hallucination risk across all layers and heads.
PKS spikes in later layers — particularly in hallucinated spans — highlighting FFN over-activation as a causal mechanism.

2. Small models can act as efficient, generalizable detectors.
A 0.6B-parameter model (Qwen3-0.6B) can:

analyze its own activations, and
detect hallucinations in GPT-4.1-mini outputs
with performance competitive to commercial detectors.

3. Proxy-model evaluation works.
Classifiers trained on Qwen3-0.6B activations generalize to larger models, reducing cost while improving explainability.

This challenges the assumption that hallucination detection requires frontier-scale models — mechanistic signals provide a lighter, cheaper path.

How InterpDetect Powers Agent CI

Pegasi’s Agent CI is our continuous-improvement engine for enterprise AI agents.
InterpDetect is the mechanistic scoring core of that system.

Simulate: Run thousands of retrieval, reasoning, and browser-automation traces.
Score: Use ECS/PKS to detect hallucinations, context drift, and reliability failures.
Improve: Automatically retrain, patch, or steer models to eliminate recurring issues.

Instead of relying on expensive black-box LLM judges, Agent CI uses:

deterministic activation-level signals
fast, low-compute proxy models
transparent, auditable scoring

InterpDetect makes continuous improvement practical at enterprise scale.

📄 Read the paper: InterpDetect on arXiv
🌐 Explore the code: Agent CI on GitHub

Pegasi is building the corrective and continuous-improvement layer for enterprise AI.
And NeurIPS 2025 is the next step in making generative AI not just powerful —
but reliable enough for the real world.