Design
Design
Design
Kevin Wu
Kevin Wu
Kevin Wu
November 13, 2025
InterpDetect at NeurIPS 2025 MechInt Workshop
InterpDetect at NeurIPS 2025 MechInt Workshop
InterpDetect at NeurIPS 2025 MechInt Workshop


Mechanistic Interpretability • GenAI in Finance
Earlier this month, we announced that our paper - InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation - was accepted to two NeurIPS 2025 Workshops: Mechanistic Interpretability and GenAI in Finance.
Each workshop convenes a different part of the research community.
Mechanistic Interpretability focuses on the circuits and pathways that govern model behavior.
GenAI in Finance brings together researchers tackling some of the highest-stakes applied problems in the world.
InterpDetect sits directly at their intersection.
Why InterpDetect Matters
Hallucinations remain a fundamental reliability challenge in RAG systems.
Even when retrieval is strong, models often generate content unsupported by evidence — a failure mode that breaks financial analysis, compliance workflows, and enterprise agents.
InterpDetect introduces a mechanistic framework for diagnosing why hallucinations happen, not just identifying whether they occur.
External Context Score (ECS): Measures how much attention heads rely on retrieved evidence.
Parametric Knowledge Score (PKS): Measures how strongly FFNs inject internal knowledge into the residual stream.
Across thousands of examples, we find a consistent pattern:
Hallucinations arise when models underweight external context (low ECS) and over-rely on late-layer FFNs (high PKS).
These activation-level signals offer a new, interpretable way to understand and detect errors in RAG pipelines.
Results We're Sharing at NeurIPS
At NeurIPS, we presented three main findings:
1. Mechanistic signals predict hallucinations with high fidelity.
ECS negatively correlates with hallucination risk across all layers and heads.
PKS spikes in later layers — particularly in hallucinated spans — highlighting FFN over-activation as a causal mechanism.
2. Small models can act as efficient, generalizable detectors.
A 0.6B-parameter model (Qwen3-0.6B) can:
analyze its own activations, and
detect hallucinations in GPT-4.1-mini outputs
with performance competitive to commercial detectors.
3. Proxy-model evaluation works.
Classifiers trained on Qwen3-0.6B activations generalize to larger models, reducing cost while improving explainability.
This challenges the assumption that hallucination detection requires frontier-scale models — mechanistic signals provide a lighter, cheaper path.
How InterpDetect Powers Agent CI
Pegasi’s Agent CI is our continuous-improvement engine for enterprise AI agents.
InterpDetect is the mechanistic scoring core of that system.
Simulate: Run thousands of retrieval, reasoning, and browser-automation traces.
Score: Use ECS/PKS to detect hallucinations, context drift, and reliability failures.
Improve: Automatically retrain, patch, or steer models to eliminate recurring issues.
Instead of relying on expensive black-box LLM judges, Agent CI uses:
deterministic activation-level signals
fast, low-compute proxy models
transparent, auditable scoring
InterpDetect makes continuous improvement practical at enterprise scale.
📄 Read the paper: InterpDetect on arXiv
🌐 Explore the code: Agent CI on GitHub
Pegasi is building the corrective and continuous-improvement layer for enterprise AI.
And NeurIPS 2025 is the next step in making generative AI not just powerful —
but reliable enough for the real world.
Mechanistic Interpretability • GenAI in Finance
Earlier this month, we announced that our paper - InterpDetect: Interpretable Signals for Detecting Hallucinations in Retrieval-Augmented Generation - was accepted to two NeurIPS 2025 Workshops: Mechanistic Interpretability and GenAI in Finance.
Each workshop convenes a different part of the research community.
Mechanistic Interpretability focuses on the circuits and pathways that govern model behavior.
GenAI in Finance brings together researchers tackling some of the highest-stakes applied problems in the world.
InterpDetect sits directly at their intersection.
Why InterpDetect Matters
Hallucinations remain a fundamental reliability challenge in RAG systems.
Even when retrieval is strong, models often generate content unsupported by evidence — a failure mode that breaks financial analysis, compliance workflows, and enterprise agents.
InterpDetect introduces a mechanistic framework for diagnosing why hallucinations happen, not just identifying whether they occur.
External Context Score (ECS): Measures how much attention heads rely on retrieved evidence.
Parametric Knowledge Score (PKS): Measures how strongly FFNs inject internal knowledge into the residual stream.
Across thousands of examples, we find a consistent pattern:
Hallucinations arise when models underweight external context (low ECS) and over-rely on late-layer FFNs (high PKS).
These activation-level signals offer a new, interpretable way to understand and detect errors in RAG pipelines.
Results We're Sharing at NeurIPS
At NeurIPS, we presented three main findings:
1. Mechanistic signals predict hallucinations with high fidelity.
ECS negatively correlates with hallucination risk across all layers and heads.
PKS spikes in later layers — particularly in hallucinated spans — highlighting FFN over-activation as a causal mechanism.
2. Small models can act as efficient, generalizable detectors.
A 0.6B-parameter model (Qwen3-0.6B) can:
analyze its own activations, and
detect hallucinations in GPT-4.1-mini outputs
with performance competitive to commercial detectors.
3. Proxy-model evaluation works.
Classifiers trained on Qwen3-0.6B activations generalize to larger models, reducing cost while improving explainability.
This challenges the assumption that hallucination detection requires frontier-scale models — mechanistic signals provide a lighter, cheaper path.
How InterpDetect Powers Agent CI
Pegasi’s Agent CI is our continuous-improvement engine for enterprise AI agents.
InterpDetect is the mechanistic scoring core of that system.
Simulate: Run thousands of retrieval, reasoning, and browser-automation traces.
Score: Use ECS/PKS to detect hallucinations, context drift, and reliability failures.
Improve: Automatically retrain, patch, or steer models to eliminate recurring issues.
Instead of relying on expensive black-box LLM judges, Agent CI uses:
deterministic activation-level signals
fast, low-compute proxy models
transparent, auditable scoring
InterpDetect makes continuous improvement practical at enterprise scale.
📄 Read the paper: InterpDetect on arXiv
🌐 Explore the code: Agent CI on GitHub
Pegasi is building the corrective and continuous-improvement layer for enterprise AI.
And NeurIPS 2025 is the next step in making generative AI not just powerful —
but reliable enough for the real world.



