Research
Research
Research
Kevin Wu
Kevin Wu
Kevin Wu
July 27, 2025
Pegasi Research: FRED at ICML 2025
Pegasi Research: FRED at ICML 2025
Pegasi Research: FRED at ICML 2025


Earlier this month, we presented our paper - FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models - at the ICML 2025 Workshop on World Models .
The workshop convened researchers across NLP, video, and science around a central question: Do generative models truly understand the world, or are they simply mimicking it? Our contribution brought this debate into one of the highest-stakes application areas: finance.
Why FRED Matters
Generative AI is powerful - but hallucinations remain its Achilles’ heel. In finance, even a small factual slip (a wrong number, date, or entity) can have multi-million-dollar consequences.
FRED is our framework for detecting and correcting hallucinations in financial RAG systems.
Domain-specific taxonomy: Six categories of errors tailored to finance (numerical, temporal, entity, relation, contradictory, unverifiable).
Synthetic data: Systematic insertion of tagged errors into FinQA + TAT-QA corpora.
End-to-end editing: Fine-tuned models that not only flag errors but also generate grounded corrections, making outputs auditable and trustworthy.
Results We Shared
At ICML, we presented evidence that small, fine-tuned models can rival — and in some cases outperform — larger frontier models:
Fine-tuned Phi-4 achieved an 8% boost in binary F1 and a 30% overall gain in detection compared to OpenAI’s o3 .
Even Phi-4-mini (4B params) matched o3 within ~2%, proving that accuracy doesn’t always require scale.
Strongest gains were on numerical and temporal corrections, the exact pain points in financial reasoning.
The Reception
Our work sparked strong discussion at the workshop for three reasons:
Grounded in high-stakes applications – showing finance as a proving ground for reliable AI.
Generalizable framework – methods extend beyond finance into law, healthcare, and compliance.
Startup-driven research – Pegasi is bridging frontier AI evaluation with enterprise deployment.
What’s Next
Mechanistic interpretability – digging deeper into why small fine-tuned models outperform larger LLMs in factuality.
Enterprise rollouts – embedding FRED into Pegasi’s corrective layer for compliance, legal, and vendor risk use cases.
Cross-domain testing – expanding from finance into other regulated sectors.
📄 Read the paper: FRED on arXiv
🌐 Explore the code: Pegasi Shield on GitHub
Pegasi is building the corrective layer for enterprise AI.
ICML 2025 was one more step in showing how we can make generative AI not just powerful, but trustworthy enough for the real world.
Earlier this month, we presented our paper - FRED: Financial Retrieval-Enhanced Detection and Editing of Hallucinations in Language Models - at the ICML 2025 Workshop on World Models .
The workshop convened researchers across NLP, video, and science around a central question: Do generative models truly understand the world, or are they simply mimicking it? Our contribution brought this debate into one of the highest-stakes application areas: finance.
Why FRED Matters
Generative AI is powerful - but hallucinations remain its Achilles’ heel. In finance, even a small factual slip (a wrong number, date, or entity) can have multi-million-dollar consequences.
FRED is our framework for detecting and correcting hallucinations in financial RAG systems.
Domain-specific taxonomy: Six categories of errors tailored to finance (numerical, temporal, entity, relation, contradictory, unverifiable).
Synthetic data: Systematic insertion of tagged errors into FinQA + TAT-QA corpora.
End-to-end editing: Fine-tuned models that not only flag errors but also generate grounded corrections, making outputs auditable and trustworthy.
Results We Shared
At ICML, we presented evidence that small, fine-tuned models can rival — and in some cases outperform — larger frontier models:
Fine-tuned Phi-4 achieved an 8% boost in binary F1 and a 30% overall gain in detection compared to OpenAI’s o3 .
Even Phi-4-mini (4B params) matched o3 within ~2%, proving that accuracy doesn’t always require scale.
Strongest gains were on numerical and temporal corrections, the exact pain points in financial reasoning.
The Reception
Our work sparked strong discussion at the workshop for three reasons:
Grounded in high-stakes applications – showing finance as a proving ground for reliable AI.
Generalizable framework – methods extend beyond finance into law, healthcare, and compliance.
Startup-driven research – Pegasi is bridging frontier AI evaluation with enterprise deployment.
What’s Next
Mechanistic interpretability – digging deeper into why small fine-tuned models outperform larger LLMs in factuality.
Enterprise rollouts – embedding FRED into Pegasi’s corrective layer for compliance, legal, and vendor risk use cases.
Cross-domain testing – expanding from finance into other regulated sectors.
📄 Read the paper: FRED on arXiv
🌐 Explore the code: Pegasi Shield on GitHub
Pegasi is building the corrective layer for enterprise AI.
ICML 2025 was one more step in showing how we can make generative AI not just powerful, but trustworthy enough for the real world.