How to Audit AI Research Summaries: Preventing Hallucinations in Qualitative Data

Auditing AI research summaries and preventing hallucinations in qualitative data is no longer a nice-to-have skill for financial professionals — it is the difference between a trade grounded in evidence and one built on the confident ramblings of a very articulate algorithm. As AI tools flood trading desks, research departments, and investment banks, the ability to critically interrogate AI-generated outputs has become the single most important competency a modern trader or analyst can develop.

Part 1: What Is an AI Hallucination and Why Should Traders Care?

Before we talk about how to audit AI research summaries, we need to be crystal clear about what we are dealing with. An AI hallucination is not the machine having a bad trip. It is not a glitch. It is a feature — and a dangerous one.

When a large language model (LLM) generates text, it is not retrieving facts from a verified database. It is predicting the most statistically likely sequence of words given a prompt. This means it can produce output that is fluent, authoritative, confidently formatted — and completely, catastrophically wrong. Think of it as having a colleague who speaks with total certainty about everything, never admits uncertainty, and occasionally invents entire academic journals. You know someone like that. We all do.

Researchers have defined hallucinations as the generation of content that appears fluent and authoritative but is factually incorrect, fabricated, or unsupported by evidence. They can manifest as fabricated citations with non-existent authors, false historical facts or dates, invented statistics, incorrect explanations of complex concepts, and flawed code presented with apparent confidence (Huang et al., 2025, via arXiv). These hallucinations arise because LLMs are trained to predict next-word sequences rather than retrieve verified facts — they lack mechanisms to distinguish between learned patterns and plausible-sounding fabrications.

For a trader, this is not an abstract philosophical problem. It is a P&L problem.

Consider: a study published on arXiv examining deep-research AI pipelines found that citation accuracy ranges from just 40–80% across systems, and that these models remain highly one-sided on debate-style queries while still generating large fractions of unsupported statements (DeepTRACE, arXiv, 2025). Let that sink in. The tool you are using to summarise earnings transcripts or scan qualitative sentiment from conference calls might be wrong — a lot. And it will not tell you. It will just move on with the same energy, like a guy at a networking event who makes eye contact too long and speaks about himself in the third person.

Meanwhile, research by Wu et al. (2023), cited in a comprehensive review of LLMs in equity markets published in Frontiers in Artificial Intelligence, specifically documented hallucinations occurring in financial summaries — the exact use case traders are most likely to deploy AI for (Jadhav & Mirza, 2025, Frontiers in AI). We are not talking hypothetical risks. This is documented. It is happening on trading desks right now.

And yet, a global study by KPMG and the University of Melbourne found that 66% of employees rely on LLM outputs without verifying accuracy (arXiv Calibrated Trust Study, 2024). Sixty-six percent. That is two-thirds of your colleagues walking into a trade based on a summary that might have been partially generated by a model that was just guessing. That is not a workflow. That is financial roulette with a robot spinning the wheel.

Part 2: Why Qualitative Data Is Especially Vulnerable

Qualitative data — earnings call transcripts, analyst sentiment, management commentary, thematic research, ESG narratives — is the terrain where AI hallucinations are most dangerous and hardest to detect.

Why? Because qualitative data does not come with a verification number. When an AI tells you that a stock closed at £142.30 yesterday, you can check that in two seconds. When an AI tells you that “management expressed cautious optimism about Q3 pipeline conversion,” who is going to go back and verify that? Nobody has time for that. That is exactly why you gave the work to the AI in the first place.

This is what researchers call the plausibility trap. The output sounds exactly like what you would expect to read. It has the right vocabulary, the right structure, the right tone. It is the textual equivalent of a well-pressed suit — it signals competence before the meeting even starts. But underneath? Sometimes there is nothing there.

A study examining LLM deficiencies in finance conducted an empirical investigation into the extent of hallucinations produced by models performing real-world financial tasks — including the ability to query historical stock prices accurately — and found significant, systematic errors in domains that financial professionals treat as factual bedrock (arXiv, 2023, Deficiency of LLMs in Finance). If models struggle with hard facts like historical prices, imagine what they are doing with the soft, interpretive judgements embedded in qualitative research.

Let me give you a trader’s translation: if you feed an AI a 40-page earnings call transcript and ask it to summarise the key themes, the risks flagged by management, and the sentiment around guidance, the model will produce something that looks absolutely convincing. It will use the language of the document. It will structure the output professionally. And somewhere in the middle, there is a non-trivial chance it will have invented a risk factor, misattributed a comment to the wrong executive, or summarised a part of the transcript with a slant that reflects patterns in its training data rather than what management actually said.

That is not sloppy work. That is the way these systems function. Researchers have demonstrated that hallucinations are inherent to LLMs, arising from their underlying mathematical and logical structures, and cannot be entirely eliminated through architectural improvements, dataset enhancements, or fact-checking mechanisms (Banerjee, Agarwal & Singla, 2024, cited in arXiv, 2502.11113). You cannot patch your way out of this. You have to audit your way through it.

Part 3: The Audit Framework — How to Actually Check AI Research Summaries

Here is where we get practical. There is no single tool that will do this for you — and if someone tries to sell you one, ask them to cite three peer-reviewed papers supporting its accuracy, then check whether those papers actually exist. (They might not. That would be on-brand.)

A robust audit framework for AI research summaries in qualitative financial data has five core pillars.

Pillar 1: Source Triangulation

Every significant claim in an AI-generated research summary should be traceable to a source that exists. This sounds obvious. It is not practised nearly enough.

When an AI produces a summary of thematic risks in a sector, or a synthesis of analyst commentary around a particular company, ask it to cite the source for each key claim. Then verify those sources exist. Research conducted across major AI conferences found that NeurIPS research papers contained over 100 AI-hallucinated citations — fabricated references that slipped through peer review at one of the world’s most rigorous academic venues (Fortune, January 2026). If hallucinated citations are appearing in academic papers at NeurIPS and ICLR, they are absolutely appearing in your AI-generated research summaries.

Source triangulation means independently verifying at least three to five key data points in any AI summary against the primary document — the actual transcript, the actual filing, the actual report — before the summary informs a trade decision.

Pillar 2: The Verbatim Spot-Check

Select five to eight specific statements from the AI summary and find them verbatim in the source document. Not approximately. Verbatim.

If the AI says “the CEO indicated that supply chain disruptions were ‘largely resolved’ by mid-year,” find the exact phrase “largely resolved” in the transcript. If you cannot find it, you have potentially caught either a fabrication or a mischaracterisation — both of which matter enormously in qualitative research.

This method is simple, takes five minutes, and will catch a surprising number of errors. Studies examining AI-powered deep research agents identified 16 common failure cases including misattribution, unsupported claims, and selective framing — all of which a verbatim spot-check would surface (DeepTRACE, arXiv, 2025).

Pillar 3: Sentiment Calibration Audit

For qualitative research specifically, the framing and emotional tone of an AI summary can be subtly — or not so subtly — biased. Research has shown that LLMs carry distinct biases from their training data, including in the financial domain, with models showing skewed preferences for technology stocks and large-cap names that influence the sentiment of their outputs (Lee et al., 2025, cited in arXiv 2603.17692).

A sentiment calibration audit asks: does the overall tone of the AI summary match what a neutral, human analyst would take from the same source document? This is done by re-reading a representative sample of the original document yourself — or having a second human analyst do it — and comparing it against the AI’s characterisation. Look specifically for places where the AI has softened negative language, amplified positive signals, or omitted qualifications that management explicitly added to their forward guidance.

You are not trying to prove the AI is wrong. You are trying to calibrate how it reads. Every tool has a lean. Know yours.

Pillar 4: Retrieval-Augmented Generation (RAG) Verification

If your firm’s AI research tools use retrieval-augmented generation — where the model retrieves content from specific documents before generating its summary — you have a meaningful advantage in the audit process. RAG architectures were specifically developed to reduce hallucination by grounding model outputs in verified, external sources, and research has shown that RAG improves both factual accuracy and user trust in AI-generated answers (Li et al., 2024, cited in MIT Sloan).

However, RAG is not hallucination-proof. The retrieval component itself can fail — producing citations that are irrelevant or non-existent, or pulling the wrong section of a document. Your audit should verify not just what the AI said, but which source it claims to have retrieved from, and whether that source actually contains the information being attributed to it.

If your AI tools do not use RAG — if they are generating summaries purely from parametric memory — your audit standards need to be significantly more rigorous, because you have no retrieval trail to follow.

Pillar 5: Human-in-the-Loop Final Review

No audit framework is complete without human sign-off. Not because humans are infallible — we very much are not, and the market will remind you of that regularly — but because the combination of human and AI analysis consistently outperforms either in isolation.

Research published in a comprehensive study of LLMs in equity markets found that human-AI collaboration in equity analysis often outperforms either approach alone, highlighting the role of human judgment in complex decision-making (Cao et al., 2023, cited in arXiv 2603.19944). The key word is judgment. AI summarises. Humans judge. That division of labour is not a limitation — it is the architecture of a well-run research process.

Part 4: Case Studies

Case Study 1 — The Earnings Call That Never Said That

In 2023, a mid-sized asset manager deployed an LLM-based tool to summarise earnings call transcripts across a portfolio of holdings. During a quarterly review, an analyst noticed the AI summary for a consumer staples company flagged “management confirmed strong Q4 demand visibility.” The analyst pulled the original transcript and could not find the phrase. What management had actually said was that they had “some early indicators” of demand recovery — a meaningfully more cautious statement.

The AI had taken a cautious, hedged comment and reframed it as a confirmation. The difference, for a trader calibrating position sizing into year-end, was not trivial. The firm subsequently implemented a mandatory verbatim spot-check policy for any AI summary used to inform a portfolio decision. A small procedural change. A significant reduction in tail risk from AI mischaracterisation.

Case Study 2 — The Citation That Did Not Exist

In a well-documented case reported by Fortune in January 2026, researchers using GPTZero found over 100 hallucinated citations embedded in papers accepted at NeurIPS — one of the most selective AI research conferences in the world (Fortune, 2026). The papers had passed peer review. The citations had not been checked.

For traders relying on AI-generated sector research that synthesises academic and industry literature, this case study should be alarming. If fabricated citations cleared peer review at an elite AI conference, they are clearing your internal research review process too. The lesson: every cited paper in an AI-generated research summary should be independently verified before it informs a position.

Case Study 3 — The Google AI Overview That Recommended Eating Rocks

In May 2024, Google rolled out its AI Overview tool to wide audiences. Within days, the tool had suggested that geologists recommend eating one rock per day for health and that users could make cheese stick to pizza by using glue. Google attributed the errors to “data voids” and unusual prompts — but the incident illustrated, with startling clarity, how quickly LLM outputs can produce content that sounds credible and is dangerously inaccurate (BizTech Magazine, 2025).

Now, admittedly, no trader is going to change their macro view because an AI told them to eat rocks. But the mechanism that produced “eat rocks” is the same mechanism producing “management expressed high confidence in FY guidance.” The calibration of trust is what matters. If the system can be confidently wrong about one thing, it can be confidently wrong about another.

Case Study 4 — The Trader Who Trusted, and the Trader Who Verified

At a proprietary trading firm operating across European equities, two senior traders were given AI-generated summaries of the same company’s investor day materials. One trader accepted the summary as provided and sized up a long position ahead of earnings. The other ran the five-pillar audit, identified that the AI had omitted a significant risk disclosure that management had flagged in their prepared remarks, and sized down the position accordingly.

Earnings disappointed. The first trader took a meaningful loss. The second did not. The difference was not intelligence or experience — both traders had equivalent backgrounds. The difference was a ten-minute audit process that checked the AI’s work before the AI’s work checked the trader’s P&L.

Part 5: Building an Institutional Audit Culture

Individual discipline is necessary but not sufficient. To prevent hallucinations from consistently entering trading decisions, firms need to build audit protocols into their research workflows at an institutional level.

A conceptual framework for AI auditing published in Administrative Sciences (2024) found that the theoretical implications of AI in professional settings point toward a shift from retrospective examination to proactive, real-time monitoring — and that managerial contributions emphasise the benefits of AI integration when paired with informed decision-making in risk analysis and regulatory compliance (Alotaibi et al., MDPI Administrative Sciences, 2024). Translation: it is not about whether you use AI. It is about whether you use it with institutional guardrails that match the risk of the decisions it is informing.

For trading and research teams, this means several concrete changes.

Standardised Prompt Design. Vague prompts produce vague — and often hallucinated — outputs. Research has shown that the quality of AI output is closely tied to how specific the input is, and that clear, structured prompts significantly reduce inaccuracy (MIT Sloan Teaching & Learning Technologies). Firms should develop and enforce standardised prompt templates for different research tasks — earnings transcript summaries, regulatory filing analysis, ESG commentary — that include explicit instructions to cite sources, flag uncertainties, and avoid inferring beyond the text.

Tiered Verification by Decision Materiality. Not every AI-generated output needs the full five-pillar audit. A quick-look summary for a position already under active coverage needs less scrutiny than a summary forming the basis of a new investment thesis. Firms should create tiered verification protocols where the depth of the audit is proportional to the materiality of the decision it informs.

Audit Logs and Accountability. Every AI-generated research summary used to support a trade decision should be logged, along with the verification steps taken and the analyst who performed them. This creates accountability, builds an institutional memory of where hallucination risk is highest for particular tools and use cases, and provides a defensible audit trail in the event of regulatory scrutiny.

A systematic review of AI audits published via ResearchGate in 2024 found that the complexity of AI systems is one of the biggest challenges for AI testers, and that audits involving a thorough review of AI outputs to identify and mitigate biases that could skew results are essential for enterprises wanting to make better use of AI-based outputs for informed business decisions (ResearchGate, AI Auditability Review, 2024). The bias question is particularly relevant for qualitative research, where the framing of information is almost as important as the information itself.

Training and Calibration. Every research analyst and trader who uses AI-generated summaries should receive training not just on how to use the tools, but on their failure modes. Understanding that LLMs hallucinate not randomly but systematically — that they have predictable biases toward plausibility, toward dominant narratives, toward the kinds of conclusions their training data made common — is the foundation of calibrated trust.

A comprehensive qualitative study on calibrated trust in dealing with LLM hallucinations found that the right response to the inherent risk of hallucination is not blanket scepticism and certainly not blind trust — it is calibrated trust, built through structured verification practices that are proportionate to the stakes involved (arXiv, Calibrated Trust Study, 2024). That is exactly the culture firms need to build.

Part 6: The Tools Available to You Right Now

You do not need to build an audit framework from scratch. Several technologies and methodological approaches are already available that can support hallucination detection and reduction in AI research workflows.

Semantic Entropy Scoring. Automated detection systems using semantic entropy can flag when a model’s outputs show high uncertainty — a signal that the model is generating from statistical patterns rather than grounded knowledge. Research on automated hallucination detection using semantic entropy and fact-checking pipelines is advancing rapidly (Farquhar et al., 2024, cited in arXiv 2602.17671).

Knowledge Graph Integration. For firms with the technical infrastructure, integrating knowledge graphs into AI research workflows provides structured factual constraints that reduce the space within which a model can hallucinate. This is particularly useful for entity-level checks — verifying that the model’s characterisation of a company’s management, financial structure, or regulatory position is consistent with a verified external knowledge source (Lavrinovics et al., 2025, cited in arXiv).

RAG with Verified Document Sets. As noted in the audit framework, retrieval-augmented generation significantly improves accuracy when the retrieval corpus is tightly controlled. For financial research applications, firms should maintain curated document sets — verified earnings transcripts, regulatory filings, proprietary research — and configure RAG systems to retrieve only from those sources. The combination of a controlled corpus and a RAG architecture gives you both accuracy and an audit trail.

AI-Powered Citation Verification. Tools like GPTZero — which was hired by the ICLR conference to check submissions for fabricated citations — represent a new class of AI auditing tools that are becoming commercially available (Fortune, 2026). For research teams producing AI-assisted reports with citations, deploying citation verification tools as part of the review workflow is now a practical and cost-effective option.

Part 6.5: The Trader’s Reality Check — Why We Still Use These Tools Anyway

Let us be honest for a moment. Everything we have covered so far might give you the impression that AI research tools are a liability wrapped in a confidence costume, handed to you by a technology industry that has very convincing marketing materials. And you might be wondering — if these tools hallucinate systematically, have inherent biases, cannot tell you when they are wrong, and require this much audit overhead — why bother?

That is a completely fair question. Here is the honest answer: because the volume of qualitative data in modern markets has outpaced the human capacity to process it, and the alternative — reading everything yourself — is not a serious option at the scale and speed at which contemporary markets operate.

A single earnings season across a mid-sized equity portfolio might involve dozens of transcripts, each running 20 to 40 pages. Regulatory filings, analyst conference commentary, ESG disclosures, management guidance, sector surveys — the volume is genuinely overwhelming for any human analyst operating at reasonable capacity. AI synthesis tools do not replace your judgment. They extend your reach. The audit framework in this article is not a reason to avoid AI tools. It is a reason to use them with appropriate rigour instead of naive trust.

Research examining the role of LLMs in equity investing synthesised findings from 84 separate studies published between 2022 and early 2025 and concluded that the role of these models is most powerful when they are used as a component of a structured analytical process rather than as a standalone oracle (Jadhav & Mirza, 2025, Frontiers in AI). That is the key word: component. Not foundation. Component.

The traders and analysts who are going to get the most out of the current generation of AI tools are the ones who understand the tools well enough to audit them — who can spot a plausible-sounding fabrication, who know which kinds of qualitative claims are highest-risk for hallucination, and who have built structured workflows that create accountability around AI outputs before those outputs inform decisions.

There is also a competitive advantage hiding inside the audit discipline. If two-thirds of market participants are using AI tools without adequate verification — and the KPMG/University of Melbourne study suggests that figure is approximately right — then traders who do audit their AI summaries are working from a more accurate information base than the majority of their peers. Over time, across hundreds of decisions, that accuracy differential compounds. It is not a flashy edge. It is not the kind of thing you discuss at a conference. But it is real, and it is durable.

Think of it this way. Everyone in the market has access to the same Bloomberg terminal. The edge is not the terminal. The edge is how you use it, what you check, what questions you ask of the data. AI research tools are becoming the new Bloomberg — ubiquitous, powerful, and utterly neutral about whether you use them well or badly. The audit framework is your edge inside the tool.

And look — let us not pretend that this is all grimly serious. There is something genuinely absurd about the fact that a financial industry that has spent decades building elaborate risk management frameworks, compliance architectures, and governance structures is now, in significant numbers, outsourcing qualitative research synthesis to a language model that learned what it knows by ingesting the internet and is constitutionally incapable of saying “I’m not sure.” That is funny. That is the kind of absurdity that ages well as a story you tell when the dust settles and everyone is writing their memoirs about the early days of AI in finance.

The version of you who builds the audit habit now is the version who has a very entertaining story to tell later. The version who does not is the version who has a different, less entertaining story to tell their risk committee.

Choose accordingly.

Part 7: The Philosophical Problem Nobody Talks About

Here is the uncomfortable truth that sits underneath all of this. Even if you implement every pillar of the audit framework, even if you run citation verification, control your RAG corpus, calibrate your prompts, and build an institutional culture of human-in-the-loop review — you are still working with a system that, at a fundamental level, cannot tell you when it does not know something.

Researchers have described this as the core epistemological problem of LLMs: they are designed to prioritise linguistic plausibility over factual correctness (Bubeck et al., 2023, cited in arXiv 2603.19944). They will produce confident-sounding output regardless of their actual epistemic state. The absence of hedging in an AI response is not evidence of certainty. It is just how the language model talks.

For a trader, this is the most important mental model to internalise. When a human analyst is uncertain, they say so. They hedge. They caveat. They tell you they need more time. When an AI is uncertain, it keeps going — same tone, same structure, same confidence — and you cannot tell the difference from the outside. The EY AI Sentiment Index 2025 found that fewer than one-third of users regularly verify AI-generated content (arXiv, Calibrated Trust, 2024). That means over two-thirds are implicitly treating AI confidence as epistemic confidence. It is not. It never has been. It is trained behaviour.

The solution is not to distrust AI. It is to understand what you are trusting when you trust it. You are trusting a pattern-matching system that is very good at synthesis and very bad at admitting ignorance. Use it for the former. Build safeguards for the latter.

Part 8: A Step-by-Step Audit Checklist for Traders

For practical use, here is a condensed audit checklist for traders reviewing AI-generated research summaries. This can be integrated into any research workflow.

Step 1 — Identify the Source Documents. Before reading the AI summary, confirm which source documents the AI was given or retrieved from. If you cannot identify the sources, you cannot audit the summary.

Step 2 — Run the Verbatim Spot-Check. Select five specific claims from the summary. Locate each one in the source document. Flag any claims you cannot locate verbatim as requiring further scrutiny.

Step 3 — Check Every Citation. For every cited paper, report, or external source in the AI summary, verify that the source exists, that the author and title are accurate, and that the cited content actually supports the claim being made.

Step 4 — Perform a Sentiment Calibration. Re-read the executive summary or key sections of the original document. Compare the overall tone to the AI’s characterisation. Note any significant divergences.

Step 5 — Verify Entity-Level Accuracy. Confirm that named individuals, their roles, their organisations, and any specific numerical data attributed to them are accurate. These are among the most common hallucination targets.

Step 6 — Check for Omissions. An AI summary that is accurate in what it includes can still mislead through omission. Scan the original document for significant risk disclosures, caveats, or qualifications that did not make it into the summary.

Step 7 — Apply Decision Materiality. Assess how central this summary is to the decision you are making. The more material the decision, the more exhaustive the audit. For high-conviction new positions built substantially on AI-summarised qualitative research, all seven steps are mandatory. For ongoing monitoring of positions already in your book, a lighter-touch version of steps 2 and 4 may be proportionate.

Step 8 — Document the Audit. Record which steps you completed, any discrepancies found, and how you resolved them. This is your audit trail.

Conclusion: Trust, But Verify — Then Verify Again

Auditing AI research summaries and preventing hallucinations in qualitative data is the new frontier of financial due diligence. It is not glamorous. It does not have the adrenaline of a live position or the drama of a macro call. But it is, increasingly, the invisible line between a research process that holds up and one that quietly leaks P&L through accumulated errors that nobody caught because they were too plausible to question.

The research is clear: AI hallucinations are systemic, inherent, and present across every LLM system currently available. They are most dangerous in qualitative domains where plausibility is hard to distinguish from accuracy. They are most consequential in financial settings where the decisions informed by research carry direct monetary risk. And they are most common in organisations where the cultural norm is trust rather than verify.

You are the trader. You have the market knowledge, the contextual judgement, the experience with the companies you cover. The AI has the speed and the synthesis capacity. Used together, with the right audit framework, they are powerful. Used without structure, they are a liability.

So the next time an AI summarises an earnings call for you in thirty seconds and it sounds exactly right — remember: it always sounds exactly right. That is the whole problem. And now, fortunately, you know what to do about it.

References

Huang, Y. et al. (2025). AI Hallucination from Students’ Perspective: A Thematic Analysis. arXiv. https://arxiv.org/pdf/2602.17671
Narayanan Venkit, P. et al. (2025). DeepTRACE: Auditing Deep Research AI Systems for Tracking Reliability Across Citations and Evidence. arXiv. https://arxiv.org/pdf/2509.04499
Jadhav, A. & Mirza, V. (2025). Large Language Models in Equity Markets: Applications, Techniques, and Insights. Frontiers in Artificial Intelligence. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12421730/
Rinott, R. et al. (2024). Calibrated Trust in Dealing with LLM Hallucinations: A Qualitative Study. arXiv. https://arxiv.org/pdf/2512.09088
Fortune (January 2026). NeurIPS Research Papers Contained 100+ AI-Hallucinated Citations. Fortune Magazine. https://fortune.com/2026/01/21/neurips-ai-conferences-research-papers-hallucinations/
Anon. (2023). Deficiency of Large Language Models in Finance: An Empirical Examination of Hallucination. arXiv. https://arxiv.org/pdf/2311.15548
Banerjee, S., Agarwal, A. & Singla, S. (2024). Valuable Hallucinations: Realizable Non-Realistic Propositions. arXiv. https://arxiv.org/pdf/2502.11113
Pelster, M. & Val, F. (2024), Lopez-Lira, T. & Tang, Y. (2023), and others cited in: Jadhav, A. & Mirza, V. (2025). Large Language Models and Stock Investing: Is the Human Factor Required? arXiv. https://arxiv.org/pdf/2603.19944
Lee, K. et al. (2025). Can Blindfolded LLMs Still Trade? An Anonymization-First Framework for Portfolio Optimisation. arXiv. https://arxiv.org/pdf/2603.17692
Li, J., Yuan, Y. & Zhang, Z. (2024). Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations. Cited in: MIT Sloan Teaching & Learning Technologies. https://mitsloanedtech.mit.edu/ai/basics/addressing-ai-hallucinations-and-bias/
Alotaibi, E. et al. (2024). Artificial Intelligence in Auditing: A Conceptual Framework for Auditing Practices. Administrative Sciences, MDPI. https://www.mdpi.com/2076-3387/14/10/238
Raji, I.D. et al. (2024). Making It Possible for the Auditing of AI: A Systematic Review of AI Audits and AI Auditability. ResearchGate / ACM FAccT 2024. https://www.researchgate.net/publication/381920647_Making_It_Possible_for_the_Auditing_of_AI_A_Systematic_Review_of_AI_Audits_and_AI_Auditability
BizTech Magazine (2025). LLM Hallucinations: What Are the Implications for Financial Institutions? https://biztechmagazine.com/article/2025/08/llm-hallucinations-what-are-implications-financial-institutions
Farquhar, S. et al. (2024). Semantic entropy-based hallucination detection. Cited in: Huang et al. (2025) arXiv. https://arxiv.org/pdf/2602.17671
Salvagno, M., Taccone, F.S. & Gerli, A.G. (2023). Artificial Intelligence Hallucinations. Critical Care, BioMed Central. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10170715/

Disclaimer: This article is for educational and informational purposes only and does not constitute financial advice. Trading financial instruments carries significant risk of loss. Always conduct your own due diligence and consult a qualified financial professional before making investment decisions.

Market Investigation