How accurate is SureCiteAI?

SureCiteAI passes 221 of 297 test cases (74%) across six benchmark suites, with zero citation hallucinations across all 297 cases. On FinanceBench (PatronusAI), the system retrieves the correct source filing in 144 of 150 cases (96%).

Does SureCiteAI hallucinate citations?

Zero hallucinated citations across all 297 published test cases. The citation verifier rejects any cited filename that is not in the retrieved context — answers are rewritten or held back rather than published with a fabricated source.

How does SureCiteAI compare on FinanceBench?

On FinanceBench (PatronusAI, NeurIPS 2023, 150 questions across 84 SEC filings from 32 issuers): 95/150 (63%) pass rate, 144/150 (96%) retrieval hit-rate, 0/150 hallucinations, 0.085 Expected Calibration Error.

Are the benchmarks reproducible?

Yes. The full eval harness, golden case files (where licensing permits), and per-case run artifacts are in the public repository. Anyone with a Pinecone, Supabase, Cohere, and OpenAI account can reproduce the published numbers byte-for-byte.

Reproducible · Independently verifiable

Benchmarks

Every accuracy claim about SureCiteAI is traceable to a numbered run on this page. Six benchmark suites. 297 test cases. Public methodology. Reproducible from the GitHub repo with one command.

citation hallucinations across 297 cases

96%

retrieval hit-rate on FinanceBench

221/297

aggregate pass rate (74%)

benchmark suites · 4 public datasets

Full BENCHMARKS.md on GitHub Run artifacts (JSON)

Latest run: 2026-04-27T21:35:36Z · Configuration: RERANKER_PROVIDER=cohere · RAG_GROUNDING_PENALTY_ENABLED=true

Why this page exists

Most RAG vendors quote percentages from internal evals on private data. Those numbers are unfalsifiable — there is nothing a buyer can independently verify. We made a deliberate choice the other way.

The numbers below are computed by a runner whose source code is public, on golden cases anyone can regenerate from public corpora, against PDFs anyone can download from the original publishers. If a number is on this page, it is reproducible. If a number is not on this page or in benchmarks/runs/, treat it as marketing copy, not measurement.

Latest scorecard

Six suites · 297 test cases · run on 2026-04-27T21:35:36Z.

Suite	Pass	Retrieval hit	Hallucinations	License
FinanceBench PatronusAI · 84 SEC filings · 32 issuers	95/150(63%)	144/150(96%)	0/150	CC-BY-NC-4.0 (aggregate scores fair-use citable)
Legal (CUAD v1) Atticus Project · CC-BY-4.0	17/35(49%)	19/27(70%)	0/35	Public
Healthcare (openFDA) FDA drug labels · public domain	35/35(100%)	27/27(100%)	0/35	Public
Accounting (SEC EDGAR) 10-K filings · public domain	35/35(100%)	23/23(100%)	0/35	Public
Real Estate Internal corpus	32/35(91%)	20/20(100%)	0/35	Internal
Consulting Internal demo	7/7(100%)	6/6(100%)	0/7	Internal
AGGREGATE	221/297 (74%)	239/253 (94%)	0/297	—

FinanceBench: External public benchmark. Patronus AI, NeurIPS 2023. Highest retrieval hit-rate of any suite.

Legal (CUAD v1): Adversarial: many cases test for clause categories absent from the contract. Correct behaviour is abstention.

Healthcare (openFDA): Drug-label retrieval and dosage-instruction comprehension.

Accounting (SEC EDGAR): Risk-factor extraction, MD&A questions, footnote retrieval.

What we measure

Retrieval hit-rate

Was the expected source document returned in the retrieved context? Scored only on cases where the question is answerable from the corpus.

Citation hallucination rate

Did the LLM cite a source filename that wasn't in the retrieved context? The single most damaging failure mode for a citation-first product. We hold this to zero by construction.

Abstention accuracy

Did the system correctly answer when grounded, and correctly refuse when the answer isn't in the corpus? Measured per-suite.

Calibration (ECE, Brier, AUROC)

Does the confidence score actually mean what it says? Lower ECE = better calibration. Computed per-suite — aggregating across suites is mathematically meaningless and we don't do it.

Reranker ablation

Three back-to-back runs of the same 147-case suite with only the reranker provider changed. This is how the per-tier reranker policy was decided.

Provider	Tier	Pass rate	Hallucinations
None (hybrid only)	Trial	122/147(83%)	0/147
Cohere rerank-3.5	Enterprise / Custom	126/147(86%)	0/147
Voyage rerank-2.5-lite	Paid (Solo / Team / Business / Scale)	123/147(84%)	1/147 (legal)

Cohere is the enterprise default · Voyage Lite is the paid-tier default (~10× cheaper than Cohere) · Trial users get hybrid-only with the same zero hallucination guarantee.

Reproduce locally

Clone the repo, set up a Pinecone index, populate .env.local with the keys listed in .env.example, then:

# 1. Provision the eval tenants (idempotent).
npx tsx scripts/eval/setup-eval-tenants.ts

# 2. Pull the FinanceBench corpus (84 PDFs, ~12 min).
npx tsx scripts/eval/sample-financebench.ts
npx tsx scripts/eval/pull-financebench-corpus.ts

# 3. Ingest into the per-suite Pinecone namespaces.
npx tsx scripts/eval/ingest-eval-docs.ts --tenant eval-financebench

# 4. Run the full suite (~50 min wall time).
RERANKER_PROVIDER=cohere npm run eval:all

# 5. Publish redacted artifact to benchmarks/runs/.
npx tsx scripts/eval/publish-run.ts

GitHub repo Eval runner source

References

FinanceBench: A New Benchmark for Financial Question Answering — Patronus AI, NeurIPS 2023 (arXiv:2311.11944)
CUAD: Contract Understanding Atticus Dataset — Hendrycks et al., NeurIPS 2021
RAGAS: Automated Evaluation of Retrieval Augmented Generation — Es et al., 2023 (arXiv:2309.15217)
openFDA — U.S. Food and Drug Administration
SEC EDGAR — U.S. Securities and Exchange Commission

Want to verify these numbers on your own documents?

Start a 21-day free trial — no credit card required. Upload your contracts, workpapers, or filings and get cited answers in minutes.

Start free trial Book a demo