Reproducible · Independently verifiable

Benchmarks

Every accuracy claim about SureCiteAI is traceable to a numbered run on this page. Six benchmark suites. 297 test cases. Public methodology. Reproducible from the GitHub repo with one command.

0
citation hallucinations across 297 cases
96%
retrieval hit-rate on FinanceBench
221/297
aggregate pass rate (74%)
6
benchmark suites · 4 public datasets

Latest run: 2026-04-27T21:35:36Z · Configuration: RERANKER_PROVIDER=cohere · RAG_GROUNDING_PENALTY_ENABLED=true

Why this page exists

Most RAG vendors quote percentages from internal evals on private data. Those numbers are unfalsifiable — there is nothing a buyer can independently verify. We made a deliberate choice the other way.

The numbers below are computed by a runner whose source code is public, on golden cases anyone can regenerate from public corpora, against PDFs anyone can download from the original publishers. If a number is on this page, it is reproducible. If a number is not on this page or in benchmarks/runs/, treat it as marketing copy, not measurement.

Latest scorecard

Six suites · 297 test cases · run on 2026-04-27T21:35:36Z.

SuitePassRetrieval hitHallucinationsLicense
FinanceBench
PatronusAI · 84 SEC filings · 32 issuers
95/150(63%)144/150(96%)0/150CC-BY-NC-4.0 (aggregate scores fair-use citable)
Legal (CUAD v1)
Atticus Project · CC-BY-4.0
17/35(49%)19/27(70%)0/35Public
Healthcare (openFDA)
FDA drug labels · public domain
35/35(100%)27/27(100%)0/35Public
Accounting (SEC EDGAR)
10-K filings · public domain
35/35(100%)23/23(100%)0/35Public
Real Estate
Internal corpus
32/35(91%)20/20(100%)0/35Internal
Consulting
Internal demo
7/7(100%)6/6(100%)0/7Internal
AGGREGATE221/297 (74%)239/253 (94%)0/297
FinanceBench: External public benchmark. Patronus AI, NeurIPS 2023. Highest retrieval hit-rate of any suite.
Legal (CUAD v1): Adversarial: many cases test for clause categories absent from the contract. Correct behaviour is abstention.
Healthcare (openFDA): Drug-label retrieval and dosage-instruction comprehension.
Accounting (SEC EDGAR): Risk-factor extraction, MD&A questions, footnote retrieval.

What we measure

Retrieval hit-rate

Was the expected source document returned in the retrieved context? Scored only on cases where the question is answerable from the corpus.

Citation hallucination rate

Did the LLM cite a source filename that wasn't in the retrieved context? The single most damaging failure mode for a citation-first product. We hold this to zero by construction.

Abstention accuracy

Did the system correctly answer when grounded, and correctly refuse when the answer isn't in the corpus? Measured per-suite.

Calibration (ECE, Brier, AUROC)

Does the confidence score actually mean what it says? Lower ECE = better calibration. Computed per-suite — aggregating across suites is mathematically meaningless and we don't do it.

Reranker ablation

Three back-to-back runs of the same 147-case suite with only the reranker provider changed. This is how the per-tier reranker policy was decided.

ProviderTierPass rateHallucinations
None (hybrid only)Trial122/147(83%)0/147
Cohere rerank-3.5Enterprise / Custom126/147(86%)0/147
Voyage rerank-2.5-litePaid (Solo / Team / Business / Scale)123/147(84%)1/147 (legal)

Cohere is the enterprise default · Voyage Lite is the paid-tier default (~10× cheaper than Cohere) · Trial users get hybrid-only with the same zero hallucination guarantee.

Reproduce locally

Clone the repo, set up a Pinecone index, populate .env.local with the keys listed in .env.example, then:

# 1. Provision the eval tenants (idempotent).
npx tsx scripts/eval/setup-eval-tenants.ts

# 2. Pull the FinanceBench corpus (84 PDFs, ~12 min).
npx tsx scripts/eval/sample-financebench.ts
npx tsx scripts/eval/pull-financebench-corpus.ts

# 3. Ingest into the per-suite Pinecone namespaces.
npx tsx scripts/eval/ingest-eval-docs.ts --tenant eval-financebench

# 4. Run the full suite (~50 min wall time).
RERANKER_PROVIDER=cohere npm run eval:all

# 5. Publish redacted artifact to benchmarks/runs/.
npx tsx scripts/eval/publish-run.ts

References

Want to verify these numbers on your own documents?

Start a 21-day free trial — no credit card required. Upload your contracts, workpapers, or filings and get cited answers in minutes.