Benchmarks
Every accuracy claim about SureCiteAI is traceable to a numbered run on this page. Six benchmark suites. 297 test cases. Public methodology. Reproducible from the GitHub repo with one command.
Latest run: 2026-04-27T21:35:36Z · Configuration: RERANKER_PROVIDER=cohere · RAG_GROUNDING_PENALTY_ENABLED=true
Why this page exists
Most RAG vendors quote percentages from internal evals on private data. Those numbers are unfalsifiable — there is nothing a buyer can independently verify. We made a deliberate choice the other way.
The numbers below are computed by a runner whose source code is public, on golden cases anyone can regenerate from public corpora, against PDFs anyone can download from the original publishers. If a number is on this page, it is reproducible. If a number is not on this page or in benchmarks/runs/, treat it as marketing copy, not measurement.
Latest scorecard
Six suites · 297 test cases · run on 2026-04-27T21:35:36Z.
| Suite | Pass | Retrieval hit | Hallucinations | License |
|---|---|---|---|---|
FinanceBench PatronusAI · 84 SEC filings · 32 issuers | 95/150(63%) | 144/150(96%) | 0/150 | CC-BY-NC-4.0 (aggregate scores fair-use citable) |
Legal (CUAD v1) Atticus Project · CC-BY-4.0 | 17/35(49%) | 19/27(70%) | 0/35 | Public |
Healthcare (openFDA) FDA drug labels · public domain | 35/35(100%) | 27/27(100%) | 0/35 | Public |
Accounting (SEC EDGAR) 10-K filings · public domain | 35/35(100%) | 23/23(100%) | 0/35 | Public |
Real Estate Internal corpus | 32/35(91%) | 20/20(100%) | 0/35 | Internal |
Consulting Internal demo | 7/7(100%) | 6/6(100%) | 0/7 | Internal |
| AGGREGATE | 221/297 (74%) | 239/253 (94%) | 0/297 | — |
What we measure
Retrieval hit-rate
Was the expected source document returned in the retrieved context? Scored only on cases where the question is answerable from the corpus.
Citation hallucination rate
Did the LLM cite a source filename that wasn't in the retrieved context? The single most damaging failure mode for a citation-first product. We hold this to zero by construction.
Abstention accuracy
Did the system correctly answer when grounded, and correctly refuse when the answer isn't in the corpus? Measured per-suite.
Calibration (ECE, Brier, AUROC)
Does the confidence score actually mean what it says? Lower ECE = better calibration. Computed per-suite — aggregating across suites is mathematically meaningless and we don't do it.
Reranker ablation
Three back-to-back runs of the same 147-case suite with only the reranker provider changed. This is how the per-tier reranker policy was decided.
| Provider | Tier | Pass rate | Hallucinations |
|---|---|---|---|
| None (hybrid only) | Trial | 122/147(83%) | 0/147 |
| Cohere rerank-3.5 | Enterprise / Custom | 126/147(86%) | 0/147 |
| Voyage rerank-2.5-lite | Paid (Solo / Team / Business / Scale) | 123/147(84%) | 1/147 (legal) |
Cohere is the enterprise default · Voyage Lite is the paid-tier default (~10× cheaper than Cohere) · Trial users get hybrid-only with the same zero hallucination guarantee.
Reproduce locally
Clone the repo, set up a Pinecone index, populate .env.local with the keys listed in .env.example, then:
# 1. Provision the eval tenants (idempotent). npx tsx scripts/eval/setup-eval-tenants.ts # 2. Pull the FinanceBench corpus (84 PDFs, ~12 min). npx tsx scripts/eval/sample-financebench.ts npx tsx scripts/eval/pull-financebench-corpus.ts # 3. Ingest into the per-suite Pinecone namespaces. npx tsx scripts/eval/ingest-eval-docs.ts --tenant eval-financebench # 4. Run the full suite (~50 min wall time). RERANKER_PROVIDER=cohere npm run eval:all # 5. Publish redacted artifact to benchmarks/runs/. npx tsx scripts/eval/publish-run.ts
References
- FinanceBench: A New Benchmark for Financial Question Answering — Patronus AI, NeurIPS 2023 (arXiv:2311.11944)
- CUAD: Contract Understanding Atticus Dataset — Hendrycks et al., NeurIPS 2021
- RAGAS: Automated Evaluation of Retrieval Augmented Generation — Es et al., 2023 (arXiv:2309.15217)
- openFDA — U.S. Food and Drug Administration
- SEC EDGAR — U.S. Securities and Exchange Commission
Want to verify these numbers on your own documents?
Start a 21-day free trial — no credit card required. Upload your contracts, workpapers, or filings and get cited answers in minutes.