Metadata-Version: 2.4
Name: financebench-rag-agent
Version: 0.1.0
Summary: Multi-agent RAG system for RBAC-secured financial document Q&A. 72.7% on FinanceBench. Ships a CLI client + self-hostable FastAPI backend.
Project-URL: Homepage, https://github.com/Rishabhmannu/financebench-rag-agent
Project-URL: Repository, https://github.com/Rishabhmannu/financebench-rag-agent
Project-URL: Issues, https://github.com/Rishabhmannu/financebench-rag-agent/issues
Author-email: Rishabh Kumar <rishabhkumards07@gmail.com>
License-Expression: MIT
Keywords: agent,finance,financebench,hitl,langgraph,llm,rag,rbac
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.14,>=3.11
Requires-Dist: alembic<2.0,>=1.13
Requires-Dist: docling<3.0,>=2.70
Requires-Dist: fastapi<1.0,>=0.115
Requires-Dist: fastembed<1.0,>=0.4
Requires-Dist: gradio<6.0,>=5.0
Requires-Dist: httpx<1.0,>=0.28
Requires-Dist: langchain-anthropic<1.0,>=0.3
Requires-Dist: langchain-core<1.0,>=0.3
Requires-Dist: langchain-groq<1.0,>=0.2
Requires-Dist: langchain-openai<1.0,>=0.3
Requires-Dist: langchain<1.0,>=0.3
Requires-Dist: langgraph-checkpoint-postgres<3.0,>=2.0
Requires-Dist: langgraph<1.0,>=0.6
Requires-Dist: langsmith<1.0,>=0.4
Requires-Dist: llm-guard<1.0,>=0.3
Requires-Dist: openai<3.0,>=1.60
Requires-Dist: presidio-analyzer<3.0,>=2.2
Requires-Dist: presidio-anonymizer<3.0,>=2.2
Requires-Dist: psycopg[binary]<4.0,>=3.2
Requires-Dist: pydantic-settings<3.0,>=2.6
Requires-Dist: pydantic<3.0,>=2.9
Requires-Dist: pyjwt<3.0,>=2.9
Requires-Dist: pypdf<7.0,>=5.0
Requires-Dist: python-dotenv<2.0,>=1.0
Requires-Dist: python-jose[cryptography]<4.0,>=3.3
Requires-Dist: python-multipart>=0.0.18
Requires-Dist: qdrant-client<2.0,>=1.13
Requires-Dist: redis<6.0,>=5.0
Requires-Dist: sentence-transformers<5.0,>=3.0
Requires-Dist: sse-starlette<3.0,>=2.0
Requires-Dist: uvicorn[standard]<1.0,>=0.34
Requires-Dist: voyageai<1.0,>=0.3
Provides-Extra: cli
Requires-Dist: httpx-sse<1.0,>=0.4; extra == 'cli'
Requires-Dist: prompt-toolkit<4.0,>=3.0; extra == 'cli'
Requires-Dist: rich<15.0,>=13.0; extra == 'cli'
Requires-Dist: typer<1.0,>=0.12; extra == 'cli'
Provides-Extra: dev
Requires-Dist: build<2.0,>=1.2; extra == 'dev'
Requires-Dist: datasets<4.0,>=3.0; extra == 'dev'
Requires-Dist: deepeval<4.0,>=3.9; extra == 'dev'
Requires-Dist: edgartools<5.0,>=3.0; extra == 'dev'
Requires-Dist: fpdf2<3.0,>=2.8; extra == 'dev'
Requires-Dist: httpx-sse<1.0,>=0.4; extra == 'dev'
Requires-Dist: patronus<1.0,>=0.1; extra == 'dev'
Requires-Dist: pre-commit<5.0,>=4.0; extra == 'dev'
Requires-Dist: prompt-toolkit<4.0,>=3.0; extra == 'dev'
Requires-Dist: pytest-asyncio<1.0,>=0.24; extra == 'dev'
Requires-Dist: pytest-timeout<3.0,>=2.3; extra == 'dev'
Requires-Dist: pytest<9.0,>=8.0; extra == 'dev'
Requires-Dist: ragas<1.0,>=0.2; extra == 'dev'
Requires-Dist: reportlab<5.0,>=4.0; extra == 'dev'
Requires-Dist: requests<3.0,>=2.31; extra == 'dev'
Requires-Dist: rich<15.0,>=13.0; extra == 'dev'
Requires-Dist: ruff<1.0,>=0.8; extra == 'dev'
Requires-Dist: twine<7.0,>=5.0; extra == 'dev'
Requires-Dist: typer<1.0,>=0.12; extra == 'dev'
Requires-Dist: xgboost<4.0,>=2.1; extra == 'dev'
Provides-Extra: eval
Requires-Dist: datasets<4.0,>=3.0; extra == 'eval'
Requires-Dist: deepeval<4.0,>=3.9; extra == 'eval'
Requires-Dist: patronus<1.0,>=0.1; extra == 'eval'
Requires-Dist: ragas<1.0,>=0.2; extra == 'eval'
Provides-Extra: scripts
Requires-Dist: edgartools<5.0,>=3.0; extra == 'scripts'
Requires-Dist: fpdf2<3.0,>=2.8; extra == 'scripts'
Requires-Dist: reportlab<5.0,>=4.0; extra == 'scripts'
Requires-Dist: requests<3.0,>=2.31; extra == 'scripts'
Requires-Dist: xgboost<4.0,>=2.1; extra == 'scripts'
Provides-Extra: test
Requires-Dist: pytest-asyncio<1.0,>=0.24; extra == 'test'
Requires-Dist: pytest-timeout<3.0,>=2.3; extra == 'test'
Requires-Dist: pytest<9.0,>=8.0; extra == 'test'
Description-Content-Type: text/markdown

# FinanceBench RAG Agent

[![PyPI](https://img.shields.io/pypi/v/financebench-rag-agent.svg)](https://pypi.org/project/financebench-rag-agent/)
[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/)
[![LangGraph 0.6](https://img.shields.io/badge/LangGraph-0.6-green.svg)](https://github.com/langchain-ai/langgraph)
[![Tests](https://img.shields.io/badge/tests-340%20passing-brightgreen.svg)]()
[![FinanceBench](https://img.shields.io/badge/FinanceBench-72.7%25%20pass-blue.svg)]()
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/LICENSE)

A multi-agent RAG system for role-based access-controlled financial document Q&A. Achieves **72.7% correctness pass rate** on the public FinanceBench benchmark using selective agentic retrieval, a BGE cross-encoder reranker, and a self-hosted LLM observability stack.

## Try it

```bash
pip install financebench-rag-agent
financebench setup     # brings up the 4-service docker stack, seeds a sample corpus
financebench login -u analyst    # password analyst123
financebench chat
```

![RBAC role-switch demo](https://raw.githubusercontent.com/Rishabhmannu/financebench-rag-agent/main/docs/demos/rbac.gif)

Multi-party HITL approval workflow and conversation memory have their own walkthroughs in [docs/cli.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/cli.md). Self-hosting the backend (env vars, full vs minimal stack, production hardening) is in [docs/deploy.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/deploy.md).

## Architecture

```mermaid
flowchart TD
    Q([Query + JWT]) --> RBAC[rbac_gate<br/>JWT to Qdrant filter]
    RBAC --> Guard[guardrails<br/>regex to LLM Guard to LLM classifier]
    Guard -->|blocked| Block([blocked])
    Guard --> Route{router}
    Route -->|simple_lookup| Direct[retrieval → reranker → grader → generator]
    Route -->|research_required| Agent[[research_agent subgraph<br/>decompose → retrieve → grade → sufficiency → synthesize<br/>5-turn cap]]
    Direct --> Halu[hallucination_checker]
    Agent --> Halu
    Halu -->|ungrounded, retry up to 2| Direct
    Halu --> HITL{hitl_gate}
    HITL -->|amount above role threshold| Pause([pause for human approval])
    HITL --> Out([Answer + sources])
```

A router classifies each query as a simple lookup or research-required. Simple lookups take the fast direct path; research queries enter a multi-turn subgraph that decomposes the question, retrieves per sub-question, grades sufficiency, and synthesizes a final answer. RBAC is enforced at the Qdrant payload-filter level — agentic queries cannot bypass access control. High-stakes answers (above a per-role dollar threshold) pause via LangGraph's `interrupt()` for human approval, with state checkpointed to Postgres.

## Tech stack

- **Backend** — FastAPI · LangGraph · Qdrant · PostgreSQL · Redis · PyJWT
- **Client** — `financebench` CLI: typer · rich · prompt_toolkit · httpx-sse · token-streaming over SSE
- **Frontend** — Next.js 16 · React 19 · Tailwind · shadcn/ui  *(in progress; CLI is the canonical client)*
- **LLMs** — Claude Sonnet 4.6 · gpt-4o-mini · Llama 3.3 (via Groq)
- **Retrieval** — voyage-finance-2 embeddings · BGE-reranker-v2-m3 cross-encoder
- **Observability** — self-hosted LiteLLM proxy + Langfuse v3 + Redis semantic cache
- **Safety** — Microsoft Presidio PII detection · LLM Guard · LLM classifier (3-layer cascade)
- **Evaluation** — RAGAS · DeepEval · custom LLM correctness judge

## Evaluation results

Evaluated on the FinanceBench benchmark (150 questions across 32 companies):

| Metric | Value |
|---|---|
| Correctness pass rate | **72.7%** (109/150) |
| Refusal rate | 6.7% (10/150) |
| RAGAS faithfulness | 0.747 |
| DeepEval faithfulness | 0.844 |
| DeepEval contextual recall | 0.768 |

Per-slice pass rate: **lookup 68.6%** (n=86), **multi-hop 84.6%** (n=13), **calc 76.5%** (n=51).

The correctness judge is a Claude Sonnet 4.6 + structured-prompt setup calibrated to Cohen's κ = 0.932 against an 89-question hand-labeled set with an adversarial leniency guard. Full methodology, per-judge scores, and reproduction commands in [docs/evaluation.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/evaluation.md).

## Comparison with published systems on FinanceBench

| System | Approach | Accuracy |
|---|---|---:|
| [Mafin 2.5 / PageIndex](https://github.com/VectifyAI/Mafin2.5-FinanceBench) | Vectorless reasoning over hierarchical document tree | **98.7%** |
| [DANA](https://arxiv.org/abs/2410.02823) | Domain-aware neurosymbolic agent with deterministic operators | 94.3% |
| GPT-4-Turbo · long context (128k) | Whole-document prompting | ~79% |
| Claude-2 · long context (100k) | Whole-document prompting | ~76% |
| **This project** | Multi-agent RAG with selective research-agent subgraph + RBAC + HITL | **72.7%** |
| [FinanceBench paper](https://arxiv.org/abs/2311.11944) baselines | Vector retrieval + GPT-4 / Llama-2 | 38–43% |
| GPT-4-Turbo · top-k vector RAG | Standard retrieval, no agent | ~19% |

Long-context approaches score higher but are not enterprise-deployable — 10-K filings frequently exceed 128k tokens, and whole-document prompting is impractical at scale due to latency and cost. The 72.7% here is measured on a production-shaped pipeline (fixed institutional corpus, batched retrieval, RBAC at the storage layer, HITL on high-stakes outputs).

## Known limitations

- **Not deployed to production** — runs locally via `docker compose up -d`. No public URL or live traffic.
- **Frontend is a vertical slice** — login + streaming chat work; sidebar, HITL UI, admin panel, citation PDF viewer are unbuilt.
- **Below the top-published systems** (Mafin 2.5 at 98.7%, DANA at 94.3%) — see comparison table above for context.

## Running from source

```bash
git clone https://github.com/Rishabhmannu/financebench-rag-agent.git
cd financebench-rag-agent
pip install -e ".[cli,dev]" && cp .env.example .env   # add your API keys
financebench setup                                     # docker compose + seed corpus
```

For self-hosting the full 11-service stack (LiteLLM + Langfuse), upgrade flows, and production hardening, see [docs/deploy.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/deploy.md) and [docs/upgrade.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/upgrade.md).

## Documentation

- [docs/cli.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/cli.md) — CLI reference, slash commands, multi-party HITL workflow
- [docs/deploy.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/deploy.md) — Self-host: stack profiles, env vars, backup, hardening
- [docs/upgrade.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/upgrade.md) — Upgrade cookbook by change type
- [docs/evaluation.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/evaluation.md) — Methodology, results, reproduction
- [docs/engineering-log.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/engineering-log.md) — Engineering decisions and tradeoffs
- [docs/setup.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/setup.md) — Test accounts, environment, dev commands
- [docs/architecture.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/architecture.md) · [docs/api-reference.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/api-reference.md) · [docs/rbac-matrix.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/docs/rbac-matrix.md) · [web/README.md](https://github.com/Rishabhmannu/financebench-rag-agent/blob/main/web/README.md)

## License

MIT
