Metadata-Version: 2.4
Name: finance_data_llm
Version: 0.1.13
Summary: SEC filings and Earnings call transcripts data
Requires-Python: <3.14,>=3.11
Description-Content-Type: text/markdown
Requires-Dist: fastapi>=0.115.0
Requires-Dist: pydantic-settings>=2.6.0
Requires-Dist: uvicorn[standard]>=0.32.0
Requires-Dist: loguru>=0.7.3
Requires-Dist: aiohttp>=3.11.0
Requires-Dist: playwright>=1.49
Requires-Dist: yfinance>=1.2.0
Requires-Dist: numpy>=1.26
Requires-Dist: httpx>=0.28.1
Requires-Dist: vllm>=0.18.1
Requires-Dist: watchdog>=6.0.0
Requires-Dist: orjson>=3.10.0

# Finance Data MCP

A Python-first toolkit for SEC filing ingestion, OCR-to-Markdown conversion, transcript collection, and retrieval across **hybrid retrieval** (dense + BM25) with reranking.

## What this project does

- Downloads SEC filings and stores filing metadata.
- Converts filing PDFs to Markdown via olmOCR.
- Chunks and indexes filings/transcripts in Chroma.
- Supports:
  - **Hybrid search** (dense + BM25 reciprocal-rank-fusion + reranker).
- Exposes workflows through:
  - FastAPI (`server.py`).
  - MCP server (`mcp_server.py`).

## Repository layout

- `finance_data/filings/`: SEC download + helpers.
- `finance_data/ocr/`: olmOCR pipeline.
- `finance_data/dataloader/`: chunking, Chroma indexing, semantic + BM25 retrieval.
- `finance_data/earnings_transcripts/`: transcript fetch + persistence.
- `finance_data/server_api/`: API request/response models + batch helpers.
- `server.py`: FastAPI app.
- `mcp_server.py`: MCP entrypoint.
- `docs/`: setup and operations docs.

## Quick start

### 1) Install dependencies

```bash
uv sync
```

For OCR/embedding flows:

```bash
uv sync --group ocr-md
```

For MCP workflows:

```bash
uv sync --group ocr-md --group mcp
```

### 2) Configure environment

Use `.env` or environment variables. Common settings:

- `SEC_API_ORGANIZATION`, `SEC_API_EMAIL`
- `OLMOCR_SERVER`, `OLMOCR_MODEL`, `OLMOCR_WORKSPACE`
- `EMBEDDING_SERVER`, `EMBEDDING_MODEL`
- `CHROMA_PERSIST_DIR`
- `MCP_HOST`, `MCP_PORT`, `MCP_NGROK_ALLOWED_HOSTS`

See `finance_data/settings.py` for defaults.

### 3) Run services

Start model servers:

```bash
make vllm-olmocr-serve
make vllm-embd-serve
make vllm-reranker-serve
```

Start API:

```bash
make start-server
```

Start MCP:

```bash
uv run --group ocr-md --group mcp python mcp_server.py
```

## Search capabilities

### SEC filings API

- Hybrid (dense + BM25 + reranker): `POST /vector_store/search_sec_filings`

### Transcript API

- Hybrid (dense + BM25 + reranker): `POST /vector_store/search_transcripts`

### MCP tools

- Hybrid: `search_sec_filings_tool`, `search_transcripts_tool`

## Core workflows

### SEC filing → Markdown

```bash
uv run python -m finance_data.filings.sec_data --ticker AMZN --year 2025
uv run python -m finance_data.ocr.olmocr_pipeline --pdf-dir sec_data/AMZN-2025
```

### Embed and search filings (API)

```bash
curl -s -X POST "http://127.0.0.1:8081/vector_store/embed_sec_filings" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":"2025","filing_type":"10-K","force":false}'

curl -s -X POST "http://127.0.0.1:8081/vector_store/search_sec_filings" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":"2025","filing_type":"10-K","query":"operating income margin","top_k":5}'
```

### Earnings transcripts

Fetch quarterly transcripts:

```bash
uv run python -m finance_data.earnings_transcripts.transcripts AMZN 2025
```

Embed + hybrid search transcripts:

```bash
curl -s -X POST "http://127.0.0.1:8081/vector_store/embed_transcripts" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":"2025","force":false}'

curl -s -X POST "http://127.0.0.1:8081/vector_store/search_transcripts" \
  -H "Content-Type: application/json" \
  -d '{"ticker":"AMZN","year":"2025","query":"AWS revenue growth","top_k":5}'
```

## Docker

Use Makefile wrappers:

```bash
make docker-build
make docker-start
```

Stop/remove by API port:

```bash
make docker-stop
make docker-remove
```

## Documentation

- `docs/README.md`
- `docs/setup-and-operations.md`
