Metadata-Version: 2.4
Name: phares
Version: 0.1.0
Summary: Agentic Purchase Order Intelligence — multi-agent document extraction with visual grounding and schema memory.
Author-email: Lancer International <it@lancers.in>
Maintainer-email: Lancer International <it@lancers.in>
License: Copyright (c) 2026 Lancer International
        
        All rights reserved.
        
        This software and its source code are proprietary to Lancer International.
        You may not copy, modify, merge, publish, distribute, sublicense, sell, or
        otherwise use this software, in whole or in part, except with the prior
        written permission of Lancer International.
        
        THIS SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
        FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
        DEALINGS IN THE SOFTWARE.
        
Project-URL: Homepage, https://github.com/lancer-international/phares
Project-URL: Issues, https://github.com/lancer-international/phares/issues
Keywords: purchase-order,document-extraction,ocr,crewai,langchain,langgraph,rag,imap
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: crewai>=0.80.0
Requires-Dist: crewai-tools>=0.15.0
Requires-Dist: langchain>=0.3.0
Requires-Dist: langchain-community>=0.3.0
Requires-Dist: langchain-ollama>=0.2.0
Requires-Dist: langchain-openai>=0.2.0
Requires-Dist: langchain-huggingface>=0.1.0
Requires-Dist: huggingface_hub>=0.24.0
Requires-Dist: langsmith>=0.1.100
Requires-Dist: langgraph>=0.2.0
Requires-Dist: pydantic>=2.6.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: pdfplumber>=0.11.0
Requires-Dist: pdf2image>=1.17.0
Requires-Dist: pillow>=10.3.0
Requires-Dist: python-docx>=1.1.2
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: faiss-cpu>=1.8.0
Requires-Dist: sentence-transformers>=3.0.0
Requires-Dist: chromadb>=0.5.0
Requires-Dist: duckduckgo-search>=6.2.0
Requires-Dist: requests>=2.32.0
Requires-Dist: beautifulsoup4>=4.12.3
Requires-Dist: rich>=13.7.1
Requires-Dist: tqdm>=4.66.4
Requires-Dist: orjson>=3.10.3
Requires-Dist: numpy>=1.26.4
Provides-Extra: vision
Requires-Dist: torch>=2.2.0; extra == "vision"
Requires-Dist: transformers>=4.42.0; extra == "vision"
Requires-Dist: sentencepiece; extra == "vision"
Requires-Dist: timm; extra == "vision"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Dynamic: license-file

# Phares — Agentic Purchase Order Intelligence

A production-grade, multi-agent document processing system that turns any PDF,
scan, image, DOCX, or XLSX **Purchase Order** into a strict, validated JSON
object with **full detail preservation**, **visual grounding**, and **schema
memory** for repeat templates.

Built on **CrewAI + LangChain + (optional) LangGraph**, with PyMuPDF,
pdfplumber, Tesseract, and optional HuggingFace (Donut / LayoutLMv3) and
Ollama multimodal backends.

---

## 1. Architecture

```
 INPUT (pdf/png/jpg/docx/xlsx)
        │
        ▼
  ┌──────────────┐                              ┌─────────────────┐
  │  Planner     │────── fingerprint hit? ─────▶│ Memory Agent    │
  │  Agent       │◀─────── reuse schema ────────│ (FAISS)         │
  └──────┬───────┘                              └─────────────────┘
         │
   ┌─────┴──────┐    ┌─────────────┐    ┌──────────────────┐
   │ Loader     │───▶│ OCR/Vision  │───▶│ Structure        │
   │ (pymupdf,  │    │ (Tesseract, │    │ (tables + KVs)   │
   │  pdfplumber│    │  Donut,     │    │                  │
   │  docx,xlsx)│    │  LayoutLMv3,│    │                  │
   │            │    │  llava)     │    │                  │
   └────────────┘    └──────┬──────┘    └─────────┬────────┘
                            │                     │
                            └─────────┬───────────┘
                                      ▼
                           ┌─────────────────────┐
                           │ Labeling Agent      │
                           │ (Ollama LLM, JSON   │
                           │  mode, few-shot)    │
                           └─────────┬───────────┘
                                     │  low-confidence?
                                     ▼
                            ┌────────────────┐
                            │ Web Research   │
                            │ (DuckDuckGo)   │
                            └────────┬───────┘
                                     ▼
                            ┌────────────────┐
                            │ Output Agent   │
                            │ (Pydantic      │
                            │  validation)   │
                            └────────┬───────┘
                                     ▼
                              output/<file>.json
                                     │
                                     ▼
                          Memory Agent persists
                          learned schema for reuse
```

### Agent roster

| Agent | Role | Key tools |
|-------|------|-----------|
| **Planner** | Decides per-file pipeline; rejects non-POs if `PO_ONLY=true` | `classify_pdf` |
| **Loader** | Extracts text + layout + images | `pdf_extract`, `docx_extract`, `xlsx_extract`, `pdf_pages_to_images` |
| **OCR & Vision** | Recovers text from scans/images | `ocr_pdf`, `ocr_image`, `donut_parse`, `vision_describe` |
| **Structure Extractor** | Tables + key/value pairs | `extract_tables`, `extract_kv_pairs` |
| **Memory** | Schema fingerprint → FAISS recall / persist | `schema_fingerprint`, `memory_lookup`, `memory_store` |
| **Labeler** | Strict-JSON PO field extraction | (LLM) |
| **Web Researcher** | Clarifies unknown labels / vendor formats | `web_search`, `fetch_url` |
| **Output Validator** | Pydantic-validates final object | (LLM) |

### LangSmith tracing

Set the following in `.env` to stream every LLM / chain / tool call to
LangSmith (https://smith.langchain.com):

```
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_TRACING=true
LANGSMITH_PROJECT=email-parsh
LANGSMITH_ENDPOINT=https://api.smith.langchain.com
```

`src/config.py` exports these into both the modern `LANGSMITH_*` and legacy
`LANGCHAIN_*` environment variable names at import time, so every LangChain /
CrewAI call is automatically traced without any code changes. Turn it off by
setting `LANGSMITH_TRACING=false`.

### LLM backend options

| Provider | `.env` keys | When to use |
|----------|-------------|-------------|
| **Ollama (local)** | `LLM_PROVIDER=ollama`, `LLM_MODEL=llama3.1:8b` | Zero API cost, offline, slower on CPU |
| **HuggingFace Inference** | `LLM_PROVIDER=hf`, `HF_TOKEN=...`, `HF_INFERENCE_MODEL=meta-llama/Meta-Llama-3.1-8B-Instruct` | Fast, no local GPU needed, uses HF serverless |
| **OpenAI-compatible** | `LLM_PROVIDER=openai`, `OPENAI_API_KEY=...`, `LLM_MODEL=gpt-4o-mini` | Production SLAs, any OpenAI-compatible endpoint |

`HF_TOKEN` is also picked up automatically by `transformers`, `huggingface_hub`,
and `sentence-transformers` for model downloads (gated Donut/LayoutLMv3, Nougat,
private models). Set it once in `.env` and everything works.

### Model selection — why these

- **Donut** (`naver-clova-ix/donut-base-finetuned-cord-v2`) — best open model for end-to-end document understanding without relying on OCR; handles noisy receipts/POs very well. Feature-flagged (`ENABLE_DONUT=true`) because of GPU weight.
- **LayoutLMv3** — when you need token-level visual grounding for training/custom models.
- **Tesseract** — dependable baseline OCR with word-level confidences + boxes.
- **Ollama llama3.1/mistral** — strong local reasoning for labeling; zero API cost; JSON mode gives us schema-valid output.
- **Ollama llava** — multimodal fallback for image-only POs.
- **sentence-transformers MiniLM** — fast, small embeddings for FAISS schema memory.

---

## 2. Installation

```powershell
# 1) Python env
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

# 2) OCR engine (needed for scanned PDFs)
winget install --id UB-Mannheim.TesseractOCR -e
# Then set TESSERACT_CMD in .env to the install path.

# 3) (optional) Poppler — needed by pdf2image on Windows
#    Download from https://github.com/oschwartz10612/poppler-windows/releases
#    Unzip and set POPPLER_PATH in .env to the bin directory.

# 4) LLM backend — Ollama (recommended)
# https://ollama.com/download
ollama pull llama3.1:8b
ollama pull llava:7b       # only if you want vision fallback

# 5) Configure
copy .env.example .env
#   edit paths / LLM settings
```

---

## 3. Usage

```powershell
# Process every file under samples/
python run.py

# Process one file
python run.py "samples\PO-TVP-548.pdf"

# Or via the full CrewAI agentic orchestrator (verbose trace)
python run.py --crew
```

Output JSON files land in `output/` with the same basename. A summary table is
printed to the console.

---

## 4. Output schema

```jsonc
{
  "document_type": "purchase_order",
  "document_type_confidence": 0.97,
  "is_purchase_order": true,
  "metadata": {
    "file_name": "PO-TVP-548.pdf",
    "file_path": "...",
    "file_size_bytes": 70011,
    "mime_type": "application/pdf",
    "page_count": 1,
    "is_scanned": false,
    "pipeline_path": "digital_pdf",
    "processing_seconds": 6.12
  },
  "fields": {
    "po_number":   {"value": "PO-TVP-548", "raw": "PO No: PO-TVP-548", "confidence": 0.96},
    "po_date":     {"value": "2025-04-12", "raw": "12-Apr-2025",       "confidence": 0.93},
    "vendor":   {"name": {"value": "Acme Traders", "confidence": 0.95}, "address": {...}},
    "buyer":    {"name": {"value": "Lancer International", "confidence": 0.94}},
    "ship_to":  {"address": {"value": "...", "confidence": 0.88}},
    "items": [
      {
        "description": {"value": "SS Flange 150#", "confidence": 0.91},
        "quantity":    {"value": 10, "confidence": 0.95},
        "unit_price":  {"value": 1250.0, "confidence": 0.93},
        "line_total":  {"value": 12500.0, "confidence": 0.94}
      }
    ],
    "totals": {
      "subtotal":   {"value": 12500.0, "confidence": 0.94},
      "tax_total":  {"value": 2250.0,  "confidence": 0.93},
      "grand_total":{"value": 14750.0, "confidence": 0.95},
      "currency":   {"value": "INR",    "confidence": 0.92}
    },
    "extras": { "incoterm": "FOB Chennai" }
  },
  "tables": [{...}],
  "raw_text": "…full document text…",
  "confidence_scores": {"po_number": 0.96, "totals.grand_total": 0.95, ...},
  "warnings": [],
  "schema_fingerprint": "acme po no date vendor… :: TBL:desc|qty|rate|amount",
  "reused_memory_template": null
}
```

Every leaf field is a `GroundedValue` with `value`, `raw`, optional `bbox`, and
`confidence`. Unknown fields land in `fields.extras` so **no data is ever dropped**.

---

## 5. Intelligence / learning loop

1. For each new document we compute a **layout fingerprint** from top-of-page
   tokens, table headers, and KV keys.
2. FAISS returns the nearest known template. If similarity ≥ 0.92 we inject
   that schema as few-shot context, skipping re-learning.
3. After labeling, if no hit existed, the new schema is persisted — subsequent
   runs on the same template are faster and more consistent.

---

## 6. Extensibility

- Add a new labeled field → extend `models/schemas.py` + mention it in
  `agents/labeling_agent.LABELING_SYSTEM`.
- Add a new file type → add a tool in `src/tools/`, register it in
  `tools/__init__.py`, and teach the Planner.
- Swap the LLM → change `LLM_PROVIDER` / `LLM_MODEL` in `.env`.

---

## 7. Email ingestion (IMAP → extraction)

The email subsystem is an **additive** feature — nothing in the existing
extraction pipeline is modified. An Email Ingestion Agent monitors the
configured IMAP inbox, classifies each message as PO or not, downloads
attachments, and forwards supported files into `pipeline.graph.run_graph`.

### Security

- Credentials are read from `IMAP_USER` / `IMAP_PASS` in `.env` or the process
  environment. **Never** hardcoded. `.env` is in `.gitignore`.
- Connection is always `IMAP4_SSL` (default port 993).
- For Gmail, create an [App Password](https://myaccount.google.com/apppasswords)
  (16 chars). Your normal password will not work with 2FA.
- For production, swap the `.env` source for AWS Secrets Manager / Azure Key
  Vault by overriding `IMAP_USER`/`IMAP_PASS` in the process environment before
  import; `src/config.py` will pick them up transparently.

### .env keys

```
IMAP_USER=you@example.com
IMAP_PASS=xxxxxxxxxxxxxxxx
IMAP_HOST=imap.gmail.com
IMAP_PORT=993
IMAP_FOLDER=INBOX
IMAP_POLL_SECONDS=60
IMAP_MARK_SEEN=true        # marks \Seen on the server after processing
IMAP_SEARCH=UNSEEN         # IMAP SEARCH criterion
IMAP_LOOKBACK_DAYS=7
IMAP_MAX_BATCH=25
ATTACHMENT_DIR=<path to save attachments>
SEEN_STORE_PATH=<path to jsonl dedup log>
```

### Classification pipeline

For each unread email:

1. **Keyword prefilter** — subject/body regex + attachment presence. Strong
   matches (score ≥ 0.75) or the clearly-null case skip the LLM.
2. **LLM classifier** — when the signal is ambiguous, an Ollama JSON call
   returns `{is_purchase_order, confidence, reason}`. Blended with the keyword
   signal.
3. **Result** is appended to `memory_store/email_runs.jsonl` for audit.

### Attachment handling

- Only `.pdf .png .jpg .jpeg .tif .tiff .docx .xlsx` are retained.
- Filenames are sanitized (`[^A-Za-z0-9._\- ]` → `_`); collisions get `__1`, `__2`.
- Size cap 25 MB per attachment.
- One folder per email: `attachments/uid-<UID>-<mid-hash>/`.

### Dedup

Two guards so the same email is never processed twice:

- **Server-side**: `IMAP_MARK_SEEN=true` sets the `\Seen` flag after processing
  so subsequent `UNSEEN` searches skip it.
- **Client-side**: a JSONL seen-store at `SEEN_STORE_PATH` keyed by Message-ID
  (fallback UID) — survives even if the server flag is cleared.

### Usage

```powershell
# one cycle, then exit (good for cron / Task Scheduler)
python run_email.py --once

# long-running poller
python run_email.py --poll 60

# widen the IMAP SINCE window to 30 days, skip LLM classifier
python run_email.py --once --lookback 30 --no-llm

# verbose logging
python run_email.py --once -v
```

The runner prints a Rich summary table per cycle and appends full details to
`memory_store/email_runs.jsonl`. Each PO attachment produces its own
`output/<basename>.json` via the existing extraction graph — unchanged.

### Files added

```
src/email_ingestion/
├── __init__.py
├── imap_client.py        # IMAP4_SSL + parsing + retry
├── seen_store.py         # JSONL dedup
├── classifier.py         # keyword + LLM PO classifier
├── attachment_handler.py # sanitize, validate, save
└── runner.py             # poll loop + wiring to run_graph

src/tools/email_tools.py  # LangChain tools for the Email Agent
src/agents/email_agent.py # CrewAI "Email Ingestion Specialist"
run_email.py              # CLI entrypoint
```

---

## 8. Files

```
Phares/
├── run.py                        # entry
├── requirements.txt
├── .env.example
├── README.md
├── samples/                      # input POs
├── output/                       # JSON results
├── memory_store/                 # FAISS + records.jsonl
└── src/
    ├── main.py                   # CLI
    ├── config.py
    ├── models/
    │   ├── schemas.py            # Pydantic contract
    │   └── model_loader.py       # LLM + embedder factories
    ├── tools/                    # LangChain tools
    │   ├── pdf_tools.py
    │   ├── ocr_tools.py
    │   ├── vision_tools.py
    │   ├── layout_tools.py
    │   ├── office_tools.py
    │   ├── memory_tools.py
    │   └── search_tools.py
    ├── memory/vector_store.py    # FAISS schema memory
    ├── agents/                   # CrewAI agents
    │   ├── planner_agent.py
    │   ├── loader_agent.py
    │   ├── ocr_vision_agent.py
    │   ├── structure_agent.py
    │   ├── labeling_agent.py
    │   ├── web_research_agent.py
    │   ├── memory_agent.py
    │   └── output_agent.py
    ├── pipeline/
    │   ├── graph.py              # deterministic state machine (default)
    │   └── crew.py               # CrewAI orchestration (--crew)
    └── utils/file_utils.py
```
