Metadata-Version: 2.4
Name: bidreader
Version: 0.9.5
Summary: Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps/exclusions vendors bury. Every value cited to its page.
Author-email: Anmol <anmol@attentive.ai>
License-Expression: MIT
Project-URL: Homepage, https://github.com/anmolsam/bidreader
Project-URL: Issues, https://github.com/anmolsam/bidreader/issues
Keywords: construction,estimating,takeoff,subcontractor,bid,quote,scope,exclusions,spec,AEC,preconstruction,BOQ,LLM,MCP
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Office/Business :: Financial
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf>=1.24
Requires-Dist: certifi>=2024.0
Provides-Extra: tables
Requires-Dist: pdfplumber>=0.11; extra == "tables"
Provides-Extra: xlsx
Requires-Dist: openpyxl>=3.1; extra == "xlsx"
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Requires-Dist: pillow>=10; extra == "ocr"
Provides-Extra: mcp
Requires-Dist: mcp>=1.2; extra == "mcp"
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: openpyxl>=3.1; extra == "dev"
Dynamic: license-file

<div align="center">

# 📄 BidReader

### Read messy construction sub-quotes, bid packages & spec PDFs into clean structured data — and catch the scope gaps and exclusions vendors bury in the fine print.

Every line item carries its **page**, the **exact source text** it came from, and an **arithmetic check** (`qty × unit_price == amount`) — verification on top of extraction, not just an LLM guess.

[![PyPI](https://img.shields.io/pypi/v/bidreader?color=2ea043&label=pip%20install%20bidreader)](https://pypi.org/project/bidreader/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](pyproject.toml)
[![MCP](https://img.shields.io/badge/MCP-server-8b5cf6)](docs/MCP.md)
[![Runs on free models](https://img.shields.io/badge/runs%20on-free%20LLMs-success)](docs/FREE_MODELS.md)

</div>

---

> *"Manually typing numbers from a PDF into Excel because the formatting is a crime scene… hunting for the one line where a sub quietly excluded 'trash removal' in size-8 font."*
> — r/Construction, **498 upvotes** ([source](https://www.reddit.com/r/Construction/comments/1pq34ur/))

Most construction-AI effort chases autonomous *takeoff*. BidReader does something narrower and more concrete: it reads subcontractor **quotes / estimates** (PDF) and helps you **level competing bids** — surfacing the scope a sub quietly excluded — so you catch it during leveling, not after award.

It's an open-source **bid-leveling assistant**, not an autopilot: it proposes, cites its source, and you verify. MIT, `pip install`, runs on free LLMs (or fully local via Ollama), and callable from an AI agent over MCP.

> **Scope, honestly:** built and tested on **estimate-class** docs (sub quotes, GC estimates, schedules of values). It's an assistant for a human estimator — line-item extraction can be incomplete (the tool flags when it is), and inferred scope-gaps are *prompts to check*, not contractual findings. Not built for multi-bidder DOT unit-price bid-tabs. See [`demo/REAL_EVIDENCE.md`](demo/REAL_EVIDENCE.md) for honest real-document results.

## Quickstart (copy-paste, ~30 seconds)

```bash
pip install bidreader

# Use any one — a FREE key works (see docs/FREE_MODELS.md):
export GEMINI_API_KEY=...        # free at aistudio.google.com
# or  export OPENROUTER_API_KEY=...   (has :free models)
# or  export REQUESTY_API_KEY=...

bidreader your_sub_quote.pdf
```

```python
from bidreader import read

doc = read("sub_quote.pdf")
doc.line_items     # [{section, description, qty, unit, amount, page}, ...]
doc.exclusions     # [{item, quote, page, risk}, ...]   <- the buried stuff
doc.scope_gaps     # trade-standard scope NOT in the doc — confirm before bidding
doc.to_json()
```

## Private mode — bids never leave your machine

Sub bids are confidential. Run BidReader fully offline against a local [Ollama](https://ollama.com) model — **no document text is sent to any cloud LLM, no API key**:

```bash
ollama pull llama3.1
export BID_MODEL=ollama/llama3.1
bidreader your_sub_quote.pdf        # 100% local
```
Full guide + on-prem/shared-host options: [docs/LOCAL_MODELS.md](docs/LOCAL_MODELS.md).

## Real output

On a real **$324,240.61 drywall estimate** (72 line items, scanned in seconds), BidReader's scope engine caught a genuinely expensive hole:

```
!!  SCOPE GAPS TO CONFIRM:
  - Finishing (taping, mudding, sanding) -- the gypsum line items price the BOARD
    only, not the finishing labor to reach a paint-ready surface.
  - Door hardware -- "Door W/ Frame" lines don't include hinges/locks/closers.
  - Firestopping at rated assemblies -- life-safety scope, commonly omitted.
```

On a real **25-page multi-trade GC estimate**, it parsed **959 line items across 16 CSI divisions** (demolition → concrete → steel → finishes → plumbing → fire suppression), each page-cited. See [docs/RESULTS.md](docs/RESULTS.md) and a full worked example in [`examples/`](examples/).

## Scanned PDFs

Lots of real bids are scans with no text layer. BidReader auto-detects those and
falls back to **local Tesseract OCR** — same structured output, still private:

```bash
pip install "bidreader[ocr]"           # + tesseract binary: brew install tesseract
bidreader scanned_quote.pdf            # auto-OCR; or force with --ocr always
```
Verified on an image-only quote: recovered all line items, total, and exclusions
purely from the page image.

## Bid leveling — compare subs side-by-side → Excel

The bid-day workflow: read every sub's quote and level them apples-to-apples.

```bash
pip install "bidreader[xlsx]"
bidreader level voltage_bros.pdf current_co.pdf sparky.pdf -o leveling.xlsx
```

It builds an Excel workbook (bidders as columns) with a **scope/exclusion matrix** that exposes the catch every estimator dreads — the *apparent* low bid that quietly carved out scope:

```
                  Voltage Bros   Current Co   Sparky
Bid total            $64,300      $108,890    $77,520
                     ◀ LOW
EXCLUSION MATRIX (filled = this bidder EXCLUDED it):
Fire alarm system    EXCL p1         —        EXCL p1
Temporary power      EXCL p1         —        EXCL p1
Permits                 —            —        EXCL p1
```

The "$64,300 low bid" excluded the fire alarm the $108,890 bid *includes* — not actually the cheapest. Plus per-bidder detail sheets with line items + arithmetic flags. (Try it: `python examples/make_leveling_sample.py` → `examples/leveling_demo.xlsx`.)

## Use it from an AI agent (MCP)

```bash
pip install "bidreader[mcp]"
```
```json
{ "mcpServers": { "bidreader": {
    "command": "bidreader-mcp",
    "env": { "GEMINI_API_KEY": "..." }
}}}
```
Tools: `read_document`, `catch_exclusions`, `extract_line_items`. Now your agent can answer *"which subs excluded fire-stopping across this bid folder?"* Full guide: [docs/MCP.md](docs/MCP.md).

## How it works

```
PDF (sub-quote / bid package / spec / schedule)
  → page-tagged text extraction (PyMuPDF)
  → chunk by page  (scales to 25+ page, 900+ line-item estimates)
  → LLM structured extraction  (line items · exclusions · assumptions · alternates · scope gaps)
  → merge + page-cited output (JSON / CLI / MCP)
```

Text-based, so it runs great on **free** models — see [docs/FREE_MODELS.md](docs/FREE_MODELS.md).

## Evidence pack — see what it does on 14 messy bids

[**`demo/EVIDENCE.md`**](demo/EVIDENCE.md) runs BidReader across 14 deliberately-messy
synthetic bids (prose-buried exclusions, fine-print footnotes, two-column layouts,
planted arithmetic errors, multi-page, **scanned** image-only docs) and reports
honestly — wins *and* failures:

- **100%** line-item recall · **97%** exclusion-catch · **100%** bid-total · **3/3** planted arithmetic errors caught · **2/2** scanned docs OCR'd
- One honest miss documented: a low-DPI scan dropped 1 of 3 exclusions.
- Two committed Excel leveling workbooks (electrical 4-sub, drywall 3-sub) showing the apparent-low-bid-that-carved-out-scope.

Reproduce: `python demo/make_corpus.py && python demo/run_eval.py`.

## Benchmark

Reproducible ground-truth benchmark ([`benchmark/`](benchmark/)) — synthetic docs we author, so truth is exact and the PDFs ship in-repo:

| metric | score |
|---|---|
| Line-item recall | **100%** |
| Exclusion-catch recall (incl. prose-buried) | **100%** |
| No-hallucination rate (clean docs) | **100%** |
| Bid-total accuracy (±2%) | **100%** |
| Arithmetic errors caught | **2/2**, 0 false positives |

Honest caveat: synthetic docs are cleaner than real scans — these are an **upper bound** on well-structured input, not a claim about messy real bids. Uncontrolled real-document results are in [docs/RESULTS.md](docs/RESULTS.md). Reproduce: `python benchmark/generate.py && python benchmark/run.py`.

## Why this, and why now — the evidence

A full write-up (problem, market data, prior-art gap, method, results) is in **[PAPER.md](PAPER.md)**. The short version:

- **Loudest, most-shared pain** in construction-estimating communities (the 498-upvote thread above; more cited in the paper).
- **It works *today*** — document extraction is LLM-native, unlike floor-plan symbol detection (academic SOTA tops out ~83% mAP).
- **Empty slot** — `bidreader`, `blueprint-parser`, `pytakeoff` were all unclaimed on PyPI; the only adjacent tools are AGPL/non-commercial or abandoned toys.
- **Concrete wedge** — not "do everything," just the bid-leveling step on bid day. Whether that is genuinely useful is unproven — this open release exists to find out. Feedback from real estimators welcome.

## Roadmap

- [x] Multi-quote **leveling** → Excel (compare subs side-by-side) — v0.6
- [x] Fully-local / private mode via **Ollama** — v0.7
- [x] **Scanned-PDF OCR** (local Tesseract) — v0.8
- [ ] Source-grounded **click-back review UI** (data already carries `source_text`)
- [ ] Revision/addendum **diff** ("what changed between Addendum 3 and 4")
- [ ] CSI/UNIFORMAT mapping + UOM normalization for estimator-grade leveling
- [ ] Region/trade notation packs (AISC, BS/IS, AUS)

## Contributing

PRs welcome — see [CONTRIBUTING.md](CONTRIBUTING.md). Good first issues: add a notation parser, a new export format, or a test fixture.

## License

[MIT](LICENSE) © 2026. Cite via [CITATION.cff](CITATION.cff).
