Metadata-Version: 2.4
Name: cartographer-filings
Version: 0.1.0
Summary: High-fidelity PDF-to-markdown pipeline for financial filings. Multilingual, jurisdiction-aware.
Author-email: Hugo Condesa <hugo@condesa.dev>
License: MIT
Project-URL: Homepage, https://github.com/hugocondesa-debug/cartographer
Project-URL: Issues, https://github.com/hugocondesa-debug/cartographer/issues
Keywords: pdf,markdown,financial-reports,xbrl,annual-report,ifrs,us-gaap
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf4llm>=0.0.17
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: requests>=2.31.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# Cartographer

[![tests](https://github.com/hugocondesa-debug/cartographer/actions/workflows/test.yml/badge.svg)](https://github.com/hugocondesa-debug/cartographer/actions/workflows/test.yml)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)

**High-fidelity PDF-to-markdown pipeline for financial filings.** Multilingual, jurisdiction-aware, deterministic.

Cartographer takes annual reports, quarterly filings, and regulatory documents — and produces structured markdown plus classified sections (Income Statement, Balance Sheet, Cash Flow, MD&A, Risk Factors, numbered Notes) that downstream pipelines can consume directly. It works across 9 languages and 10+ jurisdictions, from SEC 10-Ks to ESEF annual reports to HKEX filings.

## Why it exists

Commercial PDF-to-markdown services for financial documents are expensive, opaque, and often miss the structural cues that matter most: note numbering hierarchies, table reconstruction from broken cells, multi-column layouts, language-specific section headers. Cartographer is deterministic — regex and structural rules do the cleanup, LLMs are only used as an optional fallback for image-heavy pages. Every transformation is auditable. In internal benchmarks across 6 jurisdictions, it outperformed a commercial baseline on deterministic structural metrics.

## Quickstart

```bash
pip install cartographer-filings
```

```python
from cartographer import Pipeline

pipe = Pipeline()
result = pipe.extract("annual_report.pdf")

print(result["metadata"])
# {"company": "Siemens AG", "currency": "EUR", "fiscal_year": 2025, "standard": "IFRS", ...}

print(f"{len(result['notes'])} notes across {result['stats']['sections_found']} classified sections")
print(result["sections"]["income_statement"][:500])
```

## Pipeline

```mermaid
flowchart LR
    PDF[PDF file] --> RAW[pymupdf4llm<br/>raw markdown]
    RAW --> ENH[enhance<br/>tables • headings • noise]
    ENH --> CLEAN[clean markdown]
    CLEAN --> PARSE[parser<br/>sections + notes]
    PARSE --> OUT[structured dict]
    ENH -.optional.-> VIS[Qwen-VL<br/>image-heavy pages]
    VIS -.-> CLEAN
```

Three deterministic stages plus one optional LLM fallback:

- **`enhance`** — reconstructs tables from `<br>`-stacked cells, promotes bold runs to headings, strips repeated page headers/footers, normalises negative number formats (`(1,234)` → `-1,234`).
- **`parser`** — classifies sections (IS, BS, CF, MD&A, Risk, Audit), detects numbered notes with V5 type mapping, extracts metadata (company, currency, fiscal year, reporting standard) across 9 languages.
- **`vision`** — only triggered on image-heavy pages where text extraction yields nothing useful. Uses Qwen-VL via SiliconFlow. Opt-in via API key.

## Output schema

```python
{
    "markdown": str,                  # Enhanced markdown, full document
    "metadata": {
        "company": str | None,
        "currency": str | None,       # ISO 4217 code
        "fiscal_year": int | None,
        "period_end": str | None,     # ISO date
        "standard": str | None,       # IFRS, US-GAAP, K-IFRS, ...
        "language": str | None,
    },
    "sections": {
        "income_statement": str | None,
        "balance_sheet": str | None,
        "cash_flow": str | None,
        "mda": str | None,
        "risk": str | None,
        "audit": str | None,
    },
    "notes": [
        {
            "note_number": int,
            "title": str,
            "content": str,
            "v5_type": str | None,    # Classified note type (e.g. U03, Related Parties)
            "start_char": int,
            "end_char": int,
        },
        ...
    ],
    "stats": {
        "pages": int,
        "raw_chars": int,
        "enhanced_chars": int,
        "sections_found": int,
        "notes_count": int,
        "vision_pages": int,
        "time_seconds": float,
    },
}
```

## Vision fallback

Image-heavy pages (scanned filings, heavy-graphic annual reports) are only handled when you provide a SiliconFlow API key:

```python
from cartographer import Pipeline

pipe = Pipeline(vision_api_key="sk-...")  # or SILICONFLOW_API_KEY env var
result = pipe.extract("scanned_report.pdf")
```

Vision is a fallback, not a default path. The overwhelming majority of financial filings parse correctly through the deterministic pipeline alone.

## Supported languages

English, German, French, Spanish, Italian, Portuguese, Korean, Japanese, Chinese (partial).

## Jurisdictions tested

| Region | Standard | Examples |
|---|---|---|
| United States | US-GAAP | SEC 10-K, 10-Q |
| Germany | IFRS | ESEF filings, Bundesanzeiger |
| Italy | IFRS | CONSOB / SDIR |
| Portugal | IFRS | CMVM filings |
| Korea | K-IFRS | DART filings |
| Hong Kong | IFRS (HK) | HKEXnews |
| Australia | IFRS | ASX listings |

## Development

```bash
git clone https://github.com/hugocondesa-debug/cartographer.git
cd cartographer
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v
ruff check src/ tests/
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

## Roadmap

- [ ] PyPI release (`pip install cartographer-filings`)
- [ ] Regression test suite fetching from primary sources (SEC, ESEF, HKEXnews)
- [ ] Thai, Arabic, additional CJK coverage
- [ ] Multi-column prospectus layouts
- [ ] Table-of-contents anchor refinement
- [ ] Full audit pass on `parser.py` to tighten ruff rules

## Origin

Cartographer was extracted from [Atlas](https://github.com/hugocondesa-debug/atlas), a personal financial data platform. It exists as a standalone library because the pipeline has utility beyond any single consumer — anyone processing annual reports or regulatory filings at scale will benefit.

## License

MIT. See [LICENSE](LICENSE).

---

Built by [Hugo Condesa](https://github.com/hugocondesa-debug).
