Metadata-Version: 2.4
Name: arcus-provider-runtime
Version: 0.4.0
Summary: Content-extraction provider runtime for arcus — turn a URL or file into normalized markdown + structured metadata.
Project-URL: Homepage, https://github.com/polleoai/arcus
Project-URL: Repository, https://github.com/polleoai/arcus
Project-URL: Issues, https://github.com/polleoai/arcus/issues
Author-email: "POLLEO.AI" <support@polleo.ai>
License: MIT
License-File: LICENSE
Keywords: content-extraction,html-to-markdown,llm,pdf-extraction,rag,scraping,youtube-transcript
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.11
Requires-Dist: pyyaml>=6.0
Requires-Dist: yt-dlp>=2025.5.1
Provides-Extra: all
Requires-Dist: openpyxl>=3.1; extra == 'all'
Requires-Dist: playwright>=1.40; extra == 'all'
Requires-Dist: pymupdf4llm>=0.0.10; extra == 'all'
Requires-Dist: python-docx>=1.0; extra == 'all'
Requires-Dist: python-pptx>=0.6; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest-mock>=3.12; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: html
Requires-Dist: playwright>=1.40; extra == 'html'
Provides-Extra: office
Requires-Dist: openpyxl>=3.1; extra == 'office'
Requires-Dist: python-docx>=1.0; extra == 'office'
Requires-Dist: python-pptx>=0.6; extra == 'office'
Provides-Extra: pdf
Requires-Dist: pymupdf4llm>=0.0.10; extra == 'pdf'
Description-Content-Type: text/markdown

# arcus-provider-runtime

The content-extraction kernel behind [arcus](https://github.com/polleoai/arcus):
give it one URL or one file path, get back normalized markdown plus structured
metadata. No vault, no database, no project awareness — a pure download +
extraction layer you can drop into any pipeline (RAG ingest, knowledge bases,
LLM context building).

## Install

```bash
pip install "arcus-provider-runtime[html,pdf,office]"
```

Extras pull in only the heavy dependencies you need:

| Extra | Adds | For |
|---|---|---|
| `html` | `playwright` | JS-rendered pages, X.com / LinkedIn, SPA articles |
| `pdf` | `pymupdf4llm` | PDF → markdown extraction |
| `office` | `python-docx`, `python-pptx`, `openpyxl` | DOCX / PPTX / XLSX / EPUB |
| `all` | everything above | — |

The base install (YouTube transcripts via `yt-dlp`) has no extras. The HTML
provider also needs Chromium (`python -m playwright install chromium`) and
`node` on `PATH` (the vendored `html2md.mjs` converter).

## Use

```python
from arcus.provider_runtime import Factory

result = Factory().run("https://example.com/article", out_dir="./out")
# result.markdown_path  → ./out/<slug>.md   (frontmatter + readable body)
# result.metadata_path  → ./out/<slug>.json (segments, timing, provenance)
```

One `Factory.run()` entry point dispatches to the right provider by inspecting
the input. Providers live under
`arcus.provider_runtime.providers.<kind>/` and are individually registerable.

## What it deliberately does NOT do

arcus has zero awareness of any consuming app's storage, topics, or wiki. One
input in, one extracted artifact out. Vault-aware orchestration (dedup,
cross-referencing, synthesis) belongs in the consumer, not here.

## License

MIT © 2026 POLLEO.AI
