Metadata-Version: 2.4
Name: mdengine
Version: 0.12.0
Summary: Convert PDF, Office, images, audio, video, text/JSON/XML, ZIP archives, web URLs, databases, graphs, and OpenAPI specs to Markdown.
Author: mdengine contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/vishal7090/md-generator
Project-URL: Repository, https://github.com/vishal7090/md-generator
Project-URL: Issues, https://github.com/vishal7090/md-generator/issues
Project-URL: Documentation, https://vishal7090.github.io/md-generator/
Keywords: markdown,pdf,docx,pptx,xlsx,ocr,zip,url,html,youtube,playwright,database,openapi,codeflow
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24.0; extra == "pdf"
Requires-Dist: pdfplumber>=0.11.0; extra == "pdf"
Provides-Extra: word
Requires-Dist: mammoth>=1.8.0; extra == "word"
Requires-Dist: markdownify>=0.13.0; extra == "word"
Provides-Extra: ppt
Requires-Dist: python-pptx>=1.0.0; extra == "ppt"
Requires-Dist: Pillow>=10.0.0; extra == "ppt"
Requires-Dist: lxml>=5.0.0; extra == "ppt"
Requires-Dist: olefile>=0.47; extra == "ppt"
Requires-Dist: openpyxl>=3.1.0; extra == "ppt"
Requires-Dist: mammoth>=1.8.0; extra == "ppt"
Requires-Dist: markdownify>=0.14.0; extra == "ppt"
Requires-Dist: pymupdf>=1.24.0; extra == "ppt"
Requires-Dist: pdfplumber>=0.11.0; extra == "ppt"
Requires-Dist: pytesseract>=0.3.10; extra == "ppt"
Provides-Extra: xlsx
Requires-Dist: openpyxl>=3.1.0; extra == "xlsx"
Provides-Extra: image
Requires-Dist: Pillow>=10.0.0; extra == "image"
Provides-Extra: image-ocr
Requires-Dist: Pillow>=10.0.0; extra == "image-ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "image-ocr"
Requires-Dist: numpy>=1.24.0; extra == "image-ocr"
Requires-Dist: paddlepaddle>=2.5.0; extra == "image-ocr"
Requires-Dist: paddleocr>=2.7.0.3; extra == "image-ocr"
Requires-Dist: easyocr>=1.7.0; extra == "image-ocr"
Provides-Extra: text
Requires-Dist: xmltodict>=0.13.0; extra == "text"
Requires-Dist: lxml>=5.0.0; extra == "text"
Provides-Extra: archive
Requires-Dist: Pillow>=10.0.0; extra == "archive"
Requires-Dist: pytesseract>=0.3.10; extra == "archive"
Provides-Extra: archive-formats
Requires-Dist: py7zr>=0.21.0; extra == "archive-formats"
Requires-Dist: patoolib>=2.0.0; extra == "archive-formats"
Provides-Extra: url
Requires-Dist: httpx>=0.27.0; extra == "url"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "url"
Requires-Dist: lxml>=5.0.0; extra == "url"
Requires-Dist: lxml_html_clean>=0.4.0; extra == "url"
Requires-Dist: markdownify>=0.14.0; extra == "url"
Requires-Dist: readability-lxml>=0.8.1; extra == "url"
Provides-Extra: url-full
Requires-Dist: httpx>=0.27.0; extra == "url-full"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "url-full"
Requires-Dist: lxml>=5.0.0; extra == "url-full"
Requires-Dist: lxml_html_clean>=0.4.0; extra == "url-full"
Requires-Dist: markdownify>=0.14.0; extra == "url-full"
Requires-Dist: readability-lxml>=0.8.1; extra == "url-full"
Requires-Dist: pymupdf>=1.24.0; extra == "url-full"
Requires-Dist: pdfplumber>=0.11.0; extra == "url-full"
Requires-Dist: mammoth>=1.8.0; extra == "url-full"
Requires-Dist: python-pptx>=1.0.0; extra == "url-full"
Requires-Dist: Pillow>=10.0.0; extra == "url-full"
Requires-Dist: olefile>=0.47; extra == "url-full"
Requires-Dist: openpyxl>=3.1.0; extra == "url-full"
Requires-Dist: pytesseract>=0.3.10; extra == "url-full"
Provides-Extra: audio
Requires-Dist: openai-whisper>=20231117; extra == "audio"
Requires-Dist: imageio-ffmpeg>=0.5.1; extra == "audio"
Provides-Extra: video
Requires-Dist: openai-whisper>=20231117; extra == "video"
Requires-Dist: imageio-ffmpeg>=0.5.1; extra == "video"
Provides-Extra: youtube
Requires-Dist: youtube-transcript-api>=0.6.0; extra == "youtube"
Requires-Dist: httpx>=0.27.0; extra == "youtube"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "youtube"
Requires-Dist: lxml>=5.0.0; extra == "youtube"
Provides-Extra: playwright
Requires-Dist: playwright>=1.49.0; extra == "playwright"
Requires-Dist: httpx>=0.27.0; extra == "playwright"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "playwright"
Requires-Dist: lxml>=5.0.0; extra == "playwright"
Requires-Dist: markdownify>=0.14.0; extra == "playwright"
Requires-Dist: readability-lxml>=0.8.1; extra == "playwright"
Requires-Dist: Pillow>=10.0.0; extra == "playwright"
Provides-Extra: db
Requires-Dist: sqlalchemy>=2.0.0; extra == "db"
Requires-Dist: pyyaml>=6.0.0; extra == "db"
Requires-Dist: psycopg2-binary>=2.9.9; extra == "db"
Requires-Dist: pymysql>=1.1.0; extra == "db"
Requires-Dist: pyodbc>=5.0.0; extra == "db"
Requires-Dist: oracledb>=2.0.0; extra == "db"
Requires-Dist: pymongo>=4.6.0; extra == "db"
Requires-Dist: elasticsearch>=8.0.0; extra == "db"
Requires-Dist: mermaid-py>=0.8.0; extra == "db"
Provides-Extra: graph
Requires-Dist: networkx>=3.2.0; extra == "graph"
Requires-Dist: neo4j>=5.14.0; extra == "graph"
Requires-Dist: pyyaml>=6.0.0; extra == "graph"
Provides-Extra: openapi
Requires-Dist: pyyaml>=6.0.0; extra == "openapi"
Requires-Dist: prance>=23.6.0; extra == "openapi"
Requires-Dist: openapi-spec-validator>=0.7.0; extra == "openapi"
Provides-Extra: codeflow
Requires-Dist: networkx>=3.2.0; extra == "codeflow"
Requires-Dist: javalang>=0.13.0; extra == "codeflow"
Provides-Extra: sap
Requires-Dist: pyyaml>=6.0.0; extra == "sap"
Requires-Dist: networkx>=3.2.0; extra == "sap"
Requires-Dist: pydantic>=2.0.0; extra == "sap"
Requires-Dist: pydantic-settings>=2.0.0; extra == "sap"
Provides-Extra: codeflow-worker
Requires-Dist: celery>=5.3.0; extra == "codeflow-worker"
Requires-Dist: redis>=5.0.0; extra == "codeflow-worker"
Provides-Extra: codeflow-treesitter
Requires-Dist: tree-sitter>=0.21.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-javascript>=0.21.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-typescript>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-cpp>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-java>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-python>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-go>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-php>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-rust>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-kotlin>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-c-sharp>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-swift>=0.7.2; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-ruby>=0.23.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-lua>=0.2.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-scala>=0.6.0; extra == "codeflow-treesitter"
Requires-Dist: tree-sitter-zig>=0.2.0; extra == "codeflow-treesitter"
Provides-Extra: codeflow-clang
Requires-Dist: clang>=16.0.0; extra == "codeflow-clang"
Provides-Extra: codeflow-semantic
Requires-Dist: numpy>=1.24.0; extra == "codeflow-semantic"
Requires-Dist: scikit-learn>=1.3.0; extra == "codeflow-semantic"
Requires-Dist: sentence-transformers>=3.0.0; extra == "codeflow-semantic"
Provides-Extra: log
Requires-Dist: pandas>=2.0.0; extra == "log"
Requires-Dist: python-dateutil>=2.8.0; extra == "log"
Requires-Dist: pyyaml>=6.0.0; extra == "log"
Provides-Extra: log-cluster
Requires-Dist: scikit-learn>=1.3.0; extra == "log-cluster"
Provides-Extra: log-semantic
Requires-Dist: numpy>=1.24.0; extra == "log-semantic"
Requires-Dist: scikit-learn>=1.3.0; extra == "log-semantic"
Requires-Dist: sentence-transformers>=3.0.0; extra == "log-semantic"
Provides-Extra: log-pretty
Requires-Dist: loguru>=0.7.0; extra == "log-pretty"
Provides-Extra: log-otel-proto
Requires-Dist: opentelemetry-proto>=1.27.0; extra == "log-otel-proto"
Requires-Dist: protobuf>=4.25.0; extra == "log-otel-proto"
Provides-Extra: log-export-parquet
Requires-Dist: pyarrow>=14.0.0; extra == "log-export-parquet"
Provides-Extra: log-stream-kafka
Requires-Dist: kafka-python>=2.0.2; extra == "log-stream-kafka"
Provides-Extra: log-stream-redis
Requires-Dist: redis>=5.0.0; extra == "log-stream-redis"
Provides-Extra: log-stream-ws
Requires-Dist: websockets>=12.0; extra == "log-stream-ws"
Provides-Extra: api
Requires-Dist: fastapi>=0.115.0; extra == "api"
Requires-Dist: uvicorn[standard]>=0.32.0; extra == "api"
Requires-Dist: python-multipart>=0.0.12; extra == "api"
Requires-Dist: httpx>=0.27.0; extra == "api"
Requires-Dist: pydantic-settings>=2.0.0; extra == "api"
Provides-Extra: mcp
Requires-Dist: mcp>=1.2.0; extra == "mcp"
Requires-Dist: fastmcp>=2.3.0; extra == "mcp"
Provides-Extra: docs
Requires-Dist: mkdocs-material; extra == "docs"
Requires-Dist: mkdocstrings; extra == "docs"
Requires-Dist: mkdocstrings-python; extra == "docs"
Requires-Dist: pymdown-extensions; extra == "docs"
Requires-Dist: mkdocs-mermaid2-plugin; extra == "docs"
Requires-Dist: mkdocs-git-revision-date-localized-plugin; extra == "docs"
Requires-Dist: mike; extra == "docs"
Provides-Extra: skill-openai
Requires-Dist: openai>=1.30.0; extra == "skill-openai"
Provides-Extra: skill-rag-chroma
Requires-Dist: chromadb>=0.4.22; extra == "skill-rag-chroma"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: httpx>=0.27.0; extra == "dev"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "dev"
Requires-Dist: lxml>=5.0.0; extra == "dev"
Requires-Dist: youtube-transcript-api>=0.6.0; extra == "dev"
Requires-Dist: mdengine[api]; extra == "dev"
Requires-Dist: mdengine[mcp]; extra == "dev"
Requires-Dist: mdengine[url]; extra == "dev"
Requires-Dist: mdengine[playwright]; extra == "dev"
Requires-Dist: mdengine[db]; extra == "dev"
Requires-Dist: mdengine[graph]; extra == "dev"
Requires-Dist: mdengine[openapi]; extra == "dev"
Requires-Dist: mdengine[codeflow]; extra == "dev"
Requires-Dist: mdengine[codeflow-treesitter]; extra == "dev"
Requires-Dist: mdengine[log]; extra == "dev"
Requires-Dist: mdengine[log-cluster]; extra == "dev"
Provides-Extra: all
Requires-Dist: pymupdf>=1.24.0; extra == "all"
Requires-Dist: pdfplumber>=0.11.0; extra == "all"
Requires-Dist: mammoth>=1.8.0; extra == "all"
Requires-Dist: markdownify>=0.14.0; extra == "all"
Requires-Dist: python-pptx>=1.0.0; extra == "all"
Requires-Dist: Pillow>=10.0.0; extra == "all"
Requires-Dist: lxml>=5.0.0; extra == "all"
Requires-Dist: olefile>=0.47; extra == "all"
Requires-Dist: openpyxl>=3.1.0; extra == "all"
Requires-Dist: pytesseract>=0.3.10; extra == "all"
Requires-Dist: numpy>=1.24.0; extra == "all"
Requires-Dist: paddlepaddle>=2.5.0; extra == "all"
Requires-Dist: paddleocr>=2.7.0.3; extra == "all"
Requires-Dist: easyocr>=1.7.0; extra == "all"
Requires-Dist: fastapi>=0.115.0; extra == "all"
Requires-Dist: uvicorn[standard]>=0.32.0; extra == "all"
Requires-Dist: python-multipart>=0.0.12; extra == "all"
Requires-Dist: httpx>=0.27.0; extra == "all"
Requires-Dist: pydantic-settings>=2.0.0; extra == "all"
Requires-Dist: mcp>=1.2.0; extra == "all"
Requires-Dist: fastmcp>=2.3.0; extra == "all"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "all"
Requires-Dist: lxml_html_clean>=0.4.0; extra == "all"
Requires-Dist: readability-lxml>=0.8.1; extra == "all"
Requires-Dist: youtube-transcript-api>=0.6.0; extra == "all"
Dynamic: license-file

# mdengine

Single Python distribution for converting **PDF**, **Word (.docx)**, **PowerPoint (.pptx)**, **Excel (.xlsx/.xlsm)**, **images** (OCR), **plain text / JSON / XML**, **ZIP archives**, **audio / video** (Whisper transcription → Markdown), **database metadata** (SQL + Mongo), **graphs** (Neo4j / NetworkX → Markdown), **OpenAPI** specs, **Playwright**-captured web pages (including SPAs), **application logs** (plain, JSON, CSV, ZIP bundles → Markdown with optional clustering and semantic grouping), **OpenTelemetry** traces (**OTLP** JSON or protobuf → Markdown trace summaries; optional correlation with log exports via `otel_path`), **SAP artifacts** (ABAP, CDS, DDIC, OData, BAPI, IDoc, transport → AI-ready Markdown knowledge packs), and **source code** (codeflow → architecture Markdown) into **Markdown** (and related assets). Install only the extras you need; everything imports under the **`md_generator`** package.

- **PyPI name:** `mdengine` (import package: `md_generator`)
- **Source:** [github.com/vishal7090/md-generator](https://github.com/vishal7090/md-generator)
- **Python:** 3.10+
- **License:** [MIT](LICENSE)

**Quick links:** [On a new computer](#on-a-new-computer) · [Command-line execution](#command-line-execution) · [Python library](#python-library) · [Audio and video](#audio-and-video-to-markdown) · [HTTP API](#http-api-fastapi) · [MCP](#mcp-model-context-protocol) · [AI assistant CLI](#ai-assistant-cli) · [Development](#development) · [Published documentation](https://vishal7090.github.io/md-generator/) · [Code of Conduct](CODE_OF_CONDUCT.md)

---

## On a new computer

Use this checklist the first time you run the tools on a machine that does not have the project yet.

1. **Install Python 3.10 or newer** from [python.org](https://www.python.org/downloads/) (Windows: enable **Add python.exe to PATH** in the installer). Confirm in a new terminal: `python --version`.
2. **(Recommended)** Create an isolated environment so dependencies do not clash with other projects:
   ```bash
   python -m venv .venv
   ```
   Then activate it: **Windows (PowerShell)** `.\.venv\Scripts\Activate.ps1` · **Windows (CMD)** `.venv\Scripts\activate.bat` · **macOS / Linux** `source .venv/bin/activate`.
3. **Install this package with the extras you need** (see [Optional dependency extras](#optional-dependency-extras) for what each extra does):
   ```bash
   pip install "mdengine[pdf,word]"
   ```
   If the package is not on PyPI yet, clone [the repository](https://github.com/vishal7090/md-generator), `cd` into the repo root, then:
   ```bash
   pip install -e ".[pdf,word]"
   ```
4. **Confirm the CLI is on your PATH:** `md-pdf --help` (or `md-word --help`, etc.). If you see “command not found”, the folder where `pip` puts scripts (often `.venv\Scripts` on Windows or `.venv/bin` on Unix) must be on your `PATH`, or you must run commands from an **activated** virtual environment.
5. **Run one conversion** with a real file path, for example:
   ```bash
   md-pdf path\to\report.pdf out.md
   ```
   Full flags and every `md-*` command are in [Command-line execution](#command-line-execution).

---

## Installation

From the repository root (editable install for development):

```bash
pip install -e .
```

With format-specific and HTTP extras:

```bash
pip install -e ".[pdf,word,api]"
pip install -e ".[ppt,xlsx,image,archive,api,mcp]"
```

From PyPI (once published):

```bash
pip install "mdengine[pdf,word]"
pip install "mdengine[all]"
```

### Optional dependency extras

| Extra | Purpose |
|--------|---------|
| `pdf` | PDF extraction (PyMuPDF, pdfplumber) |
| `word` | DOCX → Markdown (mammoth, markdownify) |
| `ppt` | PPTX and embedded content (python-pptx, Pillow, lxml, mammoth, PyMuPDF, …) |
| `xlsx` | Excel → Markdown (openpyxl) |
| `image` | Image I/O for OCR pipelines (Pillow) |
| `image-ocr` | Heavy OCR backends (pytesseract, paddle, easyocr, …) |
| `text` | TXT / JSON / XML converter (stdlib-oriented; marker extra) |
| `archive` | ZIP → Markdown layout (Pillow; optional tesseract for inline image OCR) |
| `url` | HTTP(S) HTML → Markdown (httpx, readability-lxml, markdownify, BeautifulSoup, lxml) |
| `url-full` | `url` plus PDF/Word/PPTX/XLSX/archive stack for **post-converting** downloaded linked files to Markdown |
| `audio` | **Audio → Markdown** via Whisper (`openai-whisper`); ships `imageio-ffmpeg` for a bundled **ffmpeg** when none is on `PATH` |
| `video` | **Video → Markdown** (ffmpeg extracts mono 16 kHz WAV, then same Whisper stack as `audio`) |
| `youtube` | **YouTube → Markdown** (captions + page metadata; `youtube-transcript-api`, httpx, BeautifulSoup, lxml) |
| `playwright` | **Playwright-rendered SPA → Markdown** (`playwright` + same HTML stack as `url`; run `playwright install chromium` after install) |
| `db` | **Database metadata → Markdown** (SQLAlchemy, drivers, `mermaid-py` for ERD helpers) |
| `openapi` | **OpenAPI / Swagger → Markdown** (`prance`, `openapi-spec-validator`, PyYAML) |
| `codeflow` | **Static codeflow / call graphs → Markdown** (`networkx`, `javalang`; core scanners) |
| `codeflow-worker` | Optional **Celery + Redis** workers for the codeflow API async path |
| `codeflow-treesitter` | **Tree-sitter** parsers (JS/TS/TSX, C++, Java, Python, Go, PHP); use `--parser-mode treesitter` — see [codeflow-to-md/docs/parser-backends.md](codeflow-to-md/docs/parser-backends.md) |
| `codeflow-clang` | **libclang** bindings for C/C++ parsing when used by the codeflow pipeline |
| `codeflow-semantic` | **Semantic clustering** (SentenceTransformers, scikit-learn, numpy; large install) |
| `log` | **Log files → Markdown** (`pandas`, PyYAML, date parsing) |
| `log-cluster` | Log pipeline **clustering** (scikit-learn) |
| `log-semantic` | Log pipeline **semantic** helpers (SentenceTransformers + scikit-learn; large) |
| `log-pretty` | Optional **loguru** for pretty console diagnostics during log conversion |
| `log-otel-proto` | **OTLP protobuf** ingest for **`md-otel`** and log/OTEL tooling (`opentelemetry-proto`, `protobuf`) |
| `sap` | **SAP artifacts → Markdown** (ABAP, CDS, DDIC, OData, BAPI, IDoc, transport): PyYAML, NetworkX, Pydantic |
| `api` | FastAPI, uvicorn, httpx, pydantic-settings |
| `mcp` | MCP servers (`mcp`, `fastmcp` where used) |
| `graph` | **Graph → Markdown** (Neo4j Bolt + NetworkX GraphML/GML): `networkx`, `neo4j`, `pyyaml` |
| `docs` | **MkDocs** site tooling (Material, mkdocstrings, Mermaid plugin, mike) for maintainers publishing reference docs |
| `dev` | pytest + API/MCP test helpers |
| `all` | Large superset of dependencies (use only if you need everything) |
| `skill-openai` | Optional helpers for OpenAI-style calls with assembled skill context (`md_generator.tools.assistant`) |
| `skill-rag-chroma` | Optional Chroma-based RAG over skill markdown (`MasterAgent.ask(..., use_rag=True)`) |

Nested ZIP and office files inside archives require the corresponding extras (e.g. `archive` plus `pdf` for PDFs inside a ZIP).

---

## Command-line execution

All converters can be run from a terminal after you install the package (with the right **extras** for that format). Each tool is a normal executable on your `PATH` (no need to open Python yourself unless you choose the shim workflow below).

### 1. Install (once)

```bash
pip install "mdengine[pdf,word]"          # adjust extras: ppt, xlsx, image, archive, text, db, graph, …
# or from a clone:
pip install -e ".[pdf,word,archive]"
```

### 2. Check that the command is available

```bash
md-pdf --help
md-zip --help
```

If the shell reports “command not found”, ensure the Python **Scripts** directory is on your `PATH` (same place `pip` installs console scripts).

### 3. Commands (command-line entry points)

| Command | Implements | One-line example |
|---------|------------|------------------|
| `md-pdf` | `md_generator.pdf.converter:main` | `md-pdf report.pdf out.md` |
| `md-word` | `md_generator.word.converter:main` | `md-word notes.docx body.md` |
| `md-ppt` | `md_generator.ppt.converter:main` | `md-ppt deck.pptx ./ppt-out` |
| `md-xlsx` | `md_generator.xlsx.converter:main` | `md-xlsx -i data.xlsx -o ./excel-out` (also **`.csv`**) |
| `md-image` | `md_generator.image.converter:main` | `md-image ./scans page.md` |
| `md-text` | `md_generator.text.converter:main` | `md-text config.xml out.md` |
| `md-zip` | `md_generator.archive.converter:main` | `md-zip bundle.zip ./zip-out` |
| `md-url` | `md_generator.url.converter:main` | `md-url https://example.com/doc ./web-out --artifact-layout` |
| `md-audio` | `md_generator.media.audio.converter:main` | `md-audio clip.mp3 transcript.md --model base` |
| `md-video` | `md_generator.media.video.converter:main` | `md-video clip.mp4 transcript.md --model base` |
| `md-youtube` | `md_generator.media.youtube.converter:main` | `md-youtube "https://youtu.be/…" out.md --transcript-lang en` |
| `md-audio-api` | `md_generator.media.audio.api.run:main` | REST + MCP on port **8011** (see [Audio and video to Markdown](#audio-and-video-to-markdown)) |
| `md-video-api` | `md_generator.media.video.api.run:main` | REST + MCP on port **8012** |
| `md-youtube-api` | `md_generator.media.youtube.api.run:main` | REST + MCP on port **8013** (JSON `url` body; see same section) |
| `md-audio-mcp` | `md_generator.media.audio.api.mcp_server:main` | Standalone MCP (`--transport stdio` \| `sse` \| `streamable-http`) |
| `md-video-mcp` | `md_generator.media.video.api.mcp_server:main` | Same for video |
| `md-youtube-mcp` | `md_generator.media.youtube.api.mcp_server:main` | Same for YouTube (`youtube_url_to_markdown`) |
| `md-db` | `md_generator.db.cli.main:main` | `pip install "mdengine[db]"` then `md-db --config db.yaml` (or `mdengine db-to-md …`) |
| `md-db-api` | `md_generator.db.api.run:main` | FastAPI on port **8010** (`DB_TO_MD_PORT`): `POST /db-to-md/run`, `POST /db-to-md/run/sqlite` (upload + ZIP), `POST /db-to-md/job`, `POST /db-to-md/job/sqlite` (upload + async job), SSE `/db-to-md/job/{id}/events` |
| `md-db-mcp` | `md_generator.db.api.mcp_server:main` | Standalone MCP for metadata export tools |
| `md-graph` | `md_generator.graph.cli.main:main` | `pip install "mdengine[graph]"` then `md-graph --source neo4j --uri bolt://…` (or `mdengine graph-to-md …`) |
| `md-graph-api` | `md_generator.graph.api.run:main` | FastAPI on port **8012** (`GRAPH_TO_MD_PORT`): `POST /graph-to-md/run`, `/graph-to-md/job`, SSE `/graph-to-md/job/{id}/events` |
| `md-graph-mcp` | `md_generator.graph.api.mcp_server:main` | Standalone MCP for graph export tools |
| `md-openapi` | `md_generator.openapi.cli.main:main` | `pip install "mdengine[openapi]"` then `md-openapi generate --file openapi.yaml --output ./docs` (or `mdengine openapi-to-md generate …`) |
| `md-openapi-api` | `md_generator.openapi.api.run:main` | FastAPI on port **8015** (`OPENAPI_TO_MD_PORT`): `POST /openapi-to-md/generate` (OpenAPI upload → ZIP), `/health`, MCP at **`/mcp`** |
| `md-openapi-mcp` | `md_generator.openapi.api.mcp_server:main` | Standalone MCP: `api_validate_openapi_yaml`, `api_generate_readme_markdown`, `api_run_sync_zip_base64` |
| `md-playwright` | `md_generator.playwright.cli:main` | `pip install "mdengine[playwright]"` then `md-playwright https://spa.example/app ./spa-out` (after `playwright install chromium`) |
| `md-playwright-api` | `md_generator.playwright.api.run:main` | FastAPI + MCP on port **8014** (`PLAYWRIGHT_TO_MD_API_PORT` / settings); see [HTTP API](#http-api-fastapi) |
| `md-playwright-mcp` | `md_generator.playwright.api.mcp_server:main` | Standalone MCP for Playwright capture tools |
| `md-codeflow` | `md_generator.codeflow.cli.main:main` | Same as `codeflow` — `md-codeflow scan …` (alias of packaged `codeflow` script) |
| `codeflow` | `md_generator.codeflow.cli.main:main` | `pip install "mdengine[codeflow]"` then `codeflow scan path/to/src --output ./out --lang java` |
| `md-codeflow-api` | `md_generator.codeflow.api.run:main` | FastAPI on port **8016** (`CODEFLOW_TO_MD_PORT`); upload/workspace job flow + MCP at **`/mcp`** when MCP extras resolve |
| `md-codeflow-mcp` | `md_generator.codeflow.api.mcp_server:main` | Standalone **stdio** MCP for codeflow tools |
| `md-log` | `md_generator.log.cli.main:main` | `pip install "mdengine[log]"` then `md-log --config log.yaml` (or `mdengine log-to-md …`) |
| `md-log-api` | `md_generator.log.api.run:main` | FastAPI on port **8012** (`LOG_TO_MD_PORT`); set a different port if **`md-graph-api`** or **`md-video-api`** already uses **8012** |
| `md-log-mcp` | `md_generator.log.api.mcp_server:main` | Standalone MCP (`stdio` \| `sse` \| `streamable-http`) |
| `md-otel` | `md_generator.otel.cli.main:main` | `pip install "mdengine[log-otel-proto]"` then `md-otel --input traces.json --output ./otel-docs` (or `mdengine otel-to-md …`; add **`--protobuf`** for OTLP binary) |
| `md-sap` | `md_generator.sap.cli.main:main` | `pip install "mdengine[sap]"` then `md-sap ./sap-source --output ./sap-out --graph --chunk` (or `mdengine sap-to-md …`) |
| `md-sap-api` | `md_generator.sap.api.run:main` | FastAPI on port **8020** (`SAP_TO_MD_PORT`): `POST /sap-to-md/run`, `POST /sap-to-md/job`, `GET /sap-to-md/job/{id}/download` |
| `md-sap-mcp` | `md_generator.sap.api.mcp_server:main` | Runs the same FastAPI app as **`md-sap-api`** (MCP mount when enabled in app) |
| `mdengine` | `md_generator.engine_cli:main` | `mdengine ai assist …`, `mdengine ai export …`, `mdengine skill build …`, `mdengine db-to-md …`, `mdengine log-to-md …`, `mdengine otel-to-md …`, `mdengine sap-to-md …`, `mdengine graph-to-md …`, `mdengine openapi-to-md generate …`, `mdengine codeflow-to-md scan …` |

**openapi-to-md (`md-openapi`):** OpenAPI **3.x** is parsed directly; **Swagger 2.0** (`swagger: "2.0"`) documents are **converted in-process** to OpenAPI 3.0.3 (deterministic, in-repo converter) before `$ref` resolution. Edge-heavy specs (unusual OAuth2 flows, vendor extensions) may still need fixes after conversion.

**db-to-md ER diagrams:** add **`erd`** to the export feature list (YAML `features.include`, API body, or CLI `--include …,erd`). **Preferred:** **Graphviz** (`dot` on `PATH`, or **`GRAPHVIZ_DOT`**) produces `erd/*.dot`, `erd/*.png`, and `erd/*.svg`. **If Graphviz is missing,** the exporter falls back to **Mermaid** (`erDiagram`): it writes `erd/*.mermaid` plus a fenced **`erd/*.md`** for GitHub-style preview. With **`mermaid-py`** (included in `mdengine[db]`), it also requests **PNG/SVG** via mermaid.ink (requires network unless you self-host mermaid.ink and set **`MERMAID_INK_SERVER`** per [mermaid-py](https://pypi.org/project/mermaid-py/)). Tune **`erd.max_tables`** (default 100) and **`erd.scope`** (`full` \| `per_schema` \| `per_table`) under `erd:` in YAML; CLI: `--erd-max-tables`, `--erd-scope`. Async job SSE uses `progress_update` with `current` starting with `erd:`.

**db-to-md split exports and README merge:** with **`output.split_files: true`**, set **`output.write_combined_feature_markdown: true`** to also write root-level combined Markdown (for example `tables.md`, `functions.md`, `indexes.md`, and feature-specific paths such as `oracle/packages.md` or `mongodb/collections.md` when those features run). Set **`output.readme_feature_merge`** to **`inline`** (append full bundle bodies into `README.md`) or **`toc`** (append a linked list to those files). If merge is not **`none`** and split files are on, combined bundle writes are turned on automatically when loading config. CLI: `--write-combined-feature-markdown`, `--readme-feature-merge none|inline|toc`.

**db-to-md SQLite:** set **`database.type: sqlite`** and **`database.uri`** to a SQLAlchemy URL such as **`sqlite:///my.db`** or **`sqlite:////C:/data/my.db`**. The default SQLite catalog is **`main`**; packaged YAML that still has **`schema: public`** is normalized to **`main`** when the type is SQLite. CLI: **`--type sqlite`**. Stored routines, sequences, and partitions are not used by SQLite and stay empty; tables, indexes, views, triggers, and ERD (via FK introspection) follow the same export path as other SQL engines.

**db-to-md SQL Server:** set **`database.type: mssql`** (aliases: `sqlserver`) and a **`mssql+pyodbc://`** URI with an ODBC driver query (for example `?driver=ODBC+Driver+18+for+SQL+Server`). Set **`database.schema`** to the SQL Server schema (default **`dbo`**). Optional features: **`synonyms`**, **`dependencies`** (lineage from `sys.sql_expression_dependencies`). Table/column/routine **MS_Description** extended properties are exported when present. Requires **`pyodbc`** (`mdengine[db]`).

**db-to-md Microsoft Access:** set **`database.type: access`** with a file path via **`mssql+pyodbc:///?odbc_connect=...`** (built automatically from uploads) or use **`POST /db-to-md/run/access`** / **`POST /db-to-md/job/access`** to upload **`.mdb` / `.accdb`** files (Windows + Microsoft Access ODBC driver). Saved **Query** objects are exported as views (SELECT) or query procedures (action queries). Requires **`pyodbc`** and an Access-compatible ODBC driver.

**db-to-md CLI validation:** **`md-db --test-connection`** (or **`mdengine db-to-md --test-connection`**) checks connectivity without exporting. **`--list-schemas`** prints schema/catalog names (`--output-format json` optional).

**db-to-md export manifest:** every run can write **`export_manifest.json`** at the output root (counts, file list, ERD paths). Control with **`output.write_manifest`** (default true). **`output.markdown_cross_links`** adds **Related** links between tables via foreign keys when using split files (default true).

**db-to-md API SQLite upload:** **`POST /db-to-md/run/sqlite`** — `multipart/form-data` with field **`file`** (the `.sqlite` / `.db` bytes; must start with the standard `SQLite format 3` header) and optional **`config`** (JSON string with the same shape as a normal run body minus `database`: `schema`, `output`, `features`, `execution`, `limits`, `erd`). Returns the metadata ZIP immediately. **`POST /db-to-md/job/sqlite`** — same form fields; saves the file under the job workspace and runs the existing job pipeline (`GET /db-to-md/job/{id}/download` when complete). Upload size cap: env **`DB_TO_MD_MAX_SQLITE_UPLOAD_MB`** (default **256**). Large ZIPs may still require the async job path if they exceed **`DB_TO_MD_MAX_SYNC_ZIP_MB`** on the sync upload route.

**db-to-md Elasticsearch / OpenSearch:** set **`database.type: elasticsearch`** and **`database.uri`** to the cluster URL (for example `https://localhost:9200`). Install **`mdengine[db]`** (includes the **`elasticsearch`** Python client). Feature flags: **`elasticsearch_indices`**, **`elasticsearch_data_streams`**, **`elasticsearch_component_templates`**, **`elasticsearch_index_templates`**, **`elasticsearch_ingest_pipelines`**, **`elasticsearch_ilm_policies`**, **`elasticsearch_slm_policies`**, **`elasticsearch_field_caps`**, **`elasticsearch_snapshot_repositories`**, **`elasticsearch_search_templates`**, **`elasticsearch_search_architecture`**, **`elasticsearch_search_dependency_graph`**. Output under **`elasticsearch/`** (indices, data streams, templates, pipelines, ILM, SLM, snapshots, search templates, **`alias_graph.md`**, **`search_architecture.md`**, **`search_dependency_graph.md`**). Index docs add **`## Mapping Summary`** (field counts, nested/vector/semantic totals) and **`## Operational Notes`** (health, replicas, vector-field signals). Search templates add **`## Query type`** using a vector-aware taxonomy (full text, aggregation, suggest, vector, semantic, hybrid). **`search_architecture.md`** narrates the cluster lifecycle (ingestion → indexing → retention → search → backup → alias routing) with aggregate mapping metrics, query-type counts, vector/AI readiness, and an **OpenSearch Compatibility** matrix when **`limits.opensearch: true`** or export probes detect API gaps. **`search_dependency_graph.md`** links templates to indices, aliases, and pipelines (heuristic); enable with **`elasticsearch_search_dependency_graph`** or **`limits.emit_dependency_graph: true`** (requires **`elasticsearch_search_architecture`**). Snapshot limits: **`limits.snapshot_repository_types`** (optional type filter), **`limits.max_snapshot_repositories`**. Search-template limits: **`limits.search_templates_langs`** (default **`mustache`**), **`limits.max_template_source_chars`** (markdown truncation). Presentation: **`output.elasticsearch_mapping_mode`** (`flattened` \| `summarized` \| `raw`), **`output.elasticsearch_include_raw_json`**, **`output.elasticsearch_analyzer_format`** (`chain` \| `json` \| `both`), **`output.max_json_block_chars`** (default **32000**; **`0`** = unlimited JSON blocks). Truncation and other export issues appear in per-doc **`## Export Warnings`** and **`export_manifest.json`** **`warnings`** when present. OpenSearch: set **`limits.opensearch: true`** to use ISM instead of ILM. Example config: [`example/db/elasticsearch.example.yaml`](example/db/elasticsearch.example.yaml). CLI: **`--type elasticsearch`**, **`--index-pattern`**, **`--list-indices`**. Security flags **`elasticsearch_security_roles`**, **`elasticsearch_security_users`**, **`elasticsearch_security_api_keys`** are **reserved**: when included they write deterministic placeholder docs under **`elasticsearch/security/`** only (no live `_security` API calls, no secrets). Optional **`security.redact_sensitive_values: true`** redacts matching keys in pipelines, templates, snapshots, and settings (pattern names only in manifest **`redaction`** audit block).

**db-to-md API Elasticsearch upload:** **`POST /db-to-md/run/elasticsearch`** — `multipart/form-data` with **`file`** (ZIP of JSON metadata: **`mappings/`**, **`settings/`**, optional **`pipelines/`**, **`templates/`**, **`ilm/`**, etc.) and optional **`config`** JSON (`output`, `features`, `limits`). **`POST /db-to-md/job/elasticsearch`** for async jobs. Cap: **`DB_TO_MD_MAX_ELASTICSEARCH_UPLOAD_MB`** (default **64**).

**sap-to-md (`md-sap`):** parses **ABAP**, **CDS**, **DDIC** exports, **OData** metadata, **BAPI**, **IDoc**, and transport artifacts into chunked Markdown knowledge packs. Optional **`--include-lineage`**, **`--include-governance`**, **`--graph`**, **`--chunk`**. Library: [`md_generator/sap/`](src/md_generator/sap/). Deeper design: [`sap-to-md/README.md`](sap-to-md/README.md).

**otel-to-md (`md-otel`):** reads **OTLP** trace exports (**JSON** or **protobuf** with **`log-otel-proto`**) and writes **`trace.md`** (span list). No separate HTTP API entry point; use **`md-otel`** or correlate traces with log exports via **`input.otel_path`** in log YAML. Library: [`md_generator/otel/`](src/md_generator/otel/).

**graph-to-md (Neo4j + NetworkX):** library lives under [`md_generator/graph/`](src/md_generator/graph/). **Sources:** **`networkx`** (GraphML/GML via `graph.graph_file` or `--graph-file`) or **`neo4j`** (`graph.uri`, `graph.user`, `graph.password`, optional `graph.database` for `session(database=…)`). **Output layout:** by default **`output.combine_markdown: true`** writes **`nodes.md`**, **`relationship.md`**, and **`graph_summary.md`** (summary plus embedded nodes and relationships). Set **`combine_markdown: false`** or CLI **`--individual` / `--markdown-layout individual`** for per-entity files under **`nodes/`** and **`relationships/`**. **Diagrams:** **`viz.mermaid: true`** (default) writes **`graph/graph.mmd`** and embeds a fenced Mermaid block in the export **`README.md`** (no Graphviz required). **`viz.enabled: true`** or CLI **`--viz`** also writes **`graph/graph.dot`** and runs **`dot`** for PNG/SVG/PDF when Graphviz is on `PATH` (or **`GRAPHVIZ_DOT`**). CLI **`--no-mermaid`** disables Mermaid; **`--depth`**, **`--start-node`**, **`--max-nodes`**, **`--max-edges`** bound traversal. Packaged defaults: [`src/md_generator/graph/config/default.yaml`](src/md_generator/graph/config/default.yaml). Tests: **`graph-to-md/tests/`**; API image: [`graph-to-md/Dockerfile.api`](graph-to-md/Dockerfile.api).

Every command accepts **`-h` / `--help`** for full flags (artifact layout, OCR, ZIP options, etc.).

### 4. Copy-paste examples (terminal)

**bash / macOS / Linux**

```bash
md-pdf manual.pdf ./artifact --artifact-layout
md-word letter.docx letter.md --images-dir ./letter-images
md-ppt slides.pptx ./ppt-artifact --artifact-layout
md-xlsx -i sales.xlsx -o ./md-sheets --split
md-xlsx -i export.csv -o ./csv-out
md-image ./photos ocr.md --engines tess --strategy best
md-text data.json data.md
md-zip archive.zip ./unzipped-md
md-url https://example.com/page ./page-bundle --artifact-layout
md-audio ./voice.mp3 ./voice.md --model tiny
md-video ./screen.mp4 ./screen.md --model base
pip install "mdengine[graph]" && md-graph --source neo4j --uri neo4j://localhost:7687 --user neo4j --password secret --database neo4j --output ./graph-out --viz
pip install "mdengine[playwright]" && playwright install chromium && md-playwright https://example.com/app ./spa-out
pip install "mdengine[openapi]" && md-openapi generate --file openapi.yaml --output ./openapi-md
pip install "mdengine[log]" && md-log --config log.yaml
pip install "mdengine[log-otel-proto]" && md-otel --input otlp-traces.json --output ./otel-docs
pip install "mdengine[sap]" && md-sap ./sap-source --output ./sap-out --graph --chunk
md-codeflow scan ./src --output ./cf-out --lang python
```

**Windows PowerShell** (same commands; use backslashes for paths if you prefer)

```powershell
md-pdf .\manual.pdf .\out\doc.md
md-zip .\archive.zip .\zip-out
md-url https://example.com/page .\page-bundle --artifact-layout
md-audio .\voice.mp3 .\voice.md --model tiny
md-video .\screen.mp4 .\screen.md --model base
pip install "mdengine[graph]"
md-graph --source neo4j --uri neo4j://localhost:7687 --user neo4j --password secret --database neo4j --output .\graph-out --viz
pip install "mdengine[playwright]"
playwright install chromium
md-playwright https://example.com/app .\spa-out
```

**Windows CMD**

```cmd
md-pdf manual.pdf out\doc.md
md-zip archive.zip zip-out
md-url https://example.com/page page-bundle --artifact-layout
```

### 5. Run without `pip install` (repo clone + `PYTHONPATH`)

The folders `pdf-to-md/`, `word-to-md/`, `url-to-md/`, … contain a thin `converter.py` that calls the same code as `md-pdf`, `md-word`, etc. From the **repository root**, point Python at `src` so `md_generator` imports, then run the shim:

**PowerShell**

```powershell
$env:PYTHONPATH = "$PWD\src"
python pdf-to-md\converter.py input.pdf out.md
```

**CMD**

```cmd
set PYTHONPATH=src
python pdf-to-md\converter.py input.pdf out.md
```

**bash**

```bash
PYTHONPATH=src python pdf-to-md/converter.py input.pdf out.md
```

### 6. Convert every file in `example/` (strictly command-line)

To process **all supported files** under the [`example/`](example/) folder using only the installed **`md-*`** tools (no Python snippets), use the batch driver:

| Platform | Command (run from **repository root** unless noted) |
|----------|------------------------------------------------------|
| Windows | `powershell -ExecutionPolicy Bypass -File scripts/run-docs-cli.ps1` |
| Windows | Or double-click / run [`example/run-all-cli.cmd`](example/run-all-cli.cmd) (changes to repo root, then runs the script on `example\`) |
| macOS / Linux | `bash scripts/run-docs-cli.sh` |

Optional environment variables for the shell script: `DOCS_DIR`, `OUT_DIR`, `IMAGE_ENGINES` (default `tess`). PowerShell script parameters: `-DocsDir`, `-OutDir`, `-ImageEngines`.

Outputs are written to **`example/cli-output/<basename>/`** (one subfolder per input file). **`.csv`** files are converted with **`md-xlsx`** (same engine as Excel). **`.md`** files are skipped.

---


## Python library

Import from `md_generator.<format>` after installing the matching extras.

### PDF

```python
from pathlib import Path
from md_generator.pdf.pdf_extract import ConvertOptions, convert_pdf
from md_generator.pdf.utils import resolve_output

pdf = Path("input.pdf")
out = resolve_output(Path("out-dir"), artifact_layout=True, images_dir=None)
convert_pdf(pdf, out, ConvertOptions(verbose=True))
```

### Word (DOCX)

```python
from pathlib import Path
from md_generator.word.converter import convert_docx_to_markdown

convert_docx_to_markdown(
    Path("input.docx"),
    Path("out/body.md"),
    images_dir=Path("out/images"),
    verbose=False,
)
```

### PowerPoint

```python
from pathlib import Path
from md_generator.ppt.convert_impl import convert_pptx
from md_generator.ppt.options import ConvertOptions

convert_pptx(
    Path("slides.pptx"),
    Path("artifact-dir"),
    ConvertOptions(artifact_layout=True, extract_embedded_deep=False),
)
```

### Excel

```python
from pathlib import Path
from md_generator.xlsx.convert_config import ConvertConfig
from md_generator.xlsx.converter_core import convert_excel_to_markdown

result = convert_excel_to_markdown(
    Path("book.xlsx"),
    Path("out-dir"),
    config=ConvertConfig(),
)
print(result.paths_written)
```

### Images (OCR)

```python
from pathlib import Path
from md_generator.image.convert_impl import ConvertOptions, convert_images

convert_images(
    Path("scan.png"),
    Path("out.md"),
    ConvertOptions(
        engines=("tess",),
        strategy="best",
        title="OCR",
        tess_lang="eng",
        tesseract_cmd=None,
        paddle_lang="en",
        paddle_use_angle_cls=True,
        easy_langs=("en",),
        verbose=False,
    ),
)
```

### Text / JSON / XML

```python
from pathlib import Path
from md_generator.text.convert_impl import convert_text_file
from md_generator.text.options import ConvertOptions

convert_text_file(
    Path("data.json"),
    Path("out.md"),
    ConvertOptions(artifact_layout=False, verbose=False),
)
```

### ZIP archive

```python
from pathlib import Path
from md_generator.archive.convert_impl import convert_zip
from md_generator.archive.options import ConvertOptions

convert_zip(
    Path("upload.zip"),
    Path("artifact-out"),
    ConvertOptions(
        enable_office=True,
        use_image_to_md=True,
        verbose=False,
    ),
)
```

`repo_root` on `ConvertOptions` is **deprecated and ignored**; converters are loaded in-process from `md_generator`.

### URL (HTML)

```python
from pathlib import Path

import httpx
from md_generator.url.convert_impl import run_single
from md_generator.url.options import ConvertOptions

with httpx.Client(follow_redirects=True, timeout=30.0) as client:
    run_single(
        "https://example.com/",
        Path("web-out"),
        ConvertOptions(artifact_layout=True, verbose=False),
        client,
    )
```

See [url-to-md/README.md](url-to-md/README.md) for crawl modes, post-convert assets, and API bodies.

### Playwright (SPA)

```python
import asyncio
from pathlib import Path

from md_generator.playwright.options import PlaywrightOptions
from md_generator.playwright.pipeline import convert_url_to_md

async def main() -> None:
    await convert_url_to_md(
        "https://example.com/app",
        Path("spa-out"),
        PlaywrightOptions(),
    )

asyncio.run(main())
```

Install browsers once: `playwright install chromium`.

### Database metadata (db-to-md)

```python
from pathlib import Path

from md_generator.db.core.extractor import extract_to_markdown
from md_generator.db.core.run_config import load_run_config

cfg = load_run_config(Path("db-export.yaml"))
extract_to_markdown(cfg)
```

YAML shape and ERD options are covered above under **db-to-md** CLI notes.

### Graph (Neo4j / NetworkX)

```python
from pathlib import Path

from md_generator.graph.core.extractor import extract_to_markdown
from md_generator.graph.core.run_config import load_graph_run_config

cfg = load_graph_run_config(Path("graph-export.yaml"))
extract_to_markdown(cfg)
```

### OpenAPI → Markdown

```python
from pathlib import Path

from md_generator.openapi.core.extractor import extract_to_markdown
from md_generator.openapi.core.run_config import ApiRunConfig

cfg = ApiRunConfig(file=Path("openapi.yaml"), output_path=Path("api-md"))
extract_to_markdown(cfg)
```

Use `load_api_run_config(Path("openapi-md.yaml"))` when you prefer a single YAML file for input/output/format defaults.

### Log files (log-to-md)

```python
from pathlib import Path

from md_generator.log.core.extractor import extract_to_markdown
from md_generator.log.core.run_config import load_run_config as load_log_run_config

cfg = load_log_run_config(Path("log-export.yaml"))
extract_to_markdown(cfg)
```

Set **`input.otel_path`** in the YAML (or API body) to a directory of OTLP JSON exports so the log pipeline can **correlate** log lines with spans. Install **`mdengine[log-otel-proto]`** when ingesting protobuf OTLP via **`md-otel`**.

### OpenTelemetry (otel-to-md)

```python
from pathlib import Path

from md_generator.otel.otel_parser import load_otlp_json
from md_generator.otel.otel_spans import parse_otlp_spans

doc = load_otlp_json(Path("otlp-traces.json"))
spans = parse_otlp_spans(doc)
```

CLI (writes **`trace.md`** under **`--output`**):

```bash
pip install "mdengine[log-otel-proto]"
md-otel --input otlp-traces.json --output ./otel-docs
md-otel --input export.pb --output ./otel-docs --protobuf
```

Shim: [`otel-to-md/converter.py`](otel-to-md/converter.py) delegates to **`md-otel`**.

### SAP artifacts (sap-to-md)

```python
from pathlib import Path

from md_generator.sap.core.extractor import extract_to_markdown
from md_generator.sap.core.run_config import load_run_config

cfg = load_run_config(Path("sap-export.yaml"))
extract_to_markdown(cfg)
```

Or pass paths on the CLI: **`md-sap ./abap ./cds --output ./sap-out --include-lineage --include-governance`**. See [`sap-to-md/README.md`](sap-to-md/README.md) for API routes and architecture notes.

### Codeflow (static call graphs)

```python
from pathlib import Path

from md_generator.codeflow.core.extractor import run_scan
from md_generator.codeflow.core.run_config import ScanConfig

cfg = ScanConfig(project_root=Path("my-service/src"), output_path=Path("codeflow-out"), languages="java")
run_scan(cfg)
```

More flags and language notes: [Development](#development) (section **Codeflow**) and [`codeflow-to-md/docs/graph-and-outputs.md`](codeflow-to-md/docs/graph-and-outputs.md).

---

## Audio and video to Markdown

Library code lives under **`md_generator.media`**: shared probing in [`document_converter.py`](src/md_generator/media/document_converter.py), **audio** in [`media/audio/`](src/md_generator/media/audio/) (Whisper + ffprobe / ffmpeg metadata), **video** in [`media/video/`](src/md_generator/media/video/) (ffmpeg extracts audio only; transcription always goes through the audio service—no duplicate Whisper path in video).

### System requirements

- **ffmpeg** (and **ffprobe** when available) on `PATH` for metadata and for **video** demux. If `ffprobe` is missing or misbehaving, metadata falls back to parsing `ffmpeg -i` stderr.
- Optional **`FFMPEG`** environment variable: absolute path to an `ffmpeg` executable (see [`resolve_ffmpeg_executable()`](src/md_generator/media/document_converter.py)).
- **GPU** is optional; Whisper runs on CPU if needed (may log FP16→FP32 on CPU).

### Install

```bash
pip install "mdengine[audio,api,mcp]"    # audio CLI + HTTP + MCP
pip install "mdengine[video,api,mcp]"   # video CLI + HTTP + MCP (same ML stack as audio)
pip install "mdengine[youtube,api,mcp]" # YouTube URL → Markdown (captions + metadata; optional Whisper fallback)
```

### Python library

**Audio** — structured result + Markdown:

```python
from pathlib import Path
from md_generator.media.audio import AudioToMarkdownService, AudioConverter

svc = AudioToMarkdownService(whisper_model="base")  # language omitted → Whisper auto-detect; pass e.g. language="en" to force
text = svc.to_markdown(Path("input.mp3"), title="My title")
svc.write_markdown(Path("input.mp3"), Path("out/transcript.md"))

result = svc.transcribe(Path("input.wav"))  # metadata + segments + plain_text
```

**Video** — extract → transcribe (via audio) → Markdown:

```python
from pathlib import Path
from md_generator.media.video import VideoToMarkdownService

svc = VideoToMarkdownService(whisper_model="base")  # transcription language omitted → auto-detect
md = svc.to_markdown(Path("input.mp4"), title=None)
svc.write_markdown(Path("input.mp4"), Path("out/transcript.md"))
```

**YouTube** — captions API + page metadata (BeautifulSoup / oEmbed); optional **yt-dlp + Whisper** when captions are missing (`mdengine[audio]` and `yt-dlp` on `PATH`, or `MD_YOUTUBE_YTDLP`):

```python
from md_generator.media.youtube import YouTubeToMarkdownService

svc = YouTubeToMarkdownService(whisper_model="base")
md = svc.to_markdown("https://www.youtube.com/watch?v=VIDEO_ID", transcript_languages=["en"])
svc.write_markdown("https://youtu.be/VIDEO_ID", Path("out/youtube.md"))
```

For file-based pipelines, [`YouTubeConverter`](src/md_generator/media/youtube/converter.py) reads a `.url` / `.yturl` / `.youtube` file (or a `.txt` whose first non-comment line is a YouTube URL) and implements [`DocumentConverter`](src/md_generator/media/document_converter.py).

Public symbols are also re-exported from [`md_generator.media`](src/md_generator/media/__init__.py) for ffprobe helpers (`ffprobe_json`, `VideoProbeResult`, …).

### REST API (FastAPI)

Each service exposes the same job pattern as other converters:

| Endpoint | Description |
|----------|-------------|
| `POST /convert/sync` | Multipart field **`file`**; returns **Markdown** body. Query: `whisper_model`, `language` (omit or `auto` for detection; pass a code to force), `title`. |
| `POST /convert/jobs` | Async upload; returns `{ "job_id", "status" }`. |
| `GET /convert/jobs/{job_id}` | Status JSON. |
| `GET /convert/jobs/{job_id}/download` | Markdown when `done`; workspace removed after download. |

**Audio** defaults: `MD_AUDIO_MAX_UPLOAD_MB=200`, `MD_AUDIO_MAX_SYNC_UPLOAD_MB=40`, `MD_AUDIO_API_PORT=8011`.  
**Video** defaults: `MD_VIDEO_MAX_UPLOAD_MB=500`, `MD_VIDEO_MAX_SYNC_UPLOAD_MB=80`, `MD_VIDEO_API_PORT=8012`.  
**YouTube** uses **JSON** (not multipart): `POST /convert/sync` and `POST /convert/jobs` accept `{"url":"https://www.youtube.com/watch?v=…","title":null,"transcript_languages":["en"],"enable_audio_fallback":true,"whisper_model":"base","language":null}`. Defaults: `MD_YOUTUBE_API_PORT=8013`, `MD_YOUTUBE_JOB_TTL_SECONDS`, `MD_YOUTUBE_CORS_ORIGINS`, `MD_YOUTUBE_TEMP_DIR`.

Run with the bundled runners (each call builds the app with **`factory=True`** for a clean MCP session manager):

```bash
md-audio-api --host 127.0.0.1 --port 8011
md-video-api --host 127.0.0.1 --port 8012
md-youtube-api --host 127.0.0.1 --port 8013
```

Or with Uvicorn directly (the ASGI app is built by **`create_app()`** so each worker gets its own MCP session manager):

```bash
uvicorn md_generator.media.audio.api.main:create_app --factory --host 127.0.0.1 --port 8011
uvicorn md_generator.media.video.api.main:create_app --factory --host 127.0.0.1 --port 8012
uvicorn md_generator.media.youtube.api.main:create_app --factory --host 127.0.0.1 --port 8013
```

The module also defines **`app = create_app()`** for a single-process target: `uvicorn md_generator.media.audio.api.main:app` (no `--factory`).

Swagger is at **`/docs`** when the app is running.

### MCP (stdio, SSE, streamable HTTP)

1. **With FastAPI** — start `md-audio-api`, `md-video-api`, or `md-youtube-api`; mount Streamable HTTP MCP at **`http://<host>:<port>/mcp`** (same host as REST).
2. **Standalone** — process speaks MCP only:

```bash
md-audio-mcp --transport stdio
md-audio-mcp --transport sse
md-audio-mcp --transport streamable-http
md-video-mcp --transport stdio
md-youtube-mcp --transport stdio
```

**Audio MCP tools:** `transcribe_audio_path`, `transcribe_audio_base64`.  
**Video MCP tools:** `transcribe_video_path`, `transcribe_video_base64`.  
**YouTube MCP tool:** `youtube_url_to_markdown`.

Equivalent modules: `python -m md_generator.media.audio.api.mcp_server`, `python -m md_generator.media.video.api.mcp_server`, `python -m md_generator.media.youtube.api.mcp_server`.

### Thin shims (repo clone)

[`audio-to-md/converter.py`](audio-to-md/converter.py), [`video-to-md/converter.py`](video-to-md/converter.py), and [`youtube-to-md/converter.py`](youtube-to-md/converter.py) delegate to the same `main` as `md-audio` / `md-video` / `md-youtube`. Tests and `pytest.ini` live under `audio-to-md/tests/`, `video-to-md/tests/`, and `youtube-to-md/tests/`. [`db-to-md/converter.py`](db-to-md/converter.py) delegates to `md-db`; tests live under `db-to-md/tests/`. **graph-to-md** uses `md-graph` / `mdengine graph-to-md` directly (no thin `converter.py` shim); tests live under [`graph-to-md/tests/`](graph-to-md/tests/). [`openapi-to-md/converter.py`](openapi-to-md/converter.py) delegates to `md-openapi`; tests live under [`openapi-to-md/tests/`](openapi-to-md/tests/); example OpenAPI and output notes: [`openapi-to-md/examples/`](openapi-to-md/examples/). [`log-to-md/converter.py`](log-to-md/converter.py) delegates to `md-log`; tests under [`log-to-md/tests/`](log-to-md/tests/) (including OTLP fixtures under `log-to-md/tests/fixtures/otel/`). [`otel-to-md/converter.py`](otel-to-md/converter.py) delegates to `md-otel`. [`sap-to-md/converter.py`](sap-to-md/converter.py) delegates to `md-sap`; tests under [`sap-to-md/tests/`](sap-to-md/tests/); module README: [`sap-to-md/README.md`](sap-to-md/README.md). [`codeflow-to-md/converter.py`](codeflow-to-md/converter.py) invokes the same CLI entrypoint as `md-codeflow` / `codeflow`; tests under [`codeflow-to-md/tests/`](codeflow-to-md/tests/). [`url-to-md/converter.py`](url-to-md/converter.py) matches `md-url`; tests under [`url-to-md/tests/`](url-to-md/tests/). [`playwright-to-md/tests/`](playwright-to-md/tests/) cover the Playwright pipeline (no root `converter.py` shim required).

---

## HTTP API (FastAPI)

All format APIs follow a similar pattern:

- **`POST /convert/sync`** — upload a file (most converters) **or** send JSON (`url-to-md`); response is often a **ZIP** (artifact bundle) for larger formats.
- **`POST /convert/jobs`** — async job; returns `job_id`.
- **`GET /convert/jobs/{job_id}`** — status.
- **`GET /convert/jobs/{job_id}/download`** — download result when ready.

Upload field name is **`file`** (multipart form) for file-based converters. Use `httpx` or `curl -F "file=@path/to/file"`. **URL** conversion uses a **JSON** body (`url` or `urls`); see [url-to-md/README.md](url-to-md/README.md).

### Run with Uvicorn

Install `mdengine[api]` plus the format extra(s), then run the **`app`** object from the table below.

| Service | Uvicorn target | Required extras (typical) |
|---------|----------------|---------------------------|
| PDF | `md_generator.pdf.api.main:app` | `pdf`, `api` |
| Word | `md_generator.word.api.main:app` | `word`, `api`, `mcp` (Word mounts FastMCP) |
| PPTX | `md_generator.ppt.api.main:app` | `ppt`, `api`, `mcp` |
| XLSX | `md_generator.xlsx.api.app:app` | `xlsx`, `api` |
| Image | `md_generator.image.api.main:app` | `image`, `api`, `mcp` |
| Text/JSON/XML | `md_generator.text.api.main:app` | `text`, `api`, `mcp` |
| ZIP | `md_generator.archive.api.main:app` | `archive`, `api`, `mcp` (+ extras for nested office/PDF) |
| URL / HTML | `md_generator.url.api.main:app` | `url`, `api`, `mcp` |
| Playwright / SPA | `md_generator.playwright.api.main:app` | `playwright`, `api`, `mcp` |
| Database metadata | `md_generator.db.api.main:app` | `db`, `api`, `mcp` |
| Graph metadata (Neo4j / NetworkX) | `md_generator.graph.api.main:app` | `graph`, `api`, `mcp` |
| OpenAPI → Markdown | `md_generator.openapi.api.main:app` | `openapi`, `api`, `mcp` |
| Audio (Whisper) | `md_generator.media.audio.api.main:create_app` (use **`--factory`**) or `…main:app` | `audio`, `api`, `mcp` |
| Video (Whisper) | `md_generator.media.video.api.main:create_app` (use **`--factory`**) or `…main:app` | `video`, `api`, `mcp` |
| YouTube | `md_generator.media.youtube.api.main:create_app` (use **`--factory`**) or `…main:app` | `youtube`, `api`, `mcp` |
| Codeflow (workspace scans) | `md_generator.codeflow.api.main:app` | `codeflow`, `api`, `mcp` |
| Log files → Markdown | `md_generator.log.api.main:app` | `log`, `api`, `mcp` |
| SAP artifacts | `md_generator.sap.api.main:app` | `sap`, `api` |

Examples (ports are **illustrative** so several services can run together; each `md-*-api` / `run` module has its own **default** — set `*_PORT` / `*_API_PORT` env vars to match your layout):

```bash
uvicorn md_generator.pdf.api.main:app --host 127.0.0.1 --port 8001
uvicorn md_generator.word.api.main:app --host 127.0.0.1 --port 8002
uvicorn md_generator.archive.api.main:app --host 127.0.0.1 --port 8008
uvicorn md_generator.url.api.main:app --host 127.0.0.1 --port 8017
uvicorn md_generator.db.api.main:app --host 127.0.0.1 --port 8010
uvicorn md_generator.playwright.api.main:app --host 127.0.0.1 --port 8014
uvicorn md_generator.graph.api.main:app --host 127.0.0.1 --port 8020
uvicorn md_generator.sap.api.main:app --host 127.0.0.1 --port 8021
uvicorn md_generator.openapi.api.main:app --host 127.0.0.1 --port 8015
uvicorn md_generator.codeflow.api.main:app --host 127.0.0.1 --port 8016
uvicorn md_generator.log.api.main:app --host 127.0.0.1 --port 8018
uvicorn md_generator.media.audio.api.main:create_app --factory --host 127.0.0.1 --port 8011
uvicorn md_generator.media.video.api.main:create_app --factory --host 127.0.0.1 --port 8012
uvicorn md_generator.media.youtube.api.main:create_app --factory --host 127.0.0.1 --port 8013
```

**Port note:** **`md-graph-api`**, **`md-video-api`**, and **`md-log-api`** all default to **8012** in their respective `run.py` modules. Run only one on that port, or set **`GRAPH_TO_MD_PORT`**, **`MD_VIDEO_API_PORT`**, and **`LOG_TO_MD_PORT`** so each process listens on a unique port. **`md-db-api`** defaults to **8010**. **`md-sap-api`** defaults to **8020** (`SAP_TO_MD_PORT`) — the graph Uvicorn example above uses **8020** on purpose as an illustration; use **8021** (or another port) when running **both** graph and SAP APIs locally. **`md-openapi-api`** defaults to **8015** (`OPENAPI_TO_MD_PORT`). **`md-youtube-api`** defaults to **8013**; **`md-playwright-api`** / **`PLAYWRIGHT_TO_MD_API_PORT`** default to **8014**. **`md-codeflow-api`** defaults to **8016** (`CODEFLOW_TO_MD_PORT`). **`md-otel`** has **no** bundled HTTP server (CLI only).

### MCP over HTTP on the same server

These apps mount an MCP HTTP app at **`/mcp`** (Streamable HTTP / framework-specific). Start the API as above, then point an MCP client at `http://<host>:<port>/mcp` where supported.

### Environment variables (limits & CORS)

Prefixes differ per service (often read from a `.env` file next to the process):

| Service | Prefix | Examples |
|---------|--------|----------|
| PDF | `PDF_TO_MD_` | `PDF_TO_MD_MAX_UPLOAD_MB`, `PDF_TO_MD_MAX_SYNC_UPLOAD_MB`, `PDF_TO_MD_TEMP_DIR`, `PDF_TO_MD_CORS_ORIGINS` |
| Word | `WORD_TO_MD_` | `WORD_TO_MD_MAX_UPLOAD_MB`, `WORD_TO_MD_MAX_SYNC_UPLOAD_MB`, `WORD_TO_MD_JOB_TTL_SECONDS`, `WORD_TO_MD_TEMP_DIR`, `WORD_TO_MD_CORS_ORIGINS` |
| ZIP | `ZIP_TO_MD_` | `ZIP_TO_MD_MAX_UPLOAD_MB`, `ZIP_TO_MD_MAX_SYNC_UPLOAD_MB`, `ZIP_TO_MD_JOB_TTL_SECONDS`, `ZIP_TO_MD_TEMP_DIR`, `ZIP_TO_MD_CORS_ORIGINS`, optional image post-pass defaults |
| PPTX | `PPT_TO_MD_` | `PPT_TO_MD_MAX_UPLOAD_MB`, … |
| Text | `TXT_JSON_XML_TO_MD_` | same pattern |
| XLSX | `XLSX_TO_MD_` | `XLSX_TO_MD_TEMP_DIR`, `XLSX_TO_MD_CORS_ORIGINS`, etc. (see `md_generator.xlsx.api.app`) |
| URL | `URL_TO_MD_` | `URL_TO_MD_MAX_SYNC_URLS`, `URL_TO_MD_MAX_SYNC_CRAWL_PAGES`, `URL_TO_MD_MAX_JOB_URLS`, `URL_TO_MD_JOB_TTL_SECONDS`, `URL_TO_MD_TEMP_DIR`, `URL_TO_MD_CORS_ORIGINS` |
| Playwright / SPA | `PLAYWRIGHT_TO_MD_` | `PLAYWRIGHT_TO_MD_MAX_SYNC_URLS`, `PLAYWRIGHT_TO_MD_MAX_JOB_URLS`, `PLAYWRIGHT_TO_MD_JOB_TTL_SECONDS`, `PLAYWRIGHT_TO_MD_TEMP_DIR`, `PLAYWRIGHT_TO_MD_CORS_ORIGINS`, `PLAYWRIGHT_TO_MD_API_HOST`, `PLAYWRIGHT_TO_MD_API_PORT` (default **8014**) |
| Database metadata | `DB_TO_MD_` | `DB_TO_MD_JOB_SQLITE_PATH`, `DB_TO_MD_JOB_WORKSPACE_ROOT`, `DB_TO_MD_CORS_ORIGINS`, `DB_TO_MD_MAX_SYNC_ZIP_MB`, `DB_TO_MD_HOST`, `DB_TO_MD_PORT` (default **8010**) |
| Graph metadata | `GRAPH_TO_MD_` | `GRAPH_TO_MD_JOB_SQLITE_PATH`, `GRAPH_TO_MD_JOB_WORKSPACE_ROOT`, `GRAPH_TO_MD_CORS_ORIGINS`, `GRAPH_TO_MD_MAX_SYNC_ZIP_MB`, `GRAPH_TO_MD_HOST`, `GRAPH_TO_MD_PORT` (default **8012**) |
| OpenAPI → Markdown | `OPENAPI_TO_MD_` | `OPENAPI_TO_MD_CORS_ORIGINS`, `OPENAPI_TO_MD_MAX_SYNC_ZIP_MB`, `OPENAPI_TO_MD_HOST`, `OPENAPI_TO_MD_PORT` (default **8015**) |
| Codeflow API | `CODEFLOW_TO_MD_` / `CODEFLOW_` | `CODEFLOW_TO_MD_HOST`, `CODEFLOW_TO_MD_PORT` (default **8016**); in-app limits: `CODEFLOW_MAX_UPLOAD_ZIP_MB`, `CODEFLOW_MAX_SYNC_ZIP_MB`, `CODEFLOW_JOB_WORKSPACE_ROOT`, `CODEFLOW_SQLITE_PATH`, CORS via `CODEFLOW_CORS` |
| Log API | `LOG_TO_MD_` | `LOG_TO_MD_HOST`, `LOG_TO_MD_PORT` (default **8012** — change if **graph** or **video** API uses that port), `LOG_TO_MD_CORS_ORIGINS`, `LOG_TO_MD_MAX_SYNC_ZIP_MB`, `LOG_TO_MD_MAX_LOG_UPLOAD_MB`, workspace / SQLite paths |
| SAP API | `SAP_TO_MD_` | `SAP_TO_MD_HOST`, `SAP_TO_MD_PORT` (default **8020**), `SAP_TO_MD_CORS_ORIGINS`, `SAP_TO_MD_MAX_SYNC_ZIP_MB`, `SAP_TO_MD_JOB_SQLITE_PATH`, `SAP_TO_MD_JOB_WORKSPACE_ROOT` |
| Audio API | `MD_AUDIO_` | `MD_AUDIO_MAX_UPLOAD_MB`, `MD_AUDIO_MAX_SYNC_UPLOAD_MB`, `MD_AUDIO_JOB_TTL_SECONDS`, `MD_AUDIO_TEMP_DIR`, `MD_AUDIO_CORS_ORIGINS`, `MD_AUDIO_API_HOST`, `MD_AUDIO_API_PORT` |
| Video API | `MD_VIDEO_` | Same pattern as audio with `MD_VIDEO_*` (defaults: larger upload/sync caps, port **8012**) |
| YouTube API | `MD_YOUTUBE_` | `MD_YOUTUBE_JOB_TTL_SECONDS`, `MD_YOUTUBE_TEMP_DIR`, `MD_YOUTUBE_CORS_ORIGINS`, `MD_YOUTUBE_API_HOST`, `MD_YOUTUBE_API_PORT` (default **8013**); optional `MD_YOUTUBE_YTDLP` path for audio fallback |

Exact variable names match the `ApiSettings` / helper functions in each `api/settings` or `api/app` module.

---

## MCP (Model Context Protocol)

Two usage patterns:

1. **Bundled with FastAPI** — run Uvicorn as in the previous section; use path **`/mcp`** on the same host/port.
2. **Standalone process** — run a small `__main__` module (stdio, SSE, or streamable-http) for use with Cursor, Claude Desktop, or other MCP hosts.

### Standalone MCP processes

| Converter | Command (examples) |
|-----------|---------------------|
| ZIP | `python -m md_generator.archive.api.mcp_server` / `--transport sse` / `--transport streamable-http` |
| Text/JSON/XML | `python -m md_generator.text.api.mcp_server` |
| Word (FastMCP) | `python -m md_generator.word.api.mcp_server` / `--transport stdio` (default) or `streamable-http`, plus `--host` / `--port` when needed |
| PDF (FastMCP) | `python -m md_generator.pdf.api.mcp_server` / `--transport stdio` / `sse` / `streamable-http` |
| PPTX | `python -m md_generator.ppt.api.mcp_server` (see module docstring for flags) |
| Image | `python -m md_generator.image.api.mcp_server` (see module for CLI) |
| URL / HTML | `python -m md_generator.url.api.mcp_server` / `--transport sse` / `--transport streamable-http` |
| Playwright / SPA | `md-playwright`, `md-playwright-api`, `md-playwright-mcp`, or `python -m md_generator.playwright.api.mcp_server` / `--transport sse` / `--transport streamable-http` |
| Audio | `md-audio-mcp` or `python -m md_generator.media.audio.api.mcp_server` — `--transport stdio` (default), `sse`, `streamable-http` |
| Video | `md-video-mcp` or `python -m md_generator.media.video.api.mcp_server` — same transports |
| YouTube | `md-youtube-mcp` or `python -m md_generator.media.youtube.api.mcp_server` — same transports |
| Database metadata | `md-db-mcp` or `python -m md_generator.db.api.mcp_server` — `--transport stdio` (default), `sse`, `streamable-http` |
| Graph (Neo4j / NetworkX) | `md-graph-mcp` or `python -m md_generator.graph.api.mcp_server` — `--transport stdio` (default), `sse`, `streamable-http` |
| OpenAPI → Markdown | `md-openapi-mcp` or `python -m md_generator.openapi.api.mcp_server` — `--transport stdio` (default), `sse`, `streamable-http` |
| Codeflow | `md-codeflow-mcp` or `python -m md_generator.codeflow.api.mcp_server` — **stdio** MCP (see module for details) |
| Log files | `md-log-mcp` or `python -m md_generator.log.api.mcp_server` — `--transport stdio` (default), `sse`, `streamable-http` |
| SAP | `md-sap-mcp` or `md-sap-api` — FastAPI app (same as HTTP API; dedicated MCP tools when mounted in app) |

**Word** and **XLSX** also ship a small runner script in the repo:

```bash
python word-to-md/run.py api --host 127.0.0.1 --port 8002
python word-to-md/run.py mcp --transport stdio

python xlsx-to-md/run.py api --port 8003
python xlsx-to-md/run.py mcp --transport stdio
```

The XLSX MCP server is built in code (`build_mcp_server()` in `md_generator.xlsx.mcp_server`) and is mounted on the XLSX FastAPI app when MCP dependencies are installed.

Install **`mdengine[mcp]`** (and usually **`[api]`** when using HTTP) for MCP-related imports to resolve.

---

## AI assistant CLI

The repo ships **distributable AI skills** under [`ai/`](ai/) (Markdown skills, [`ai/registry.json`](ai/registry.json), routing, and [`ai/dependency-graph.json`](ai/dependency-graph.json)). A small SDK in [`src/md_generator/tools/assistant/`](src/md_generator/tools/assistant/) assembles **grounded context** from a natural-language query for Cursor, Claude, OpenAI-style hosts, or your own LLM pipeline.

**Further detail:** [ai/README.md](ai/README.md) (layout, registry schema, master agent).

**PyPI installs:** `mdengine ai assist` and `mdengine ai export` read the **bundled** skill data shipped with **`md_generator.tools.assistant`** unless **`MDENGINE_SKILL_AI_ROOT`** points at another directory (for example the repo’s [`ai/`](ai/) while developing). You do **not** need **`mdengine skill build`** unless you maintain this repository and want to regenerate that bundle.

Script index (all `project.scripts`): [`ai/skills/mdengine-reference/references/entrypoints.md`](ai/skills/mdengine-reference/references/entrypoints.md).

### Regenerate skills, graph, and bundled data

**Maintainers / repo clones.** From the **repository root**, refresh generated files under [`ai/`](ai/) and copy the bundle into **`src/md_generator/tools/assistant/data/`** for packaging:

```bash
# Windows PowerShell
$env:PYTHONPATH = "src"
python -m md_generator.tools.skill_builder
```

```bash
# macOS / Linux
PYTHONPATH=src python -m md_generator.tools.skill_builder
```

After `pip install -e .`, you can run **`mdengine skill build`** from the repo root without setting `PYTHONPATH` (same as `python -m md_generator.tools.skill_builder` when `PYTHONPATH=src`).

- Refreshes `ai/skills/**/SKILL.md`, `ai/skills/global-skill.md`, `ai/skills/modules/*.md`, `ai/registry.json` (routing), and `ai/dependency-graph.json`.
- Copies the same bundle into **`src/md_generator/tools/assistant/data/`** for the installed wheel.

**Incremental** (only area skills touched under `src/md_generator/` since a git ref; global/graph/registry still refresh):

```bash
mdengine skill build --since HEAD~1
```

```bash
PYTHONPATH=src python -m md_generator.tools.skill_builder --since HEAD~1
```

**Custom repo root** (CI or non-default cwd):

```bash
mdengine skill build --root /path/to/md-generator
```

### Use the live `ai/` tree from a clone (optional)

When developing skills, point the SDK at the repo’s `ai/` directory instead of the copy under `src/md_generator/tools/assistant/data/`:

```powershell
# PowerShell
$env:MDENGINE_SKILL_AI_ROOT = (Resolve-Path ".\ai").Path
```

```bash
export MDENGINE_SKILL_AI_ROOT="$(pwd)/ai"
```

### Console: `mdengine ai assist` and `mdengine ai export`

After `pip install mdengine` (or `pip install -e .` from a clone), use the **`mdengine`** entry point (no separate AI script is installed).

- **`mdengine ai assist`** — prints a single **Markdown** context (selected skills + global architecture) to **stdout**, meant to paste into an LLM or pipe to a file.
- **`mdengine ai export`** — prints **host-shaped** output to stdout, or writes with **`-o` / `--output`**: OpenAI **`messages`** JSON (`--format openai`), a Claude-style project prompt (`claude`), or Cursor-style rules Markdown (`cursor`).

**Flags (both subcommands):** **`--rag`** — use optional Chroma RAG when **`mdengine[skill-rag-chroma]`** is installed (falls back to full skills if Chroma is unavailable). **`--ai-root PATH`** — same effect as **`MDENGINE_SKILL_AI_ROOT`** for that run only.

**Discover flags:** `mdengine ai assist --help` · `mdengine ai export --help`

**`ai export` arguments:** **`--format`** `openai` \| `claude` \| `cursor` (required) · **`--query`** `…` (required) · **`-o` / `--output`** `file` (optional; default stdout).

```bash
mdengine ai assist "How do I run md-pdf on a file?"
mdengine ai assist "openapi mcp" --rag
mdengine ai export --format openai --query "md-db ERD" -o bundle.json
mdengine ai export --format claude --query "playwright"
mdengine ai export --format cursor --query "xlsx excel"
```

One-off override without env var: `mdengine ai assist "…" --ai-root /path/to/ai`.

### Python API

```python
from md_generator.tools.assistant import MasterAgent, Registry

reg = Registry.load_default()
agent = MasterAgent(reg)
result = agent.ask("openapi mcp", use_rag=False)
print(result.resolved_skill_ids)
print(result.context_markdown[:2000])
```

Optional RAG: install **`mdengine[skill-rag-chroma]`**, then `agent.ask(..., use_rag=True)` (falls back to full skills if Chroma is unavailable). The same optional stack is used by **`mdengine ai assist --rag`** and **`mdengine ai export … --rag`**.

### Tests (skill SDK)

```bash
PYTHONPATH=src python -m pytest tool-assistant/tests -q
```

---

## Development

```bash
pip install -e ".[dev,all]"   # or a smaller subset of extras
python -m pytest
```

Tests live under each legacy folder’s `tests/` directory (e.g. `pdf-to-md/tests/`), plus **`url-to-md/tests/`**, **`playwright-to-md/tests/`**, **`youtube-to-md/tests/`**, **`graph-to-md/tests/`**, **`openapi-to-md/tests/`**, **`codeflow-to-md/tests/`**, **`log-to-md/tests/`**, **`sap-to-md/tests/`**, and **[`tool-assistant/tests/`](tool-assistant/tests/)** for the skill SDK; `pyproject.toml` sets `pythonpath = ["src"]` so **`md_generator`** (including **`md_generator.tools.assistant`**) resolves without a manual `PYTHONPATH` when you use `pytest` from the config.

**Docs site (maintainers):** install **`mdengine[docs]`**, then build or serve the MkDocs project from the repo root (`mkdocs.yml`). The published site URL is listed under **Documentation** in `[project.urls]` on PyPI ([GitHub Pages build](https://vishal7090.github.io/md-generator/)).

### Codeflow (`md_generator.codeflow`)

Static call graphs and per-entry Markdown/Mermaid/JSON for **Python**, **Java** (Spring REST, Kafka listeners, **Liferay / JSR-286 portlets**), and other parsers. Output is **static only** (`graph-full.json` can include unresolved `unknown::*` callees).

More detail: **[`codeflow-to-md/docs/graph-and-outputs.md`](codeflow-to-md/docs/graph-and-outputs.md)** (node/edge model, `graph-schema.json`, `methods/` layout).

- **Formats:** pass **`--formats md,mermaid,json,html`** (comma-separated). **`--output`** is the **output directory**, not the format list.
- **Graph schema export:** optional **`--emit-graph-schema`** (with `json` in formats) writes **`graph-schema.json`** — stable Node/Edge view with derived File/Class nodes and `CONTAINS` edges. **`graph-full.json`** stays unchanged for existing consumers.
- **Markdown intelligence:** flow and entry Markdown include **Called By** and **Impact** lists (capped with **`--intelligence-list-cap`**, default 80).
- **IR / CFG (optional):** **`--emit-cfg`** builds a language-agnostic IR and writes **`cfg.json`**, **`cfg.mmd`**, and appends a CFG Mermaid block to **`flow.md`** (when `md` is in formats). Cap with **`--cfg-max-nodes`** (default 500).
- **Entries:** use `--entry` (comma-separated symbol ids), **`--entries-file`** (one id per line, `#` comments allowed), or rely on detectors. **`--emit-entry-per-method`** emits one output directory per method/entry symbol under **`methods/<slug>/`**; cap volume with **`--emit-entry-max`** (default cap **10000** when the flag is set and max is omitted). **`--entry-fallback`** `none` \| `roots` \| `first_n` controls behavior when nothing is detected (default **`roots`**, in-degree 0 symbols, limited by **`--entry-fallback-max`**).
- **Java flow:** edges can record **branch context** from enclosing `if`, `switch`, or ternary expressions (`condition` / `labels` on edges). **Business rules** Markdown picks up validation-style **`@NotNull`** / **`@Size`** / **`@Valid`** (name heuristics) and throws such as **`IllegalArgumentException`**.
- **JavaScript / TypeScript / TSX:** Tree-sitter adds **`condition_label`** from enclosing control flow, **`fr.branches`** for `if` / `switch`, and **`fr.rules`** for **`throw`** and common **`assert` / `console.assert`** calls (slice-scoped in the collector).
- **C++ (tree-sitter path):** **`condition_label`** on calls, **`fr.branches`** for `if` / `switch`, plus **`rules/cpp_rules.py`** for **`throw`** / **`assert`** when methods appear in the slice. (Clang-only parses still get throw/assert via the same extractor when tree-sitter is installed.)
- **Go:** the **`codeflow_go_dump`** helper emits **`condition`** on calls inside `if` / `switch` / `for` / `range`, and **`rules`** for **`panic`** / **`log.Fatal` / `Fatalf`**; **`fr.rules`** are merged slice-scoped.
- **PHP:** the **`codeflow_php_dump`** helper adds **`condition`** on calls (from enclosing **`if` / `case`**), **`rules`** for **`throw`**, and **`branches`** for documentation. Install PHP + Composer deps under `codeflow-to-md/examples/codeflow_php_dump` (or mirror under `tools/codeflow_php_dump`) for PHP scans.
- **Kinds / filters:** `--include` accepts aliases such as **`portlet`** / **`liferay`** for portlet entries.
- **Operators:** each run writes **`scan-summary.md`** at the output root (counts, warnings, fallback mode) unless **`--no-scan-summary`** is set.

```bash
python -m md_generator.codeflow.cli.main scan path/to/src --output ./codeflow-out --lang java
```

---

## Repository layout

| Path | Role |
|------|------|
| `LICENSE` | MIT license text |
| `docs/` | **MkDocs** source (module guides, `index.md`); build with **`mdengine[docs]`** |
| `mkdocs.yml` | MkDocs configuration |
| `CODE_OF_CONDUCT.md` | [Contributor Covenant](https://www.contributor-covenant.org/) 2.1 |
| `src/md_generator/` | **Library source** (all formats + `api` subpackages); **audio/video** under [`media/`](src/md_generator/media/); **logs** [`log/`](src/md_generator/log/), **OTEL** [`otel/`](src/md_generator/otel/), **SAP** [`sap/`](src/md_generator/sap/), **graphs** [`graph/`](src/md_generator/graph/) |
| `pyproject.toml` | Packaging, extras, CLI entry points, pytest |
| `ai/` | **Distributable AI skills** (`SKILL.md` trees, `registry.json`, `dependency-graph.json`); regenerate with `PYTHONPATH=src python -m md_generator.tools.skill_builder` or **`mdengine skill build`** (after `pip install -e .`) |
| `src/md_generator/tools/skill_builder/` | **Skill generator** — scans `src/md_generator` and `pyproject.toml` |
| `src/md_generator/tools/assistant/` | **Skill SDK** (`md_generator.tools.assistant`): `MasterAgent`, `Registry`, bundled `data/` copy of `ai/` |
| `tool-assistant/tests/` | **Pytests** for the assistant SDK (`md_generator.tools.assistant`) |
| `*-to-md/` | **Docs, tests, fixtures**, thin `converter.py` shims, some `run.py` helpers |
| `README.md` | This document |

For deeper behavior per format, see the original README files under each `*-to-md/` folder where they still exist.

---

## Legal

This project is released under the [MIT License](LICENSE). A copy of the license text is included in the repository root.
