Metadata-Version: 2.4
Name: docstudio
Version: 0.2.0
Summary: Bidirectional, Markdown-centric document conversion: reverse (X->Markdown) like markitdown, plus high-fidelity forward export (Markdown->PDF/Word/LaTeX/EPUB/Excel) and optional VLM image recognition.
Project-URL: Homepage, https://github.com/Sudo-Biao/docstudio
Project-URL: Repository, https://github.com/Sudo-Biao/docstudio
Project-URL: Issues, https://github.com/Sudo-Biao/docstudio/issues
Author-email: biaoli <biaoli@swufe.edu.cn>
License: MIT
License-File: LICENSE
Keywords: conversion,document,docx,epub,latex,markdown,markitdown,ocr,pdf,vlm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.9
Requires-Dist: markdown>=3.5
Requires-Dist: markdownify>=0.11
Provides-Extra: all
Requires-Dist: ebooklib; extra == 'all'
Requires-Dist: mammoth; extra == 'all'
Requires-Dist: markitdown[all]; extra == 'all'
Requires-Dist: openpyxl; extra == 'all'
Requires-Dist: pdfminer-six; extra == 'all'
Requires-Dist: pillow; extra == 'all'
Requires-Dist: pymupdf; extra == 'all'
Requires-Dist: pypdf; extra == 'all'
Requires-Dist: pytesseract; extra == 'all'
Requires-Dist: python-docx; extra == 'all'
Requires-Dist: python-pptx; extra == 'all'
Requires-Dist: requests; extra == 'all'
Provides-Extra: llm
Requires-Dist: requests; extra == 'llm'
Provides-Extra: markitdown
Requires-Dist: markitdown[all]; extra == 'markitdown'
Provides-Extra: ocr
Requires-Dist: pillow; extra == 'ocr'
Requires-Dist: pymupdf; extra == 'ocr'
Requires-Dist: pytesseract; extra == 'ocr'
Provides-Extra: office
Requires-Dist: ebooklib; extra == 'office'
Requires-Dist: mammoth; extra == 'office'
Requires-Dist: openpyxl; extra == 'office'
Requires-Dist: python-docx; extra == 'office'
Requires-Dist: python-pptx; extra == 'office'
Provides-Extra: pdf
Requires-Dist: pdfminer-six; extra == 'pdf'
Requires-Dist: pypdf; extra == 'pdf'
Provides-Extra: pdf-chrome
Requires-Dist: playwright; extra == 'pdf-chrome'
Provides-Extra: pdf-weasy
Requires-Dist: weasyprint; extra == 'pdf-weasy'
Description-Content-Type: text/markdown

# DocumentStudio (Python)

A **bidirectional, Markdown-centric** document converter — a Python library and
CLI in the spirit of Microsoft's [`markitdown`](https://github.com/microsoft/markitdown),
but going **both ways**:

| Direction | Formats | Notes |
|-----------|---------|-------|
| **Reverse** `X → Markdown` | PDF, Word, PPT, Excel, EPUB, HTML, CSV/TSV, JSON, ZIP, images | like markitdown; can *delegate to* markitdown when installed |
| **Forward** `Markdown → X` | HTML, **PDF**, **Word (.docx)**, **LaTeX**, EPUB, Excel (.xlsx), text | high-fidelity export — the part markitdown does **not** do |
| **AI / VLM** | image & scanned-PDF recognition, "smart cleanup" | any OpenAI-compatible endpoint; vision model optional |
| **AI assistant** | polish, translate, summarise, expand, continue, grammar, formalise, titles, outline, fix-LaTeX, free-form | one-shot ops on a document |
| **Toolbox** | table of contents, merge PDFs, extract images | headless, no browser |
| **Templates** | academic, techdoc, minutes, readme, weekly, blog | ready-to-edit Markdown |

The design mirrors markitdown's: a small core, a converter **registry** that's open
for extension, and **optional dependency extras** so a minimal install still works.

## Install

```bash
pip install docstudio                 # core: csv/tsv/json/html  +  md→html/latex/text
pip install "docstudio[office]"       # docx, pptx, xlsx, epub
pip install "docstudio[pdf]"          # PDF text extraction (pdfminer.six)
pip install "docstudio[ocr]"          # scanned-PDF / image OCR (PyMuPDF, pytesseract)
pip install "docstudio[llm]"          # AI cleanup + VLM (requests)
pip install "docstudio[markitdown]"   # reuse Microsoft markitdown for the reverse path
pip install "docstudio[all]"
```

For **Markdown → PDF/DOCX/EPUB** with the best fidelity, install
[`pandoc`](https://pandoc.org) plus a TeX engine (`xelatex`):
```bash
sudo apt install pandoc texlive-xetex texlive-latex-recommended fonts-noto-cjk
```
PDF also has two pure-Python backends as fallbacks: `weasyprint`
(`docstudio[pdf-weasy]`) and headless-Chrome via `playwright`
(`docstudio[pdf-chrome]`, full KaTeX math).

## Library

```python
from docstudio import DocumentStudio
ds = DocumentStudio()                       # use_markitdown=True by default

# anything → Markdown
md = ds.to_markdown("report.pdf")
md = ds.to_markdown("slides.pptx")

# Markdown → anything (non-md inputs are auto-converted first)
ds.convert("paper.md",  to="pdf",   out="paper.pdf")
ds.convert("paper.md",  to="docx",  out="paper.docx")
ds.convert("scan.pdf",  to="docx",  out="scan.docx")   # PDF → md → docx
ds.convert("table.png", to="xlsx",  out="table.xlsx")  # image → md → xlsx
```

### AI + Vision (VLM)

```python
from docstudio import DocumentStudio
from docstudio.llm import LLM

llm = LLM(base_url="https://api.openai.com", api_key="sk-...",
          model="gpt-4o-mini", vlm_model="gpt-4o")

print(LLM.fetch_models("https://api.openai.com", "sk-..."))   # pick from the list

ds = DocumentStudio(llm=llm)
md = ds.to_markdown("photographed_table.jpg")   # recognised by the vision model
md = ds.to_markdown("scanned_book.pdf")         # page-by-page VLM when no text layer
md = llm.cleanup_markdown(rough_text)           # turn messy OCR into clean Markdown
```

## AI assistant (operate on a document)

One-shot AI operations on Markdown/text — the *AI Assistant* from the web app.
Needs an `llm` (any OpenAI-compatible endpoint).

```python
from docstudio import DocumentStudio
from docstudio.llm import LLM

# any OpenAI-compatible endpoint — OpenAI, DeepSeek, vLLM, Ollama, a gateway…
# you choose base_url + model; nothing is hard-coded to a provider
ds = DocumentStudio(llm=LLM(base_url="https://api.openai.com",
                            api_key="sk-...", model="gpt-4o-mini"))

ds.assist(md, action="polish")     # 润色
ds.assist(md, action="to_en")      # 翻译成英文（to_zh 反之）
ds.assist(md, action="summary")    # 摘要
ds.assist(md, action="outline")    # 生成大纲
ds.assist(md, instruction="把所有表格改成要点列表")   # 自由指令

DocumentStudio.assist_actions()
# polish, to_en, to_zh, summary, expand, condense, continue,
# grammar, formal, titles, outline, fix_latex
```

## Toolbox

```python
ds.generate_toc(md)                              # insert a Markdown table of contents
ds.merge_pdfs(["a.pdf", "b.pdf"], "all.pdf")     # concatenate PDFs (needs pypdf)
ds.extract_images("report.pdf", "./imgs")        # pull embedded images out (PDF/DOCX/PPTX/EPUB)
```

## Templates

Six ready-to-edit Markdown templates: `academic`, `techdoc`, `minutes`,
`readme`, `weekly`, `blog`.

```python
ds.templates()                # {slug: (title, description)}
body = ds.template("academic")
```

## CLI

```bash
docstudio report.pdf                      # → report.md   (prints to stdout)
docstudio report.pdf -o out.md
cat report.pdf | docstudio                # stdin → stdout
docstudio paper.md --to pdf -o paper.pdf  # Markdown → anything
docstudio scan.pdf --to docx              # PDF → md → docx
docstudio photo.jpg --vlm-model gpt-4o --base-url https://api.openai.com --api-key sk-...
docstudio --list-formats

docstudio paper.md --toc -o paper.md                     # insert a table of contents
docstudio notes.md --assist polish --base-url https://api.openai.com --model gpt-4o-mini --api-key sk-... -o clean.md
docstudio notes.md --instruction "翻译成英文" --base-url https://api.openai.com --model gpt-4o-mini --api-key sk-... -o en.md
docstudio --merge a.pdf b.pdf -o all.pdf                 # merge PDFs
docstudio report.pdf --extract-images ./imgs             # pull out images
docstudio --template academic                            # print a template
docstudio --list-templates
```

## Extending

Register your own converter — exactly how the built-ins are defined:

```python
from docstudio.core import registry

@registry.ingester("rtf")
def rtf_to_md(source, ds=None, **opts):
    ...
    return markdown_text

@registry.exporter("rst")
def md_to_rst(md, out=None, ds=None, **opts):
    ...
    return out
```

## Relationship to markitdown

`markitdown` is excellent at `X → Markdown` for LLM pipelines. DocumentStudio
**reuses** it for that direction when present (`use_markitdown=True`), and adds the
missing half: turning Markdown back into polished, human-facing **PDF / Word /
LaTeX / EPUB / Excel**, plus a vision-model path for images and scanned PDFs.

MIT licensed.
