Metadata-Version: 2.4
Name: deepdoc_lib
Version: 0.2.0
Summary: [RAGFlow](https://ragflow.io/) is an open-source RAG (Retrieval-Augmented Generation) engine based on deep document understanding. It offers a streamlined RAG workflow for businesses of any scale, combining LLM (Large Language Models) to provide truthful question-answering capabilities, backed by well-founded citations from various complex formatted data.
Author-email: Zhichang Yu <yuzhichang@gmail.com>
Requires-Python: <3.14,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datrie>=0.8.2
Requires-Dist: anthropic>=0.69.0
Requires-Dist: beartype<0.19.0,>=0.18.5
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: chardet>=5.2.0
Requires-Dist: demjson3>=3.0.6
Requires-Dist: hanziconv>=0.3.2
Requires-Dist: html5lib>=1.1
Requires-Dist: jinja2>=3.1.0
Requires-Dist: modelscope<2.0.0,>=1.20.0
Requires-Dist: markdown>=3.6
Requires-Dist: nltk>=3.9.1
Requires-Dist: numpy<2.0.0,>=1.26.0
Requires-Dist: ollama>=0.6.1
Requires-Dist: onnxruntime>=1.19.2
Requires-Dist: openai>=1.45.0
Requires-Dist: opencv-python>=4.10.0.84
Requires-Dist: opencv-python-headless>=4.10.0.84
Requires-Dist: openpyxl<4.0.0,>=3.1.0
Requires-Dist: pandas<3.0.0,>=2.2.0
Requires-Dist: pdfplumber>=0.10.4
Requires-Dist: pillow>=11.0.0
Requires-Dist: pyclipper>=1.3.0.post5
Requires-Dist: pypdf>=6.0.0
Requires-Dist: python-pptx<2.0.0,>=1.0.2
Requires-Dist: python-docx<2.0.0,>=1.1.2
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: requests>=2.32.2
Requires-Dist: scikit-learn>=1.5.0
Requires-Dist: shapely>=2.0.5
Requires-Dist: six>=1.16.0
Requires-Dist: strenum>=0.4.15
Requires-Dist: tencentcloud-sdk-python>=3.0.1215
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: xgboost<2.0.0,>=1.6.0
Requires-Dist: xpinyin>=0.7.6
Requires-Dist: zhipuai>=2.0.1
Requires-Dist: google-generativeai<0.9.0,>=0.8.1
Requires-Dist: trio>=0.29.0
Requires-Dist: setuptools<76.0.0,>=75.2.0
Requires-Dist: huggingface-hub<0.26.0,>=0.25.0
Provides-Extra: gpu
Requires-Dist: onnxruntime-gpu>=1.19.2; (sys_platform != "darwin" and platform_machine == "x86_64") and extra == "gpu"
Requires-Dist: torch; extra == "gpu"
Dynamic: license-file

# Deepdoc

### Installations

CPU-only (default):

```bash
pip install deepdoc-lib
```

GPU (Linux x86_64 only):

```bash
pip install deepdoc-lib[gpu]
```

Note: `onnxruntime` (CPU) and `onnxruntime-gpu` should not be installed together. If you're switching an existing environment to GPU, uninstall CPU ORT first:

```bash
pip uninstall -y onnxruntime
pip install onnxruntime-gpu==1.19.2
```

### Parser Usage

```python
from deepdoc import (
    DocxParser,
    ExcelParser,
    HtmlParser,
    PdfModelConfig,
    PdfParser,
    TokenizerConfig,
)

# Build configs
# Method 1: Explicit configuration (offline mode)
tokenizer_cfg = TokenizerConfig(
    offline=True,
    nltk_data_dir="/path/to/nltk_data",
)
pdf_model_cfg = PdfModelConfig(
    vision_model_dir="/path/to/models/vision",
    xgb_model_dir="/path/to/models/xgb",
    model_provider="local",
)

# Method 2: Empty configuration (auto-download models and nltk_data)
# tokenizer_cfg = TokenizerConfig()
# pdf_model_cfg = PdfModelConfig()


# Parse PDF
pdf_parser = PdfParser(model_cfg=pdf_model_cfg, tokenizer_cfg=tokenizer_cfg)
result = pdf_parser("document.pdf")

# Parse DOCX / HTML (tokenizer only)
docx_parser = DocxParser(tokenizer_cfg=tokenizer_cfg)
html_parser = HtmlParser(tokenizer_cfg=tokenizer_cfg)

# Parse Excel (no model/tokenizer dependency)
excel_parser = ExcelParser()
with open("data.xlsx", "rb") as f:
    result = excel_parser(f.read())
```

Or use explicit env factories:

```python
tokenizer_cfg = TokenizerConfig.from_env()
pdf_model_cfg = PdfModelConfig.from_env()
pdf_parser = PdfParser(model_cfg=pdf_model_cfg, tokenizer_cfg=tokenizer_cfg)
```

Or rely on defaults (env + cache). Deepdoc will look for cached bundles under
`$DEEPDOC_MODEL_HOME` (or `~/.cache/deepdoc`) and only download missing files
when the provider allows remote access:

```python
pdf_parser = PdfParser()
```

env definitions:

```bash
# provider: auto | local | modelscope
export DEEPDOC_MODEL_PROVIDER=auto

# shared model cache root (default: ~/.cache/deepdoc)
export DEEPDOC_MODEL_HOME=/path/to/deepdoc-models

# optional bundle-specific local directories
export DEEPDOC_VISION_MODEL_DIR=/path/to/vision
export DEEPDOC_XGB_MODEL_DIR=/path/to/xgb

# single combined ModelScope repo (all bundles in one repo)
# (default: Xorbits/deepdoc)
export DEEPDOC_MODELSCOPE_REPO=Xorbits/deepdoc
# optional shared revision (default: master)
export DEEPDOC_MODELSCOPE_REVISION=master

# offline mode for tokenizer NLTK auto-download
export DEEPDOC_OFFLINE=0

# optional NLTK data controls for tokenizer
export DEEPDOC_NLTK_DATA_DIR=/path/to/nltk_data
```

### Download model artifacts

To pre-download all model bundles (vision/xgb/tokenizer) into the default cache directory (`~/.cache/deepdoc`), run:

```bash
deepdoc-download-models
# or (from source checkout)
python -m deepdoc.download_models
```

If you want to override the cache location, set `DEEPDOC_MODEL_HOME`:

```bash
export DEEPDOC_MODEL_HOME=./models
deepdoc-download-models
```

By default this also downloads the required NLTK resources into `~/.cache/deepdoc/nltk_data` (or `$DEEPDOC_MODEL_HOME/nltk_data`) and the cached `cl100k_base` tiktoken file into `~/.cache/deepdoc/tiktoken_cache` (or `$DEEPDOC_MODEL_HOME/tiktoken_cache`). `deepdoc.common.token_utils` automatically points `TIKTOKEN_CACHE_DIR` at the same location unless you override it with `DEEPDOC_TIKTOKEN_CACHE_DIR` or `TIKTOKEN_CACHE_DIR`.

If you want to skip either optional offline asset, use:

```bash
deepdoc-download-models --no-nltk --no-tiktoken
```


### Vision Model Usage

``` python
from deepdoc import create_vision_model
```

- Use Environment Variable

```bash
# Vision model configs
export DEEPDOC_VISION_PROVIDER="qwen"
export DEEPDOC_VISION_API_KEY="your-api-key"
export DEEPDOC_VISION_MODEL="qwen-vl-max"
export DEEPDOC_VISION_LANG="Chinese"
export DEEPDOC_VISION_BASE_URL="http://your_base_url"

# Other configs
export DEEPDOC_LIGHTEN=0  # Whether to use lighten mode
```

``` python
vision_model = create_vision_model()
```

- Use Default Provider

``` bash
export DEEPDOC_VISION_API_KEY="your-api-key"
```

``` python
vision_model = create_vision_model("qwen")
```

Supported providers: ["openai", "qwen", "zhipu", "ollama", "gemini", "anthropic"]

- Use Configuration File

Create `deepdoc_config.yaml`:

```yaml
vision_model:
  provider: "qwen"
  model_name: "qwen-vl-max"
  api_key: "your-api-key"
  lang: "Chinese"
  base_url : "http://your-base-url"
```

``` python
vision_model = create_vision_model("/path/to/deepdoc_config.yaml")
```

#### Run
``` python
with open("image.jpg", "rb") as f:
    result = vision_model.describe_with_prompt(f.read())
```
