Metadata-Version: 2.2
Name: gllm-docproc-binary
Version: 0.7.26
Summary: A library for orchestrating the processing of document. Typically in a Gen AI applications (but not limited to just Gen AI).
Author-email: GenAI SDK Team <gat-sdk@gdplabs.id>
Requires-Python: <3.13,>=3.11
Description-Content-Type: text/markdown
Requires-Dist: bosa-connectors-binary<0.4.0,>=0.3.0
Requires-Dist: gllm-core-binary<0.4.0,>=0.3.0
Requires-Dist: gllm-datastore-binary[chroma,elasticsearch]<0.6.0,>=0.5.0
Requires-Dist: gllm-multimodal-binary[audio]<0.4.0,>=0.3.0
Requires-Dist: gllm-privacy-binary<0.5.0,>=0.4.0
Requires-Dist: langchain-text-splitters<0.4.0,>=0.3.2
Requires-Dist: pandas<3.0.0,>=2.2.3
Requires-Dist: pydantic<3.0.0,>=2.9.1
Requires-Dist: tabulate<0.10.0,>=0.9.0
Requires-Dist: python-magic<0.5.0,>=0.4.27; sys_platform != "win32"
Requires-Dist: python-magic-bin<0.5.0,>=0.4.14; sys_platform == "win32"
Provides-Extra: dev
Requires-Dist: coverage<8.0.0,>=7.4.4; extra == "dev"
Requires-Dist: mypy<2.0.0,>=1.15.0; extra == "dev"
Requires-Dist: pre-commit<4.0.0,>=3.7.0; extra == "dev"
Requires-Dist: pytest<9.0.0,>=8.1.1; extra == "dev"
Requires-Dist: pytest-asyncio<1.0.0,>=0.23.6; extra == "dev"
Requires-Dist: pytest-cov<6.0.0,>=5.0.0; extra == "dev"
Requires-Dist: ruff<1.0.0,>=0.6.7; extra == "dev"
Provides-Extra: audio
Requires-Dist: librosa<0.11.0,>=0.10.1; extra == "audio"
Requires-Dist: tqdm<5.0.0,>=4.66.2; extra == "audio"
Provides-Extra: docx
Requires-Dist: docx2python<3.0.0,>=2.8.0; extra == "docx"
Requires-Dist: python-docx<2.0.0,>=1.1.0; extra == "docx"
Provides-Extra: html
Requires-Dist: billiard<5.0.0,>=4.2.1; extra == "html"
Requires-Dist: firecrawl-py<5.0.0,>=4.3.6; extra == "html"
Requires-Dist: html-to-markdown<2.0.0,>=1.9.0; extra == "html"
Requires-Dist: playwright<2.0.0,>=1.40.0; extra == "html"
Requires-Dist: scrapy<3.0.0,>=2.11.0; extra == "html"
Requires-Dist: scrapy-playwright<0.1.0,>=0.0.33; extra == "html"
Requires-Dist: scrapy-zyte-api<1.0.0,>=0.12.2; extra == "html"
Requires-Dist: zyte-api<1.0.0,>=0.4.8; extra == "html"
Provides-Extra: html-svg
Requires-Dist: cairosvg<3.0.0,>=2.8.2; extra == "html-svg"
Provides-Extra: image
Requires-Dist: aioresponses<1.0.0,>=0.7.0; extra == "image"
Requires-Dist: boto3<2.0.0,>=1.38.10; extra == "image"
Requires-Dist: pillow<12.0.0,>=11.2.1; extra == "image"
Provides-Extra: kg
Requires-Dist: asyncpg<1.0.0,>=0.30.0; extra == "kg"
Requires-Dist: gllm-datastore-binary[kg]<0.6.0,>=0.5.0; extra == "kg"
Requires-Dist: lightrag-hku<2.0.0,>=1.4.6; extra == "kg"
Requires-Dist: llama-index-embeddings-openai<1.0.0,>=0.3.0; extra == "kg"
Requires-Dist: llama-index-llms-openai<1.0.0,>=0.3.0; extra == "kg"
Provides-Extra: pdf
Requires-Dist: azure-ai-documentintelligence<2.0.0,>=1.0.0b3; extra == "pdf"
Requires-Dist: jpype1<2.0.0,>=1.5.0; extra == "pdf"
Requires-Dist: pdfminer-six<20250000,>=20231228; extra == "pdf"
Requires-Dist: pdfplumber<1.0.0,>=0.11.4; extra == "pdf"
Requires-Dist: pdfservices-sdk<5.0.0,>=4.0.0; extra == "pdf"
Requires-Dist: pymupdf<2.0.0,>=1.24.10; extra == "pdf"
Requires-Dist: tabula-py<3.0.0,>=2.9.3; extra == "pdf"
Provides-Extra: pii
Requires-Dist: langdetect<2.0.0,>=1.0.0; extra == "pii"
Requires-Dist: torch<3.0.0,>=2.0.0; extra == "pii"
Provides-Extra: pptx
Requires-Dist: python-pptx<2.0.0,>=1.0.2; extra == "pptx"
Provides-Extra: video
Requires-Dist: PyGObject==3.50.0; sys_platform != "win32" and extra == "video"
Requires-Dist: numpy<2.0.0,>=1.26.0; extra == "video"
Requires-Dist: scipy<2.0.0,>=1.15.0; extra == "video"
Requires-Dist: soundfile<0.14.0,>=0.13.1; extra == "video"
Provides-Extra: xlsx
Requires-Dist: openpyxl<4.0.0,>=3.0.10; extra == "xlsx"

# GLLM Docproc

## Description
A library for orchestrating the processing of document. Typically in a Gen AI applications (but not limited to just Gen AI).

---

## Installation

### Prerequisites

Mandatory:
1. Python 3.11+ — [Install here](https://www.python.org/downloads/)
2. pip — [Install here](https://pip.pypa.io/en/stable/installation/)
3. uv — [Install here](https://docs.astral.sh/uv/getting-started/installation/)
4. gcloud CLI (for authentication) — [Install here](https://cloud.google.com/sdk/docs/install), then log in using:
   ```bash
   gcloud auth login
   ```

---

### Install from Artifact Registry

This requires authentication via the `gcloud` CLI.

1. Export token
```
export GCLOUD_ACCESS_TOKEN="$(gcloud auth print-access-token)"
```

2. Configure the index in your `pyproject.tom;`
```
[[tool.uv.index]]
name = "gen-ai-internal"
url = "https://oauth2accesstoken:${GCLOUD_ACCESS_TOKEN}@glsdk.gdplabs.id/gen-ai-internal/simple/"
```

3. Add the dependency
```
uv add gllm-docproc
```

---

## Local Development Setup

### Prerequisites

1. Python 3.11+ — [Install here](https://www.python.org/downloads/)
2. pip — [Install here](https://pip.pypa.io/en/stable/installation/)
3. uv — [Install here](https://docs.astral.sh/uv/getting-started/installation/)
4. gcloud CLI — [Install here](https://cloud.google.com/sdk/docs/install), then log in using:

   ```bash
   gcloud auth login
   ```
5. Git — [Install here](https://git-scm.com/downloads)
6. Access to the [GDP Labs SDK GitHub repository](https://github.com/GDP-ADMIN/gl-sdk)

---

### 1. Clone Repository

```bash
git clone git@github.com:GDP-ADMIN/gl-sdk.git
cd gl-sdk/libs/gllm-docproc
```

---

### 2. Setup Authentication

Set the following environment variables to authenticate with internal package indexes:

```bash
export UV_INDEX_GEN_AI_INTERNAL_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_INTERNAL_PASSWORD="$(gcloud auth print-access-token)"
export UV_INDEX_GEN_AI_USERNAME=oauth2accesstoken
export UV_INDEX_GEN_AI_PASSWORD="$(gcloud auth print-access-token)"
```

---

### 3. Quick Setup

Run:

```bash
make setup
```

---

### 4. Activate Virtual Environment

```bash
source .venv/bin/activate
```

---

## Local Development Utilities

The following Makefile commands are available for quick operations:

### Install uv

```bash
make install-uv
```

### Install Pre-Commit

```bash
make install-pre-commit
```

### Install Dependencies

```bash
make install
```

### Update Dependencies

```bash
make update
```

### Run Tests

```bash
make test
```

---

## Contributing

Please refer to the [Python Style Guide](https://docs.google.com/document/d/1uRggCrHnVfDPBnG641FyQBwUwLoFw0kTzNqRm92vUwM/edit?usp=sharing)
for information about code style, documentation standards, and SCA requirements.
