Metadata-Version: 2.4
Name: docvec-cli
Version: 0.1.0
Summary: DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
Home-page: https://github.com/onurbaran/docvec-cli
Author: Onur Baran
Author-email: baranonur@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: annotated-types==0.7.0
Requires-Dist: anyio==4.9.0
Requires-Dist: asgiref==3.8.1
Requires-Dist: async-timeout==4.0.3
Requires-Dist: attrs==25.3.0
Requires-Dist: backoff==2.2.1
Requires-Dist: bcrypt==4.3.0
Requires-Dist: beautifulsoup4==4.13.4
Requires-Dist: build==1.2.2.post1
Requires-Dist: cachetools==5.5.2
Requires-Dist: certifi==2025.4.26
Requires-Dist: charset-normalizer==3.4.2
Requires-Dist: chromadb==1.0.10
Requires-Dist: click==8.1.8
Requires-Dist: coloredlogs==15.0.1
Requires-Dist: Deprecated==1.2.18
Requires-Dist: distro==1.9.0
Requires-Dist: durationpy==0.10
Requires-Dist: exceptiongroup==1.3.0
Requires-Dist: fastapi==0.115.9
Requires-Dist: filelock==3.18.0
Requires-Dist: flatbuffers==25.2.10
Requires-Dist: fsspec==2025.5.1
Requires-Dist: google-auth==2.40.2
Requires-Dist: googleapis-common-protos==1.70.0
Requires-Dist: grpcio==1.71.0
Requires-Dist: h11==0.16.0
Requires-Dist: hf-xet==1.1.2
Requires-Dist: httpcore==1.0.9
Requires-Dist: httptools==0.6.4
Requires-Dist: httpx==0.28.1
Requires-Dist: huggingface-hub==0.32.0
Requires-Dist: humanfriendly==10.0
Requires-Dist: idna==3.10
Requires-Dist: importlib_metadata==8.6.1
Requires-Dist: importlib_resources==6.5.2
Requires-Dist: Jinja2==3.1.6
Requires-Dist: joblib==1.5.1
Requires-Dist: jsonpatch==1.33
Requires-Dist: jsonpointer==3.0.0
Requires-Dist: jsonschema==4.23.0
Requires-Dist: jsonschema-specifications==2025.4.1
Requires-Dist: kubernetes==32.0.1
Requires-Dist: langchain==0.3.25
Requires-Dist: langchain-core==0.3.61
Requires-Dist: langchain-text-splitters==0.3.8
Requires-Dist: langsmith==0.3.42
Requires-Dist: lxml==5.4.0
Requires-Dist: markdown-it-py==3.0.0
Requires-Dist: MarkupSafe==3.0.2
Requires-Dist: mdurl==0.1.2
Requires-Dist: mmh3==5.1.0
Requires-Dist: mpmath==1.3.0
Requires-Dist: networkx==3.2.1
Requires-Dist: numpy==2.0.2
Requires-Dist: oauthlib==3.2.2
Requires-Dist: onnxruntime==1.19.2
Requires-Dist: opentelemetry-api==1.33.1
Requires-Dist: opentelemetry-exporter-otlp-proto-common==1.33.1
Requires-Dist: opentelemetry-exporter-otlp-proto-grpc==1.33.1
Requires-Dist: opentelemetry-instrumentation==0.54b1
Requires-Dist: opentelemetry-instrumentation-asgi==0.54b1
Requires-Dist: opentelemetry-instrumentation-fastapi==0.54b1
Requires-Dist: opentelemetry-proto==1.33.1
Requires-Dist: opentelemetry-sdk==1.33.1
Requires-Dist: opentelemetry-semantic-conventions==0.54b1
Requires-Dist: opentelemetry-util-http==0.54b1
Requires-Dist: orjson==3.10.18
Requires-Dist: overrides==7.7.0
Requires-Dist: packaging==24.2
Requires-Dist: pillow==11.2.1
Requires-Dist: posthog==4.2.0
Requires-Dist: protobuf==5.29.4
Requires-Dist: pyasn1==0.6.1
Requires-Dist: pyasn1_modules==0.4.2
Requires-Dist: pydantic==2.11.5
Requires-Dist: pydantic_core==2.33.2
Requires-Dist: Pygments==2.19.1
Requires-Dist: pypdf==5.5.0
Requires-Dist: PyPika==0.48.9
Requires-Dist: pyproject_hooks==1.2.0
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: python-docx==1.1.2
Requires-Dist: python-dotenv==1.1.0
Requires-Dist: PyYAML==6.0.2
Requires-Dist: referencing==0.36.2
Requires-Dist: regex==2024.11.6
Requires-Dist: requests==2.32.3
Requires-Dist: requests-oauthlib==2.0.0
Requires-Dist: requests-toolbelt==1.0.0
Requires-Dist: rich==14.0.0
Requires-Dist: rpds-py==0.25.1
Requires-Dist: rsa==4.9.1
Requires-Dist: safetensors==0.5.3
Requires-Dist: scikit-learn==1.6.1
Requires-Dist: scipy==1.13.1
Requires-Dist: sentence-transformers==4.1.0
Requires-Dist: shellingham==1.5.4
Requires-Dist: six==1.17.0
Requires-Dist: sniffio==1.3.1
Requires-Dist: soupsieve==2.7
Requires-Dist: SQLAlchemy==2.0.41
Requires-Dist: starlette==0.45.3
Requires-Dist: sympy==1.14.0
Requires-Dist: tenacity==9.1.2
Requires-Dist: threadpoolctl==3.6.0
Requires-Dist: tokenizers==0.21.1
Requires-Dist: tomli==2.2.1
Requires-Dist: torch==2.7.0
Requires-Dist: tqdm==4.67.1
Requires-Dist: transformers==4.52.3
Requires-Dist: typer==0.15.4
Requires-Dist: typing-inspection==0.4.1
Requires-Dist: typing_extensions==4.13.2
Requires-Dist: urllib3==2.4.0
Requires-Dist: uvicorn==0.34.2
Requires-Dist: uvloop==0.21.0
Requires-Dist: watchfiles==1.0.5
Requires-Dist: websocket-client==1.8.0
Requires-Dist: websockets==15.0.1
Requires-Dist: wrapt==1.17.2
Requires-Dist: zipp==3.21.0
Requires-Dist: zstandard==0.23.0
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# DocVec CLI

🚀 **Overview**  
DocVec CLI is a powerful command-line tool designed to transform your unstructured local documents (PDF, DOCX, TXT) into query-ready vector embeddings, making them instantly usable for Large Language Models (LLMs) and bolstering Retrieval Augmented Generation (RAG) workflows.
---

## ✨ Key Features

- **Multi-Format Support**: Processes `.pdf`, `.docx`, and `.txt` files.
- **Automatic Text Extraction**: Efficiently extracts raw text content from various document types.
- **Intelligent Text Cleaning**: Removes unnecessary whitespace, excessive newlines, and basic HTML tags.
- **Configurable Text Chunking**: Uses `langchain`'s `RecursiveCharacterTextSplitter`, with customizable `chunk_size` and `chunk_overlap`.
- **Offline Embedding Generation**: Uses local `sentence-transformers` models (default: `all-MiniLM-L6-v2`) to create high-quality vector embeddings directly on your machine, ensuring privacy and offline capabilities.
- **ChromaDB-Compatible Output**: Generates JSON files structured for easy ingestion into ChromaDB or other vector databases.
- **User-Friendly CLI**: Simple command-line arguments for input/output paths and processing parameters.
- **Progress Indicators**: Visual progress bars for long-running operations like embedding generation.

---

## 📦 Installation

### Prerequisites
- Python 3.8 or newer

### Steps

#### 1. Clone the repository:
```bash
git clone https://github.com/onurbaran/docvec-cli.git  
cd docvec-cli
```

#### 2. Create and activate a virtual environment:
It’s highly recommended to use a virtual environment to manage dependencies.

```bash
python -m venv .venv

# On Windows:
.\.venv\Scripts\activate

# On macOS/Linux:
source ./.venv/bin/activate
```

#### 3. Install dependencies:

Ensure your `requirements.txt` contains:
```
pypdf
python-docx
sentence-transformers
langchain-text-splitters
tqdm
numpy
```

Then run:
```bash
pip install -r requirements.txt
```

---

## 🚀 Usage

Once installed, you can use `docvec-cli` from your terminal.

### Basic Command Structure
```bash
python src/main.py --input-path <path_to_document_or_directory> --output-path <path_to_output_directory> [OPTIONS]
```

### Required Arguments
- `--input-path <path>`: Path to a document file (e.g., `report.pdf`) or a directory (directory processing is planned for future updates).
- `--output-path <path>`: Path to the directory where the generated vector and metadata files will be saved.

### Optional Arguments
- `--chunk-size <int>`: Max size of each text chunk in characters (default: `1000`)
- `--chunk-overlap <int>`: Number of characters to overlap between chunks (default: `200`)
- `--model-name <str>`: Sentence-transformers model name (default: `all-MiniLM-L6-v2`)
- `--output-format <str>`: Format for output files (default: `json`, only format currently supported)

---

## 📁 Examples

### Process a single PDF file:
```bash
python src/main.py --input-path "docs/my_report.pdf" --output-path "vectors/"
```

### Process a DOCX file with custom chunking:
```bash
python src/main.py --input-path "articles/research.docx" --output-path "embeddings/" --chunk-size 500 --chunk-overlap 100
```

### Process a TXT file with a different embedding model:
```bash
python src/main.py --input-path "notes/daily_journal.txt" --output-path "processed_data/" --model-name "all-MiniLM-L12-v2"
```

---

## 📄 Output File Structure

For each processed document (e.g., `my_report.pdf`), a JSON file (`my_report_vectors.json`) will be created in the specified `--output-path`.

Example content:
```json
[
  {
    "id": "my_report-0",
    "document": "This is the text content of the first chunk...",
    "embedding": [0.123, -0.456, ..., 0.789],
    "metadata": {
      "source_file": "my_report.pdf",
      "chunk_index": 0,
      "chunk_size": 250
    }
  }
]
```

---

## 🤝 Contributing

We welcome contributions from the community! To contribute:

1. Fork the repository.
2. Create a new branch: `git checkout -b feature/your-feature-name`
3. Make your changes.
4. Write clear, concise commit messages.
5. Push your branch: `git push origin feature/your-feature-name`
6. Open a Pull Request.

Please ensure:
- Your code follows [PEP 8](https://peps.python.org/pep-0008/)
- You include appropriate tests.

---

## 📄 License

This project is licensed under the MIT License.

---

## 📧 Contact

For questions, feedback, or issues, please open an issue.
