Metadata-Version: 2.4
Name: hecvec
Version: 6.0.0
Summary: List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.
License-Expression: MIT
Keywords: chunking,document-pipeline,listdir,text-files
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.14,>=3.9
Requires-Dist: chromadb>=0.4.0
Requires-Dist: langchain-text-splitters>=0.2.0
Requires-Dist: openai>=1.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: chroma
Requires-Dist: chromadb>=0.4.0; extra == 'chroma'
Requires-Dist: langchain-text-splitters>=0.2.0; extra == 'chroma'
Requires-Dist: openai>=1.0.0; extra == 'chroma'
Requires-Dist: python-dotenv>=1.0.0; extra == 'chroma'
Requires-Dist: tiktoken>=0.5.0; extra == 'chroma'
Requires-Dist: typing-extensions>=4.0.0; extra == 'chroma'
Provides-Extra: chunk
Requires-Dist: langchain-text-splitters>=0.2.0; extra == 'chunk'
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: twine>=6.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# HecVec

**HecVec** is a Python library that discovers `.txt` and `.md` files, chunks them (token, text, semantic, or LLM-based), embeds with OpenAI, and stores vectors in Chroma. It is **library-only** — no HTTP API. All work runs in-process.

---

## Table of contents

- [Install](#install)
- [Requirements to run the pipeline](#requirements-to-run-the-pipeline)
- [Workflow](#workflow)
- [Quick start](#quick-start)
- [Parameters](#parameters)
- [API reference: methods and parameters](#api-reference-methods-and-parameters)
- [Chunking methods](#chunking-methods)
- [Chroma server](#chroma-server)
- [Environment and API key](#environment-and-api-key)
- [Collection naming](#collection-naming)
- [Building blocks](#building-blocks)
- [Module layout](#module-layout)
- [Development](#development)
- [License](#license)

---

## Install

**Full pipeline** (list → read → chunk → embed → Chroma):

```bash
pip install hecvec
```

A **Chroma server** must be running; the pipeline connects only to that server (see [Chroma server](#chroma-server)). There is no local/ephemeral mode.

---

## Requirements to run the pipeline

To use the full `Slicer.slice(...)` pipeline you need:

1. **Python** 3.9–3.13.
2. **Dependencies** installed via `pip install hecvec`.
3. **OpenAI API key** for embeddings (and for `semantic` / `llm` chunking). Set `OPENAI_API_KEY` in the environment or in a `.env` file (see [Environment and API key](#environment-and-api-key)).
4. **Chroma server** — A Chroma server must be listening at `host:port` (default `localhost:8000`). If nothing is listening, the pipeline raises. Start one e.g. with `docker run -p 8000:8000 chromadb/chroma`.

---

## Workflow

The main entry point is `Slicer.slice(path=..., **kwargs)`. It runs five steps:

| Step | Description |
|------|-------------|
| **0** | Resolve path, resolve collection name (`base_name` + `_` + `chunking_method`). |
| **1** | Discover files: single `.txt`/`.md` file or recursive list under a directory. |
| **2** | Read file contents as text (UTF-8 with fallbacks). |
| **3** | Chunk using the chosen method (`token`, `text`, `semantic`, or `llm`). |
| **4** | Generate embeddings with OpenAI. |
| **5** | Connect to Chroma; if the collection **already exists**, skip adding (no duplicate docs). Otherwise create the collection and add documents. |

Progress is logged as `[0/5]` … `[5/5]` with timings. If the collection already exists, the log states that clearly and no new documents are added.

---

## Quick start

```python
import hecvec

# Default: token chunking, Chroma at localhost:8000
result = hecvec.Slicer.slice(path="/path/to/folder_or_file")
# → {"files": N, "chunks": M, "collection": "folder_or_file_name_token"}

# Custom host/port and semantic chunking
result = hecvec.Slicer.slice(
    path="/path/to/docs",
    host="localhost",
    port=8000,
    chunking_method="semantic",
)
```

Or use an instance:

```python
slicer = hecvec.Slicer(
    host="chroma",  # e.g. Docker service name
    port=8000,
    chunking_method="token",
)
result = slicer.slice(path="/data/myfile.txt")
```

**Run the test script** (from repo root, with Chroma running and `OPENAI_API_KEY` set):

```bash
# Terminal 1: start Chroma
docker run -p 8000:8000 chromadb/chroma

# Terminal 2: run pipeline
uv run python scripts/test_slice.py
# Or with a path:
uv run python scripts/test_slice.py /path/to/file_or_folder
```

---

## Parameters

All of these can be passed to `Slicer(...)` or to `Slicer.slice(..., key=value)`.

| Parameter | Default | Description |
|-----------|---------|-------------|
| `path` | *(required)* | File or directory to process (`.txt`/`.md` only). |
| `root` | `path.parent` (file) or `path` (dir) | Safe root for resolving paths (used when listing under a directory). |
| `collection_name` | `"hecvec"` | Base name for the Chroma collection. If `"hecvec"`, it is replaced by the file stem or directory name; the final name is always `{collection_name}_{chunking_method}` (e.g. `mydoc_token`). |
| `db` | `"chroma"` | Database to use. Only `"chroma"` is supported. When `db="chroma"`, connection uses `host` and `port`. |
| `host` | `"localhost"` | Server host (used when `db="chroma"`). Server must be listening. |
| `port` | `8000` | Server port (used when `db="chroma"`). |
| `auth` | `None` | Optional Basic Auth credentials for `db="chroma"` as `"username:password"` (matches Chroma Basic Auth). |
| `chunking_method` | `"token"` | Chunking strategy: `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. See [Chunking methods](#chunking-methods). |
| `chunk_size` | `200` | Target chunk size (tokens for `token`, characters for `text`; also used by `llm`). |
| `chunk_overlap` | `0` | Overlap between consecutive chunks. |
| `encoding_name` | `"cl100k_base"` | Tiktoken encoding for token chunking. |
| `embedding_model` | `"text-embedding-3-small"` | OpenAI embedding model. |
| `batch_size` | `100` | Batch size for embedding API calls. |
| `openai_api_key` | from env / `.env` | OpenAI API key. Overrides env if provided. |
| `dotenv_path` | `None` | Path to `.env` file for loading `OPENAI_API_KEY`. |

---

## API reference: methods and parameters

Public methods and functions with their parameters. All are available from `import hecvec` unless a submodule is noted.

### Pipeline

**`Slicer(root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, auth=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, openai_api_key=None, dotenv_path=None)`**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `root` | `str` \| `Path` \| `None` | `None` (→ cwd) | Safe root for path resolution. |
| `collection_name` | `str` | `"hecvec"` | Base collection name; see [Collection naming](#collection-naming). |
| `db` | `DbType` | `"chroma"` | Database backend. Only `"chroma"` is supported. |
| `host` | `str` | `"localhost"` | Server host (when `db="chroma"`). |
| `port` | `int` | `8000` | Server port (when `db="chroma"`). |
| `auth` | `str` \| `None` | `None` | Optional Basic Auth credentials for Chroma as `"username:password"`. |
| `embedding_model` | `str` | `"text-embedding-3-small"` | OpenAI embedding model. |
| `chunk_size` | `int` | `200` | Chunk size (tokens or chars by method). |
| `chunk_overlap` | `int` | `0` | Overlap between chunks. |
| `encoding_name` | `str` | `"cl100k_base"` | Tiktoken encoding. |
| `batch_size` | `int` | `100` | Embedding batch size. |
| `openai_api_key` | `str` \| `None` | `None` | OpenAI key (else env / `.env`). |
| `dotenv_path` | `str` \| `Path` \| `None` | `None` | Path to `.env` file. |

**`Slicer.slice(path, *, root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, auth=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, chunking_method="token", openai_api_key=None, dotenv_path=None)`**

Same parameters as above, plus:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` \| `Path` | *(required)* | File or directory to process (`.txt`/`.md`). |

Returns: `dict` with `files`, `chunks`, `collection`, and optionally `message` (e.g. when collection already exists).

---

### Listing and reading

**`ListDir(root)`**

| Parameter | Type | Description |
|-----------|------|-------------|
| `root` | `str` \| `Path` | Root directory; all listed paths are under this. |

**`ListDir.listdir(path=".")`** → `list[str]`  
List one level under `path` (relative to root). Returns sorted relative path strings (dirs first, then files).

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` \| `Path` | `"."` | Path under root. |

**`ListDir.listdir_recursive(path=".", max_depth=None)`** → `list[str]`  
List all entries under `path` recursively.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` \| `Path` | `"."` | Path under root. |
| `max_depth` | `int` \| `None` | `None` | Max depth; `None` = unlimited. |

**`ListDirTextFiles(root, allowed_extensions=(".txt", ".md"))`**  
Subclass of `ListDir` that filters to `.txt`/`.md` only.

**`ListDirTextFiles.filter_txt_md(relative_paths)`** → `list[Path]`  
From relative path strings, return full paths of files with allowed extensions.

**`ListDirTextFiles.listdir_txt_md(path=".")`** → `list[Path]`  
One-level list of `.txt`/`.md` files under `path`.

**`ListDirTextFiles.listdir_recursive_txt_md(path=".", max_depth=None)`** → `list[Path]`  
Recursive list of `.txt`/`.md` files under `path`.

**`ReadText(paths, encoding="utf-8")`**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `paths` | `list[str]` \| `list[Path]` | — | File paths to read. |
| `encoding` | `str` | `"utf-8"` | Preferred encoding; fallbacks are latin-1, cp1252. |

**`ReadText.read_all()`** → `list[tuple[Path, str]]`  
Read all files; returns `(path, text)` pairs. Skips non-files and unreadable paths.

**`ReadText`** is iterable: `for path, text in reader:` yields `(path, text)`.

---

### Chunking

**`chunk_text(text, chunk_size=400, chunk_overlap=0, separators=None)`** → `list[str]`  
Single-document recursive character split. Requires `hecvec[chunk]`.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` | — | Document text. |
| `chunk_size` | `int` | `400` | Max characters per chunk. |
| `chunk_overlap` | `int` | `0` | Overlap. |
| `separators` | `list[str]` \| `None` | `None` | Split order; default `["\n\n\n", "\n\n", "\n", ". ", " ", ""]`. |

**`chunk_documents(path_and_texts, chunk_size=400, chunk_overlap=0, separators=None)`** → `list[dict]`  
Multiple documents, recursive character split. Each dict: `{"path", "chunk_index", "content"}`. Requires `hecvec[chunk]`.

**`token_chunk_text(text, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base")`** → `list[str]`  
Single-document token split (tiktoken).

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` | — | Document text. |
| `chunk_size` | `int` | `200` | Max tokens per chunk. |
| `chunk_overlap` | `int` | `0` | Overlap. |
| `encoding_name` | `str` | `"cl100k_base"` | Tiktoken encoding. |

**`token_chunk_documents(path_and_texts, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base")`** → `tuple[list[str], list[str]]`  
Multiple documents, token split. Returns `(ids, documents)` with ids like `chunk_0`, `chunk_1`, ...

**`chunk_documents_by_method(path_and_texts, method="token", *, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", separators=None, openai_api_key=None, semantic_max_chunk_size=400, semantic_min_chunk_size=50, llm_model="gpt-4o-mini")`** → `tuple[list[str], list[str]]`  
Chunk by `method`: `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. Returns `(ids, documents)`.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path_and_texts` | `list[tuple[(Path, str)]]` | — | `(path, text)` pairs. |
| `method` | `ChunkingMethod` | `"token"` | `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. |
| `chunk_size` | `int` | `200` | Used by token, text, llm. |
| `chunk_overlap` | `int` | `0` | Used by token, text. |
| `encoding_name` | `str` | `"cl100k_base"` | Token method. |
| `separators` | `list[str]` \| `None` | `None` | Text method only. |
| `openai_api_key` | `str` \| `None` | `None` | Required for semantic/llm. |
| `semantic_max_chunk_size` | `int` | `400` | Semantic method. |
| `semantic_min_chunk_size` | `int` | `50` | Semantic method. |
| `llm_model` | `str` | `"gpt-4o-mini"` | LLM method. |

---

### Embeddings and Chroma

**`embed_texts(texts, api_key, model="text-embedding-3-small", batch_size=100)`** → `list[list[float]]`  
OpenAI embeddings for a list of strings.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `texts` | `list[str]` | — | Texts to embed. |
| `api_key` | `str` | — | OpenAI API key. |
| `model` | `str` | `"text-embedding-3-small"` | Embedding model. |
| `batch_size` | `int` | `100` | Request batch size. |

**`get_client(host="localhost", port=8000, auth=None)`** → Chroma `HttpClient`  
Connects to the Chroma server at host:port. Raises if nothing is listening. If `auth` is provided, it is `"username:password"` for Chroma Basic Auth.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `host` | `str` | `"localhost"` | Chroma server host. |
| `port` | `int` | `8000` | Chroma server port. |
| `auth` | `str` \| `None` | `None` | Chroma Basic Auth credentials as `"username:password"`. |

**`get_or_create_collection(client, name, metadata=None)`**  
Get or create a Chroma collection (default cosine similarity). `metadata` default: `{"hnsw:space": "cosine"}`.

**`add_documents(client, collection_name, ids, embeddings, documents)`** → `dict`  
Add documents to a collection. Returns `{"collection_existed": bool}`.

**`list_collections(host="localhost", port=8000)`** → `list[tuple[str, int]]`  
List collection names and document counts on a Chroma server: `[(name, count), ...]`.

---

### Environment

**`load_dotenv_if_available(dotenv_path=None)`**  
Load `.env` into `os.environ` if python-dotenv is installed. No-op otherwise.

**`load_openai_key(dotenv_path=None)`** → `str` \| `None`  
Load `.env` if available, then return `os.environ.get("OPENAI_API_KEY")`.

---

## Chunking methods

| Method | Description | Requires |
|--------|-------------|----------|
| **`token`** | Split by token count (tiktoken, `cl100k_base`). Fast and deterministic. | — |
| **`text`** | Recursive character splitter (paragraph → line → sentence, etc.). | — |
| **`semantic`** | Embed small segments, then group by similarity (DP) into larger chunks. | `OPENAI_API_KEY` |
| **`llm`** | Use an LLM to choose split points for thematic sections. | `OPENAI_API_KEY` |

Use `chunking_method="token"` or `"text"` to avoid API calls during chunking. Use `"semantic"` or `"llm"` for more coherent, topic-aware chunks (at the cost of extra OpenAI usage).

---

## Chroma server

The pipeline connects **only** to a Chroma server. There is no ephemeral or persistent local client; a server must be running at `host:port` (default `localhost:8000`). If nothing is listening, `slice()` and `collections()` raise `RuntimeError`.

**Start a server** (e.g. Docker):

```bash
docker run -p 8000:8000 chromadb/chroma
```

**Containers:** If the app runs in a devcontainer and Chroma is in the same Docker Compose stack, use the **service name** as `host` (e.g. `chroma`). If Chroma is on the host and the app in the container, use `host="host.docker.internal"`.

---

## Environment and API key

- **OpenAI:** The pipeline (and `semantic` / `llm` chunking) needs an API key. It is read in this order:
  1. Argument `openai_api_key=...`
  2. Environment variable `OPENAI_API_KEY`
  3. A `.env` file in the current working directory (loaded via `python-dotenv` when you use `hecvec`)

- **`.env`:** Create a `.env` in your project root (or set `dotenv_path` to point to one):

  ```env
  OPENAI_API_KEY=sk-...
  ```

- Do **not** commit `.env` or expose the key in logs or source code.

---

## Collection naming

- If you pass `collection_name="hecvec"` (default), the base name is taken from the input:
  - **Single file:** `path.stem` (e.g. `mydoc`)
  - **Directory:** `path.name` (e.g. `docs`)
- The **final** collection name is always:

  **`{base_name}_{chunking_method}`**

  Examples: `mydoc_token`, `docs_semantic`, `CNSF-S0043-0032-2025_CONDUSEF-005190-08_token`.

- If a collection with that name **already exists**, the pipeline does **not** add documents again. It logs that the collection already exists and returns something like:

  `{"files": N, "chunks": 0, "collection": "...", "message": "Collection already exists; no documents added."}`

---

## Building blocks

You can use the pipeline step-by-step.

**List and read:**

```python
from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText

root = Path("/path/to/repo")
lister = ListDir(root=root)
for rel in lister.listdir("."):
    print(rel)

text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")
reader = ReadText(paths)
for path, text in reader:
    print(path, len(text))
```

**Chunk only** (e.g. recursive character, with `hecvec[chunk]`):

```python
from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents

paths = ListDirTextFiles(root=root).listdir_recursive_txt_md(".")
path_and_text = ReadText(paths).read_all()
chunks = chunk_documents(path_and_text)  # list of {"path", "chunk_index", "content"}
```

**Token chunk + embed + list Chroma collections:**

```python
from hecvec import token_chunk_text, embed_texts, list_collections

chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...")
names_and_counts = list_collections(host="localhost", port=8000)
```

**CLI** (list directory under a root):

```bash
hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]
```

---

## Module layout

| Module | Responsibility |
|--------|----------------|
| `hecvec.env` | Load `.env` and `OPENAI_API_KEY` |
| `hecvec.listdir` | List dirs under a safe root; filter `.txt`/`.md` |
| `hecvec.reading` | Read files as text (UTF-8 / latin-1 / cp1252 fallback) |
| `hecvec.token_splitter` | Token-based chunking (tiktoken) |
| `hecvec.chunking` | Recursive character chunking (`chunk_documents`, `chunk_text`) |
| `hecvec.chunkers` | Multi-method chunking: token, text, semantic, llm |
| `hecvec.embeddings` | OpenAI embeddings (`embed_texts`) |
| `hecvec.chroma_client` | Chroma client, get/create collection, add documents |
| `hecvec.chroma_list` | List Chroma collections and counts |
| `hecvec.pipeline` | Orchestrator: `Slicer` and `slice(path=...)` |

---

## Development

From the repo root:

```bash
uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"
```

---

## License

MIT
