Metadata-Version: 2.4
Name: hecvec
Version: 6.6.1
Summary: List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.
License-Expression: MIT
Keywords: chunking,document-pipeline,listdir,text-files
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.14,>=3.9
Requires-Dist: chromadb>=0.4.0
Requires-Dist: langchain-text-splitters>=0.2.0
Requires-Dist: openai>=1.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: chroma
Requires-Dist: chromadb>=0.4.0; extra == 'chroma'
Requires-Dist: langchain-text-splitters>=0.2.0; extra == 'chroma'
Requires-Dist: openai>=1.0.0; extra == 'chroma'
Requires-Dist: python-dotenv>=1.0.0; extra == 'chroma'
Requires-Dist: tiktoken>=0.5.0; extra == 'chroma'
Requires-Dist: typing-extensions>=4.0.0; extra == 'chroma'
Provides-Extra: chunk
Requires-Dist: langchain-text-splitters>=0.2.0; extra == 'chunk'
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: twine>=6.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# HecVec

**HecVec** is a Python library that discovers `.txt` and `.md` files, chunks them (token, text, semantic, or LLM-based), embeds with OpenAI, and stores vectors in Chroma. It is **library-only** — no HTTP API. All work runs in-process.

---

## Table of contents

- [Install](#install)
- [Requirements to run the pipeline](#requirements-to-run-the-pipeline)
- [Workflow](#workflow)
- [Quick start](#quick-start)
- [Parameters](#parameters)
- [API reference: methods and parameters](#api-reference-methods-and-parameters)
- [Chunking methods](#chunking-methods)
- [Chroma server](#chroma-server)
- [Environment and API key](#environment-and-api-key)
- [Collection naming](#collection-naming)
- [Building blocks](#building-blocks)
- [Module layout](#module-layout)
- [Development](#development)
- [License](#license)

---

## Install

**Full pipeline** (list → verify Chroma is up → read → chunk → embed → Chroma):

```bash
pip install hecvec
```

A **Chroma server** must be running; the pipeline connects only to that server (see [Chroma server](#chroma-server)). There is no local/ephemeral mode.

---

## Requirements to run the pipeline

To use the full `Slicer.slice(...)` pipeline you need:

1. **Python** 3.9–3.13.
2. **Dependencies** installed via `pip install hecvec`.
3. **OpenAI API key** for embeddings (and for `semantic` / `llm` chunking). Set `OPENAI_API_KEY` in the environment or in a `.env` file (see [Environment and API key](#environment-and-api-key)).
4. **Chroma server** — A Chroma server must be listening at `host:port` (default `localhost:8000`). If nothing is listening, the pipeline raises. Start one e.g. with `docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma` (bind-mounts data so it persists across container restarts).

---

## Workflow

The main entry point is `Slicer.slice(path=..., **kwargs)`. It runs six logged steps:

| Step | Description |
|------|-------------|
| **0** | Resolve path, resolve collection name (`base_name` + `_` + `chunking_method`). |
| **1** | Discover files: single `.txt`/`.md` file or recursive list under a directory. |
| **2** | **Chroma server check:** connect to the server and fail fast if nothing is listening (before read/chunk/embed so you don’t pay for OpenAI when Chroma is down). The client is reused for the final write. |
| **3** | Read file contents as text (UTF-8 with fallbacks). |
| **4** | Chunk using the chosen method (`token`, `text`, `semantic`, or `llm`). |
| **5** | Generate embeddings with OpenAI. |
| **6** | Connect (already verified in step 2), list collections; if the collection **already exists**, skip adding. Otherwise create the collection and add documents. |

Progress is logged as `[0/6]` … `[6/6]` with timings. If the collection already exists, the log states that clearly after embeddings and no new documents are added.

---

## Quick start

```python
import hecvec

# Default: token chunking, Chroma at localhost:8000
result = hecvec.Slicer.slice(path="/path/to/folder_or_file")
# → {"files": N, "chunks": M, "collection": "folder_or_file_name_token_cs200_ov0_enccl100k_base"}

# Custom host/port and semantic chunking
result = hecvec.Slicer.slice(
    path="/path/to/docs",
    host="localhost",
    port=8000,
    chunking_method="semantic",
)
```

Or use an instance:

```python
slicer = hecvec.Slicer(
    host="chroma",  # e.g. Docker Compose service name (see `.devcontainer/docker-compose.yml`)
    port=8000,
    chunking_method="token",
)
result = slicer.slice(path="/data/myfile.txt")
```

**Run the test script** (from repo root, with Chroma running and `OPENAI_API_KEY` set):

```bash
# Terminal 1: start Chroma
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma

# Terminal 2: run pipeline
uv run python scripts/test_slice.py
# Or with a path:
uv run python scripts/test_slice.py /path/to/file_or_folder
```

---

## Parameters

All of these can be passed to `Slicer(...)` or to `Slicer.slice(..., key=value)`.

| Parameter | Default | Description |
|-----------|---------|-------------|
| `path` | *(required)* | File or directory to process (`.txt`/`.md` only). |
| `root` | `path.parent` (file) or `path` (dir) | Safe root for resolving paths (used when listing under a directory). |
| `collection_name` | `"hecvec"` | Base name for the Chroma collection. If `"hecvec"`, it is replaced by the file stem or directory name; the final name includes method + chunk config (so different `chunk_size`/`chunk_overlap` don’t collide). |
| `db` | `"chroma"` | Database backend to use: `"chroma"` (self-hosted) or `"chroma_cloud"` (Chroma Cloud). |
| `host` | `"localhost"` | Server host (**only for** `db="chroma"`). |
| `port` | `8000` | Server port (**only for** `db="chroma"`). |
| `user` | `None` | Optional Basic Auth username (**only for** `db="chroma"`). Use together with `password`. |
| `password` | `None` | Optional Basic Auth password (**only for** `db="chroma"`). Use together with `user`. |
| `cloud_api_key` | `None` | Chroma Cloud API key (**only for** `db="chroma_cloud"`). If not passed, `hecvec` reads it from `.env`/env. |
| `cloud_tenant` | `None` | Optional Chroma Cloud tenant (**only for** `db="chroma_cloud"`). If not passed, `hecvec` reads `CHROMA_TENANT` from `.env`/env. |
| `cloud_database` | `None` | Optional Chroma Cloud database (**only for** `db="chroma_cloud"`). If not passed, `hecvec` reads `CHROMA_DATABASE` from `.env`/env. |
| `chunking_method` | `"token"` | Chunking strategy: `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. See [Chunking methods](#chunking-methods). |
| `chunk_size` | `200` | Target chunk size (tokens for `token`, characters for `text`; also used by `llm`). |
| `chunk_overlap` | `0` | Overlap between consecutive chunks. |
| `encoding_name` | `"cl100k_base"` | Tiktoken encoding for token chunking. |
| `embedding_model` | `"text-embedding-3-small"` | OpenAI embedding model. |
| `batch_size` | `100` | Batch size for embedding API calls. |
| `openai_api_key` | from env / `.env` | OpenAI API key. Overrides env if provided. |
| `dotenv_path` | `None` | Path to `.env` file for loading `OPENAI_API_KEY`. |

---

## API reference: methods and parameters

Public methods and functions with their parameters. All are available from `import hecvec` unless a submodule is noted.

### Pipeline

**`Slicer(root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, user=None, password=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, openai_api_key=None, dotenv_path=None)`**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `root` | `str` \| `Path` \| `None` | `None` (→ cwd) | Safe root for path resolution. |
| `collection_name` | `str` | `"hecvec"` | Base collection name; see [Collection naming](#collection-naming). |
| `db` | `DbType` | `"chroma"` | Database backend: `"chroma"` (self-hosted) or `"chroma_cloud"` (Chroma Cloud). |
| `host` | `str` | `"localhost"` | Server host (when `db="chroma"`). |
| `port` | `int` | `8000` | Server port (when `db="chroma"`). |
| `user` | `str` \| `None` | `None` | Optional Basic Auth username for Chroma. Use together with `password`. |
| `password` | `str` \| `None` | `None` | Optional Basic Auth password for Chroma. Use together with `user`. |
| `cloud_api_key` | `str` \| `None` | `None` | Chroma Cloud API key (required when `db="chroma_cloud"`). |
| `cloud_tenant` | `str` \| `None` | `None` | Optional Chroma Cloud tenant. If provided, `cloud_database` must also be provided. |
| `cloud_database` | `str` \| `None` | `None` | Optional Chroma Cloud database. If provided, `cloud_tenant` must also be provided. |
| `embedding_model` | `str` | `"text-embedding-3-small"` | OpenAI embedding model. |
| `chunk_size` | `int` | `200` | Chunk size (tokens or chars by method). |
| `chunk_overlap` | `int` | `0` | Overlap between chunks. |
| `encoding_name` | `str` | `"cl100k_base"` | Tiktoken encoding. |
| `batch_size` | `int` | `100` | Embedding batch size. |
| `openai_api_key` | `str` \| `None` | `None` | OpenAI key (else env / `.env`). |
| `dotenv_path` | `str` \| `Path` \| `None` | `None` | Path to `.env` file. |

**`Slicer.slice(path, *, root=None, collection_name="hecvec", db="chroma", host="localhost", port=8000, user=None, password=None, embedding_model="text-embedding-3-small", chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", batch_size=100, chunking_method="token", openai_api_key=None, dotenv_path=None)`**

Same parameters as above, plus:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` \| `Path` | *(required)* | File or directory to process (`.txt`/`.md`). |

Returns: `dict` with `files`, `chunks`, `collection`, and optionally `message` (e.g. when collection already exists).

---

### Listing and reading

**`ListDir(root)`**

| Parameter | Type | Description |
|-----------|------|-------------|
| `root` | `str` \| `Path` | Root directory; all listed paths are under this. |

**`ListDir.listdir(path=".")`** → `list[str]`  
List one level under `path` (relative to root). Returns sorted relative path strings (dirs first, then files).

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` \| `Path` | `"."` | Path under root. |

**`ListDir.listdir_recursive(path=".", max_depth=None)`** → `list[str]`  
List all entries under `path` recursively.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path` | `str` \| `Path` | `"."` | Path under root. |
| `max_depth` | `int` \| `None` | `None` | Max depth; `None` = unlimited. |

**`ListDirTextFiles(root, allowed_extensions=(".txt", ".md"))`**  
Subclass of `ListDir` that filters to `.txt`/`.md` only.

**`ListDirTextFiles.filter_txt_md(relative_paths)`** → `list[Path]`  
From relative path strings, return full paths of files with allowed extensions.

**`ListDirTextFiles.listdir_txt_md(path=".")`** → `list[Path]`  
One-level list of `.txt`/`.md` files under `path`.

**`ListDirTextFiles.listdir_recursive_txt_md(path=".", max_depth=None)`** → `list[Path]`  
Recursive list of `.txt`/`.md` files under `path`.

**`ReadText(paths, encoding="utf-8")`**

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `paths` | `list[str]` \| `list[Path]` | — | File paths to read. |
| `encoding` | `str` | `"utf-8"` | Preferred encoding; fallbacks are latin-1, cp1252. |

**`ReadText.read_all()`** → `list[tuple[Path, str]]`  
Read all files; returns `(path, text)` pairs. Skips non-files and unreadable paths.

**`ReadText`** is iterable: `for path, text in reader:` yields `(path, text)`.

---

### Chunking

**`chunk_text(text, chunk_size=400, chunk_overlap=0, separators=None)`** → `list[str]`  
Single-document recursive character split. Requires `hecvec[chunk]`.

Note: `Slicer.slice` does not expose `separators` directly; it uses the defaults from the low-level chunker.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` | — | Document text. |
| `chunk_size` | `int` | `400` | Max characters per chunk. |
| `chunk_overlap` | `int` | `0` | Overlap. |
| `separators` | `list[str]` \| `None` | `None` | Split order; default `["\n\n\n", "\n\n", "\n", ". ", " ", ""]`. |

**`chunk_documents(path_and_texts, chunk_size=400, chunk_overlap=0, separators=None)`** → `list[dict]`  
Multiple documents, recursive character split. Each dict: `{"path", "chunk_index", "content"}`. Requires `hecvec[chunk]`.

Note: `Slicer.slice` does not expose `separators` directly; it uses the defaults from the low-level chunker.

**`token_chunk_text(text, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base")`** → `list[str]`  
Single-document token split (tiktoken).

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` | — | Document text. |
| `chunk_size` | `int` | `200` | Max tokens per chunk. |
| `chunk_overlap` | `int` | `0` | Overlap. |
| `encoding_name` | `str` | `"cl100k_base"` | Tiktoken encoding. |

**`token_chunk_documents(path_and_texts, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base")`** → `tuple[list[str], list[str]]`  
Multiple documents, token split. Returns `(ids, documents)` with ids like `chunk_0`, `chunk_1`, ...

**`chunk_documents_by_method(path_and_texts, method="token", *, chunk_size=200, chunk_overlap=0, encoding_name="cl100k_base", separators=None, openai_api_key=None, semantic_max_chunk_size=400, semantic_min_chunk_size=50, llm_model="gpt-4o-mini")`** → `tuple[list[str], list[str]]`  
Chunk by `method`: `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. Returns `(ids, documents)`.

Note: this is a low-level helper with advanced knobs. `Slicer.slice` forwards only:
`chunk_size`, `chunk_overlap`, `encoding_name`, `chunking_method` (as `method`), and (when needed) `openai_api_key`.  
So you only need `chunk_size` + `chunk_overlap` at the `Slicer` level; `separators`, `semantic_max_chunk_size`, `semantic_min_chunk_size`, and `llm_model` stay at their defaults unless you call this helper directly.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `path_and_texts` | `list[tuple[(Path, str)]]` | — | `(path, text)` pairs. |
| `method` | `ChunkingMethod` | `"token"` | `"token"` \| `"text"` \| `"semantic"` \| `"llm"`. |
| `chunk_size` | `int` | `200` | Used by token, text, llm. |
| `chunk_overlap` | `int` | `0` | Used by token, text. |
| `encoding_name` | `str` | `"cl100k_base"` | Token method. |
| `separators` | `list[str]` \| `None` | `None` | Text method only. |
| `openai_api_key` | `str` \| `None` | `None` | Required for semantic/llm. |
| `semantic_max_chunk_size` | `int` | `400` | Semantic method. |
| `semantic_min_chunk_size` | `int` | `50` | Semantic method. |
| `llm_model` | `str` | `"gpt-4o-mini"` | LLM method. |

---

### Embeddings and Chroma

**`embed_texts(texts, api_key, model="text-embedding-3-small", batch_size=100)`** → `list[list[float]]`  
OpenAI embeddings for a list of strings.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `texts` | `list[str]` | — | Texts to embed. |
| `api_key` | `str` | — | OpenAI API key. |
| `model` | `str` | `"text-embedding-3-small"` | Embedding model. |
| `batch_size` | `int` | `100` | Request batch size. |

**`get_client(host="localhost", port=8000, user=None, password=None)`** → Chroma `HttpClient`
Connects to the Chroma server at host:port. Raises if nothing is listening. If `user` and `password` are provided, uses Chroma Basic Auth.

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `host` | `str` | `"localhost"` | Chroma server host. |
| `port` | `int` | `8000` | Chroma server port. |
| `user` | `str` \| `None` | `None` | Basic Auth username for Chroma. Use together with `password`. |
| `password` | `str` \| `None` | `None` | Basic Auth password for Chroma. Use together with `user`. |

**`get_or_create_collection(client, name, metadata=None)`**  
Get or create a Chroma collection (default cosine similarity). `metadata` default: `{"hnsw:space": "cosine"}`.

**`add_documents(client, collection_name, ids, embeddings, documents)`** → `dict`  
Add documents to a collection. Returns `{"collection_existed": bool}`.

**`list_collections(host="localhost", port=8000)`** → `list[tuple[str, int]]`  
List collection names and document counts on a Chroma server: `[(name, count), ...]`.

---

### Environment

**`load_dotenv_if_available(dotenv_path=None)`**  
Load `.env` into `os.environ` if python-dotenv is installed. No-op otherwise.

**`load_openai_key(dotenv_path=None)`** → `str` \| `None`  
Load `.env` if available, then return `os.environ.get("OPENAI_API_KEY")`.

---

## Chunking methods

| Method | Description | Requires |
|--------|-------------|----------|
| **`token`** | Split by token count (tiktoken, `cl100k_base`). Fast and deterministic. | — |
| **`text`** | Recursive character splitter (paragraph → line → sentence, etc.). | — |
| **`semantic`** | Embed small segments, then group by similarity (DP) into larger chunks. | `OPENAI_API_KEY` |
| **`llm`** | Use an LLM to choose split points for thematic sections. | `OPENAI_API_KEY` |

Use `chunking_method="token"` or `"text"` to avoid API calls during chunking. Use `"semantic"` or `"llm"` for more coherent, topic-aware chunks (at the cost of extra OpenAI usage).

---

## Self-hosted Chroma server (`db="chroma"`)

Use this when you run Chroma yourself (Docker, EC2, devcontainer, etc). `hecvec` connects to a Chroma server over HTTP using `host` and `port`.

**Start a server** (e.g. Docker):

```bash
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma chromadb/chroma
```

`-v` is a Docker bind mount: `-v ./chroma-data:/chroma/chroma` maps your local `./chroma-data` directory into the container.

In practice, “persistent data” means Chroma’s database files (collections, vectors, metadata) are written to disk and survive `docker stop` / `docker start` (and even container recreation), so reruns can append without losing history.

**Parameters used by `db="chroma"`:**

- **Required**: `host`, `port`
- **Optional**: `user`, `password` (Basic Auth header; both-or-neither)

### Authentication (recommended)

Modern Chroma releases (the `chromadb/chroma:latest` image) do **not** reliably support the older `CHROMA_SERVER_AUTHN_*` env-var based auth providers. If you need enforced auth on a self-hosted instance (e.g. EC2), the simplest reliable pattern is:

- Run Chroma on a private network interface (or at least don’t expose it directly).
- Put a reverse proxy in front that enforces **Basic Auth** (or token/JWT) and TLS.

When you pass `user=` and `password=` to `hecvec.Slicer(...)`, `hecvec` will send a standard HTTP `Authorization: Basic ...` header, which works with a proxy-enforced Basic Auth setup.

Two common ways to run Chroma persistently:

1. **Plain Docker on the host:** run the `docker run ... -v ./chroma-data:...` command above.
2. **Inside the provided devcontainer:** use the compose setup in `.devcontainer/docker-compose.yml` (the devcontainer “compose” config in this repo). It starts a `chroma` service with a persistent Docker volume (`chroma-data`) mounted at `/chroma/data` and `IS_PERSISTENT=TRUE`, so reopening the devcontainer keeps your vectors.

**Containers:** If the app runs in a devcontainer and Chroma is in the same Docker Compose stack, use the **service name** as `host` (in this repo: `host="chroma"`). If Chroma is on the host and the app in the container, use `host="host.docker.internal"`.

---

## Chroma Cloud (`db="chroma_cloud"`)

Use this when you want a managed Chroma deployment with built-in auth and TLS. `hecvec` uses the Chroma Python client’s Cloud client under the hood.

**Parameters used by `db="chroma_cloud"`:**

- **Required**: `cloud_api_key` (or set via `.env`/env)
- **Optional**: `cloud_tenant` and `cloud_database` (either provide both, or omit both)
- **Ignored**: `host`, `port`, `user`, `password`

**Environment / `.env` variables (loaded automatically):**

- `CHROMA_CLOUD_API_KEY=...` (preferred) or `CHROMA_API_KEY=...`
- `CHROMA_TENANT=...` *(optional; used when `cloud_tenant` not passed)*
- `CHROMA_DATABASE=...` *(optional; used when `cloud_database` not passed)*

**Example:**

```python
import hecvec

# Minimal: only API key needed (tenant/database optional depending on your Cloud setup)
slicer = hecvec.Slicer(db="chroma_cloud", cloud_api_key="ck-...")
result = slicer.slice(path="/path/to/docs")
```

---

## Environment and API key

- **OpenAI:** The pipeline (and `semantic` / `llm` chunking) needs an API key. It is read in this order:
  1. Argument `openai_api_key=...`
  2. Environment variable `OPENAI_API_KEY`
  3. A `.env` file in the current working directory (loaded via `python-dotenv` when you use `hecvec`)

- **`.env`:** Create a `.env` in your project root (or set `dotenv_path` to point to one):

  ```env
  OPENAI_API_KEY=sk-...
  ```

- Do **not** commit `.env` or expose the key in logs or source code.

---

## Collection naming

- If you pass `collection_name="hecvec"` (default), the base name is taken from the input:
  - **Single file:** `path.stem` (e.g. `mydoc`)
  - **Directory:** `path.name` (e.g. `docs`)
- The **final** collection name is always:

  **`{base_name}_{chunking_method}_{chunk_config}`**

  Examples:
  - token: `mydoc_token_cs200_ov0_enccl100k_base`
  - text: `mydoc_text_cs400_ov0`
  - llm/semantic: `mydoc_llm_cs200`

- If a collection with that name **already exists**, the pipeline does **not** add documents again. It logs that the collection already exists and returns something like:

  `{"files": N, "chunks": 0, "collection": "...", "message": "Collection already exists; no documents added."}`

---

## Building blocks

You can use the pipeline step-by-step.

**List and read:**

```python
from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText

root = Path("/path/to/repo")
lister = ListDir(root=root)
for rel in lister.listdir("."):
    print(rel)

text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")
reader = ReadText(paths)
for path, text in reader:
    print(path, len(text))
```

**Chunk only** (e.g. recursive character, with `hecvec[chunk]`):

```python
from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents

paths = ListDirTextFiles(root=root).listdir_recursive_txt_md(".")
path_and_text = ReadText(paths).read_all()
chunks = chunk_documents(path_and_text)  # list of {"path", "chunk_index", "content"}
```

**Token chunk + embed + list Chroma collections:**

```python
from hecvec import token_chunk_text, embed_texts, list_collections

chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...")
names_and_counts = list_collections(host="localhost", port=8000)
```

**CLI** (list directory under a root):

```bash
hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]
```

---

## Module layout

| Module | Responsibility |
|--------|----------------|
| `hecvec.env` | Load `.env` and `OPENAI_API_KEY` |
| `hecvec.listdir` | List dirs under a safe root; filter `.txt`/`.md` |
| `hecvec.reading` | Read files as text (UTF-8 / latin-1 / cp1252 fallback) |
| `hecvec.token_splitter` | Token-based chunking (tiktoken) |
| `hecvec.chunking` | Recursive character chunking (`chunk_documents`, `chunk_text`) |
| `hecvec.chunkers` | Multi-method chunking: token, text, semantic, llm |
| `hecvec.embeddings` | OpenAI embeddings (`embed_texts`) |
| `hecvec.chroma_client` | Chroma client, get/create collection, add documents |
| `hecvec.chroma_list` | List Chroma collections and counts |
| `hecvec.pipeline` | Orchestrator: `Slicer` and `slice(path=...)` |

---

## Development

From the repo root:

```bash
uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"
```

---

## License

MIT
