Metadata-Version: 2.4
Name: hecvec
Version: 0.1.0
Summary: List directories (safe root), filter .txt/.md files, read as text, chunk, embed, and push to Chroma.
License-Expression: MIT
Keywords: chunking,document-pipeline,listdir,text-files
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.14,>=3.9
Requires-Dist: chromadb>=0.4.0
Requires-Dist: langchain-text-splitters>=0.2.0
Requires-Dist: openai>=1.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: tiktoken>=0.5.0
Provides-Extra: chroma
Requires-Dist: chromadb>=0.4.0; extra == 'chroma'
Requires-Dist: langchain-text-splitters>=0.2.0; extra == 'chroma'
Requires-Dist: openai>=1.0.0; extra == 'chroma'
Requires-Dist: python-dotenv>=1.0.0; extra == 'chroma'
Requires-Dist: tiktoken>=0.5.0; extra == 'chroma'
Provides-Extra: chunk
Requires-Dist: langchain-text-splitters>=0.2.0; extra == 'chunk'
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# HecVec

List directories with a safe root, filter `.txt`/`.md` files, read them as text, and optionally chunk and push to Chroma — **library only, no API**.

## Install

```bash
pip install hecvec
```

One-call pipeline (list → filter → token-chunk → Chroma):

```bash
pip install hecvec[chroma]
```

Optional chunking only (no Chroma):

```bash
pip install hecvec[chunk]
```

## Usage

### One-call pipeline (list → filter → chunk → Chroma)

Runs entirely in the library (no API). You need Chroma running (e.g. `docker run -p 8000:8000 chromadb/chroma`) and `OPENAI_API_KEY` set (in the environment or in a `.env` file; the library loads `.env` via python-dotenv when you use `hecvec[chroma]`).

```python
import hecvec

# Class-style: use defaults, then slice
test = hecvec.HecVec()
result = test.slice(path="/path/to/folder")
# → {"files": N, "chunks": M, "collection": "hecvec"}

# Or call slice on the class (same flow)
result = hecvec.HecVec.slice(path="/path/to/folder")
```

Flow: resolve path → listdir → filter `.txt`/`.md` → token-chunk (200 tokens, `cl100k_base`) → embed with OpenAI → push to Chroma.

Optional config (instance or `HecVec.slice(..., key=value)`):

- `root`, `collection_name`, `chroma_host`, `chroma_port`
- `embedding_model`, `chunk_size`, `chunk_overlap`, `encoding_name`, `batch_size`
- `openai_api_key` (or set `OPENAI_API_KEY` in the environment or in a `.env` file; optional `dotenv_path` to point to a specific `.env`)

### Low-level building blocks

```python
from pathlib import Path
from hecvec import ListDir, ListDirTextFiles, ReadText

root = Path("/path/to/repo")

# List all entries under a path (restricted to root)
lister = ListDir(root=root)
for rel in lister.listdir("."):
    print(rel)

# Only .txt and .md files, recursively
text_lister = ListDirTextFiles(root=root)
paths = text_lister.listdir_recursive_txt_md("docs")

# Read each file as text
reader = ReadText(paths)
for path, text in reader:
    print(path, len(text))
```

### Chunking (optional)

With `pip install hecvec[chunk]`:

```python
from hecvec import ListDirTextFiles, ReadText
from hecvec.chunking import chunk_documents

lister = ListDirTextFiles(root=root)
paths = lister.listdir_recursive_txt_md(".")
reader = ReadText(paths)
path_and_text = reader.read_all()
chunks = chunk_documents(path_and_text)
# list of {"path": "...", "chunk_index": 0, "content": "..."}
```

### CLI

```bash
hecvec-listdir [path] [root]
# or
python -m hecvec.cli [path] [root]
```

### Test the full pipeline (the method that does everything)

From the project root, with Chroma running and `OPENAI_API_KEY` set (e.g. in `.env`):

```bash
# Start Chroma (one terminal)
docker run -p 8000:8000 chromadb/chroma

# Run the test script (another terminal)
uv run python scripts/test_slice.py
# or: python scripts/test_slice.py
```

The script creates a temp folder with two `.txt` files, runs `HecVec.slice(path=...)`, and prints `PASS` or `FAIL` with the result (`files`, `chunks`, `collection`).

### Modular layout (easy to study)

Each step of the pipeline lives in its own module:

| Module | Responsibility |
|--------|-----------------|
| `hecvec.env` | Load `.env` and `OPENAI_API_KEY` |
| `hecvec.listdir` | List dirs under a safe root; filter by extension (`.txt`/`.md`) |
| `hecvec.reading` | Read files as text (UTF-8 / latin-1 / cp1252 fallback) |
| `hecvec.token_splitter` | Token-based chunking (TokenTextSplitter) |
| `hecvec.chunking` | Recursive-character chunking (RecursiveCharacterTextSplitter) |
| `hecvec.embeddings` | OpenAI embeddings (`embed_texts`) |
| `hecvec.chroma_client` | Chroma client, get/create collection, add documents |
| `hecvec.chroma_list` | List Chroma collections and counts |
| `hecvec.pipeline` | Orchestrator: `HecVec` and `slice(path=...)` |

Example: use one step on its own:

```python
from hecvec import embed_texts, token_chunk_text, list_collections

chunks = token_chunk_text("Some long document...", chunk_size=200)
vecs = embed_texts(chunks, api_key="sk-...")
names_and_counts = list_collections(host="localhost", port=8000)
```

## Development

From the repo root:

```bash
uv sync
uv run python -c "from hecvec import ListDir; print(ListDir('.').listdir('.'))"
```

## License

MIT
