Metadata-Version: 2.4
Name: contextiq
Version: 0.1.0
Summary: Turn messy files into agent-ready context.
Author: ContextIQ Contributors
License-Expression: MIT
Keywords: rag,agents,llm,ingestion,chunking,search,context
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: docs
Requires-Dist: python-docx>=1.1.0; extra == "docs"
Requires-Dist: pypdf>=5.0.0; extra == "docs"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Dynamic: license-file

# ContextIQ

ContextIQ turns messy files into agent-ready context.

It is a local-first ingestion pipeline for developers building RAG systems, agent memory layers, document search, and eval datasets. Point it at a folder and it produces clean JSONL and Markdown exports with chunked, traceable content.

## Why it exists

Most AI tooling starts after your data is already clean. Real projects get stuck much earlier:

- PDFs are noisy
- Word docs lose structure
- repos and notes mix formats
- chunks are inconsistent
- source traceability is easy to lose

ContextIQ focuses on the missing middle: consistent ingestion, chunking, and export.

## Features

- Local-first CLI
- Recursive file ingestion
- Built-in support for:
  - `.txt`, `.md`, `.rst`
  - `.json`, `.jsonl`
  - `.csv`, `.tsv`
  - `.html`, `.htm`
  - optional `.pdf` via `pypdf`
  - optional `.docx` via `python-docx`
- Document-aware chunking
- Source-preserving metadata
- JSONL and Markdown exports
- Run manifest with counts, warnings, and timings

## Quickstart

```bash
python -m venv .venv
. .venv/bin/activate
pip install -e .[dev]
contextiq ingest ./examples --out ./build/context
```

On Windows PowerShell:

```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -e .[dev]
contextiq ingest .\examples --out .\build\context
```

## CLI

```bash
contextiq ingest <path> --out <directory>
```

Useful flags:

- `--include-ext .md,.txt,.json`
- `--exclude-glob "*.min.js,*.lock"`
- `--chunk-size 1200`
- `--chunk-overlap 150`
- `--formats jsonl,markdown`
- `--fail-on-warning`

## Output

`contextiq ingest` writes:

- `documents.jsonl`: normalized source documents
- `chunks.jsonl`: chunked outputs for RAG/agents
- `chunks.md`: human-readable review file
- `manifest.json`: summary of the run

Each chunk preserves:

- source path
- document id
- chunk id
- byte and character ranges when available
- headings / section hints

## Example

```bash
contextiq ingest ./docs --out ./dist/context --chunk-size 900 --chunk-overlap 120
```

## Development

```bash
pip install -e .[dev]
pytest
```

## Roadmap

- embeddings plugin interface
- vector DB exporters
- OCR pipeline
- table extraction
- citation-aware retrieval benchmarks
