Metadata-Version: 2.4
Name: contextframe
Version: 0.1.1
Summary: ContextFrame: A global standard file specification for contextual document management with LLMs
Author-email: ContextFrame Team <contact@contextframe.org>
License-Expression: MIT
Project-URL: repository, https://github.com/greyhaven-ai/contextframe
Project-URL: documentation, https://contextframe.org
Project-URL: issues, https://github.com/greyhaven-ai/contextframe/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Documentation
Classifier: Topic :: Software Development :: Documentation
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: jsonschema
Requires-Dist: lance>=0.7.0
Requires-Dist: pyarrow>=14.0.2
Requires-Dist: numpy>=1.24
Requires-Dist: pyyaml>=6.0.0
Provides-Extra: dev
Requires-Dist: pytest==8.3.4; extra == "dev"
Requires-Dist: pytest-asyncio==0.25.2; extra == "dev"
Requires-Dist: pytest-cov==6.0.0; extra == "dev"
Requires-Dist: mypy==1.14.1; extra == "dev"
Requires-Dist: ruff>=0.9.5; extra == "dev"
Requires-Dist: deptry>=0.12.0; extra == "dev"
Provides-Extra: extract
Requires-Dist: semantic-text-splitter>=0.3.0; extra == "extract"
Requires-Dist: litellm>=1.0.0; extra == "extract"
Requires-Dist: tokenizers>=0.13.0; extra == "extract"
Requires-Dist: tree-sitter>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-python>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-javascript>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-typescript>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-rust>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-go>=0.19.0; extra == "extract"
Requires-Dist: tree-sitter-cpp>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-c>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-java>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-ruby>=0.20.0; extra == "extract"
Requires-Dist: tree-sitter-php>=0.19.0; extra == "extract"
Requires-Dist: tree-sitter-html>=0.19.0; extra == "extract"
Requires-Dist: tree-sitter-css>=0.19.0; extra == "extract"
Requires-Dist: tree-sitter-json>=0.19.0; extra == "extract"
Requires-Dist: tree-sitter-yaml>=0.5.0; extra == "extract"
Requires-Dist: tree-sitter-toml>=0.5.0; extra == "extract"
Requires-Dist: tree-sitter-xml>=0.6.0; extra == "extract"
Requires-Dist: tree-sitter-bash>=0.19.0; extra == "extract"
Provides-Extra: embed
Requires-Dist: sentence-transformers>=2.2.0; extra == "embed"
Requires-Dist: openai>=1.0.0; extra == "embed"
Requires-Dist: anthropic>=0.21.0; extra == "embed"
Requires-Dist: cohere>=4.0.0; extra == "embed"
Provides-Extra: enhance
Requires-Dist: openai>=1.0.0; extra == "enhance"
Requires-Dist: anthropic>=0.21.0; extra == "enhance"
Requires-Dist: tiktoken>=0.5.0; extra == "enhance"
Requires-Dist: mirascope[anthropic,litellm,openai]>=1.0.0; extra == "enhance"
Requires-Dist: pydantic>=2.0.0; extra == "enhance"
Provides-Extra: encode
Requires-Dist: opencv-python>=4.8.0; extra == "encode"
Requires-Dist: pillow>=10.0.0; extra == "encode"
Requires-Dist: ffmpeg-python>=0.2.0; extra == "encode"
Requires-Dist: numpy>=1.24; extra == "encode"
Provides-Extra: serve
Requires-Dist: mcp>=0.9.0; extra == "serve"
Requires-Dist: fastapi>=0.104.0; extra == "serve"
Requires-Dist: uvicorn[standard]>=0.24.0; extra == "serve"
Requires-Dist: sse-starlette>=1.6.0; extra == "serve"
Requires-Dist: python-jose[cryptography]>=3.3.0; extra == "serve"
Requires-Dist: python-multipart>=0.0.6; extra == "serve"
Requires-Dist: httpx>=0.25.0; extra == "serve"
Requires-Dist: slowapi>=0.1.9; extra == "serve"
Provides-Extra: all
Requires-Dist: contextframe[embed,encode,enhance,extract,io,serve]; extra == "all"
Dynamic: license-file

# ContextFrame: Simplifying LLM Context Management with Enhanced Lance Files

ContextFrame is a Python package and file specification designed to streamline context management for Large Language Models (LLMs) in applied AI/ML workflows. It builds upon the powerful `.lance` columnar data format, providing a standardized schema, validation tools, and user-friendly abstractions to enhance its usability.

## The Problem: Managing Context for LLMs

Effectively managing the context provided to LLMs is crucial for generating relevant and accurate responses. This often involves handling diverse documents, maintaining structured metadata, tracking relationships, ensuring data consistency, **and crucially, annotating content with user-specific context relevant to the task at hand** – tasks that can become complex in real-world applications.

## The Solution: ContextFrame + Lance

ContextFrame addresses these challenges by leveraging the `.lance` format's strengths (performance, versioning, vector support) and adding a layer of structure and semantics inspired by the original MDP project:

-   **Standardized Schema**: Defines a specific, extensible schema (`contextframe/schema/contextframe_schema.json`) for storing document content, **rich metadata** (including authorship, timestamps, status, source information), **relationships**, **collection linkage**, and vector embeddings within a `.lance` file. This ensures consistency and provides a common structure for LLM context.
    -   **Record Type (`record_type`):** An **enum** column distinguishing between normal `document` rows and special control rows like `collection_header` or `dataset_header`.  This enables efficient filtering and makes datasets self-describing.
    -   **Relationships:** The schema includes a dedicated `relationships` field (a list of structs) to explicitly link related documents (e.g., parent/child, related, reference, member_of) using various identifiers (UUID, URI, Path, CID).
    -   **Collections:** Fields like `collection`, `collection_id`, and `position` facilitate organizing documents into logical groups or sequences.
    -   **Source Tracking:** Fields like `source_file`, `source_type`, and `source_url` allow tracking the origin of content, useful for ingestion pipelines.
    -   **User-Defined Context:** Fields like `context` (string) and `custom_metadata` (map) are explicitly designed to be **editable by the user**, allowing them to add workflow-specific annotations, descriptions of relevance (e.g., "This document is crucial for Project 123 because..."), or task-specific labels beyond the original source metadata.
-   **Schema-Aware Abstractions**: Offers Python classes (`FrameRecord`, `FrameDataset`) that simplify interactions with Lance datasets conforming to the ContextFrame spec.
    -   `FrameRecord`: Represents a single document (row) in memory. It handles data validation against the schema, manages default values (like UUIDs and timestamps), provides convenience accessors **and allows direct modification** of common metadata fields (including `context` and `custom_metadata`), includes helpers for managing relationships (e.g., `add_relationship`), and converts data to/from the Arrow format required by Lance. This enables users to easily layer their own semantic context onto the data.
    -   `FrameDataset`: Wraps the core `lance.LanceDataset`, providing schema-aware methods for common dataset operations:
        -   **Creation (`.create`)**: Initializes a new `.lance` directory with the correct ContextFrame schema (using `lance.write_dataset`).
        -   **Opening (`.open`)**: Loads an existing ContextFrame dataset (using `lance.dataset`).
        -   **Adding Records (`.add`, `.add_many`)**: Validates `FrameRecord` metadata against the schema and inserts the data into the Lance dataset (using `LanceDataset.insert`).
        -   **Retrieval (`.from_dataset_row`)**: Fetches a specific record by UUID (using `LanceDataset.scanner` with a filter).
        -   **Vector & Text Search Helpers**
            -   `knn_search(...)` – convenient nearest-neighbour search that returns fully-materialised `FrameRecord` objects and supports additional SQL filters.
            -   `full_text_search(...)` – wraps Lance BM25 full-text search for one-liner relevance queries.
        -   **Rich Query Helpers** – pre-built helpers for common metadata look-ups (e.g. `find_by_status`, `find_by_tag`, `find_related_to`, `find_by_author`, date-range helpers, etc.).
        -   **Record Management (`.update_record`, `.delete_record`, `.upsert_record`)**: Idiomatic helpers to update, remove, or upsert records by UUID while preserving schema integrity.
        -   **Collection Helpers (`.get_collection_header`, `.get_collection_members`, `.get_members_linked_to_header`)**: Convenience methods to work with logical document collections identified by `collection` + `record_type='collection_header'`.
-   **Validation**: Integrates metadata validation using the JSON schema before writing data, preventing inconsistencies.
-   **Focus on Usability**: Aims to enhance the developer experience for managing context data within `.lance` files, without reimplementing Lance's core file I/O or advanced query capabilities. The underlying `LanceDataset` is always accessible (`FrameDataset._native`) for advanced use cases.

## Key Features

-   **Lance-Based**: Built on the high-performance, version-aware `.lance` format.
-   **Rich, Editable & Standardized Metadata**: Comprehensive schema for essential document details, including provenance, collections, status, **and importantly, user-editable fields (`context`, `custom_metadata`) for adding workflow-specific relevance and annotations.**
-   **Optional Raw Multimodal Data**  Store raw binary data (images, audio, video, PDFs, etc.) directly in the dataset using the **`raw_data`** (bytes) and **`raw_data_type`** (IANA MIME type) columns added to the canonical schema.  A helper `MimeTypes` enum under `contextframe.schema` exposes common constants (e.g., `MimeTypes.IMAGE_JPEG`).
-   **Explicit Record Typing**  The `RecordType` enum (`contextframe.schema.RecordType`) provides canonical constants (`DOCUMENT`, `COLLECTION_HEADER`, `DATASET_HEADER`, `FRAMESET`).
-   **Explicit Relationships**: Dedicated schema support and helpers for defining connections between documents.
-   **Vector Embedding Support**: Natively handles vector embeddings alongside text and metadata.
-   **Schema Validation**: Ensures data consistency and integrity against the ContextFrame specification.
-   **Pythonic API**: Provides intuitive classes (`FrameRecord`, `FrameDataset`) for interacting with ContextFrame data.
-   **Full CRUD Helpers**: Built-in `.update_record`, `.delete_record`, and `.upsert_record` methods enable convenient record modifications.
-   **Document Extraction**: Native extractors for common formats (text, JSON, CSV, YAML) with lightweight Rust-based chunking via `semantic-text-splitter`. Supports integration patterns for advanced extraction tools (Docling, Unstructured.io, Chunkr, Reducto).
-   **Embedding Generation**: Unified embedding interface through LiteLLM supporting 100+ providers (OpenAI, Anthropic, Cohere, local models via Ollama, etc.) with automatic batching and fallback strategies.
-   **Lightweight Abstraction**: Focuses on schema, validation, and usability, relying on the underlying `lance` library for core storage and query operations.

## Core vs Extended Functionality

The `contextframe` package provides the core specification, schema, validation rules, and Python abstractions (`FrameRecord`, `FrameDataset`). It requires the `lance` library as a dependency.

While ContextFrame defines fields to track document source and relationships, the actual **data conversion/ingestion** (e.g., from PDF, HTML, other formats) and **complex relationship graph analysis** are considered extended functionalities. Users should leverage other libraries or build specific tools that produce or consume ContextFrame-compliant `.lance` files for these purposes.

## Installation

### Core Package (Minimal)
```bash
# Core functionality only - fast, lightweight install
pip install contextframe
```

### With Optional Features
```bash
# Document extraction (PDF, DOCX, HTML, PPTX, XLSX)
pip install "contextframe[extract]"

# Embedding generation (OpenAI, Anthropic, Cohere, Sentence Transformers)
pip install "contextframe[embed]"

# AI-powered enhancement (summarization, metadata extraction)
pip install "contextframe[enhance]"

# MP4 encoding for universal distribution
pip install "contextframe[encode]"

# MCP server for LLM/agent integration
pip install "contextframe[serve]"

# Install everything
pip install "contextframe[all]"
```

### Installation Decision Tree
- **Just managing Lance datasets?** → Use core package
- **Need to extract content from documents?** → Add `[extract]`
- **Creating embeddings?** → Add `[embed]`
- **Using LLMs for enrichment?** → Add `[enhance]`
- **Distributing datasets as MP4?** → Add `[encode]`
- **Serving to LLMs via MCP?** → Add `[serve]`

## Basic Usage (Conceptual)

```python
from contextframe import FrameRecord, FrameDataset
from contextframe.schema.contextframe_schema import MimeTypes, RecordType
import numpy as np
import pyarrow as pa # Lance works closely with Arrow

# 1. Create FrameRecord objects (in-memory representation)
#    - Handles metadata validation and default values (e.g., uuid, dates)
#    - Prepares data for Lance/Arrow
#    - Includes helpers for relationships
record1 = FrameRecord.create(
    title="Introduction to Lance",
    content="Lance is a modern columnar format...",
    vector=np.random.rand(1536).astype(np.float32), # Example embedding
    author="LanceDB Team",
    tags=["lance", "columnar", "vector db"],
    custom_metadata={"project": "ContextFrame"}, # Example custom field
    record_type=RecordType.DOCUMENT,
)

# Example with raw embedded image content and collection header
# (demonstrates use of RecordType enum and collection helpers)
image_bytes = b"...binary jpeg bytes..."  # Replace with real bytes

# Create a *collection header* document which holds contextual information
collection_header = FrameRecord.create(
    title="Project X Docs – Collection Header",
    content="Overview and context for Project X documents...",
    vector=np.zeros(1536, dtype=np.float32),
    tags=["project-x"],
    collection="project_x",
    record_type=RecordType.COLLECTION_HEADER,
)

record2 = FrameRecord.create(
    title="Project X – Image Asset",
    content="This record stores an image relevant to Project X...",
    vector=np.random.rand(1536).astype(np.float32),
    collection="project_x",
    raw_data=image_bytes,
    raw_data_type=MimeTypes.IMAGE_JPEG,
)

# Later, add or modify context based on your workflow needs
# record1.metadata['context'] = "This is the primary specification document for the ContextFrame project."
# record1.metadata['custom_metadata']['review_status'] = 'approved'
# record1.tags.append('core_concept') # Tags are also mutable

# Add a relationship between records (managed within FrameRecord metadata)
record2.add_relationship(collection_header, relationship_type="member_of")

# FrameRecord handles conversion to Arrow Table format needed by Lance
# table1 = record1.to_table()
# print(table1.schema)

# 2. Use FrameDataset to manage records and interact with Lance files
#    - FrameDataset wraps the underlying lance.LanceDataset

dataset_path = "my_context_frames.lance"

# Create a new dataset (writes an empty .lance dir with ContextFrame schema)
# try:
#     frame_dataset = FrameDataset.create(dataset_path, embed_dim=1536)
# except FileExistsError: # Handle existing dataset
#     frame_dataset = FrameDataset.open(dataset_path)


# Add records (validates schema, converts FrameRecord to Arrow, uses LanceDataset.insert)
# frame_dataset.add(record1)
# frame_dataset.add(record2) # Can add records individually or use add_many

# Open an existing dataset
# existing_dataset = FrameDataset.open(dataset_path)
# print(f"Dataset URI: {existing_dataset._native.uri}")
# print(f"Schema: {existing_dataset._native.schema}")
# print(f"Versions: {existing_dataset.versions()}")

# Advanced query helpers

# 3-nearest neighbours restricted to a collection
# query_vector = np.random.rand(1536).astype(np.float32)
# top3 = existing_dataset.knn_search(query_vector, k=3, filter="collection = 'project_x'")
# print([r.title for r in top3])

# Full-text search across text_content
# hits = existing_dataset.full_text_search("vector database", k=10)
# for rec in hits:
#     print(rec.title, rec.metadata.get("score"))

# Metadata helpers
# published_docs = existing_dataset.find_by_status("published")
# tag_matches = existing_dataset.find_by_tag("lance")

# Load a specific record using the helper
# loaded_record = FrameRecord.from_dataset_row(dataset_path, uuid=record1.uuid)
# print(f"Loaded record title: {loaded_record.title}")
# print(f"Loaded record relationships: {loaded_record.metadata.get('relationships')}")
# print(f"Loaded record custom metadata: {loaded_record.metadata.get('custom_metadata')}")

# Access the native LanceDataset object if needed
# native_lance_dataset = existing_dataset._native
# print(f"Native dataset row count: {native_lance_dataset.count_rows()}")

# Update a record (delete-then-add under the hood)
# record1.title = "Introduction to Lance (Updated)"
# frame_dataset.update_record(record1)

# Delete a record by UUID
# frame_dataset.delete_record(record2.uuid)

# Upsert semantics – insert or replace
# frame_dataset.upsert_record(record1)

```

*(Please note: The usage example above is conceptual and may require adjustments based on the final API design. It illustrates the intended workflow.)*

## Optimising Performance with Indexes

`FrameDataset` exposes **index management helpers** that wrap Lance native functionality so you can optimise searches without reading the docs every time:

```python
# Build (or replace) a vector IVF+PQ index – essential for fast knn_search
frame_dataset.create_vector_index(index_type="IVF_PQ", num_partitions=256, num_sub_vectors=64)

# Create a bitmap index on the 'status' column to speed up filters
frame_dataset.create_scalar_index("status")
```

Behind the scenes these forward to `LanceDataset.create_index()` / `create_scalar_index()`.  See the [Lance docs](https://lancedb.github.io/lance/) for tuning parameters.

## License

This project is licensed under the MIT License - see the LICENSE file for details.
