Metadata-Version: 2.3
Name: data-gentry
Version: 0.1.2
Summary: A library for packaging together data + documentation into an agent-friendly duckdb artifact.
Requires-Dist: boto3>=1.42.5
Requires-Dist: duckdb>=1.4.3
Requires-Dist: duckdb-engine>=0.17.0
Requires-Dist: pymupdf4llm>=0.2.7
Requires-Dist: pytz>=2025.2
Requires-Dist: semchunk>=3.2.5
Requires-Dist: sqlalchemy>=2.0.45
Requires-Dist: strands-agents>=1.19.0
Requires-Dist: types-boto3[bedrock,bedrock-runtime]>=1.42.9
Requires-Python: >=3.14
Description-Content-Type: text/markdown

# DataGentry
🎩  
🧐  🦆

A small library for creating efficient file-specific agents / RAG systems with duckdb.

## Overview

Data Gentry packages together:
  - Loading data files and data documentation into a duckdb database with pre-built vector and full-text indices
on the data dictionary's contents. 
  - Simple interfaces for chunking + embedding documents and loading data, allowing the user to customize how the duckdb artifact is created.
    - Out-of-the-box chunking: Semchunk
    - Out-of-the-box embedding: Bedrock
  - Hybrid BM-25 / HNSW retrieval on the generated database.

The project is currently in a "proof-of-concept/playing around" phase, but in my mind could help to solve the problem that existing semantic layers are often tightly-coupled to vendors like Databricks or Snowflake, increasing vendor lock-in and coupling to spark workloads that are often overkill for the size of the data in question.

## TODO:
  - Support vector similarity metrics other than cosine similarity
  - Implement a set of tools to allow an agent to work with the artifact
  - Convenience functionality to auto-load from fs (/httpfs)?
