Metadata-Version: 2.4
Name: agent-library
Version: 0.10.0
Summary: Multi-modal knowledge library with vector and full-text search for text, code, images, and PDFs
Author-email: "Arcade.dev" <dev@arcade.dev>
License-Expression: MIT
Keywords: ai,code,embeddings,knowledge-management,markdown,mcp,multimodal,search,vector
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.10
Requires-Dist: arcade-mcp-server>=0.1.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: python-frontmatter>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: sqlite-vec>=0.1.0
Requires-Dist: typer>=0.9.0
Provides-Extra: all
Requires-Dist: pillow>=10.0.0; extra == 'all'
Requires-Dist: pypdf>=4.0.0; extra == 'all'
Requires-Dist: pytesseract>=0.3.10; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Requires-Dist: torchvision>=0.15.0; extra == 'all'
Requires-Dist: transformers>=4.30.0; extra == 'all'
Requires-Dist: tree-sitter-javascript>=0.21.0; extra == 'all'
Requires-Dist: tree-sitter-python>=0.21.0; extra == 'all'
Requires-Dist: tree-sitter>=0.21.0; extra == 'all'
Provides-Extra: code
Requires-Dist: torch>=2.0.0; extra == 'code'
Requires-Dist: transformers>=4.30.0; extra == 'code'
Requires-Dist: tree-sitter-javascript>=0.21.0; extra == 'code'
Requires-Dist: tree-sitter-python>=0.21.0; extra == 'code'
Requires-Dist: tree-sitter>=0.21.0; extra == 'code'
Provides-Extra: dev
Requires-Dist: mypy>=1.5.1; extra == 'dev'
Requires-Dist: pre-commit>=3.4.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.25.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.3.0; extra == 'dev'
Requires-Dist: ruff>=0.7.4; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.1; extra == 'dev'
Provides-Extra: evals
Requires-Dist: arcade-mcp[evals]>=1.10.0; extra == 'evals'
Provides-Extra: ocr
Requires-Dist: pillow>=10.0.0; extra == 'ocr'
Requires-Dist: pytesseract>=0.3.10; extra == 'ocr'
Provides-Extra: pdf
Requires-Dist: pypdf>=4.0.0; extra == 'pdf'
Provides-Extra: vision
Requires-Dist: pillow>=10.0.0; extra == 'vision'
Requires-Dist: torch>=2.0.0; extra == 'vision'
Requires-Dist: torchvision>=0.15.0; extra == 'vision'
Requires-Dist: transformers>=4.30.0; extra == 'vision'
Description-Content-Type: text/markdown

# Librarian

A personal knowledge library for AI agents, built on [Arcade](https://arcade.dev) for the Model Context Protocol (MCP).

## Overview

Librarian provides AI agents with persistent storage for text, documents, and knowledge. Agents can store information and retrieve it later through semantic and keyword search, maintaining context across conversations.

```mermaid
graph LR
    A[Agent Stores Info] --> B[Parser]
    B --> C[Chunker]
    C --> D[Embedder]
    D --> E[(SQLite + vec)]
    F[Agent Queries] --> G[Hybrid Search]
    E --> G
    G --> H[Relevant Context]
```

## Features

- Persistent knowledge storage for AI agents
- SQLite storage with `sqlite-vec` for vector search
- Full-text search using FTS5 with BM25 ranking
- Hybrid search combining semantic and keyword matching
- Max Marginal Relevance (MMR) for diverse results
- Configurable embedding models (local or OpenAI-compatible API)
- Header-aware text chunking with overlap
- Time-bounded search filters
- CLI and MCP server interfaces

## Multi-Modal Support

Librarian supports indexing and searching across multiple file types:

| Asset Type | File Extensions | Features |
|------------|----------------|----------|
| **Text** | `.md`, `.txt` | Frontmatter extraction, header-aware chunking |
| **Code** | `.py`, `.js`, `.ts`, `.go`, `.rs`, `.java`, `.cpp`, and more | Symbol extraction (classes, functions, methods) |
| **PDF** | `.pdf` | Page-based text extraction |
| **Image** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp` | Metadata and EXIF extraction, optional OCR |

## Installation

```bash
git clone https://github.com/ArcadeAI/librarian.git
cd librarian
./setup.sh
```

Or install manually:

```bash
uv pip install -e ".[dev]"
```

Optional multi-modal dependencies:

```bash
uv pip install -e ".[pdf]"      # PDF support (pypdf)
uv pip install -e ".[vision]"   # Image support (Pillow)
uv pip install -e ".[all]"      # All optional features
```

## CLI Usage

```bash
# Add files to the library
libr add ~/notes

# Search the library
libr search "machine learning concepts"

# List sources
libr list

# View library statistics
libr index

# Rebuild the index
libr index build
```

## MCP Server

Start the server for AI assistant integration:

```bash
# stdio transport (Claude Desktop, CLI)
libr serve stdio

# HTTP transport (Cursor, VS Code)
libr serve http --port 8000
```

See the [Arcade MCP documentation](https://docs.arcade.dev) for integration details.

### Available Tools

| Tool | Description |
|------|-------------|
| `Librarian_SearchLibrary` | Search the library with hybrid vector + keyword search |
| `Librarian_SemanticSearchLibrary` | Find content by meaning (semantic similarity) |
| `Librarian_KeywordSearchLibrary` | Find content by exact keywords |
| `Librarian_SearchLibraryByDates` | Search within a date range |
| `Librarian_AddToLibrary` | Store new content in the library |
| `Librarian_UpdateLibraryDoc` | Update existing content |
| `Librarian_ReadFromLibrary` | Read full document content |
| `Librarian_RemoveFromLibrary` | Remove content from the library |
| `Librarian_ListLibraryContents` | List all stored content |
| `Librarian_IndexDirectoryToLibrary` | Bulk import files |
| `Librarian_GetLibrarySources` | List sources with document/chunk counts |
| `Librarian_GetLibraryStats` | Overall library statistics |

## Configuration

Set via environment variables:

| Variable | Default | Description |
|----------|---------|-------------|
| `DOCUMENTS_PATH` | `./documents` | Root directory for files |
| `DATABASE_PATH` | `~/.librarian/index.db` | SQLite database location |
| `EMBEDDING_PROVIDER` | `openai` | `local` or `openai` |
| `EMBEDDING_MODEL` | `all-MiniLM-L6-v2` | Local model name |
| `OPENAI_API_BASE` | `http://localhost:7171/v1` | OpenAI-compatible API URL |
| `OPENAI_EMBEDDING_MODEL` | `qwen3-embedding-06b` | API model name |
| `CHUNK_SIZE` | `512` | Max characters per chunk |
| `CHUNK_OVERLAP` | `50` | Overlap between chunks |
| `SEARCH_LIMIT` | `10` | Default results limit |
| `MMR_LAMBDA` | `0.5` | MMR diversity (0=diverse, 1=relevant) |
| `HYBRID_ALPHA` | `0.7` | Vector vs keyword weight (1=vector only) |

## Project Structure

```
librarian/
├── cli.py           # Command-line interface
├── server.py        # MCP server and tool definitions
├── config.py        # Configuration management
├── indexing.py      # Document indexing service
├── types.py         # Shared type definitions
├── storage/
│   ├── database.py  # SQLite operations
│   ├── vector_store.py  # sqlite-vec search
│   └── fts_store.py     # FTS5 search
├── processing/
│   ├── embed/       # Embedding providers
│   ├── parsers/     # Document parsers (md, code, pdf, image)
│   └── transform/   # Text chunking
├── retrieval/
│   └── search.py    # Hybrid search + MMR
└── utils/
    └── timeframe.py # Time filter utilities
```

## Development

```bash
make install    # Install dependencies
make test       # Run tests
make lint       # Run linter
make format     # Format code
make typecheck  # Type checking
make check      # All checks
make evals      # Run evaluations
```

## Resources

- [Arcade.dev](https://arcade.dev) - Build AI-native applications
- [Arcade Documentation](https://docs.arcade.dev) - Integration guides and API reference

## License

MIT License - see [LICENSE](LICENSE) for details.

## Contact

- Email: <contact@arcade.dev>
- Website: [arcade.dev](https://arcade.dev)
