Metadata-Version: 2.4
Name: colbert-matryoshka
Version: 0.1.1
Summary: Matryoshka ColBERT: Multi-dimensional ColBERT embeddings with PyLate
Project-URL: Homepage, https://huggingface.co/dragonkue/colbert-ko-0.1b
Project-URL: Repository, https://github.com/dragonkue/colbert-matryoshka
Author: dragonkue
License: Apache-2.0
Keywords: colbert,embeddings,korean,matryoshka,pylate,retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Requires-Dist: pylate>=1.0.0
Description-Content-Type: text/markdown

# colbert-matryoshka

Matryoshka ColBERT: Multi-dimensional ColBERT embeddings with PyLate.

This package provides `MatryoshkaColBERT`, a ColBERT model with Multiple Linear Heads for Matryoshka embeddings (Jina-ColBERT-v2 style). It supports multiple embedding dimensions (32, 64, 96, 128) using separate projection heads.

## Installation

```bash
pip install colbert-matryoshka
```

## Quick Start

```python
from colbert_matryoshka import MatryoshkaColBERT

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")

# Set embedding dimension (32, 64, 96, or 128)
model.set_active_dim(128)

# Encode queries and documents
query_embeddings = model.encode(["검색 쿼리"], is_query=True)
doc_embeddings = model.encode(["문서 내용"], is_query=False)

print(f"Query shape: {query_embeddings[0].shape}")  # (num_tokens, 128)
print(f"Doc shape: {doc_embeddings[0].shape}")      # (num_tokens, 128)
```

## Retrieval with PyLate

```python
from colbert_matryoshka import MatryoshkaColBERT
from pylate import indexes, retrieve

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)

# Initialize PLAID index
index = indexes.PLAID(
    index_folder="pylate-index",
    index_name="index",
    override=True,
)

# Encode and index documents
documents_ids = ["1", "2", "3"]
documents = ["첫번째 문서입니다", "두번째 문서입니다", "세번째 문서입니다"]

documents_embeddings = model.encode(documents, is_query=False)
index.add_documents(
    documents_ids=documents_ids,
    documents_embeddings=documents_embeddings,
)

# Retrieve
retriever = retrieve.ColBERT(index=index)
queries_embeddings = model.encode(["첫번째 문서 검색"], is_query=True)

scores = retriever.retrieve(
    queries_embeddings=queries_embeddings,
    k=3,
)
print(scores)
```

## Reranking

```python
from colbert_matryoshka import MatryoshkaColBERT
from pylate import rank

# Load model
model = MatryoshkaColBERT.from_pretrained("dragonkue/colbert-ko-0.1b")
model.set_active_dim(128)

queries = ["인공지능 기술", "한국어 자연어처리"]

documents = [
    ["AI와 머신러닝에 대한 문서", "요리 레시피 문서"],
    ["한국어 NLP 연구", "영어 문법 설명", "프로그래밍 튜토리얼"],
]

documents_ids = [
    [1, 2],
    [1, 3, 2],
]

# Encode
queries_embeddings = model.encode(queries, is_query=True)
documents_embeddings = [model.encode(docs, is_query=False) for docs in documents]

# Rerank
reranked = rank.rerank(
    documents_ids=documents_ids,
    queries_embeddings=queries_embeddings,
    documents_embeddings=documents_embeddings,
)
print(reranked)
```

## Available Models

| Model | Dimensions | Language |
|-------|------------|----------|
| [dragonkue/colbert-ko-0.1b](https://huggingface.co/dragonkue/colbert-ko-0.1b) | 32, 64, 96, 128 | Korean |

## License

Apache-2.0
