Metadata-Version: 2.4
Name: rwkv-emb
Version: 0.0.6
Summary: The EmbeddingRWKV Model
Author: Haowen HOU
Project-URL: Homepage, https://github.com/howard-hou/EmbeddingRWKV
Project-URL: Bug Tracker, https://github.com/howard-hou/EmbeddingRWKV/issues
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# EmbeddingRWKV

A high-efficiency text embedding and reranking model based on RWKV architecture.

## 🚀 Quick Start (End-to-End)

Get text embeddings in just a few lines. The tokenizer and model are designed to work seamlessly together.

> **Note**: Always set `add_eos=True` during tokenization. The model relies on the EOS token (`65535`) to mark the end of a sentence for correct embedding generation.

```python
import os
# Set environment for JIT compilation (Optional, set to '1' for CUDA acceleration)
os.environ["RWKV_CUDA_ON"] = '1'

from rwkv_emb.tokenizer import RWKVTokenizer
from rwkv_emb.model import EmbeddingRWKV

# 1. Initialize Tokenizer & Model
model = EmbeddingRWKV(model_path='/path/to/model.pth')
tokenizer = RWKVTokenizer()

# 2. Tokenize (Text -> Tokens)
text = "Hello world! This is RWKV embedding."
# Important: Enable add_eos=True to append the required EOS token (65535)
tokens = tokenizer.encode(text, add_eos=True)
print(f"Tokens: {tokens}") 

# 3. Inference (Tokens -> Embedding)
embedding, state = model.forward(tokens, None)

print(f"Embedding shape: {embedding.shape}")
print(embedding.shape)
```

## ⚡ Batch Inference & Performance Guide

For production use cases, running inference in batches is significantly faster.

### ⚠️ Critical Performance Tip: Pad to Same Length

While the model supports batches with variable sequence lengths, **we strongly recommend padding all sequences to the same length** for maximum GPU throughput.

  - **Pad Token**: `0`
  - **Performance**: Fixed-length batches allow the CUDA kernel to parallelize computation efficiently. Variable-length batches will trigger a slower execution path.

### Batch Example (Recommended)

```python
# Example: Batching two sentences of different lengths
sentences = [
    "Short sentence.",
    "This is a slightly longer sentence for demonstration."
]

# 1. Tokenize all
batch_tokens = [tokenizer.encode(s, add_eos=True) for s in sentences]

# 2. Left pad to the longest sequence in the batch using 0
max_len = max(len(t) for t in batch_tokens)

for i in range(len(batch_tokens)):
    pad_len = max_len - len(batch_tokens[i])
    # insert 0s to the beginning
    batch_tokens[i] = [0] * pad_len + batch_tokens[i]

# batch_tokens is now a rectangular matrix (List of Lists with same length)
print(f"Padded Batch: {batch_tokens}")

# 3. Fast Inference
embeddings, states = model.forward(batch_tokens, None)

# embeddings shape: [Batch_Size, Embedding_Dim]
print("Batch Embeddings:", embeddings.shape)
print("Batch States:", states.shape)
```
