Metadata-Version: 2.4
Name: llemb
Version: 0.3.0
Summary: Embedding extractor for decoder-only LLMs
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: accelerate>=1.10.1
Requires-Dist: bitsandbytes>=0.48.2
Requires-Dist: numpy>=2.0.2
Requires-Dist: torch>=2.0.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: transformers>=4.30.0
Provides-Extra: quantization
Provides-Extra: vllm
Requires-Dist: vllm>=0.10.0; extra == "vllm"
Dynamic: license-file

[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/llemb?logo=pypi&style=flat&color=blue)](https://pypi.org/project/llemb/)
[![PyPI - Package Version](https://img.shields.io/pypi/v/llemb?logo=pypi&style=flat&color=orange)](https://pypi.org/project/llemb/)
[![License](https://img.shields.io/github/license/j341nono/llemb?logo=github&style=flat&color=green)](https://github.com/j341nono/llemb/blob/main/LICENSE)

# llemb: Unified Embedding Extraction from Decoder-only LLMs

**llemb** is a lightweight framework designed to extract high-quality sentence embeddings from Decoder-only Large Language Models (LLMs) like Llama, Mistral, and others. It unifies various state-of-the-art pooling strategies and efficiency optimizations into a simple, coherent interface.

With `llemb`, you can easily leverage powerful LLMs for embedding tasks using advanced techniques like **PromptEOL** and **PCoTEOL**, with built-in support for **Batch Processing** and **vLLM** for high-throughput inference.

> **Origin of the Name**
> **llemb** is a coined name derived from **LLM** and **Embedding**. In Japanese pronunciation, the letter *M* in **LLM** is read as /ɛm/ (emu), and **Embedding** begins with *Em*, which is pronounced /ɛm/ as well. Overlapping this sound reflects the idea of merging LLMs and embeddings.

## Features

- **Flexible Backends**: Seamless support for **Hugging Face Transformers** and **vLLM** (for high-speed inference).
- **Advanced Pooling & Prompting**:
    - **Pooling Methods**: `mean`, `last_token`, `eos_token`
    - **Prompt Templates**: `prompteol`, `pcoteol` (Pretended Chain of Thought), `ke` (Knowledge Enhancement)
    - **Independent Control**: Combine any pooling method with any prompt template
- **High-Performance**:
    - **Batch Processing**: Efficiently handles large datasets with automatic CPU offloading and progress bars (`tqdm`).
    - **Quantization**: Native support for **4-bit and 8-bit** via `bitsandbytes` (Transformers) or `fp8`/`awq`/`gptq` (vLLM).
- **Granular Control**: Extract embeddings from any layer (defaults to recommended layers based on research).

## Installation

Install via PyPI using `pip` or `uv`.

**Basic Installation (includes quantization support)**

```bash
pip install llemb
# or
uv add llemb
```

Starting from v0.2.2, `bitsandbytes` for quantization is included by default.

**With vLLM Support**

To enable the vLLM backend for faster inference:

```bash
pip install "llemb[vllm]"
# or
uv add llemb[vllm]
```

## Getting Started

Initialize the encoder and start extracting embeddings in just a few lines of code.

### Basic Usage

```python
import llemb

# 1. Initialize the encoder (defaults to auto-device detection)
enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# 2. Extract embeddings using mean pooling
embeddings = enc.encode("Hello world", pooling_method="mean")

print(embeddings.shape)
# => (1, 4096)
```

### Batch Processing (Recommended for Large Data)

Process lists of texts efficiently. `llemb` handles batching, progress tracking with `tqdm`, and memory management by automatically offloading computed embeddings to the CPU to prevent GPU VRAM saturation.

```python
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is fascinating.",
    # ... thousands of texts ...
]

# Process in batches of 32
embeddings = enc.encode(texts, batch_size=32, pooling_method="mean")

print(embeddings.shape)
# => (N, 4096)
```

### Using vLLM Backend

For maximum throughput on supported hardware, use the vLLM backend. You can pass vLLM-specific arguments (like `tensor_parallel_size` or `gpu_memory_utilization`) directly to the encoder.

```python
enc = llemb.Encoder(
    "meta-llama/Llama-3.1-8B",
    backend="vllm",
    tensor_parallel_size=1,       # Number of GPUs
    gpu_memory_utilization=0.9    # vLLM memory setting
)

embeddings = enc.encode("Hello vLLM", pooling_method="last_token")
```

### Using Prompt Templates

Leverage research-backed prompt templates to improve embedding quality. When you specify a template, `pooling_method` automatically defaults to `last_token`.

```python
import llemb

enc = llemb.Encoder("meta-llama/Llama-3.1-8B")

# Simple usage - pooling_method automatically defaults to last_token
embeddings = enc.encode(
    "Hello world",
    prompt_template="prompteol"
)

# Use PCoTEOL (Pretended Chain of Thought) template
# Note: Automatically uses layer -2 when prompt_template is "pcoteol" or "ke"
embeddings = enc.encode(
    "Hello world",
    prompt_template="pcoteol"
)

# Or override the layer index
embeddings = enc.encode(
    "Hello world",
    prompt_template="pcoteol",
    layer_index=-1  # Override default -2
)
```

### Advanced Usage (Quantization)

Use quantization to reduce memory usage.

```python
import llemb

# Initialize with 4-bit quantization and force CUDA
enc = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    backend="transformers",
    device="cuda",
    quantization="4bit"
)

embeddings = enc.encode(
    "Hello world",
    pooling_method="mean",
    prompt_template="ke"  # Knowledge Enhancement template
)
```

## Configuration & Optimization

`llemb` passes arguments directly to the backend, allowing for deep customization.

**Using Flash Attention 2**

```python
import torch

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16
)
```

**Custom Quantization Config**

```python
from transformers import BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

encoder = llemb.Encoder(
    model_name="meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config
)
```

## Supported Pooling Methods & Prompt Templates

### Pooling Methods (`pooling_method` parameter)

| Method | Description | Default Layer |
| --- | --- | --- |
| `mean` | Average pooling of all tokens (excluding padding). | -1 (Last) |
| `last_token` | Vector of the last generated token. | -1 (Last) |
| `eos_token` | Vector corresponding to the EOS token position. | -1 (Last) |

### Prompt Templates (`prompt_template` parameter)

| Template | Description | Default Layer |
| --- | --- | --- |
| `prompteol` | Simple prompt template targeting the last token. | -1 (Last) |
| `pcoteol` | "Pretended Chain of Thought" - wraps input in a reasoning template. | -2 |
| `ke` | "Knowledge Enhancement" - wraps input in a context-aware template. | -2 |

You can combine any pooling method with any prompt template. When using `pcoteol` or `ke` templates, the default layer is automatically set to -2 unless explicitly overridden.

**Smart Defaults (v0.2.2+):**
- When `prompt_template` is specified, `pooling_method` automatically defaults to `last_token`
- When no `prompt_template` is specified, `pooling_method` defaults to `mean`
- Explicit values always take precedence

## Migration Guide

If you're upgrading from an earlier version of `llemb`, the API has been refactored to separate pooling methods and prompt templates into orthogonal parameters.

### Breaking Changes in v0.2.2

1. **API Refactoring**: `pooling` parameter split into `pooling_method` and `prompt_template`
2. **Smart Defaults**: `pooling_method` automatically set to `last_token` when using templates
3. **Dependencies**: `bitsandbytes` is now a core dependency (no longer optional)
4. **Removed**: `index` pooling strategy has been removed

**Old API (deprecated):**
```python
# Old: pooling parameter mixed strategies and templates
enc.encode("text", pooling="mean")
enc.encode("text", pooling="pcoteol")  # Mixed template + pooling
enc.encode("text", pooling="index", token_index=0)  # Index strategy
```

**New API (v0.2.2+):**
```python
# New: separate parameters with smart defaults
enc.encode("text", pooling_method="mean")
enc.encode("text", prompt_template="pcoteol")  # pooling_method auto-set to last_token
# Index strategy has been removed
```

### Migration Examples

| Old Code | New Code (Explicit) | New Code (Smart Default) |
| --- | --- | --- |
| `enc.encode(text, pooling="mean")` | `enc.encode(text, pooling_method="mean")` | `enc.encode(text)` |
| `enc.encode(text, pooling="last_token")` | `enc.encode(text, pooling_method="last_token")` | — |
| `enc.encode(text, pooling="eos_token")` | `enc.encode(text, pooling_method="eos_token")` | — |
| `enc.encode(text, pooling="prompteol")` | `enc.encode(text, pooling_method="last_token", prompt_template="prompteol")` | `enc.encode(text, prompt_template="prompteol")` |
| `enc.encode(text, pooling="pcoteol")` | `enc.encode(text, pooling_method="last_token", prompt_template="pcoteol")` | `enc.encode(text, prompt_template="pcoteol")` |
| `enc.encode(text, pooling="ke")` | `enc.encode(text, pooling_method="last_token", prompt_template="ke")` | `enc.encode(text, prompt_template="ke")` |
| `enc.encode(text, pooling="index", token_index=0)` | *Removed - use `last_token` or implement custom logic* | — |

### Benefits of the New API

1. **Orthogonality**: You can now combine any pooling method with any prompt template.
2. **Clarity**: The separation makes it clear what each parameter does.
3. **Flexibility**: Easier to experiment with different combinations.
4. **Smart Defaults**: 
   - `pooling_method` automatically set to `last_token` when using templates
   - Layer indices automatically set based on prompt template (can be overridden)
5. **Simpler Usage**: Less boilerplate for common use cases with templates.

## Development

Clone the repository and sync dependencies using `uv`:

```bash
git clone https://github.com/j341nono/llemb.git
cd llemb
uv sync --all-extras --dev
```

**Run Tests**

```bash
uv run pytest
```

**Static Analysis**

```bash
uv run ruff check src
uv run mypy src
```

## Citations

If you use the advanced pooling strategies implemented in this library, please cite the respective original papers:

**PromptEOL:**

```bibtex
@inproceedings{jiang-etal-2024-scaling,
    title = "Scaling Sentence Embeddings with Large Language Models",
    author = "Jiang, Ting and Huang, Shaohan and Luan, Zhongzhi and Wang, Deqing and Zhuang, Fuzhen",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    year = "2024"
}
```

**PCoTEOL and KE:**

```bibtex
@article{zhang2024simple,
    title={Simple Techniques for Enhancing Sentence Embeddings in Generative Language Models},
    author={Zhang, Bowen and Chang, Kehua and Li, Chunping},
    journal={arXiv preprint arXiv:2404.03921},
    year={2024}
}
```

## License

This project is open source and available under the [Apache-2.0 license](https://www.google.com/search?q=LICENSE).
