Metadata-Version: 2.4
Name: nlp4j-llm-embedding-e5
Version: 0.1.0
Summary: Local and HTTP server embedding tools for multilingual E5
Author: Hiroki OYA
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/oyahiroki/nlp4j-llm-embeddings-e5
Project-URL: Repository, https://github.com/oyahiroki/nlp4j-llm-embeddings-e5
Project-URL: Issues, https://github.com/oyahiroki/nlp4j-llm-embeddings-e5/issues
Keywords: nlp,embedding,sentence-transformers,multilingual-e5,semantic-search,japanese
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: sentence-transformers>=2.6.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Dynamic: license-file

以下は、現在の構成に合わせた英語版 `README.md` の案です。`local_e5.py` の JSONL CLI、`server_e5.py` の HTTP server、`e5_model.py` の共通 E5 embedding 処理、`request_handler.py` の3エンドポイント構成を反映しています。   

# nlp4j-llm-embeddings-e5

Local and HTTP server tools for generating multilingual E5 embeddings.

This project provides command-line and HTTP server utilities for using the `intfloat/multilingual-e5-large` model locally. It is designed for embedding JSONL data, building local semantic search workflows, and exposing embedding functions over a simple HTTP API.

The implementation uses `sentence-transformers` internally and applies E5-style prefixes such as `passage:` and `query:` automatically.

## Features

* Generate embeddings for JSONL files
* Add embedding vectors to a specified JSON attribute
* Use multilingual E5 embeddings locally
* Run a lightweight HTTP embedding server
* Support document embeddings with `passage:` prefix
* Support query/document semantic search with `query:` and `passage:` prefixes
* Support cosine similarity calculation
* Optional token count checking
* Batch processing for local JSONL embedding
* Model warmup support for server mode

## Model

The default model is:

```text
intfloat/multilingual-e5-large
```

E5 models are designed to work with explicit text prefixes.

For document embeddings:

```text
passage: your document text
```

For query embeddings:

```text
query: your search query
```

This project automatically adds these prefixes depending on the selected mode.

## Project Structure

```text
.
├── LICENSE.txt
├── README.md
├── README_ja.md
├── docker
│   ├── Dockerfile
│   └── README.md
├── examples
│   ├── index.html
│   ├── nlp4j-embedding-local-e5-bench-example_input_ja_1.txt
│   ├── nlp4j-embedding-local-e5-bench.py
│   ├── nlp4j-embedding-local-openai.py
│   ├── test2.txt
│   ├── test3.txt
│   └── test_json.txt
├── pyproject.toml
├── requirements.txt
└── src
    └── nlp4j_embedding
        ├── __init__.py
        ├── e5_model.py
        ├── local_e5.py
        ├── request_handler.py
        └── server_e5.py
```

## Installation

### Install from source

```bash
git clone https://github.com/oyahiroki/nlp4j-llm-embeddings-e5.git
cd nlp4j-llm-embeddings-e5
pip install .
```

For development:

```bash
pip install -e .
```

### Install dependencies manually

```bash
pip install -r requirements.txt
```

If you want to use GPU acceleration, please install a PyTorch build suitable for your CUDA environment.

## Commands

After installation, the following commands are available:

```bash
nlp4j-embedding-local-e5
nlp4j-embedding-server-e5
```

## Local JSONL Embedding

The local command reads a JSONL file, embeds text from a specified attribute, and writes a new JSONL file with an embedding vector added.

### Basic usage

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl
```

By default, it reads text from the `text` attribute and writes the vector to the `vector` attribute.

Input example:

```json
{"id": "1", "text": "Kyoto is a city in Japan."}
{"id": "2", "text": "Tokyo is the capital of Japan."}
```

Output example:

```json
{"id": "1", "text": "Kyoto is a city in Japan.", "vector": [0.0123, -0.0456, ...]}
{"id": "2", "text": "Tokyo is the capital of Japan.", "vector": [0.0234, -0.0567, ...]}
```

### Specify input and output attributes

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --text-attr body \
  --vector-attr embedding
```

### Specify E5 text type

For document embeddings, use `passage`:

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --text-type passage
```

For query embeddings, use `query`:

```bash
nlp4j-embedding-local-e5 queries.jsonl queries_with_vectors.jsonl \
  --text-type query
```

To disable automatic E5 prefixing:

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --text-type none
```

### Batch size

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --batch-size 32
```

### Token length

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --max-length 512
```

### Token count warning

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --check-token-count
```

If the token count exceeds `--max-length`, a warning is printed.

### Verbose mode

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl \
  --verbose
```

## HTTP Embedding Server

Start the server:

```bash
nlp4j-embedding-server-e5
```

The default host is `127.0.0.1` and the default port is `8888`.

```bash
nlp4j-embedding-server-e5 --host 127.0.0.1 --port 8888
```

By default, the model is loaded and warmed up at server startup.

To skip warmup:

```bash
nlp4j-embedding-server-e5 --no-warmup
```

## HTTP API

The server provides the following endpoints:

```text
/embeddings
/semantic_search
/cos_sim
```

## `/embeddings`

Generate an embedding for a single text.

This endpoint is intended for document embeddings and uses the E5 `passage:` prefix internally.

### GET

```bash
curl "http://127.0.0.1:8888/embeddings?text=This%20is%20a%20test."
```

### POST

```bash
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"text":"This is a test."}' \
  http://127.0.0.1:8888/embeddings
```

### Response example

```json
{
  "message": "ok",
  "time": "2026-06-20T12:00:00",
  "text": "This is a test.",
  "embeddings": [0.0123, -0.0456, 0.0789]
}
```

### Token count check

```bash
curl "http://127.0.0.1:8888/embeddings?text=This%20is%20a%20test.&checktokencount=true"
```

## `/semantic_search`

Run semantic search between a query and one or more candidate texts.

The query is encoded with the E5 `query:` prefix.
The candidate texts are encoded with the E5 `passage:` prefix.

### GET

The GET API supports one query text and one candidate text.

```bash
curl "http://127.0.0.1:8888/semantic_search?text1=This%20is%20a%20test.&text2=This%20is%20an%20exam."
```

### POST

The POST API supports multiple candidate texts.

```bash
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"text":"Japanese NLP","texts":["GiNZA is a Japanese NLP library.","This document is about image processing."]}' \
  http://127.0.0.1:8888/semantic_search
```

### Response example

```json
{
  "message": "ok",
  "time": "2026-06-20T12:00:00",
  "text": "Japanese NLP",
  "r": [
    {
      "corpus_id": 0,
      "score": 0.8234
    },
    {
      "corpus_id": 1,
      "score": 0.3123
    }
  ]
}
```

## `/cos_sim`

Calculate cosine similarity between two texts.

This endpoint currently uses no automatic E5 prefix by default. It is intended as a simple compatibility endpoint for comparing two raw texts.

For retrieval-style search, `/semantic_search` is recommended because it applies `query:` and `passage:` prefixes correctly.

### GET

```bash
curl "http://127.0.0.1:8888/cos_sim?text1=This%20is%20a%20test.&text2=This%20is%20an%20exam."
```

### POST

```bash
curl -X POST \
  -H "Content-Type: application/json" \
  -d '{"text1":"This is a test.","text2":"This is an exam.","checktokencount":true}' \
  http://127.0.0.1:8888/cos_sim
```

### Response example

```json
{
  "text1": "This is a test.",
  "text2": "This is an exam.",
  "cosine_similarity": 0.8123
}
```

## Python API

You can also use the internal Python functions directly.

```python
from nlp4j_embedding import e5_model

vector, elapsed = e5_model.embed_text(
    "Kyoto is a city in Japan.",
    text_type="passage"
)

print(vector)
print(elapsed)
```

Batch embedding:

```python
from nlp4j_embedding import e5_model

vectors, elapsed = e5_model.embed_texts(
    [
        "Kyoto is a city in Japan.",
        "Tokyo is the capital of Japan."
    ],
    text_type="passage"
)

print(vectors)
```

Semantic search:

```python
from nlp4j_embedding import e5_model

results = e5_model.semantic_search(
    "Japanese city",
    [
        "Kyoto is a city in Japan.",
        "Python is a programming language."
    ]
)

print(results)
```

Cosine similarity:

```python
from nlp4j_embedding import e5_model

score = e5_model.cos_sim(
    "This is a test.",
    "This is an exam."
)

print(score)
```

## Notes on E5 Prefixes

E5 models expect input text to be prefixed depending on the task.

For search queries:

```text
query: ...
```

For documents or passages:

```text
passage: ...
```

This project automatically adds the prefix unless the text already starts with `query:` or `passage:`.

For local JSONL embedding, the default text type is `passage`.

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl
```

This is equivalent to:

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl --text-type passage
```

## Performance Notes

The first execution may take time because the model must be downloaded and loaded.

The server command warms up the model by default so that the first HTTP request does not need to load the model.

```bash
nlp4j-embedding-server-e5
```

To skip warmup:

```bash
nlp4j-embedding-server-e5 --no-warmup
```

For large JSONL files, increase or decrease the batch size depending on available memory and GPU capacity.

```bash
nlp4j-embedding-local-e5 input.jsonl output.jsonl --batch-size 64
```

## Docker

A Dockerfile is provided in the `docker` directory.

```bash
cd docker
```

See:

```text
docker/README.md
```

for Docker-specific usage.

## License

This project is licensed under the Apache License 2.0.

See `LICENSE.txt` for details.

## Author

Hiroki OYA
