Metadata-Version: 2.4
Name: sinapsis-vllm
Version: 0.1.2
Summary: Sinapsis templates for LLM text completion using vLLM
Author-email: SinapsisAI <dev@sinapsis.tech>
Project-URL: Homepage, https://sinapsis.tech
Project-URL: Documentation, https://docs.sinapsis.tech/docs
Project-URL: Tutorials, https://docs.sinapsis.tech/tutorials
Project-URL: Repository, https://github.com/Sinapsis-AI/sinapsis-chatbots.git
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sinapsis>=0.2.24
Requires-Dist: sinapsis-chatbots-base
Requires-Dist: vllm>=0.15.0
Provides-Extra: sinapsis-data-readers
Requires-Dist: sinapsis-data-readers>=0.1.23; extra == "sinapsis-data-readers"
Provides-Extra: bitsandbytes
Requires-Dist: bitsandbytes>=0.49.2; extra == "bitsandbytes"
Provides-Extra: all
Requires-Dist: sinapsis-vllm[sinapsis-data-readers]; extra == "all"
Requires-Dist: sinapsis-vllm[bitsandbytes]; extra == "all"
Dynamic: license-file

<h1 align="center">
<br>
<a href="https://sinapsis.tech/">
  <img
    src="https://github.com/Sinapsis-AI/brand-resources/blob/main/sinapsis_logo/4x/logo.png?raw=true"
    alt="" width="300">
</a>
<br>
Sinapsis vLLM
<br>
</h1>

<h4 align="center">Sinapsis Templates for LLM text completion with vLLM</h4>

<p align="center">
<a href="#installation">🐍 Installation</a> •
<a href="#features">🚀 Features</a> •
<a href="#example">📚 Usage example</a> •
<a href="#webapps">🌐 Webapps</a> •
<a href="#documentation">📙 Documentation</a> •
<a href="#license">🔍 License</a>
</p>

The `sinapsis-vllm` module provides a suite of templates to run LLMs with [vLLM](https://github.com/vllm-project/vllm), a high-throughput and memory-efficient inference engine for serving large language models.

<h2 id="installation">🐍 Installation</h2>

Install using your package manager of choice. We encourage the use of <code>uv</code>

Example with <code>uv</code>:

```bash
  uv pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech
```
 or with raw <code>pip</code>:
```bash
  pip install sinapsis-vllm --extra-index-url https://pypi.sinapsis.tech
```

> [!IMPORTANT]
> Templates may require extra dependencies. For development, we recommend installing the package with all the optional dependencies:
>

with <code>uv</code>:

```bash
  uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
```
 or with raw <code>pip</code>:
```bash
  pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
```

<h2 id="features">🚀 Features</h2>

<h3>Templates Supported</h3>

- **vLLMTextCompletion**: Template for text completion using vLLM.

    <details>
    <summary>Attributes</summary>

    - `init_args`(`vLLMInitArgs`, required): vLLM engine configuration arguments.
      - `llm_model_name`(`str`, required): The name or path of the LLM model to use (e.g., 'Qwen/Qwen3-1.7B').
      - `tokenizer_mode`(`str`, optional): The tokenizer mode. ``"auto"`` will use the fast tokenizer if available. Defaults to ``"auto"``.
      - `trust_remote_code`(`bool`, optional): Whether to allow custom code from the model repository. Defaults to ``False``.
      - `download_dir`(`str`, optional): Directory to download and load the weights. Defaults to ``SINAPSIS_CACHE_DIR``.
      - `tensor_parallel_size`(`int`, optional): Number of GPUs to use for distributed execution. Defaults to ``1``.
      - `dtype`(`str`, optional): Data type for model weights and activations (auto, half, float16, bfloat16, float, float32). Defaults to ``"auto"``.
      - `quantization`(`str`, optional): Method used to quantize the weights (awq, fp8, gptq, etc.). Defaults to ``None``.
      - `seed`(`int`, optional): Random seed for reproducibility. Defaults to ``0``.
      - `gpu_memory_utilization`(`float`, optional): Fraction of GPU memory to be used for the model executor. Defaults to ``0.9``.
      - `max_num_seqs`(`int`, optional): Maximum number of sequences per iteration. Defaults to ``256``.
      - `max_model_len`(`int`, optional): Maximum sequence length for the model. Defaults to ``None``.
      - `cpu_offload_gb`(`float`, optional): Amount of CPU memory (in GB) to offload weights to. Defaults to ``0``.
      - `enforce_eager`(`bool`, optional): Whether to enforce eager execution instead of CUDA graphs. Defaults to ``False``.
      - `disable_log_stats`(`bool`, optional): Whether to disable logging of periodic runtime statistics. Defaults to ``False``.
    - `completion_args`(`vLLMCompletionArgs`, required): Generation arguments to pass to the selected model.
      - `temperature`(`float`, optional): Controls randomness. 0.0 = deterministic, >0.0 = random. Defaults to ``0.7``.
      - `top_p`(`float`, optional): Nucleus sampling. Considers tokens with cumulative probability >= top_p. Defaults to ``1.0``.
      - `top_k`(`int`, optional): Top-k sampling. Considers the top 'k' most probable tokens. Defaults to ``-1``.
      - `min_p`(`float`, optional): Min-p sampling, filters tokens below this probability. Defaults to ``0.0``.
      - `max_tokens`(`int`, optional): Maximum number of tokens to generate per output sequence. Defaults to ``16``.
      - `min_tokens`(`int`, optional): Minimum number of tokens to generate before EOS or stop tokens. Defaults to ``0``.
      - `presence_penalty`(`float`, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to ``0.0``.
      - `frequency_penalty`(`float`, optional): Penalizes new tokens based on their frequency in the text so far. Defaults to ``0.0``.
      - `repetition_penalty`(`float`, optional): Penalizes new tokens based on whether they appear in the text so far. Defaults to ``1.0``.
      - `seed`(`int`, optional): Random seed to use for the generation. Defaults to ``None``.
      - `stop`(`str | list[str]`, optional): List of strings that stop the generation when they are generated. Defaults to ``None``.
      - `ignore_eos`(`bool`, optional): Whether to ignore the EOS token and continue generating. Defaults to ``False``.
      - `bad_words`(`list[str]`, optional): List of words that are not allowed to be generated. Defaults to ``None``.
      - `response_format`(`vLLMResponseFormat`, optional): Constrains the model output to a specific format.
        - `type`(`str`, optional): The output format type ('text' or 'json_object'). Defaults to ``"text"``.
        - `schema`(`SchemaDefinition`, optional): Schema defining the expected JSON structure when type is 'json_object'.
            - `properties`(`dict`, optional): Mapping of field names to type strings or PropertyDefinition objects.
            - `required`(`list[str]`, optional): List of required field names.
    - `chat_history_key`(`str`, optional): Key in the packet's generic_data to find the conversation history.
    - `rag_context_key`(`str`, optional): Key in the packet's generic_data to find RAG context to inject.
    - `system_prompt`(`str | Path`, optional): The system prompt (or path to one) to instruct the model.
    - `pattern`(`str`, optional): A regex pattern used to post-process the model's response.
    - `keep_before`(`bool`, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    - `structured_output_key`(`str`, optional): Key used to store parsed JSON structured output in the packet's generic_data when response_format type is 'json_object'. Defaults to ``"structured_output"``.

    </details>

- **vLLMBatchTextCompletion**: Template for batched text completion using vLLM's continuous batching engine. Processes multiple conversations in a single batch for improved throughput.

    <details>
    <summary>Attributes</summary>

    Inherits all attributes from `vLLMTextCompletion`. Optimized for processing multiple text packets in parallel using vLLM's continuous batching.

    </details>

- **vLLMStreamingTextCompletion**: Streaming version of vLLMTextCompletion for real-time response generation.

    <details>
    <summary>Attributes</summary>

    Inherits all attributes from `vLLMTextCompletion`. The template yields response chunks as they are generated rather than waiting for the complete response.

    </details>

- **vLLMMultiModal**: Template for multimodal (text + image) completion using vLLM. Supports vision-language models like Qwen-VL.

    <details>
    <summary>Attributes</summary>

    - `init_args`(`vLLMMultimodalInitArgs`, required): vLLM multimodal engine arguments.
      - `llm_model_name`(`str`, required): The name or path of the VLM model to use (e.g., 'Qwen/Qwen2-VL-2B-Instruct-AWQ').
      - `trust_remote_code`(`bool`, optional): Whether to allow custom code from the model repository. Defaults to ``True``.
      - `limit_mm_per_prompt`(`dict`, optional): Maximum number of multimodal items per prompt. Defaults to ``{"image": 1}``.
      - All other attributes from `vLLMInitArgs` are also supported.
    - `completion_args`(`vLLMCompletionArgs`, required): Generation arguments to pass to the selected model. Same as `vLLMTextCompletion`.
    - `chat_history_key`(`str`, optional): Key in the packet's generic_data to find the conversation history.
    - `rag_context_key`(`str`, optional): Key in the packet's generic_data to find RAG context to inject.
    - `system_prompt`(`str | Path`, optional): The system prompt (or path to one) to instruct the model.
    - `pattern`(`str`, optional): A regex pattern used to post-process the model's response.
    - `keep_before`(`bool`, optional): If True, keeps text before the 'pattern' match; otherwise, keeps text after.
    - `structured_output_key`(`str`, optional): Key used to store parsed JSON structured output. Defaults to ``"structured_output"``.

    </details>

> [!TIP]
> Use CLI command ``` sinapsis info --all-template-names``` to show a list with all the available Template names installed with Sinapsis Data Tools.

> [!TIP]
> Use CLI command ```sinapsis info --example-template-config TEMPLATE_NAME``` to produce an example Agent config for the Template specified in ***TEMPLATE_NAME***.

For example, for ***vLLMTextCompletion*** use ```sinapsis info --example-template-config vLLMTextCompletion``` to produce the following example config:

```yaml
agent:
  name: my_test_agent
templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}
- template_name: vLLMTextCompletion
  class_name: vLLMTextCompletion
  template_input: InputTemplate
  attributes:
    init_args:
      llm_model_name: '`replace_me:<class ''str''>`'
      tokenizer_mode: auto
      trust_remote_code: false
      download_dir: /path/to/.cache/sinapsis
      tensor_parallel_size: 1
      dtype: auto
      quantization: null
      seed: 0
      gpu_memory_utilization: 0.9
      max_num_seqs: 256
      max_model_len: null
      cpu_offload_gb: 0
      enforce_eager: false
      disable_log_stats: false
    completion_args:
      temperature: 0.2
      top_p: 0.95
      top_k: 40
      presence_penalty: 0.0
      frequency_penalty: 0.0
      repetition_penalty: 1.0
      min_p: 0.0
      seed: null
      stop: null
      ignore_eos: false
      max_tokens: 16
      min_tokens: 0
      bad_words: null
      response_format:
        type_: text
        schema_:
          properties: '`replace_me:dict[str, str | sinapsis_vllm.helpers.schemas.PropertyDefinition]`'
          required: '`replace_me:list[str]`'
    chat_history_key: null
    rag_context_key: null
    system_prompt: null
    pattern: null
    keep_before: true
    structured_output_key: structured_output
```

<h2 id="example">📚 Usage example</h2>
The following agent passes text messages through TextPackets and retrieves responses from an LLM
<details id='usage'><summary><strong><span style="font-size: 1.0em;"> Config</span></strong></summary>

```yaml
agent:
  name: chat_completion
  description: Chatbot agent using Qwen

templates:
- template_name: InputTemplate
  class_name: InputTemplate
  attributes: {}

- template_name: TextInput
  class_name: TextInput
  template_input: InputTemplate
  attributes:
    text: what is AI?

- template_name: vLLMTextCompletion
  class_name: vLLMTextCompletion
  template_input: TextInput
  attributes:
    init_args:
      llm_model_name: Qwen/Qwen3-1.7B
      max_model_len: 4096
      dtype: auto
      seed: 42
      gpu_memory_utilization: 0.9
      cpu_offload_gb: 2
      max_num_seqs: 8
      disable_log_stats: true
    completion_args:
      max_tokens: 1024
      temperature: 0.7
      seed: 42
    system_prompt: 'You are a helpful AI assistant'
```
</details>

<h3>Multimodal Example</h3>
The following agent processes an image and generates a description using a vision-language model:

<details><summary><strong><span style="font-size: 1.0em;"> Multimodal Config</span></strong></summary>

```yaml
agent:
  name: multimodal_chatbot
  description: Agent with support for multimodal vLLM model for image-to-text

templates:
  - template_name: InputTemplate
    class_name: InputTemplate
    attributes: {}

  - template_name: FolderImageDatasetCV2
    class_name: FolderImageDatasetCV2
    template_input: InputTemplate
    attributes:
      load_on_init: True
      data_dir: "artifacts"
      pattern: "test.png"

  - template_name: TextInput
    class_name: TextInput
    template_input: FolderImageDatasetCV2
    attributes:
      text: "Describe what you see in the image."

  - template_name: vLLMMultiModal
    class_name: vLLMMultiModal
    template_input: TextInput
    attributes:
      init_args:
        llm_model_name: "Qwen/Qwen2-VL-2B-Instruct-AWQ"
        max_model_len: 1024
        dtype: auto
        quantization: awq
        seed: 42
        gpu_memory_utilization: 0.95
        max_num_seqs: 1
        disable_log_stats: true
        enforce_eager: true
        limit_mm_per_prompt:
          image: 1
      completion_args:
        temperature: 0.7
        top_p: 0.8
        top_k: 20
        min_p: 0
        max_tokens: 1024
      system_prompt: "You are a helpful vision-language assistant."
```

> [!NOTE]
> This example uses an AWQ quantized model for lower GPU memory requirements. For GPUs with limited memory, consider using quantized models (AWQ, GPTQ) or increasing `cpu_offload_gb`.

</details>

<h2 id="webapps">🌐 Webapps</h2>

You can interact with vLLM models using the generic chatbot webapp. The webapp works with any config by setting the `AGENT_CONFIG_PATH` environment variable.

> [!IMPORTANT]
> To run the app you first need to clone this repository:

```bash
git clone git@github.com:Sinapsis-ai/sinapsis-chatbots.git
cd sinapsis-chatbots
```

> [!NOTE]
> If you'd like to enable external app sharing in Gradio, `export GRADIO_SHARE_APP=True`

<details>
<summary id="docker"><strong><span style="font-size: 1.4em;">🐳 Docker</span></strong></summary>

**IMPORTANT** This docker image depends on the sinapsis-nvidia:base image. Please refer to the official [sinapsis](https://github.com/Sinapsis-ai/sinapsis?tab=readme-ov-file#docker) instructions to Build with Docker.

1. **Build the sinapsis-chatbots image**:
```bash
docker compose -f docker/compose.yaml build
```
2. **Start the vLLM chatbot container**:
```bash
docker compose -f docker/compose_apps.yaml up sinapsis-vllm-chatbot -d
```
Or for the multimodal variant with image upload support:
```bash
docker compose -f docker/compose_apps.yaml up sinapsis-vllm-multimodal-chatbot -d
```
3. **Check the logs**:
```bash
docker logs -f sinapsis-vllm-chatbot
```
4. **The logs will display the URL to access the webapp, e.g.,:**:
```bash
Running on local URL:  http://127.0.0.1:7860
```
**NOTE**: The url may be different, check the output of logs.

5. **To stop the app**:
```bash
docker compose -f docker/compose_apps.yaml down
```

**To use a different model, update the `AGENT_CONFIG_PATH` environmental variable to point to the desired YAML file.**

</details>

<details>
<summary><strong><span style="font-size: 1.25em;">💻 UV</span></strong></summary>

To run the webapp using the `uv` package manager, follow these steps:

1. **Sync the virtual environment**:

```bash
uv sync --frozen
```

2. **Install the wheel**:
```bash
uv pip install sinapsis-vllm[all] --extra-index-url https://pypi.sinapsis.tech
```

3. **Run the chatbot webapp with vLLM config**:
```bash
export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/vllm_text_completion.yaml
uv run webapps/llama_cpp_simple_chatbot.py
```

Or for multimodal (image upload support):
```bash
export AGENT_CONFIG_PATH=webapps/configs/llama_cpp_simple_chatbot/vllm_multimodal.yaml
uv run webapps/llama_cpp_simple_chatbot.py
```

4. **The terminal will display the URL to access the webapp, e.g.**:

```bash
Running on local URL:  http://127.0.0.1:7860
```
**NOTE**: The URL may vary; check the terminal output for the correct address.

</details>

<h2 id="documentation">📙 Documentation</h2>

Documentation for this and other sinapsis packages is available on the [sinapsis website](https://docs.sinapsis.tech/docs)

Tutorials for different projects within sinapsis are available at [sinapsis tutorials page](https://docs.sinapsis.tech/tutorials)


<h2 id="license">🔍 License</h2>

This project is licensed under the AGPLv3 license, which encourages open collaboration and sharing. For more details, please refer to the [LICENSE](LICENSE) file.

For commercial use, please refer to our [official Sinapsis website](https://sinapsis.tech) for information on obtaining a commercial license.
