Metadata-Version: 2.4
Name: inferscale
Version: 0.1.2
Summary: Inference-time model selection and ensembling for LLM outputs
Author: Mohamed Baddar
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: transformers
Requires-Dist: torch
Dynamic: license-file

<p align="center">
  <img src="assets/inferscale_logo.png" width="300">
</p>
# InferScale is an open-source inference scaling platform for large language models.

## The Problem

Improving the quality of responses generated by Large Language Models (LLMs) for tasks such as **question answering**, **summarization**, and **content generation** remains a key challenge for AI developers.

Common approaches include:

- **Fine-tuning models** on task-specific datasets  
- **Prompt engineering and optimization**  
- **Retrieval-Augmented Generation (RAG)** pipelines  

However, these approaches often require:

- Large training datasets  
- Expensive computing resources  
- Dependence on large proprietary models or third-party APIs  

An alternative and more **budget-efficient approach** is **inference-time scaling**.

Instead of modifying the model itself, inference-time scaling improves output quality by:

1. Generating **multiple candidate responses**
2. Evaluating them using a **scoring function**
3. Selecting the **best response automatically**

This approach allows developers to **improve response quality without expensive training or larger models**, making it particularly attractive for **cost-constrained or production environments**.

# InferScale

**InferScale** is a lightweight Python library that improves LLM output quality using **inference-time scaling techniques** such as **Best-of-N sampling across multiple models**.

Instead of relying on expensive fine-tuning or larger models, InferScale generates multiple candidate responses and automatically selects the best one using lightweight scoring methods.

The goal is to help AI developers **focus on building AI applications**, while InferScale handles **candidate generation and response selection**.

---

# Architecture

The current architecture of InferScale is shown below:

<img width="1200" height="327" alt="InferScale architecture" src="https://github.com/user-attachments/assets/1006af4b-4718-49a3-880c-389c3987be3d" />

Pipeline overview:

1. Multiple LLM models generate candidate responses
2. Each model can generate **N samples**
3. All responses are collected
4. A scoring mechanism selects the **best candidate**

---

## How InferScale Works

InferScale implements a simple **inference-time scaling strategy** to improve LLM response quality without additional training or expensive models.

The core idea is simple:

Generate multiple candidate responses from multiple models and automatically select the best one.

This approach leverages **model diversity and response sampling** to increase the probability of obtaining a higher-quality output.

---

### Step-by-Step Process

1. Load the model 

The library loads one of the models from Hugging-Face

---

2. Generate Multiple Responses

Each model generates **N candidate responses** for the same input.

Example:

Input Article

- Response A1  
- Response A2  
- Response A3  


This creates a pool of candidate outputs.

---

3. Compute Semantic Similarity

All responses are embedded using a sentence embedding model.  
InferScale then computes **cosine similarity scores** to estimate the semantic quality of each response.

---

4. Select the Best Response

The response with the **highest similarity score** is selected as the final output.

Candidate Responses  
↓  
Embedding + Cosine Similarity  
↓  
Best Scoring Response  
↓  
Final Output  

## Installation

`pip install inferscale datasets sentence-transformers rich`

## Example
```
import json
from inferscale.best_of_n import BestOfNSampler
from datasets import load_dataset
from rich import print_json


if __name__ == "__main__":

    # Candidate models
    model_names = [
        "Sachin21112004/distilbart-news-summarizer",
        "google/pegasus-xsum"
    ]

    # Initialize Best-of-N sampler
    bon = BestOfNSampler(models_names=model_names)

    # Load dataset
    dataset = load_dataset("cnn_dailymail", "3.0.0")

    # Example queries
    queries = [
        dataset["train"][0]["article"],
        dataset["train"][1]["article"],
        dataset["train"][2]["article"]
    ]

    # Generate responses
    results = bon.generate(queries=queries, n=3)

    # Pretty print results
    print_json(json.dumps(results, indent=4))
```
## Change Log 
If you are intrested in the details of development and changes in each version, check the [CHANGE LOG](https://github.com/mbaddar1/InferScale/blob/main/changelog.md)
# Main Resources

1. https://open.substack.com/pub/sebastianraschka/p/categories-of-inference-time-scaling
2. https://arxiv.org/abs/2510.10787
3. https://medium.com/@adnanmasood/inference-time-scaling-how-modern-ai-models-think-longer-to-perform-better-a1e1a8155fbd
