Metadata-Version: 2.2
Name: distllm
Version: 1.0.2
Summary: Distributed Inference for Large Language Models.
Author-email: Alexander Brace <abrace@anl.gov>, Ozan Gokdemir <ogokdemir@uchicago.edu>
License: MIT
Project-URL: homepage, https://github.com/ramanathanlab/distllm
Project-URL: documentation, https://github.com/ramanathanlab/distllm
Project-URL: repository, https://github.com/ramanathanlab/distllm
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: transformers>=4.38.2
Requires-Dist: datasets>=2.18.0
Requires-Dist: bitsandbytes>=0.42.0
Requires-Dist: langchain>=0.2.5
Requires-Dist: langchain-anthropic>=0.1.7
Requires-Dist: langchain-google-genai>=1.0.1
Requires-Dist: accelerate>=0.28.0
Requires-Dist: parsl>=2024.1.29
Requires-Dist: pydantic>=2.6.0
Requires-Dist: typer[all]>=0.9.0
Requires-Dist: nltk>=3.9
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: peft>=0.10.0
Requires-Dist: sentence-transformers>=3.3.1
Requires-Dist: torch
Provides-Extra: dev
Requires-Dist: covdefaults>=2.2; extra == "dev"
Requires-Dist: coverage; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: tox; extra == "dev"
Requires-Dist: virtualenv; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Provides-Extra: docs
Requires-Dist: black; extra == "docs"
Requires-Dist: mkdocs-gen-files; extra == "docs"
Requires-Dist: mkdocs-literate-nav; extra == "docs"
Requires-Dist: mkdocs-material==9.4.7; extra == "docs"
Requires-Dist: mkdocs-section-index; extra == "docs"
Requires-Dist: mkdocstrings==0.23.0; extra == "docs"
Requires-Dist: mkdocstrings-python==1.8.0; extra == "docs"
Requires-Dist: mike; extra == "docs"

# distllm
[![PyPI version](https://badge.fury.io/py/distllm.svg)](https://badge.fury.io/py/distllm)

Distributed Inference for Large Language Models.
- Create embeddings for large datasets at scale.
- Generate text using language models at scale.
- Semantic similarity search using Faiss.

## Installation

distllm is available on PyPI and can be installed using pip:
```bash
pip install distllm
```

To install the package on Polaris@ALCF as of 12/12/2024, run the following command:
```bash
git clone git@github.com:ramanathanlab/distllm.git
cd distllm
module use /soft/modulefiles; module load conda
conda create -n distllm python=3.12 -y
conda activate distllm-12-12
pip install faiss-gpu-cu12
pip install vllm
pip install -e .
python -m nltk.downloader punkt
```

### Protein Embedding Installation
For ESMC, you can install the following package:
```bash
pip install esm
```

For ESM2, you can install the following package:
```bash
pip install flash-attn --no-build-isolation
pip install faesm[flash_attn]
```
Or, if you want to forego flash attention and just use SDPA
```bash
pip install faesm
```

## Usage
To create embeddings at scale, run the following command:
```bash
nohup python -m distllm.distributed_embedding --config examples/your-config.yaml &> nohup.out &
```

For LLM generation at scale, run the following command:
```bash
nohup python -m distllm.distributed_generation --config examples/your-config.yaml &> nohup.out &
```

To run smaller datasets on a single GPU, you can use the following command:
```bash
distllm embed --encoder_name auto --pretrained_model_name_or_path pritamdeka/S-PubMedBert-MS-MARCO --data_path /lus/eagle/projects/FoundEpidem/braceal/projects/metric-rag/data/parsed_pdfs/LUCID.small.test/parsed_pdfs --data_extension jsonl --output_path cli_test_lucid --dataset_name jsonl_chunk --batch_size 512 --chunk_batch_size 512 --buffer_size 4 --pooler_name mean --embedder_name semantic_chunk --writer_name huggingface --quantization --eval_mode
```

Or using a larger model on a single GPU, such as Salesforce/SFR-Embedding-Mistral:
```bash
distllm embed --encoder_name auto --pretrained_model_name_or_path Salesforce/SFR-Embedding-Mistral --data_path /lus/eagle/projects/FoundEpidem/braceal/projects/metric-rag/data/parsed_pdfs/LUCID.small.test/parsed_pdfs --data_extension jsonl --output_path cli_test_lucid_sfr_mistral --dataset_name jsonl_chunk --batch_size 16 --chunk_batch_size 2 --buffer_size 4 --pooler_name last_token --embedder_name semantic_chunk --writer_name huggingface --quantization --eval_mode
```

To merge the HF dataset files, you can use the following command:
```bash
distllm merge --writer_name huggingface --dataset_dir /lus/eagle/projects/FoundEpidem/braceal/projects/metric-rag/data/semantic_chunks/lit_covid_part2.PubMedBERT/embeddings --output_dir lit_covid_part2.PubMedBERT.merge
```

To generate text using a language model, you can use the following command:
```bash
distllm generate --input_dir cli_test_lucid/ --output_dir cli_test_generate --top_p 0.95
```

## Contributing

For development, it is recommended to use a virtual environment. The following commands will create a virtual environment, install the package in editable mode, and install the pre-commit hooks.
```bash
python3.10 -m venv venv
source venv/bin/activate
pip install -U pip setuptools wheel
pip install -e '.[dev,docs]'
pre-commit install
```
To test the code, run the following command:
```bash
pre-commit run --all-files
tox -e py310
```
To release a new version of distllm to PyPI:

1. Merge the develop branch into the main branch with an updated version number in pyproject.toml.
2. Make a new release on GitHub with the tag and name equal to the version number.
3. Clone a fresh distllm repository and run the installation commands above.
4. Run the following commands from the main branch:
```bash
rm -r dist
python3 -m build
twine upload dist/*
```
