Metadata-Version: 2.4
Name: strands-transformers
Version: 0.3.0
Summary: The universal entrypoint to HuggingFace transformers for Strands agents - 100% task & modality coverage, zero hardcoding.
Author-email: Cagatay Cali <cagataycali@icloud.com>
License: MIT
Project-URL: Homepage, https://github.com/cagataycali/strands-transformers
Project-URL: Repository, https://github.com/cagataycali/strands-transformers
Project-URL: Issues, https://github.com/cagataycali/strands-transformers/issues
Keywords: strands,transformers,huggingface,ai,agents,multimodal,vision,audio,video,vla,robotics,llm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: strands-agents
Requires-Dist: transformers>=4.40
Requires-Dist: torch
Requires-Dist: accelerate
Requires-Dist: pillow
Requires-Dist: numpy
Provides-Extra: audio
Requires-Dist: soundfile; extra == "audio"
Requires-Dist: librosa; extra == "audio"
Provides-Extra: vision
Requires-Dist: torchvision; extra == "vision"
Requires-Dist: opencv-python; extra == "vision"
Requires-Dist: av; extra == "vision"
Provides-Extra: training
Requires-Dist: trl; extra == "training"
Requires-Dist: peft; extra == "training"
Requires-Dist: accelerate; extra == "training"
Requires-Dist: datasets; extra == "training"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs-material; extra == "docs"
Requires-Dist: mkdocstrings[python]; extra == "docs"
Requires-Dist: pymdown-extensions; extra == "docs"
Provides-Extra: all
Requires-Dist: strands-transformers[audio,dev,docs,training,vision]; extra == "all"

<div align="center">
  <h1>🤗 Strands Transformers</h1>
  <h3>Run any HuggingFace transformers model from a Strands agent - as a tool, or as the agent's own brain.</h3>
  <p>Two entry points: <code>use_transformers</code> (a tool for all 24 transformers tasks) and <code>TransformerModel</code> (a local model provider that consumes image / video / audio / document content blocks).</p>

  <div>
    <a href="https://pypi.org/project/strands-transformers/"><img alt="pypi" src="https://img.shields.io/pypi/v/strands-transformers"/></a>
    <a href="https://github.com/cagataycali/strands-transformers/actions/workflows/docs.yml"><img alt="docs" src="https://github.com/cagataycali/strands-transformers/actions/workflows/docs.yml/badge.svg"/></a>
    <a href="https://github.com/cagataycali/strands-transformers/issues"><img alt="issues" src="https://img.shields.io/github/issues/cagataycali/strands-transformers"/></a>
    <img alt="python" src="https://img.shields.io/badge/python-3.10+-blue"/>
    <img alt="transformers" src="https://img.shields.io/badge/🤗_transformers-24_tasks-yellow"/>
    <img alt="modalities" src="https://img.shields.io/badge/modalities-text·image·video·audio-orange"/>
    <img alt="license" src="https://img.shields.io/badge/license-MIT-green"/>
  </div>
</div>

---

**`use_transformers`** is one tool that exposes every transformers task. It reads
transformers' task taxonomy at runtime, so a model or task added upstream works
here without a code change - discover tasks, run a pipeline, or call any class/
method directly.

**`TransformerModel`** plugs a local HF model in as a Strands `Agent(model=…)`.
It speaks the agent content-block protocol, so the model receives `image`,
`video`, `audio` and `document` blocks directly. Vision-language models see
images; audio models hear; Qwen2.5-Omni hears and replies with generated speech.

```mermaid
flowchart LR
    IN["📥 text · image · video<br/>audio · document · robot-state"]
    TOOL["🛠️ use_transformers<br/><i>tool</i>"]
    BRAIN["🧠 TransformerModel<br/><i>local agent brain</i>"]
    OUT["📤 text · speech · image<br/>labels · actions"]
    IN --> TOOL --> OUT
    IN --> BRAIN --> OUT
    classDef i fill:#7C4DFF,stroke:#5b34d6,color:#fff;
    classDef c fill:#FFD21E,stroke:#E68A00,color:#3a2d00;
    classDef o fill:#00E5FF,stroke:#00b3cc,color:#003844;
    class IN i;
    class TOOL,BRAIN c;
    class OUT o;
```

📖 **[Full documentation](https://cagataycali.github.io/strands-transformers/)** (built with MkDocs, see `docs/`)

## Install

```bash
uv pip install strands-transformers        # from PyPI
# or from source:
uv pip install -e .                         # or: pip install -e .
PYTHONPATH=. python examples/smoke.py       # verify → "18/18 checks passed"
```

<details>
<summary>Optional extras (audio · vision · training · docs)</summary>

```bash
uv pip install -e ".[audio]"      # soundfile, librosa  (mp3/flac/ogg decode)
uv pip install -e ".[vision]"     # torchvision (VLMs!), opencv, av
uv pip install -e ".[training]"   # trl, peft, accelerate
uv pip install -e ".[docs]"       # mkdocs-material, mkdocstrings
uv pip install -e ".[all]"        # everything
```
**Vision models** (SmolVLM, etc.) need the `[vision]` extra (torchvision). WAV
audio works without extras. `device="auto"` picks cuda → mps → cpu (bf16 on GPU).
</details>

## 60-second hello - a local vision agent

```python
import io
from PIL import Image
from strands import Agent
from strands_transformers import TransformerModel

buf = io.BytesIO(); Image.new("RGB", (64, 64), (20, 200, 40)).save(buf, "PNG")  # green square

model = TransformerModel(model_path="HuggingFaceTB/SmolVLM-256M-Instruct")
agent = Agent(model=model, system_prompt="You are concise.")

print(agent([
    {"image": {"format": "png", "source": {"bytes": buf.getvalue()}}},
    {"text": "Color? One word."},
]))
# → Green.
```

A 256M-param model in the standard Strands loop, *seeing* pixels through a content
block - no API key, no server. Swap `model_path` for any HF VLM.

## See it work

Every output below is a **real** model result (CUDA · transformers 5.12 · torch 2.10):

| You give it | Script | It returns |
|-------------|--------|-----------|
| 🖼️ a green image + "Color?" | `examples/multimodal_agent.py` | `"Green."` |
| 🎬 brightening frames | `examples/multimodal_advanced.py` | `"BRIGHTER."` |
| 🧰 a tool screenshot (blue) | `examples/multimodal_advanced.py` | `"Blue."` |
| 📄 a text document | `examples/document_and_audio.py` | recovers `BANANA-42` |
| 🔊 a 440 Hz tone (Omni) | `examples/omni_audio.py` | `"It's a pure tone."` |
| 💬 "say: …can speak" (Omni) | `examples/omni_audio.py` | 🔊 real 24 kHz speech |

**Real agent outputs** - detection boxes, depth, panoptic segmentation (one COCO photo):

<p align="center"><img src="https://raw.githubusercontent.com/cagataycali/strands-transformers/main/docs/assets/img/gallery.png" width="780" alt="detection · depth · segmentation"/></p>

<table>
<tr>
<td width="50%" valign="top">

🎬 **Video understanding** - frames in, label out:

<img src="https://raw.githubusercontent.com/cagataycali/strands-transformers/main/docs/assets/video/demo.gif" width="240" alt="video demo"/>

</td>
<td width="50%" valign="top">

🔊 **Speech** - `text-to-audio` then re-transcribed by whisper (the library narrating itself):

<img src="https://raw.githubusercontent.com/cagataycali/strands-transformers/main/docs/assets/img/waveform.png" width="340" alt="generated speech waveform"/>

▶️ [Listen on the docs site](https://cagataycali.github.io/strands-transformers/)

</td>
</tr>
</table>

▶️ **[Hear it speak + play every example in the docs →](https://cagataycali.github.io/strands-transformers/)**

## Featured models

The examples use tiny models so they run in seconds. In practice you point the
same code at any current `library_name: transformers` model - swap the id, the
plumbing is identical. A few strong open ones, by modality:

| Modality | Model | How to use |
|----------|-------|-----------|
| Vision-language | [`Qwen/Qwen3-VL-8B-Instruct`](https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct) · [`google/gemma-3-4b-it`](https://huggingface.co/google/gemma-3-4b-it) | `TransformerModel` brain or `run` (image-text-to-text) |
| Speech → text | [`openai/whisper-large-v3-turbo`](https://huggingface.co/openai/whisper-large-v3-turbo) · [`Qwen/Qwen3-ASR-1.7B`](https://huggingface.co/Qwen/Qwen3-ASR-1.7B) | `run` (automatic-speech-recognition) |
| Audio in + speech out | [`Qwen/Qwen2.5-Omni-3B`](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) | `TransformerModel` brain (`speak=True`) |
| Multimodal (audio+vision+text) | [`microsoft/Phi-4-multimodal-instruct`](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | `TransformerModel` brain |
| Robot actions (VLA) | [`allenai/MolmoAct2`](https://huggingface.co/allenai/MolmoAct2) · [`openvla/openvla-7b`](https://huggingface.co/openvla/openvla-7b) | `call` → `predict_action` |
| Embodied reasoning | [`nvidia/Cosmos-Reason2-2B`](https://huggingface.co/nvidia/Cosmos-Reason2-2B) | `run` (image-text-to-text) |

```python
# swap the tiny demo model for a SOTA one - same code:
model = TransformerModel(model_path="Qwen/Qwen3-VL-8B-Instruct")
```

## Two ways to use it

<details open>
<summary><b>As a tool</b> - <code>use_transformers</code> (discover · run · call)</summary>

```python
from strands import Agent
from strands_transformers import use_transformers

agent = Agent(tools=[use_transformers])
agent("Transcribe recording.wav")                  # automatic-speech-recognition
agent("What's in scene.jpg?")                       # image-text-to-text
agent("Say 'hello from strands' as audio")          # text-to-audio
agent("Detect objects in https://.../street.jpg")   # object-detection
```

Discover everything at runtime (`action="tasks" | "modalities" | "inspect" | …`),
run high-level pipelines, or `call` any class/fn/method for custom models.
→ **[The tool guide](https://cagataycali.github.io/strands-transformers/guide/the-tool/)**
</details>

<details>
<summary><b>As the agent's brain</b> - <code>TransformerModel</code> (multimodal content blocks)</summary>

Pass `image` / `video` / `audio` / `document` content blocks (and media inside a
`toolResult`) - the provider auto-detects the model's processor and routes them.
All outputs below are **real** results (CUDA, transformers 5.12 / torch 2.10):

| Content block | Example | Verified output |
|---|---|---|
| `image` | `multimodal_agent.py` | `"Green."` |
| `video` (with `fps`) | `multimodal_advanced.py` | `"BRIGHTER."` |
| `image` in `toolResult` | `multimodal_advanced.py` | `"Blue."` |
| `document` | `document_and_audio.py` | recovers `BANANA-42` |
| `audio` *(our schema extension)* | `audio_content_block.py` | audio → text |
| `audio` in **and** speech out | `omni_audio.py` | hears + **speaks** (Qwen2.5-Omni) |

→ **[Agent brain](https://cagataycali.github.io/strands-transformers/guide/agent-brain/)** ·
**[Content blocks](https://cagataycali.github.io/strands-transformers/guide/content-blocks/)** ·
**[Audio](https://cagataycali.github.io/strands-transformers/guide/audio/)**
</details>

<details>
<summary><b>Robotics / VLA</b> - camera + instruction → robot actions</summary>

Two layers, both transformers-native and GPU-verified:
- 🧠 **reason** - [Cosmos-Reason2-2B](https://huggingface.co/nvidia/Cosmos-Reason2-2B)
  (a physical-AI VLM) plans over a scene via the `run` path: *"the red cube is in
  the bottom left corner, so the arm should move there first."*
- ⚙️ **act** - VLA models expose `predict_action` via the `call` path:
  [MolmoAct2](https://huggingface.co/allenai/MolmoAct2-SO100_101) → `[1,30,6]`;
  [OpenVLA-7b](https://huggingface.co/openvla/openvla-7b) → 7-DoF (auto 4.x→5.x shims).

🔗 **Full agentic loop** ([`examples/robot_reason_act_agent.py`](examples/robot_reason_act_agent.py)):
Cosmos-Reason *plans* over real RealSense frames → MolmoAct *acts* (`[1,30,6]`) -
perception→plan→action through one tool.

Lerobot-ecosystem policies (SmolVLA, π0, ACT, GR00T) use their own runtimes -
pair with `use_lerobot`.
→ **[Robotics guide](https://cagataycali.github.io/strands-transformers/guide/robotics/)**
</details>

## How it works

Nothing is hardcoded per task - `core/registry.py` reads transformers' own
`SUPPORTED_TASKS` at runtime, so coverage tracks upstream automatically.

<details>
<summary>Project layout</summary>

```
strands_transformers/
├── tools/use_transformers.py   # the one @tool: discover · run · call
├── models/transformers.py      # TransformerModel - local multimodal agent brain
├── types/audio.py              # audio content-block extension
└── core/{registry,engine,io,compat}.py   # taxonomy · load/cache · I/O · legacy shims
```
→ **[Architecture](https://cagataycali.github.io/strands-transformers/reference/architecture/)** ·
**[API reference](https://cagataycali.github.io/strands-transformers/reference/transformer-model/)**
</details>

## Examples

12 runnable, GPU-verified examples in [`examples/`](examples/) - image, video,
audio, document, Omni speech, VLA, and pipelines. Run any:

```bash
PYTHONPATH=. python examples/<name>.py
```

→ **[Examples & FAQ](https://cagataycali.github.io/strands-transformers/reference/examples/)**

## Star history

<a href="https://www.star-history.com/#cagataycali/strands-transformers&Date">
  <img src="https://api.star-history.com/svg?repos=cagataycali/strands-transformers&type=Date" width="600" alt="Star History Chart"/>
</a>

## License

MIT - built with [Strands Agents SDK](https://github.com/strands-agents/sdk-python)
and [HuggingFace Transformers](https://github.com/huggingface/transformers).

<div align="center">
  <sub>If this saved you a pile of per-model glue code, consider giving it a ⭐</sub>
</div>
