Metadata-Version: 2.4
Name: mistralai-search-toolkit-storage-s3
Version: 0.0.6
Summary: AWS S3 (and S3-compatible) ObjectStorage plugin for mistralai-search-toolkit
Author-email: Mistral AI <support@mistral.ai>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,aws,cloud-storage,mistral,plugin,s3,search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.15,>=3.12
Requires-Dist: aioboto3<14.0.0,>=13.0.0
Requires-Dist: boto3<2.0.0,>=1.34.0
Requires-Dist: botocore<2.0.0,>=1.34.0
Requires-Dist: mistralai-search-toolkit
Description-Content-Type: text/markdown

# AWS S3 Storage Plugin for Search Toolkit

AWS S3 (and S3-compatible) object storage backend for [`mistralai-search-toolkit`](https://pypi.org/project/mistralai-search-toolkit/).

This plugin implements the Search Toolkit's `ObjectStorage` interface, enabling the ingestion pipeline to load files directly from S3.

## Installation

```bash
pip install mistralai-search-toolkit-storage-s3
```

Or as an optional dependency of the core package:

```bash
pip install mistralai-search-toolkit[storage-s3]
```

## Quick Start: Load Files from S3 in Ingestion Pipeline

### 1. Upload a File to S3

```python
import asyncio
from mistralai.search.toolkit.plugins.storage.s3 import S3ObjectStorage

async def upload_file():
    storage = S3ObjectStorage(
        bucket_name="your-bucket",
        region_name="us-east-1",
    )

    # Upload a file
    with open("document.pdf", "rb") as f:
        data = f.read()

    await storage.put(key="documents/document.pdf", data=data)

asyncio.run(upload_file())
```

### 2. Load Files from S3 in Ingestion Pipeline

```python
import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.s3 import S3ObjectStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

async def ingest_from_s3():
    # Create S3 storage factory
    def s3_storage_factory():
        return S3ObjectStorage(
            bucket_name="your-bucket",
            region_name="us-east-1",
        )

    # Create FileLoader backed by S3
    file_loader = FileLoader(storage_factory=s3_storage_factory)

    # Create ingestion pipeline
    mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
    vespa_config = VespaClientConfig(
        endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
    )
    vector_store = app.get_search_index(vespa_config, collection_name="articles")

    pipeline = Pipeline(
        loader=file_loader,
        text_splitter=CharacterTextSplitter(chunk_size=512),
        embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
        stores=vector_store,
    )

    # Ingest documents from S3
    num_chunks = await pipeline.run(documents=[
        "documents/document1.pdf",
        "documents/document2.pdf",
    ])

    print(f"Indexed {num_chunks} chunks")

asyncio.run(ingest_from_s3())
```

## Configuration

### Basic Setup

```python
storage = S3ObjectStorage(
    bucket_name="your-bucket",
    region_name="us-east-1",
)
```

### With Credentials

```python
storage = S3ObjectStorage(
    bucket_name="your-bucket",
    region_name="us-east-1",
    access_key="your-access-key",
    secret_key="your-secret-key",
)
```

### S3-Compatible Services

Works with MinIO, DigitalOcean Spaces, and other S3-compatible services:

```python
storage = S3ObjectStorage(
    bucket_name="bucket",
    endpoint_url="https://minio.example.com",
    access_key="minioadmin",
    secret_key="minioadmin",
)
```

## Local Development

For testing without AWS, use [MinIO](https://min.io/):

```bash
docker run -p 9000:9000 -p 9001:9001 minio/minio server /data
```

Configure to use local MinIO:

```python
storage = S3ObjectStorage(
    bucket_name="documents",
    endpoint_url="http://localhost:9000",
    access_key="minioadmin",
    secret_key="minioadmin",
)
```

## License

This plugin is licensed under the Apache License 2.0.

## Support

For Search Toolkit issues, refer to the [Search Toolkit documentation](https://pypi.org/project/mistralai-search-toolkit/).

For AWS S3 documentation, visit [AWS S3 Docs](https://docs.aws.amazon.com/s3/).
