Metadata-Version: 2.4
Name: mistralai-search-toolkit-storage-gcs
Version: 0.0.6
Summary: Google Cloud Storage ObjectStorage plugin for mistralai-search-toolkit
Author-email: Mistral AI <support@mistral.ai>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,cloud-storage,gcs,mistral,plugin,search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.15,>=3.12
Requires-Dist: gcloud-aio-storage<10.0.0,>=9.3.0
Requires-Dist: mistralai-search-toolkit
Description-Content-Type: text/markdown

# Google Cloud Storage Plugin for Search Toolkit

Google Cloud Storage backend for [`mistralai-search-toolkit`](https://pypi.org/project/mistralai-search-toolkit/).

This plugin implements the Search Toolkit's `ObjectStorage` interface, enabling the ingestion pipeline to load files directly from Google Cloud Storage.

## Installation

```bash
pip install mistralai-search-toolkit-storage-gcs
```

Or as an optional dependency of the core package:

```bash
pip install mistralai-search-toolkit[storage-gcs]
```

## Quick Start: Load Files from GCS in Ingestion Pipeline

### 1. Upload a File to GCS

```python
import asyncio
from mistralai.search.toolkit.plugins.storage.gcs import GCSObjectStorage

async def upload_file():
    storage = GCSObjectStorage(
        bucket_name="your-bucket",
        project_id="your-project",
    )

    # Upload a file
    with open("document.pdf", "rb") as f:
        data = f.read()

    await storage.put(key="documents/document.pdf", data=data)

asyncio.run(upload_file())
```

### 2. Load Files from GCS in Ingestion Pipeline

```python
import asyncio
import os
from mistralai.search.toolkit.ingestion.loaders import FileLoader
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.storage.gcs import GCSObjectStorage
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from vespa_app import app

async def ingest_from_gcs():
    # Create GCS storage factory
    def gcs_storage_factory():
        return GCSObjectStorage(
            bucket_name="your-bucket",
            project_id="your-project",
        )

    # Create FileLoader backed by GCS
    file_loader = FileLoader(storage_factory=gcs_storage_factory)

    # Create ingestion pipeline
    mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY"))
    vespa_config = VespaClientConfig(
        endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:8080"),
    )
    vector_store = app.get_search_index(vespa_config, collection_name="articles")

    pipeline = Pipeline(
        loader=file_loader,
        text_splitter=CharacterTextSplitter(chunk_size=512),
        embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
        stores=vector_store,
    )

    # Ingest documents from GCS
    num_chunks = await pipeline.run(documents=[
        "documents/document1.pdf",
        "documents/document2.pdf",
    ])

    print(f"Indexed {num_chunks} chunks")

asyncio.run(ingest_from_gcs())
```

## Configuration

### Basic Setup

```python
storage = GCSObjectStorage(
    bucket_name="your-bucket",
    project_id="your-project",
)
```

### Using Service Account

```python
from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(
    "/path/to/service-account-key.json"
)

storage = GCSObjectStorage(
    bucket_name="your-bucket",
    project_id="your-project",
    credentials=credentials,
)
```

## Authentication

### Environment Variables

Set credentials using environment variables:

```bash
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account-key.json"
```

Or authenticate with gcloud CLI:

```bash
gcloud auth application-default login
```

The plugin will automatically use credentials from:
- `GOOGLE_APPLICATION_CREDENTIALS` environment variable
- Application Default Credentials (if running in GCP)

## License

This plugin is licensed under the Apache License 2.0.

## Support

For Search Toolkit issues, refer to the [Search Toolkit documentation](https://pypi.org/project/mistralai-search-toolkit/).

For Google Cloud Storage documentation, visit [GCS Docs](https://cloud.google.com/storage/docs).
