Metadata-Version: 2.4
Name: llama-index-readers-plasmate
Version: 0.1.0
Summary: LlamaIndex reader for Plasmate SOM, providing structured web content for AI agents
Project-URL: Homepage, https://github.com/plasmate-labs/llamaindex-plasmate
Project-URL: Documentation, https://docs.plasmate.app
Project-URL: Repository, https://github.com/plasmate-labs/llamaindex-plasmate
Author: Plasmate Labs
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: agents,ai,llama-index,plasmate,reader,som,web
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Requires-Dist: llama-index-core>=0.10.0
Requires-Dist: requests>=2.25.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# LlamaIndex Plasmate Reader

A LlamaIndex reader for [Plasmate SOM](https://docs.plasmate.app) (Structured Object Model), providing clean, structured web content optimized for AI agents and RAG pipelines.

## What is Plasmate SOM?

Plasmate SOM converts messy HTML into a clean, semantic structure that AI models can easily understand. Instead of parsing raw HTML with all its noise, you get structured content with:

- Semantic regions (headers, navigation, main content, footers)
- Clean text extraction from headings, paragraphs, links, lists, and tables
- Compression ratios typically 10x smaller than raw HTML
- Consistent structure across any website

## Installation

```bash
pip install llama-index-readers-plasmate
```

## Quick Start

```python
from llama_index_plasmate import PlasmateReader

# Initialize the reader
reader = PlasmateReader()

# Load documents from URLs
documents = reader.load_data(urls=[
    "https://example.com/page1",
    "https://example.com/page2",
])

# Use with LlamaIndex
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is on these pages?")
```

## Configuration

### Using the SOM Cache API (Recommended)

The reader uses the Plasmate SOM Cache API by default for fast, cached responses:

```python
reader = PlasmateReader(
    api_key="your-api-key",  # Optional, for authenticated access
    api_base="https://cache.plasmate.app",  # Default
)
```

### Using Local Plasmate CLI Fallback

If the API is unavailable, the reader automatically falls back to the local `plasmate` CLI if installed:

```bash
# Install plasmate CLI
npm install -g plasmate
```

The reader will use the CLI when:
- The API returns an error
- No API key is provided and the endpoint requires authentication
- You explicitly disable the API

### Document Metadata

Each document includes rich metadata:

```python
doc = documents[0]
print(doc.metadata)
# {
#     "source": "https://example.com/page1",
#     "title": "Page Title",
#     "som_version": "1.0",
#     "compression_ratio": 12.5,
#     "html_bytes": 125000,
#     "som_bytes": 10000,
# }
```

## API Reference

### PlasmateReader

```python
PlasmateReader(
    api_key: Optional[str] = None,
    api_base: str = "https://cache.plasmate.app",
)
```

**Parameters:**

- `api_key`: Optional API key for authenticated access to the SOM Cache API
- `api_base`: Base URL for the SOM Cache API (default: `https://cache.plasmate.app`)

### load_data

```python
reader.load_data(
    urls: List[str],
) -> List[Document]
```

**Parameters:**

- `urls`: List of URLs to fetch and convert to documents

**Returns:**

List of LlamaIndex `Document` objects with extracted text and metadata.

## How It Works

1. The reader sends URLs to the Plasmate SOM Cache API
2. Plasmate fetches the page and converts HTML to SOM format
3. The reader extracts readable text from semantic regions:
   - Headings (h1 through h6)
   - Paragraphs
   - Links (with href context)
   - Lists (ordered and unordered)
   - Tables
4. Text is assembled into a clean document with source metadata

## Links

- [Plasmate Documentation](https://docs.plasmate.app)
- [SOM Format Specification](https://docs.plasmate.app/som)
- [GitHub Repository](https://github.com/plasmate-labs/llamaindex-plasmate)

## License

Apache 2.0
