Metadata-Version: 2.4
Name: cjm-text-plugin-system
Version: 0.0.13
Summary: Defines standardized interfaces and data structures for text processing plugins, enabling modular NLP operations like sentence splitting, tokenization, and chunking within the cjm-plugin-system ecosystem.
Author-email: "Christian J. Mills" <9126128+cj-mills@users.noreply.github.com>
License: Apache-2.0
Project-URL: Repository, https://github.com/cj-mills/cjm-text-plugin-system
Project-URL: Documentation, https://cj-mills.github.io/cjm-text-plugin-system
Keywords: nbdev,jupyter,notebook,python
Classifier: Natural Language :: English
Classifier: Intended Audience :: Developers
Classifier: Development Status :: 3 - Alpha
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cjm_plugin_system>=0.0.38
Dynamic: license-file

# cjm-text-plugin-system


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Install

``` bash
pip install cjm_text_plugin_system
```

## Project Structure

    nbs/
    ├── core.ipynb             # DTOs for text processing with character-level span tracking
    ├── plugin_interface.ipynb # Domain-specific plugin interface for text processing operations
    └── storage.ipynb          # Standardized SQLite storage for text processing results with content hashing

Total: 3 notebooks

## Module Dependencies

``` mermaid
graph LR
    core["core<br/>Core Data Structures"]
    plugin_interface["plugin_interface<br/>Text Processing Plugin Interface"]
    storage["storage<br/>Text Processing Storage"]

    plugin_interface --> core
```

*1 cross-module dependencies detected*

## CLI Reference

No CLI commands found in this project.

## Module Overview

Detailed documentation for each module in the project:

### Core Data Structures (`core.ipynb`)

> DTOs for text processing with character-level span tracking

#### Import

``` python
from cjm_text_plugin_system.core import (
    TextSpan,
    TextProcessResult
)
```

#### Classes

``` python
@dataclass
class TextSpan:
    "Represents a segment of text with its original character coordinates."
    
    text: str  # The text content of this span
    start_char: int  # 0-indexed start position in original string
    end_char: int  # 0-indexed end position (exclusive)
    label: str = 'sentence'  # Span type: 'sentence', 'token', 'paragraph', etc.
    metadata: Dict[str, Any] = field(...)  # Additional span metadata
    
    def to_dict(self) -> Dict[str, Any]:  # Dictionary representation
        "Convert span to dictionary for serialization."
```

``` python
@dataclass
class TextProcessResult:
    "Container for text processing results."
    
    spans: List[TextSpan]  # List of text spans from processing
    metadata: Dict[str, Any] = field(...)  # Processing metadata
```

### Text Processing Plugin Interface (`plugin_interface.ipynb`)

> Domain-specific plugin interface for text processing operations

#### Import

``` python
from cjm_text_plugin_system.plugin_interface import (
    TextProcessingPlugin
)
```

#### Classes

``` python
class TextProcessingPlugin(PluginInterface):
    """
    Abstract base class for plugins that perform NLP operations.
    
    Extends PluginInterface with text processing requirements:
    - `execute`: Dispatch method for different text operations
    - `split_sentences`: Split text into sentence spans with character positions
    """
    
    def execute(
            self,
            action: str = "split_sentences",  # Operation to perform: 'split_sentences', 'tokenize', etc.
            **kwargs
        ) -> Dict[str, Any]:  # JSON-serializable result
        "Execute a text processing operation."
    
    def split_sentences(
            self,
            text: str,  # Input text to split
            **kwargs
        ) -> TextProcessResult:  # Result with TextSpan objects containing character indices
        "Split text into sentence spans with accurate character positions."
```

### Text Processing Storage (`storage.ipynb`)

> Standardized SQLite storage for text processing results with content
> hashing

#### Import

``` python
from cjm_text_plugin_system.storage import (
    TextProcessRow,
    TextProcessStorage
)
```

#### Classes

``` python
@dataclass
class TextProcessRow:
    "A single row from the text_jobs table."
    
    job_id: str  # Unique job identifier
    input_text: str  # Original input text
    input_hash: str  # Hash of input text in "algo:hexdigest" format
    config_hash: str  # Hash of the effective processing config used
    spans: Optional[List[Dict[str, Any]]]  # Processed text spans
    metadata: Optional[Dict[str, Any]]  # Processing metadata
    created_at: Optional[float]  # Unix timestamp
```

``` python
class TextProcessStorage:
    def __init__(
        self,
        db_path: str  # Absolute path to the SQLite database file
    )
    "Standardized SQLite storage for text processing results."
    
    def __init__(
            self,
            db_path: str  # Absolute path to the SQLite database file
        )
        "Initialize storage, create table, run migrations, and build indexes."
    
    def save(
            self,
            job_id: str,       # Unique job identifier
            input_text: str,   # Original input text
            input_hash: str,   # Hash of input text in "algo:hexdigest" format
            config_hash: str,  # Hash of the effective processing config
            spans: Optional[List[Dict[str, Any]]] = None,  # Processed text spans
            metadata: Optional[Dict[str, Any]] = None       # Processing metadata
        ) -> None
        "Save or replace a text processing result (upsert by input_hash + config_hash)."
    
    def save_with_logging(
            self,
            *,
            job_id: str,       # Unique job identifier
            input_text: str,   # Original input text
            input_hash: str,   # Hash of input text in "algo:hexdigest" format
            config_hash: str,  # Hash of the effective processing config
            spans: Optional[List[Dict[str, Any]]] = None,  # Processed text spans
            metadata: Optional[Dict[str, Any]] = None,      # Processing metadata
            logger: Optional[logging.Logger] = None          # Optional logger for success/failure messages
        ) -> bool:  # True if saved; False if the save failed (error logged, not raised)
        "Save a result, logging success/failure. Failures are logged and swallowed (returns False).

Centralizes the try/save/log/except block text-processing plugins reimplement
(e.g. NLTK's manual wrap). Returns True on success so callers can gate
post-save side effects on the result."
    
    def get_cached(
            self,
            input_hash: str,   # Content hash of the input text (the input identity)
            config_hash: str   # Hash of the effective processing config
        ) -> Optional[TextProcessRow]:  # Cached row or None
        "Retrieve a cached text processing result by input_hash + config_hash.

Content-correct by construction: text is passed by value, so input_hash
identifies the exact input. Different text or config misses."
    
    def get_by_job_id(
            self,
            job_id: str  # Job identifier to look up
        ) -> Optional[TextProcessRow]:  # Row or None if not found
        "Retrieve a text processing result by job ID."
    
    def list_jobs(
            self,
            limit: int = 100  # Maximum number of rows to return
        ) -> List[TextProcessRow]:  # List of text processing rows
        "List text processing jobs ordered by creation time (newest first)."
    
    def verify_input(
            self,
            job_id: str  # Job identifier to verify
        ) -> Optional[bool]:  # True if input matches, False if changed, None if not found
        "Verify the stored input text still matches its hash."
```
