Metadata-Version: 2.4
Name: sunstone-py
Version: 1.9.1
Summary: Python library for managing datasets with lineage tracking in Sunstone projects
Author-email: Sunstone Institute <stig@sunstone.institute>
License: MIT
Project-URL: Homepage, https://github.com/sunstoneinstitute/sunstone-py
Project-URL: Documentation, https://sunstoneinstitute.github.io/sunstone-py/
Project-URL: Repository, https://github.com/sunstoneinstitute/sunstone-py
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: typer>=0.15
Requires-Dist: frictionless>=5.18.1
Requires-Dist: google-auth>=2.43.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: ruamel-yaml>=0.18
Requires-Dist: pyarrow>=23.0.1
Requires-Dist: tomli-w>=1.2.0
Requires-Dist: pint>=0.24
Provides-Extra: gcs
Requires-Dist: google-cloud-storage>=2.0; extra == "gcs"
Provides-Extra: s3
Requires-Dist: boto3>=1.28; extra == "s3"
Provides-Extra: qudt
Requires-Dist: ontopint>=0.1; extra == "qudt"
Dynamic: license-file

# sunstone-py

A Python library for managing datasets with lineage tracking in data science projects.

[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Features

- **Automatic Lineage Tracking**: Track data provenance through all operations automatically
- **Dataset Management**: Integration with `datasets.yaml` for organized dataset registration
- **Pandas-Compatible API**: Familiar pandas-like interface via `from sunstone import pandas as pd` (CSV, Excel, JSON)
- **Plugin System**: Extensible architecture for custom auth providers, URL handlers, and format handlers via entry points
- **Strict/Relaxed Modes**: Control whether operations can modify `datasets.yaml`
- **Validation Tools**: Check notebooks and scripts for correct import usage
- **Full Type Hints**: Complete type hint support for better IDE integration

## Installation

```bash
# Using uv (recommended)
uv add sunstone-py

# Using pip
pip install sunstone-py
```

To use the latest commit from github:

```toml
dependencies = [
    "sunstone-py @ git+https://github.com/sunstoneinstitute/sunstone-py.git",
]
```

If you are making changes to a local checkout of sunstone-py and want to test them
from your project, add a `[tool.uv.sources]` override to your project's `pyproject.toml`:

```toml
[tool.uv.sources]
sunstone-py = { path = "../path/to/sunstone-py", editable = true }
```

The path is relative to your project's `pyproject.toml`. Leave the regular PyPI dependency
in `[project.dependencies]` unchanged — the sources override takes precedence locally.
Remember to remove the `[tool.uv.sources]` block before committing.

### For Development

```bash
git clone https://github.com/sunstoneinstitute/sunstone-py.git
cd sunstone-py
uv venv
uv sync
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
```


## Quick Start

### 1. Set Up Your Project with datasets.yaml

Create a `datasets.yaml` file in your project directory:

```yaml
inputs:
  - name: School Data
    slug: school-data
    location: data/schools.csv
    source:
      name: Ministry of Education
      location:
        data: https://example.com/schools.csv
      attributedTo: Ministry of Education
      acquiredAt: 2025-01-15
      acquisitionMethod: manual-download
      license: CC-BY-4.0
    fields:
      - name: school_id
        type: string
      - name: enrollment
        type: integer

outputs: []
```

### 2. Use Pandas-Like API with Lineage Tracking

```python
from sunstone import pandas as pd
from pathlib import Path

# Set project path (where datasets.yaml lives)
PROJECT_PATH = Path.cwd()

# Read data - lineage automatically tracked
df = pd.read_csv('data/schools.csv', project_path=PROJECT_PATH)

# Transform using familiar pandas operations
result = df[df['enrollment'] > 100].groupby('district').sum()

# Save with automatic lineage tracking and dataset registration
result.to_csv(
    'outputs/summary.csv',
    slug='school-summary',
    name='School Enrollment Summary',
    index=False
)
```

### 3. Check Lineage Metadata

```python
# View lineage information
print(result.lineage.sources)      # Source datasets
print(result.lineage.operations)   # Operations performed
print(result.lineage.get_licenses())  # All source licenses
```

## Core Concepts

### Pandas-Like API

sunstone-py provides a drop-in replacement for pandas that adds lineage tracking:

```python
from sunstone import pandas as pd

# Works like pandas, but tracks lineage
df = pd.read_csv('input.csv', project_path='/path/to/project')
df2 = pd.read_csv('input2.csv', project_path='/path/to/project')

# All pandas operations work
filtered = df[df['value'] > 100]
grouped = df.groupby('category').sum()

# Merge/join operations combine lineage from both sources
merged = pd.merge(df, df2, on='key')
concatenated = pd.concat([df, df2])
```

### Strict vs Relaxed Mode

**Relaxed Mode** (default):
- Writing to new outputs auto-registers them in `datasets.yaml`
- More flexible for exploratory work

**Strict Mode**:
- All reads and writes must be pre-registered in `datasets.yaml`
- Ensures complete documentation of data operations
- Enable via `strict=True` parameter or `SUNSTONE_DATAFRAME_STRICT=1` environment variable

```python
# Enable strict mode
df = pd.read_csv('data.csv', project_path=PROJECT_PATH, strict=True)

# Or globally
import os
os.environ['SUNSTONE_DATAFRAME_STRICT'] = '1'
```

### Validation Tools

Check notebooks for correct import usage:

```python
import sunstone

# Check a single notebook
result = sunstone.check_notebook_imports('analysis.ipynb')
print(result.summary())

# Check all notebooks in project
results = sunstone.validate_project_notebooks('/path/to/project')
for path, result in results.items():
    if not result.is_valid:
        print(f"\n{path}:")
        print(result.summary())
```

## Plugin System

sunstone-py uses a plugin architecture for reading, writing, and fetching data. Built-in handlers cover common formats (CSV, JSON, Excel, Parquet, TSV) and HTTP/HTTPS, local file, GCS, and S3/R2 URLs.

### Plugin Protocols

Plugins implement one or more of these protocols:

- **`AuthProvider`**: Injects authentication headers into HTTP requests
- **`URLHandler`**: Opens URLs for reading/writing, returning file-like streams (`BinaryIO`/`TextIO`)
- **`FormatHandler`**: Reads and writes data formats not built into sunstone

### Installation Extras

```bash
pip install sunstone-py          # Core + HTTP + local file handling
pip install sunstone-py[gcs]     # Adds GCS (gs://) support
pip install sunstone-py[s3]      # Adds S3 (s3://) and R2 (r2://) support
pip install sunstone-py[gcs,s3]  # Both
```

### Registering Custom Plugins

Plugins are discovered via Python [entry points](https://packaging.python.org/en/latest/specifications/entry-points/):

```toml
[project.entry-points."sunstone.plugins"]
my-plugin = "my_package:MyPlugin"
```

### Plugin Configuration

Plugin config uses cascading precedence (later sources override earlier):

1. `datasets.yaml` — `plugins.<name>` section
2. `pyproject.toml` — `[tool.sunstone.plugins.<name>]` table
3. Environment variables — `SUNSTONE_PLUGIN_<NAME>_<KEY>`

## Advanced Usage

### Direct DataFrame API

For more control, use the DataFrame class directly:

```python
from sunstone import DataFrame

# Read with explicit parameters
df = DataFrame.read_csv(
    'data.csv',
    project_path='/path/to/project',
    strict=True
)

# Access underlying pandas DataFrame
pandas_df = df.data
```

### DataFrame Metadata

Set metadata on DataFrames that flows through to `datasets.yaml` on write:

```python
from sunstone import pandas as pd

df = pd.read_csv('input.csv', project_path=PROJECT_PATH)
result = df[df['value'] > 100]

# Set output identity and description
result.metadata.slug = "filtered-data"
result.metadata.name = "Filtered Data"
result.metadata.description = "Values above threshold"

# Set RDF metadata
result.metadata.rdf_prefixes = {"schema": "https://schema.org/"}
result.metadata.custom_properties = {"schema:about": "Analysis"}

# Annotate columns
result.set_field_metadata("value", description="Measured value", unit="kg")

# Write — slug/name come from metadata
result.to_csv('outputs/filtered.csv', index=False)
```

Available metadata:

- `df.metadata.slug`: Dataset slug (used at write time)
- `df.metadata.name`: Dataset name (used at write time)
- `df.metadata.description`: Dataset description
- `df.metadata.rdf_prefixes`: RDF namespace prefixes
- `df.metadata.custom_properties`: Custom properties (RDF-style)
- `df.set_field_metadata(column, *, description, unit, source, type, constraints)`: Annotate a column

### Managing datasets.yaml Programmatically

```python
from sunstone import DatasetsManager, FieldSchema

manager = DatasetsManager('/path/to/project')

# Find datasets
dataset = manager.find_dataset_by_slug('school-data')
dataset = manager.find_dataset_by_location('data/schools.csv')

# Add new output dataset
manager.add_output_dataset(
    name='Analysis Results',
    slug='analysis-results',
    location='outputs/results.csv',
    fields=[
        FieldSchema(name='category', type='string'),
        FieldSchema(name='count', type='integer'),
        FieldSchema(name='avg_value', type='number')
    ],
    publish=True
)
```

## Documentation

- [Contributing Guide](CONTRIBUTING.md)
- [Changelog](CHANGELOG.md)
- [API Reference](#api-reference) (below)

## API Reference

### pandas Module

Drop-in replacement for pandas with lineage tracking:

- `read_csv(filepath, project_path, strict=False, **kwargs)`: Read CSV with lineage
- `read_excel(filepath, project_path, strict=False, **kwargs)`: Read Excel (.xlsx/.xls) with lineage
- `read_json(filepath, project_path, strict=False, **kwargs)`: Read JSON with lineage
- `merge(left, right, **kwargs)`: Merge DataFrames with combined lineage
- `concat(dfs, **kwargs)`: Concatenate DataFrames with combined lineage

### DataFrame Class

Main class for working with data:

- `read_csv(filepath, project_path, strict=False, **kwargs)`: Read CSV with lineage tracking
- `read_excel(filepath, project_path, strict=False, **kwargs)`: Read Excel with lineage tracking
- `to_csv(path, slug, name, publish=False, **kwargs)`: Write CSV and register
- `merge(right, **kwargs)`: Merge with another DataFrame
- `join(other, **kwargs)`: Join with another DataFrame
- `concat(others, **kwargs)`: Concatenate DataFrames
- `set_field_metadata(column, **kwargs)`: Annotate column metadata
- `.data`: Access underlying pandas DataFrame
- `.metadata`: Access unified metadata container
- `.lineage`: Access lineage metadata (deprecated — use `.metadata.lineage`)

### DatasetsManager Class

Manage `datasets.yaml` files:

- `find_dataset_by_location(location, dataset_type='input')`: Find by file path
- `find_dataset_by_slug(slug, dataset_type='input')`: Find by slug
- `get_all_inputs()`: Get all input datasets
- `get_all_outputs()`: Get all output datasets
- `add_output_dataset(...)`: Register new output
- `update_output_dataset(...)`: Update existing output

### Validation Functions

- `check_notebook_imports(notebook_path)`: Validate a single notebook
- `validate_project_notebooks(project_path)`: Validate all notebooks in project

### Plugin Protocols

- `AuthProvider`: Implement `authenticate(url, headers, dataset) -> headers` to inject auth
- `URLHandler`: Implement `can_handle(url) -> bool` and `open(url, mode) -> BinaryIO | TextIO`
- `FormatHandler`: Implement `can_read(path, format)`, `read(stream, **kwargs)`, `can_write(path, format)`, `write(df, stream, **kwargs)`

### PluginRegistry Class

Singleton that discovers and manages plugins:

- `PluginRegistry.get()`: Get the singleton registry instance
- `get_auth_providers()`: Return all registered auth providers
- `get_url_handlers()`: Return all registered URL handlers
- `get_format_handlers()`: Return all registered format handlers
- `find_url_handler(url)`: Find first handler that can handle a URL
- `find_format_reader(path, format)`: Find first handler that can read a file
- `find_format_writer(path, format)`: Find first handler that can write a file
- `fetch(url, dest)`: Convenience — download URL to local file via `open()`

### Exceptions

- `SunstoneError`: Base exception
- `DatasetNotFoundError`: Dataset not found in datasets.yaml
- `StrictModeError`: Operation blocked in strict mode
- `DatasetValidationError`: Validation failed
- `LineageError`: Lineage tracking error

## Environment Variables

- `SUNSTONE_DATAFRAME_STRICT`: Set to `"1"` or `"true"` to enable strict mode globally
- `SUNSTONE_PLUGIN_<NAME>_<KEY>`: Override plugin configuration (highest precedence)

## Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

### Running Tests

```bash
uv run pytest
```

### Type Checking

```bash
uv run mypy
```

### Linting and Formatting

```bash
uv run ruff check
uv run ruff format
```

## About Sunstone Institute

[Sunstone Institute](https://sunstone.institute) is a philanthropy-funded organization using data and AI to show the world as it really is, and inspire action everywhere.

## License

MIT License - see [LICENSE](LICENSE) file for details.

## Support

- **Issues**: [GitHub Issues](https://github.com/sunstoneinstitute/sunstone-py/issues)

---

Made with ❤️ by [Sunstone Institute](https://sunstone.institute)
