Metadata-Version: 2.4
Name: s3-data-organizer-mcp
Version: 0.1.0
Summary: MCP server for safe S3 data layout analysis and cleanup planning
Project-URL: Homepage, https://github.com/YummyTastyCode/s3-data-organizer-mcp
Project-URL: Issues, https://github.com/YummyTastyCode/s3-data-organizer-mcp/issues
Author: YummyTastyCode
License-Expression: MIT
License-File: LICENSE
Keywords: aws,cleanup,finops,lifecycle,mcp,s3,storage
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.11
Requires-Dist: boto3<2,>=1.34
Requires-Dist: mcp<2,>=1.9
Provides-Extra: test
Requires-Dist: pytest<9,>=8; extra == 'test'
Description-Content-Type: text/markdown

# S3 Data Organizer MCP

MCP server for safe S3 data layout analysis and cleanup planning.

This is an early prototype. Version `0.1.x` is **read-only** and exposes scan,
analysis, and proposal tools only. It does not delete, copy, tag, or change
lifecycle policies.

This project is not affiliated with, endorsed by, or sponsored by Amazon Web
Services. AWS and Amazon S3 are trademarks of Amazon.com, Inc. or its affiliates.

## Purpose

The goal is a cloud-storage equivalent of a safe file organizer:

```text
scan S3 prefix
-> summarize layout and cost
-> find large objects and duplicate candidates
-> suggest cleanup/lifecycle options
-> generate a reviewable plan
-> only later apply with explicit confirmation
```

## Current Tools

- `get_s3_organizer_status`: local policy/dependency status.
- `scan_s3_prefix`: read S3 object metadata under an allowlisted prefix.
- `summarize_s3_layout`: object count, total bytes, extensions, storage classes,
  top prefixes, and rough monthly storage cost.
- `find_s3_large_objects`: largest objects under a prefix.
- `find_s3_duplicate_candidates`: ETag-based duplicate candidates.
- `rank_s3_cold_candidates`: LRU-like ranking using `LastModified`, object
  size, and artifact type. This is not true last-access time.
- `list_s3_prefix_children`: read-only pseudo-folder navigation for one prefix
  level.
- `analyze_s3_prefix_tree`: folder-like rollups by projected S3 prefix depth.
- `analyze_s3_artifact_types`: classify objects by artifact type, extension,
  and top prefix.
- `inspect_s3_hidden_storage`: inspect object versions, delete markers, and
  incomplete multipart uploads.
- `propose_s3_cleanup_options`: review options and safe next steps.
- `propose_s3_lifecycle_options`: heuristic lifecycle rule ideas.

See [docs/COMMANDS.md](docs/COMMANDS.md) for the public command contract.

## Safety Boundaries

- Read-only by default.
- Requires `S3_ORGANIZER_ALLOWED_ROOTS`.
- Refuses to inspect S3 URIs outside allowlisted roots.
- Does not perform writes in this version.
- Destructive operations should require future policy opt-in and confirmation
  tokens.
- ETag duplicate detection is only a candidate signal; multipart/encrypted
  objects need additional checksum validation.
- Cold-candidate ranking uses S3 `LastModified` as a proxy; S3 object listing
  metadata does not include true last-access time.

## Install

From PyPI:

```bash
pipx install s3-data-organizer-mcp
```

Or run without a persistent install:

```bash
uvx s3-data-organizer-mcp
```

Local development:

```bash
python3.11 -m venv .venv
.venv/bin/pip install -e ".[test]"
.venv/bin/python -m pytest
```

Run the MCP server:

```bash
s3-data-organizer-mcp
```

Example MCP client config:

```json
{
  "mcpServers": {
    "s3-data-organizer": {
      "command": "s3-data-organizer-mcp",
      "env": {
        "AWS_PROFILE": "research",
        "AWS_REGION": "eu-north-1",
        "S3_ORGANIZER_ENDPOINT_URL": "",
        "S3_ORGANIZER_ALLOWED_ROOTS": "s3://YOUR_BUCKET/data,s3://YOUR_BUCKET/archive",
        "S3_ORGANIZER_MAX_SCAN_KEYS": "10000",
        "S3_ORGANIZER_STORAGE_PRICE_USD_PER_GB_MONTH": "0.023",
        "S3_ORGANIZER_ALLOW_WRITES": "false"
      }
    }
  }
}
```

The same example is available in `examples/mcp-config.json`.

## Configuration

```bash
export AWS_PROFILE=research
export AWS_REGION=eu-north-1
export S3_ORGANIZER_ENDPOINT_URL=
export S3_ORGANIZER_ALLOWED_ROOTS=s3://YOUR_BUCKET/data,s3://YOUR_BUCKET/archive
export S3_ORGANIZER_MAX_SCAN_KEYS=10000
export S3_ORGANIZER_STORAGE_PRICE_USD_PER_GB_MONTH=0.023
```

Writes are intentionally disabled in the current prototype:

```bash
export S3_ORGANIZER_ALLOW_WRITES=false
```

For S3-compatible providers such as reg.ru, set `S3_ORGANIZER_ENDPOINT_URL`,
for example:

```bash
export AWS_REGION=auto
export S3_ORGANIZER_ENDPOINT_URL=https://s3.regru.cloud
```

## IAM

Read-only prototype permissions:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:ListBucketVersions",
        "s3:ListBucketMultipartUploads"
      ],
      "Resource": "arn:aws:s3:::YOUR_BUCKET",
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "data/*",
            "archive/*"
          ]
        }
      }
    }
  ]
}
```

Only `s3:ListBucket` is needed for the core scan/summarize/rank tools. The
version and multipart actions are needed only for `inspect_s3_hidden_storage`.

Future write-capable versions will need separate policies for tagging, copy,
delete, lifecycle configuration, or Batch Operations manifest generation.

## Publishing

Release steps are documented in [PUBLISHING.md](PUBLISHING.md). The short
version is:

```bash
python -m pytest -q
python -m build
python -m twine check dist/*
python -m twine upload dist/*
```

## Product Direction

This should not become a generic S3 file manager. The useful product is:

- S3 layout intelligence.
- Read-only pseudo-folder navigation.
- Cleanup options.
- Lifecycle rule suggestions.
- Duplicate candidate review.
- Cold-candidate and artifact-type ranking.
- Cost/savings estimates.
- Safe manifests for AWS-native execution.

For large buckets, the right backend is likely S3 Inventory + Athena + S3 Batch
Operations rather than listing every object interactively.
