Metadata-Version: 2.4
Name: dt-validator
Version: 0.3.0
Summary: Read, validate and print the contents of a publicly-accessible (open) S3 object, no AWS credentials required.
Author: Jayendra
License: MIT
Project-URL: Homepage, https://github.com/optevo/dt-validator
Project-URL: Issues, https://github.com/optevo/dt-validator/issues
Keywords: s3,aws,boto3,public,reader,validation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boto3>=1.26
Provides-Extra: server
Requires-Dist: flask>=2; extra == "server"
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: pytest-cov>=4; extra == "dev"
Requires-Dist: moto>=4; extra == "dev"
Requires-Dist: flask>=2; extra == "dev"
Dynamic: license-file

# dt-validator

Read, **validate**, and print the contents of a **publicly-accessible (open)** S3
object — no AWS credentials required. Uses `boto3` with an unsigned (anonymous)
signature.

## Features

- Anonymous reads of public S3 objects (`s3://`, virtual-hosted, and path-style URLs).
- A composable **file-validation** layer: extension, content-type, size bounds,
  non-empty, text-encoding, and checksum checks.
- Cheap pre-flight validation via a HEAD request *before* downloading the body.
- Typed exception hierarchy, opt-in logging, `py.typed` for type-checkers.
- Library API **and** a CLI, both fully tested (pytest + moto).

## Install

```bash
pip install -e .            # runtime
pip install -e ".[dev]"     # + pytest, moto, coverage
```

## CLI

```bash
# Print an object
dt-validator s3://my-open-bucket/path/to/file.txt

# Metadata only (HEAD, no download)
dt-validator s3://my-open-bucket/file.txt --head

# Read only the first 1 KB
dt-validator s3://my-open-bucket/big.log --max-bytes 1024

# Raw bytes to stdout
dt-validator s3://my-open-bucket/logo.png --binary > logo.png

# With validation — fails (non-zero exit) if any constraint is violated
dt-validator s3://my-open-bucket/data.csv \
    --ext csv --content-type text/csv \
    --max-size 1048576 --non-empty \
    --require-encoding utf-8 \
    --checksum sha256:9f86d0818...
```

Exit codes: `0` ok · `1` S3/network error · `2` bad usage · `3` validation failed ·
`4` not found · `5` access denied.

## Library

```python
from dt_validator import read_object, ValidationPolicy

# Simple read (str by default; encoding=None -> bytes)
text = read_object("s3://my-open-bucket/notes.txt")

# Read with a validation policy
policy = ValidationPolicy(
    allowed_extensions=[".csv"],
    allowed_content_types=["text/csv", "text/plain"],
    max_bytes=5 * 1024 * 1024,
    require_non_empty=True,
    expected_encoding="utf-8",
    checksum_algorithm="sha256",
    expected_checksum="9f86d0818...",
)
data = read_object("s3://my-open-bucket/data.csv", policy=policy)
```

### Reading a file whose URL comes from an API

The API returns a **file URL**, and the package then reads that file itself. The
endpoint's response is the *indirection* — you configure the file location there
instead of hard-coding it in your app.

Flow: call the API → extract the file URL from its response → read that file
(`s3://` or `http(s)://`) → return its contents.

```python
from dt_validator import read_file_from_api, read_url

# 1) call the API  2) read the file URL from its response  3) return that file's content
text = read_file_from_api()   # endpoint defaults to https://file-read.free.beeceptor.com

# Custom endpoint / JSON field / validation policy
text = read_file_from_api(
    "https://my-api.example.com/current-file",
    url_field="url",            # JSON field holding the file URL (default: "url")
    method="GET",              # or "POST"
    policy=ValidationPolicy(max_bytes=1_000_000, expected_encoding="utf-8"),
)

# Or read a file URL you already have (s3:// or http(s)://)
data = read_url("s3://my-open-bucket/notes.txt")
```

CLI:

```bash
dt-validator --via-api
dt-validator --via-api \
    --api-endpoint https://my-api.example.com/current-file \
    --api-method GET --api-url-field url
```

Configure your endpoint to return the file URL, e.g. response body:

```json
{"url": "s3://my-open-bucket/notes.txt"}
```

(a bare URL as plain text works too).

> **Note:** the default endpoint is a Beeceptor **mock** — until you add a rule
> that returns a file URL, it replies with placeholder text and the package will
> report that no file URL was found.

Cheap metadata check without downloading:

```python
from dt_validator import head_object

meta = head_object("s3://my-open-bucket/data.csv")
print(meta.size, meta.content_type, meta.etag)
```

Standalone validators (each raises a specific `ValidationError` subclass):

```python
from dt_validator import (
    validate_extension, validate_content_type, validate_size,
    validate_not_empty, validate_encoding, validate_checksum,
)

validate_extension("data.csv", ["csv", ".tsv"])
validate_size(len(data), max_bytes=1_000_000)
validate_checksum(data, "sha256", expected_hex)
```

## Package layout

```
src/dt_validator/
  __init__.py        public API
  reader.py          parsing, anonymous read, HEAD, error mapping
  validation.py      validators + ValidationPolicy
  exceptions.py      typed exception hierarchy
  _logging.py        NullHandler + opt-in configure_logging()
  cli.py             argparse CLI
tests/
  test_reader.py               URI parsing
  test_validation.py           validators + policy
  test_reader_integration.py   reader against moto S3
  test_cli.py                  CLI against moto S3
```

## Exceptions

All derive from `FileValidatorError`:

- `InvalidUriError` (also a `ValueError`)
- `ObjectNotFoundError`, `AccessDeniedError`, `RemoteReadError`
- `ValidationError` → `FileSizeError`, `ExtensionError`, `ContentTypeError`,
  `EncodingError`, `ChecksumError`

## Note on "open" access

The object (or bucket) must allow anonymous `s3:GetObject`. This tool intentionally
sends **unsigned** requests, so private objects return `403 AccessDenied`.

## Tests

```bash
pytest              # 46 tests
pytest --cov        # with coverage
```
