Metadata-Version: 2.4
Name: ml-analytics-tools
Version: 0.2.0
Summary: Tools for ML projects and data management
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: boto3>=1.37.24
Requires-Dist: catboost>=1.2.8
Requires-Dist: ddtrace>=3.4.1
Requires-Dist: dotenv>=0.9.9
Requires-Dist: duckdb>=1.4.1
Requires-Dist: google-api-python-client>=2.150.0
Requires-Dist: google-auth>=2.35.0
Requires-Dist: google-auth-httplib2>=0.2.0
Requires-Dist: google-auth-oauthlib>=1.2.0
Requires-Dist: ipykernel>=6.29.5
Requires-Dist: lifelines>=0.30.3
Requires-Dist: mlflow==3.10.1
Requires-Dist: mlflow[auth]==3.10.1
Requires-Dist: pip>=25.3
Requires-Dist: polars==1.30.0
Requires-Dist: pytest>=8.3.5
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: redshift-connector>=2.1.9
Requires-Dist: ruff>=0.11.4
Requires-Dist: schedule>=1.2.2
Requires-Dist: scikit-learn==1.5.2
Requires-Dist: seaborn>=0.13.2
Requires-Dist: setuptools>=42.0.0
Requires-Dist: shap>=0.47.2
Requires-Dist: slack-sdk>=3.27.0
Dynamic: license-file

# ML Analytics Tools

Utilities for common analytics and machine learning workflows: Redshift, S3,
Google Sheets, Slack, MLflow, model evaluation, and SQL pipelines.

The package is intentionally infrastructure-neutral. Buckets, credentials,
MLflow hosts, and tokens are provided by your environment or by explicit
arguments.

## What Is Included

- `DataConnector`: run Redshift SQL, load SQL files, unload/load data through S3, and create Redshift tables from DataFrames.
- `S3Connector`: read, write, list, delete, and query S3 data with DuckDB.
- `GSheet`: read, write, share, and export Google Sheets data.
- `SlackConnector`: send messages, upload files, and manage simple Slack interactions.
- `ModelManager`: create MLflow experiments, log models, register versions, manage aliases, and handle permissions.
- `model_tools`: classification, regression, survival analysis, CatBoost helpers, plotting, and reporting utilities.
- `utils`: project-root discovery, SQL file loading, logging, credentials, and YAML SQL pipelines.

## Install

From PyPI, after a release is available:

```bash
uv add ml-analytics-tools
```

Directly from GitHub:

```bash
uv add git+https://github.com/sdaza/ml-analytics-tools
```

For local development:

```bash
uv sync --all-groups
```

## Configuration

The package loads a `.env` file from the project root when it is imported.
Only configure the services you use.

```bash
# Redshift
BI_REDSHIFT_HOST=redshift-cluster.example.com
BI_REDSHIFT_DB=analytics
BI_REDSHIFT_USER=analytics_user
BI_REDSHIFT_PASSWORD=secret
BI_REDSHIFT_PORT=5439

# S3
ML_ANALYTICS_S3_BUCKET=my-analytics-bucket

# MLflow
MLFLOW_TRACKING_URI=https://mlflow.example.com
MLFLOW_TRACKING_USERNAME=user@example.com
MLFLOW_TRACKING_PASSWORD=secret

# Google Sheets
GSHEET_SPREADSHEET_ID=optional-default-sheet-id
GOOGLE_CREDENTIALS='{"type":"service_account", ...}'

# Slack
SLACK_BOT_TOKEN=xoxb-your-token
```

S3 buckets are never hard-coded. Pass `bucket=...` or `s3_bucket=...`, or set
`ML_ANALYTICS_S3_BUCKET`.

## AWS Authentication

Use the CLI helper for AWS SSO:

```bash
ml-analytics-auth
```

You can also call it from Python:

```python
from ml_analytics import ensure_aws_authenticated

ensure_aws_authenticated()
```

See [AWS Authentication](docs/AWS_AUTHENTICATION.md) and
[CLI Commands](docs/CLI_COMMANDS.md) for details.

## Quick Examples

### Query Redshift

```python
from ml_analytics import DataConnector

dc = DataConnector()

df = dc.sql("SELECT * FROM analytics.customer_features LIMIT 100")
df_polars = dc.sql("queries/features.sql", format="polars", country="es")
```

### Create A Redshift Table From A DataFrame

```python
dc.create_table_from_dataframe(
    df,
    table="model_scores",
    schema="analytics",
    drop_existing_table=True,
)
```

### Work With S3

```python
from ml_analytics import S3Connector

s3 = S3Connector(bucket="my-analytics-bucket", s3_root="projects/churn")

s3.save_dataframe(df, directory="outputs", file_name="scores")

summary = s3.query(
    """
    SELECT segment, count(*) AS rows
    FROM read_parquet('s3://my-analytics-bucket/projects/churn/outputs/*.parquet')
    GROUP BY segment
    """
)
```

### Read And Write Google Sheets

```python
from ml_analytics import GSheet

gsheet = GSheet(credentials_path="gsheet_credentials.json")

df = gsheet.read_sheet(spreadsheet_id="...", sheet_name="Input")
gsheet.write_sheet(df, spreadsheet_id="...", sheet_name="Results")
```

### Log To MLflow

```python
from ml_analytics import ModelManager

manager = ModelManager(model_name="churn-model", user="user@example.com")

manager.start_run("training")
manager.log_metric("auc", 0.91)
manager.end_run()
```

### Send A Slack Message

```python
from ml_analytics import SlackConnector

slack = SlackConnector()
slack.send_message(channel="#ml-alerts", text="Training finished")
```

## Detailed Guides

| Guide | Use It For |
| --- | --- |
| [AWS Authentication](docs/AWS_AUTHENTICATION.md) | AWS SSO setup and Python helpers |
| [CLI Commands](docs/CLI_COMMANDS.md) | Available console commands |
| [Google Sheets](docs/GSHEET_CONNECTOR_USAGE.md) | Sheets setup, sharing, exports, and examples |
| [Slack](docs/SLACK_CONNECTOR_USAGE.md) | Slack token setup and message/file examples |
| [Tunnel Manager](docs/TUNNEL_MANAGER.md) | SSH tunnel configuration and CLI usage |

## Development

Run the standard checks before opening a PR:

```bash
uv run ruff check
uv run pytest
```

CI runs Ruff and pytest on Python 3.11 and 3.12.

## Releases

This repository uses Release Please. Conventional commits on `main` create or
update a release PR with the next version and changelog. When that PR is merged,
the release workflow builds the package and publishes it to PyPI through Trusted
Publishing using the `pypi` GitHub environment.

## Contributing

Keep changes small, covered by tests when behavior changes, and free of
environment-specific defaults. Prefer explicit configuration over hidden
infrastructure assumptions.
