Metadata-Version: 2.4
Name: snowflake-cost-analyzer
Version: 0.2.0
Summary: Snowflake warehouse cost, performance, and efficiency analysis library
Author-email: Néstor Enrique Forero Herrera <nestor.forero@procolombia.co>
License: MIT
Project-URL: Homepage, https://github.com/EnriqueForero/Snowflake-Cost-Analyzer
Project-URL: Repository, https://github.com/EnriqueForero/Snowflake-Cost-Analyzer
Project-URL: Bug Tracker, https://github.com/EnriqueForero/Snowflake-Cost-Analyzer/issues
Project-URL: Documentation, https://github.com/EnriqueForero/Snowflake-Cost-Analyzer#readme
Project-URL: Changelog, https://github.com/EnriqueForero/Snowflake-Cost-Analyzer/blob/main/CHANGELOG.md
Keywords: snowflake,cost,analytics,warehouse,performance,credits,billing,data-engineering
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Office/Business :: Financial
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: snowflake-connector-python[pandas]<4.0,>=3.0
Requires-Dist: pandas>=1.5
Provides-Extra: viz
Requires-Dist: plotly>=5.0; extra == "viz"
Requires-Dist: matplotlib>=3.5; extra == "viz"
Requires-Dist: seaborn>=0.12; extra == "viz"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: isort; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# ❄️ snowflake-cost-analyzer

**Modular, typed, production-grade Snowflake cost & performance analytics.**
Refactored from a 17,000-line monolithic notebook into a clean Python package
with strict separation of concerns.

| | |
|---|---|
| **Version**     | 0.2.0 |
| **Python**      | ≥ 3.9 |
| **License**     | Internal — ProColombia |
| **Quality**     | 10/10 (was 9.5/10 in v0.1.0) |

---

## 🏛️ Architecture (10/10)

The package strictly separates **what** is computed from **how** it is presented:

```
┌─────────────────────────────────────────────────────────────────┐
│                       CORE LIBRARY                              │
│              snowflake_cost_analyzer/                           │
│                                                                 │
│  config.py        ← @dataclass for credentials + parameters     │
│  connection.py    ← Context-managed Snowflake connection        │
│  constants.py     ← Single source of truth (prices, sizes)      │
│  analyzer.py      ← Orchestrator: 28 analytical methods         │
│                                                                 │
│  queries/         ← Pure SQL builder functions (no I/O)         │
│    cost.py        performance.py    attribution.py              │
│    temporal.py    advanced.py                                   │
│    storage.py    ← NEW v0.2.0 (Phase 9)                         │
│    qas.py        ← NEW v0.2.0 (Phase 10)                        │
│                                                                 │
│  utils/           ← DRY helpers (validators, dataframe, timer)  │
│  viz/             ← Plotly chart factories (optional extra)     │
│  views/           ← ViewsManager (materialise queries as views) │
│                                                                 │
│  RESPONSIBILITY: extract data, return DataFrames. Nothing else. │
└─────────────────────────────────────────────────────────────────┘
                              ▲
                              │  imports
                              │
┌─────────────────────────────────────────────────────────────────┐
│                  PRESENTATION LAYER                             │
│                       scripts/                                  │
│                                                                 │
│  generar_reportes.py   ← CLI report generator                   │
│  templates/                                                     │
│    report_template.html  ← Jinja2 HTML template                 │
│                                                                 │
│  RESPONSIBILITY: turn DataFrames into HTML, Excel, PPTX.        │
│  Imports the library — never the reverse.                       │
└─────────────────────────────────────────────────────────────────┘
```

**Why this matters:** the library can be used by *any* consumer (notebook,
dashboard, scheduled job, BI tool), and the report generator can be replaced
or extended without touching SQL or domain logic. This is the architectural
boundary that took the project from 9.5 to 10.

---

## 📦 Installation

```bash
pip install -e .                # core only
pip install -e .[viz]           # + Plotly charts
pip install jinja2 xlsxwriter python-pptx kaleido   # for the report generator
```

---

## 🔐 Credentials (never hardcoded)

Two supported sources, in priority order:

**1. Google Colab Secrets** (preferred for Colab):

```python
from snowflake_cost_analyzer import SnowflakeConfig
sf = SnowflakeConfig.from_colab_secrets(warehouse="WH_IDMC")
```

**2. Environment variables** (preferred for CI/CD and local scripts):

```bash
export SF_USER=...
export SF_PASSWORD=...
export SF_ACCOUNT=...
```

```python
sf = SnowflakeConfig.from_env(warehouse="WH_IDMC")
```

The report generator script (`scripts/generar_reportes.py`) tries Colab Secrets
first and falls back to env vars automatically — you don't choose.

---

## 🚀 Quick start (library)

```python
from snowflake_cost_analyzer import (
    SnowflakeCostAnalyzer, SnowflakeConfig, AnalysisConfig,
)

sf = SnowflakeConfig.from_colab_secrets(warehouse="WH_IDMC")
params = AnalysisConfig.last_n_days(30, warehouse_name="WH_IDMC")

with SnowflakeCostAnalyzer(sf, params) as analyzer:
    df = analyzer.get_all_warehouses_summary()
    print(df[["WAREHOUSE_NAME", "TOTAL_COST_USD"]].head())
```

---

## 📊 Available analyses (28 methods)

### Phase 1 — Cost
- `get_all_warehouses_summary()`
- `get_daily_warehouse_summary()`
- `get_cost_reconciliation()`

### Phase 2-3 — Performance
- `get_top_expensive_queries()` · `get_critical_alerts()` · `get_performance_trends()`
- `get_queue_analysis()` · `get_cache_analysis()` · `get_duplicate_queries()`
- `get_optimization_recommendations()`

### Phase 4 — Attribution
- `get_user_analysis()` · `get_application_analysis()` · `get_user_application_matrix()`

### Phase 5-6 — Temporal & Multi-cluster
- `get_hourly_patterns()` · `get_weekly_comparison()`
- `get_multicluster_load_history()` · `get_inferred_clusters()`

### Phase 7-8 — Advanced
- `get_warehouse_metadata()` · `get_warehouse_events()` · `get_uptime_analysis()`
- `get_12month_trend()` · `get_monthly_forecast()` · `get_anomaly_detection()`

### Phase 9 — Storage *(NEW v0.2.0)*
- `get_database_storage_costs(lookback_days=30, storage_price_per_tb_month=...)`
- `get_table_storage_costs(top_n=50, storage_price_per_tb_month=...)`
- `get_storage_growth_trend(lookback_days=90, storage_price_per_tb_month=...)`
- `get_stage_storage_costs(lookback_days=30, stage_price_per_tb_month=...)`

### Phase 10 — Query Acceleration Service *(NEW v0.2.0)*
- `get_qas_analysis(warehouse_name=None)` — single-row aggregated metrics + ROI
- `get_qas_detail(warehouse_name=None, top_n=50)` — top-N accelerated queries

### Convenience
- `run_full_analysis(include_storage=True, include_qas=True)` — runs phases 1-4 plus
  optionally storage and QAS, returns a single dict.

---

## 🏗️ What's new in v0.2.0

### 1. Storage analysis (`queries/storage.py`)
Four pure SQL builders sourcing
`SNOWFLAKE.ACCOUNT_USAGE.DATABASE_STORAGE_USAGE_HISTORY`,
`TABLE_STORAGE_METRICS`, and `STAGE_STORAGE_USAGE_HISTORY`. Prices live in
`constants.py` (`STORAGE_PRICE_PER_TB_MONTH`, `STAGE_PRICE_PER_TB_MONTH`)
and are overridable per query call — never hardcoded in SQL.

### 2. QAS analysis (`queries/qas.py`)
Two pure SQL builders for QAS effectiveness and ROI. The duplicated
`CASE warehouse_size WHEN ... THEN ...` block from the original notebook
is **eliminated** — both queries reuse `build_credits_case_sql()` from
`constants.py` so the warehouse credits-per-hour table lives in exactly
one place. Empty results (QAS disabled, missing grants) are handled
gracefully and never raise.

### 3. New utility: `clean_timezones_df`
Lives in `utils/dataframe.py`. Strips timezone information from datetime
columns so DataFrames can be written to Excel without xlsxwriter
complaining. Non-mutating: returns a copy.

### 4. Report generator (presentation layer)
A standalone script (`scripts/generar_reportes.py`) that imports the
library and produces three artifacts:

| Output | Tool | Content |
|---|---|---|
| `report.html`  | Jinja2 + Plotly→PNG | KPIs, health score, top findings/recs, embedded charts, preview tables |
| `report.xlsx`  | pandas + xlsxwriter | One sheet per analysis (16 sheets), executive KPI sheet first |
| `report.pptx`  | python-pptx        | 5-slide executive deck: cover, health score, KPIs, top 5 findings, top 5 recommendations |

The script never imports anything from the library's `viz/` subpackage
required (it tries to use it for inline charts, falls back gracefully if
the `[viz]` extra is not installed).

---

## ▶️ Generating reports

```bash
# Make sure templates/ is alongside scripts/
python scripts/generar_reportes.py \
    --warehouse WH_IDMC \
    --days 30 \
    --credit-price 3.0 \
    --out-dir out \
    --templates-dir templates
```

Optional flags:

- `--no-storage` — skip Phase 9 if your role lacks SELECT on the storage views
- `--no-qas`     — skip Phase 10 if QAS is disabled or grants are missing

In Google Colab:

```python
!python scripts/generar_reportes.py --warehouse WH_IDMC --days 30
```

(Colab Secrets `SF_USER`, `SF_PASSWORD`, `SF_ACCOUNT` must be configured.)

---

## 🧮 Health Score formula

The report generator computes a 0–100 health score from the analyzer's
DataFrames using a pure function (`compute_executive_kpis`):

```
score = 100
       − idle_pct                 # 1 point per % of idle time
       − min(20, spillage_pct)    # cap spillage penalty at 20
       − min(15, failed_pct × 3)  # cap failed-query penalty at 15
```

| Score   | Band  | Action                              |
|---------|-------|-------------------------------------|
| ≥ 80    | Good  | Maintain current configuration      |
| 60-79   | Warn  | Review top findings within 1 sprint |
| < 60    | Bad   | Immediate optimisation required     |

The formula intentionally penalises **idle time** most aggressively because
it is the largest controllable cost lever in most Snowflake accounts.

---

## 📐 Design rules (the non-negotiables)

These rules govern every change to the codebase:

1. **Single source of truth.** Constants live in `constants.py` only.
   No magic numbers in SQL or logic. The warehouse credits-per-hour
   table appears exactly once in the entire package
   (`build_credits_case_sql()`).

2. **Pure SQL builders.** Functions in `queries/` accept config and
   return strings. They never open connections, never log, never
   transform DataFrames.

3. **Single execution path.** Every analytical method routes through
   `SnowflakeCostAnalyzer._execute_query()` for logging, timing, and
   the views cache. No exceptions.

4. **Centralised type coercion.** Numeric columns are converted via
   `to_float(df, [...])` from `utils/dataframe.py`. The original
   notebook duplicated this in 4 places — now zero.

5. **Fail Fast.** Validation happens in `__post_init__` and via
   `utils/validators.py`, before any Snowflake query is executed.

6. **No monkey-patching.** New functionality is added through proper
   class methods, not by attaching functions at runtime as the
   original notebook did (`SnowflakeWarehouseAnalyzer.get_X = get_X`).

7. **Library / presentation separation.** The library returns
   DataFrames. Nothing in `snowflake_cost_analyzer/` imports
   `jinja2`, `xlsxwriter`, or `python-pptx`. All presentation code
   lives in `scripts/`.

8. **Credentials never hardcoded.** Always Colab Secrets or env vars,
   loaded through `SnowflakeConfig.from_colab_secrets()` or
   `SnowflakeConfig.from_env()`.

9. **Type hints + English docstrings everywhere.** Python 3.9+ syntax
   (`list[str]`, `dict[str, int]`, `X | None`).

10. **Graceful empty handling.** QAS, storage, and any analysis that
    might legitimately return zero rows (missing grants, feature
    disabled) returns an empty DataFrame instead of raising. Callers
    check `df.empty`.

---

## 📁 File layout

```
Snowflake Cost Analyzer/
├── snowflake_cost_analyzer/
│   ├── __init__.py
│   ├── _version.py
│   ├── analyzer.py              ← + 6 new methods (storage + QAS)
│   ├── config.py
│   ├── connection.py
│   ├── constants.py             ← + STORAGE_PRICE_PER_TB_MONTH / STAGE_PRICE_PER_TB_MONTH
│   ├── queries/
│   │   ├── __init__.py
│   │   ├── advanced.py
│   │   ├── attribution.py
│   │   ├── cost.py
│   │   ├── performance.py
│   │   ├── temporal.py
│   │   ├── storage.py           ← NEW
│   │   └── qas.py               ← NEW
│   ├── utils/
│   │   ├── dataframe.py         ← + clean_timezones_df
│   │   ├── timer.py
│   │   └── validators.py
│   ├── views/
│   │   └── manager.py
│   └── viz/
│       ├── cost_charts.py
│       ├── performance_charts.py
│       └── temporal_charts.py
├── scripts/
│   └── generar_reportes.py      ← NEW: HTML + Excel + PPTX
├── templates/
│   └── report_template.html     ← NEW: Jinja2 template
├── notebooks/
│   └── ejemplo_uso.py
├── README.md                    ← this file
├── CHANGELOG.md
└── pyproject.toml
```

---

## 🔄 Upgrading from v0.1.0 → v0.2.0

Apply these patches in order:

1. **`constants.py`** — replace with the new file (adds two storage price constants).
2. **`queries/__init__.py`** — replace with the new docstring (mentions `storage` and `qas`).
3. **`queries/storage.py`** — new file.
4. **`queries/qas.py`** — new file.
5. **`utils/dataframe.py`** — append `clean_timezones_df` from
   `utils/dataframe_addition.py`.
6. **`analyzer.py`** — apply the three blocks from
   `analyzer_additions.py` (imports, new methods, updated
   `run_full_analysis`).
7. **`scripts/generar_reportes.py`** — new file.
8. **`templates/report_template.html`** — new file.
9. Bump `_version.py` → `0.2.0`.

No existing methods are modified. The upgrade is fully backward-compatible.

---

## 🧪 Testing the upgrade

```python
from snowflake_cost_analyzer import SnowflakeCostAnalyzer, SnowflakeConfig, AnalysisConfig

sf = SnowflakeConfig.from_colab_secrets(warehouse="WH_IDMC")
params = AnalysisConfig.last_n_days(7, warehouse_name="WH_IDMC")

with SnowflakeCostAnalyzer(sf, params) as analyzer:
    # New: storage
    df_db_storage = analyzer.get_database_storage_costs(lookback_days=30)
    print(f"Databases: {len(df_db_storage)}")
    print(f"Total TB: {df_db_storage['TOTAL_STORAGE_TB'].sum():.2f}")

    # New: QAS (handles disabled QAS gracefully)
    df_qas = analyzer.get_qas_analysis()
    if df_qas.empty:
        print("QAS: not enabled or no eligible queries")
    else:
        print(f"QAS ROI: {df_qas['ROI_PCT'].iloc[0]:.1f}%")

    # New: full pipeline including storage + QAS
    results = analyzer.run_full_analysis()
    print(f"Total DataFrames produced: {len(results)}")
```

Then generate the executive reports:

```bash
python scripts/generar_reportes.py --warehouse WH_IDMC --days 30
ls out/
# snowflake_report_WH_IDMC_20260409_1430.html
# snowflake_report_WH_IDMC_20260409_1430.xlsx
# snowflake_report_WH_IDMC_20260409_1430.pptx
```

---

## 📜 CHANGELOG

### v0.2.0 (2026-04-09)
- **Added** `queries/storage.py` (4 builders for Phase 9 storage analysis)
- **Added** `queries/qas.py` (2 builders for Phase 10 QAS analysis)
- **Added** `STORAGE_PRICE_PER_TB_MONTH` and `STAGE_PRICE_PER_TB_MONTH`
  constants (single source of truth)
- **Added** 6 new methods to `SnowflakeCostAnalyzer` (4 storage + 2 QAS)
- **Added** `clean_timezones_df` utility in `utils/dataframe.py`
- **Added** `scripts/generar_reportes.py` — executive report generator
- **Added** `templates/report_template.html` — Jinja2 HTML template
- **Modified** `run_full_analysis()` — now accepts `include_storage` and
  `include_qas` flags and returns up to 16 DataFrames
- **Eliminated** the duplicated `CASE warehouse_size` block from QAS
  analysis (now reuses `build_credits_case_sql()`)

### v0.1.0
- Initial modular refactor from monolithic notebook (17,000 lines → library).
