Metadata-Version: 2.4
Name: autoseqmodels
Version: 0.1.1
Summary: Automated weekly sequence-model workflow (LSTM / Transformer) for customer transaction prediction.
Author-email: Pablo Huber <pablohuber.ge@gmail.com>
License-Expression: MIT
Keywords: sequence-models,lstm,transformer,transaction-prediction,customer-lifetime-value,time-series
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: matplotlib>=3.7
Requires-Dist: scikit-learn>=1.3
Requires-Dist: torch>=2.0
Requires-Dist: torchmetrics>=1.0
Requires-Dist: optuna>=3.0
Requires-Dist: tqdm>=4.65
Requires-Dist: pyreadr>=0.5
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# autoseqmodels

Automated weekly sequence-model workflow for customer transaction prediction.
Provides an end-to-end pipeline from raw transaction tables to trained LSTM /
Transformer models, with column-type inference, encoding strategy proposal,
per-customer sequence construction, training, tuning (Optuna), and
holdout evaluation.

## Installation

```bash
pip install autoseqmodels
```

Or from a local clone:

```bash
pip install -e .
```

## Expected input format

`build_transaction_panel` does **not** infer column roles — auto-detection
silently degrades if the inputs don't match these expectations:

- **`tx_df`** (raw transactions, one row per purchase) must contain a
  customer-identifier column (the name you pass as `merge_on`, e.g.
  `"Id"`) and exactly one date-typed column (either `datetime64` dtype,
  or a string column whose name contains `"date"` and parses as a date).
  Any other columns are ignored at panel-building time but flow through
  to the panel and are typed in step 3.
- **`cov_df`** (optional covariate calendar) must contain the same
  `merge_on` column, exactly one date-typed column (the per-week
  observation date), and one row per `(customer, week)`. Merge-key dtypes
  are coerced automatically across the two tables, so an `int64` `Id` on
  one side and a string `Id` on the other is fine.
- The downstream `build_transaction_sequences` step expects the panel to
  carry a `transaction_count` column (added by `build_transaction_panel`)
  plus the customer and date columns; everything else is treated as a
  feature.

## Workflow

```python
from autoseqmodels import (
    loader, inspection, encoders, sequence_builder,
    training, sequence_lstm, sequence_transformer,
)

# 1. Load data (CSV / Excel / RData)
df = loader.load_table("transactions.csv")

# 2. Aggregate raw transactions to a (customer, week) panel
#    Two modes:
#      a) With a covariate calendar — pass tx_df + cov_df:
panel = loader.build_transaction_panel(tx_df, cov_df, merge_on="Id")
#      b) Transactions only (no covariates) — omit cov_df:
panel = loader.build_transaction_panel(tx_df, merge_on="Id")
#    The transactions-only mode returns one row per (customer, purchase-week);
#    sequence_builder fills zero-transaction weeks itself in step 7.

# 3. Detect column types (user-editable)
detected = inspection.infer_column_types(panel)
panel = inspection.cast_columns_by_detected_type(panel, detected)

# 4. Resolve entity / date / target + covariate plan
structure, plan = inspection.analyze_structure(panel, detected)

# 5. Propose encoding strategy (user-editable)
strategy = encoders.propose_encodings(panel, detected, plan)

# 6. Fit encoders on training rows only
enc_df, spec = encoders.apply_encodings(panel, strategy, ...)
plan = encoders.expand_plan(plan, spec)

# 7. Build per-customer sequences
seqs = sequence_builder.build_transaction_sequences(enc_df, ...)

# 8. Train / evaluate (fixed hyperparameters)
model = sequence_lstm.train_tuned_lstm(seqs, ...)
preds = sequence_lstm.predict_holdout(model, seqs)
```

### With Optuna hyperparameter search

Replace step 8 with a tune → train-best → evaluate sequence:

```python
from autoseqmodels import training, sequence_lstm

# Wrap the sequences and split customers into train / val
dataset             = training.SequenceDataset(seqs)
train_ds, val_ds    = training.make_train_val_split(dataset, val_fraction=0.1, seed=42)

# Run (or resume) the Optuna study — persisted to SQLite if `storage` is set
study = sequence_lstm.tune_lstm(
    seqs,
    train_ds,
    val_ds,
    n_trials   = 30,
    max_epochs = 100,
    storage    = "sqlite:///optuna_lstm.db",
    study_name = "lstm_v1",
)

# Retrain with the best hyperparameters and run the holdout rollout
model, history = sequence_lstm.train_tuned_lstm(seqs, study, train_ds, val_ds)
preds          = sequence_lstm.predict_holdout(model, seqs)
```

The Transformer variant exposes the same triple under
`sequence_transformer`: `tune_transformer`, `train_tuned_transformer`,
`predict_holdout_transformer`.

## Minimal runnable example

End-to-end on the bundled `Electronic.csv` (`Id, Date, Price`,
829 customers, transactions from 1999-01-01 to 2004-11-30). Uses the
transactions-only mode of `build_transaction_panel` so no covariate file
is needed; the same script works with a covariate calendar by passing it
as the second positional argument.

```python
from autoseqmodels import (
    loader, inspection, encoders, sequence_builder,
    training, sequence_lstm,
)

# 1-2. Load and aggregate to a (customer, week) panel
tx_df = loader.load_table("Datasets/Electronic.csv")
panel = loader.build_transaction_panel(tx_df, merge_on="Id")

# 3. Detect column types and cast
detected = inspection.infer_column_types(panel)
panel    = inspection.cast_columns_by_detected_type(panel, detected)

# 4. Resolve entity / date / target + covariate plan
structure, plan = inspection.analyze_structure(
    panel, detected, target_col="transaction_count"
)

# 5-6. Encoding strategy + fit on training rows
strategy        = encoders.propose_encodings(panel, detected, plan)
train_mask      = panel["Date"] <= "2003-12-31"
enc_df, spec    = encoders.apply_encodings(panel, strategy, train_mask=train_mask)
plan            = encoders.expand_plan(plan, spec)

# 7. Build per-customer sequences (3-year calibration, 11-month holdout)
seqs = sequence_builder.build_transaction_sequences(
    enc_df,
    account_col     = "Id",
    date_col        = "Date",
    training_start  = "2001-01-01",
    training_end    = "2003-12-31",
    holdout_start   = "2004-01-01",
    holdout_end     = "2004-11-30",
    transaction_col = "transaction_count",
    plan            = plan,
    seasonality     = ["woy"],
)

# 8. Tune with Optuna, refit on the best trial, then roll out the holdout
ds                = training.SequenceDataset(seqs)
train_ds, val_ds  = training.make_train_val_split(ds, val_fraction=0.1, seed=42)

study = sequence_lstm.tune_lstm(
    seqs, train_ds, val_ds,
    n_trials   = 10,
    max_epochs = 30,
    storage    = "sqlite:///optuna_lstm.db",
    study_name = "electronic_demo",
)
model, history = sequence_lstm.train_tuned_lstm(seqs, study, train_ds, val_ds)
preds          = sequence_lstm.predict_holdout(model, seqs)
```

To verify the install in a few minutes, drop `n_trials` and `max_epochs`
to small values (e.g. 3 and 10). Re-running the same script with the
same `study_name` resumes the existing Optuna study from the SQLite
file.

## Modules

### `loader` — input I/O and panel aggregation
- `load_table(path, ...)` — read `.csv`, `.xlsx/.xls`, `.RData/.rda`. For R
  files, `r_covariates_object_name` (and optional `r_base_object_name`)
  returns a `(base_df, covariates_df)` tuple in one call.
- `build_transaction_panel(tx_df, cov_df=None, merge_on=...)` — aggregate
  raw transactions into a weekly `(customer, week)` panel. With `cov_df`,
  left-joins counts onto a covariate calendar (missing weeks → 0). Without
  `cov_df`, returns one row per `(customer, purchase-week)`; the sequence
  builder fills zero weeks itself. Auto-detects the date column and
  reconciles merge-key dtypes across the two tables.

### `inspection` — column typing & structural analysis
- `infer_column_types(df, config=TypeDetectionConfig())` — Auto-Prep-style
  detection: id / bool / date / time / category / string, plus a
  statistical role (identifier / discrete / continuous / binary /
  categorical / temporal / text). Output is a typed summary DataFrame
  (`detected`) the user can edit before casting.
- `cast_columns_by_detected_type(df, detected)` — apply the inferred
  dtypes column by column.
- `propose_structure(detected)` → `DataStructure` — pick `entity_col`,
  `date_col` (highest-cardinality date), and surface alternatives as
  `entity_candidates` / `date_candidates`.
- `classify_time_variance(df, entity_col, detected)` — label every
  column as `invariant` / `variant` per entity, and pick the main
  transaction date.
- `plan_covariates(df, detected, structure)` → `CovariatePlan` —
  classify every column as `static_cols`, `time_varying_cols`, or
  `skip_cols`, with per-column encoding hints.
- `analyze_structure(df, detected, unknown_future_cols=None, target_col=None)`
  — one-shot wrapper that runs `propose_structure` + `plan_covariates`
  and removes columns whose future values aren't known at prediction
  time from `time_varying_cols`.

### `encoders` — turning columns into model-ready numerics
- `propose_encodings(df, detected, plan, ...)` — pick a strategy per
  column: `EMBED` (id or high-cardinality category), `ONE-HOT`
  (low-cardinality), `SCALE` (numeric, Standard or MinMax), or
  `PASSTHROUGH` (binary). Embedding dimension follows the Guo & Berkhahn
  rule. Strategy is user-editable.
- `apply_encodings(df, strategy, training_mask=...)` — fit encoders on
  training rows only, transform the whole panel, return `(enc_df, spec)`
  where `spec: EncodingSpec` carries the embedding vocabularies, scaler
  parameters, one-hot category orders, and the total `input_width`.
- `expand_plan(plan, spec)` — rewrite the `CovariatePlan` so one-hot
  expansion columns replace their source column.
- `prepare_encodings(...)` / `encoding_report(spec)` — convenience
  wrapper and a human-readable summary table.

### `sequence_builder` — per-customer weekly sequences
- `build_transaction_sequences(df, *, account_col, date_col,
  training_start, training_end, holdout_start, holdout_end,
  transaction_col="transaction_count", plan=None, covariate_cols=None,
  time_varying_cols=None, seasonality=None, clip_transactions=None)` —
  constructs its own universal weekly calendar from
  `(training_start, holdout_end)`, merges each customer's transactions
  onto it (zero-filling missing weeks), and emits a dict with
  `samples / targets` (calibration shifted by one), `calibration` (full
  training tensor), `holdout` (rollout tensor), `account_ids`,
  `n_features`, `feature_names`, `total_transactions`. Optional
  seasonality features: `woy / moy / dow / year` (sin/cos pairs).

### `models.training` — dataset, split, train loop, evaluation
- `SequenceDataset` — wraps the dict from `build_transaction_sequences`
  for use with `DataLoader`.
- `make_train_val_split(dataset, val_fraction=..., seed=...)` —
  customer-level random split (no leakage between train and val).
- `train_model(model, train_ds, val_ds, ...)` — generic training loop
  with early stopping, returns the per-epoch loss history.
- `evaluate_predictions(...)` — MAE / RMSE / aggregate-count metrics on
  the holdout rollout.
- `plot_training_and_predictions(...)` — quick-look loss curves and
  predicted-vs-actual transaction plots.

### `models.sequence_lstm` — LSTM model + Optuna tuning
- `SequenceLSTM(nn.Module)` — embedding layers (one per id / categorical),
  multi-layer LSTM, regression head; consumes the `feature_names` order
  produced by `build_transaction_sequences`.
- `predict_holdout(model, seqs)` — autoregressive holdout rollout,
  returns predicted weekly counts per customer.
- `tune_lstm(seqs, train_ds, val_ds, *, n_trials, max_epochs,
  storage=None, study_name=None, fixed_params=None)` — Optuna search
  over `hidden_size`, `n_layers`, dropout, learning rate, batch size, …
  Studies persist to SQLite (`sqlite:///optuna_lstm.db`) when `storage`
  is given, so runs are resumable. Any param can be pinned via
  `fixed_params`.
- `train_tuned_lstm(seqs, study, train_ds, val_ds)` — refit with the
  best trial's hyperparameters, returning `(model, history)`.

### `models.sequence_transformer` — Transformer mirror
- `SequenceTransformer(nn.Module)` — same input contract as
  `SequenceLSTM` but with positional encoding and a TransformerEncoder
  stack; `n_heads` is auto-picked to divide `d_model`.
- `predict_holdout_transformer`, `tune_transformer`,
  `train_tuned_transformer` — direct counterparts of the LSTM helpers,
  with their own SQLite study (`optuna_transformer.db`).

## Requirements

Python ≥ 3.10. See `pyproject.toml` for the full dependency list.

## License

MIT — see [LICENSE](LICENSE).
