Metadata-Version: 2.4
Name: cur-estimator
Version: 0.1.0
Summary: Cur-E: release-oriented imputation package distilled from the Propose_Alg path.
Author: haoze
License-Expression: Apache-2.0
Keywords: time-series,imputation,itin,missing-data,cur-estimator
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scipy>=1.10
Requires-Dist: torch>=2.1
Provides-Extra: dev
Requires-Dist: build>=1.2.1; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"
Dynamic: license-file

# Cur-E

中文说明见 [README.zh-CN.md](README.zh-CN.md).

Cur-E imputation package with release-oriented packaging.

`cur-estimator` is a Python package distilled from the original experimental codebase. The packaged implementation follows the `algs04/Propose_Alg.py` direction and keeps the release focused on the Cur-E method itself. This repository is still a simplified release package rather than a full reproduction of the original research environment.

The Cur-E pipeline implemented here includes:

- a GRU-based bidirectional recurrent imputation core
- interpolation-regularized training inspired by the `Propose_Alg.py` direction
- a bidirectional GRU-based ITIN core
- pchip-based interpolation regularization during training
- NumPy-based training and inference API
- a standalone `demo.py` entry for direct execution

## Installation

```bash
pip install cur-estimator
```

For local development:

```bash
pip install .
```

For local packaging:

```bash
python -m build
```

## Quick Start

```python
import numpy as np
from cur_e import CurEImputer, make_holdout_validation

rng = np.random.default_rng(2024)
X = rng.normal(size=(32, 48, 8)).astype(np.float32)

mask = rng.random(X.shape) < 0.1
X_missing = X.copy()
X_missing[mask] = np.nan

val = make_holdout_validation(X_missing, holdout_rate=0.1, random_state=2024)

model = CurEImputer(
    n_steps=48,
    n_features=8,
    rnn_hidden_size=128,
    epochs=5,
    alpha=1.2,
)

model.fit(
    train_X=X_missing,
    train_timestamps=None,  # optional absolute timestamps s, shape (num_samples, seq_len) or (num_samples, seq_len, 1/feature_dim)
    val_X=val["X"],
    val_X_ori=val["X_ori"],
    val_indicating_mask=val["indicating_mask"],
    verbose=True,
)

imputed = model.predict(X_missing)
print(imputed.shape)
```

## CLI / Demo

Run the standalone demo directly:

```bash
python demo.py
```

Run the demo with a CSV input:

```bash
python demo.py --csv your_data.csv --n-steps 48
```

The demo saves outputs into `demo_outputs/`, including:

- `cur_e_demo_model.pt`
- `imputed.npy`
- `input_with_nan.npy`
- `input_full.npy`

## Input Data Format

The core API expects NumPy arrays with shape `(num_samples, seq_len, feature_dim)`.

- `train_X`, `val_X`, `test_X` must be 3D arrays
- missing values must be represented by `np.nan`
- `val_X_ori` must contain the intact validation target
- `val_indicating_mask` must be `1` on artificially hidden validation positions and `0` elsewhere
- `train_timestamps`, `val_timestamps`, and `test_timestamps` are optional absolute timestamps `s`
- the model derives `delta` internally from adjacent timestamp differences for temporal decay
- if timestamps are omitted, an equally spaced time axis `0, 1, 2, ...` is used

Minimal example:

```python
import numpy as np

X = np.array(
    [
        [
            [1.0, 2.0],
            [np.nan, 2.1],
            [1.2, np.nan],
        ],
        [
            [0.8, 1.5],
            [0.9, np.nan],
            [1.0, 1.7],
        ],
    ],
    dtype=np.float32,
)
```

This example has:

- `num_samples = 2`
- `seq_len = 3`
- `feature_dim = 2`

## CSV Demo Input Format

When using `python demo.py --csv your_data.csv --n-steps 48`, the CSV is interpreted as a continuous table:

- each row is one time step
- each column is one feature
- if a column named `timestamp` exists, it is used as the absolute timestamp axis and is not treated as a feature column
- the total row count must be at least `n_steps`
- rows are reshaped into samples of shape `(n_steps, feature_dim)`
- if the total number of rows is not divisible by `n_steps`, the tail rows are dropped

During training, the PCHIP regularization term is also evaluated along the provided `timestamp` axis instead of assuming equally spaced steps.

For example, a CSV with `480` rows and `8` columns and `--n-steps 48` becomes:

- `num_samples = 10`
- `seq_len = 48`
- `feature_dim = 8`

## Configuration Notes

```python
from cur_e import CurEImputer

model = CurEImputer(
    n_steps=48,
    n_features=8,
    rnn_hidden_size=128,
    batch_size=16,
    epochs=30,
    patience=3,
    learning_rate=1e-3,
    alpha=1.2,
)
```

- `n_steps` is the sequence length per sample
- `n_features` is the feature dimension per time step
- `rnn_hidden_size` controls the recurrent hidden-state size
- `batch_size` controls training and inference batch size
- `epochs` and `patience` control stopping behavior
- `alpha` controls the strength of Cur-E interpolation regularization

## Notes

This repository is the source distribution of the `cur-estimator` package, intended for research, reproduction, and further development. The implementation here is a distilled package extracted from a larger experimental codebase, rather than the complete original research environment.
- It should not be read as a claim that this package reproduces every detail of the full published paper system.
- It is a release-oriented distillation inspired by the `algs04/Propose_Alg.py` direction.
- It is not the complete original research environment or experiment pipeline.

Because the code has been extracted and simplified for packaging, it may contain engineering adaptations relative to the broader experimental system. If you need dataset-specific preprocessing or experiment orchestration, those should be added explicitly on top of this package.

## License

This project is released under the Apache License 2.0.
