Metadata-Version: 2.4
Name: dynamic-pattern-mining
Version: 0.1.0
Summary: Dynamic, low-resource pattern mining with sklearn-compatible API
Author: Moro
License: MIT
Requires-Python: >=3.9
Requires-Dist: numpy>=1.22
Requires-Dist: pandas>=1.5
Requires-Dist: scikit-learn>=1.2
Requires-Dist: scipy>=1.8
Provides-Extra: bench
Requires-Dist: mlxtend>=0.24.0; extra == 'bench'
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# dynamic-pattern-mining

`dynamic-pattern-mining` is a scikit-learn-compatible library for mining clinical code patterns and recommending likely next codes.

Example goal:

If a patient has codes `A, B, C`, infer likely additional codes such as `D` from cohort-wide structure.

## Why this approach

Compared to classic candidate-generation workflows (Apriori/FP-Growth style), this estimator is designed for:

- low memory usage via integer coding + sparse matrices
- robust behavior under code-string variants through normalization
- direct personalized ranking (recommendation), not only global frequent itemsets
- shrinkage-aware scoring for stability on sparse/rare co-occurrences
- optional second-order diffusion over the learned code graph

## Install

```bash
pip install dynamic-pattern-mining
```

## Quick Start (Long Format)

```python
import pandas as pd
from dynamic_pattern_mining import DynamicPatternMiner

# long format: one row per (patient, code)
df = pd.DataFrame(
    [
        (1, "I10"), (1, "E11"), (1, "N18"),
        (2, "I10"), (2, "E11"),
        (3, "J45"), (3, "R06"),
    ],
    columns=["patient_id", "code"],
)

miner = DynamicPatternMiner(
    patient_col="patient_id",
    code_col="code",
    min_code_frequency=1,
    min_pair_frequency=1,
)

miner.fit(df)

print(miner.recommend(["I10", "E11"], top_k=5))
print(miner.explain_recommendation(["I10", "E11"], target_code="N18"))
print(miner.mine_common_patterns(top_k=10, min_score=-1e9))
```

## Quick Start (Basket Format)

```python
import pandas as pd
from dynamic_pattern_mining import DynamicPatternMiner

X = pd.DataFrame(
    {
        "basket": [
            ["I10", "E11"],
            ["I10", "N18"],
            ["J45", "R06"],
        ]
    }
)

miner = DynamicPatternMiner(
    basket_col="basket",
    min_code_frequency=1,
    min_pair_frequency=1,
    output_format="sparse",
)

X_rec = miner.fit_transform(X)
print(X_rec.shape)
```

## sklearn Pipeline Example

```python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

from dynamic_pattern_mining import DynamicPatternMiner

X = pd.DataFrame({"basket": [["I10", "E11"], ["J45", "R06"], ["F32", "F41"]]})
y = [0, 1, 2]

pipe = Pipeline([
    ("miner", DynamicPatternMiner(basket_col="basket", output_format="sparse")),
    ("clf", LogisticRegression(max_iter=2000)),
])

pipe.fit(X, y)
```

## Full Parameter Reference

`DynamicPatternMiner` signature:

```python
DynamicPatternMiner(
    patient_col="patient_id",
    code_col="code",
    basket_col=None,
    min_code_frequency=3,
    min_pair_frequency=2,
    max_codes=None,
    chunk_size=None,
    lowercase=True,
    normalize_text=True,
    pair_smoothing=1.0,
    shrinkage_lambda=10.0,
    popularity_penalty=0.10,
    diffusion_weight=0.25,
    output_top_k=30,
    output_format="sparse",
    dtype=np.float32,
)
```

### Input Parsing

- `patient_col`: `str` (default `"patient_id"`)
  Patient identifier column for long-format input.
- `code_col`: `str` (default `"code"`)
  Code column for long-format input.
- `basket_col`: `str | None` (default `None`)
  Basket column if each row already contains a list/set of codes.

### Frequency / Pruning

- `min_code_frequency`: `int` (default `3`)
  Minimum patient-level frequency for a code to be kept.
- `min_pair_frequency`: `int` (default `2`)
  Minimum pair co-occurrence count to keep an edge.
- `max_codes`: `int | None` (default `None`)
  Optional top-K code cap after frequency filtering.

### Resource / Scaling

- `chunk_size`: `int | None` (default `None`)
  Reserved chunking control for large input processing.

### Normalization

- `lowercase`: `bool` (default `True`)
  Lowercase code strings.
- `normalize_text`: `bool` (default `True`)
  Normalize separators (`_`, `-`, repeated spaces) for robust matching.

### Scoring / Pattern Dynamics

- `pair_smoothing`: `float` (default `1.0`)
  Additive smoothing for conditional probability estimates.
- `shrinkage_lambda`: `float` (default `10.0`)
  Shrinkage strength for low-support pairs.
- `popularity_penalty`: `float` (default `0.10`)
  Penalizes globally frequent consequents to reduce trivial recommendations.
- `diffusion_weight`: `float` (default `0.25`)
  Weight of second-order graph diffusion contribution.

### Output Control

- `output_top_k`: `int` (default `30`)
  Max number of positive recommendations kept per sample in `transform`.
- `output_format`: `{"sparse", "dense", "pandas"}` (default `"sparse"`)
  Return type of `transform`.
- `dtype`: numpy dtype (default `np.float32`)
  Numeric dtype for learned scores and outputs.

## Main Methods

- `fit(X)`
  Learns code vocabulary, pair graph, and dynamic score matrix.
- `transform(X)`
  Returns recommendation-score features per sample.
- `recommend(basket, top_k=10)`
  Personalized top-code recommendations.
- `explain_recommendation(basket, target_code, top_drivers=5)`
  Source-code contributions for a target recommendation.
- `mine_common_patterns(top_k=20, min_score=0.0)`
  Global antecedent→consequent patterns from learned score graph.
- `get_feature_names_out()`
  Feature names for transformed output.

## FP-Growth Benchmark

Run the built-in benchmark comparison:

```bash
python src/dynamic_pattern_mining/benchmarks/fp_growth_benchmark.py
```

It reports:

- `recall_at_5_dynamic_pattern_miner`
- `recall_at_5_fp_growth`
- `delta`

## Development

```bash
pip install -e .[dev]
pytest
python -m build
```
