Metadata-Version: 2.4
Name: jupytertracker
Version: 0.1.0
Summary: Track Jupyter notebook cell execution and export a clean, ordered Python script
License: MIT
Requires-Python: >=3.8
Requires-Dist: ipython>=7.0
Provides-Extra: dev
Requires-Dist: nbformat>=5.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# jupytertracker

Part of an end-to-end ML model management system for replicable machine learning.

## The problem

Building a machine learning model in a Jupyter notebook is iterative and messy — cells run out of order, code gets modified and re-run, hyperparameters get tweaked. When a model reviewer asks "how did you build this?", the data scientist has to manually reconstruct the process. When a compliance team asks for documentation, someone has to write it by hand.

The result: models that can't be independently replicated, and whitepapers that are written after the fact from memory rather than from the actual process.

## System vision

This library is Component 1 of a three-part system for making the ML modeling process fully replicable and auditable:

```
┌─────────────────────────────────────────────────────────────────┐
│                  ML Model Management System                      │
├──────────────────┬──────────────────────┬───────────────────────┤
│  Component 1     │  Component 2         │  Component 3          │
│  JupyterTracker  │  MLflow Integration  │  Whitepaper Generator │
│  (this library)  │                      │                        │
├──────────────────┼──────────────────────┼───────────────────────┤
│ Records every    │ Registers models,    │ Generates a structured│
│ cell execution   │ tracks experiments,  │ report (data, method, │
│ in order. Exports│ parameters, metrics, │ results, limitations) │
│ an honest Python │ and serves models.   │ from code annotations │
│ script of what   │ Uses MLflow as-is.   │ using an LLM.         │
│ actually ran.    │                      │                        │
├──────────────────┴──────────────────────┴───────────────────────┤
│  Together: a non-technical reviewer can verify what was built,  │
│  how it was built, and reproduce the result independently.      │
└─────────────────────────────────────────────────────────────────┘
```

**Data flow:**

```
Notebook session
  │
  ├── JupyterTracker records every cell execution (parallel, live)
  │     └── export_script() → ordered .py file with timing
  │
  ├── MLflow tracks experiments, parameters, and metrics (parallel, live)
  │     └── model registry → reproducible run IDs
  │
  └── On demand: Whitepaper generator
        ├── pulls execution log from JupyterTracker
        ├── pulls run metadata from MLflow
        └── uses wpr_-prefixed function outputs as report sections
              └── LLM assembles → structured whitepaper (PDF/Markdown)
```

---

## Component 1: JupyterTracker

Track Jupyter notebook cell executions and export a clean, ordered Python script — exactly what ran, in the order it ran.

### Install

```bash
pip install jupytertracker
```

### Usage

Add one line at the top of your notebook:

```python
import jupytertracker
jupytertracker.start()
```

When you're done, export:

```python
jupytertracker.export_script("my_analysis.py")
```

The output is a `.py` file with every cell execution in order, one block per run:

```python
# Generated by jupytertracker (sequential mode)
# Total execution time: 2m 14.3s
# Cells recorded: 5

# execution 1  [340ms]
x = load_data("train.csv")

# execution 2  [1m 52.1s]
model = train(x, lr=0.01)

# execution 3  [18.4s]
evaluate(model)

# execution 4 (re-run)  [1m 48.7s]
model = train(x, lr=0.1)

# execution 5 (re-run)  [15.1s]
evaluate(model)
```

### API

```python
jupytertracker.start(ip=None)        # start tracking; idempotent
jupytertracker.stop()                # stop tracking; next start() begins fresh
jupytertracker.export_script(path)   # write execution log to .py file
jupytertracker.clear()               # clear the log without stopping
jupytertracker.get_log()             # return list of ExecutionRecord
```

### Notes

- **Call `start()` in your very first cell**, before any imports or data loading. The tracker only records what runs after `start()` is called. Any state built up before — loaded dataframes, imported libraries, defined variables — is invisible to the tracker and will be missing from the exported script.

- **The exported script is an execution record, not a guaranteed reproducible script.** If cells depended on state that existed in the kernel but wasn't captured (see above), the script will fail with a `NameError` when run top-to-bottom.

- **Failed cells are excluded.** Cells that raise an exception, have a syntax error, or are interrupted by the user are not recorded — only successful executions appear in the output.

- **Kernel restart** resets tracking automatically (Python state is cleared). Call `export_script()` before restarting if you want to preserve the session.

- Magic commands (`%matplotlib inline`, `!pip install ...`) are included with a comment noting they require a Jupyter environment.

## Related projects

- **[ipyflow](https://github.com/ipyflow/ipyflow)** — reactive Python kernel that tracks dataflow between cells and can recover the minimal set of cells needed to reproduce an output. Requires switching kernels; takes a "prevent the mess" approach vs. jupytertracker's "record the mess" approach.
- **[papermill](https://github.com/nteract/papermill)** — parameterizes and executes notebooks top-to-bottom. Good for batch runs; doesn't handle interactive out-of-order execution.
- **[reprozip-jupyter](https://pypi.org/project/reprozip-jupyter/)** — packs the full notebook environment (libraries, data) for portability. Solves environment reproducibility, not execution-order reproducibility.
- **[MLflow](https://mlflow.org)** — experiment tracking, model registry, and model serving. Component 2 of this system.

## Roadmap

- **v2:** `mode='dedup'` — deduplicate to the last version of each cell, ordered by last execution. For "clean up my notebook" workflows.
- **Component 2:** MLflow integration — link JupyterTracker sessions to MLflow run IDs automatically.
- **Component 3:** Whitepaper generator — `wpr_`-prefixed functions collect outputs for LLM-generated structured reports.
