Metadata-Version: 2.4
Name: jupyterlab-pipelines
Version: 0.1.0
Summary: Export Jupyter notebooks to Airflow DAGs, dbt models, and Spark jobs on ilum
Project-URL: Homepage, https://github.com/ilum-cloud/jupyterlab-pipelines
Project-URL: Repository, https://github.com/ilum-cloud/jupyterlab-pipelines.git
Project-URL: Issues, https://github.com/ilum-cloud/jupyterlab-pipelines/issues
Author-email: Ilum Labs LLC <support@ilum.cloud>
License:                                  Apache License
                                   Version 2.0, January 2004
                                http://www.apache.org/licenses/
        
           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
        
           1. Definitions.
        
              "License" shall mean the terms and conditions for use, reproduction,
              and distribution as defined by Sections 1 through 9 of this document.
        
              "Licensor" shall mean the copyright owner or entity authorized by
              the copyright owner that is granting the License.
        
              "Legal Entity" shall mean the union of the acting entity and all
              other entities that control, are controlled by, or are under common
              control with that entity. For the purposes of this definition,
              "control" means (i) the power, direct or indirect, to cause the
              direction or management of such entity, whether by contract or
              otherwise, or (ii) ownership of fifty percent (50%) or more of the
              outstanding shares, or (iii) beneficial ownership of such entity.
        
              "You" (or "Your") shall mean an individual or Legal Entity
              exercising permissions granted by this License.
        
              "Source" form shall mean the preferred form for making modifications,
              including but not limited to software source code, documentation
              source, and configuration files.
        
              "Object" form shall mean any form resulting from mechanical
              transformation or translation of a Source form, including but
              not limited to compiled object code, generated documentation,
              and conversions to other media types.
        
              "Work" shall mean the work of authorship, whether in Source or
              Object form, made available under the License, as indicated by a
              copyright notice that is included in or attached to the work
              (an example is provided in the Appendix below).
        
              "Derivative Works" shall mean any work, whether in Source or Object
              form, that is based on (or derived from) the Work and for which the
              editorial revisions, annotations, elaborations, or other modifications
              represent, as a whole, an original work of authorship. For the purposes
              of this License, Derivative Works shall not include works that remain
              separable from, or merely link (or bind by name) to the interfaces of,
              the Work and Derivative Works thereof.
        
              "Contribution" shall mean any work of authorship, including
              the original version of the Work and any modifications or additions
              to that Work or Derivative Works thereof, that is intentionally
              submitted to Licensor for inclusion in the Work by the copyright owner
              or by an individual or Legal Entity authorized to submit on behalf of
              the copyright owner. For the purposes of this definition, "submitted"
              means any form of electronic, verbal, or written communication sent
              to the Licensor or its representatives, including but not limited to
              communication on electronic mailing lists, source code control systems,
              and issue tracking systems that are managed by, or on behalf of, the
              Licensor for the purpose of discussing and improving the Work, but
              excluding communication that is conspicuously marked or otherwise
              designated in writing by the copyright owner as "Not a Contribution."
        
              "Contributor" shall mean Licensor and any individual or Legal Entity
              on behalf of whom a Contribution has been received by Licensor and
              subsequently incorporated within the Work.
        
           2. Grant of Copyright License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              copyright license to reproduce, prepare Derivative Works of,
              publicly display, publicly perform, sublicense, and distribute the
              Work and such Derivative Works in Source or Object form.
        
           3. Grant of Patent License. Subject to the terms and conditions of
              this License, each Contributor hereby grants to You a perpetual,
              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
              (except as stated in this section) patent license to make, have made,
              use, offer to sell, sell, import, and otherwise transfer the Work,
              where such license applies only to those patent claims licensable
              by such Contributor that are necessarily infringed by their
              Contribution(s) alone or by combination of their Contribution(s)
              with the Work to which such Contribution(s) was submitted. If You
              institute patent litigation against any entity (including a
              cross-claim or counterclaim in a lawsuit) alleging that the Work
              or a Contribution incorporated within the Work constitutes direct
              or contributory patent infringement, then any patent licenses
              granted to You under this License for that Work shall terminate
              as of the date such litigation is filed.
        
           4. Redistribution. You may reproduce and distribute copies of the
              Work or Derivative Works thereof in any medium, with or without
              modifications, and in Source or Object form, provided that You
              meet the following conditions:
        
              (a) You must give any other recipients of the Work or
                  Derivative Works a copy of this License; and
        
              (b) You must cause any modified files to carry prominent notices
                  stating that You changed the files; and
        
              (c) You must retain, in the Source form of any Derivative Works
                  that You distribute, all copyright, patent, trademark, and
                  attribution notices from the Source form of the Work,
                  excluding those notices that do not pertain to any part of
                  the Derivative Works; and
        
              (d) If the Work includes a "NOTICE" text file as part of its
                  distribution, then any Derivative Works that You distribute must
                  include a readable copy of the attribution notices contained
                  within such NOTICE file, excluding those notices that do not
                  pertain to any part of the Derivative Works, in at least one
                  of the following places: within a NOTICE text file distributed
                  as part of the Derivative Works; within the Source form or
                  documentation, if provided along with the Derivative Works; or,
                  within a display generated by the Derivative Works, if and
                  wherever such third-party notices normally appear. The contents
                  of the NOTICE file are for informational purposes only and
                  do not modify the License. You may add Your own attribution
                  notices within Derivative Works that You distribute, alongside
                  or as an addendum to the NOTICE text from the Work, provided
                  that such additional attribution notices cannot be construed
                  as modifying the License.
        
              You may add Your own copyright statement to Your modifications and
              may provide additional or different license terms and conditions
              for use, reproduction, or distribution of Your modifications, or
              for any such Derivative Works as a whole, provided Your use,
              reproduction, and distribution of the Work otherwise complies with
              the conditions stated in this License.
        
           5. Submission of Contributions. Unless You explicitly state otherwise,
              any Contribution intentionally submitted for inclusion in the Work
              by You to the Licensor shall be under the terms and conditions of
              this License, without any additional terms or conditions.
              Notwithstanding the above, nothing herein shall supersede or modify
              the terms of any separate license agreement you may have executed
              with Licensor regarding such Contributions.
        
           6. Trademarks. This License does not grant permission to use the trade
              names, trademarks, service marks, or product names of the Licensor,
              except as required for reasonable and customary use in describing the
              origin of the Work and reproducing the content of the NOTICE file.
        
           7. Disclaimer of Warranty. Unless required by applicable law or
              agreed to in writing, Licensor provides the Work (and each
              Contributor provides its Contributions) on an "AS IS" BASIS,
              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
              implied, including, without limitation, any warranties or conditions
              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
              PARTICULAR PURPOSE. You are solely responsible for determining the
              appropriateness of using or redistributing the Work and assume any
              risks associated with Your exercise of permissions under this License.
        
           8. Limitation of Liability. In no event and under no legal theory,
              whether in tort (including negligence), contract, or otherwise,
              unless required by applicable law (such as deliberate and grossly
              negligent acts) or agreed to in writing, shall any Contributor be
              liable to You for damages, including any direct, indirect, special,
              incidental, or consequential damages of any character arising as a
              result of this License or out of the use or inability to use the
              Work (including but not limited to damages for loss of goodwill,
              work stoppage, computer failure or malfunction, or any and all
              other commercial damages or losses), even if such Contributor
              has been advised of the possibility of such damages.
        
           9. Accepting Warranty or Additional Liability. While redistributing
              the Work or Derivative Works thereof, You may choose to offer,
              and charge a fee for, acceptance of support, warranty, indemnity,
              or other liability obligations and/or rights consistent with this
              License. However, in accepting such obligations, You may act only
              on Your own behalf and on Your sole responsibility, not on behalf
              of any other Contributor, and only if You agree to indemnify,
              defend, and hold each Contributor harmless for any liability
              incurred by, or claims asserted against, such Contributor by reason
              of your accepting any such warranty or additional liability.
        
           END OF TERMS AND CONDITIONS
        
           APPENDIX: How to apply the Apache License to your work.
        
              To apply the Apache License to your work, attach the following
              boilerplate notice, with the fields enclosed by brackets "[]"
              replaced with your own identifying information. (Don't include
              the brackets!)  The text should be enclosed in the appropriate
              comment syntax for the file format. We also recommend that a
              file or class name and description of purpose be included on the
              same "printed page" as the copyright notice for easier
              identification within third-party archives.
        
           Copyright 2024-2026 Ilum Labs LLC
        
           Licensed under the Apache License, Version 2.0 (the "License");
           you may not use this file except in compliance with the License.
           You may obtain a copy of the License at
        
               http://www.apache.org/licenses/LICENSE-2.0
        
           Unless required by applicable law or agreed to in writing, software
           distributed under the License is distributed on an "AS IS" BASIS,
           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
           See the License for the specific language governing permissions and
           limitations under the License.
License-File: LICENSE
Keywords: Airflow,Data Pipeline,Exporter,Jupyter,JupyterLab,JupyterLab4,Livy,Notebook,Spark,dbt,ilum,jupyterlab-extension
Classifier: Framework :: Jupyter
Classifier: Framework :: Jupyter :: JupyterLab
Classifier: Framework :: Jupyter :: JupyterLab :: 4
Classifier: Framework :: Jupyter :: JupyterLab :: Extensions
Classifier: Framework :: Jupyter :: JupyterLab :: Extensions :: Prebuilt
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Requires-Dist: black>=23.0.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: jupyter-server>=2.0.0
Requires-Dist: jupyterlab<5,>=4.0.0
Requires-Dist: nbformat>=5.7.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pyjwt>=2.0
Requires-Dist: structlog>=23.0
Provides-Extra: dev
Requires-Dist: black; extra == 'dev'
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: coverage; extra == 'dev'
Requires-Dist: hypothesis; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest-jupyter; extra == 'dev'
Requires-Dist: pytest-tornasync; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: format
Requires-Dist: black>=23.0.0; extra == 'format'
Provides-Extra: test
Requires-Dist: black; extra == 'test'
Requires-Dist: coverage; extra == 'test'
Requires-Dist: hypothesis; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Requires-Dist: pytest-asyncio; extra == 'test'
Requires-Dist: pytest-cov; extra == 'test'
Requires-Dist: pytest-jupyter; extra == 'test'
Requires-Dist: pytest-tornasync; extra == 'test'
Requires-Dist: ruff; extra == 'test'
Description-Content-Type: text/markdown

# jupyterlab-pipelines

JupyterLab 4 extension that exports the current notebook to an Airflow DAG or
an Ilum Spark job (DBT + Spark Declarative Pipelines in the future) - directly from the sidebar, without leaving JupyterLab.

> jupyterlab-pipelines is developed by [ILUM](https://ilum.cloud), the free data lakehouse platform for a cloud native world.

![Build](https://github.com/ilum-cloud/jupyterlab-pipelines/actions/workflows/build.yml/badge.svg)
![PyPI](https://img.shields.io/pypi/v/jupyterlab-pipelines)
![Version](https://img.shields.io/github/v/tag/ilum-cloud/jupyterlab-pipelines)
![License](https://img.shields.io/badge/license-Apache--2.0-blue)
[![Built by ILUM](https://img.shields.io/badge/Built_by-ILUM-5280FF)](https://ilum.cloud)

---

## What it does

- **Airflow DAG export** - parses the open notebook, renders a production-ready
  DAG Python file (batch or per-cell mode), pushes it to Gitea so Airflow's
  git-sync picks it up (~40 s), and optionally fires a DAG run automatically.
- **Ilum Spark submission** - wraps the notebook as a standalone pyFile and
  submits it directly to the Ilum native REST API as a one-shot job, long-running
  service, or cron schedule.
- **Notebook sanitization** - strips Jupyter-only line-magic (`%manage_spark`),
  shell-escapes (`!pip`), and display calls (`print`, `display`) that crash
  Airflow workers; reports every stripped line so users can opt back in.

![JPE flow - notebook → Airflow DAG export → push → completed (green) DAG run in Airflow](docs/jpe-flow.gif)

---

## Quick start

### User install (any JupyterLab pod)

```bash
pip install jupyterlab-pipelines
# restart Jupyter so the server extension loads
```

The plugin sidebar appears at left-rank 102 (below the file browser). Open
any notebook → click the pipelines icon → pick a target.

**Zero configuration runs out of the box** - Local pyFile export (download
the wrapped `.py`) works with no backend services. Other targets (Airflow,
Ilum, Gitea) become available as soon as their URLs are configured in
`Settings → Pipelines`.

#### What works at which level

| Capability                                           | Required                                                |
| ---------------------------------------------------- | ------------------------------------------------------- |
| Local pyFile download (wrapped Spark-runnable `.py`) | nothing - built in-process                              |
| Airflow DAG export                                   | `airflowApiUrl` setting **OR** `AIRFLOW_API_URL` env    |
| Gitea push of generated DAG                          | `gitApiUrl` + `gitToken` (or `gitUser` + `gitPassword`) |
| Ilum single-shot job / service / cron                | `ilumApiUrl` setting **OR** `ILUM_API_URL` env          |
| Auto-trigger DAG after push (Airflow 3.x JWT mint)   | `AIRFLOW_JWT_SECRET` env (Ilum bundled deploy only)     |

When the panel loads, it probes the configured targets in ≤1.5 s. Reachable
targets light up; unreachable ones get a dashed border + tooltip. If
**nothing** remote responds, a standalone-mode banner explains that Local
pyFile is the only available option until URLs are configured.

### Dev install (editable, from source)

```bash
git clone https://github.com/ilum-cloud/jupyterlab-pipelines
cd jupyterlab-pipelines
pip install -e ".[dev]"
jlpm install
jlpm build

jupyter labextension develop . --overwrite
jupyter server extension enable jupyterlab_pipelines

jupyter lab
```

### Ilum bundled deploy

The `ilum-jupyter` Helm sub-chart in the
[Ilum monorepo](https://github.com/ilum-cloud/ilum) ships JPE pre-installed
with all URLs + Airflow JWT integration pre-wired - see
`helm/helm_jupyter/values.yaml: airflowIntegration:` and the
`Pipelines_*.ipynb` showcase notebooks under
`/home/jovyan/work/pipelines/`.

### Try it locally (docker compose)

A self-contained stack under [`examples/docker-compose/`](examples/docker-compose/)
brings up everything needed to exercise the full notebook → Airflow → Spark
path on your machine - no Kubernetes, no Ilum:

```bash
cd examples/docker-compose
docker compose up -d          # ~90s on first run (builds the Jupyter image)
```

| Service                       | URL                                               | Login                             |
| ----------------------------- | ------------------------------------------------- | --------------------------------- |
| JupyterLab (+ this extension) | http://localhost:8888/lab?token=pipelines | token: `pipelines`        |
| Airflow                       | http://localhost:8080                             | `admin` / `admin`                 |
| Gitea                         | http://localhost:3000                             | `ilum` / `ilum-pipelines` |
| Livy (REST)                   | http://localhost:8998                             | -                                 |

The extension opens already pointed at the stack (Git remote, Airflow API,
Livy connection all pre-seeded). Open a notebook with a Spark cell, pick
**Airflow** in the panel, and hit export - the generated DAG is pushed to
Gitea, picked up by git-sync, loaded by Airflow, and (when you tick
"Trigger DAG run after push") executes a real pyspark job on the Spark
embedded in Livy.

What each piece does:

- **jupyter** - `ilum/sparkmagic` base + the freshly built JPE wheel
  (`Dockerfile.dev-overlay`); the entrypoint seeds the extension's settings
  from the `GIT_*` / `AIRFLOW_*` env in the compose file.
- **gitea** + **gitea-init** - the git remote DAGs are pushed to; the init
  one-shot creates the `ilum` user, the `ilum/airflow` repo, and an API token.
- **airflow** (api-server / scheduler / dag-processor) - FAB auth so the
  extension's JWT-mint trigger works; DAGs come exclusively from git-sync.
- **airflow-gitsync** - mirrors the Ilum helm `dags.gitSync` sidecar: clones
  the Gitea repo every 15 s into the DAGs folder.
- **livy** - `openeuler/livy:0.9.0` with embedded Spark 3.4 in local mode;
  the generated DAGs talk to it through the pre-seeded `ilum-livy-proxy`
  Airflow connection.

```bash
docker compose down -v        # stop + wipe all state
```

> This is a **demo** stack: single-node Spark in local mode, default
> credentials, SQLite-backed Gitea. Do not expose it to a network or use it
> for anything but local evaluation. For production, deploy via the Ilum Helm
> chart (Spark on Kubernetes, real auth, HA Airflow).

---

## Operating modes

JPE runs in one of three distinct deployment shapes:

### Standalone mode (`pip install` in a vanilla JupyterLab)

Only **Local pyFile** target is enabled until you configure URLs. Use this
for individual developer machines, ephemeral Binder/Vertex AI notebooks, or
any pod where you don't have an Ilum/Airflow backend in-cluster.

- Configure: `Settings → Pipelines → ilumApiUrl / airflowApiUrl /
gitApiUrl`
- Or set env vars before starting Jupyter: `ILUM_API_URL`,
  `AIRFLOW_API_URL`, `GITEA_API_URL`, `GITEA_TOKEN`
- Restart Jupyter (or reopen the panel) - the probe picks up new URLs

### Connected mode (URLs configured, some backends present)

Each target's mode-button reflects backend reachability. If only Airflow is
reachable but not Ilum, the Ilum card is dashed-bordered with a tooltip
explaining what's missing. Local pyFile is always available regardless.

### Ilum bundled mode (full stack)

When deployed via the Ilum Helm chart, every backend is auto-discovered:
`ilum-core:9888`, `ilum-airflow-api-server:8080`, `ilum-gitea-http:3000`.
Auto-trigger uses `AIRFLOW_JWT_SECRET` from the shared `ilum-airflow-jwt-secret`
Kubernetes Secret to mint REST tokens locally (FAB+OAuth `/auth/token` is
broken in Airflow 3.x). Showcase notebooks ship pre-mounted on the PVC.

---

## Use

### Mode 1 - Airflow DAG (`POST /export`)

Renders a DAG Python file and pushes it to the configured Gitea repository.
Airflow's git-sync daemon picks it up within ~40 seconds.

Two sub-modes:

| Sub-mode   | DAG shape                                                                   |
| ---------- | --------------------------------------------------------------------------- |
| `batch`    | 1 `IlumSubmitBatchOperator` task; whole notebook as one Spark batch job     |
| `per_cell` | N `IlumSubmitStatementOperator` tasks; each code cell is one Livy statement |

**Naming DAG tasks (`per_cell` only)**

By default tasks are `cell_0`, `cell_1`, … To get readable names, add a cell
tag `task:<name>` in JupyterLab (Property Inspector → Cell Tags). The
generator picks it up and emits `task_id="<name>"`.

```text
Cell tags:  task:load_transactions     →  task_id="load_transactions"
            task:score_transactions    →  task_id="score_transactions"
            task:write_iceberg_alerts  →  task_id="write_iceberg_alerts"
```

Rules:

- `<name>` must match `[A-Za-z_][A-Za-z0-9_]*` (Python identifier). Invalid
  tags are silently ignored - generator falls back to `cell_<idx>`.
- Only the first `task:<name>` tag on a cell is used.
- Cells that get merged forward (syntactically incomplete fragments) inherit
  the first available `task:<name>` from the merge buffer.
- Tags live in the notebook - they survive git pushes and re-opens, so the
  task ids stay stable without any sidebar state.

The JPE sidebar surfaces a hint pointing at this mechanism whenever
`per_cell` mode is selected.

**Per-task retries (`retries:<n>` cell tag)**

Generated DAGs default to `retries=3` (with exponential backoff) at the DAG
level. To override the retry count for a single task, add a cell tag
`retries:<n>` where `<n>` is a non-negative integer.

```text
Cell tags:  task:load_raw_sales, retries:5  →  task_id="load_raw_sales", retries=5
            task:transform, retries:1        →  task_id="transform",      retries=1
            task:declare_params, retries:0   →  retries=0  (fail fast - deterministic)
```

Rules:

- `<n>` must be a non-negative integer (`^\d+$`). Invalid values
  (`retries:-1`, `retries:abc`, `retries:2.5`) are silently ignored - the
  task keeps the DAG-level default. Surrounding whitespace is tolerated.
- In `per_cell` mode the tag emits `retries=<n>` on that cell's operator.
- In `batch` mode (one `run_notebook` task) the first valid `retries:<n>`
  among the exported cells applies to it.
- On merged cells the first valid `retries:<n>` in the merge buffer wins.

**Example payload**

```json
{
  "notebook_path": "notebooks/etl_pipeline.ipynb",
  "mode": "batch",
  "dag_id": "etl_pipeline_daily",
  "schedule": "@daily",
  "start_date": "2026-01-01",
  "livy_conn_id": "ilum-livy-proxy",
  "cluster_id": "default",
  "spark_image": "ilum/spark:4.1.1-delta",
  "auto_push_to_airflow": true,
  "auto_trigger_dag": true
}
```

**Response excerpt**

```json
{
  "status": "ok",
  "mode": "batch",
  "generated_path": "etl_pipeline_daily.py",
  "cells": 12,
  "gitea": { "status": "ok", "commit_sha": "a1b2c3d4", "created": true },
  "auto_trigger": { "status": "queued", "dag_id": "etl_pipeline_daily" },
  "airflow_visibility_eta_s": 40
}
```

---

### Mode 2 - Ilum single-shot job (`POST /run-ilum-job`, `ilum_mode: "single"`)

Wraps the notebook as a standalone pyFile and submits it via
`POST /api/v1/job/submit`. The job appears under
`/workloads/details/job/<id>` in the Ilum UI and is cleaned up automatically
when done.

```json
{
  "notebook_path": "reports/daily_kpi.ipynb",
  "ilum_mode": "single",
  "service_name": "daily-kpi-2026-05",
  "cluster_id": "default",
  "spark_image": "ilum/spark:4.1.1-sedona",
  "driver_memory": "2g",
  "executor_memory": "4g",
  "num_executors": 2
}
```

---

### Mode 3 - Ilum service / cron (`POST /run-ilum-job`, `ilum_mode: "service"` or `"cron"`)

**Service** creates a long-running Ilum Group (`POST /api/v1/group`) and
immediately executes the notebook against it. The group persists and can be
re-invoked without re-uploading the pyFile. Params are passed to
`IlumJob.run(spark, config)`.

```json
{
  "notebook_path": "pipelines/streaming.ipynb",
  "ilum_mode": "service",
  "service_name": "streaming-svc",
  "params": [
    { "name": "kafka_topic", "value": "orders" },
    { "name": "output_table", "value": "gold.orders_agg" }
  ],
  "scale": 2,
  "auto_pause": true
}
```

**Cron** registers the notebook as an Ilum schedule (`POST /api/v1/schedule`).
Each fire is a single-shot Spark job triggered by ilum-core's internal scheduler
 -  no Kubernetes `CronJob` is created.

```json
{
  "notebook_path": "reports/monthly_rollup.ipynb",
  "ilum_mode": "cron",
  "schedule_name": "monthly-rollup",
  "cron_expression": "0 2 1 * *",
  "args": "--env=prod --month=2026-05"
}
```

---

## Architecture

```
Notebook (.ipynb)
       |
       v
 notebook_parser          Reads cells, detects magics (%%spark, %%sql, %%pyspark),
 (notebook_parser.py)     strips Jupyter-only constructs, returns List[CodeCell]
       |
       +--------+----------------------+
       |                               |
       v                               v
 Airflow DAG path               Ilum native path
 (airflow_generator.py)         (cron_packager.py)
       |                               |
  Jinja2 templates             wrap_notebook_as_pyfile()
  airflow_batch.jinja2         wrap_notebook_as_service_class()
  airflow_per_cell.jinja2            |
       |                        IlumSingleJobOptions
       |                        IlumServiceOptions
       |                        IlumScheduleOptions
       |                             |
       v                             v
  DAG Python file            ilum_native_client.py
  (<dag_id>.py)              POST /api/v1/job/submit
       |                     POST /api/v1/group
       v                     POST /api/v1/schedule
  gitea_pusher.py
  PUT/POST Gitea API
  (with retry + backoff)
       |
       v
  Gitea repo → git-sync → Airflow /dags/repo/
       |
       v (daemon thread, ~40 s later)
  airflow_client.py
  GET /api/v2/dags/{id}       (wait for visibility)
  PATCH is_paused=false
  POST /api/v2/dags/{id}/dagRuns
```

**Module map**

| Module                  | Role                                                       |
| ----------------------- | ---------------------------------------------------------- |
| `handlers.py`           | Tornado request handlers - parse, validate, orchestrate    |
| `notebook_parser.py`    | `.ipynb` → `List[CodeCell]`; sanitizes Jupyter artifacts   |
| `airflow_generator.py`  | `List[CodeCell]` → DAG source via Jinja2                   |
| `cron_packager.py`      | `code` → standalone pyFile bytes (single/cron/service)     |
| `gitea_pusher.py`       | Gitea Contents API; retry + backoff; Basic + token auth    |
| `airflow_client.py`     | Airflow 3 REST client; JWT mint; background trigger thread |
| `ilum_native_client.py` | Ilum-core REST client; multipart upload; no `requests` dep |
| `audit.py`              | JSONL audit log (`~/.jupyter/jpe-audit.jsonl`)             |
| `logging_config.py`     | structlog setup (JSON or console renderer)                 |

---

## Configuration / Environment variables

All settings can be overridden at three levels (lowest wins):
plugin settings (JupyterLab Settings editor) → env var → request payload field.

| Variable                    | Default                               | Scope   | Description                                                                                                                                                                            |
| --------------------------- | ------------------------------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `AIRFLOW_JWT_SECRET`        | -                                     | pod env | JWT signing secret from `ilum-airflow-jwt-secret`. When set, the trigger thread mints its own HS512 token instead of calling `/auth/token` (required on FAB+OAuth Airflow 3 installs). |
| `AIRFLOW_API_URL`           | `http://ilum-airflow-api-server:8080` | pod env | Airflow REST API base URL for auto-trigger.                                                                                                                                            |
| `AIRFLOW_USER`              | `admin`                               | pod env | Airflow username for `/auth/token` login (fallback when no JWT secret).                                                                                                                |
| `AIRFLOW_PASSWORD`          | `admin`                               | pod env | Airflow password for `/auth/token` login.                                                                                                                                              |
| `AIRFLOW_API_TOKEN`         | -                                     | pod env | Pre-generated JWT bearer; bypasses login entirely.                                                                                                                                     |
| `GITEA_API_URL`             | `http://ilum-gitea-http:3000/api/v1`  | pod env | Gitea API base URL.                                                                                                                                                                    |
| `GITEA_OWNER`               | `ilum`                                | pod env | Gitea repository owner.                                                                                                                                                                |
| `GITEA_REPO`                | `airflow`                             | pod env | Gitea repository name.                                                                                                                                                                 |
| `GITEA_BRANCH`              | `master`                              | pod env | Branch to commit DAGs to.                                                                                                                                                              |
| `GITEA_TOKEN`               | -                                     | pod env | Gitea personal access token (preferred over Basic).                                                                                                                                    |
| `GITEA_USERNAME`            | -                                     | pod env | Gitea Basic-auth username (fallback when no token).                                                                                                                                    |
| `GITEA_PASSWORD`            | -                                     | pod env | Gitea Basic-auth password.                                                                                                                                                             |
| `GITEA_SUBDIR`              | -                                     | pod env | Subdirectory inside the repo (e.g. `dags/`).                                                                                                                                           |
| `JPE_LOG_FORMAT`            | `console`                             | pod env | `json` for structured JSON logs; `console` for coloured dev output.                                                                                                                    |
| `JPE_AUDIT_LOG_PATH`        | `~/.jupyter/jpe-audit.jsonl`          | pod env | Override audit log file path.                                                                                                                                                          |
| `JPE_RATE_LIMIT_PER_MINUTE` | `10`                                  | pod env | Max export requests per Jupyter user per minute before HTTP 429.                                                                                                                       |
| `JPE_RATE_LIMIT_PER_HOUR`   | `60`                                  | pod env | Max export requests per Jupyter user per hour.                                                                                                                                         |
| `ILUM_LIVY_HOST`            | `ilum-core`                           | pod env | Livy endpoint host used when auto-seeding the Airflow connection.                                                                                                                      |
| `ILUM_LIVY_PORT`            | `9888`                                | pod env | Livy endpoint port used when auto-seeding the Airflow connection.                                                                                                                      |

---

## Helm integration

When deploying via the Ilum Helm chart (`helm_jupyter`), secrets are mounted
automatically through the deployment template at
`helm/helm_jupyter/templates/jupyter-deploy.yaml`.

Key helm values:

```yaml
airflowIntegration:
  enabled: true
  jwtSecretName: ilum-airflow-jwt-secret # mounts AIRFLOW_JWT_SECRET into the pod

git:
  existingSecret: ilum-git-credentials # mounts GITEA_USERNAME + GITEA_PASSWORD
  apiUrl: http://ilum-gitea-http:3000/api/v1
  owner: ilum
  repo: airflow
  branch: master
```

No manual env var configuration is required on standard Ilum installs - the
chart wires everything.

---

## Settings (UI)

All settings are accessible in JupyterLab under
**Settings → Plugin Settings → Pipelines**.

| Setting key             | Type                      | Default                                 | Description                                                                                      |
| ----------------------- | ------------------------- | --------------------------------------- | ------------------------------------------------------------------------------------------------ |
| `defaultMode`           | `"batch"` \| `"per_cell"` | `"batch"`                               | Pre-selects the export mode in the dialog.                                                       |
| `defaultOutputDir`      | string                    | `"dags"`                                | Directory (relative to JL root) where DAG files are written locally.                             |
| `defaultSchedule`       | string                    | `"@daily"`                              | Default Airflow schedule expression in the export dialog.                                        |
| `defaultCronExpression` | string                    | `"0 */6 * * *"`                         | Pre-fills the cron expression in cron-mode (every 6 h).                                          |
| `defaultClusterId`      | string                    | `"default"`                             | Ilum cluster name pre-selected in the dialog.                                                    |
| `defaultLivyConnId`     | string                    | `"ilum-livy-proxy"`                     | Airflow connection ID used by the generated DAG operators.                                       |
| `sparkImages`           | array                     | 7 presets                               | List of `{label, image}` objects shown in the Spark image dropdown. First entry is pre-selected. |
| `defaultSparkImage`     | string                    | `""`                                    | Image value pre-selected. Empty = use cluster default.                                           |
| `ilumApiUrl`            | string                    | `"http://ilum-core:9888/api/v1"`        | Ilum-core native REST API base URL (service/cron/single mode).                                   |
| `ilumApiToken`          | string                    | `""`                                    | Optional Bearer token for Ilum API auth.                                                         |
| `ilumUiUrl`             | string                    | `"http://localhost:19777"`              | Used to build clickable Ilum UI links in the History tab.                                        |
| `giteaApiUrl`           | string                    | `"http://ilum-gitea-http:3000/api/v1"`  | Gitea API URL for DAG push.                                                                      |
| `giteaOwner`            | string                    | `"ilum"`                                | Gitea repo owner.                                                                                |
| `giteaRepo`             | string                    | `"airflow"`                             | Gitea repo name.                                                                                 |
| `giteaBranch`           | string                    | `"master"`                              | Gitea branch for DAG commits.                                                                    |
| `giteaSubdir`           | string                    | `""`                                    | Sub-directory path inside the repo (e.g. `"dags/"`).                                             |
| `giteaToken`            | string                    | `""`                                    | Gitea personal access token. Leave empty if env vars supply credentials.                         |
| `airflowApiUrl`         | string                    | `"http://ilum-airflow-api-server:8080"` | Airflow REST API base URL for auto-trigger.                                                      |
| `airflowApiToken`       | string                    | `""`                                    | Pre-generated Airflow JWT. Use when FAB+OAuth blocks `/auth/token`.                              |
| `autoPushToAirflow`     | boolean                   | `true`                                  | Automatically push generated DAG to Gitea after export.                                          |
| `enabledCellKinds`      | string[]                  | `["spark"]`                             | Cell magics/kinds included in the export. Cells with other kinds are dropped and reported.       |

Default Spark image presets:

| Label                         | Image                            |
| ----------------------------- | -------------------------------- |
| Cluster default               | _(cluster default)_              |
| Spark 4.1.1 + Delta           | `ilum/spark:4.1.1-delta`         |
| Spark 4.1.1 + Sedona          | `ilum/spark:4.1.1-sedona`        |
| Spark 4.1.1 + Iceberg         | `ilum/spark:4.1.1-iceberg`       |
| Spark 4.1.1 + Trino           | `ilum/spark:4.1.1-trino`         |
| Spark 3.5.7 + Nessie + Sedona | `ilum/spark:3.5.7-nessie-sedona` |
| Spark 3.4.2 (legacy)          | `ilum/spark:3.4.2`               |

---

## API

Partial OpenAPI 3.1.0 specification: [`docs/openapi.yaml`](docs/openapi.yaml) - covers the three primary export/run endpoints (`/health`, `/export`, `/run-ilum-job`). The remaining 8 endpoints listed below are documented by their handlers in [`jupyterlab_pipelines/handlers.py`](jupyterlab_pipelines/handlers.py); OpenAPI extension is a follow-up task.

| Method | Path                                         | Tag     | Description                                         |
| ------ | -------------------------------------------- | ------- | --------------------------------------------------- |
| `GET`  | `/jupyterlab-pipelines/health`       | health  | Liveness probe                                      |
| `POST` | `/jupyterlab-pipelines/export`       | export  | Export notebook → Airflow DAG; push to Gitea        |
| `POST` | `/jupyterlab-pipelines/export`       | preview | Preview DAG without push (`preview: true`)          |
| `POST` | `/jupyterlab-pipelines/run-ilum-job` | run-job | Submit notebook to Ilum (single/service/cron)       |
| `POST` | `/jupyterlab-pipelines/run-ilum-job` | preview | Preview pyFile without submission (`preview: true`) |

All endpoints require Jupyter session authentication (token or cookie).

---

## Troubleshooting

### DAG does not appear in Airflow ~40 s after push

The expected path is: Gitea push → git-sync poll (default 10 s) →
dag-processor re-parse (default 30 s).

1. Check git-sync logs:
   ```bash
   kubectl logs -n ilum deploy/ilum-airflow-dag-processor -c git-sync | tail -30
   ```
2. Check dag-processor for parse errors:
   ```bash
   kubectl logs -n ilum deploy/ilum-airflow-dag-processor | grep "ERROR\|dag_id"
   ```
3. Confirm the file actually landed in Gitea: open
   `http://<gitea-url>/ilum/airflow/src/branch/master/<dag_id>.py`.

---

### Auto-trigger did not fire

The trigger runs in a daemon thread and only starts after the DAG appears in
Airflow. Check in order:

1. Verify `AIRFLOW_JWT_SECRET` is set in the Jupyter pod:
   ```bash
   kubectl exec -n ilum deploy/ilum-jupyter -- env | grep AIRFLOW_JWT_SECRET
   ```
2. Check the audit log for the trigger outcome:
   ```bash
   tail -f ~/.jupyter/jpe-audit.jsonl | python3 -m json.tool
   ```
3. Check the Jupyter server log for warnings from `airflow_client`:
   ```bash
   kubectl logs -n ilum deploy/ilum-jupyter | grep "pipelines-trigger"
   ```

---

### Gitea push returns 404 or "repo not found"

Private Gitea repos return HTTP 404 (not 401) to anonymous GET requests - this
hides the real cause (missing auth).

1. Confirm `GITEA_USERNAME` and `GITEA_PASSWORD` are set in the pod:
   ```bash
   kubectl exec -n ilum deploy/ilum-jupyter -- env | grep GITEA
   ```
2. Confirm the repo exists and the credentials have write access:
   ```bash
   curl -u "$GITEA_USERNAME:$GITEA_PASSWORD" \
     http://ilum-gitea-http:3000/api/v1/repos/ilum/airflow
   ```
3. If using a token instead of Basic auth, confirm `GITEA_TOKEN` is set and
   the token has `write:repository` scope.

---

### "no module named 'ilum'" in service mode

Service mode uploads an `IlumJob` subclass. If the Spark image does not have
the `ilum-python-job` package installed, the import fails.

The extension ships a duck-type shim:

```python
try:
    from ilum.api import IlumJob as _IlumJob
except ImportError:
    class _IlumJob:   # duck-type shim
        pass
```

If you still see the error, verify the Spark image version:

```bash
kubectl exec -n ilum <spark-driver-pod> -- pip show ilum
```

Rebuild the image from `ilum/spark:4.1.1-sedona` (ships with `ilum-python-job`)
or pre-install via `spark.kubernetes.driver.initContainers`.

---

### Generated DAG has import errors at parse time

Symptom: Airflow dag-processor logs show `ImportError` on the generated DAG.

Cause: Airflow 3.x moved operators to `apache-airflow-providers-standard`.

The extension already generates the correct imports:

- `airflow.providers.standard.operators.python.PythonOperator`
- `airflow.task.trigger_rule.TriggerRule`

If you see the old paths, you are running an older version of the extension.
Update with:

```bash
pip install --upgrade jupyterlab-pipelines
```

Alternatively, ensure your Airflow version is `>=3.0.0` with
`apache-airflow-providers-standard` installed.

---

### HTTP 429 Too Many Requests

The extension enforces per-user rate limits (default 10/minute, 60/hour).

Options:

- Wait for the window to reset (1 minute or 1 hour).
- Raise the limits via pod env vars:
  ```bash
  JPE_RATE_LIMIT_PER_MINUTE=30
  JPE_RATE_LIMIT_PER_HOUR=200
  ```
- For bulk processing, add `auto_push_to_airflow: false` + `auto_trigger_dag: false`
  to reduce the backend work per request.

---

### "secrets detected" - HTTP 400

The extension scans cells for AWS key patterns, JWT payloads, and common
password literals before committing to Gitea.

1. Review the flagged cell in the preview panel - the `rejected` list identifies
   the exact line.
2. Replace hard-coded credentials with environment variables or a secrets
   manager reference.
3. If the pattern is a false positive (e.g. a sample key in a comment), pass
   `allow_secrets: true` in the request - **not recommended for production**.

---

### Cell magic not processed / cell dropped

If a cell kind is missing from the `enabledCellKinds` setting, it is dropped
and reported in the `rejected` list.

Default UI value is `["spark"]`. If you use `%%pyspark`, `%%sql`, or plain
Python cells, add those kinds:

1. Open Settings → Plugin Settings → Pipelines.
2. Edit `enabledCellKinds` to include the required kinds (e.g.
   `["spark", "pyspark", "sql", "plain"]`).
3. Or pass `enabled_cell_kinds` in the export payload to override per-request.

---

## Testing

Tests live in `jupyterlab_pipelines/tests/`. The suite covers:

- **Parser** (`test_parser.py`, `test_parser_edge_cases.py`) - cell detection,
  magic classification, sanitization, per-cell sanitization.
- **Generator** (`test_generator.py`, `test_generator_compile.py`,
  `test_lint_generated.py`) - DAG rendering, batch / per-cell mode, `black`
  formatting, `ruff` linting of generated output.
- **pyFile packaging** (`test_cron_packager.py`, `test_pyfile_executes.py`)  - 
  `wrap_notebook_as_pyfile`, `wrap_notebook_as_service_class`, argparse
  boilerplate, module-name derivation.
- **Gitea pusher** (`test_gitea_pusher.py`) - HTTP mocking, retry logic,
  Basic-auth headers, SHA detection.
- **Airflow client** (`test_airflow_client.py`) - JWT minting, login flow,
  trigger thread, mock HTTP.
- **Handlers** (`test_handlers.py`) - full request/response cycle with mock app.
- **Ilum native client** (`test_ilum_native_client.py`) - multipart encoding,
  group create, execute, schedule create.
- **E2E round-trip** (`test_e2e_roundtrip.py`, `test_real_notebooks.py`)  - 
  parse real fixtures → generate → compile.

Install test dependencies and run:

```bash
pip install -e ".[dev]"
pytest --cov=jupyterlab_pipelines --cov-report=term-missing
```

Airflow-related tests that require a live Airflow instance are skipped
automatically when `apache-airflow` is not installed in the test environment.

---

## Contributing

### Dev environment

```bash
# Install Python package in editable mode with all dev extras
pip install -e ".[dev]"

# Install Node deps (requires Node 18+)
jlpm install

# Watch mode - frontend rebuilds on save
jlpm watch

# In a second terminal, start JupyterLab pointing at the source tree
jupyter labextension develop . --overwrite
jupyter lab
```

### Code style

- Python: `ruff` (line-length 100, target py39) + `black` as the formatter.
- TypeScript: ESLint (inherited JupyterLab config).

Run linting locally:

```bash
ruff check jupyterlab_pipelines/
black --check jupyterlab_pipelines/
```

### Pre-commit

A pre-commit hook that runs `ruff` + `black` is recommended but not required.
Install via:

```bash
pip install pre-commit
pre-commit install
```

### Branching

Work in feature branches off `main`. Keep commits atomic. PR descriptions
should reference the CHANGELOG entry they address.

---

## License

Apache License 2.0 - see [LICENSE](LICENSE).
