Metadata-Version: 2.4
Name: manualforge
Version: 0.3.0
Summary: A configuration-driven management manual generation framework based on Kedro pipelines with Polars and Typst.
Author: songwupei
License-Expression: MIT
Project-URL: Homepage, https://github.com/quartools/manualforge
Project-URL: Repository, https://github.com/quartools/manualforge
Project-URL: Issues, https://github.com/quartools/manualforge/issues
Keywords: kedro,polars,typst,report-generation,document-generation,configuration-driven,pipeline,manual
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: kedro~=1.2.0
Requires-Dist: kedro-datasets~=9.2
Requires-Dist: polars>=1.0
Requires-Dist: polars-runtime-compat>=1.38.1
Requires-Dist: fastexcel~=0.19
Requires-Dist: Jinja2>=3.0
Requires-Dist: duckdb>=1.0
Provides-Extra: docs
Requires-Dist: docutils<0.21; extra == "docs"
Requires-Dist: sphinx<7.3,>=5.3; extra == "docs"
Requires-Dist: sphinx_rtd_theme==2.0.0; extra == "docs"
Requires-Dist: nbsphinx==0.8.1; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints==1.20.2; extra == "docs"
Requires-Dist: sphinx_copybutton==0.5.2; extra == "docs"
Requires-Dist: ipykernel<7.0,>=5.3; extra == "docs"
Requires-Dist: Jinja2<3.2.0; extra == "docs"
Requires-Dist: myst-parser<2.1,>=1.0; extra == "docs"
Provides-Extra: dev
Requires-Dist: ipython>=8.10; extra == "dev"
Requires-Dist: jupyterlab>=3.0; extra == "dev"
Requires-Dist: notebook; extra == "dev"
Requires-Dist: pytest-cov<7,>=3; extra == "dev"
Requires-Dist: pytest-mock<2.0,>=1.7.1; extra == "dev"
Requires-Dist: pytest~=9.0; extra == "dev"
Requires-Dist: ruff~=0.15.0; extra == "dev"

# ManualForge

> **Configuration-driven management manual generation framework.**
> **配置驱动的管理手册生成框架。**
> Define your data sources, fields, and templates in YAML — get a formatted report.
> 在 YAML 中定义数据源、字段和模板，即可生成格式化报告。

[![PyPI version](https://img.shields.io/badge/pypi-0.2.0-blue)](https://pypi.org/project/manualforge/)
[![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13-blue)](https://www.python.org/)

Built on [Kedro](https://kedro.org) pipelines with [Polars](https://pola.rs) for data processing and [Typst](https://typst.app) for document rendering.
基于 [Kedro](https://kedro.org) 流水线 + [Polars](https://pola.rs) 数据处理 + [Typst](https://typst.app) 文档渲染。

ManualForge is a **reusable Python package** (`pip install manualforge`). For a real-world downstream application, see **[modelmanual](https://codeberg.org/songwupei/modelmanual)** — a Chinese regulatory manual generator that extends ManualForge with rule-code field mapping and reconciliation reporting.
ManualForge 是一个**可复用的 Python 包**（`pip install manualforge`）。实际下游应用示例见 **[modelmanual](https://codeberg.org/songwupei/modelmanual)** —— 一个基于 ManualForge 的中文法规手册生成器，扩展了规则代码字段映射和核对报告功能。

## Philosophy / 设计理念

ManualForge separates **what** you want to produce from **how** it's produced.
ManualForge 将「要生成什么」与「如何生成」解耦。

- **What / 内容**: Defined in `conf/base/parameters_manualforge.yml` — your data sources, expected columns, standardization rules, sort orders, summary dimensions, and report templates. 在配置文件中定义数据源、期望列、标准化规则、排序、汇总维度和报告模板。
- **How / 方法**: Implemented by the pipeline nodes — reusable data processing functions that read from your config. 由流水线节点实现——可复用的数据处理函数，读取配置驱动行为。

To create a new manual for a different domain, you only need to edit the config file (and optionally provide new templates). No Python code changes required.
要为新领域创建手册，只需编辑配置文件（可选提供新模板），无需修改 Python 代码。

## Features / 功能

| Capability / 能力 | Description / 说明 |
|---|---|
| **Multi-sheet Excel ingestion** / 多表 Excel 读取 | Auto-detect headers, filter cover sheets, merge into structured DataFrames. 自动检测表头，过滤封面页，合并为结构化 DataFrame。 |
| **Field standardization** / 字段标准化 | Mapping files + exact matching + fuzzy matching (difflib / duckdb). 映射文件 + 精确匹配 + 模糊匹配。 |
| **Config-driven summaries** / 配置驱动汇总 | Define group-by dimensions, sort orders, ability categories, and output paths in YAML. 在 YAML 中定义分组维度、排序、能力类别和输出路径。 |
| **Typst report generation** / Typst 报告生成 | Jinja2 templates → Typst source → PDF compilation. Jinja2 模板 → Typst 源码 → PDF 编译。 |
| **Pipeline hooks** / 流水线钩子 | Shell command hooks at pipeline/node granularity for pre/post processing. 流水线/节点粒度的 shell 命令钩子，用于前后处理。 |
| **Auto-backup** / 自动备份 | Pre-run config snapshot + post-run data backup via hooks. 跑前配置快照 + 跑后数据备份，通过 hooks 自动触发。 |
| **Config deploy** / 配置部署 | `cfg-backup` / `cfg-deploy` — backup, restore, and deploy configs from templates. 备份、恢复和从模板部署配置文件。 |

## Quick Start / 快速开始

```bash
# 1. Install dependencies / 安装依赖
pip install -r requirements.txt

# 2. Copy and customize configuration / 复制并自定义配置
#    Option A: interactive deployment / 交互式部署
./scripts/cfg-deploy --from-examples

#    Option B: manual copy / 手动复制
cp conf/examples/parameters_manualforge.yml.example conf/base/parameters_manualforge.yml
cp conf/examples/catalog.yml.example          conf/base/catalog.yml
cp conf/examples/hooks.yml.example            conf/base/hooks.yml
cp conf/examples/parameters.yml.example       conf/base/parameters.yml
cp conf/examples/credentials.yml.example      conf/local/credentials.yml

# 3. Edit the config files to point to your data sources
#    编辑配置文件，指向你的数据源
#    (conf/base/ is gitignored — your real configs stay local)
#    (conf/base/ 已 gitignore — 实际配置保存在本地)

# 4. Run the pipeline / 运行流水线
kedro run

# Run specific node groups / 运行特定节点组
kedro run --tags conversion        # Excel → Parquet only / 仅 Excel → Parquet
kedro run --tags standardization   # Standardization only / 仅标准化
kedro run --tags csv               # Summary tables only / 仅汇总表
```

## Backup & Config Management / 备份与配置管理

**Auto-backup via hooks (runs on every `kedro run`):**
**通过 hooks 自动备份（每次 `kedro run` 自动触发）：**

```
kedro run
  ├─ [before_pipeline]  cfg-backup      ← snapshot conf/base/
  └─ [after_pipeline]   backup_data.sh  ← snapshot pipeline output data
```

**Manual backup/restore/deploy:**
**手动备份/恢复/部署：**

```bash
# Config backup / 配置备份
./scripts/cfg-backup              # backup conf/base/ → conf/.backups/
./scripts/cfg-backup -l           # list existing backups

# Config restore / deploy from examples / 配置恢复 / 从模板部署
./scripts/cfg-deploy                        # interactive menu | 交互菜单
./scripts/cfg-deploy -l                     # list config backups
./scripts/cfg-deploy -r 20260617_105645     # restore specific backup | 恢复指定备份
./scripts/cfg-deploy --from-examples         # deploy fresh templates | 从模板部署
./scripts/cfg-deploy --from-examples --dry-run  # preview | 预览

# Data backup / 数据备份
./scripts/backup_data.sh          # backup pipeline output → data/.backups/
./scripts/backup_data.sh -k 5     # keep only last 5 backups
```

## Project Structure / 项目结构

```
├── conf/
│   ├── base/                          # ★ Gitignored — copy from examples/ | 从 examples/ 复制
│   │   ├── parameters_manualforge.yml # Central project configuration | 项目中心配置
│   │   ├── catalog.yml                # Kedro data catalog | 数据目录
│   │   ├── hooks.yml                  # Pipeline hooks (shell commands) | 流水线钩子
│   │   └── parameters.yml             # Pipeline parameters | 流水线参数
│   ├── examples/                      # ★ Tracked example templates | 版本追踪的示例模板
│   │   ├── parameters_manualforge.yml.example
│   │   ├── catalog.yml.example
│   │   ├── hooks.yml.example
│   │   ├── parameters.yml.example
│   │   └── credentials.yml.example
│   ├── local/                         # Local-only (gitignored) | 仅本地 (gitignored)
│   │   └── credentials.yml
│   └── logging.yml
├── data/                              # Gitignored except .gitkeep | 除 .gitkeep 外均 gitignored
│   ├── 01_raw/                        # Raw Excel/CSV + mapping files | 原始数据 + 映射文件
│   ├── 02_intermediate/              # Parquet, reconcile reports | 中间数据、核对报告
│   ├── 03_primary/                   # Standardized data | 标准化后数据
│   ├── 04_feature/                   # Summary tables (CSV + Markdown) | 汇总表
│   └── 08_reporting/                 # Typst sources & compiled PDFs | Typst 源码和 PDF
├── scripts/                          # Auxiliary scripts | 辅助脚本
│   ├── backup_data.sh                # ★ Backup pipeline output data | 备份管道输出数据
│   ├── cfg-backup                    # ★ Backup conf/base/ config | 备份配置文件
│   ├── cfg-deploy                    # ★ Deploy/restore configs | 部署/恢复配置
│   ├── main.sh                       # ★ Kedro runner with auto-backup | 带自动备份的启动脚本
│   ├── convert_csv_to_md.py          # CSV → Markdown conversion | 转换
│   ├── extract_rule_field_mapping.py # Rule field extraction | 规则字段提取
│   ├── extract_rule_overview.py      # Rule overview extraction | 规则概览提取
│   └── render_with_forge.py          # Markdown → DOCX/PDF rendering | 渲染
├── src/manualforge/                  # Framework source code | 框架源码
│   ├── config.py                     # Configuration helper utilities | 配置工具
│   ├── hooks.py                      # Kedro pipeline hooks (PipelineHooks base class) | 流水线钩子基类
│   ├── io/                           # Custom Kedro datasets (PolarsExcelDataset) | 自定义数据集
│   ├── pipelines/
│   │   └── data_processing_pl/       # Core pipeline: 12 reusable nodes | 核心流水线：12 个可复用节点
│   │       ├── nodes.py              #   Node functions | 节点函数
│   │       ├── pipeline.py           #   Pipeline definition | 流水线定义
│   │       ├── rulecsv2typ.py        #   CSV → Typst/Jinja conversion | CSV → Typst/Jinja 转换
│   │       └── standardize_fields.py #   Field standardization engine | 字段标准化引擎
│   ├── pipeline_registry.py          # Pipeline registration | 流水线注册
│   ├── settings.py                   # Kedro project settings | 项目设置
│   └── __main__.py                   # CLI entry point | CLI 入口
├── templates/                        # Jinja2 Typst templates | Jinja2 Typst 模板
│   └── report.typ.j2
├── pyproject.toml                    # Project metadata & dependencies | 项目元数据和依赖
└── requirements.txt
```

## Configuration Guide / 配置指南

The central configuration file is `conf/base/parameters_manualforge.yml`. Copy from `conf/examples/` and customize.
核心配置文件为 `conf/base/parameters_manualforge.yml`。从 `conf/examples/` 复制后进行自定义。

### 1. Data Sources / 数据源

Define your Excel files, expected headers, and sheet filtering rules.
定义 Excel 文件、期望表头和 Sheet 过滤规则：

```yaml
datasources:
  primary_data:
    filepath: "data/01_raw/your_data.xlsx"
    sheet:
      exclude_names: ["封面", "封皮"]
      name_becomes_column: "sheet_name"
    header_detection:
      mode: keyword_match
      expected_headers:
        - "column_a"
        - "column_b"
    cleaning:
      drop_rows_where:
        column_a: ["column_a"]   # drop residual header rows | 删除残留表头行
      fill_null: forward
      deduplicate: true
```

### 2. Field Standardization / 字段标准化

Define which fields to standardize, their mapping files, and special corrections.
定义需要标准化的字段、映射文件和特殊修正：

```yaml
standardization:
  fields:
    - name: "dept_name"
      mapping_file: "data/01_raw/dept_list"
      case_corrections:
        wrong_name: "correct_name"
      special_mappings:
        alias: "canonical_name"
      fuzzy:
        enabled: true
        threshold: 0.8
        method: difflib             # difflib | duckdb
```

### 3. Sort Orders / 排序

Define reusable sort order lists referenced by summaries.
定义汇总引用的可复用排序列表：

```yaml
sort_orders:
  model_names:
    - "Model A"
    - "Model B"
  dep_names:
    - "HR"
    - "Finance"
```

### 4. Summaries / 汇总

Define what summary tables to generate.
定义要生成的汇总表：

```yaml
summaries:
  my_summary:
    description: "Fields grouped by model and department"
    group_by: ["model", "department"]
    struct_columns: ["module", "system", "field_name"]
    sort_by:
      department: dep_names
    output:
      csv: "data/04_feature/my_summary.csv"
```

### 5. Reports / 报告

Define report templates and output.
定义报告模板和输出：

```yaml
reports:
  my_report:
    description: "Rules cookbook"
    template_source: inline
    data_source: rules_data
    output_typ: "data/08_reporting/output.typ"
    typst_compile:
      enabled: true
```

## Data Layers / 数据分层

| Layer / 层级 | Directory / 目录 | Description / 说明 |
|---|---|---|
| Raw / 原始 | `data/01_raw/` | Source Excel/CSV files, mapping files / 源文件与映射文件 |
| Intermediate / 中间 | `data/02_intermediate/` | Parquet, reconcile reports / Parquet 与核对报告 |
| Primary / 主数据 | `data/03_primary/` | Standardized data / 标准化后数据 |
| Feature / 特征 | `data/04_feature/` | Summary tables (CSV + Markdown) / 汇总表 |
| Reporting / 报告 | `data/08_reporting/` | Typst sources & PDF output / Typst 源码与 PDF |

## Requirements / 环境要求

- Python >= 3.10
- [Typst](https://github.com/typst/typst) CLI (for PDF compilation / 用于 PDF 编译)

## Recent Changes / 近期变更

### 2026-06-26

- **Rule code parsing** (`nodes.py`): Added `_parse_rule_codes` and `_RULE_CODE_RE` for GZW rule code extraction; rule-code column saved as `List(String)` without forward-fill; added "规则代码" to `EXPECTED_HEADER` in `convert_excel_to_parquet_fj1` and `process_attachment1_excel`.
- **Model name standardization** (`nodes.py`): Added `_load_model_mapping()` and `_normalize_model_name()` for fuzzy-matching model names against a reference table; 概览 sheet now uses its own 模型名称 column instead of the sheet name.
- **Department field** (`nodes.py`): Added 主研部门 to fj2 `EXPECTED_HEADER`.
- **Path resolution** (`nodes.py`, `standardize_fields.py`): Changed `Path(__file__).parents[4]` to `Path.cwd()` so paths resolve correctly when ManualForge is used as an installed package (e.g., from modelmanual).
- **Direct data access** (`rulecsv2typ.py`): `convert_rules_to_typst_jinja` now reads from the Kedro catalog as a Polars DataFrame directly instead of round-tripping through CSV on disk.


## Development / 开发

```bash
pip install -e ".[dev]"
ruff check src/
pytest
```
