Metadata-Version: 2.4
Name: manualforge
Version: 0.1.3
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: kedro~=1.2.0
Requires-Dist: kedro-datasets~=9.2
Requires-Dist: polars>=1.0
Requires-Dist: polars-runtime-compat>=1.38.1
Requires-Dist: fastexcel~=0.19
Requires-Dist: Jinja2>=3.0
Requires-Dist: duckdb>=1.0
Provides-Extra: docs
Requires-Dist: docutils<0.21; extra == "docs"
Requires-Dist: sphinx<7.3,>=5.3; extra == "docs"
Requires-Dist: sphinx_rtd_theme==2.0.0; extra == "docs"
Requires-Dist: nbsphinx==0.8.1; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints==1.20.2; extra == "docs"
Requires-Dist: sphinx_copybutton==0.5.2; extra == "docs"
Requires-Dist: ipykernel<7.0,>=5.3; extra == "docs"
Requires-Dist: Jinja2<3.2.0; extra == "docs"
Requires-Dist: myst-parser<2.1,>=1.0; extra == "docs"
Provides-Extra: dev
Requires-Dist: ipython>=8.10; extra == "dev"
Requires-Dist: jupyterlab>=3.0; extra == "dev"
Requires-Dist: notebook; extra == "dev"
Requires-Dist: pytest-cov<7,>=3; extra == "dev"
Requires-Dist: pytest-mock<2.0,>=1.7.1; extra == "dev"
Requires-Dist: pytest~=9.0; extra == "dev"
Requires-Dist: ruff~=0.15.0; extra == "dev"

# ManualForge

> **Configuration-driven management manual generation framework.**
> **配置驱动的管理手册生成框架。**
> Define your data sources, fields, and templates in YAML — get a formatted report.
> 在 YAML 中定义数据源、字段和模板，即可生成格式化报告。

Built on [Kedro](https://kedro.org) pipelines with [Polars](https://pola.rs) for data processing and [Typst](https://typst.app) for document rendering.
基于 [Kedro](https://kedro.org) 流水线 + [Polars](https://pola.rs) 数据处理 + [Typst](https://typst.app) 文档渲染。

## Philosophy / 设计理念

ManualForge separates **what** you want to produce from **how** it's produced.
ManualForge 将「要生成什么」与「如何生成」解耦。

- **What / 内容**: Defined in `conf/base/parameters_manualforge.yml` — your data sources, expected columns, standardization rules, sort orders, summary dimensions, and report templates. 在配置文件中定义数据源、期望列、标准化规则、排序、汇总维度和报告模板。
- **How / 方法**: Implemented by the pipeline nodes — reusable data processing functions that read from your config. 由流水线节点实现——可复用的数据处理函数，读取配置驱动行为。

To create a new manual for a different domain, you only need to edit the config file (and optionally provide new templates). No Python code changes required.
要为新领域创建手册，只需编辑配置文件（可选提供新模板），无需修改 Python 代码。

## Features / 功能

| Capability / 能力 | Description / 说明 |
|---|---|
| **Multi-sheet Excel ingestion** / 多表 Excel 读取 | Auto-detect headers, filter cover sheets, merge into structured DataFrames. 自动检测表头，过滤封面页，合并为结构化 DataFrame。 |
| **Field standardization** / 字段标准化 | Mapping files + exact matching + fuzzy matching (difflib / duckdb). 映射文件 + 精确匹配 + 模糊匹配。 |
| **Config-driven summaries** / 配置驱动汇总 | Define group-by dimensions, sort orders, ability categories, and output paths in YAML. 在 YAML 中定义分组维度、排序、能力类别和输出路径。 |
| **Typst report generation** / Typst 报告生成 | Jinja2 templates → Typst source → PDF compilation. Jinja2 模板 → Typst 源码 → PDF 编译。 |
| **Pipeline hooks** / 流水线钩子 | Shell command hooks at pipeline/node granularity for pre/post processing. 流水线/节点粒度的 shell 命令钩子，用于前后处理。 |

## Quick Start / 快速开始

```bash
# 1. Install dependencies / 安装依赖
pip install -r requirements.txt

# 2. Copy and customize configuration / 复制并自定义配置
cp conf/examples/parameters_manualforge.yml.example conf/base/parameters_manualforge.yml
cp conf/examples/catalog.yml.example          conf/base/catalog.yml
cp conf/examples/hooks.yml.example            conf/base/hooks.yml
cp conf/examples/parameters.yml.example       conf/base/parameters.yml
cp conf/examples/credentials.yml.example      conf/local/credentials.yml

# 3. Edit the config files to point to your data sources
#    编辑配置文件，指向你的数据源
#    (conf/base/ is gitignored — your real configs stay local)
#    (conf/base/ 已 gitignore — 实际配置保存在本地)

# 4. Run the pipeline / 运行流水线
kedro run

# Run specific node groups / 运行特定节点组
kedro run --tags conversion        # Excel → Parquet only / 仅 Excel → Parquet
kedro run --tags standardization   # Standardization only / 仅标准化
kedro run --tags csv               # Summary tables only / 仅汇总表
```

## Project Structure / 项目结构

```
├── conf/
│   ├── base/                          # ★ Gitignored — copy from examples/ | 从 examples/ 复制
│   │   ├── parameters_manualforge.yml # Central project configuration | 项目中心配置
│   │   ├── catalog.yml                # Kedro data catalog | 数据目录
│   │   ├── hooks.yml                  # Pipeline hooks (shell commands) | 流水线钩子
│   │   └── parameters.yml             # Pipeline parameters | 流水线参数
│   ├── examples/                      # ★ Tracked example templates | 版本追踪的示例模板
│   │   ├── parameters_manualforge.yml.example
│   │   ├── catalog.yml.example
│   │   ├── hooks.yml.example
│   │   ├── parameters.yml.example
│   │   └── credentials.yml.example
│   ├── local/                         # Local-only (gitignored) | 仅本地 (gitignored)
│   │   └── credentials.yml
│   └── logging.yml
├── data/                              # Gitignored except .gitkeep | 除 .gitkeep 外均 gitignored
│   ├── 01_raw/                        # Raw Excel/CSV + mapping files | 原始数据 + 映射文件
│   ├── 02_intermediate/              # Parquet, reconcile reports | 中间数据、核对报告
│   ├── 03_primary/                   # Standardized data | 标准化后数据
│   ├── 04_feature/                   # Summary tables (CSV + Markdown) | 汇总表
│   └── 08_reporting/                 # Typst sources & compiled PDFs | Typst 源码和 PDF
├── scripts/                          # Auxiliary scripts | 辅助脚本
│   ├── convert_csv_to_md.py          # CSV → Markdown conversion | 转换
│   ├── extract_rule_field_mapping.py # Rule field extraction | 规则字段提取
│   ├── extract_rule_overview.py      # Rule overview extraction | 规则概览提取
│   └── render_with_forge.py          # Markdown → DOCX/PDF rendering | 渲染
├── src/manualforge/                  # Framework source code | 框架源码
│   ├── config.py                     # Configuration helper utilities | 配置工具
│   ├── hooks.py                      # Kedro pipeline hooks | 流水线钩子
│   ├── io/                           # Custom Kedro datasets (PolarsExcelDataset) | 自定义数据集
│   ├── pipelines/                    # Pipeline definitions & node functions | 流水线定义和节点
│   └── settings.py                   # Kedro project settings | 项目设置
├── templates/                        # Jinja2 Typst templates | Jinja2 Typst 模板
│   └── report.typ.j2
├── pyproject.toml                    # Project metadata & dependencies | 项目元数据和依赖
└── requirements.txt
```

## Configuration Guide / 配置指南

The central configuration file is `conf/base/parameters_manualforge.yml`. Copy from `conf/examples/` and customize.
核心配置文件为 `conf/base/parameters_manualforge.yml`。从 `conf/examples/` 复制后进行自定义。

### 1. Data Sources / 数据源

Define your Excel files, expected headers, and sheet filtering rules.
定义 Excel 文件、期望表头和 Sheet 过滤规则：

```yaml
datasources:
  primary_data:
    filepath: "data/01_raw/your_data.xlsx"
    sheet:
      exclude_names: ["封面", "封皮"]
      name_becomes_column: "sheet_name"
    header_detection:
      mode: keyword_match
      expected_headers:
        - "column_a"
        - "column_b"
    cleaning:
      drop_rows_where:
        column_a: ["column_a"]   # drop residual header rows | 删除残留表头行
      fill_null: forward
      deduplicate: true
```

### 2. Field Standardization / 字段标准化

Define which fields to standardize, their mapping files, and special corrections.
定义需要标准化的字段、映射文件和特殊修正：

```yaml
standardization:
  fields:
    - name: "dept_name"
      mapping_file: "data/01_raw/dept_list"
      case_corrections:
        wrong_name: "correct_name"
      special_mappings:
        alias: "canonical_name"
      fuzzy:
        enabled: true
        threshold: 0.8
        method: difflib             # difflib | duckdb
```

### 3. Sort Orders / 排序

Define reusable sort order lists referenced by summaries.
定义汇总引用的可复用排序列表：

```yaml
sort_orders:
  model_names:
    - "Model A"
    - "Model B"
  dep_names:
    - "HR"
    - "Finance"
```

### 4. Summaries / 汇总

Define what summary tables to generate.
定义要生成的汇总表：

```yaml
summaries:
  my_summary:
    description: "Fields grouped by model and department"
    group_by: ["model", "department"]
    struct_columns: ["module", "system", "field_name"]
    sort_by:
      department: dep_names
    output:
      csv: "data/04_feature/my_summary.csv"
```

### 5. Reports / 报告

Define report templates and output.
定义报告模板和输出：

```yaml
reports:
  my_report:
    description: "Rules cookbook"
    template_source: inline
    data_source: rules_data
    output_typ: "data/08_reporting/output.typ"
    typst_compile:
      enabled: true
```

## Data Layers / 数据分层

| Layer / 层级 | Directory / 目录 | Description / 说明 |
|---|---|---|
| Raw / 原始 | `data/01_raw/` | Source Excel/CSV files, mapping files / 源文件与映射文件 |
| Intermediate / 中间 | `data/02_intermediate/` | Parquet, reconcile reports / Parquet 与核对报告 |
| Primary / 主数据 | `data/03_primary/` | Standardized data / 标准化后数据 |
| Feature / 特征 | `data/04_feature/` | Summary tables (CSV + Markdown) / 汇总表 |
| Reporting / 报告 | `data/08_reporting/` | Typst sources & PDF output / Typst 源码与 PDF |

## Requirements / 环境要求

- Python >= 3.10
- [Typst](https://github.com/typst/typst) CLI (for PDF compilation / 用于 PDF 编译)

## Development / 开发

```bash
pip install -e ".[dev]"
ruff check src/
pytest
```
