从开发到运维全流程覆盖 — 质量检查 · 行为测试 · 版本感知。别人只做安装前扫描,我们覆盖整个生命周期。
SKILL.md + evals.json
→skill-quality
→skill-test run
→skill-version baseline
# 🧩 agent-skill-infra v0.2.0 — Skill 全生命周期工具链 $ pip install agent-skill-infra Successfully installed agent-skill-infra-0.2.0 $ skill-quality docs/examples/demo-skill/SKILL.md --output json { "skill_name": "code-review-checklist", "overall_score": 0.92, "dimensions": [ {"name": "trigger_precision", "score": 0.90}, {"name": "helloandy_8dim", "score": 0.94} ] } $ skill-test run docs/examples/demo-skill/evals.json --adapter mock Skill Test Report: code-review-checklist ┌──────────────────────┬──────┬───────┬──────────┬────────────────┐ │ Case ID │ Pass │ Score │ Time(ms) │ Reason │ ├──────────────────────┼──────┼───────┼──────────┼────────────────┤ │ ✓ should-contain… │ PASS │ 0.750 │ 0 │ 3/4 keywords │ │ ✓ should-detect… │ PASS │ 0.800 │ 0 │ 4/5 keywords │ │ … │ … │ … │ … │ … │ ├──────────────────────┼──────┼───────┼──────────┼────────────────┤ │ Total: 5 │ Pass:│ Fail: │ Rate: │ Time: 0ms │ │ │ 2 │ 3 │ 40.0% │ │ └──────────────────────┴──────┴───────┴──────────┴────────────────┘ $ skill-version diff . --old-ref HEAD~3 --new-ref HEAD --output json [结构化 diff 输出,含文件路径、增删行数、变更摘要]
基于 helloandy 8 维度评分体系,自动评估 SKILL.md 质量。支持集成 agent-skill-linter 和安全扫描。
| 检查器 | 评分维度 | 输出 |
|---|---|---|
| TriggerChecker | 触发词覆盖度 + 特异性 | 0.0-1.0 分数 + 发现列表 |
| OutputChecker | 输出格式 + 示例 + 约束 | 格式/示例/分段检测 |
| ToleranceChecker | 错误处理信号 | try/catch/fallback 等 6 个信号 |
| TokenChecker | 行数效率 | 行数统计 + 效率评分 |
| HelloAndyChecker | 8 维度综合评分 | 技术 5 维度 + 输出 3 维度 |
| LinterAdapter | agent-skill-linter (可选) | 17 条格式规则 |
| SecurityIntegration | Cisco Scanner (可选) | 安全扫描结果 |
$ skill-quality /path/to/SKILL.md --output json { "skill_name": "code-review-checklist", "overall_score": 0.92, "file_path": "/path/to/SKILL.md", "total_lines": 111, "token_estimate": 935, "dimensions": [ { "name": "trigger_precision", "score": 0.90, "findings": ["Good keyword coverage with domain-specific terms"] }, { "name": "helloandy_8dim", "score": 0.94, "findings": [ "Good keyword coverage", "Output format defined", "Examples provided", "Error handling: 6 signals detected", "Edge case coverage: 4 signals" ] } ] } $ skill-quality /path/to/SKILL.md --lint Quality Report: code-review-checklist Overall Score: 92% trigger_precision: 90% - Good keyword coverage with domain-specific terms helloandy_8dim: 94% - [trigger] Good keyword coverage - [output] Output format defined, Examples provided - [error] Good error handling coverage (6 signals) - [edge] Good edge case coverage (4 signals) agent-skill-linter: 100% - No linter violations found.
运行 evals.json 测试套件,5 种判定器类型,支持 CI 集成。
| 判定器 | 用途 | 示例 |
|---|---|---|
| keyword | 关键词匹配(any/all 模式) | 输出是否包含预期关键词 |
| schema | JSON Schema 验证 | 输出是否符合 JSON Schema |
| flow | 工具调用序列校验 | Agent 是否按预期顺序调用工具 |
| snapshot | 快照对比(回归检测) | 输出是否与基线快照一致 |
| llm | LLM-as-Judge(语义等价) | 两次输出语义是否等价(需 API Key) |
支持的 evals.json 格式:
$ skill-test run docs/examples/demo-skill/evals.json --adapter mock Running 5 test cases with 'mock' adapter... ┌──────────────────────────────┬──────┬───────┬──────────┬────────────────┐ │ Case ID │ Pass │ Score │ Time(ms) │ Reason │ ├──────────────────────────────┼──────┼───────┼──────────┼────────────────┤ │ ✓ should-contain-report… │ PASS │ 0.750 │ 0 │ 3/4 keywords │ │ ✓ should-detect-security… │ PASS │ 0.800 │ 0 │ 4/5 keywords │ │ ✗ should-not-trigger… │ FAIL │ 0.000 │ 0 │ no keywords │ │ ✗ output-should-be… │ FAIL │ 0.250 │ 0 │ 1/4 keywords │ │ ✗ should-handle-error… │ FAIL │ 0.000 │ 0 │ no keywords │ ├──────────────────────────────┼──────┼───────┼──────────┼────────────────┤ │ Total: 5 │ Pass │ Fail │ Rate │ Time: 0ms │ │ │ 2 │ 3 │ 40.0% │ │ └──────────────────────────────┴──────┴───────┴──────────┴────────────────┘
追踪 Skill 变更、检测回归、安全分析、一键回滚。
| 子命令 | 功能 | 示例 |
|---|---|---|
| diff | 结构化 diff 输出 | skill-version diff . --old-ref HEAD~3 |
| check | diff + 安全分析 | skill-version check . --security |
| rollback | 一键回滚 | skill-version rollback . --target-ref HEAD~1 --yes |
| baseline store | 存储基线快照 | skill-version baseline store . case-1 output.txt |
| baseline detect | 检测回归 | skill-version baseline detect . case-1 output.txt |
$ skill-version diff . --old-ref HEAD~3 --new-ref HEAD Version Diff: 0a1b2c3d... -> e4f5a6b7... 4 file(s) changed: modified src/skill_infra/test_runner/judgers/llm_judge.py +85 -0 ++++++++++ modified src/skill_infra/version_aware/cli.py +148 -0 ++++++++++ modified pyproject.toml +15 -2 ++++++-- added README.md +56 -0 ++++++++++ $ skill-version check . --old-ref HEAD~3 --security Version Check: 0a1b2c3d... -> e4f5a6b7... Files changed: 4 src/skill_infra/.../llm_judge.py (modified, +85/-0) src/skill_infra/.../cli.py (modified, +148/-0) pyproject.toml (modified, +15/-2) README.md (added, +56/-0) Security: clean Max severity: none $ skill-version rollback . --target-ref HEAD~1 --yes Rolled back to HEAD~1