Architecture · myKG · Extract pipeline
End-to-end: mixed input through preprocessing, two passes, orphan recovery, and export
SCHEMA GAP
.MD CORPUS
SCHEMA
NODES·EDGES
UNIFIED
RECONNECTED
INPUT
Mixed corpus directory
.md · .pdf · .docx · .pptx · .png · .html
PREP
Preprocess
mineru · uv venv
markdownify · html
LLM · PASS 1
01
Schema Induction
parallel batches → merge → harmonize → quality
→ optional human review gate (--review)
LLM · PASS 2
02
Instance Extraction
parallel across files · per-file shards
stateful chunks · validate against schema
ASSY
Assembly
normalize names · stable IDs · dedup · sidecar
ORPHAN PASS
Orphan-Connection
Stage 1: co-occurrence score · Stage 2: LLM confirm
— escalates to schema-gap restart of Pass 2 →
OUT
Export — five parallel output families
jsonl · ttl · networkx · html · obsidian vault