Architecture · myKG · Extract pipeline

End-to-end: mixed input through preprocessing, two passes, orphan recovery, and export

SCHEMA GAP .MD CORPUS SCHEMA NODES·EDGES UNIFIED RECONNECTED INPUT Mixed corpus directory .md · .pdf · .docx · .pptx · .png · .html PREP Preprocess mineru · uv venv markdownify · html LLM · PASS 1 01 Schema Induction parallel batches → merge → harmonize → quality → optional human review gate (--review) LLM · PASS 2 02 Instance Extraction parallel across files · per-file shards stateful chunks · validate against schema ASSY Assembly normalize names · stable IDs · dedup · sidecar ORPHAN PASS Orphan-Connection Stage 1: co-occurrence score · Stage 2: LLM confirm — escalates to schema-gap restart of Pass 2 → OUT Export — five parallel output families jsonl · ttl · networkx · html · obsidian vault