Metadata-Version: 2.4
Name: xgen-ontology
Version: 0.2.0
Summary: Backend-agnostic ontology / knowledge-graph toolkit — parse + chunk documents (or tables), build a clean KG, then search it with one-shot GraphRAG. Zero infra, any graph DB.
Project-URL: Homepage, https://github.com/jinsoo96/xgen-ontology
Project-URL: Repository, https://github.com/jinsoo96/xgen-ontology
Project-URL: Changelog, https://github.com/jinsoo96/xgen-ontology/blob/main/CHANGELOG.md
Author-email: jinsoo96 <wlstn010203@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: entity-resolution,graphrag,knowledge-graph,llm,ontology,owl,rag,rdf,sparql,taxonomy
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: files
Requires-Dist: openpyxl>=3; extra == 'files'
Requires-Dist: pypdf>=3; extra == 'files'
Requires-Dist: python-docx>=1; extra == 'files'
Provides-Extra: korean
Requires-Dist: kiwipiepy>=0.15; extra == 'korean'
Provides-Extra: rdf
Requires-Dist: rdflib>=6.0; extra == 'rdf'
Provides-Extra: vector
Requires-Dist: qdrant-client>=1.7; extra == 'vector'
Description-Content-Type: text/markdown

# xgen-ontology

**Backend-agnostic ontology / knowledge-graph toolkit.** Turn documents or tables
into a *clean* knowledge graph — extract, resolve, dedup, induce the is-a hierarchy,
govern predicates, score quality — then **search it with one-shot GraphRAG**.
**Zero infra** (the whole thing runs on a pure-Python in-memory backend), **zero
lock-in** (load into *any* SPARQL 1.1 store), **zero hard deps** in the core.

```python
from xgen_ontology import build_from_csv

onto = build_from_csv({                       # no LLM, no DB, no API key
    "products": "product_id,name,color_id\n1,Widget,10\n2,Gadget,20",
    "colors":   "color_id,name\n10,Red\n20,Blue",
})
print(onto.stats())          # {'classes': 2, 'instances': 4, 'relations': 2, ...}
print(onto.to_turtle())      # standards RDF/Turtle
print(onto.search("what color is Widget").answer)
```

Build from prose with any LLM, mix tables and text freely — raw documents are
**parsed and chunked** for you:

```python
from xgen_ontology import build_from_files, build_from_text, CallableLLM

llm = CallableLLM(lambda p, system="": my_model(system, p))     # OpenAI / Anthropic / vLLM / …

# from files on disk (txt/md/html/csv built-in; pdf/docx/xlsx via the [files] extra)
onto = build_from_files(["policy.pdf", "products.csv"], llm=llm)

# or from a single raw string (auto boundary-aware chunking)
onto = build_from_text("Rule A applies to Acme Bank since 2020. ...", llm=llm)
```

## Two halves of the lifecycle

### Build — documents/tables → a clean graph

The pipeline is a sequence of independently-importable, backend-agnostic stages:

| Stage | What it does |
|------|--------------|
| **parse** | extract text from files — txt/md/html/csv built-in (zero-dep), pdf/docx/xlsx via `[files]` |
| **chunk** | boundary-aware chunking (paragraph→sentence→char) with overlap, stable chunk ids for provenance |
| **tabular** | table → ontology with **no LLM**: table→Class, FK→ObjectProperty (same-name / normalized-name / value-overlap detection), column→DataProperty, dimension rows→instances; large fact/junction tables stay schema-only |
| **extract** | one LLM call per chunk batch → schema *and* instances, tagged to source chunks; junk (base64/degenerate) filtered first |
| **resolve** | entity resolution: fold case/whitespace/unicode + similar surface forms, *guarding* dates/ids and number-conflicting names |
| **govern** | predicate governance: fold surface variants of a relation, anchor to the schema vocabulary |
| **dedup** | merge synonymous classes/properties/instances — rule keys, LLM synonym groups, and embedding cosine clusters |
| **hierarchy** | keep only genuine is-a edges ("being linked is not being a subclass"), break cycles, then SCS context profiles with property inheritance |
| **quality** | a graph-reviewer score: completeness · integrity · grounding · shape |
| **community** | Louvain modularity clustering (pure Python) |
| **emit** | Turtle (zero-dep) or OWL/RDF-XML (rdflib) |

### Search — one-shot GraphRAG

Not an iterative ReAct loop — fire several retrieval strategies at once and **fuse**:

1. vector / lexical passages (what it says)
2. graph label-linking → 1-hop relations (how entities connect)
3. **class enumeration** — the complete "list/count" a vector index can't give
4. HippoRAG: entities of the retrieved chunks → 1-hop expansion
5. evidence assembled with **MMR diversity** + **adaptive top-k** (the decisive
   minority — a warning/exception — survives instead of being crowded out)
6. one LLM synthesis; honest `evidence_nodes` = only the nodes the answer cites

```python
res = onto.search("which regulation applies to Acme Bank", llm=llm)
res.answer          # the synthesis
res.relations       # graph relations used
res.evidence_nodes  # nodes the answer actually cites (honest highlight)
```

## Any graph DB, or none

The algorithms only ever talk to small protocols (`GraphStore`, `VectorStore`,
`LLM`, `GraphSink`, `Morphology`, `Embedder`), never to a database:

```python
# zero infra — pure-Python in-memory (default)
onto.search("…")

# load into any SPARQL 1.1 store (Fuseki, GraphDB, Blazegraph, Virtuoso, …)
from xgen_ontology import fuseki
store = fuseki("http://localhost:3030", "ds", user="admin", password="…")
onto.push(store, graph="urn:my-graph")           # write
onto.search("…")                                  # or search a remote store via SparqlGraph
```

`SparqlGraph` is stdlib-only (urllib) and uses portable `FILTER(CONTAINS(...))`, so
it works on **any** SPARQL 1.1 endpoint — not just jena-text.

## Install

```bash
pip install xgen-ontology                 # core, zero deps
pip install "xgen-ontology[files]"        # + pypdf / python-docx / openpyxl (parse pdf/docx/xlsx)
pip install "xgen-ontology[rdf]"          # + rdflib (OWL / RDF-XML emit & parse)
pip install "xgen-ontology[korean]"       # + kiwipiepy (Korean morphological dedup)
pip install "xgen-ontology[vector]"       # + qdrant-client (embedding adapters)
```

Run the demos with no install:

```bash
python examples/build_csv.py
python examples/build_and_search.py
```

## Design — algorithms as a library

- **`dependencies = []`** — the core needs nothing but the standard library. The
  in-memory graph indexes labels with **BM25** (CJK character n-grams, so Korean/CJK
  search works with no morphological analyzer); the Turtle writer is hand-rolled.
- **English-neutral by default** — no hardcoded language. Korean morphology, name→URI
  translation and the extraction/synthesis prompts are all pluggable; the defaults
  assume nothing about your domain or language.
- **Bring your own everything** — LLM (`generate(prompt, system)`), embedder, morphology,
  graph store. The bundled `EchoLLM` lets the whole pipeline run with no API key.

```
src/xgen_ontology/
  models.py        # Class/Property/Concepts (T-Box), Instance/Relation/DataValue (A-Box), Node/Chunk
  protocols.py     # LLM / GraphStore / VectorStore / GraphSink / Morphology / Embedder
  text.py          # tokenizer + BM25 (CJK n-grams), IRI-safe slugging
  build/
    parse.py       # file -> text (txt/md/html/csv; pdf/docx/xlsx optional)
    chunk.py       # boundary-aware chunking
    tabular.py     # table -> ontology (no LLM)
    extract.py     # document -> ontology (LLM)
    resolve.py     # entity resolution
    govern.py      # predicate governance
    dedup.py       # rule + LLM + vector dedup
    hierarchy.py   # is-a cleaning + SCS inheritance
    quality.py     # graph-reviewer score
    community.py   # Louvain
    emit.py        # Turtle / OWL
    pipeline.py    # OntologyBuilder (wires the stages)
  backends/
    memory.py      # InMemoryGraph / InMemoryVector / InMemoryGraphSink (zero infra)
    sparql.py      # SparqlGraph — any SPARQL 1.1 store (read + write)
  search/          # fusion + one-shot GraphRAG
  ontology.py      # Ontology — the hub (search / emit / push / quality / communities)
  facade.py        # build_from_csv / build_from_documents / build_from_triples
examples/  tests/
```

## Roadmap

- async pipeline (parallel chunk extraction + parallel search seeds)
- Neo4j / property-graph `GraphStore` adapter; Qdrant `VectorStore` adapter
- RDF-star / qualified statements (n-ary relations, provenance) in emit
- reranker / cross-encoder hook for search

## License

MIT © jinsoo96. See [LICENSE](LICENSE).
