Metadata-Version: 2.4
Name: simplekg
Version: 0.1.3
Summary: A complete workflow for generating, normalizing, and visualizing Knowledge Graphs from unstructured Hebrew text
Author-email: Hadar Miller <miller.hadar@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://gitlab.com/millerhadar/simplekg
Project-URL: Bug Tracker, https://gitlab.com/millerhadar/simplekg/-/issues
Project-URL: Documentation, https://gitlab.com/millerhadar/simplekg/-/blob/main/README.md
Project-URL: Source Code, https://gitlab.com/millerhadar/simplekg
Keywords: knowledge-graph,nlp,hebrew,ontology,skos,rdf,extraction
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: dspy-ai>=2.6.24
Requires-Dist: openai>=1.61.0
Requires-Dist: numpy<2
Requires-Dist: pydantic>=2.0.0
Requires-Dist: torch<2.1.0,>=1.11.0
Requires-Dist: sentence-transformers>=2.0.0
Requires-Dist: scikit-learn
Requires-Dist: python-dotenv
Requires-Dist: pyvis>=0.3.2
Requires-Dist: nltk>=3.8
Requires-Dist: jupyter>=1.1.1
Requires-Dist: ipykernel>=6.31.0
Requires-Dist: requests>=2.32.5
Provides-Extra: dev
Requires-Dist: ipywidgets; extra == "dev"
Requires-Dist: jupyter; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Provides-Extra: jupyter
Requires-Dist: ipywidgets; extra == "jupyter"
Requires-Dist: jupyter; extra == "jupyter"
Provides-Extra: stanza
Requires-Dist: stanza; extra == "stanza"
Dynamic: license-file

# SimpleKG

[![PyPI version](https://badge.fury.io/py/simplekg.svg)](https://badge.fury.io/py/simplekg)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**SimpleKG** is a Python library for generating Knowledge Graphs from unstructured text using LLMs. It is designed for humanities and digital scholarship research — particularly for multi-lingual, domain-specific corpora such as rabbinic Hebrew, ancient Greek patristic literature, and legal documents. The primary use case is cross-text comparison: extracting KGs from multiple source texts under a shared ontology, then comparing their structure to detect text reuse, semantic proximity, or shared tradition.

---

## Table of Contents

1. [Installation](#installation)
2. [Environment Setup](#environment-setup)
3. [Command Line Usage](#command-line-usage)
4. [Python API — Basic Usage](#python-api--basic-usage)
   - [Rabbinic / Hebrew Text](#rabbinic--hebrew-text)
   - [Ancient Greek with Ontology Normalization](#ancient-greek-with-ontology-normalization)
5. [Post-KG: ACT Format and Visualization](#post-kg-act-format-and-visualization)
6. [Pipeline Configuration Reference](#pipeline-configuration-reference)
7. [Signature Modules (Domains)](#signature-modules-domains)
8. [Implementation](#implementation)
   - [Pipeline Architecture](#pipeline-architecture)
   - [Signature Registry System](#signature-registry-system)
   - [Ontology Normalization](#ontology-normalization)
   - [Data Models](#data-models)
   - [Project Structure](#project-structure)

---

## Installation

```bash
# From PyPI
pip install simplekg

# From source (recommended for development)
git clone https://gitlab.com/millerhadar/simplekg.git
cd simplekg
uv sync

# With optional stanza NLP support
uv sync --extra stanza
```

---

## Environment Setup

Create a `.env` file in your project root:

```bash
# LLM
OPENAI_API_KEY=sk-...

# Elasticsearch (optional — only needed for ACT graph storage)
ELASTIC_HOST=https://your-es-host
ELASTIC_USER=your-user
ELASTIC_PASSWORD=your-password
```

Load it in your script or notebook:

```python
from dotenv import load_dotenv
load_dotenv(".env")
```

---

## Command Line Usage

`kg_gen.py` is the main script for batch pipeline execution from the command line.

```bash
# Basic usage
uv run kg_gen.py -f <file_name> -s <signature_module>

# Examples
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic
uv run kg_gen.py -f Ramban19_2 -s signatures_rabbinic -c 1500
uv run kg_gen.py -f IbnShuib19_2 -s signatures_ramban -c -1 -x Ramban19_2
uv run kg_gen.py -f Greek_tlg0526_tlg004_6_201_210 -s signatures_ancient_greek -c 1500 &
```

**Arguments:**

| Flag | Description |
|---|---|
| `-f` | Text file name (without `.txt` extension), looked up in the configured source path |
| `-s` | Signature module (domain): `signatures_rabbinic`, `signatures_ramban`, `signatures_legal`, `signatures_wa`, `signatures_ancient_greek` |
| `-c` | Chunk size in characters (default: 1100). Use `-1` to chunk by sentences |
| `-x` | Optional context file name — provides background document context to the LLM |
| `-p` | Enable text preprocessing step before extraction |
| `-r` | Override source path for the input file |
| `-o` | Override output directory |

Output is written to a structured directory under `base_output_path`:

```
kg_d<file_name>/
  C0_O0/singleStepRelations/
    final_knowledge_graph.json
    final_knowledge_graph_visualization.html
    step_1_processed_subgraphs.json
    ...
  logs/
    nkg_pipeline.log
```

---

## Python API — Basic Usage

### Rabbinic / Hebrew Text

```python
import os
from dotenv import load_dotenv
from simplekg import NKGGenerator

load_dotenv(".env")

generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_rabbinic",
    api_key=os.getenv("OPENAI_API_KEY"),
    log_level="INFO",
    log_to_file=False,
)

text = """מֵאֵימָתַי קוֹרִין אֶת שְׁמַע בְּעַרְבִית. מִשָּׁעָה שֶׁהַכֹּהֲנִים נִכְנָסִים
לֶאֱכֹל בִּתְרוּמָתָן, עַד סוֹף הָאַשְׁמוּרָה הָרִאשׁוֹנָה, דִּבְרֵי רַבִּי אֱלִיעֶזֶר."""

pipeline = {
    "preProcessText": False,
    "processSubGraphs": True,
    "tagDefinitions": False,
    "twoStepRelations": False,
    "consolidateSubGraphs": True,
    "mergeConcepts": True,
    "pruneConcepts": False,
    "storeGraph": True,
    "storeEmbeddings": True,
    "storeVisualization": True,
    "storeGraphSteps": True,
    "outputPath": "/tmp/kg_output/",
}

generator.execute_pipeline(
    text=text,
    doc_context=None,
    pipeline=pipeline,
    chunk_size=0,           # 0 = no chunking, process as one unit
    chunk_by_sentences=False,
    chunk_overlap_sentences=0,
    verbose=False,
)
```

### Ancient Greek with Ontology Normalization

Ancient Greek texts require a text normalization function for matching LLM output back to the source (removing diacritics, normalizing sigma variants, etc.). Ontology normalization aligns all extracted predicates and entity types to the `AncientGreekOntology` — enabling meaningful cross-text comparison.

```python
import os, unicodedata, re, regex
from dotenv import load_dotenv
from simplekg import NKGGenerator

load_dotenv(".env")

def normalize_greek(text, filter_non_greek=True):
    """Strip diacritics and normalize Greek letter variants."""
    custom_mapping = {
        '\u03c2': 'σ',  # final sigma → regular sigma
        '\u03f2': 'σ',  # lunate sigma → regular sigma
    }
    normalized = unicodedata.normalize('NFD', text)
    text = re.sub(r'[\u0300-\u036F]', '', normalized)
    text = regex.sub(r'(\p{Script=Greek})[-—]\s+(\p{Script=Greek})', r'\1\2', text)
    if filter_non_greek:
        text = re.sub('[^\u0370-\u03FF\u1F00-\u1FFF\u0300-\u036F ]+', '', text)
    for char, unified in custom_mapping.items():
        text = text.replace(char, unified)
    return text.lower()


generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_ancient_greek",
    api_key=os.getenv("OPENAI_API_KEY"),
    log_level="INFO",
    log_to_file=False,
    normalize_text_for_matching=normalize_greek,   # domain-specific normalization
)

text = ("ωστε συναγεσθαι απο πρωτου ετουσ κυρου και περσων βασιλειασ επι το τελοσ "
        "τησ των μακκαβαιων γραφησ και επι την σιμωνοσ του αρχιερεωσ τελευτην ετη "
        "τετρακοσια εικοσιπεντε")

pipeline = {
    "preProcessText": False,
    "processSubGraphs": True,
    "tagDefinitions": False,
    "normalizeOntology": True,        # map predicates + entity types to AncientGreekOntology
    "twoStepRelations": False,
    "forceOrphanRelation": True,      # attempt to connect isolated concepts
    "consolidateSubGraphs": True,
    "mergeConcepts": True,
    "pruneConcepts": False,
    "storeGraph": True,
    "storeEmbeddings": True,
    "storeVisualization": True,
    "storeGraphSteps": True,
    "outputPath": "/tmp/kg_greek/",
}

generator.execute_pipeline(
    text=text,
    doc_context=None,
    pipeline=pipeline,
    chunk_size=0,
    chunk_by_sentences=False,
    chunk_overlap_sentences=0,
    verbose=False,
)
```

---

## Post-KG: ACT Format and Visualization

After pipeline execution, the graph can be converted to **ACT format** (Annotated Concept Tree) — a JSON structure suitable for graph databases and network analysis — and then visualized as an interactive HTML graph.

```python
import json
from simplekg.utilities import utils, networkXutils

nxu = networkXutils.NXUtils()

# Convert subgraphs to ACT format
ret = utils.kg2ACT(
    generator.graph.subgraphs,
    location="my_document_id",
    categories=["private", "my_project"],
    optin_entity_type=[],               # empty = include all entity types
    clean_overlapping=False,
    normalizers={},
    additional_attrs=["prefLabel_en"],
    additional_edge_attrs=["predicate", "evidence_text"],
)

# Optionally store ACT graph as JSON
with open("/tmp/kg_greek/final_knowledge_graph_ACT.json", "w") as f:
    json.dump(ret, f, indent=4, ensure_ascii=False)

# Convert to NetworkX graph for visualization
actnx = nxu.graphACT2nx(
    ret,
    title_node_attrs=["conceptDescription_en"],
    node_label_attr="prefLabel_en",
    edge_label="weight",
    edge_hover=["predicate"],
)

# Render interactive HTML visualization
nxu.visualize_graph(
    actnx,
    output_file="/tmp/kg_greek/visualization.html",
    open_browser=True,
    show_legend=True,
)
```

The resulting HTML file contains a fully interactive graph (powered by pyvis) with hover tooltips, legend, and drag-and-drop layout.

---

## Pipeline Configuration Reference

All pipeline flags are passed as a dictionary to `execute_pipeline()`. Missing flags fall back to defaults.

| Flag | Type | Default | Description |
|---|---|---|---|
| `preProcessText` | bool | False | Run domain-specific text preprocessing before extraction |
| `processSubGraphs` | bool | True | Extract concepts and relations from each chunk |
| `tagDefinitions` | bool | False | Mark definition concepts (e.g., "X is defined as...") |
| `twoStepRelations` | bool | False | Two-step relation extraction: candidates first, then structured relations |
| `forceOrphanRelation` | bool | False | Attempt to attach isolated concepts (no relations) to the graph |
| `normalizeOntology` | bool | False | Enforce domain ontology on entity types and predicates (see below) |
| `consolidateSubGraphs` | bool | True | Merge per-chunk subgraphs into one consolidated graph |
| `mergeConcepts` | bool | True | Cluster and merge semantically equivalent concepts |
| `pruneConcepts` | bool | False | Remove low-confidence or isolated concepts |
| `storeGraph` | bool | False | Save final graph as JSON |
| `storeEmbeddings` | bool | False | Save concept embeddings alongside the graph |
| `storeVisualization` | bool | False | Render and save an HTML visualization |
| `storeGraphSteps` | bool | False | Save intermediate pipeline stages as JSON snapshots |
| `outputPath` | str | None | Base output directory; subdirectory structure is auto-created |

**Chunking parameters** (passed directly to `execute_pipeline`, not inside the pipeline dict):

| Parameter | Default | Description |
|---|---|---|
| `chunk_size` | 0 | Characters per chunk. `0` = no chunking. `-1` = chunk by sentences |
| `chunk_by_sentences` | False | Automatically set to True when `chunk_size=-1` |
| `chunk_overlap_sentences` | 2 | Sentence overlap between adjacent chunks (for context continuity) |

---

## Signature Modules (Domains)

Each domain has a dedicated `SignatureRegistry` under `simplekg/signatures/` that encapsulates the DSPy extraction prompts (signatures) for that domain's language and conventions.

| Module | Class | Domain |
|---|---|---|
| `signatures_rabbinic` | `RabbinicSignatureRegistry` | Mishnah, Talmud, and rabbinic Hebrew literature |
| `signatures_ramban` | `RambanSignatureRegistry` | Ramban biblical commentary (medieval Hebrew) |
| `signatures_legal` | `LegalSignatureRegistry` | Hebrew legal documents and work agreements |
| `signatures_wa` | `WorkAgreementsSignatureRegistry` | Structured work agreement analysis |
| `signatures_ancient_greek` | `AncientGreekSignatureRegistry` | Ancient Greek historical, patristic, and classical texts |

Pass the module name as a string to `NKGGenerator`:

```python
generator = NKGGenerator(
    model="openai/gpt-4o",
    signature_module="signatures_rabbinic",
    api_key=os.getenv("OPENAI_API_KEY"),
)
```

---

## Implementation

### Pipeline Architecture

`NKGGenerator.execute_pipeline()` orchestrates a sequential set of stages. Each stage is independently gated by a flag in the pipeline dictionary and can be disabled without affecting the others.

```
execute_pipeline(text)
  │
  ├── init_graph()             Split text into chunks → DocGraph with subgraphs
  │
  ├── processSubGraphs()       Per chunk (parallel):
  │     ├── _extractChunkConcepts()     LLM: terms → concepts
  │     ├── _enforce_entity_types()     [if normalizeOntology] validate against ontology
  │     ├── _extractChunkRelations()    LLM: concepts → relations
  │     ├── _normalize_predicates()     [if normalizeOntology] map predicates to ontology
  │     └── _forceOrphanRelation()      [if forceOrphanRelation] connect isolated concepts
  │
  ├── consolidateSubGraphs()   Merge all subgraph concepts + relations with id remapping
  │
  ├── mergeConcepts()          Cluster semantically equivalent concepts; pick canonical form
  │
  ├── pruneConcepts()          Remove low-value nodes
  │
  └── storeGraph / storeVisualization / storeEmbeddings
```

Each chunk is processed as a `BaseGraph` object (the subgraph). After consolidation the full document is a `DocGraph` containing the consolidated graph and all subgraphs.

### Signature Registry System

The `SignatureRegistry` (in `simplekg/signatures/signature_registry.py`) is an abstract base class that defines the interface every domain must implement. Each method returns a DSPy `Signature` class used by the pipeline.

```python
class SignatureRegistry(ABC):
    def get_harvest_terms_signature(self)         # text chunk → term candidates
    def get_candidates_to_concepts_signature(self) # candidates → Concept objects
    def get_harvest_definitions_signature(self)    # identify definition concepts
    def get_harvest_relation_candidates_signature(self)  # two-step: raw candidates
    def get_candidates_to_relations_signature(self)      # two-step: structured relations
    def get_harvest_relation_signature(self)       # one-step relation extraction
    def get_propose_concept_merges_signature(self) # merge proposals across subgraphs
    def get_resolve_orphan_signature(self)         # connect isolated concepts (default impl.)
    def get_ontology(self)                         # returns None by default
    def get_predicate_mapping_signature(self)      # maps free predicates to ontology (default impl.)
```

Creating a new domain requires subclassing `SignatureRegistry` and implementing the abstract methods with domain-specific DSPy signatures. The pipeline discovers the registry class by convention (class name ending in `SignatureRegistry`) via `importlib`.

### Ontology Normalization

By default, the LLM extracts relations using free-form predicates — maximizing recall but producing varied vocabulary that makes cross-text comparison unreliable. Ontology normalization is a post-extraction step that maps every extracted predicate and entity type to a fixed, domain-specific controlled vocabulary.

**Enabling normalization:**

```python
pipeline = {
    ...
    "normalizeOntology": True,
}
```

The normalization is only active when both the pipeline flag is `True` AND the signature registry provides an ontology via `get_ontology()`. All existing registries without an ontology are completely unaffected.

**How it works:**

*Entity type enforcement* (deterministic, no LLM call):
After concept extraction, each `concept.entity_type` is validated against the ontology's entity type list. Invalid types are silently replaced with the generic fallback (e.g., `"Entity"`). The original LLM output is preserved in `concept.entity_type_raw`.

*Predicate normalization* (pre-filter + LLM + post-validate):
Relations are grouped by `(subject_type, object_type)` pair. For each group:
- **0 candidates** after domain/range filtering → assign `generic_predicate` directly (no LLM call)
- **1 candidate** → assign directly (no LLM call)
- **N candidates** → single batched LLM call mapping all relations in the group

The original predicate is preserved in `relation.predicate_raw`.

**Defining a domain ontology:**

```python
from simplekg.ontologies.base import Ontology, OntologyEntityType, OntologyPredicate

class MyOntology(Ontology):
    def __init__(self):
        super().__init__(
            id="my-domain-v1",
            description="Ontology for my domain.",
            generic_entity_type="Entity",
            generic_predicate="relatedTo",
            entity_types=[
                OntologyEntityType(
                    id="Person",
                    description="An individual.",
                    examples=["Aristotle", "Plato"],
                    aliases=["Individual", "Author"],
                ),
                # ... more types
            ],
            predicates=[
                OntologyPredicate(
                    id="ruledOver",
                    label="ruled over",
                    description="A person or polity exercised political authority over a place.",
                    domain=["Person", "Polity"],
                    range=["Place", "Polity"],
                    examples=["Caesar ruledOver Rome"],
                    aliases=["governed", "controlled", "founded", "established", "led"],
                ),
                # ... more predicates
            ],
        )
```

Inject it via the registry:

```python
class MySignatureRegistry(SignatureRegistry):
    def get_ontology(self):
        return MyOntology()
    # ... implement other abstract methods
```

**AncientGreekOntology** (`simplekg/ontologies/ancient_greek.py`) is the reference implementation, defining 11 entity types (Person, Place, Polity, Role, Event, TimePeriod, Work, Abstraction, Ethnonym, Practice, Artifact) and 28 predicates with full domain/range constraints and alias lists. See [OntologyBasedKGImplementation.md](simplekg/documentation/OntologyBasedKGImplementation.md) for the full design rationale.

### Data Models

Core objects are Pydantic models defined in `simplekg/models.py`.

**Concept** — an extracted entity:
```python
class Concept(BaseModel):
    id: str                          # unique identifier within the graph
    prefLabel: str                   # canonical label in source language
    prefLabel_en: str                # English translation
    altLabels: List[str]             # alternative surface forms
    conceptDescription: str          # description in source language
    conceptDescription_en: str       # English description
    entity_type: str                 # ontology-enforced type (e.g., "Person")
    entity_type_raw: Optional[str]   # original LLM output before enforcement
    concept_position: int            # character offset in source text
    definition: bool                 # True if this concept is a definition
```

**Relation** — an extracted relationship:
```python
class Relation(BaseModel):
    subject_id: str                  # id of the subject Concept
    predicate: str                   # ontology-normalized predicate
    predicate_raw: Optional[str]     # original LLM predicate before mapping
    object_id: str                   # id of the object Concept
    evidence_text: Optional[str]     # text span supporting this relation
```

**BaseGraph** — a single chunk's subgraph:
```python
class BaseGraph:
    gid: int
    text: str
    concepts: List[Concept]
    relation_objects: List[Relation]
```

**DocGraph** — the full document:
```python
class DocGraph:
    consolidated_graph: BaseGraph
    subgraphs: List[BaseGraph]
```

### Project Structure

```
simplekg/
  kg.py                     NKGGenerator — main pipeline class
  models.py                 Pydantic data models (Concept, Relation, BaseGraph, DocGraph)
  signatures/
    signature_registry.py   Abstract base class for all signature registries
    signatures_rabbinic.py  Rabbinic Hebrew signatures
    signatures_ramban.py    Ramban commentary signatures
    signatures_legal.py     Legal document signatures
    signatures_wa.py        Work agreement signatures
    signatures_ancient_greek.py  Ancient Greek signatures
  ontologies/
    base.py                 OntologyEntityType, OntologyPredicate, Ontology base classes
    ancient_greek.py        AncientGreekOntology (reference implementation)
  utilities/
    utils.py                kg2ACT conversion, text utilities
    networkXutils.py        NetworkX graph construction and visualization
    ElasticUtils.py         Elasticsearch storage and retrieval

tests/                      Applicative / research notebooks and ES test scripts
test_sys/                   System and pipeline regression tests
kg_gen.py                   Command-line pipeline runner
```

---

## Citation

If you use SimpleKG in your research, please cite:

```bibtex
@software{simplekg,
  author  = {Hadar Miller},
  title   = {SimpleKG: Knowledge Graph Generation for Humanities Research},
  url     = {https://gitlab.com/millerhadar/simplekg},
  version = {0.1.3},
  year    = {2025}
}
```

---

## Links

- [Source Code](https://gitlab.com/millerhadar/simplekg)
- [Issues](https://gitlab.com/millerhadar/simplekg/-/issues)
- [Ontology Design](simplekg/documentation/OntologyBasedKGImplementation.md)
