Metadata-Version: 2.3
Name: strwythura
Version: 1.2.1
Summary: Construct a _knowledge graph_ (KG) from unstructured data sources using _state of the art_ (SOTA) models for _named entity recognition_ (NER), then implement an enhanced _GraphRAG_ approach, and curate semantics for optimizing AI app outcomes within a specific domain.
License: MIT
Author: Paco Nathan
Author-email: paco@derwen.ai
Requires-Python: >=3.11,<3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: baml-cli (>=0.1.0,<0.2.0)
Requires-Dist: bs4 (>=0.0.2,<0.0.3)
Requires-Dist: datasets (>=4.0.0,<5.0.0)
Requires-Dist: gensim (>=4.3.3,<5.0.0)
Requires-Dist: gliner (>=0.2.21,<0.3.0)
Requires-Dist: gliner-spacy (>=0.0.11,<0.0.12)
Requires-Dist: icecream (>=2.1.5,<3.0.0)
Requires-Dist: ipython (>=9.4.0,<10.0.0)
Requires-Dist: ipywidgets (>=8.1.7,<9.0.0)
Requires-Dist: jupyterlab (>=4.4.5,<5.0.0)
Requires-Dist: jupyterlab-execute-time (>=3.2.0,<4.0.0)
Requires-Dist: lancedb (>=0.24.2,<0.25.0)
Requires-Dist: loguru (>=0.7.3,<0.8.0)
Requires-Dist: lxml (>=6.0.0,<7.0.0)
Requires-Dist: networkx (>=3.5,<4.0)
Requires-Dist: nltk (>=3.9.1,<4.0.0)
Requires-Dist: ollama (>=0.5.1,<0.6.0)
Requires-Dist: polars (>=1.32.3,<2.0.0)
Requires-Dist: pyinstrument (>=5.0.3,<6.0.0)
Requires-Dist: pyvis (>=0.3.2,<0.4.0)
Requires-Dist: rdflib (>=7.1.4,<8.0.0)
Requires-Dist: requests (>=2.32.4,<3.0.0)
Requires-Dist: requests-cache (>=1.2.1,<2.0.0)
Requires-Dist: spacy (>=3.8.7,<4.0.0)
Requires-Dist: tqdm (>=4.67.1,<5.0.0)
Requires-Dist: w3lib (>=2.3.1,<3.0.0)
Requires-Dist: watermark (>=2.5.0,<3.0.0)
Project-URL: Homepage, https://github.com/DerwenAI/strwythura
Project-URL: doi, https://doi.org/10.5281/zenodo.16934079
Project-URL: slides, https://derwen.ai/s/2njz#1
Project-URL: video, https://senzing.com/gph-graph-rag-llm-knowledge-graphs/
Description-Content-Type: text/markdown

# Strwythura

[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.16934079.svg)](https://doi.org/10.5281/zenodo.16934079)

**Strwythura** tutorial, based on a presentation about GraphRAG for
[GraphGeeks](https://graphgeeks.org/) on 2024-08-14

How to construct a _knowledge graph_ (KG) from unstructured data
sources using _state of the art_ (SOTA) models for _named entity
recognition_ (NER), then implement an enhanced _GraphRAG_ approach,
and curate semantics for optimizing AI app outcomes within a
specific domain.

  * videos: <https://youtu.be/B6_NfvQL-BE>, <https://senzing.com/gph-graph-rag-llm-knowledge-graphs/>
  * slides: <https://derwen.ai/s/2njz#1>

Motivation for this tutorial comes from the stark fact that the
term "GraphRAG" means many things, based on multiple conflicting
definitions. Several popular implementations reveal a relatively 
cursory understanding about either _natural language processing_ (NLP)
or graph algorithms, plus a _vendor bias_ toward their own query language.

See this article for more details and history:
["Unbundling the Graph in GraphRAG"](https://www.oreilly.com/radar/unbundling-the-graph-in-graphrag/).

Instead of delegating KG construction to a _large language model_
(LLM), this tutorial shows the use of sophisticated NLP pipelines
based on `spaCy`, `GLiNER`, _TextRank_, and related libraries.
Results are better/faster/cheaper, plus this provides more control
and oversight for _intentional arrangement_ of the KG. Then for
downstream usage in a question/answer chat bot, an enhanced GraphRAG
approach leverages graph algorithms (e.g., _semantic random walk_)
to optimize retrieval of text chunks which ultimately get presented
to an LLM for _summarization_ to produce responses.

For more detailed discussions, see:

  * enhanced GraphRAG: ["GraphRAG to enhance LLM-based apps"](https://derwen.ai/s/hm7h#3)
  * ontology pipeline: ["Intentional Arrangement"](https://jessicatalisman.substack.com/) by Jessica Talisman
  * `spaCy`: <https://spacy.io/>
  * `GLiNER`: <https://huggingface.co/urchade/gliner_base>
  * _TextRank_: <https://www.derwen.ai/docs/ptr/explain_algo/>

A few key issues regarding KG construction with LLMs still have not
been addressed by the graph community in general:

  1. LLMs tend to mangle cross-domain semantics when used for building graphs; see _Mai2024_ referenced in the "GraphRAG to enhance LLM-based apps" talk above.
  2. You need to introduce a _semantic layer_ for representing the domain context, which follows more of a _neurosymbolic AI_ approach.
  3. Most all LLMs perform _question rewriting_ in ways which cannot be disabled, even when the `temperature` parameter is set to zero; this leads to relative degrees of "hallucinated questions" for which there are no clear workarounds.
  4. Any _model_ used for prediction introduces reasoning based on _generalization_, even more so when the model uses a _loss function_ for training; this tends to be the point where KG structure and semantics turn into crap; see the "Let's talk about ..." articles linked below.
  5. The approach outlined here is faster and less expensive, and produces better results than if you'd delegated KG construction to an LLM.

Of course, YMMV.

Overall, this approach leverages _neurosymbolic AI_ methods, combining
best practices from:

  * _natural language processing
  * _graph data science_
  * _ontology pipeline_
  * _context engineering_
  * _human-in-the-loop_

to illustrate a reference implementation for _entity-resolved
retrieval-augmented generation_ (ER-RAG).


## Set up

Caveat: this code runs with Python 3.11, though the range of versions
may be extended soon.

```bash
poetry update
poetry run python3 -m spacy download en_core_web_md
```

Note: if you're working with text documents in another language,
change the `spaCy` model downloaded here, and also the model setting
in the `config.toml` file. It's a shame this *cannot* be done
programmatically in a more fluent Pythonic way, for a variety of
complex reasons.


## Usage

Caveat: this repo provides the source code and notebooks which
accompany an instructional tutorial; it is not intended as a packaged
library or maintained product.

That said, if you want to use this code to build an application it may
help to copy settings in `config.toml` into a custom configuration
file, then instantiate new `Strwythura` and `GraphRAG` objects using
it.


## Part 1: Build assets

Given as input:

  * a list of URLs from which to scrape content
  * `domain.ttl` -- semantics for the domain context

Note: the `domain.ttl` file provides a _ontology pipeline_ for the
given domain, used as the _human-in-the-loop_ basis for constructing a
_semantic layer_.  Along with the `curate.py` script described below
this illustrates _human-in-the-loop_ approaches in KG construction.

The `build.py` script scrapes text sources and constructs a
_knowledge graph_ plus _entity embeddings_, with nodes linked to
chunks in a _vector store_:

```bash
poetry run python3 build.py
```

Demo data used in this case includes articles about the linkage
between eating _processed red meat_ frequently and the risks of
_dementia_ later in life, based on long-term studies.

The approach in this tutorial iterates through multiple steps to
produce the assets needed for GraphRAG downstream:

  1. Scrape each URL using `requests` and `BeautifulSoup`
  2. Split the text into _chunks_
  3. Build _vector embeddings_ for each chunk, in `LanceDB`
  4. Parse each text chunk using `spaCy`, iterating per sentence
  5. Extract _entities_ from each sentence using `GLiNER`
  6. Build a _lexical graph_ from the parse trees in `NetworkX`
  7. Run a _textrank_ algorithm to rank important entities
  8. Build an embedding model for entities using `gensim.Word2Vec`
  9. Generate an interactive visualization using `PyVis`

Note: processing may take a few extra minutes the first time it runs
since `PyTorch` must download a large (~2GB) file.

If you look at the performance statistics, it takes almost twice as
long to generate an interactive graph visualization as it does to
perform everything else.

The assets get serialized into these files:

  * `data/lancedb` -- vector database tables in `LanceDB`
  * `data/kg.json` -- serialization of `NetworkX` graph
  * `data/sem.csv` -- entity semantics from `curate.py`
  * `data/entity.w2v` -- entity embeddings in `Gensim`
  * `data/url_cache.sqlite` -- URL cache in `SQLite`
  * `kg.html` -- interactive graph visualization in `PyVis`


## Part 2: GraphRAG chat bot

A good downstream use case for exploring a newly constructed KG is
GraphRAG, used for grounding the responses by an LLM in a
question/answer chat.

This implementation uses `BAML` <https://docs.boundaryml.com/home>
and leverages the KG using _semantic random walks_.

To set up, first download/install `Ollama` <https://ollama.com/>
and pull the Gemma3 model <https://huggingface.co/google/gemma-3-12b-it>

```bash
ollama pull gemma3:12b
```

Then run the `errag.py` script for an interactive GraphRAG example:

```bash
poetry run python3 errag.py
```

## Part 3: Semantics curation (WIP)

This code uses a _semantic layer_ -- in other words, a "backbone" for
the KG -- to organize the entities and relations which get abstracted
from the lexical graph.

If you had previously run _entity resolution_ from _structured data
sources_, which tend to be more reliable than unstructured content,
this approach could integrate those results as well.

For now, run the `curate.py` script to generate a view of the ranked
NER results, serialized as the `data/sem.csv` file.  This can be
viewed in a spreadsheet to understand how to iterate on the semantic
definitions for more effective graph organization in the domain of the
scraped documents.

```bash
poetry run python3 curate.py
```


## Generalized, Unbundled Process

**Objective:**

Construct a _knowledge graph_ (KG) using open source libraries where
deep learning models provide narrowly-focused _point solutions_ to
generate components for a graph: nodes, edges, properties.

These steps define a generalized process, where this tutorial picks up
at the _lexical graph_ (without the EL part yet):

**Semantic overlay:**

  1. Load any pre-defined controlled vocabularies directly into the KG.

**Data graph:**

  1. Load the structured data sources or updates into a data graph.
  2. Perform entity resolution (ER) on PII extracted from the data graph.
  3. Use ER results to generate a semantic overlay as a "backbone" for the KG.

**Lexical graph:**

  1. Parse the text chunks, using lemmatization to normalize token spans.
  2. Construct a lexical graph from parse trees, e.g., using a textgraph algorithm.
  3. Analyze named entity recognition (NER) to extract candidate entities from NP spans.
  4. Analyze relation extraction (RE) to extract relations between pairwise entities.
  5. Perform entity linking (EL) leveraging the ER results.
  6. Promote the extracted entities and relations up to the semantic overlay.

Of course many vendors suggest using a _large language model_ (LLM) as
a _one-size-fits-all_ (OSFA) "black box" approach for extracting
entities and generating an entire graph **automagically**.

However, the business process of _resolution_ -- for both entities and
relations -- requires _judgements_. If the entities getting resolved
are low-risk, low-effort in nature, then yeah knock yourself out. If
the entities represent _people_ or _organizations_, these have agency
and may take actions when misrepresented in applications which have
consequences.

Whenever judgements get delegated to _model-based_ approaches,
_generalization_ becomes a form of reasoning employed.  When the
technology within the model is based on _loss functions_, then
generalization becomes dominant -- regardless of any marketing claims
about "AI reasoning" made by tech firms.

Fortunately, decisions can be made _without models_, even in AI
applications. Shock, horror!!! Please, say it isn't so!?! Brace
yourselves, using models is a thing, but not the only thing.  For more
detailed discussion, see:

  * Part 1: Let's talk about "Today's AI" <https://www.linkedin.com/pulse/lets-talk-todays-ai-paco-nathan-co60c/>
  * Part 2: Let's talk about "Resolution" <https://www.linkedin.com/pulse/lets-talk-resolution-paco-nathan-ryjhc/>

Also keep in mind that black box approaches don't work especially well
for regulated environments, where audits, explanations, evidence, data
provenance, etc., are required.

Moreover, KGs used in mission-critical apps, such as investigations,
generally require periodic data updates, so construction isn't a
one-step process. By producing a KG based on the approach sketched
above, updates can be handled more effectively.  Any downstream use
cases, such as AI applications, also benefit from improved quality of
semantics and representation.


## Experiment: Relation Extraction library evals

Current Python libraries for _relation extraction_ (RE) are
probably best characterized as "experimental research projects".

Their tokenization approaches tend to make the mistake of "throwing
the baby out with the bath water" by not leveraging other available
information, e.g., what we have in the _textgraph_ representation of
the parsed documents. Also, they tend to ignore the semantic
constraints of the domain context, while computationally boiling
the ocean.

RE libraries which have been evaluated:

  * `GLiREL`: <https://github.com/jackboyla/GLiREL>
  * `ReLIK`: <https://github.com/SapienzaNLP/relik>
  * `OpenNRE`: <https://github.com/thunlp/OpenNRE>
  * `mREBEL`: <https://github.com/Babelscape/rebel>

This project had used `GLiREL` although its results were quite sparse.
RE will be replaced by `BAML` or `DSPy` workflows in the near future.

There is some experimental code which illustrates `OpenNRE` evaluation.
Use the `archive/nre.sh` script to load OpenNRE pre-trained models
before running the `archive/opennre.ipynb` notebook.

This may not work in many environments, depending on how well the
`OpenNRE` library is being maintained.


## Tutorial notebooks

There is a collection of Jupyter notebooks (now archived) which
were used to prototype code. These help illustrate important
intermediate steps within these workflows:

```bash
.venv/bin/jupyter-lab
```

  * Part 1: `archive/construct.ipynb` -- detailed KG construction using a lexical graph
  * Part 2: `archive/chunk.ipynb` -- simple example of how to scrape and chunk text
  * Part 3: `archive/vector.ipynb` -- query LanceDB table for text chunk embeddings (after running `build.py`)
  * Part 4: `archive/embed.ipynb` -- query the entity embedding model (after running `build.py`)


## Developer notes

After each `BAML` release update, some committer needs to regenerate
its Python client source:

```bash
poetry run baml-cli generate --from strwythura/baml_src
```

Kudos to @prrao87, @hellovai, @louisguitton, @cj2001


## FAQ

Q: "Have you tried this with `langextract` yet?"  
A: "I'll take `How does an instructor know a student ignored the README?` from the [`FAFO`](https://en.wiktionary.org/wiki/fuck_around_and_find_out) category, for $200" ... but yes of course, it's an interesting package, building on other interesting work used here.

Q: "What the hell is the name of this repo about?"  
A: "As you may have noticed, many open source projects by Derwen are named in a beautiful language Gymraeg, which English speakers call 'Welsh', where this word [`strwythura`](https://translate.google.com/details?sl=cy&tl=en&text=strwythura&op=translate) translates as the verb **'structure'** in English."

Q: "Why aren't you using an LLM instead to build the graph?"  
A: "I promise to visit you in jail."


## License and Copyright

Source code, documentation, and examples have an
[MIT license](https://spdx.org/licenses/MIT.html)
which is succinct andsimplifies use in commercial applications.

All materials herein are Copyright © 2024-2025 Senzing, Inc.


## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=derwenai/strwythura&type=Date)](https://star-history.com/#derwenai/strwythura&Date)

