Metadata-Version: 2.1
Name: kgzip
Version: 0.1.0
Summary: Knowledge graph compression engine: parallel-decodable subgraph capsules for fast, storage-efficient KG queries.
Author: Ayush Mukherjee
License: MIT
Keywords: knowledge-graph,graph,compression,rdf,networkx,neo4j
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: rdflib>=6.0
Requires-Dist: networkx>=2.6
Requires-Dist: pandas>=1.3
Requires-Dist: python-louvain>=0.16
Requires-Dist: numpy>=1.21
Requires-Dist: scipy>=1.7
Requires-Dist: msgpack>=1.0
Requires-Dist: zstandard>=0.18
Requires-Dist: nest-asyncio>=1.5
Requires-Dist: filelock>=3.4
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Provides-Extra: neo4j
Requires-Dist: neo4j>=5.0; extra == "neo4j"

# KGZip

**A compression engine for knowledge graphs.** KGZip takes a knowledge graph and
splits it into small, independently-loadable pieces called *capsules*, so that when
you ask a question about one part of the graph, you only read that part — not the
whole thing. The result is a store that is **smaller on disk** and lets you **query
large graphs without loading them entirely into memory**.

This README assumes **no prior knowledge of knowledge graphs**. If you already know
the basics, jump to [Quickstart](#quickstart) or the [API reference](#api-reference).

---

## New here? Start with the concepts

### What is a knowledge graph?

A **knowledge graph (KG)** is just data stored as **things** and the **relationships
between them**.

- A **node** is a thing: a drug, a disease, a person, a movie.
- An **edge** is a relationship connecting two nodes: *Aspirin* **treats** *Headache*.

```
(Aspirin) --treats--> (Headache) --associated_with--> (GeneX)
```

Here `Aspirin`, `Headache`, and `GeneX` are nodes; `treats` and `associated_with`
are edges (also called *relations*). Each node can also carry **attributes**
(properties), e.g. Aspirin's `{ "formula": "C9H8O4" }`.

That's the whole idea. Real KGs just have many more nodes and edges (thousands to
billions), often describing a domain like medicine, finance, or social networks.

### What problem does KGZip solve?

When a graph is large, two things get painful:

1. **Storage** — keeping the whole graph around costs space.
2. **Querying** — to answer "what is near node X?", naive tools load or scan the
   entire graph, even though you only care about a tiny neighbourhood.

KGZip pre-organises the graph into **capsules** (clusters of closely-related nodes)
and writes a small **manifest** (an index). A query then loads only the capsules it
needs. Think of it like a book with chapters and a table of contents: to read about
one topic you open one chapter, not the entire book.

### The golden rule: KGZip is a *read replica*

Your original graph (in a file, or in a database like Neo4j) is always the **source
of truth** — the "master". KGZip builds a **compressed copy** from it for fast
reads. **KGZip never modifies your original data.** If the KGZip store is ever lost
or corrupted, you can always rebuild it from the master.

### Is it lossless?

Yes. KGZip v1 is **lossless**: if you compress a graph and then ask for all of its
nodes back, you get *every node and every edge* exactly as they were. Capsules store
boundary-crossing edges (and a small "halo" of neighbouring nodes) precisely so that
nothing is lost when the pieces are reassembled.

---

## Install

```bash
pip install kgzip

# optional: to read directly from a Neo4j database
pip install "kgzip[neo4j]"
```

From source (for development):

```bash
git clone <repo-url> && cd KGZip
pip install -e ".[dev]"
pytest                       # run the test suite
```

Requires **Python ≥ 3.8**. Works in plain scripts and in Jupyter notebooks.

---

## Quickstart

Five lines to compress a graph and query it:

```python
import networkx as nx
from kgzip import KGZipStore

store = KGZipStore("./my_store")              # 1. where the compressed store lives
store.compress(nx.karate_club_graph())        # 2. build capsules from a graph
result = store.query(["0", "1"], depth=2)     # 3. ask: what's around nodes 0 and 1?
print(result.subgraph.meta.node_count)        # 4. how many nodes came back
```

What just happened, line by line:

1. `KGZipStore("./my_store")` — open (or prepare to create) a store in that folder.
   Nothing is read or written yet.
2. `compress(...)` — read the graph, cluster it into capsules, and write the capsule
   files plus a manifest into `./my_store`.
3. `query(["0","1"], depth=2)` — find the capsules containing nodes `"0"` and `"1"`,
   expand outward `depth` hops, decode just those capsules, and merge them.
4. The answer is a `QueryResult`; its `.subgraph` is a normal graph you can inspect.

---

## Loading data from different sources

`compress()` accepts several common graph formats. You don't convert anything
yourself — KGZip detects the type and reads it.

```python
store.compress(nx.karate_club_graph())   # a NetworkX graph object (in memory)
store.compress("graph.ttl")              # RDF / Turtle file  (.ttl, .n3, .nt)
store.compress("graph.jsonld")           # JSON-LD file
store.compress("edges.csv")              # CSV edge list (see format below)
store.compress("bolt://localhost:7687")  # a live Neo4j database (see below)
```

### CSV edge-list format

The simplest way to bring your own data. One row per edge:

```csv
src,dst,relation,weight
Aspirin,Headache,treats,1.0
Headache,GeneX,associated_with,1.0
```

- `src`, `dst` — **required**: the two node IDs the edge connects.
- `relation` — optional: the edge type (defaults to `related_to`).
- `weight` — optional: a number (defaults to `1.0`).
- `src_type`, `dst_type` — optional: node categories (default `unknown`).
- Any other columns are kept as edge attributes.

### Reading directly from Neo4j

[Neo4j](https://neo4j.com/) is a popular graph **database**. KGZip can read a full
snapshot of it over **Bolt** (Neo4j's network connection protocol — the `bolt://`
address is just "where the database is listening"). You supply the connection URL
and your login:

```python
from kgzip import KGZipStore

# If your Neo4j has no authentication:
store = KGZipStore("./my_store")
store.compress("bolt://localhost:7687")
```

Most Neo4j databases require a username and password. Pass them via the store's
`IngestionConfig`:

```python
from kgzip import KGZipStore
from kgzip.models import KGZipConfig, IngestionConfig

config = KGZipConfig(
    ingestion=IngestionConfig(
        neo4j_auth=("neo4j", "your-password"),  # (username, password)
        neo4j_database=None,                    # database name; None = server default
        neo4j_node_label=None,                  # only nodes with this label; None = all
    ),
)
store = KGZipStore("./my_store", config)
store.compress("bolt://localhost:7687")          # one-time snapshot read + compress
```

KGZip reads every node (`id(n)` becomes the node ID, the first label becomes the
node type, properties become attributes) and every relationship. It **only reads** —
your Neo4j data is never changed.

---

## How a query works (the mental model)

```
   compress() once:                 query() many times:

   master graph                     query(["X"], depth=2)
        │                                  │
        ▼                                  ▼
   ┌──────────┐                     find capsule holding "X"
   │ capsules │  ◄──── reads only ──── + its neighbour capsules
   │ + manifest                          │
   └──────────┘                          ▼
   (on local disk)                  decode those capsules, merge
                                          │
                                          ▼
                                     QueryResult.subgraph
```

- **`depth`** controls how far out from your seed nodes to reach. `depth=1` is "the
  seed nodes and their immediate surroundings"; higher depth pulls in more.
- KGZip retrieves at **capsule granularity** — it returns whole clusters, so the
  result is a *superset* of the exact neighbourhood (great recall; some extra nodes).
  Asking for *all* nodes always returns the complete original graph (lossless).

---

## API reference

`KGZipStore` is the **only class you need**. Everything else is internal.

### Creating a store

```python
KGZipStore(path, config=None)
```
- `path` — folder for the compressed store (created on first `compress()`).
- `config` — optional `KGZipConfig` to tune clustering/compression (see below).
- The manifest is loaded **lazily** (on your first `query()`), so creating a store
  is instant and does no I/O.
- Works as a **context manager**: `with KGZipStore(path) as store: ...`.

### `compress(graph, *, config=None) → CapsuleStoreRef`
Builds the compressed store from a graph. Accepts any supported source (NetworkX
object, file path, or `bolt://` URL). Steps it runs for you: ingest → cluster →
encode → write capsules → write manifest (written last, as the safe commit point).

- Returns a **`CapsuleStoreRef`** describing the new store: `manifest_path`,
  `capsule_count`, `total_bytes`, `gcs_summary`, `store_version`, `created_at`.
- **Idempotent** by default (`overwrite=False`): re-compressing the same graph skips
  capsules whose content hasn't changed.
- Thread/process-safe: takes a file lock so two compresses can't clobber each other.

```python
ref = store.compress("edges.csv")
print(ref.capsule_count, ref.total_bytes)
```

### `query(node_ids, depth=1, **kwargs) → QueryResult`
Fetch the subgraph around one or more seed nodes.

- `node_ids` — list of node IDs to start from (must be non-empty).
- `depth` — how many hops to expand. `depth=1` is the seeds and their immediate
  surroundings; higher pulls in more. **`depth=None` = unbounded** (follow the graph
  until nothing new is reachable — the whole connected subgraph).
- Optional keyword arguments:
  - `trim: bool = False` — **token control.** `False` returns the full capsule
    contents (a *superset* of the neighbourhood — more context). `True` prunes the
    result down to the **exact `depth`-hop neighbourhood** of your seeds. Trimming is
    *lossless relative to the query* (it never drops anything within `depth` hops) and
    can cut output ~100× on large graphs. See [Saving tokens](#saving-tokens).
  - `max_capsules: int = 50` — safety cap on how many capsules one query may load.
    **Set higher, or `None`, to fetch large/complete subgraphs.** If the cap limits a
    result, `QueryResult.truncated` is set to `True` (never a silent partial answer).
  - `relation_filter: list[str]` — keep only edges of these relation types.
  - `consistency: "eventual" | "strict"` — `"strict"` re-fetches stale parts from
    the master via `master_kg_fn` instead of serving possibly-stale capsule data.
  - `timeout_ms: int` — max time to wait for parallel decoding (default 5000).
  - `master_kg_fn: Callable` — required when `consistency="strict"`; you write a
    function `node_ids -> fresh subgraph` that fetches from your master.

Returns a **`QueryResult`**:
| Field | Meaning |
|---|---|
| `subgraph` | the merged result graph (a `NormalizedGraph`) |
| `capsules_loaded` | how many capsules were read |
| `latency_ms` | how long the query took |
| `stale_capsules` | IDs of capsules flagged stale |
| `fallback_used` | `True` if the master was consulted (strict mode) |
| `query_node_ids_not_found` | seed IDs that weren't in the store |
| `truncated` | `True` if `max_capsules` limited the result (it's incomplete) |

```python
# Token-lean: exact 2-hop neighbourhood, only "treats" edges
res = store.query(["Aspirin"], depth=2, trim=True, relation_filter=["treats"])

# Agent escape hatch: not satisfied? fetch everything reachable, no caps
res = store.query(["Aspirin"], depth=None, max_capsules=None)
if res.truncated:
    print("result was capped — raise max_capsules")
```

#### Iterative deepening (for AI agents)
The defaults are safe (you never get *less* than the true neighbourhood). An agent
can start cheap and widen only when needed:

```python
res = store.query(seeds, depth=1, trim=True)      # cheap, few tokens
if not_enough(res):
    res = store.query(seeds, depth=3, trim=True)   # go deeper
if still_not_enough(res):
    res = store.query(seeds, depth=None, max_capsules=None)  # the whole reachable graph
```

### `sync(master_graph=None) → SyncReport`
Keep the store consistent with a changed master.
- `sync()` with no argument → marks all capsules **stale** (they'll be treated as
  out-of-date until rebuilt).
- `sync(updated_graph)` → re-compresses the store from the updated graph.
- Returns a **`SyncReport`**: `stale_count`, `re_encoded_count`, `skipped_count`,
  `sync_duration_ms`.

### `status() → StoreStatus`
A safe, never-raises health check.
- Returns **`StoreStatus`**: `exists`, `capsule_count`, `stale_count`, `total_bytes`,
  `store_version`, `last_encoded_at`.
- `exists=False` means nothing has been compressed yet.

```python
if not store.status().exists:
    store.compress(my_graph)
```

### Configuration

Tune how KGZip clusters and compresses. Defaults are sensible — change these only if
you need to.

```python
from kgzip.models import KGZipConfig, DecisionConfig, StorageConfig

config = KGZipConfig(
    decision=DecisionConfig(
        max_capsule_nodes=500,   # biggest a capsule may get (bigger ones are split)
        min_capsule_nodes=5,     # smallest; tiny clusters merge into a neighbour
        spectral_k=8,            # size of each capsule's structural "fingerprint"
        random_seed=42,          # makes clustering reproducible
    ),
    storage=StorageConfig(
        base_path="./my_store",
        compression="zstd",      # "zstd" (best) | "gzip" | "none"
        compression_level=3,     # 1–19 for zstd (higher = smaller, slower)
        overwrite=False,         # True = always re-encode, even if unchanged
    ),
)
store = KGZipStore("./my_store", config)
```

### Errors

Every error KGZip raises is a subclass of **`kgzip.KGZipError`** and carries a
`message` plus a `context` dict for debugging. Common ones:

| Exception | When |
|---|---|
| `EmptyGraphError` | the input graph has no nodes |
| `SchemaError` | a CSV is missing required `src`/`dst` columns |
| `SoftDependencyError` | an optional library (e.g. `neo4j`) isn't installed |
| `ConnectionError` | a Neo4j database couldn't be reached |
| `StoreNotFoundError` | you queried before compressing |
| `CorruptionError` / `VersionError` | a capsule file is damaged or wrong version |
| `QueryError` | bad query input (e.g. empty `node_ids`) |

```python
from kgzip import KGZipError
try:
    store.query([], depth=1)
except KGZipError as e:
    print(e.message, e.context)
```

---

## Saving tokens

If you feed query results to an LLM/agent, the number of tokens matters. Two levers,
both **lossless** (they remove waste, not information you asked for):

**1. `trim=True` — return only the exact neighbourhood.** Without it, a query returns
the seed's whole community capsule (lots of extra context). With it, you get exactly
the `depth`-hop neighbourhood.

**2. Compact serialization** — render the subgraph as terse triples instead of verbose
JSON, with optional attribute projection:

```python
from kgzip import to_triples, to_compact

res = store.query(["Aspirin"], depth=2, trim=True)

print(to_triples(res.subgraph))
# Aspirin --treats--> Headache
# Headache --associated_with--> GeneX

to_compact(res.subgraph)                                   # ids + types only (leanest)
to_compact(res.subgraph, attrs=["name"], include_attrs=True)  # keep only the 'name' attr
```

Measured on a 1,000-node medical KG, average tokens for a single depth-2 query
(`chars/4` estimate):

| Strategy | tokens | vs naive |
|---|---:|---:|
| Full-capsule result, verbose JSON | 54,114 | 1× |
| Full-capsule result, compact triples | 26,191 | 2.1× less |
| **`trim=True` + compact triples (= exact neighbourhood)** | **367** | **~147× less** |

The trimmed output equals what a precise Neo4j neighbourhood query would return — so
you get targeted-query token cost *plus* KGZip's storage/offline benefits. If you need
more, just widen `depth` or set `trim=False`; nothing is lost, it's your choice.

---

## When should I (not) use KGZip?

- ✅ Your graph is **large** and you mostly read **local neighbourhoods**.
- ✅ You want a **smaller on-disk** representation than raw JSON.
- ✅ Your graph **doesn't fit comfortably in memory**, so you must read from storage.
- ❌ Your graph is small and fits in RAM, and you traverse it repeatedly in-process —
  plain in-memory traversal (e.g. NetworkX) will be faster. KGZip's wins are storage
  size and avoiding full-graph loads, **not** beating RAM-speed traversal.

---

## How it works (under the hood)

KGZip is built as five layers; you only ever touch the last one (`KGZipStore`).

1. **Ingestion (L1)** — any input → a clean, immutable `NormalizedGraph` with unique
   string node IDs.
2. **Decision (L2)** — analyse the graph, detect communities (clusters of densely
   connected nodes via the Louvain algorithm), and plan one capsule per community.
3. **Encoding (L3)** — write each capsule as a compact binary `.kgzc` file (magic
   bytes, version, header, SHA-256 checksum, compressed payload). The
   `manifest.kgz.json` index is written **last**, as the atomic commit point.
4. **Query (L4)** — use the manifest to find the right capsules, decode them in
   parallel, verify their checksums, and merge.
5. **Facade (L5)** — `KGZipStore` ties it together with locking and lazy loading.

---

## Glossary

- **Node** — a thing in the graph (has an ID, a type, and attributes).
- **Edge / relation** — a directed connection between two nodes (has a type and weight).
- **Capsule** — a cluster of related nodes, stored as one `.kgzc` file. The unit KGZip
  loads per query.
- **Manifest** — `manifest.kgz.json`, the index that maps nodes to capsules.
- **Community** — a group of nodes more densely connected to each other than to the
  rest of the graph; KGZip turns each into a capsule.
- **Boundary / halo node** — a node on the edge of a capsule that also connects to a
  neighbouring capsule; stored in both so no edge is lost.
- **Depth** — how many hops outward from your seed nodes a query reaches (`None` = unbounded).
- **Trim** — prune a query result to the exact depth-hop neighbourhood (token-lean,
  lossless w.r.t. the query). Opt-in via `trim=True`.
- **Truncated** — a query result flagged `truncated=True` because `max_capsules` capped
  it; the answer is incomplete and you should raise the cap.
- **Master** — your original source-of-truth graph. KGZip never writes to it.
- **Lossless** — compressing then querying everything returns the exact original graph.
- **Bolt** — Neo4j's network protocol; a `bolt://host:port` URL is the database address.
