Metadata-Version: 2.4
Name: graphforge
Version: 0.4.0
Summary: Composable graph tooling for analysis, construction, and refinement
Project-URL: Homepage, https://github.com/DecisionNerd/graphforge
Project-URL: Repository, https://github.com/DecisionNerd/graphforge
Project-URL: Issues, https://github.com/DecisionNerd/graphforge/issues
Author: David Spencer
License: MIT
License-File: LICENSE
Keywords: analysis,graph,opencypher,pydantic
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: defusedxml>=0.7.1
Requires-Dist: isodate>=0.6.1
Requires-Dist: lark>=1.1
Requires-Dist: msgpack>=1.0
Requires-Dist: pydantic>=2.6
Requires-Dist: python-dateutil>=2.8.2
Requires-Dist: pyyaml>=6.0.3
Provides-Extra: analytics
Requires-Dist: networkx>=3.0; extra == 'analytics'
Requires-Dist: pandas>=2.0; extra == 'analytics'
Requires-Dist: python-igraph>=0.11; extra == 'analytics'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.0; extra == 'dev'
Requires-Dist: pytest-bdd>=7.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.0; extra == 'dev'
Requires-Dist: pytest-split>=0.9.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.0; extra == 'dev'
Requires-Dist: pytest>=9.0.3; extra == 'dev'
Requires-Dist: ruff==0.15.12; extra == 'dev'
Provides-Extra: igraph
Requires-Dist: python-igraph>=0.11; extra == 'igraph'
Provides-Extra: networkx
Requires-Dist: networkx>=3.0; extra == 'networkx'
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == 'pandas'
Description-Content-Type: text/markdown

<h1 align="center">GraphForge</h1>

<p align="center">
  <a href="https://pypi.org/project/graphforge/"><img src="https://img.shields.io/pypi/v/graphforge.svg?label=PyPI&logo=pypi" alt="PyPI version" /></a>
  <a href="https://pypi.org/project/graphforge/"><img src="https://img.shields.io/pypi/dm/graphforge.svg?label=Downloads" alt="Monthly downloads" /></a>
  <a href="https://pypi.org/project/graphforge/"><img src="https://img.shields.io/pypi/pyversions/graphforge.svg?logo=python&logoColor=white" alt="Python versions" /></a>
  <a href="https://github.com/DecisionNerd/graphforge/actions"><img src="https://github.com/DecisionNerd/graphforge/workflows/Test%20Suite/badge.svg" alt="Build status" /></a>
  <a href="https://codecov.io/gh/DecisionNerd/graphforge"><img src="https://codecov.io/gh/DecisionNerd/graphforge/graph/badge.svg" alt="Coverage" /></a>
  <a href="https://github.com/DecisionNerd/graphforge/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License" /></a>
</p>

<p align="center">
  <strong>Composable graph tooling for analysis, construction, and refinement</strong>
</p>

<p align="center">
  A lightweight, embedded, openCypher-compatible graph engine for research and investigative workflows
</p>

---

## Table of Contents

- [Why GraphForge?](#why-graphforge)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Cypher Features](#cypher-features)
- [Datasets](#datasets)
- [Transactions](#transactions)
- [Architecture](#architecture)
- [Development](#development)
- [Roadmap](#roadmap)
- [License](#license)

---

## Why GraphForge?

> *We are not building a database for applications.*
> *We are building a graph execution environment for thinking.*

Modern data science and ML workflows increasingly produce graph-shaped data — entity relationships extracted by LLMs, citation networks, dependency graphs, social connections, knowledge bases. Working with this data shouldn't require running a database server. GraphForge brings the full expressiveness of the openCypher query language to the Python notebook and script environment: zero configuration, single-file persistence, and first-class Python integration.

| | NetworkX | **GraphForge** | Neo4j / Memgraph |
|:---|:---|:---|:---|
| **Setup** | `pip install` | `pip install` | Run a server |
| **Query language** | Python API | **Full openCypher** | Full Cypher |
| **Persistence** | Manual | **SQLite (automatic)** | Native |
| **Notebook-friendly** | ✓ | ✓ | Requires connection |
| **Graph size** | Millions | up to ~20M edges† | Billions |
| **TCK compliance** | N/A | **100% (3,885/3,885)** | ~100% |

**Use GraphForge for:** knowledge graphs, citation networks, research workflows, LLM output storage, social network analysis in notebooks.

**Use a production database for:** high throughput, multi-user access, or graphs beyond the limits in [Scale Limits](docs/reference/scale-limits.md).

† *Traversal queries with LIMIT scale to ~20M edges; full-scan aggregations are practical up to ~1M edges.*

### v0.4.0 — Three-Surface API

v0.4.0 ships two new API surfaces alongside the existing Cypher executor:

- **`db.gds`** — 8 compiled graph algorithms (PageRank, betweenness, Louvain, triangle count, and more) dispatched to igraph or NetworkX. Results write back to node properties and are immediately queryable via Cypher.
- **`db.search`** — hybrid retrieval combining FTS5 text search and vector cosine similarity via RRF fusion. Returns `SearchHit` objects with score provenance; every result is addressable in `db.execute()`.
- **`graphforge.recipes`** — composable helper functions; `neighbourhood()` builds n-hop context for LLM prompts.

See [CHANGELOG.md](CHANGELOG.md) for the full list of changes.

---

## Installation

```bash
pip install graphforge
# or
uv add graphforge
```

**Requirements:** Python 3.10–3.14

**Core dependencies:** `pydantic>=2.6`, `lark>=1.1`, `msgpack>=1.0`

---

## Quick Start

### In-memory graph

```python
from graphforge import GraphForge

db = GraphForge()

# Create nodes and relationships
db.execute("""
    CREATE (alice:Person {name: 'Alice', age: 30})
    CREATE (bob:Person {name: 'Bob', age: 25})
    CREATE (alice)-[:KNOWS {since: 2020}]->(bob)
""")

# Query the graph
results = db.execute("""
    MATCH (p:Person)-[:KNOWS]->(friend)
    WHERE p.age > 25
    RETURN p.name AS person, friend.name AS friend, p.age AS age
    ORDER BY p.age DESC
""")

for row in results:
    print(f"{row['person'].value} (age {row['age'].value}) knows {row['friend'].value}")
```

### Persistent graph

```python
# Save to SQLite
db = GraphForge("research.db")
db.execute("CREATE (:Paper {title: 'Graph Neural Networks', year: 2024})")
db.close()

# Reload later
db = GraphForge("research.db")
result = db.execute("MATCH (p:Paper) RETURN p.title AS t")
print(result[0]['t'].value)  # Graph Neural Networks
```

### Python builder API

```python
alice = db.create_node(['Person', 'Employee'], name='Alice', age=30)
bob = db.create_node(['Person'], name='Bob', age=25)
db.create_relationship(alice, bob, 'KNOWS', since=2020)
```

### Graph algorithms

```python
# Compute PageRank and write scores back to nodes
db.gds.pagerank(write_property="rank")

# Query the written scores via Cypher
top = db.execute("MATCH (n) RETURN n.name, n.rank ORDER BY n.rank DESC LIMIT 5")

# Stream mode — returns dict[node_id, score] without mutating the graph
bc = db.gds.betweenness_centrality()
```

### Hybrid search

```python
db = GraphForge("research.db")

# Index node text for full-text search
db.search.index_all(node_label="Paper", properties=["title", "abstract"])

# Store a precomputed embedding (bring your own model)
db.search.set_node_vector(node_id, embedding, space="text-embedding-3-small")

# Hybrid retrieval — text + vector signals fused via RRF
results = db.search("graph neural networks", vector=query_embedding, top_k=10)
for hit in results:
    print(hit.ref.properties["title"].value, hit.score, hit.sources)
```

### Access result values

Results contain `CypherValue` objects — use `.value` to get the Python value:

```python
results = db.execute("MATCH (p:Person) RETURN p.name AS name, p.age AS age")

for row in results:
    name: str = row['name'].value
    age: int  = row['age'].value
```

---

## Cypher Features

GraphForge implements the full openCypher language (100% TCK compliant as of v0.3.8).

### Clauses

```cypher
-- Reading
MATCH (n:Person)-[:KNOWS]->(friend)
OPTIONAL MATCH (n)-[:WORKS_AT]->(company)
WHERE n.age > 25
WITH n, count(friend) AS friends
RETURN n.name, friends
ORDER BY friends DESC
LIMIT 10

-- Writing
CREATE (n:Person {name: 'Alice'})
MERGE (n:Person {name: 'Alice'})
SET n.age = 30
REMOVE n.temp
DELETE n
DETACH DELETE n

-- Iteration
UNWIND [1, 2, 3] AS x
RETURN x * 2 AS doubled

-- Subqueries
MATCH (n) WHERE EXISTS { MATCH (n)-[:KNOWS]->() }
RETURN n
```

### Patterns

```cypher
(n)                                -- Any node
(n:Person)                         -- Node with label
(n:Person {age: 30})               -- Node with property
(a)-[r:KNOWS]->(b)                 -- Directed relationship
(a)-[r:KNOWS|LIKES]->(b)           -- Multiple types
(a)-[*1..3]->(b)                   -- Variable-length (1 to 3 hops)
(a)-[*]->(b)                       -- Any length
p = (a)-[*]->(b)                   -- Bind path to variable
```

### Functions

| Category | Functions |
|----------|-----------|
| String | `toLower`, `toUpper`, `trim`, `split`, `replace`, `substring`, `left`, `right`, `reverse`, `size` |
| Math | `abs`, `ceil`, `floor`, `round`, `sqrt`, `pow`, `exp`, `log`, `sin`, `cos`, `tan`, `pi`, `e` |
| List | `head`, `tail`, `last`, `range`, `size`, `reverse`, `sort`, `collect`, `reduce`, `filter`, `extract` |
| Aggregation | `count`, `sum`, `avg`, `min`, `max`, `collect`, `stDev`, `percentileDisc` |
| Predicate | `all`, `any`, `none`, `single`, `exists`, `isEmpty` |
| Temporal | `date`, `datetime`, `localDatetime`, `time`, `localtime`, `duration`, `now` |
| Spatial | `point`, `distance` |
| Graph | `id`, `labels`, `type`, `keys`, `properties`, `nodes`, `relationships`, `startNode`, `endNode` |
| Conversion | `toInteger`, `toFloat`, `toString`, `toBoolean`, `coalesce` |

### Temporal types (full precision)

```cypher
-- Dates, times, datetimes
RETURN date('2024-01-15')
RETURN datetime('2024-01-15T14:30:00[Europe/London]')  -- IANA timezone
RETURN duration('P1Y2M3DT4H5M6.789S')

-- Nanosecond precision
RETURN duration('PT0.000000789S').nanoseconds  -- 789

-- Extreme years (outside Python's 1-9999 range)
RETURN localdatetime('+999999999-12-31T23:59:59')

-- Arithmetic
RETURN date('2024-01-01') + duration('P1M')  -- 2024-02-01
RETURN duration.between(date('2020-01-01'), date('2024-01-01'))
```

---

## Datasets

Load 100+ real-world graphs instantly:

```python
from graphforge import GraphForge
from graphforge.datasets import load_dataset, list_datasets

db = GraphForge()

# Load any pre-registered dataset (auto-downloads and caches)
load_dataset(db, "snap-ego-facebook")   # Facebook ego networks (SNAP)
load_dataset(db, "ldbc-snb-sf0.1")      # Social network benchmark (LDBC)
load_dataset(db, "netrepo-karate")      # Karate club (NetworkRepository)

# Browse available datasets
for ds in list_datasets(source="snap")[:3]:
    print(f"{ds.name}: {ds.nodes:,} nodes, {ds.edges:,} edges")

# Analyze immediately
results = db.execute("""
    MATCH (n)-[r]->()
    RETURN n.id AS user, count(r) AS degree
    ORDER BY degree DESC LIMIT 5
""")
```

**Available sources:**
- **SNAP** (Stanford): 95 social, web, email, citation, and collaboration networks
- **LDBC**: 10 social network benchmark datasets with temporal data
- **NetworkRepository**: 10 pre-registered datasets

---

## Transactions

```python
db = GraphForge("graph.db")

db.begin()
try:
    db.execute("MATCH (p:Person {id: 123}) SET p.status = 'inactive'")
    db.execute("CREATE (:AuditLog {action: 'deactivate', user_id: 123})")
    db.commit()
except Exception:
    db.rollback()
    raise
finally:
    db.close()
```

---

## Architecture

GraphForge exposes three independent API surfaces over a shared storage layer:

```
db.execute("MATCH ...")    →  Cypher path   (Parser → Planner → Executor → Storage)
db.gds.pagerank(...)       →  Algorithm path (export → compiled backend → write-back)
db.search.fts(...)         →  Search path   (SQLite FTS5 / vector index → NodeRef list)
```

The Cypher path is four independent layers:

```
┌─────────────────────────────────────────────────┐
│  Parser         cypher.lark + parser.py         │  Cypher → AST
├─────────────────────────────────────────────────┤
│  Planner        planner.py + operators.py       │  AST → Logical plan
├─────────────────────────────────────────────────┤
│  Executor       executor.py + evaluator.py      │  Plan → Results
├─────────────────────────────────────────────────┤
│  Storage        memory.py + sqlite_backend.py   │  In-memory + SQLite WAL
└─────────────────────────────────────────────────┘
```

The algorithm and search paths bypass the Cypher executor entirely — `db.gds` and `db.search` are Python-method surfaces, not Cypher extensions. Storage uses **MessagePack** for efficient binary encoding of graph properties.

---

## Development

```bash
# Install with dev dependencies
uv sync --dev

# Run all checks (mirrors CI)
make pre-push

# Run tests
uv run pytest tests/unit tests/integration
uv run pytest tests/tck/ -n auto   # Full TCK (3,885 scenarios)

# Coverage
make coverage
```

---

## Roadmap

| Version | Focus | Status |
|---------|-------|--------|
| v0.3.8 | Full TCK compliance (3,885/3,885) | **Released** |
| v0.3.9 | Performance: LALR parser, property indexes, bulk ingest, SQLite tuning, LIMIT short-circuit | **Released** |
| v0.3.10 | Analytics integration: NetworkX/igraph export, parse/plan cache, `add_graph_documents()` | **Released** |
| v0.4.0 | Three-surface API: `db.gds.*` graph algorithms + `db.search.*` hybrid retrieval | **Released** |

See [CHANGELOG.md](CHANGELOG.md) for full release history.

---

## License

MIT © David Spencer — see [LICENSE](LICENSE) for details.

Built on [Lark](https://github.com/lark-parser/lark), [Pydantic](https://docs.pydantic.dev/), [MessagePack](https://msgpack.org/), and the [openCypher](https://opencypher.org/) specification.
