Metadata-Version: 2.4
Name: phosphograph
Version: 0.1.2
Summary: Directed, signed, provenance-annotated phospho-signaling graph builder and query CLI.
License-Expression: GPL-3.0-or-later
License-File: LICENSE
Requires-Python: >=3.13
Requires-Dist: click>=8.3.3
Requires-Dist: fastmcp[apps]==3.3.1
Requires-Dist: graphviz>=0.21
Requires-Dist: httpx>=0.28.1
Requires-Dist: hypothesis>=6.152.6
Requires-Dist: mygene>=3.2.2
Requires-Dist: networkx>=3.6.1
Requires-Dist: numpy>=2.4.4
Requires-Dist: pandas<3,>=2.2
Requires-Dist: paramiko<3
Requires-Dist: parquet>=1.3.1
Requires-Dist: pyarrow>=24.0.0
Requires-Dist: pydantic>=2.13.4
Requires-Dist: pypath-omnipath>=0.16.20
Requires-Dist: pytest>=9.0.3
Requires-Dist: ruff>=0.15.12
Description-Content-Type: text/markdown

# phosphograph

## What phosphograph is

phosphograph is a Python library that builds and queries a directed, signed, provenance-annotated graph of phospho-signaling relationships among human proteins. Nodes are proteins and individual phosphosites; edges are kinase, phosphatase, autophosphorylation, and protein-protein binding relationships drawn from manually curated public databases (SIGNOR and OmniPath by default; pass `--sources signor` for a SIGNOR-only build, or `--sources signor,omnipath,psp` to opt into PhosphoSitePlus for site-level signed coverage — see the [PhosphoSitePlus section](#phosphositeplus-opt-in) for the CC BY-NC-SA 3.0 license caveat).

## Why it exists

Spatial proteomics with phospho-specific stainings (e.g. p-ERK1/2, p-c-Jun, p-AKT, p-STAT3) reports on the activity state of signaling pathways at single-cell, in-tissue resolution. A single observed phospho-state is rarely interpretable on its own: the relevant questions are always what upstream input produced it and what downstream events it predicts. Designing a multiplexed-IF panel that resolves these questions requires knowing, for any given phospho-target, which other phosphorylation events are mechanistically coupled to it and could be co-stained to corroborate or refute the inferred pathway state.

Existing pathway resources address parts of this problem but require manual cross-referencing. KEGG encodes topology but not consistent effect direction at the phospho level. PhosphoSitePlus has site-level kinase-substrate data but no network view. SIGNOR has signed phospho-edges. OmniPath integrates many sources but exposes them as a general signaling network rather than as a phospho-measurable subgraph. None of them directly answer "given p-ERK1/2 T202/Y204 is elevated in this region, which other antibodies would test or extend my inference of MAPK pathway state in the same section?"

phosphograph exists to make that question scriptable and reproducible.

## What phosphograph does

1. Ingests phospho-relevant edges from SIGNOR (default), OmniPath (default), and PhosphoSitePlus (opt-in, CC BY-NC-SA 3.0).
2. Harmonizes identifiers to UniProt canonical accessions and normalizes site nomenclature (residue letter + 1-based position on UniProt canonical).
3. Merges edges across sources with consensus-effect resolution, conflict logging, and factual per-edge provenance counts.
4. Detects autophosphorylation, synthesizes site-to-host "consequence" propagation edges, and assembles a `networkx.MultiDiGraph` of `(protein, phosphosite)` nodes.
5. Resolves free-text protein names ("p-ERK", "phospho-c-Jun S63") to ranked UniProt candidates.
6. Runs bidirectional k-hop walks from a query node with best-first source-count pruning and returns the induced subgraph, enumerated paths, and per-path signed predictions.
7. Optionally collapses the result into a protein-only view with aggregated effect counts ("3 activating, 1 inhibiting").
8. Exports to GraphML, Cytoscape JSON, GEXF, parquet edge lists, and Graphviz-rendered SVG/PDF/PNG.
9. Exposes all functionality through a `click` CLI (with an interactive `walkthrough` wizard), a Python API, and a FastMCP server (`phosphograph mcp`) that surfaces the walks as MCP tools with an inline Cytoscape viewer for LLM clients.

## What phosphograph is not

- Not an image analysis tool for spatial proteomics data.
- Not a predictor of phospho-state magnitude or kinetics.
- Not a panel optimizer in v0; walks inform manual panel decisions but do not solve set-cover automatically.
- Not a quantitative or mechanistic model of signaling.
- Not a substitute for experimental validation of any kinase-substrate relationship.

## Intended users

Bioinformaticians and computational biologists designing multiplexed-IF panels for spatial proteomics, who already work with phospho-target stainings and want a scriptable, license-clean, reproducible way to retrieve the mechanistic neighborhood around a phospho-target as a queryable graph.

## Algorithmic pipeline

These are the steps from raw curated data to a query result. Each is implemented in one small module and documented inline.

### 1. Ingest (`ingest/signor_src.py`, `ingest/omnipath_src.py`, `ingest/psp_src.py`)

- **SIGNOR**: bulk TSV download parsed row-by-row. Each row becomes one or more `PhosphoEdge`s. Filtered to human (`TAX_ID==9606`) and to mechanisms we model (`phosphorylation`, `dephosphorylation`, `binding`). The `EFFECT` column collapses to `activates|inhibits|unknown`. The `DIRECT` column ("t" = directly observed, "f" = inferred) flows through to `SourceRef.direct` as real per-row provenance.
- **OmniPath**: lazy import of `pypath-omnipath`, pulled only when the user opts in. Adds enzyme-substrate coverage. OmniPath's aggregated `enz_sub` table carries no per-row effect direction, so OmniPath-only edges are `effect="unknown"` by construction.
- **PhosphoSitePlus** (opt-in): lazy import of `pypath.inputs.phosphosite`. Joins PSP's `Kinase_Substrate_Dataset` (kinase → substrate site, unsigned) with the `Regulatory_sites` table (site-level effect direction) on `(substrate_ac, residue, position, 'phosphorylation')`. Sites with matching regsite annotations carry signed effects derived from PSP's `ON_FUNCTION` keywords (`positive=True` → `activates`, `negative=True` → `inhibits`, contradictory or absent → `unknown`); unmatched K-S rows still ingest as `effect="unknown"` for structural coverage. PSP is **opt-in** because of license restrictions; see [PhosphoSitePlus (opt-in)](#phosphositeplus-opt-in) for details.

### 2. Resolve (`harmonize/resolver.py`, `harmonize/phospho_parser.py`)

Free-text input like `"p-ERK"` or `"phospho-c-Jun S63"` is normalized:

1. A regex strips phospho prefixes/suffixes and extracts an optional `(residue, position)`.
2. The cleaned symbol is sent to `mygene.info` (human only, cached).
3. Candidates are ranked by mygene's Lucene score. Both the normalized score (top hit = 1.0) AND the raw score are returned so the caller can distinguish "top of a strong field" from "top of nothing."
4. `low_confidence=True` when the top hit's raw score is below a threshold; `ambiguous=True` when the gap between top-1 and top-2 normalized scores is below `AMBIGUITY_THRESHOLD`. Never auto-pick — the caller decides.

### 3. Merge (`harmonize/merge.py`)

For each `(source_id, target_id, mechanism)` triple seen across sources:

- Union the `references` from contributing edges.
- **Effect consensus**: all agree → that effect; one says X and the rest say `unknown` → X (silence is not contradiction); two distinct signed effects → `unknown` and the disagreement is logged to `conflicts.tsv`.
- **Factual provenance counts** (no synthetic confidence): `n_sources` = distinct curated databases; `n_references` = distinct PMIDs. These drive the `--min-sources` and `--require-signed` walk filters directly.

### 4. Build the graph (`graph/build.py`)

- **Autophosphorylation detection**: any phosphorylation edge whose kinase and substrate share a UniProt AC is re-tagged `mechanism="autophosphorylation"`. Source/target stay `protein:X → site:X:Y` so the graph never grows a self-loop at the protein level.
- **Consequence edges** (site → host protein): for every site with at least one phos/dephos parent, emit one synthetic edge that lets walks traverse from a phospho-event to "the host protein is now active/inactive." Effect is the **consensus across phosphorylation/autophosphorylation parents only** — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. References are unioned across phos parents; `n_sources` / `n_references` recomputed from that union.
- **Add to MultiDiGraph**: nodes are created on demand; every site node gets its host protein materialized if not already present (invariant 2).

### 5. K-hop neighborhood walk (`walk/neighborhood.py`)

Best-first expansion using a heap keyed by `-n_sources` of the next edge. Edges supported by more curated databases are explored first, so when `max_nodes` is hit we have kept the strongest edges. Filters happen during expansion (`min_sources`, `allow_dephosphorylation`, `allow_binding`, `require_signed`), never post-hoc. When the cap fires, a `MaxNodesPruned` warning is emitted with the visited count and remaining-frontier size so the CLI can surface "you hit the cap; raise --max-nodes to see more."

### 6. Path enumeration and sign propagation (`walk/paths.py`, `walk/sign.py`)

- The caller builds **one** `filtered_subgraph(induced_subgraph(g, visited), ...)` and passes it to both path enumeration AND sign reading — so the two cannot disagree about which parallel edge "exists."
- `all_simple_paths_up_to(g, source, cutoff)` runs a single DFS via `nx`'s container-target overload and yields each simple path once.
- For each path, the sign is the product of per-step effects (`activates`=+1, `inhibits`=-1). Any `unknown` step makes the whole path's sign `None`. **Per-path only**: the same node can sit on `+` and `−` paths from different starting points, so we never collapse to per-node sign.

### 7. Protein-collapsed view (`graph/collapse.py`)

A high-level overview for visualization. Rules:

- `protein → site:X:Y` routes through to `protein:X` (host materialized if missing).
- `site → protein` (consequence) is dropped; already accounted for via the kinase→site that produced it.
- `protein → protein` (e.g. binding) kept as-is.

Per `(source, target, mechanism)` bucket: effect counts `{activates, inhibits, unknown}`, aggregated `effect ∈ {activates, inhibits, mixed, unknown}`, `n_underlying_edges`, and a `summary_label` like `"3 activating, 2 inhibiting"`. References are intentionally dropped in the collapsed view — switch back to the full graph if you need PMIDs.

### 8. Invariants (`graph/invariants.py`)

Checked after every build:

1. Every phosphosite has an incoming kinase/phosphatase edge from a protein, OR an autophosphorylation edge **from its own host protein**.
2. Every `site:X:Y` has a matching `protein:X` node.
3. All node IDs validate (structural regex + canonical UniProt AC).
4. Post-merge: no two parallel edges with the same `(source, target, mechanism)` carry disagreeing *signed* effects. Different mechanisms between the same protein pair are not flagged (phos can activate while binding inhibits — these are two distinct biological events, not a contradiction).

## Scope (v0)

| Item | Decision |
|---|---|
| Species | Human only (taxid 9606). Mouse deferred to v0.1; cross-species inheritance has biological caveats around residue translation that v0 does not solve. |
| Node resolution | Protein-level required; phosphosite-level where annotated |
| Antibody filter | None in v0 |
| Use case | Academic |
| Secondary scope | Autophosphorylation detection |
| Deliverable | Python package + `click` CLI + graph export (GraphML, Cytoscape JSON, GEXF, SVG/PDF via Graphviz, parquet) |

## Data sources

**SIGNOR and OmniPath are both included by default.** Pass `--sources signor` for a SIGNOR-only build (smaller, higher signed share, no OmniPath unsigned edges). **PhosphoSitePlus is opt-in**: pass `--sources signor,omnipath,psp` (or `--sources signor,psp`) to include it; see [PhosphoSitePlus (opt-in)](#phosphositeplus-opt-in) below for the license-and-mirror caveat. CollecTRI is not used.

| Source | Default? | Role | Access |
|---|---|---|---|
| **SIGNOR** | yes | Manually-curated, signed phospho/dephospho edges with explicit mechanism, effect direction, PMID, and SIGNOR record ID. The large majority of edges carry a signed effect. | TSV bulk dump via `https://signor.uniroma2.it/releases/getLatestRelease.php` |
| **OmniPath** | yes | Aggregated enzyme-substrate (PTM) network from many underlying resources. Adds broader site and kinase coverage but contributes zero signed edges in v0 — OmniPath's aggregated `enz_sub` table doesn't expose per-row effect direction. | `pypath-omnipath` Python client (heavy; downloads on first use) |
| **PhosphoSitePlus** | **opt-in** | Site-level kinase-substrate dataset joined with PSP's `Regulatory_sites` annotations to recover signed effect direction for the annotated subset. Substantially expands site-level coverage and adds signed edges beyond the SIGNOR baseline. Licensed **CC BY-NC-SA 3.0** (academic / non-commercial only). | `pypath.inputs.phosphosite` (fetches from the OmniPath team's mirror at `rescued.omnipathdb.org`; see caveat below) |

**Default tradeoff**: SIGNOR + OmniPath maximizes structural coverage but dilutes the signed share. Pass `--sources signor` for a smaller, more-signed graph. Add `psp` to grow site-level coverage with a meaningful signed contribution from PSP's regulatory-sites annotations — at the cost of accepting the CC BY-NC-SA 3.0 terms and depending on the rescued mirror. KEGG, Reactome, iPTMnet, DEPOD, INDRA, and CollecTRI are not used.

### PhosphoSitePlus (opt-in)

The PSP source is implemented in `ingest/psp_src.py` and is **disabled by default** for two reasons that users should be aware of before opting in:

1. **License**: PhosphoSitePlus is distributed under **Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)**. This means:
   - **Non-commercial use only.** If you are using phosphograph in any commercial context (industry, contract research, fee-for-service analysis), do not enable PSP. Building with `--sources ...,psp` causes PSP data to be downloaded onto your machine; that download itself is subject to PSP's terms.
   - **ShareAlike** propagates to derivative datasets. If you redistribute a phosphograph-built graph that incorporates PSP edges (e.g., as a parquet file, GraphML export, or downstream model), you must license the redistribution under the same CC BY-NC-SA 3.0 terms.
   - **Attribution required.** Cite the canonical PSP reference (Hornbeck PV et al., *Nucleic Acids Res.* 2015, `doi:10.1093/nar/gku1267`) in any work that uses a PSP-enabled phosphograph build.
2. **Access via a third-party mirror**: pypath's PSP downloaders point at `https://rescued.omnipathdb.org/phosphosite/...` rather than the official `phosphosite.org` endpoint (which requires registration and a manual web download). This mirror is maintained by the OmniPath team as a courtesy and is **not endorsed by PhosphoSitePlus**. The redistribution itself sits in a gray area of PSP's terms — by opting into PSP via phosphograph, you accept that:
   - the mirror may disappear at any time, in which case `--sources ...,psp` builds will start failing with a clear error;
   - you are choosing to obtain PSP data through this informal channel rather than the canonical one;
   - the PSP edges in your graph carry `database="psp"` and a derivation trail back to the rescued-mirror files (the cache is at `~/.cache/phosphograph/raw/psp/edges.parquet`).

The `--sources ...,psp` opt-in encodes informed consent to both points; phosphograph does not silently pull PSP under any default configuration.

#### Opting in via environment variable

For environments where you want PSP to be on for every invocation without having to remember `--sources signor,omnipath,psp` each time, set `PHOSPHOGRAPH_ENABLE_PSP=1` (or `true`, `yes`, `on` — case-insensitive). When that variable is set, `psp` is automatically appended to `DEFAULT_SOURCES`, so both `phosphograph build` (no `--sources` flag) and the MCP server's auto-build on first boot will include PSP. An explicit `--sources` flag still overrides whatever the default resolves to.

```bash
# one-shot
PHOSPHOGRAPH_ENABLE_PSP=1 phosphograph build

# permanent for this shell
export PHOSPHOGRAPH_ENABLE_PSP=1
phosphograph mcp --transport stdio
```

For Claude Desktop, put the env var into `claude_desktop_config.json` (the standard MCP server spec supports `env` per-server):

```jsonc
{
  "mcpServers": {
    "phosphograph": {
      "command": "phosphograph",
      "args": ["mcp", "--transport", "stdio"],
      "env": { "PHOSPHOGRAPH_ENABLE_PSP": "1" }
    }
  }
}
```

For an `.mcpb` bundle (Claude Desktop's MCP bundle format) the same env block goes inside `server.mcp_config` in `manifest.json`. The bundled `manifest.json` in this repository ships with `PHOSPHOGRAPH_ENABLE_PSP=1` pre-set, which means **installing the bundled MCP server implicitly accepts PSP's CC BY-NC-SA 3.0 terms on behalf of whoever runs it**. If you redistribute the bundle to others, ensure they understand the license implication or remove the env block before redistribution.

For a remote HTTP deployment, set the env var on the server process (systemd unit, container env, etc.) — the operator is the party accepting PSP's terms, not the end user calling the MCP tools.

## Schema

Strongly typed via `pydantic>=2`. Node IDs are deterministic strings; v0 is human-only so the taxid is implicit.

```python
from pydantic import BaseModel, Field
from typing import Literal, Optional

Residue = Literal["S", "T", "Y", "H"]
TAXID_HUMAN = 9606

class ProteinNode(BaseModel):
    kind: Literal["protein"] = "protein"
    uniprot_ac: str
    protein_symbol: str

class PhosphoSiteNode(BaseModel):
    kind: Literal["phosphosite"] = "phosphosite"
    uniprot_ac: str
    protein_symbol: str
    residue: Residue
    position: int = Field(ge=1)              # 1-based, UniProt canonical

Mechanism = Literal[
    "phosphorylation",
    "dephosphorylation",
    "autophosphorylation",
    "binding",                               # protein-protein, no site coordinate
]
Effect = Literal["activates", "inhibits", "unknown"]

class SourceRef(BaseModel):
    database: Literal["omnipath", "signor"]
    record_id: Optional[str] = None
    pmid: Optional[str] = None
    direct: Optional[bool] = None            # SIGNOR DIRECT column

class PhosphoEdge(BaseModel):
    source_id: str
    target_id: str
    mechanism: Mechanism
    effect: Effect
    references: list[SourceRef]
    n_sources: int = Field(ge=1)             # distinct curated databases
    n_references: int = Field(ge=0)          # distinct PMIDs across references
```

Node ID conventions:
- `protein:P28482`
- `site:P28482:T185`

## Natural-language resolver

`harmonize/resolver.py` converts free-text input ("p-ERK", "phospho-c-Jun S63", "p38 alpha") to UniProt entries. Pipeline:

1. `phospho_parser.py`: regex strips `phospho-`, `p-`, `pS\d+`, `pT\d+`, `pY\d+`; returns cleaned name and optional `(residue, position)`.
2. Query `mygene.info` (Python `mygene` client) for `species="human"`. Matches on official symbol, alias, previous symbol, name.
3. Rank by mygene Lucene score. Both the normalized score (top hit = 1.0) and the raw score are returned — the normalization makes the top hit always 1.0 even when it's actually a poor match, so the raw score (and the `low_confidence` flag derived from it) is what tells you whether to trust the top pick at all.

```python
class ResolutionCandidate(BaseModel):
    uniprot_ac: str
    protein_symbol: str
    matched_via: Literal["symbol", "alias", "previous_symbol", "name"]
    score: float = Field(ge=0.0, le=1.0)     # normalized within this query
    raw_score: float = 0.0                   # mygene Lucene score verbatim

class ResolutionResult(BaseModel):
    query: str
    parsed_site: Optional[tuple[Residue, int]] = None
    parsed_phospho_prefix: bool = False
    candidates: list[ResolutionCandidate]    # sorted by score desc
    ambiguous: bool = False                  # top1 - top2 < AMBIGUITY_THRESHOLD
    low_confidence: bool = False             # top raw_score < LOW_CONFIDENCE_RAW_SCORE
```

Never auto-pick. Caller decides.

## Graph model

`networkx.MultiDiGraph`. Edge conventions:

- **Kinase to substrate site**: `protein` (kinase) → `phosphosite` (substrate), `mechanism="phosphorylation"`.
- **Phosphatase to substrate site**: same shape, `mechanism="dephosphorylation"`.
- **Autophosphorylation**: `protein:X` → `site:X:Y`, re-tagged at build time when source AC equals target AC. Self-loops at the protein level are avoided.
- **Site-to-host "consequence" edge**: `phosphosite` → `protein` of the same UniProt AC. Synthesized at build time as a structural propagation hop so walks can traverse from a phospho-event to the host protein's activity. Effect is the consensus across the site's phosphorylation/autophosphorylation parents (dephos parents excluded — see [Algorithmic pipeline / Build](#4-build-the-graph-graphbuildpy)).
- **Binding**: `protein` → `protein` (no site coordinate). From SIGNOR's binding mechanism rows.

## Walks

Two primary entry points:

```python
upstream(target: str, k: int = 2, *,
         include_phosphatases: bool = True,
         include_binding: bool = True,
         min_sources: int = 1,
         require_signed: bool = False,
         max_nodes: int | None = None,
         sources: Iterable[str] | None = None) -> Walk

downstream(source: str, k: int = 2, *,
           include_binding: bool = True,
           min_sources: int = 1,
           require_signed: bool = False,
           max_nodes: int | None = None,
           sources: Iterable[str] | None = None) -> Walk
```

`Walk` returns the (filtered) induced subgraph, the enumerated simple paths up to length `k`, and a per-path propagated sign.

**Filters use factual provenance**: `min_sources=N` keeps only edges asserted by at least N curated databases (`min_sources=2` is the "consensus only" view the interactive walkthrough prompts for); `require_signed=True` drops `effect="unknown"` edges. There is no synthetic confidence score.

**Query-time source filter**: `sources={"signor", "psp"}` (a subset of `SUPPORTED_SOURCES`) restricts the walk to edges whose `references` include at least one `SourceRef` with `database` in the allowed set. This is distinct from build-time `--sources` — it carves a per-call subset out of the already-built `graph.pkl` without rebuilding, and pairs naturally with `min_sources` (e.g. "edges with PSP AND SIGNOR both asserting" = `sources={"signor","psp"}, min_sources=2`).

**Sign propagation**: product of edge effects along the path. `activates`=+1, `inhibits`=-1, `unknown` sets `propagated_sign=None` for that path. Never aggregated to a single per-node sign — the same node can sit on `+` and `−` paths from different starting points.

**Hub blow-up**: around hubs (AKT, ERK, MTOR) k≥2 neighborhoods can easily exceed the default `max_nodes` cap. `max_nodes` triggers best-first expansion ordered by edge `n_sources`, so when the cap fires the strongest edges are kept. A `MaxNodesPruned` warning carries the cap, visited count, and remaining-frontier size so the CLI can suggest raising `--max-nodes`.

## Conflict resolution and provenance

`harmonize/merge.py`. For the same `(source_id, target_id, mechanism)` triple from multiple databases or multiple rows:

1. Union references into one `PhosphoEdge`.
2. **Effect resolution**:
   - All sources agree → that effect.
   - One says `X`, the rest say `unknown` → `X` (silence is not contradiction).
   - Genuine disagreement → `effect = "unknown"`, conflict logged to `conflicts.tsv`.
3. **Provenance counts** (factual, not heuristic):
   - `n_sources` = number of distinct databases asserting the edge.
   - `n_references` = number of distinct PMIDs across all unioned references.

No synthetic "confidence score" is produced. Walk filters use `n_sources` directly (`--min-sources N`) and the boolean `--require-signed` flag for effect direction.

Source precedence (for downstream consumers picking a representative reference): SIGNOR > PSP > OmniPath. SIGNOR ranks highest because its rows carry explicit per-edge signed effect; PSP is next because regulatory-site annotations recover signed effect for a meaningful subset of K-S edges; OmniPath's `enz_sub` aggregation carries no per-row direction and ranks last.

## Orthology

Not in v0. v0 is human only. Mouse would require sequence-aligned site coordinate translation between orthologs, which v0 does not implement honestly; the prior "copy residue+position verbatim" inheritance was biologically unreliable and has been removed. Mouse may return in v0.1 with proper alignment-aware site translation.

## Output formats

`graph/io.py`:

```python
to_graphml(g, path)            # interchange, Cytoscape desktop, yEd
to_gexf(g, path)               # Gephi
to_cytoscape_json(g, path)     # web viewers, .cyjs
to_graphviz(g, path, layout="dot")  # SVG/PDF/PNG via system Graphviz
to_pickle(g, path)             # full round-trip with typed attributes
to_parquet_edges(g, path)      # pandas-friendly edge list
```

Format inferred from file extension unless explicit. Graphviz requires the system binary; layouts: `dot` for hierarchical (upstream/downstream views), `sfdp` for large neighborhoods.

## Module layout

```
phosphograph/
  __init__.py
  config.py                # paths, species toggles, source toggles, cache dir, weights
  models.py                # pydantic schemas above
  util/
    node_id.py             # deterministic node-ID helpers
  ingest/
    base.py                # Ingestor protocol -> Iterator[PhosphoEdge]
    omnipath_src.py        # enz_sub via pypath (opt-in, human only)
    signor_src.py          # SIGNOR bulk TSV
  harmonize/
    ids.py                 # UniProt canonical resolution
    sites.py               # residue+position normalization, isoform handling
    merge.py               # consensus-effect merge + conflict logging
    resolver.py            # mygene-backed free-text -> UniProt resolver
    phospho_parser.py      # regex parser for "p-X S123"-style input
  graph/
    build.py               # MultiDiGraph assembly
    io.py                  # all exports
    invariants.py          # property tests (no orphan sites, valid node IDs, etc.)
    collapse.py            # protein-only collapsed view for high-level overview
  walk/
    neighborhood.py        # bidirectional k-hop BFS
    paths.py               # all simple paths up to length k
    sign.py                # per-path sign accumulation
  query/
    upstream.py
    downstream.py
  mcp/
    server.py              # FastMCP server: tools, resources, prompts, run()
    resolution.py          # free-text -> node ID with MCP elicitation
    payload.py             # Walk / paths -> MCP wire payload (cytoscape + summary + structured)
    view.py                # ui://phosphograph/view.html Cytoscape app
  cli.py                   # click entry point (including `phosphograph mcp`)
tests/                     # pytest + hypothesis
```

## Dependencies

Required: `pypath-omnipath`, `httpx`, `pandas`, `pydantic>=2`, `networkx>=3`, `click>=8`, `mygene`, `graphviz` (Python wrapper), `pyarrow`, `fastmcp` (powers the MCP server), `pytest`, `hypothesis`.

System: Graphviz binaries (`apt install graphviz` or equivalent).

Optional extras: `pyvis` (interactive HTML preview).

## CLI

```bash
phosphograph build [--sources signor[,omnipath][,psp]] [--force]
phosphograph resolve <query> [--top-k 5]
phosphograph upstream <gene_or_ac>   [--depth 2] [--include-phosphatases] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph downstream <gene_or_ac> [--depth 2] [--include-binding] [--min-sources N] [--require-signed] [--max-nodes 200] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph neighborhood <gene_or_ac> [--upstream-depth N] [--downstream-depth N] [--upstream-max-nodes N] [--downstream-max-nodes N] [--collapse] [--sources signor,psp] [--output FILE]
phosphograph paths <source> <target> [--max-length 4] [--sources signor,psp] [--output FILE]
phosphograph export [--format graphml|gexf|cyjs|svg|pdf|parquet] <output>
phosphograph info <gene_or_ac>
phosphograph conflicts [--output conflicts.tsv]
phosphograph walkthrough
```

Output format is inferred from the file extension unless `--format` is set. `--orientation horizontal|vertical` controls Graphviz layout direction (LR vs TB) for `upstream`, `downstream`, `neighborhood`, `paths`, `export`; ignored for non-Graphviz formats.

**Note on `--sources` semantics.** On `build`, `--sources` is a build-time directive that decides which curated databases get merged into the cached `graph.pkl`. On the walk subcommands (`upstream`, `downstream`, `neighborhood`, `paths`), `--sources` is a **query-time filter** that carves a per-call subset out of the existing cached graph: an edge passes the filter iff at least one of its `references` has `database` in the allowed set. The walk filter never triggers a rebuild.

## MCP server

`phosphograph mcp` runs a [FastMCP](https://gofastmcp.com)-based [Model Context Protocol](https://modelcontextprotocol.io) server so the walks are callable directly from LLM agents (Claude.ai, Claude Desktop, custom hosts). The same query semantics as the CLI, but with an inline interactive Cytoscape viewer rendered in the chat window and MCP elicitation for ambiguous protein names.

```bash
phosphograph mcp                              # streamable HTTP on 127.0.0.1:8765/mcp (default)
phosphograph mcp --transport stdio            # Claude Desktop / subprocess hosts
phosphograph mcp --host 0.0.0.0 --port 8765   # autodeploy / container
```

**Transports.** `http` (alias `streamable-http`) is the default and the modern MCP HTTP transport — use it for Claude.ai and most autodeploy setups. `stdio` is for hosts that spawn the server as a subprocess (Claude Desktop).

**Auto-build on first boot.** If the cached graph is missing, the server runs the build step automatically before accepting tool calls, so a freshly deployed container is usable without a manual `phosphograph build`. Disable with `--no-auto-build`; override sources on cache miss with `--sources signor,omnipath` (the default — includes both), `--sources signor` for the signed-only subset, or `--sources signor,omnipath,psp` to opt into PhosphoSitePlus (CC BY-NC-SA 3.0; see [PhosphoSitePlus (opt-in)](#phosphositeplus-opt-in)). Note that PSP must be opted in by whoever runs the server — for a hosted MCP deployment that means the server operator, not the end user, accepts PSP's license terms.

**Tools** (all read-only, annotated for hosts):

| Tool | Purpose |
|---|---|
| `upstream` | Walk upstream from a query (gene symbol / UniProt AC / `SYMBOL:T185`). |
| `downstream` | Walk downstream from a query. |
| `neighborhood` | Bidirectional neighborhood with independent up/down depth and node caps. |
| `paths` | Enumerate signed simple paths between two proteins. |
| `resolve_protein` | Free-text → ranked UniProt candidates (fallback when the client doesn't support elicitation). |
| `node_info` | Attributes + in/out degrees for a single node. |

**Walk-tool parameters at parity with the CLI walkthrough.** Every walk tool (`upstream`, `downstream`, `neighborhood`) takes the full filter set the interactive walkthrough prompts for: `depth`, `max_nodes` (and per-direction variants for `neighborhood`), `include_phosphatases`, `include_binding`, `min_sources` (a.k.a. the consensus knob — `min_sources=2` keeps only edges asserted by ≥2 curated DBs), `require_signed`, plus two flags unique to v0:
- `sources: list[str] | None` — query-time database filter (`["signor", "psp"]` etc.). Restricts to edges with at least one reference from the named databases. Does not trigger a rebuild — pairs with build-time `--sources` (the latter decides what lands in the cache; this one carves a subset out at query time).
- `collapse: bool` — return the protein-only aggregated view (phosphosites hidden, parallel edges merged by `(source, target, mechanism)`). Path enumeration is omitted under `collapse=True` because paths reference site nodes that the protein-only view hides. `paths` does not take `collapse` (path enumeration is inherently node-level) but does take `sources`.

The MCP tool surface for filters matches the CLI walkthrough 1:1, so anything a user can do interactively is also reachable from an LLM agent.

**Resources.** `ui://phosphograph/view.html` — the Cytoscape viewer (loaded into a sandboxed iframe by the host). `phosphograph://stats` — graph statistics as JSON.

**Prompts.** Canonical query templates the LLM (and slash-command UIs) can discover and invoke: `kinase_network`, `regulators_of`, `path_between`.

**Cytoscape rendering.** Each walk tool returns three things in its result: a short text summary (for the LLM), a Cytoscape elements JSON blob (picked up by the bundled viewer via `app.ontoolresult` and rendered as an interactive graph in the chat window), and a structured payload (focus node, counts, full path list, prune warnings) for programmatic consumption. The viewer styles activating edges green, inhibitory red, and binding edges dashed; protein nodes are ellipses, phosphosite nodes are boxes. Toolbar buttons: fit, re-layout, toggle phosphosites, PNG export.

**Interactive disambiguation.** When a free-text query maps to multiple candidates in the graph, the tool issues an MCP elicitation so the user picks one inline. If the client does not support elicitation, the tool raises a `ToolError` pointing the agent at `resolve_protein` to do an explicit candidate listing first.

**Claude Desktop config** (`~/Library/Application Support/Claude/claude_desktop_config.json` on macOS):

```jsonc
{
  "mcpServers": {
    "phosphograph": {
      "command": "phosphograph",
      "args": ["mcp", "--transport", "stdio"]
    }
  }
}
```

**Claude.ai or other streamable-HTTP clients**: point them at `http://<host>:<port>/mcp`.

For programmatic use:

```python
from phosphograph.mcp import build_server, run

run(transport="http", host="0.0.0.0", port=8765)          # autodeploy entry point
mcp = build_server(graph=g)                                # inject a pre-loaded graph (tests/scripts)
```

## Caching

`~/.cache/phosphograph/` (override via `PHOSPHOGRAPH_CACHE_DIR` env var):
- `raw/`: source-version-stamped JSON/TSV downloads
- `edges/`: parquet edge lists per source
- `graph/`: built `MultiDiGraph` pickle keyed by (sources, species, build-timestamp)

Idempotent rebuild: `phosphograph build --force`.

## Implementation invariants (enforced by `graph/invariants.py`)

Asserted after every `build_graph(..., merge=True)`:

1. Every phosphosite node has an incoming `phosphorylation`/`dephosphorylation` edge from a protein node, OR an `autophosphorylation` edge whose source is exactly its own host protein.
2. Every `site:X:Y` node has a matching `protein:X` node in the graph.
3. All node IDs validate (structural regex + canonical UniProt AC).
4. Post-merge: no two parallel edges sharing the same `(source, target, mechanism)` carry disagreeing signed effects. Different mechanisms between the same protein pair (e.g. phos:activates + binding:inhibits) are not flagged — they describe distinct biological events, not a contradiction.

## Out of scope for v0

- CollecTRI / transcription factor regulatory edges (TFs are gene-level, off-mission for a phospho-signaling tool)
- Antibody catalog / Antibody Registry integration
- INDRA / text-mined statements
- Panel optimization / set-cover suggestions
- Kinetic or quantitative modeling
- KEGG, Reactome, iPTMnet, DEPOD as separate ingestors

---

## Project state and continuation notes

> This appendix documents the *actual* runtime quirks future contributors (or future sessions) need to know — things not derivable from the code alone.

### Locked dependency pins (do not bump without testing pypath end-to-end)

| Pin | Reason |
|---|---|
| `paramiko<3` | paramiko 3.x removed `DSSKey`; the unmaintained `pysftp` (which `pypath-omnipath` imports unconditionally in `pypath/share/curl.py`) crashes on import. Pinning to 2.x is the cleanest workaround. |
| `pandas>=2.2,<3` | `pypath.inputs.uniprot_idmapping.idtypes()` calls `groups.fillna(-1.0, inplace=True)` on a string column. Pandas 3.x uses Arrow-backed string arrays that reject float fill values. |

If pypath upstream fixes either, both pins can be relaxed. Verify with `uv run python -c "from pypath import omnipath; omnipath.db.get_db('enz_sub').make_df(tax_id=True)"` after any bump.

### pypath API surface actually used

```python
from pypath import omnipath
es = omnipath.db.get_db('enz_sub')   # EnzymeSubstrateAggregator
es.make_df(tax_id=True)              # populates es.df
df = es.df                           # pd.DataFrame
```

DataFrame columns (verified against the pinned `pypath-omnipath`):
`enzyme, enzyme_genesymbol, substrate, substrate_genesymbol, isoforms, residue_type, residue_offset, modification, sources, references, curation_effort, ncbi_tax_id`.

We rename `residue_type` → `residue_letter` in `phosphograph/ingest/omnipath_src.py:_fetch_enz_sub_live` so the rest of the pipeline keeps a single column contract.

There is **no** `to_dataframe()` method — earlier docs hinted at one but the supported API is `make_df()` + `.df`.

### SIGNOR API

SIGNOR ships its full corpus as a single TSV at `https://signor.uniroma2.it/releases/getLatestRelease.php`.

Columns we use: `IDA, IDB, DATABASEA, DATABASEB, EFFECT, MECHANISM, RESIDUE, TAX_ID, PMID, DIRECT, SIGNOR_ID`.

- `EFFECT` collapses to `activates|inhibits|unknown` via the `up-regulates*` / `down-regulates*` prefixes (see `signor_src._effect_to_enum`).
- `MECHANISM` is kept iff one of `phosphorylation`, `dephosphorylation`, `binding`. Phos/dephos rows go `protein → site:residue:position`; binding rows go `protein → protein` (no site coordinate).
- `DIRECT` is propagated to `SourceRef.direct` (True for `t`, False for `f`, None when blank). This is a real per-row signal distinguishing directly observed interactions from inferred ones.

**SIGNOR trust score is NOT in the bulk TSV.** The published per-edge score combines several features (PMID count, pathway co-occurrence, Reactome cross-reference, UniProt co-mention) but the bulk download omits it. Recomputing locally would require pulling Reactome and UniProt sidecars. v0 uses the readily-available signals (`n_sources`, `n_references`, `direct`) instead.

### OmniPath REST endpoint naming

The query type for enzyme-substrate is `/enz_sub` (with underscore), not `/enzsub`. The `/ptms` alias also works. Metadata at `/queries/enz_sub` returns the parameter dictionary. Both `/enzsub` and `/enz-sub` 502. (phosphograph itself uses `pypath-omnipath` rather than the REST endpoint directly; this note is for orientation if you ever need to verify column shapes against the web service.)

### Build pipeline non-obvious behaviors

- **Protein-symbol plumbing**: ingestors do not contribute labels. After `build_graph` assembles the graph, a single post-build pass in `phosphograph.harmonize.symbols.apply_protein_symbols(g)` collects every UniProt AC on the graph and asks `pypath.utils.mapping.label(ac, id_type='uniprot', ncbi_tax_id=9606)` for the HGNC primary symbol, writing it onto each node as `protein_symbol`. This runs unconditionally when the CLI / MCP autobuild constructs the graph (`build_graph(..., enrich_symbols=True)`) and is opt-in elsewhere — tests that pre-set `protein_symbol` on synthetic graphs leave the default `enrich_symbols=False` so the pypath touch stays out of unit tests. The `node_label(data)` helper in `graph/io.py` reads `protein_symbol` first, falling back to `uniprot_ac` when missing, and produces the human-readable label used everywhere (`MAPK1`, `MAPK1:T185`).
- **Consequence edges**: `synthesize_consequence_edges` emits one `site → host_protein` edge per site with at least one kinase/phosphatase parent. These are *structural propagation hops* required for upstream walks (gating them on known effect would break reachability). Effect is the **consensus across phosphorylation/autophosphorylation parents only** — dephosphorylation parents are deliberately excluded because their effect annotation is inverted relative to the phospho-state. Consensus rule mirrors `merge_edges`. References are unioned across phos parents; `n_sources` and `n_references` recomputed from the union.
- **Autophosphorylation**: detected at build time by AC equality between source protein and target site (`detect_autophosphorylation`). Re-tags the mechanism but leaves source/target IDs as `protein:X` → `site:X:Y` so self-loops at the protein level are avoided. Invariant 1 verifies that any `autophosphorylation` edge originates at the site's host protein.
- **Merge produces factual provenance counts** (`n_sources` = distinct curated databases; `n_references` = distinct PMIDs). No synthetic confidence score is computed — earlier heuristic weights (manual-curation flag, LTP boost) were not grounded in real evidence quality and have been removed. Walk filters operate directly on `n_sources` and on whether the effect is signed.
- **Walks and signs share one filtered view**: `query/{downstream,upstream}.py` build a single `filtered_subgraph` from the induced subgraph and pass it to BOTH path enumeration and sign reading. The two cannot disagree about which parallel edge "exists." Path enumeration runs a single DFS via `nx`'s container-target overload.
- **K-hop pruning is best-first**, ordered by `-n_sources` of the next edge. When `max_nodes` fires, a `MaxNodesPruned` warning is emitted with `{max_nodes, visited_count, remaining_candidates, direction, source}` for the CLI to surface.

### Tests

- `pytest + hypothesis`, organized under `tests/`. Run with `uv run pytest tests/`.
- No test hits the network: OmniPath / SIGNOR / mygene are mocked or fixtured; the MCP layer uses FastMCP's in-process client (`fastmcp.utilities.tests.run_server_async` for HTTP roundtrips).
- The `pypath_log/` directory at the repo root is created by pypath the first time it runs from any cwd. Gitignored.

### Cache layout

`~/.cache/phosphograph/` (override via `PHOSPHOGRAPH_CACHE_DIR`):

```
raw/omnipath/enz_sub.parquet         # cached pypath dataframe (only if opted in)
raw/signor/all_data.tsv              # SIGNOR bulk dump
raw/psp/edges.parquet                # cached PSP K-S × regsites join (only if --sources ...,psp)
raw/mygene/                          # resolver responses
edges/                               # per-source parquet (currently unused)
graph/graph.pkl                      # merged MultiDiGraph pickle (consumed by CLI)
graph/graph.meta.json                # sidecar recording which sources produced the pickle
conflicts.tsv                        # merge conflict log
```

`phosphograph build` **always rebuilds and replaces** `graph/graph.pkl` so that changes to `--sources` (or to `PHOSPHOGRAPH_ENABLE_PSP`) take effect immediately. The raw per-source caches under `raw/` are reused on rebuild unless you pass `--force`, which re-fetches every source from upstream. The MCP server's auto-build (`ensure_graph_cached`) keeps cache-hit behavior idempotent, but invalidates the pickle when the requested source set differs from the one recorded in `graph.meta.json` — so restarting the server after flipping `PHOSPHOGRAPH_ENABLE_PSP=1` triggers a clean rebuild on first tool call.

### Live build expectations

| Build | Signed coverage | First-build cost |
|---|---|---|
| **SIGNOR + OmniPath** (default) | lower — OmniPath dilutes the signed share because its enz_sub rows are unsigned | slower — pypath downloads and caches OmniPath on first use |
| SIGNOR only (`--sources signor`) | high — the large majority of edges are signed | fast — no pypath import |
| SIGNOR + OmniPath + PSP (`--sources signor,omnipath,psp`) | meaningfully higher signed-edge count than the default — PSP regulatory-sites annotations recover signed direction for an annotated subset of K-S edges; the rest land as unknown | slowest on first build — `phosphosite_regsites_one_organism` pulls multiple PSP files plus SwissProt and runs orthology translation. The joined edge frame is then cached at `raw/psp/edges.parquet` so subsequent builds skip the pypath roundtrip |

Run `phosphograph build` once to see the actual node/edge counts and signed percentage for your build (printed on stdout). Every invocation rebuilds the merged pickle from the raw caches; subsequent builds are seconds when the raw caches are warm. Pass `--force` to also re-fetch the raw upstream downloads (equivalent to deleting `~/.cache/phosphograph/raw/` first).

### CLI surface

Subcommands: `build`, `resolve`, `upstream`, `downstream`, `neighborhood`, `paths`, `export`, `info`, `conflicts`, `walkthrough`, `mcp`. Run `phosphograph --help` for the canonical list. `walkthrough` is an interactive wizard that auto-builds when no cache exists, then loops a menu — the recommended entry point for new users. `mcp` starts the FastMCP server (see [MCP server](#mcp-server)).

`--orientation horizontal|vertical` controls Graphviz layout (`rankdir=LR` vs `rankdir=TB`) on `upstream`, `downstream`, `neighborhood`, `paths`, `export`. Ignored for non-Graphviz formats.

### Things that bit us, briefly

- `omnipath` (lighter REST-only client at https://github.com/saezlab/omnipath) is a *different package* from `pypath-omnipath`. The lock-in is on pypath because of its richer database building.
- The `omnipath` PyPI package on RTD docs (`omnipath.readthedocs.io`) is the *other one* — don't follow that for pypath's API.
- The `parquet` PyPI package is unmaintained — we use `pyarrow` for parquet I/O. The `parquet` entry in pyproject is a vestige and can be removed.
- pyproject originally declared `pandas>=3.0.3` and `pypath-omnipath>=0.16.20` without pinning paramiko or pandas upper bounds. Both turned out to be wrong; fixed.

---

## License

phosphograph is released under the **GNU General Public License v3.0 or later** (`GPL-3.0-or-later`). The full license text is in [`LICENSE`](LICENSE).

The GPL choice is dictated by a runtime dependency: `pypath-omnipath` is GPL-3.0, and importing it as a library makes the combined work a derivative work under GPL terms. Anyone redistributing phosphograph — or a program that imports it — must therefore comply with the GPL (source availability, same-license redistribution, no additional restrictions). Internal academic use and modification are unrestricted; the obligations only kick in on distribution.