Metadata-Version: 2.4
Name: cane-networks
Version: 0.1.0
Summary: Cluster Affiliation Network Embedding: platform-agnostic user networks from shared narrative participation
Author-email: Patrick Gerard <pgerard@isi.edu>
License: MIT
Project-URL: Homepage, https://github.com/pgerard/cane-networks
Project-URL: Paper, https://arxiv.org/abs/2505.21729
Keywords: social networks,narrative tracking,cross-platform,information diffusion,NLP,computational social science
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.23
Requires-Dist: pandas>=1.5
Requires-Dist: networkx>=2.8
Requires-Dist: scikit-learn>=1.1
Requires-Dist: scipy>=1.9
Requires-Dist: tqdm>=4.64
Provides-Extra: faiss-cpu
Requires-Dist: faiss-cpu>=1.7; extra == "faiss-cpu"
Provides-Extra: faiss-gpu
Requires-Dist: faiss-gpu>=1.7; extra == "faiss-gpu"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# cane-networks

CANE (Cluster Affiliation Network Embedding) builds user-user networks from social media content without relying on follower graphs, reposts, or any platform-specific metadata. Instead of connecting users through behavioral traces, it connects them through shared participation in latent narrative clusters — modeling what people talk about rather than how they interact.

This makes it useful whenever your data spans multiple platforms or API access is limited, since the same method applies to X, Telegram, TikTok, Truth Social, Reddit, or any combination of them without modification.

The method was introduced in [Gerard et al., ICWSM 2025](https://arxiv.org/abs/2505.21729) and applied to cross-platform narrative prediction in [Gerard et al., WWW 2025], where discourse-based networks substantially outperformed behavioral baselines across information operation detection, ideological stance prediction, and cross-platform emergence forecasting.

---

## Installation

```bash
pip install cane-networks
```

FAISS is optional but strongly recommended for large datasets. Install whichever variant matches your hardware:

```bash
pip install cane-networks faiss-cpu   # CPU
pip install cane-networks faiss-gpu   # GPU
```

Without FAISS the package falls back to scikit-learn's NearestNeighbors, which works fine at smaller scales.

---

## Usage

The input is a DataFrame with at least two columns: one for user identifiers and one for narrative cluster labels. How you get the cluster labels is up to you — DP-Means over sentence embeddings works well, but any clustering is fine.

### Static graph (CANE)

```python
import pandas as pd
from cane import CANE

# df needs: disc_node_id (user), cluster (narrative label)
model = CANE(similarity_threshold=0.2)
G = model.fit(df)
```

### Temporal graph (t-CANE)

t-CANE computes similarities at each time bin and aggregates them across time. Repeated co-engagement across bins strengthens edges; lapsed connections decay.

```python
from cane import tCANE

# df additionally needs a time_bin column, e.g. biweekly periods
df["time_bin"] = df["created_at"].dt.to_period("2W").astype(str)

model = tCANE(method="decay", lambda_=0.2)
G = model.fit(df)
```

Available aggregation methods: `decay`, `sum`, `average`, `max`, `stability`.

---

## Handling large narrative vocabularies

When your corpus has many narrative clusters (say, more than ten thousand), the TF-IDF user vectors become very high-dimensional and sparse. Cosine similarity degrades in this regime because most users share almost no clusters — the network ends up empty or dominated by a handful of high-volume accounts.

The fix is dimensionality reduction via TruncatedSVD before the similarity search. The key question is how many dimensions to use, and `suggest_svd_dims` answers it in terms of variance retained:

```python
from cane import suggest_svd_dims

recommended_dims, curve = suggest_svd_dims(df, target_variance=0.90)
# → 147 dimensions retain 90.0% of variance
```

This fits a single SVD on the full matrix and reads off the cumulative explained variance, so you get the entire curve cheaply and can pick whatever retention level makes sense for your use case. Once you have a number, pass it via `target_variance`:

```python
model = CANE(similarity_threshold=0.2, target_variance=0.90)
G = model.fit(df)
```

If you're not sure whether you need it, check the sparsity diagnostic that `suggest_svd_dims` prints. Sparsity above ~0.97 is the signal that reduction will help.

---

## Picking a similarity threshold

A threshold that's too high gives you a sparse, disconnected graph; too low and you're connecting users who don't really share much. `suggest_threshold` shows you the connectivity rate at several candidate values so you can make an informed choice:

```python
model = CANE(target_variance=0.90)
results = model.suggest_threshold(df)

# Threshold → Connectivity rate
# 0.10  →  94.3% nodes connected
# 0.20  →  81.2% nodes connected
# 0.30  →  63.7% nodes connected
# 0.40  →  41.0% nodes connected
# 0.50  →  22.1% nodes connected
```

For most downstream tasks (IO detection, stance prediction) something in the 0.15–0.30 range tends to work well. Narrative emergence prediction can tolerate lower thresholds since the signal comes from neighbor activity counts rather than community structure.

---

## Graph diagnostics

```python
from cane import graph_diagnostics

graph_diagnostics(G, name="My corpus")
# My corpus
# ========================================
# Nodes:              45231
# Edges:              312847
# Connected nodes:    41983 (92.8%)
# ...
```

---

## Citation

If you use this in your work, please cite:

```bibtex
@article{gerard2025bridging,
  title={Bridging the narrative divide: Cross-platform discourse networks in fragmented ecosystems},
  author={Gerard, Patrick and Hanley, Hans WA and Luceri, Luca and Ferrara, Emilio},
  journal={arXiv preprint arXiv:2505.21729},
  year={2025}
}
```

---

## License

MIT
