Metadata-Version: 2.4
Name: res-sum
Version: 0.3.0
Summary: A Python package leveraging LLMs for research evidence synthesis
Author-email: "Hammed A. Akande" <akandehammedadedamola@gmail.com>
License-Expression: MIT
Project-URL: homepage, https://github.com/drhammed/res-sum
Project-URL: repository, https://github.com/drhammed/res-sum
Project-URL: issues, https://github.com/drhammed/res-sum/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf4llm>=0.0.10
Requires-Dist: python-docx>=0.8.11
Requires-Dist: langchain-text-splitters>=0.2.0
Requires-Dist: langchain-core>=0.2.0
Requires-Dist: chromadb>=0.4.0
Requires-Dist: langchain-ollama>=0.1.0
Requires-Dist: langchain-groq>=0.1.0
Requires-Dist: langchain-openai>=0.1.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: openai
Requires-Dist: langchain-openai>=0.1.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: langchain-anthropic>=0.1.0; extra == "anthropic"
Provides-Extra: ollama
Requires-Dist: langchain-ollama>=0.1.0; extra == "ollama"
Provides-Extra: ollama-cloud
Requires-Dist: langchain-openai>=0.1.0; extra == "ollama-cloud"
Provides-Extra: all-providers
Requires-Dist: langchain-openai>=0.1.0; extra == "all-providers"
Requires-Dist: langchain-anthropic>=0.1.0; extra == "all-providers"
Requires-Dist: langchain-ollama>=0.1.0; extra == "all-providers"
Provides-Extra: grobid
Requires-Dist: requests>=2.28.0; extra == "grobid"
Provides-Extra: voyageai
Requires-Dist: langchain-voyageai>=0.1.0; extra == "voyageai"
Provides-Extra: eval
Requires-Dist: rouge-score>=0.1.2; extra == "eval"
Requires-Dist: bert-score>=0.3.0; extra == "eval"
Provides-Extra: viz
Requires-Dist: pyvis>=0.3.0; extra == "viz"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file
Dynamic: requires-python

# res-sum

**A Python package leveraging LLMs for research evidence synthesis**

`res-sum` takes a folder of PDF research papers and produces structured summaries of each one using Large Language Models. It extracts text, builds a knowledge graph of entities and relationships across your papers, and uses hybrid retrieval (vector search + graph traversal) to produce contextually grounded summaries.

Built with ecology in mind, but works for any scientific field.

## Features

- **Batch-summarize PDFs** — point it at a folder, get a structured summary for each paper
- **Knowledge graph** — extracts entities and relationships from your papers using LLMs, stored as a queryable NetworkX graph
- **Hybrid retrieval (GraphRAG)** — combines vector similarity search (ChromaDB) with knowledge graph traversal
- **Domain-aware prompting** — ecology-specific Chain-of-Thought prompts; custom domains via YAML
- **Multiple LLM providers** — Ollama (local, free, default), Ollama Cloud, Groq, OpenAI, Anthropic
- **Multiple output formats** — DOCX, JSON, CSV
- **Persistent storage** — vector store + knowledge graph persist to disk; incremental ingestion for new papers

## Installation

```bash
pip install res-sum
```

For additional LLM providers:

```bash
pip install res-sum[openai]        # OpenAI (GPT-4o)
pip install res-sum[anthropic]     # Anthropic (Claude)
pip install res-sum[ollama-cloud]  # Ollama Cloud API
pip install res-sum[all-providers] # All of the above
```

### Default setup (Ollama — free, local, no API key)

If you have [Ollama](https://ollama.com/) installed locally, `res-sum` works out of the box with no API key:

```bash
ollama pull llama3.2
```

That's it.

## Quick start

### Python API

```python
from res_sum import ResSum

# Initialize (defaults: Ollama local, ecology domain)
rs = ResSum(
    llm_provider="ollama",       # or "ollama_cloud", "groq", "openai", "anthropic"
    domain="ecology",            # or "general", or path to custom YAML
)

# Ingest papers — extracts text, builds vector store + knowledge graph
rs.ingest_papers("./pdf_folder/")

# Summarize across all papers
summary = rs.summarize("What are the key findings on pollinator decline?")

# Or batch-summarize: one summary per paper, saved to disk
rs.summarize_papers(
    pdf_directory="./pdf_folder/",
    output_directory="./summaries/",
    output_format="docx",        # or "json", "csv"
)
```

### Command line

```bash
# Batch summarize with Ollama (default)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --domain ecology

# Use Groq instead (requires API key)
res-sum summarize \
    --pdf_directory ./papers/ \
    --output_directory ./summaries/ \
    --provider groq \
    --api_key $GROQ_API_KEY

# See available providers, models, and domains
res-sum info
```

## LLM providers

| Provider | API key needed | Rate limits | How to use |
|----------|---------------|-------------|------------|
| **Ollama** (default) | No | None (runs locally) | Install [Ollama](https://ollama.com/), pull a model |
| **Ollama Cloud** | Yes (`OLLAMA_API_KEY`) | Based on plan | `--provider ollama_cloud` |
| **Groq** | Yes (`GROQ_API_KEY`) | Free tier available | `--provider groq` |
| **OpenAI** | Yes (`OPENAI_API_KEY`) | Pay-per-use | `--provider openai` |
| **Anthropic** | Yes (`ANTHROPIC_API_KEY`) | Pay-per-use | `--provider anthropic` |

API keys can be passed directly or set as environment variables. They are never stored by the package.

### Setting up API keys

**Option 1 — Environment variables** (recommended):

```bash
# Add to your ~/.zshrc or ~/.bashrc
export OLLAMA_API_KEY="your-key-here"    # for Ollama Cloud
export GROQ_API_KEY="your-key-here"      # for Groq
export OPENAI_API_KEY="your-key-here"    # for OpenAI
export ANTHROPIC_API_KEY="your-key-here" # for Anthropic
```

Then just specify the provider — the key is picked up automatically:

```python
rs = ResSum(llm_provider="ollama_cloud")
```

**Option 2 — Pass directly:**

```python
rs = ResSum(
    llm_provider="ollama_cloud",
    api_key="your-ollama-cloud-key-here",
)
```

To get an Ollama Cloud API key, go to [ollama.com/settings/keys](https://ollama.com/settings/keys).

## Domain configurations

`res-sum` ships with two built-in domains:

- **`ecology`** (default) — entity types: Species, Location, Method, Metric, Concept, Temporal. Includes ecology-specific section headers (Study Area, Field Methods, Statistical Analysis, etc.) and a 6-step Chain-of-Thought prompt.
- **`general`** — broader entity types for any scientific field.

You can define your own domain with a YAML file:

```yaml
# my_domain.yaml
name: biomedical
entity_types:
  - name: DRUG
    description: "Pharmaceutical compounds or treatments"
    examples: ["metformin", "aspirin"]
  - name: DISEASE
    description: "Medical conditions"
    examples: ["diabetes", "cancer"]
relationship_types:
  - TREATS
  - CAUSES
  - ASSOCIATED_WITH
```

```python
rs = ResSum(domain="./my_domain.yaml")
```

## Retrieval modes

| Mode | What it does | Best for |
|------|-------------|----------|
| `hybrid` (default) | Vector search + graph expansion + community context, re-ranked | General summarization |
| `local` | ChromaDB vector search only | Specific factual queries |
| `graph` | Graph traversal + vector lookup | Relational queries |
| `global` | Community-level summaries + vector search | Thematic synthesis across many papers |

```python
summary = rs.summarize("...", mode="hybrid")  # or "local", "graph", "global"
```

## Explore your knowledge base

After ingesting papers, open an interactive dashboard to visualize and inspect everything:

```python
rs.explore()  # opens in your browser
```

Or from the command line:

```bash
res-sum explore --data_dir ./knowledge_base
```

The dashboard has four tabs:

- **Overview** — papers ingested, chunk counts, entity type breakdown, graph stats
- **Knowledge Graph** — interactive graph visualization. Nodes colored by entity type, sized by connections. Click to see relationships, filter by type, search by name.
- **Vector Store** — browse all text chunks by paper. See which section each chunk came from, expand to read full text.
- **Communities** — entity clusters detected by the Leiden algorithm, with LLM-generated summaries explaining what connects each group.

It's a single HTML file — works offline, shareable with collaborators.

### Programmatic access

```python
# Query an entity
rs.query_graph("Canis lupus")

# Most connected entities
rs.get_central_entities(top_k=10)

# Community structure
rs.get_communities()

# Access the NetworkX graph directly
graph = rs.knowledge_graph.graph
```

The graph is saved as GraphML and can be imported into Neo4j or any graph visualization tool.

## How it works

```
PDF files
  → Text extraction (pymupdf4llm — handles multi-column, tables)
  → Section detection (ecology-aware regex + Markdown headers)
  → Chunking (RecursiveCharacterTextSplitter)
  → ChromaDB (embed + store chunks)
  → LLM entity/relationship extraction → NetworkX knowledge graph
  → Community detection (Leiden/Louvain)
  → Hybrid retrieval (vector + graph + community)
  → LLM summarization (Chain-of-Thought prompting)
  → Output (DOCX / JSON / CSV)
```

All data persists to a `data_dir/` folder. Adding new papers only processes what's new.

## Requirements

- Python >= 3.9
- [Ollama](https://ollama.com/) installed locally (for default provider), or an API key for another provider

## Contributing

Issues and pull requests are welcome on [GitHub](https://github.com/drhammed/res-sum/issues).

## License

MIT
