Metadata-Version: 2.4
Name: aiva-agent
Version: 0.2.4
Summary: Clinical-genomics agent: ask natural-language questions over a local VCF and get annotated, literature-grounded answers.
Author-email: Tarun Mamidi <tarun@mamidi.ai>
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/MHSPL/aiva-agent
Project-URL: Repository, https://github.com/MHSPL/aiva-agent
Project-URL: Issues, https://github.com/MHSPL/aiva-agent/issues
Keywords: genomics,vcf,variant-classification,acmg,bioinformatics,clinical-genomics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Healthcare Industry
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai-agents==0.14.8
Requires-Dist: openai==2.33.0
Requires-Dist: duckdb==1.5.2
Requires-Dist: httpx==0.28.1
Requires-Dist: requests==2.33.1
Requires-Dist: myvariant==1.0.0
Requires-Dist: ddgs==9.14.1
Requires-Dist: trafilatura==2.0.0
Requires-Dist: python-dotenv==1.2.2
Provides-Extra: dev
Requires-Dist: pytest==9.0.3; extra == "dev"
Requires-Dist: pytest-asyncio==1.3.0; extra == "dev"
Requires-Dist: respx==0.23.1; extra == "dev"
Requires-Dist: responses==0.26.0; extra == "dev"
Dynamic: license-file

# aiva-agent

A standalone CLI clinical-genomics agent. Ask natural-language questions about a local VCF, gather variant annotations, search literature, find clinical trials, prioritize genes from HPO phenotypes, and run ACMG/AMP variant classification — using **any** OpenAI-compatible provider (OpenAI, Anthropic, xAI Grok, Together, Fireworks, OpenRouter, etc.).

```bash
export LLM_MODEL=gpt-5.5
export LLM_BASE_URL=https://api.openai.com/v1
export LLM_API_KEY=sk-...
aiva_agent --vcf data/test.vcf.gz \
  --prompt "How many PASS variants on chr1?"
```

## Overview

The agent runs locally over a tabix-indexed VCF and orchestrates a curated set of tools to answer questions about variants, retrieve supporting literature, surface clinical trials, prioritize candidate genes from phenotype terms, and run ACMG/AMP variant classification. Tools are exposed under the following names and can be selectively turned off via `--disable`:

| Tool | What it does |
|---|---|
| `vcf` | Queries over your tabix-indexed `.vcf.gz` file. |
| `annotate` | Variant annotation. Supports human and plant species. |
| `literature` | PubMed / PMC search with gene / disease / variant / chemical entity annotations. |
| `trials` | Clinical-trials search by condition, intervention, gene/variant, phase, recruiting status; full-detail retrieval by NCT ID. |
| `phen2gene` | Rank candidate genes for a list of HPO phenotype terms. Can use negative terms to exclude genes. |
| `web` | Web search and clean content extraction from any URL. |
| `classify` | ACMG/AMP 2015 (germline) and AMP/ASCO/CAP 2017 (somatic) classification and returns a JSON classification. |

## Prerequisites

- **Python 3.11+** with `pip ≥ 24`.
- **htslib tools** (`bgzip`, `tabix`) only needed for preparing VCFs:
  ```bash
  brew install htslib   # macOS
  ```
- **A model provider's API key** — see the Provider section below. The agent works with any OpenAI-compatible endpoint.

## Install

```bash
pip install aiva-agent
```

## Quick start

Prepare a tabix-indexed VCF (only needed once per file):

```bash
bgzip -k path/to/sample.vcf
tabix -p vcf path/to/sample.vcf.gz
```

**Tip: prefer a pre-annotated VCF.** If you've already run your VCF through a
variant-effect annotator (VEP, SnpEff, ANNOVAR, …), the agent can read those
annotations directly from the file. Pre-annotation is recommended for variant
prioritization and classification.

Set provider credentials:

```bash
# Example for OpenAI
export LLM_MODEL=gpt-5.5
export LLM_BASE_URL=https://api.openai.com/v1
export LLM_API_KEY=sk-...
```

Ask a question:

```bash
aiva_agent --vcf path/to/sample.vcf.gz \
  --prompt "List pathogenic variants related to breast cancer"
```

## Environment variables

You can drive everything via flags or env vars. Most users put the LLM credentials in `.env` once and skip the flags on every run — copy `.env.example` to `.env` to start.

| Variable | Purpose | Default |
|---|---|---|
| `LLM_MODEL` | Model ID for the provider (e.g. `gpt-5.5`, `claude-opus-4-7`). | — |
| `LLM_BASE_URL` | Provider's OpenAI-compatible base URL. | — |
| `LLM_API_KEY` | Provider API key. | — |
| `AIVA_VCF` | Default VCF path or `alias=path,...` spec. | unset (vcf tool auto-disables) |
| `AIVA_DISABLE` | Comma-separated tools to disable. | unset (all tools on) |
| `AIVA_MAX_TURNS` | Max agent turns per run. | `25` |
| `AIVA_FORCE` | Overwrite `-o` destination without passing `--force`. Accepts `1/true/yes/on`. | unset |
| `AIVA_SESSION_ID` | Conversation session ID; persist history across runs in `~/.aiva/sessions.db`. | new UUID per run |
| `AIVA_STREAM` | Force streaming on (`1/true/yes/on`) or off (`0/false/no/off`). `--stream` flag also forces on. | auto (on iff stdout is a TTY and `--output` is unset) |
| `AIVA_STREAM_TOOL_OUTPUT` | When streaming, also dump each tool's output to stderr (debug aid). Accepts `1/true/yes/on`. | unset (off) |
| `AIVA_STREAM_TOOL_OUTPUT_MAX` | Cap on chars per tool output when the dump above is on. `0` means unlimited. | `2000` |

Precedence: **CLI flag > shell export > `.env` value**. So you can override per-shell or per-run without editing the file.

**Security note:** prefer `export LLM_API_KEY=...` over `--api-key sk-...` so the secret doesn't leak into shell history or `ps`.

## Usage

```
aiva_agent [--disable <tools>] [--vcf PATH] [--prompt TEXT | --prompt-file PATH]
           [--model ID] [--base-url URL] [--api-key KEY]
           [-o OUTPUT] [--force] [--session-id ID] [--stream]
```

### `--disable`

All tools are ON by default. Pass `--disable a,b` to drop tools from the agent's palette.

```bash
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Annotate rs113488022 and find recent papers."

# Multi-tool disable: comma-separated, no spaces
aiva_agent --disable phen2gene,web,trials --vcf data/test.vcf.gz --model gpt-5.5 \
  --prompt "Classify the most pathogenic chr1 variant."
```

`--disable` falls back to the `AIVA_DISABLE` env var (e.g. `AIVA_DISABLE=vcf` in `.env`). The flag, when given, replaces (rather than merges with) the env value.

### VCF path resolution

The `vcf` tool needs a path. Provide it via `--vcf PATH` or the `AIVA_VCF` env var (set in `.env`); the flag wins if both are set. If neither is set, the `vcf` tool auto-disables itself with a one-line stderr warning and the rest of the palette runs as normal. If a path is provided but the file doesn't exist, the CLI exits with an error — that case clearly signals user intent and shouldn't be silently swallowed. Pass `--disable vcf` to skip the resolution dance entirely.

#### Multiple VCFs (trio / tumor-normal)

The `--vcf` flag accepts a comma-separated `alias=path` list. Each entry becomes a separately named handle the agent can query and cross-reference.

```bash
aiva_agent --vcf "proband=trio/proband.vcf.gz,father=trio/father.vcf.gz,mother=trio/mother.vcf.gz" \
  --prompt "Count de novo het variants in the proband (parents both 0/0)."
```

`AIVA_VCF` accepts the same syntax (`AIVA_VCF=proband=p.vcf.gz,father=f.vcf.gz` in `.env`).

**Prefer a multisample VCF when you have one.** If you've run joint calling and produced a single multisample file, pass it as one VCF — joint-called files preserve missing-genotype information consistently across samples, which is what trio analyses depend on. The multi-VCF path above works fine, but joint calling is genomically more correct when you can do it.

### Three ways to pass a prompt

```bash
# 1. Inline
aiva_agent --vcf data/test.vcf.gz --prompt "How many variants on chr1?"

# 2. From a file (UTF-8, trailing whitespace stripped)
aiva_agent --vcf data/test.vcf.gz --prompt-file prompts/chr1_audit.txt

# 3. From stdin (Unix pipe)
echo "How many variants on chr2?" | aiva_agent --vcf data/test.vcf.gz --prompt -
```

### Writing the answer to a file

```bash
# Convenience flag — creates parent dirs, refuses to clobber unless --force
aiva_agent --model gpt-5.5 --disable vcf \
  --prompt "..." -o reports/answer.md

# Or shell redirection (warnings go to stderr)
aiva_agent --model gpt-5.5 --disable vcf \
  --prompt "..." > answer.md
```

### Streaming output

Long agent runs (multi-tool prompts, classification) can take a while. By default, when stdout is an interactive terminal, the CLI streams as it goes: tool-call markers like `[tool] vcf_query…` / `[tool] done` go to stderr the moment each call fires, and the final answer streams to stdout token-by-token. When stdout is piped or `--output` is set, streaming is automatically off so consumers receive a single clean string.

```bash
# Auto-on in a terminal:
aiva_agent --vcf data/test.vcf.gz --prompt "How many PASS variants on chr1?"

# Force on (e.g. when piping but you still want progress on stderr):
AIVA_STREAM=1 aiva_agent --prompt "..." | tee out.txt

# Force off in a terminal:
AIVA_STREAM=0 aiva_agent --prompt "..."

# Or use the explicit flag:
aiva_agent --stream --prompt "..."
```

For debugging, set `AIVA_STREAM_TOOL_OUTPUT=1` to also print each tool's output (truncated to `AIVA_STREAM_TOOL_OUTPUT_MAX` chars, default 2000) to stderr after the `[tool] done` marker. Useful when you want to see what data the agent is reasoning over.

### Conversation sessions

By default each `aiva_agent` invocation is stateless. To carry context across calls, set `--session-id` (or `AIVA_SESSION_ID`) to any string you like; history is stored locally at `~/.aiva/sessions.db` — no server, no setup. If you don't set one, the CLI prints an auto-generated ID on stderr that you can copy to resume:

```bash
export AIVA_SESSION_ID=case-2026-05
aiva_agent --prompt "Patient has chr7:117559590 G>A in CFTR. Remember it."
aiva_agent --prompt "What variant did I just mention, and what gene?"
```

Each session ID is independent — pick a fresh one per case to keep histories from bleeding into each other.

### Use from a notebook or Python script

Set env vars in one cell, call `aiva_agent(...)` in the next; calls within the same kernel automatically share a session, so follow-up questions remember the prior turn.

```python
# cell 1
import os
os.environ["LLM_API_KEY"] = "sk-..."
os.environ["LLM_BASE_URL"] = "https://api.openai.com/v1"
os.environ["LLM_MODEL"] = "gpt-5.5"
os.environ["AIVA_VCF"] = "data/sample.vcf.gz"

# cell 2
from aiva_agent import aiva_agent
print(aiva_agent("List 3 likely-pathogenic variants from vcf."))
print(aiva_agent("Of those, which is in a recessive disease gene?"))  # remembers
```

To start a fresh conversation mid-notebook, call `reset_session()`. To pin a specific ID (e.g. resume across kernel restarts), set `AIVA_SESSION_ID` in env or pass `session_id="my-case"` to `aiva_agent`. Per-call kwargs `vcf=`, `disable=`, `model=`, `base_url=`, `api_key=` override the corresponding env vars.

## Examples

```bash
# All tools are on by default — pass --disable vcf for prompts that don't need
# a local VCF, or set AIVA_DISABLE / AIVA_VCF in .env to make it permanent.

# Variant annotation
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Annotate rs113488022. Report ClinVar significance and population AF."

# Literature search
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Find 3 recent papers on TP53 R175H in lung cancer."

# Clinical trials
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Find phase 2 recruiting BRAF V600E melanoma trials."

# HPO -> genes
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Rank candidate genes for HP:0001250 + HP:0001263."

# Web search + scrape (free, no API key)
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Find and scrape the latest NCCN melanoma guideline summary."

# ACMG/AMP classification
aiva_agent --disable vcf --model gpt-5.5 \
  --prompt "Classify rs113488022 (BRAF V600E) under AMP for melanoma on GRCh38."

# VCF query + literature (vcf tool turns on automatically once a path is provided)
aiva_agent --vcf data/test.vcf.gz --model gpt-5.5 \
  --prompt "Find any TP53 variants and pull supporting literature."
```

## License

Apache-2.0. See `LICENSE`.
