Metadata-Version: 2.4
Name: article-q
Version: 0.2.1
Summary: Agent/LLM-enabled narrative reviews of academic manuscripts
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: jinja2>=3.1
Requires-Dist: marker-pdf>=1.0
Requires-Dist: openpyxl>=3.1
Requires-Dist: pandas>=2.0
Requires-Dist: pydantic-ai>=0.1.0
Requires-Dist: pydantic>=2.0
Requires-Dist: pymupdf4llm>=0.0.17
Requires-Dist: pymupdf>=1.24
Requires-Dist: rich>=13.0
Requires-Dist: tomli-w>=1.0
Requires-Dist: typer>=0.12
Description-Content-Type: text/markdown

# Article-Q

Agent/LLM-enabled narrative reviews of academic manuscripts. Parses PDFs, extracts structured data using LLM agents guided by a questions spreadsheet, and validates results through a multi-agent consensus mechanism.

## Installation

Requires Python 3.11+.

```bash
pip install -e .
```

## Step-by-step workflow

### Step 1: Initialize the project

```bash
articleq init
```

This creates `articleq.toml` with default settings. Open it and set:

- `project.papers_dir` — directory containing your PDF manuscripts
- `project.questions_file` — path to your questions CSV (see Step 2)
- `llm.api_keys.openai` or `llm.api_keys.google` — your API key (or set the `OPENAI_API_KEY` / `GEMINI_API_KEY` environment variables)

### Step 2: Create a questions file

Prepare a CSV (or Excel) file defining the data you want to extract. Required columns are `id` and `question`. Optional columns:

| Column | Description | Default |
|---|---|---|
| `id` | Unique identifier for the question | *(required)* |
| `question` | The question text | *(required)* |
| `category` | Grouping label (e.g. "methods", "outcomes") | `general` |
| `output_type` | One of `text`, `category`, `numeric`, `boolean`, `list` | `text` |
| `options` | Comma-separated valid answers (for `category` type) | |
| `description` | Additional guidance for the extraction agent | |
| `depends_on` | Comma-separated IDs of questions this one depends on | |

Example:

```csv
id,question,category,output_type,options,description,depends_on
sample_size,What was the total sample size?,demographics,numeric,,Total number of participants enrolled,
primary_outcome,What was the primary outcome?,outcomes,text,,The main outcome measure,
study_design,What was the study design?,methods,category,"RCT,cohort,case-control,cross-sectional",,
study_design_other,If other please specify,methods,text,,,study_design
blinding,Was the study blinded?,methods,boolean,,Whether any form of blinding was used,
```

### Step 3: Parse PDFs

```bash
articleq parse -c articleq.toml
```

This converts each PDF to structured markdown and saves the output to `output/parsed/`. Each paper produces:
- A `.json` file containing the parsed blocks (the source of truth)
- A `.md` file for human-readable review
- Extracted figures saved as PNGs in `output/parsed/figures/`

Two parsing backends are available (set `parsing.backend` in config):
- `pymupdf` (default) — fast, uses pymupdf4llm layout detection
- `marker` — uses marker-pdf with OCR; better for scanned documents

### Step 4: Review and clean parsed content (optional)

Preview the parsed papers in a browser:

```bash
articleq visualize --parsed-dir output/parsed/
```

The JSON files in `output/parsed/` are the source of truth. Each file contains a `blocks` array — the LLM agents read from the `content` field of each block, so edits there directly affect extraction. Do **not** edit the `.md` files or the `raw_markdown` field in the JSON; both are regenerated by `articleq rebuild`.

Each block looks like this:

```json
{
  "block_type": "text",
  "content": "The study enrolled 150 partcipants between Jan and Dec 2020.",
  "page_number": 3,
  "section": "Methods"
}
```

Common edits:

- **Fix OCR errors** — correct garbled text, broken words, or misrecognized characters (e.g. `"partcipants"` → `"participants"`)
- **Remove noise** — delete blocks containing headers, footers, page numbers, or watermarks that the parser picked up
- **Fix broken tables** — repair malformed markdown tables in `"table"` blocks
- **Remove irrelevant blocks** — delete entire blocks (e.g. reference lists, copyright notices) that add noise without useful content

After editing blocks, rebuild the markdown:

```bash
articleq rebuild -c articleq.toml
```

This regenerates the `.md` files and updates `raw_markdown` in the JSON caches to match the block content.

### Step 5: Run LLM extraction

```bash
articleq extract -c articleq.toml
```

This sends each question to the extraction agents for every paper, runs validation, and writes results to `output/results.json`. The `extract` command will refuse to run if the markdown is out of sync with the blocks — run `articleq rebuild` first if you've edited blocks.

Alternatively, run everything (parse + extract) in one shot:

```bash
articleq run -c articleq.toml
```

### Step 6: Export and visualize results

Convert results to CSV or Excel:

```bash
articleq export output/results.json --format csv
articleq export output/results.json --format excel
```

Generate an interactive HTML evidence viewer:

```bash
articleq visualize -r output/results.json
```

The viewer shows each paper's content alongside extracted answers, with evidence passages highlighted in the text. Pass `-q questions.csv` to include question text in the results panel.

## How it works

Each question for each paper goes through a three-agent workflow:

```
Paper + Question
      |
      v
  Extraction Agent  -->  Answer A
      |
      v
  Validation Agent  -->  Answer B  (blind, independent)
      |
      v
  Compare A and B
      |
      +-- AGREE + high confidence --> Accept A as final
      |
      +-- DISAGREE --> Consensus Agent reviews both --> Final answer
```

- The **extraction agent** reads the paper and extracts an answer with evidence quotes, page numbers, and a confidence score.
- The **validation agent** performs a blind, independent re-extraction (it does not see the first answer).
- If the two answers **agree** and both have confidence above `auto_accept_threshold`, the answer is accepted directly.
- If they **disagree**, the **consensus agent** reviews both answers against the source material and produces a final arbitrated answer.

### Question dependencies

Some questions depend on the answers to earlier questions. For example, a follow-up like "If other, please specify" only makes sense after the study type has been determined. Use the `depends_on` column to declare these relationships:

```csv
id,question,depends_on
study_type,What was the study design?,
study_type_other,"If other, please specify",study_type
```

When dependencies are present, questions are processed in **waves** — all questions with no unmet dependencies run concurrently, then questions whose dependencies are satisfied by the previous wave, and so on. Within each wave, concurrency is controlled by `pipeline.concurrency` as usual. Dependent questions receive a "Prior Answers" section in their prompt containing the question text and answer of each dependency.

The `depends_on` column is optional. CSVs without it continue to work as before (all questions run concurrently in a single wave). Circular dependencies and references to nonexistent question IDs are caught at load time.

Agreement checking is type-aware:
- **Categorical/boolean**: exact match
- **Numeric**: within 5% tolerance
- **Text**: normalized string comparison

## Configuration reference

```toml
[project]
name = "my-review"              # Project name used in output
papers_dir = "./papers"          # Directory containing PDF files
questions_file = "./questions.csv"  # Path to questions CSV/Excel
output_dir = "./output"          # Where results are written
# context_file = "./context.md" # Optional: additional instructions for the LLM

[parsing]
backend = "pymupdf"             # "pymupdf" or "marker"
reparse = false                 # Force re-parsing even if cached results exist

[llm]
extraction_model = "openai:gpt-4o"   # Model for primary extraction
validation_model = "openai:gpt-4o"   # Model for validation pass
consensus_model = "openai:gpt-4o"    # Model for arbitration
# temperature = 0.0                  # LLM sampling temperature (omit to use provider default)

[llm.api_keys]
openai = "${OPENAI_API_KEY}"    # Supports environment variable expansion
google = "${GEMINI_API_KEY}"

[pipeline]
concurrency = 5                 # Max concurrent agent calls
skip_validation = false         # Set true to skip the validation/consensus step
checkpoint = true               # Save per-paper checkpoints for resume
chunk_max_tokens = 8000         # Max tokens per chunk for large PDFs

[validation]
auto_accept_threshold = 0.9     # Min confidence to auto-accept agreement
always_validate_categories = ["primary_outcome"]  # Always run full 3-agent flow for these
```

## Additional topics

### Caching and re-parsing

During `parse`, each parsed PDF is saved as both a markdown file and a JSON cache file under `{output_dir}/parsed/`. On subsequent runs, cached JSON files are loaded automatically, skipping PDF re-parsing.

To force re-parsing (e.g. after replacing a PDF or upgrading the parser), use the `--reparse` flag:

```bash
articleq parse -c articleq.toml --reparse
```

To re-parse a single paper, delete its cached `.json` file and run `parse` again.

Output directory structure:

```
output/
├── parsed/
│   ├── study_smith_2020.pdf.md
│   ├── study_smith_2020.pdf.json   # cached ParsedPaper (used on re-runs)
│   ├── study_jones_2021.pdf.md
│   ├── study_jones_2021.pdf.json
│   └── figures/
│       ├── study_smith_2020_img1.png
│       ├── study_smith_2020_img2.png
│       └── study_jones_2021_img1.png
└── results.json
```

### Context file

You can provide a markdown file with additional instructions and domain knowledge to guide the LLM agents. Set `context_file` in the `[project]` section of your config:

```toml
[project]
context_file = "./context.md"
```

The contents are passed as additional system instructions to all three agents (extraction, validation, consensus). Use this for:

- Domain-specific definitions and terminology
- Important distinctions the LLM should be aware of
- Guidance on how to handle ambiguous cases
- Any background knowledge relevant to the review

Example `context.md`:

```markdown
# Extraction Context

This review focuses on dentin hypersensitivity (DH) clinical trials.

## Key Definitions

The Holland 1997 definition of DH: "short, sharp pain arising from exposed
dentine in response to stimuli typically thermal, evaporative, tactile, osmotic
or chemical and which cannot be ascribed to any other form of dental defect or
pathology."

## Important Distinctions

- Distinguish between stimuli used for DIAGNOSIS versus OUTCOME MEASURES
- "dh_threshold_teeth" refers to minimum teeth per PATIENT, not total in study
```

### Large PDFs

Papers exceeding `chunk_max_tokens` are handled with a two-pass approach:

1. The paper is split into chunks by content blocks.
2. Chunks are scored for relevance to the current question using keyword overlap.
3. Only the most relevant chunks (within the token budget) are sent to the agent.

### Multimodal support

Figures extracted from PDFs are sent to the LLM as binary images alongside the text content. This happens automatically — if a parsed paper contains image data, the images are included in the prompt sent to the extraction, validation, and consensus agents.

- Both the `pymupdf` and `marker` backends extract images and store them as base64 in the parsed data.
- Text placeholders like `[Image from page N]` remain in the text for positional context, and the actual image binaries are appended after the text.
- No configuration is needed. If image data exists in the parsed paper, it is included. Models that do not support vision will receive only the text portion.

### Evaluation

You can evaluate extraction results against manually-created ground truth using the `eval` command. This is useful for benchmarking accuracy across models, prompts, or configurations.

**Benchmark layout:**

```
benchmarks/
└── example/
    ├── papers/          # PDF manuscripts
    ├── questions.csv    # Questions used for extraction
    └── expected.csv     # Ground truth answers
```

The `expected.csv` uses the same column format as `articleq export` output. At minimum it needs `paper`, `question_id`, and `final_value` columns:

```csv
paper,question_id,final_value
study_smith_2020.pdf,sample_size,150
study_smith_2020.pdf,study_design,RCT
study_smith_2020.pdf,primary_outcome,overall survival
```

**Running an evaluation:**

```bash
articleq run -c benchmarks/example/config.toml
articleq export output/results.json --format csv
articleq eval output/results.csv benchmarks/example/expected.csv -q benchmarks/example/questions.csv
```

The `-q` flag is optional but recommended — it enables type-aware comparison (numeric tolerance, boolean normalization, etc.) by reading each question's `output_type` from the questions file.

The report shows:
- **Overall accuracy** — percentage of answers matching ground truth
- **Per-question breakdown** — accuracy for each question across all papers
- **Detailed mismatches** — expected vs actual value for every disagreement

**LLM-as-judge evaluation:**

Strict string comparison can produce false negatives for free-text answers where the meaning matches but the wording differs (e.g. "RCT, parallel group" vs "Randomised controlled trial - Parallel group trial"). The `--judge-model` option enables an LLM judge that re-evaluates deterministic mismatches for semantic equivalence:

```bash
articleq eval output/results.csv benchmarks/example/expected.csv \
  -q benchmarks/example/questions.csv \
  --judge-model openai:gpt-4o-mini \
  -c benchmarks/example/config.toml
```

When enabled:
- Answers that match deterministically are accepted as before (no LLM call).
- Mismatches on `text`, `category`, and `list` type questions are sent to the judge model, which decides whether the answers are semantically equivalent.
- `boolean` and `numeric` types keep their existing deterministic checks only.
- The `-q` questions file is required when using `--judge-model` (the question text provides context to the judge).
- The `-c` config file is optional — used to resolve API keys. Without it, keys are read from environment variables.

The report distinguishes strict matches from judge matches and includes the judge's reasoning for any answers it accepted:

```
  Strict matches:    12
  Judge matches:     3
  Mismatches:        5
  Matched:           15
  Accuracy:          75.0%
```

## CLI reference

```
articleq init [-o PATH]              Generate a starter config file
articleq run -c CONFIG [--reparse]    Run the full pipeline (parse + extract)
articleq parse -c CONFIG [--reparse]  Parse PDFs and save as markdown (no LLM calls)
articleq rebuild -c CONFIG            Rebuild markdown and JSON from edited blocks
articleq extract -c CONFIG            Run LLM extraction on pre-parsed papers
articleq export RESULTS [--format csv|excel] [-o PATH]   Export to CSV/Excel
articleq eval RESULTS EXPECTED [-q QUESTIONS] [--judge-model MODEL] [-c CONFIG]   Evaluate against ground truth
articleq visualize -r RESULTS [-o PATH] [-q QUESTIONS]   Generate HTML evidence viewer
articleq visualize --parsed-dir DIR [-o PATH]            Preview parsed papers (no results)
```
