Metadata-Version: 2.4
Name: litsynth
Version: 0.1.0
Summary: Systematic review, literature analysis, PRISMA export, and citation-safe research workflows.
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas
Requires-Dist: rispy
Requires-Dist: seaborn
Requires-Dist: nltk
Requires-Dist: numpy
Requires-Dist: openai
Requires-Dist: transformers
Requires-Dist: chromadb
Requires-Dist: sentence_transformers
Requires-Dist: pymupdf
Requires-Dist: fastapi
Requires-Dist: uvicorn
Requires-Dist: streamlit
Requires-Dist: requests
Requires-Dist: paper-qa>=5
Requires-Dist: openpyxl

# LitSynth: Research Management System
<div align="center">
  <img src="https://raw.githubusercontent.com/chandraveshchaudhari/chandraveshchaudhari/refs/heads/initial_setup/data/logo.png">
</div>

Research Management System (RMS) is a local-first research workflow for systematic review, literature analysis, provenance-aware retrieval, and citation-safe writing.

It combines:

- PDF and RIS/BibTeX ingestion
- screening and review-table generation
- PRISMA-oriented export workflows
- evidence-grounded research chat over indexed papers
- deterministic citation insertion backed by authoritative metadata

RMS is designed to run in three operating modes:

- local only: everything stays on your machine, including vector storage and project files
- cloud only: the API and web app run on Google Cloud Run with cloud-managed storage and hosted LLM providers
- hybrid: local indexing and private project files stay on the workstation while selected artifacts, configuration, or indexes are mirrored to Google Cloud for demos or team access

The system architecture and deployment design live in [docs/architecture.md](docs/architecture.md).

## System Design Diagram

```mermaid
flowchart TD
  user[Researcher] --> ui[Streamlit UI\napps/webapp/app.py]
  ui --> api[FastAPI API\napps/api/main.py]
  api --> orch[RMS Orchestrator\nand Review Pipeline]

  orch --> ingest[PDF + RIS/BibTeX Ingestion]
  orch --> review[Screening + Review Matrix\nPRISMA Exports]
  orch --> citation[Citation Store +\nDeterministic Insertion]
  orch --> rag[RAG Routing]

  rag --> chroma[Chroma Vector DB\nBAAI/bge-base-en-v1.5]
  rag --> rmsllm[Local RMS Answering\nOllama qwen3:8b]
  rag --> paperqa[PaperQA Adapter\nollama/nomic-embed-text]

  ingest --> files[Project Files\nPDFs RIS Outputs]
  review --> files
  citation --> files
  chroma --> files

  files -. hybrid sync .-> gcs[Google Cloud Storage\nproject artifacts + manifests]
  gcs --> cloudapi[Cloud Run API]
  gcs --> cloudui[Cloud Run Web App]
  cloudapi --> hosted[Hosted LLM Providers\nOpenAI Claude Gemini]
```

## What RMS Does

RMS is built around an end-to-end research pipeline rather than a single chat surface.

Core capabilities:

- ingest RIS metadata and PDF full text
- validate and screen papers before review
- index research papers into a local vector database
- answer questions with supporting evidence chunks
- generate review-ready outputs such as Excel matrices and PRISMA artifacts
- insert citations into markdown drafts using imported RIS/BibTeX records only

Citation integrity is a hard constraint in this repository:

- citations come from imported RIS/BibTeX metadata
- the system does not treat LLM-generated references as authoritative
- citation insertion requires source document and character-range provenance

## Repository Layout

```text
apps/api/                 FastAPI service for search, RAG, citations, status, and indexing
apps/webapp/              Streamlit UI for review workflows and Research Copilot
apps/web-static/          Static frontend assets
docs/                     Architecture, roadmap, PRISMA, and product documentation
infra/aws/                AWS deployment artifacts kept for reference
infra/gcp/                Cloud Run deployment assets for API and Streamlit UI
papers/                   Research writeups and project papers
rms/                      Core pipeline, retrieval, citation, and orchestration modules
extensions/asreview-rms/  ASReview-oriented extension package
```

## System Components

### 1. Core RMS library

The [rms](rms) package contains the research pipeline:

- PDF parsing and chunking
- keyword and threshold-based screening
- local vector indexing with Chroma
- semantic retrieval and reranking hooks
- local and API-backed LLM integrations
- citation-store and deterministic insertion workflows

### 2. FastAPI backend

The API in [apps/api/main.py](apps/api/main.py) exposes the main service surface:

- `GET /health`
- `GET /system-status`
- `POST /index-documents`
- `POST /semantic-search`
- `POST /rag-with-provenance`
- citation load and insertion endpoints

This service owns retrieval orchestration, provider routing, status inspection, and corpus indexing.

### 3. Streamlit web application

The UI in [apps/webapp/app.py](apps/webapp/app.py) is the main operator console for:

- review setup and filtering
- Excel and PRISMA export
- Research Copilot chat
- provider/model configuration
- vector corpus status and indexing controls
- citation-safe export to markdown

### 4. Local and hosted LLM paths

RMS currently supports local-first and hosted model usage:

- local Ollama for RMS and PaperQA-backed workflows
- direct local Qwen inference for review generation on macOS with `mps`
- OpenAI, Claude, and Gemini for hosted or user-provided API workflows

The retrieval embedding path and the answer-generation path are intentionally separate. Today the default split is:

- RMS local vector embedding: `BAAI/bge-base-en-v1.5`
- RMS default answer model: `ollama/qwen3:8b`
- PaperQA local embedding: `ollama/nomic-embed-text`

## Quick Start

### Prerequisites

- Python 3.10+
- macOS, Linux, or a compatible container runtime
- optional: Ollama for local inference

Create and activate a virtual environment:

```bash
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -e .
```

If you publish this package to PyPI, the intended public install flow is:

```bash
pip install litsynth
```

If you prefer requirements-based installation:

```bash
pip install -r requirements.txt
```

### Local LLM setup with Ollama

If you want fully local answering:

```bash
ollama serve
ollama pull qwen3:8b
ollama pull llama3.1:8b
ollama pull nomic-embed-text
```

### Run the API and web app

If you install the package, you can now launch both services with one command:

```bash
litsynth launch
```

This starts the FastAPI backend and the Streamlit UI together, then opens the browser.

If you prefer the manual two-terminal workflow, use the commands below.

From the repository root:

```bash
PYTHONPATH="$PWD" ./.venv/bin/python -m uvicorn apps.api.main:app --host 127.0.0.1 --port 8000
```

In a second terminal:

```bash
PYTHONPATH="$PWD" RMS_API_URL="http://127.0.0.1:8000" ./.venv/bin/streamlit run apps/webapp/app.py --server.address 127.0.0.1 --server.port 8501 --server.headless true
```

Open the UI in a browser and use the system-status panel to confirm:

- indexed paper count
- active RMS embedding model
- active RMS and PaperQA answer models
- available Ollama models

### Index papers

You can index papers from the UI or directly through the API:

```bash
curl -X POST http://127.0.0.1:8000/index-documents \
  -H "Content-Type: application/json" \
  -d '{"directory_path": "data/mdpi", "max_files": 10}'
```

### Run the CLI

The package installs an `rms` command:

```bash
rms --help
```

It also installs a `litsynth` launcher, so the beginner-friendly local startup flow is:

```bash
litsynth launch
```

Use `litsynth launch --help` to override ports or disable automatic browser opening.

Typical review run:

```bash
rms run-review \
  --ris-dir data \
  --pdf-dir data/mdpi \
  --output-dir outputs/literature_review \
  --provider ollama \
  --model qwen3:8b
```

## Configuration

Important runtime settings:

- `RMS_API_URL`: Streamlit UI target for the backend API
- `OLLAMA_BASE_URL`: local or remote Ollama endpoint
- `RMS_EMBEDDING_MODEL`: RMS vector embedding model, default `BAAI/bge-base-en-v1.5`
- `RMS_PAPERQA_EMBEDDING`: optional PaperQA embedding override
- `OPENAI_API_KEY`: hosted OpenAI provider
- `ANTHROPIC_API_KEY`: hosted Claude provider
- `GEMINI_API_KEY`: hosted Gemini provider

If you change the RMS embedding model, reindex the corpus. Mixing old and new embeddings in the same vector store is a bad retrieval configuration.

## Local, Cloud, and Hybrid Operation

### Local only

Use this mode when data privacy and low-latency iteration matter most.

- PDFs, RIS files, outputs, and Chroma persistence remain on local disk
- Ollama and local Qwen can serve all generation paths
- no external cloud service is required for the main workflow

### Cloud only

Use this mode when you want a hosted demo or a shared environment.

- deploy the API and Streamlit UI from [infra/gcp/cloudrun-api/README.md](infra/gcp/cloudrun-api/README.md) and [infra/gcp/cloudrun-webapp/README.md](infra/gcp/cloudrun-webapp/README.md)
- keep project files and generated outputs in Google Cloud Storage
- use hosted LLM providers or a remotely reachable Ollama endpoint
- keep the embedding model fixed across index build and query time

### Hybrid

Use this mode when local research data stays private but you still want a hosted demo or sync target.

- local workstation remains the source of truth for PDFs and indexing
- selected outputs, manifests, and review artifacts are mirrored to GCS
- cloud deployment uses the same embedding configuration and project manifest to avoid retrieval drift
- hosted UI can point to a local or cloud API depending on the demo topology

## Recommended Google Cloud Sync Design

RMS already has Cloud Run deployment assets. For consistent local, cloud, and hybrid behavior, keep these assets synchronized at the project level:

1. project files
  Store PDFs, RIS files, exported review sheets, and PRISMA outputs under a project-scoped directory locally and a matching prefix in GCS.

2. embedding manifest
  Persist a small project manifest with at least:
  - embedding model name
  - chunk size and overlap
  - vector store type
  - index build timestamp
  - corpus file list or content hashes

3. vector index lifecycle
  Rebuild the cloud index whenever the embedding model, chunking policy, or document set changes. Do not assume vector files produced with one embedding family are valid for another.

4. provider separation
  Keep retrieval embeddings, answer-generation models, and citation metadata as separate concerns. A Qwen or Llama answer model can work well with a BGE or Nomic embedding model as long as the retrieval layer is internally consistent.

5. citation authority
  Sync RIS/BibTeX files and citation logs as first-class project artifacts. Citation metadata should not be reconstructed from chat output.

The detailed system design for these modes is documented in [docs/architecture.md](docs/architecture.md).

## Additional Documentation

- [docs/architecture.md](docs/architecture.md)
- [docs/rms_features_prisma_pipeline.md](docs/rms_features_prisma_pipeline.md)
- [apps/api/README.md](apps/api/README.md)
- [apps/webapp/README.md](apps/webapp/README.md)
- [infra/gcp/cloudrun-api/README.md](infra/gcp/cloudrun-api/README.md)
- [infra/gcp/cloudrun-webapp/README.md](infra/gcp/cloudrun-webapp/README.md)

## Current State and Boundaries

What is implemented now:

- local Chroma-backed retrieval
- API and Streamlit app for indexing, search, and citation-safe workflows
- Cloud Run packaging for the API and UI
- local-first research chat with evidence chunks and visible corpus status

What still requires deployment choices outside the core local default:

- managed cloud vector storage
- object-storage sync and project manifests as an operational convention
- production secret handling for hosted LLM providers

That separation is intentional. The repository already runs well locally, and the cloud design can be layered on without weakening the local research workflow.

## Packaging Direction

Yes, a single-command beginner flow is possible and is now wired into the package metadata.

- after package installation, users can run `litsynth launch`
- once published to PyPI under the same name, the public install flow becomes `pip install litsynth`

Yes, a macOS desktop app is also realistic.

The clean upgrade path is:

1. stabilize the local `litsynth launch` flow
2. package the same launcher and backend into a desktop shell
3. ship a beginner-friendly macOS app bundle that starts the local services automatically

For a future desktop version, the lowest-friction options are usually:

- PyInstaller or Briefcase for a Python-first desktop package
- Tauri or Electron if you want a more polished native app shell later

For this codebase, the simplest near-term path is a Python-packaged macOS app that wraps the same FastAPI plus Streamlit launcher you now have.
