Metadata-Version: 2.4
Name: locisimiles
Version: 1.4.0
Summary: LociSimiles is a Python package for finding intertextual links in Latin literature using pre-trained language models.
Author: Julian Schelb
Author-email: julian.schelb@uni-konstanz.de
Requires-Python: >=3.10,<3.15
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Provides-Extra: dev
Provides-Extra: gui
Provides-Extra: rule-based
Requires-Dist: accelerate (>=0.20.0)
Requires-Dist: audioop-lts (>=0.2.1,<0.3.0) ; (python_version >= "3.13") and (extra == "gui")
Requires-Dist: chromadb (>=0.4.0,<2.0.0)
Requires-Dist: gradio (>=5.49.1) ; extra == "gui"
Requires-Dist: mkdocs (>=1.5.0) ; extra == "dev"
Requires-Dist: mkdocs-material (>=9.0.0) ; extra == "dev"
Requires-Dist: mkdocstrings[python] (>=0.24.0) ; extra == "dev"
Requires-Dist: mypy (>=1.10.0) ; extra == "dev"
Requires-Dist: numpy (>=1.24.0,<3.0.0)
Requires-Dist: pandas (>=2.0.0,<3.0.0)
Requires-Dist: poethepoet (>=0.24.0) ; extra == "dev"
Requires-Dist: pre-commit (>=3.5.0) ; extra == "dev"
Requires-Dist: pydantic (==2.10.6) ; extra == "gui"
Requires-Dist: pytest (>=8.0.0,<10.0.0) ; extra == "dev"
Requires-Dist: pytest-cov (>=4.0.0) ; extra == "dev"
Requires-Dist: python-semantic-release (>=9.0.0) ; extra == "dev"
Requires-Dist: ruff (>=0.8.0) ; extra == "dev"
Requires-Dist: sentence-transformers (>=3.0.0,<6.0.0)
Requires-Dist: spacy (>=3.5.0,<4.0.0) ; extra == "rule-based"
Requires-Dist: torch (>=2.0.0,<3.0.0)
Requires-Dist: transformers (>=4.30.0,<5.0.0)
Description-Content-Type: text/markdown

# Loci Similes

**LociSimiles** is a Python package for finding intertextual links in Latin literature using pre-trained language models.

## Basic Usage

```python

# Load example query and source documents
query_doc = Document("../data/hieronymus_samples.csv")
source_doc = Document("../data/vergil_samples.csv")

# Load the pipeline with pre-trained models
pipeline = ClassificationPipelineWithCandidategeneration(
    classification_name="...",
    embedding_model_name="...",
    device="cpu",
)

# Run the pipeline with the query and source documents
results = pipeline.run(
    query=query_doc,    # Query document
    source=source_doc,  # Source document
    top_k=3             # Number of top similar candidates to classify
)

pretty_print(results)

# Save results to CSV or JSON
pipeline.to_csv("results.csv")
pipeline.to_json("results.json")
```

## Command-Line Interface

LociSimiles provides a command-line tool for running the pipeline directly from the terminal:

### Basic Usage

```bash
locisimiles query.csv source.csv -o results.csv
```

### Advanced Usage

```bash
locisimiles query.csv source.csv -o results.csv \
  --classification-model julian-schelb/xlm-roberta-large-class-lat-intertext-v1 \
  --embedding-model julian-schelb/multilingual-e5-large-emb-lat-intertext-v1 \
  --top-k 20 \
  --threshold 0.7 \
  --device cuda \
  --verbose
```

### Options

- **Input/Output:**
  - `query`: Path to query document CSV file (columns: `seg_id`, `text`)
  - `source`: Path to source document CSV file (columns: `seg_id`, `text`)
  - `-o, --output`: Path to output CSV file for results (required)

- **Models:**
  - `--classification-model`: HuggingFace model for classification (default: xlm-roberta-large-class-lat-intertext-v1)
  - `--embedding-model`: HuggingFace model for embeddings (default: multilingual-e5-large-emb-lat-intertext-v1)

- **Pipeline Parameters:**
  - `-k, --top-k`: Number of top candidates to retrieve per query segment (default: 10)
  - `-t, --threshold`: Classification probability threshold for filtering results (default: 0.85)

- **Device:**
  - `--device`: Choose `auto`, `cuda`, `mps`, or `cpu` (default: auto-detect)

- **Other:**
  - `-v, --verbose`: Enable detailed progress output
  - `-h, --help`: Show help message

### Output Format

The CLI saves results to a CSV file with the following columns:
- `query_id`: Query segment identifier
- `query_text`: Query text content
- `source_id`: Source segment identifier
- `source_text`: Source text content
- `similarity`: Cosine similarity score (0-1)
- `probability`: Classification confidence (0-1)
- `above_threshold`: "Yes" if probability ≥ threshold, otherwise "No"


## Optional Gradio GUI

Install the optional GUI extra to experiment with a minimal Gradio front end:

```bash
pip install locisimiles[gui]
```

Launch the interface from the command line:

```bash
locisimiles-gui
```

