Metadata-Version: 2.1
Name: impresso-pipelines
Version: 0.6.0.5
Summary: A custom package containing multiple pipelines simplifying reproducibility of impresso steps
Author: Glebs Vinarskis
Author-email: gleb.vinarskis@gmail.com
Requires-Python: >=3.11,<3.13
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Provides-Extra: adclassifier
Provides-Extra: all
Provides-Extra: langident
Provides-Extra: ldatopics
Provides-Extra: newsagencies
Provides-Extra: ocrqa
Provides-Extra: solrnormalization
Requires-Dist: boto3 (>=1.35,<1.36) ; extra == "ldatopics" or extra == "all"
Requires-Dist: cython
Requires-Dist: floret ; extra == "langident" or extra == "ocrqa" or extra == "ldatopics" or extra == "solrnormalization" or extra == "all"
Requires-Dist: huggingface_hub ; extra == "langident" or extra == "ocrqa" or extra == "ldatopics" or extra == "solrnormalization" or extra == "all"
Requires-Dist: jpype1 (>=1.6,<1.7) ; extra == "ldatopics" or extra == "solrnormalization" or extra == "all"
Requires-Dist: jsonschema (>=4.23,<4.24) ; extra == "ldatopics" or extra == "all"
Requires-Dist: numpy (>=1.26,<1.27)
Requires-Dist: pybloomfilter3 ; extra == "ocrqa" or extra == "all"
Requires-Dist: pysolr ; extra == "solrnormalization"
Requires-Dist: python-dotenv (>=1.0,<1.1) ; extra == "ldatopics" or extra == "all"
Requires-Dist: smart_open (>=6.3,<6.4) ; extra == "ldatopics" or extra == "all"
Requires-Dist: spacy (>=3.7,<3.8) ; extra == "ldatopics" or extra == "all"
Requires-Dist: torch (>=2.3,<2.4) ; extra == "newsagencies" or extra == "adclassifier" or extra == "all"
Requires-Dist: torchvision (>=0.18,<0.19) ; extra == "newsagencies"
Requires-Dist: transformers (>=4.49,<4.50) ; extra == "newsagencies" or extra == "adclassifier" or extra == "all"
Project-URL: homepage, https://github.com/impresso/impresso-pipelines
Description-Content-Type: text/markdown

# Python Package: `impresso-pipelines`

[![PyPI](https://img.shields.io/pypi/v/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)
[![Python versions](https://img.shields.io/pypi/pyversions/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)
[![Weekly Downloads](https://img.shields.io/pypi/dm/impresso-pipelines)](https://pypi.org/project/impresso-pipelines/)
[![Contributors](https://img.shields.io/github/contributors/impresso/impresso-pipelines)](https://github.com/impresso/impresso-pipelines/graphs/contributors)
[![QA Workflow](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml/badge.svg)](https://github.com/impresso/impresso-pipelines/actions/workflows/qa.yml)

## Overview

This repository contains a Python package designed for modular and efficient text processing workflows. Currently, it includes the following subpackages:

- **Language Identification Pipeline**: Identifies the language of input text and returns a probability score.
- **OCR QA Pipeline**: Assesses the quality of OCR text by estimating the proportion of recognized vocabulary items (0–1), using efficient language-specific Bloom filters.
- **LDA Topic Modeling Pipeline**: Soft clustering of input texts using LDA-based topic modeling.
- **News Agencies Pipeline**: Extracts and ranks news agency entities from text, providing relevance scores and optional links to Wikidata.
- **Advertisement Classifier**: Identifies advertisements in historical newspaper content using a fine-tuned XLM-RoBERTa model with rule-based features.
- **Lucene/Solr normalization Pipeline**: Replicates Solr's language-specific text normalization to clarify how input text is tokenized and indexed in impresso.

## Installation

### Quick Install (with uv - recommended)

[uv](https://github.com/astral-sh/uv) is an extremely fast Python package installer (10-100x faster than pip):

```bash
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install the package with all dependencies
uv pip install "impresso-pipelines[all]"
```

### Standard Install (with pip)

To install the full package with all submodules:

```bash
pip install "impresso-pipelines[all]"
```

The `[all]` extra installs all dependencies required for each component.

### Install Individual Modules

To install individual modules without unnecessary dependencies, use:

```bash
pip install "impresso-pipelines[langident]"         # Language Identification
pip install "impresso-pipelines[ocrqa]"             # OCR QA
pip install "impresso-pipelines[ldatopics]"         # LDA Topics
pip install "impresso-pipelines[newsagencies]"      # News Agencies
pip install "impresso-pipelines[adclassifier]"      # Advertisement Classifier
pip install "impresso-pipelines[solrnormalization]" # Solr text normalization
```

### Development Setup

For contributors, we support both **uv** (faster) and **Poetry**:

```bash
# Clone the repository
git clone https://github.com/impresso/impresso-pipelines.git
cd impresso-pipelines

# Option 1: Using uv (recommended - 3-6x faster)
uv sync --extra all --extra dev

# Option 2: Using Poetry
poetry install --all-extras --with dev

# Or use Make (auto-detects uv or Poetry)
make install-dev
```

See [UV_MIGRATION.md](UV_MIGRATION.md) for more details on using uv.

## Usage

Each pipeline is instantiated from a corresponding class.

```python
from impresso_pipelines.langident import LangIdentPipeline
from impresso_pipelines.ocrqa import OCRQAPipeline
from impresso_pipelines.ldatopics import LDATopicsPipeline
from impresso_pipelines.newsagencies import NewsAgenciesPipeline
from impresso_pipelines.adclassifier import AdClassifierPipeline
from impresso_pipelines.solrnormalization import SolrNormalizationPipeline
```

## Pipeline Examples

For usage examples, refer to the individual README files:

- [Langident Pipeline](README_langident.md)
- [OCR QA Pipeline](README_ocrqa.md)
- [LDA Topics Pipeline](README_ldatopics.md)
- [News Agencies Pipeline](README_newsagencies.md)
- [Advertisement Classifier](README_adclassifier.md)
- [Solr normalization Pipeline](README_solrnormalization.md)

See also the interactive notebooks for further examples:

- [langident_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/langident_pipeline_demo.ipynb)
- [ocrqa_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ocrqa_pipeline_demo.ipynb)
- [ldatopics_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/ldatopics_pipeline_demo.ipynb)
- [newsagencies_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/newsagencies_pipeline_demo.ipynb)
- [solrnormalization_pipeline_demo.ipynb](https://github.com/impresso/impresso-datalab-notebooks/tree/main/annotate/solrnormalization_pipeline_demo.ipynb).

## Future Plans

Additional functionality will be added to extend use cases and support further processing tasks.

## Local Development

For contributors and developers who want to test locally before pushing to GitHub:

### Quick Start

```bash
# Clone and install
git clone https://github.com/impresso/impresso-pipelines.git
cd impresso-pipelines

# Option 1: Poetry (recommended for full development)
make install-dev

# Option 2: Pip editable mode (faster for testing changes)
make install-editable-dev

# Run tests
make test

# Run all QA checks (mimics CI)
make qa
```

### Available Commands

```bash
make help              # Show all available commands
make install          # Install package with all extras
make install-dev      # Install with dev dependencies
make test             # Run tests (skipping JVM tests)
make test-all         # Run all tests including JVM tests
make test-ocrqa       # Run only OCRQA tests
make test-cov         # Run tests with coverage report
make lint             # Run linting checks
make format           # Format code with black
make type-check       # Run type checking
make qa               # Run all QA checks
make clean            # Remove build artifacts
```

For detailed development instructions, see [CONTRIBUTING.md](CONTRIBUTING.md).

## About Impresso

### Impresso project

[Impresso - Media Monitoring of the Past](https://impresso-project.ch) is an interdisciplinary research project that aims to develop and consolidate tools for processing and exploring large collections of media archives across modalities, time, languages and national borders. The first project (2017-2021) was funded by the Swiss National Science Foundation under grant No. [CRSII5_173719](http://p3.snf.ch/project-173719) and the second project (2023-2027) by the SNSF under grant No. [CRSII5_213585](https://data.snf.ch/grants/grant/213585) and the Luxembourg National Research Fund under grant No. 17498891.

### Copyright

Copyright (C) 2025 The Impresso team.

### License

This program is provided as open source under the [GNU Affero General Public License](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE) v3 or later.

---

<p align="center">
  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="350" alt="Impresso Project Logo"/>
</p>

