Metadata-Version: 2.1
Name: pyautosummarizer
Version: 1.2.0
Summary: An Extractive and Abstractive Summarization Library Powered with Artificial Intelligence
Home-page: https://github.com/Valdecy/pyAutoSummarizer
Author: Valdecy Pereira
Author-email: valdecy.pereira@gmail.com
License: GNU
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Provides-Extra: faithfulness
License-File: LICENSE

# pyAutoSummarizer

pyAutoSummarizer â€” An Extractive and Abstractive Summarization Library Powered with Artificial Intelligence.

## Citation
PEREIRA, V., DE LIMA PORTO, R.C., FIGUEIRA, L.A.A., FERREIRA, R.A.C.A. (2026). Unveiling pyAutoSummarizer: An Extractive and Abstractive Summarization Library Powered with Artificial Intelligence. In: DA HORA, H., PORTER, A.L., CHIAVETTA, D., ZHANG, Y. (eds) Technology Mining. Springer, Cham. https://doi.org/10.1007/978-3-032-10849-4_2

## Introduction

pyAutoSummarizer is a Python library for text summarization, covering both extractive and abstractive approaches, and providing a comprehensive suite of evaluation metrics â€” from classic n-gram overlap to modern semantic and faithfulness measures.

### Summarization Methods

**Extractive** â€” identifies and returns the most important sentences from the original text:

| Method | Description |
|--------|-------------|
| **TextRank** | Graph-based ranking using sentence embeddings and cosine similarity |
| **LexRank** | Graph-based ranking using TF-IDF cosine similarity |
| **LSA** | Latent Semantic Analysis via SVD on embeddings or TF-IDF matrix |
| **KL-Sum** | Selects sentences that minimise KL-divergence from the full document distribution |
| **BART** | `facebook/bart-large-cnn` abstractive model (deep learning) |
| **T5** | `t5-base` abstractive model (deep learning) |

**Abstractive** â€” generates new text that captures the meaning of the source:

| Method | Description |
|--------|-------------|
| **PEGASUS** | `google/pegasus-xsum` model fine-tuned for abstractive summarization |
| **chatGPT** | OpenAI `gpt-4o-mini` (or any chat model) via the OpenAI API |

### Text Pre-processing

The library provides a flexible pre-processing pipeline:

- **Lowercasing**, **accent removal**, **special character removal**, **number removal**
- **Custom word removal**
- **Stopword removal** across 26 languages: Arabic, Bengali, Bulgarian, Chinese, Czech, English, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Italian, Japanese, Korean, Marathi, Persian, Polish, Portuguese-br, Romanian, Russian, Slovak, Spanish, Swedish, Thai, and Ukrainian
- **Sentence segmentation** by punctuation, word count, or character count

### Evaluation Metrics

#### Classic Metrics (reference-based, lexical)

| Metric | Method | Returns |
|--------|--------|---------|
| **ROUGE-N** | `rouge_N(generated, reference, n=1)` | F1, Precision, Recall |
| **ROUGE-L** | `rouge_L(generated, reference)` | F1, Precision, Recall |
| **ROUGE-S** | `rouge_S(generated, reference, skip_distance=4)` | F1, Precision, Recall |
| **BLEU** | `bleu(generated, reference, n=4)` | Score |
| **METEOR** | `meteor(generated, reference)` | Score |

#### Semantic Metric (reference-based)

| Metric | Method | Returns | Notes |
|--------|--------|---------|-------|
| **BERTScore** | `bert_score(generated, reference, model_type='roberta-large')` | F1, Precision, Recall | Requires `pip install bert-score`. Captures paraphrasing that ROUGE misses by comparing contextualised token embeddings. |

#### Faithfulness / Factual Consistency Metrics (source-based, no reference needed)

These metrics check whether the summary is factually consistent with the **source document**, detecting hallucinations that lexical metrics cannot see.

| Metric | Method | Returns | Notes |
|--------|--------|---------|-------|
| **SummaC** | `summa_c(generated, nli_model='cross-encoder/nli-deberta-v3-small')` | Score âˆˆ [0, 1] | Self-contained NLI-based faithfulness scorer using HuggingFace transformers. No extra install needed. |
| **AlignScore** | `align_score(generated, model='AlignScore-base')` | Score âˆˆ [0, 1] | Requires `pip install pyAutoSummarizer[faithfulness]` and `python -m spacy download en_core_web_sm`. Based on Zha et al., ACL 2023. |

#### LLM-as-Judge Metric

| Metric | Method | Returns | Notes |
|--------|--------|---------|-------|
| **G-Eval** | `g_eval(generated, api_key, model='gpt-4o-mini', dimensions=['coherence','consistency','fluency','relevance'])` | `dict {dimension: int 1â€“5}` | Uses an OpenAI chat model to score the summary across four quality dimensions. Based on Liu et al., 2023. Requires an OpenAI API key. |

## Installation

### Core install (extractive/abstractive methods + lexical/BERTScore metrics)

```bash
pip install pyAutoSummarizer
```

### With faithfulness metrics (AlignScore)

```bash
pip install "pyAutoSummarizer[faithfulness]"
python -m spacy download en_core_web_sm
```

**Requirements:** Python â‰¥ 3.9

## Quick Start

```python
from pyAutoSummarizer.base import psr

text = """
Your long text goes here. It can be multiple paragraphs.
The library will pre-process it, split it into sentences,
and summarize it using any of the available methods.
"""

# Initialise â€” pre-processes the text
s = psr.summarization(text, stop_words=['en'], lowercase=True,
                      rmv_accents=True, rmv_special_chars=True, rmv_numbers=True)

# --- Extractive summarization ---
rank    = s.summ_text_rank()          # TextRank
summary = s.show_summary(rank, n=3)   # top-3 sentences
print(summary)

# --- Abstractive summarization ---
summary = s.summ_abst_chatgpt(api_key='YOUR_KEY', model='gpt-4o-mini')

# --- Evaluation (classic) ---
f1, p, r = s.rouge_N(summary, reference, n=1)
bleu_s   = s.bleu(summary, reference)

# --- Evaluation (semantic) ---
f1, p, r = s.bert_score(summary, reference)

# --- Evaluation (faithfulness â€” no reference needed) ---
faith_sc = s.summa_c(summary)    # SummaC (built-in NLI)
align_sc = s.align_score(summary) # AlignScore (requires [faithfulness] extra)

# --- Evaluation (LLM-as-judge) ---
scores   = s.g_eval(summary, api_key='YOUR_KEY')
# {'coherence': 4, 'consistency': 5, 'fluency': 5, 'relevance': 4}
```

## Colab Demos

**Extractive Summarization**
- [TextRank](https://colab.research.google.com/drive/1m7mF4R7s6hakuVhrwymrgqNNJpTySUM4?usp=sharing)
- [LexRank](https://colab.research.google.com/drive/1gT9fV7hAE4mvwAHbfzolF6TN3TjGgJOF?usp=sharing)
- [LSA](https://colab.research.google.com/drive/19fUslzp43_Owib9YDCb0Xfe9XZm1OKmB?usp=sharing)
- [KL-Sum](https://colab.research.google.com/drive/19zHjE0nR1GcAWi4NQmaJh1gjpqm4sqjP?usp=sharing)
- [BART](https://colab.research.google.com/drive/1sAYBDQFxwlA16nBUozgE28_xZlNzUCg-?usp=sharing)
- [T5](https://colab.research.google.com/drive/1tyWu-19xA9QMrwl_kPcGJH0ZSS3r_rDZ?usp=sharing)

**Abstractive Summarization**
- [chatGPT](https://colab.research.google.com/drive/1ipl6ZnyumJeuxsYelcmZEdsXDMIuM5WG?usp=sharing) â€” requires an [OpenAI API key](https://platform.openai.com/account/api-keys)
- [PEGASUS](https://colab.research.google.com/drive/1RWIEm9WoZBPYA_p4A1LqKnFPaXhNsQcM?usp=sharing)

## Related Projects

- [pyBibX](https://github.com/Valdecy/pyBibX) â€” A Bibliometric and Scientometric Python Library Powered with Artificial Intelligence Tools
