Metadata-Version: 2.1
Name: ReportAnalysis
Version: 1.0.0
Summary: A powerful Python package for analyzing reports: sentiment, readability, keywords, summaries, NER, and more.
Home-page: https://github.com/bappy-3/ReportAnalysis
Author: Al Mustafiz Bappy
Author-email: almustafizbappy@gmail.com
License: MIT
Project-URL: Website, https://almustafizbappy.zerodevs.com/
Project-URL: Bug Tracker, https://github.com/bappy-3/ReportAnalysis/issues
Project-URL: Documentation, https://github.com/bappy-3/ReportAnalysis#readme
Keywords: nlp,text analysis,report analysis,sentiment,readability,keywords,summarization,ner,pdf,docx
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: nltk>=3.7
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: textblob>=0.17; extra == "dev"
Requires-Dist: scikit-learn>=1.0; extra == "dev"
Requires-Dist: rake-nltk>=1.0; extra == "dev"
Requires-Dist: langdetect>=1.0; extra == "dev"
Requires-Dist: pdfplumber>=0.9; extra == "dev"
Requires-Dist: python-docx>=0.8; extra == "dev"
Requires-Dist: requests>=2.28; extra == "dev"
Requires-Dist: beautifulsoup4>=4.11; extra == "dev"
Provides-Extra: full
Requires-Dist: textblob>=0.17; extra == "full"
Requires-Dist: scikit-learn>=1.0; extra == "full"
Requires-Dist: rake-nltk>=1.0; extra == "full"
Requires-Dist: langdetect>=1.0; extra == "full"
Requires-Dist: pdfplumber>=0.9; extra == "full"
Requires-Dist: python-docx>=0.8; extra == "full"
Requires-Dist: requests>=2.28; extra == "full"
Requires-Dist: beautifulsoup4>=4.11; extra == "full"

# ReportAnalysis

[![PyPI version](https://badge.fury.io/py/ReportAnalysis.svg)](https://badge.fury.io/py/ReportAnalysis)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**A powerful, batteries-included Python package for analyzing reports.**

Drop in any report — as a text string, PDF, Word document (.docx), or URL — and instantly get sentiment analysis, readability scores, keywords, summaries, named entities, language detection, and much more. Ships with a full CLI and export to JSON, CSV, and HTML.

---

## Features

| Feature | Description |
|---|---|
| Sentiment Analysis | VADER + TextBlob ensemble with confidence scoring |
| Readability Scores | Flesch, Gunning Fog, SMOG, ARI — all computed offline |
| Keyword Extraction | TF-IDF keywords + RAKE multi-word keyphrases |
| Extractive Summary | Top N most informative sentences |
| Text Statistics | Word count, reading time, vocabulary richness, and more |
| Named Entity Recognition | People, Organizations, Locations via NLTK |
| Language Detection | Detects 50+ languages with ISO codes |
| Report Comparison | Cosine similarity score between two reports |
| Multi-format Loaders | Plain text, PDF, DOCX, and web URLs |
| CLI | `analyze` / `compare` / `summarize` subcommands |
| Export | JSON, CSV, and self-contained HTML reports |

---

## Installation

### Minimal install (core only)

```bash
pip install ReportAnalysis
```

### Full install (with PDF, DOCX, URL loaders and all analysis features)

```bash
pip install "ReportAnalysis[full]"
```

---

## Quick Start

### From a text string

```python
from report_analysis import ReportAnalyzer

ra = ReportAnalyzer("The quarterly results exceeded all expectations. Revenue grew 30%.")
result = ra.analyze()
result.show()  # Prints a formatted report to the terminal
```

### From a file

```python
from report_analysis import ReportAnalyzer

# Supports .txt, .pdf, .docx
ra = ReportAnalyzer("annual_report.pdf")
result = ra.analyze()

print(result.sentiment.label)          # "positive"
print(result.readability.grade_level)  # "College"
print(result.keywords.top_keywords[:5])

result.export("analysis.html")  # Export as a standalone HTML report
```

### From a URL

```python
ra = ReportAnalyzer(url="https://example.com/annual-report")
result = ra.analyze()
result.export("results.json")
```

### From a DOCX file

```python
ra = ReportAnalyzer("report.docx")
result = ra.analyze()
print(result.summary.text)
```

### Run only specific modules

```python
result = ra.analyze(
    include=["sentiment", "readability", "keywords"],
    top_keywords=15,
    summary_sentences=5,
)
```

### Compare two reports

```python
ra1 = ReportAnalyzer("Q1 report text here...")
ra2 = ReportAnalyzer("Q2 report text here...")

comparison = ra1.compare_with(ra2)
print(f"Similarity: {comparison.similarity_score:.1%}")  # e.g. "72.3%"
print(comparison.similarity_label)                        # "Similar"
print("Common words:", comparison.common_words[:10])
```

### Export results

```python
result.export("analysis.json")  # Machine-readable JSON
result.export("analysis.csv")   # Spreadsheet-friendly CSV
result.export("analysis.html")  # Standalone HTML report
```

---

## CLI Usage

```bash
# Analyze a file
report-analysis analyze report.pdf

# Analyze from a URL
report-analysis analyze --url https://example.com/annual-report

# Read from stdin
echo "Revenue increased by 30% this quarter." | report-analysis analyze -

# Run only specific modules
report-analysis analyze report.txt --include sentiment --include keywords

# Export results to HTML
report-analysis analyze report.pdf --export html --output results.html

# Compare two reports
report-analysis compare q1.pdf q2.pdf

# Summarize with 10 sentences
report-analysis summarize report.docx --sentences 10

# Show help
report-analysis --help
report-analysis analyze --help
```

---

## API Reference

### `ReportAnalyzer(source="", *, url="")`

| Parameter | Type | Description |
|---|---|---|
| `source` | `str` | Raw text string, or a path to a `.txt`, `.pdf`, or `.docx` file |
| `url` | `str` | URL to fetch and analyze (keyword-only argument) |

### `.analyze(include=None, summary_sentences=5, top_keywords=20)`

Runs the analysis pipeline and returns an `AnalysisResult` object.

| Parameter | Default | Description |
|---|---|---|
| `include` | `None` (all modules) | List of module names to run |
| `summary_sentences` | `5` | Number of sentences to include in the summary |
| `top_keywords` | `20` | Number of keywords to extract |

**Available modules:** `"stats"`, `"language"`, `"sentiment"`, `"readability"`, `"keywords"`, `"summary"`, `"entities"`

### `AnalysisResult` — Fields

| Field | Type | Description |
|---|---|---|
| `.stats` | `StatsResult` | Word count, sentence count, reading time, vocabulary richness |
| `.sentiment` | `SentimentResult` | Label (positive/negative/neutral), compound score, confidence |
| `.readability` | `ReadabilityResult` | Flesch reading ease, Gunning Fog index, grade level |
| `.keywords` | `KeywordsResult` | TF-IDF scored keywords and RAKE keyphrases |
| `.summary` | `SummaryResult` | Extractive summary as sentence list |
| `.entities` | `EntitiesResult` | Named entities grouped by type |
| `.language` | `LanguageResult` | ISO language code and human-readable name |

### `AnalysisResult` — Methods

| Method | Description |
|---|---|
| `.show()` | Print a rich formatted report to the terminal |
| `.export(path)` | Export to `.json`, `.csv`, or `.html` |
| `.to_dict()` | Return the full result as a Python `dict` |

---

## Result Details

#### Sentiment

```python
result.sentiment.label             # "positive" | "negative" | "neutral"
result.sentiment.compound          # -1.0 to 1.0
result.sentiment.positive          # 0.0 to 1.0
result.sentiment.confidence        # "high" | "medium" | "low"
result.sentiment.textblob_polarity      # TextBlob polarity score
result.sentiment.textblob_subjectivity  # TextBlob subjectivity score
```

#### Readability

```python
result.readability.flesch_reading_ease   # 0-100 (higher = easier to read)
result.readability.flesch_kincaid_grade  # US school grade level
result.readability.gunning_fog           # Years of education needed
result.readability.smog_index            # SMOG grade level
result.readability.reading_ease_label    # "Very Easy", "Standard", "Difficult", etc.
result.readability.grade_level           # "High School", "College", etc.
```

#### Keywords

```python
result.keywords.tfidf_keywords   # [(word, score), ...]
result.keywords.rake_phrases     # [(phrase, score), ...]
result.keywords.top_keywords     # [word, ...] — plain list
result.keywords.top_phrases      # [phrase, ...] — plain list
```

#### Summary

```python
result.summary.sentences         # ["sentence 1", "sentence 2", ...]
result.summary.text              # Joined summary as a single string
result.summary.reduction_ratio   # 0.0-1.0 (proportion of text removed)
```

#### Named Entities

```python
result.entities.people           # ["Steve Jobs", ...]
result.entities.organizations    # ["Apple Inc.", ...]
result.entities.locations        # ["Cupertino", ...]
result.entities.entities         # {"PERSON": [...], "ORGANIZATION": [...], ...}
```

---

## Dependencies

**Installed with the core package:**
- `nltk` — tokenization, VADER sentiment, named entity recognition
- `click` — CLI framework
- `rich` — terminal output formatting

**Installed with `[full]` extras:**
- `textblob` — secondary sentiment signal and subjectivity scoring
- `scikit-learn` — TF-IDF keyword extraction and cosine similarity
- `rake-nltk` — RAKE multi-word keyphrase extraction
- `langdetect` — language detection
- `pdfplumber` — PDF text extraction
- `python-docx` — Word document (.docx) loading
- `requests` and `beautifulsoup4` — web page fetching and parsing

---

## Running Tests

```bash
# Install development dependencies
pip install -e ".[dev]"

# Download required NLTK data
python -c "import nltk; nltk.download(['vader_lexicon', 'punkt', 'punkt_tab', 'averaged_perceptron_tagger', 'averaged_perceptron_tagger_eng', 'maxent_ne_chunker', 'words'])"

# Run the full test suite
pytest tests/ -v
```

---

## Publishing to PyPI

```bash
pip install build twine
python -m build
twine upload dist/*
```

---

## License

MIT License — see [LICENSE](LICENSE) for details.

## Author

**Al Mustafiz Bappy**

- Website: [almustafizbappy.zerodevs.com](https://almustafizbappy.zerodevs.com/)
- GitHub: [@bappy-3](https://github.com/bappy-3)
- PyPI: [pypi.org/project/ReportAnalysis](https://pypi.org/project/ReportAnalysis/)
