Metadata-Version: 2.4
Name: text-quality-analyzer
Version: 0.1.2
Summary: Universal text analysis module for detecting language, meaningfulness, and structure
Home-page: https://github.com/yourusername/text-quality-analyzer
Author: Text Quality Analyzer Team
Author-email: LifeAiTools Team <dev@muid.io>
License: Apache-2.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: langid>=1.1.6
Requires-Dist: textstat>=0.7.3
Requires-Dist: wordfreq>=3.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Text Quality Analyzer

Universal Python module for analyzing text quality, detecting language, meaningfulness, and document structure.

## Features

- **Language Detection**: Automatic detection of 97+ languages with confidence scores
- **Meaningfulness Check**: Determines if text is coherent writing or random characters
- **Structure Analysis**: Detects Markdown headers, lists, paragraphs, links, and code blocks
- **Readability Metrics**: Flesch reading ease and other readability scores
- **Lightweight**: CPU-only, works offline, no GPU required
- **Fast**: Analyzes 1MB of text in under 1 second

## Installation

### From source (recommended for development)

```bash
cd text-quality-analyzer
pip install -e .
```

### Install dependencies only

```bash
pip install -r requirements.txt
```

### Required dependencies

- `langid>=1.1.6` - Language detection
- `textstat>=0.7.3` - Readability metrics
- `wordfreq>=3.0.0` - Word frequency dictionaries

## Quick Start

### Basic Usage

```python
from text_quality_analyzer import TextProfiler

profiler = TextProfiler()
result = profiler.analyze_text("Your text here")

print(result["language"])          # 'en'
print(result["is_meaningful"])     # True
print(result["is_structured"])     # False
```

### Convenience Function

```python
from text_quality_analyzer import analyze_text

result = analyze_text("## Hello World\n\nThis is a test.")
print(result)
```

## Usage Examples

### Example 1: Language Detection

```python
from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

texts = {
    "English": "This is a test document.",
    "Russian": "Это тестовый документ.",
    "Chinese": "这是一个测试文档。"
}

for name, text in texts.items():
    result = profiler.analyze_text(text)
    print(f"{name}: {result['language']} ({result['language_confidence']:.0%})")

# Output:
# English: en (99%)
# Russian: ru (98%)
# Chinese: zh (99%)
```

### Example 2: Quality Filtering

```python
from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

# Filter meaningful text only
texts = [
    "This is a well-written article.",
    "xkcd1234!@#$%^&*()",
    "Short"
]

for text in texts:
    should_process, reason = profiler.should_process_text(
        text,
        require_meaningful=True,
        allowed_languages=["en"]
    )
    print(f"{'ACCEPT' if should_process else 'REJECT'}: {reason}")
```

### Example 3: Structure Detection

```python
from text_quality_analyzer import TextProfiler

profiler = TextProfiler()

markdown_text = """## Introduction

This document has structure.

- Item 1
- Item 2

[Link](https://example.com)
"""

result = profiler.analyze_text(markdown_text)

if result["is_structured"]:
    elements = result["structure_elements"]
    print(f"Headers: {elements['headers']}")
    print(f"Lists: {elements['total_list_items']}")
    print(f"Links: {elements['total_links']}")
```

### Example 4: Detailed Metrics

```python
from text_quality_analyzer import TextProfiler

profiler = TextProfiler()
result = profiler.analyze_text("Your text here")

# Language
print(f"Language: {result['language']}")
print(f"Confidence: {result['language_confidence']:.0%}")

# Meaningfulness
print(f"Meaningful: {result['is_meaningful']}")
print(f"Score: {result['meaningfulness_score']:.2f}")
metrics = result['meaningfulness_metrics']
print(f"  Letter ratio: {metrics['letter_ratio']:.2f}")
print(f"  Stopword presence: {metrics['stopword_presence']:.2f}")

# Structure
print(f"Structured: {result['is_structured']}")
print(f"Score: {result['structure_score']:.2f}")

# Statistics
print(f"Words: {result['word_count']}")
print(f"Paragraphs: {result['paragraph_count']}")
```

## API Reference

### TextProfiler

Main class for text analysis.

```python
profiler = TextProfiler(
    min_confidence=0.7,        # Minimum language detection confidence
    min_meaningfulness=0.6,    # Minimum meaningfulness score
    min_structure=0.5,         # Minimum structure score
    max_text_length=1_000_000  # Maximum text length (1MB)
)
```

#### Methods

- `analyze_text(text: str) -> dict`: Full analysis of text
- `quick_check(text: str) -> dict`: Boolean checks only (faster)
- `get_text_summary(text: str) -> str`: Human-readable summary
- `should_process_text(text, ...) -> (bool, str)`: Decision helper for pipelines

### Individual Components

```python
from text_quality_analyzer import (
    LanguageDetector,
    MeaningfulnessChecker,
    StructureAnalyzer
)

# Use components individually if needed
lang_detector = LanguageDetector()
result = lang_detector.detect("Hello world")
```

## Output Format

The `analyze_text()` method returns a dictionary with:

```python
{
    "success": True,
    "language": "en",                    # ISO 639-1 code
    "language_confidence": 0.98,         # 0.0-1.0
    "is_meaningful": True,
    "meaningfulness_score": 0.87,        # 0.0-1.0
    "meaningfulness_metrics": {
        "letter_ratio": 0.82,
        "space_ratio": 0.16,
        "stopword_presence": 0.9,
        "avg_word_length": 5.3,
        "dictionary_match": 0.78
    },
    "is_structured": True,
    "structure_score": 0.98,             # 0.0-1.0 (high score due to headers+lists+code)
    "structure_elements": {
        "headers": 3,
        "total_list_items": 5,
        "paragraphs": 4,
        "total_links": 2,
        "code_blocks": 1
    },
    "text_length": 1024,
    "word_count": 145,
    "paragraph_count": 4,
    "readability_index": 58.4,           # Flesch reading ease
    "processing_time_ms": 125
}
```

## Use Cases

### Content Filtering
Filter low-quality or spam content before processing:

```python
should_process, reason = profiler.should_process_text(
    user_input,
    require_meaningful=True,
    allowed_languages=["en", "ru"]
)
if not should_process:
    return f"Content rejected: {reason}"
```

### Telegram Bot Integration
Filter and classify Telegram posts:

```python
profiler = TextProfiler()

def should_index_post(post_text):
    result = profiler.analyze_text(post_text)

    # Only index meaningful Russian texts
    if result['language'] != 'ru':
        return False
    if not result['is_meaningful']:
        return False

    # Route structured docs to special processing
    if result['is_structured']:
        route_to_structured_queue(post_text)

    return True
```

### Document Classification
Classify documents by structure:

```python
result = profiler.analyze_text(document)

if result['structure_elements']['code_blocks'] > 0:
    doc_type = "technical_documentation"
elif result['structure_elements']['total_list_items'] > 5:
    doc_type = "listicle_article"
elif result['structure_elements']['paragraphs'] > 10:
    doc_type = "long_form_article"
else:
    doc_type = "plain_text"
```

## Running Tests

```bash
# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v

# With coverage
pytest tests/ --cov=text_quality_analyzer --cov-report=html
```

## Running Examples

```bash
python3 examples/basic_usage.py
```

## How It Works

### Structure Detection Logic

The module uses a multi-level approach to detect structured text:

**Step 1: Identify Structure Indicators**
- Has headers (≥1 header)
- Has lists (≥2 list items)
- Has code blocks (≥1 block)
- Has links + paragraphs (≥1 link AND ≥3 paragraphs)

**Step 2: Calculate Base Score**
- 2+ indicators → base score = 0.7 (definitely structured)
- 1 indicator → base score = 0.5 (possibly structured)
- 0 indicators → base score = 0.0 (not structured)

**Step 3: Add Quantity Bonuses**
- +0.05 per header (max +0.15)
- +0.02 per list item (max +0.15)
- +0.1 for code blocks

**Examples:**
- Text with 4 headers + 4 lists + 1 code → score = 0.96 ✅ Structured
- Text with 2 headers only → score = 0.60 ✅ Structured
- Plain text → score = 0.00 ❌ Not structured

### Meaningfulness Detection

Uses multiple metrics:
- **Letter ratio**: Proportion of alphabetic characters (expect 0.5-0.9)
- **Space ratio**: Proper word spacing (expect 0.05-0.25)
- **Stopwords**: Presence of common words for the language
- **Word length**: Average word length in normal range (3-12 chars)
- **Dictionary match**: Words found in language frequency lists

Texts with score ≥ 0.6 are considered meaningful.

## Performance

- **Short text (100 chars)**: ~1-10ms
- **Medium text (1,000 chars)**: ~2-50ms
- **Long text (10,000 chars)**: ~3-100ms
- **First run**: ~10 seconds (library initialization)
- **Subsequent runs**: 0-3ms per text

Note: First analysis is slower due to library initialization. Subsequent analyses are very fast.

## Requirements

- Python >= 3.9
- CPU-only (no GPU required)
- ~100 MB RAM
- ~10 MB disk space

## Supported Languages

Primary support:
- English (en)
- Russian (ru)
- Chinese (zh)
- Spanish (es)
- French (fr)
- German (de)
- Arabic (ar)
- Japanese (ja)
- Portuguese (pt)
- Italian (it)

Plus 87+ more languages via langid.

## Limitations

- Maximum text length: 1 MB (configurable)
- Minimum text length for language detection: 10 characters
- Text-only input (no binary data)
- CPU-only processing

## Future Improvements

Planned features:
- Sentiment analysis
- Document type classification
- Grammar checking
- Keyword extraction
- HTML/PDF support
- Result caching

## License

MIT License - see LICENSE file for details

## Contributing

Contributions welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Add tests for new features
4. Submit a pull request

## Support

For issues and feature requests, please create an issue in the GitHub repository.

## Changelog

### 0.1.1 (2025-10-28) - Hotfix

**Fixed:**
- Fixed language confidence display (was showing negative percentages)
  - langid returns negative log-likelihood, now properly converted to 0-1 probability
  - Confidence now displays correctly as 95-100% for clear texts
- Improved structure detection logic (major improvement)
  - Old: Used complex weighted formula that was too conservative
  - New: Multi-level approach with explicit structure indicators
  - Result: Texts with headers/lists/code are now correctly identified as structured
  - Examples: Text with 4 headers + 4 lists + code → score 0.96 (was 0.50)

**Tested:**
- Comprehensive test suite with 15 diverse texts
- Multiple languages (English, Russian, Chinese, Spanish)
- Various structures (plain, markdown, code-heavy)
- All tests passing ✅

### 0.1.0 (2025-10-28)
- Initial release
- Language detection for 97+ languages
- Meaningfulness checking with 5 metrics
- Structure analysis for Markdown documents
- Readability scoring with textstat
- Complete test suite
- Usage examples and documentation
