Metadata-Version: 2.4
Name: doctok
Version: 1.0.0
Summary: Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken
Home-page: https://github.com/Pranesh-2005/Token-Calculator
Author: Pranesh
Author-email: Pranesh <praneshmadhan646@gmail.com>
Maintainer-email: Pranesh <praneshmadhan646@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Pranesh-2005/Token-Calculator
Project-URL: Documentation, https://github.com/Pranesh-2005/Token-Calculator#readme
Project-URL: Repository, https://github.com/Pranesh-2005/Token-Calculator.git
Project-URL: Bug Tracker, https://github.com/Pranesh-2005/Token-Calculator/issues
Keywords: token,counter,gpt,tiktoken,pdf,docx,pptx,nlp,llm,rag
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Office/Business
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: tiktoken
Requires-Dist: pymupdf
Requires-Dist: python-docx
Requires-Dist: python-pptx
Requires-Dist: charset-normalizer
Requires-Dist: psutil
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: isort; extra == "dev"
Provides-Extra: gradio
Requires-Dist: gradio; extra == "gradio"
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# Token Calculator

<div align="center">

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://img.shields.io/badge/PyPI-1.0.0-brightgreen.svg)](https://pypi.org/project/token_calculator/)

**Production-grade token counter for PDF, TXT, DOCX, MD, and PPTX files using real GPT tokenization via tiktoken**

[Features](#features) • [Installation](#installation) • [Quick Start](#quick-start) • [API](#api) • [CLI](#cli)

</div>

---

## Features

✨ **Real GPT Tokenization** - Uses OpenAI's `tiktoken` library for accurate token counting (not approximations)

⚡ **Multi-Format Support** - Handles PDF, TXT, MD, DOCX, PPTX files seamlessly

🔄 **Streaming Architecture** - Constant-memory processing for large files (handles 500MB+ efficiently)

🚀 **Adaptive Concurrency** - Automatic CPU detection and parallel batch processing

🔍 **OCR/Scanned PDF Detection** - Automatically identifies whether PDFs are text-based or scanned images

⏱️ **Timeout Enforcement** - Prevents hanging on malformed or problematic files

💾 **Memory Protected** - Soft (512MB) and hard (2048MB) memory limits with enforcement

🎯 **Enterprise Ready** - Designed for RAG, LLM preprocessing, token budgeting, and bulk analytics

---

## Installation

### From PyPI
```bash
pip install doctok
```

### From source
```bash
git clone https://github.com/Pranesh-2005/Token-Calculator.git
cd Token-Calculator
pip install -e .
```

---

## Quick Start

### Command Line
```bash
# Count tokens in a single file
doctok document.pdf

# Process an entire directory
doctok ./documents/

# Output results as JSON
doctok document.pdf --output results.json --format json

# Compute SHA256 hashes
doctok document.pdf --hash

# Specify timeout (seconds)
doctok large-file.pdf --timeout 600

# Use multiple workers (default: auto-detected)
doctok ./documents/ --workers 8
```

### Python API
```python
from token_calculator import count_file, count_files_batch

# Single file
result = count_file("document.pdf")
print(f"Tokens: {result.gpt_tokens}")
print(f"Words: {result.word_count}")
print(f"Characters: {result.char_count}")
print(f"Pages: {result.pages}")
print(f"Status: {result.status}")  # 'ok', 'scanned', 'timeout', 'error'

# Batch processing
results = count_files_batch(
    ["doc1.pdf", "doc2.docx", "doc3.txt"],
    max_workers=4,
    compute_hash=True
)
print(f"Total tokens: {results.total_tokens}")
print(f"Total pages: {results.total_pages}")
print(f"Elapsed time: {results.elapsed_sec:.2f}s")
```

---

## API Reference

### Core Functions

#### `count_file(path: str, timeout_sec: float = 300, compute_hash: bool = False) -> FileResult`

Process a single file and return detailed metrics.

**Parameters:**
- `path` (str): Path to the file to process
- `timeout_sec` (float): Timeout in seconds (default: 300)
- `compute_hash` (bool): Compute SHA256 hash (default: False)

**Returns:**
- `FileResult` object with:
  - `gpt_tokens` (int): Token count using GPT tokenizer
  - `char_count` (int): Character count
  - `word_count` (int): Word count
  - `pages` (int): Page count
  - `extractable_pages` (int): Pages with extractable text
  - `skipped_pages` (int): Pages without text (scanned PDFs)
  - `status` (str): 'ok', 'scanned', 'timeout', 'error'
  - `error_msg` (str): Error details if applicable
  - `elapsed_sec` (float): Processing time

**Example:**
```python
from token_calculator import count_file

result = count_file("paper.pdf")
if result.status == "ok":
    print(f"{result.gpt_tokens} tokens")
elif result.status == "scanned":
    print("PDF is scanned (no OCR)")
else:
    print(f"Error: {result.error_msg}")
```

---

#### `count_files_batch(paths: list[str], max_workers: Optional[int] = None, compute_hash: bool = False) -> BatchResult`

Process multiple files in parallel.

**Parameters:**
- `paths` (list[str]): List of file paths
- `max_workers` (int): Max parallel workers (default: auto-detected, max 8)
- `compute_hash` (bool): Compute SHA256 hashes (default: False)

**Returns:**
- `BatchResult` object with:
  - `files` (list[FileResult]): Results for each file
  - `total_tokens` (int): Sum of all tokens
  - `total_words` (int): Sum of all words
  - `total_chars` (int): Sum of all characters
  - `total_pages` (int): Sum of all pages
  - `file_count` (int): Number of files processed
  - `error_count` (int): Number of failed files
  - `elapsed_sec` (float): Total elapsed time

**Example:**
```python
from token_calculator import count_files_batch
import json

results = count_files_batch([
    "doc1.pdf",
    "doc2.docx",
    "doc3.txt"
])

# Export as JSON
output = {
    "total_tokens": results.total_tokens,
    "files": [
        {
            "name": f.filename,
            "tokens": f.gpt_tokens,
            "status": f.status
        }
        for f in results.files
    ]
}
print(json.dumps(output, indent=2))
```

---

### Exception Handling

```python
from token_calculator import (
    count_file,
    TokenCounterError,
    FileTooLargeError,
    UnsupportedFileError,
    FileTimeoutError,
)

try:
    result = count_file("document.pdf")
except FileTooLargeError:
    print("File exceeds 500MB limit")
except UnsupportedFileError as e:
    print(f"Format not supported: {e}")
except FileTimeoutError:
    print("Processing timed out")
except TokenCounterError as e:
    print(f"Processing error: {e}")
```

---

## Supported Formats

| Format | Extension | Features |
|--------|-----------|----------|
| **PDF** | `.pdf` | Text extraction, page count, scanned PDF detection |
| **Plain Text** | `.txt` | Encoding auto-detection, streaming processing |
| **Markdown** | `.md` | Standard markdown text extraction |
| **Word** | `.docx` | Paragraph and table extraction |
| **PowerPoint** | `.pptx` | Slide and shape text extraction |

---

## Configuration

### Default Limits

```
Max file size: 500 MB
Global timeout: 300 seconds
PDF page timeout: 10 seconds
Soft memory limit: 512 MB
Hard memory limit: 2048 MB
Max workers: CPU count (capped at 8)
```

To use different limits, process files locally and modify `core.py` constants.

---

## Performance

- **Small files (< 10MB):** < 1 second
- **Medium files (10-100MB):** 2-10 seconds
- **Large files (100-500MB):** 30-120 seconds
- **Batch of 100 files:** ~2 minutes (with 8 workers)

Memory usage remains constant regardless of file size due to streaming architecture.

---

## Use Cases

- 📊 **RAG Pipeline Optimization** - Calculate token costs before ingestion
- 🧠 **LLM Preprocessing** - Validate document sizes for model context windows
- 💰 **Token Budgeting** - Estimate API costs for document processing
- 📈 **Enterprise Analytics** - Batch analyze large document corpora
- 🔍 **Content Indexing** - Organize documents by token complexity

---

## Examples

### Example 1: Calculate API Cost

```python
from token_calculator import count_file

result = count_file("paper.pdf")

# Pricing: $0.01 per 1K tokens (example)
cost = (result.gpt_tokens / 1000) * 0.01
print(f"Estimated API cost: ${cost:.2f}")
```

### Example 2: Batch Processing with Error Handling

```python
from token_calculator import count_files_batch
from pathlib import Path

pdf_files = list(Path("./documents").glob("*.pdf"))

results = count_files_batch([str(p) for p in pdf_files], max_workers=8)

print(f"Successfully processed: {results.file_count - results.error_count}")
print(f"Failed: {results.error_count}")
print(f"Total tokens: {results.total_tokens:,}")

for file_result in results.files:
    if file_result.status != "ok":
        print(f"⚠️  {file_result.filename}: {file_result.error_msg}")
```

### Example 3: RAG Ingestion Check

```python
from token_calculator import count_file

# Check if document fits in context window
MAX_CONTEXT = 8000  # tokens

result = count_file("document.pdf")

if result.gpt_tokens > MAX_CONTEXT:
    print(f"⚠️  Document is {result.gpt_tokens} tokens (exceeds {MAX_CONTEXT})")
    print(f"Split into {result.gpt_tokens // MAX_CONTEXT + 1} chunks")
else:
    print(f"✓ Document fits in context ({result.gpt_tokens} tokens)")
```

---

## CLI Options

```
usage: token_calculator [-h] [-w WORKERS] [-t TIMEOUT] [-o OUTPUT] [--hash] [--format {text,json}] path

positional arguments:
  path                  File or directory to process

optional arguments:
  -h, --help           Show help message
  -w, --workers N      Max parallel workers (default: auto)
  -t, --timeout SEC    Per-file timeout in seconds (default: 300)
  -o, --output FILE    Write JSON results to file
  --hash               Compute SHA256 hash for each file
  --format {text,json} Output format (default: text)
```

---

## Logging

```python
import logging
from token_calculator import setup_logging, count_file

# Enable debug logging
setup_logging(logging.DEBUG)

result = count_file("document.pdf")
```

---

## License

MIT License - see [LICENSE](LICENSE) file for details

---

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/improvement`)
3. Make your changes
4. Run tests (`pytest`)
5. Submit a pull request

---

## Support

- 📖 [GitHub Issues](https://github.com/Pranesh-2005/Token-Calculator/issues)
- 💬 [GitHub Discussions](https://github.com/Pranesh-2005/Token-Calculator/discussions)

---

## Changelog

### v1.0.0 (Initial Release)
- Multi-format document support (PDF, TXT, MD, DOCX, PPTX)
- Real GPT tokenization via tiktoken
- Streaming extraction with constant-memory processing
- Batch processing with adaptive concurrency
- OCR/scanned PDF detection
- CLI and Python API
- Comprehensive error handling
- Memory protection and timeout enforcement

---

## Roadmap

- [ ] Async API support
- [ ] Web API endpoint
- [ ] Language-specific tokenizer support
- [ ] Streaming file uploads
- [ ] Token cost calculator for multiple LLM providers

---

**Made with ❤️ by Pranesh**
