Metadata-Version: 2.4
Name: sharepoint-to-text
Version: 0.9.0
Summary: Text extraction library for typical file formats found in SharePoint repositories
Project-URL: Homepage, https://github.com/Horsmann/sharepoint-to-text
Project-URL: Documentation, https://github.com/Horsmann/sharepoint-to-text#readme
Project-URL: Repository, https://github.com/Horsmann/sharepoint-to-text.git
Project-URL: Issues, https://github.com/Horsmann/sharepoint-to-text/issues
Project-URL: Changelog, https://github.com/Horsmann/sharepoint-to-text/blob/main/CHANGELOG.md
Author-email: Tobias Horsmann <tobias.horsmann@gmail.com>
Maintainer-email: Tobias Horsmann <tobias.horsmann@gmail.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: aws-lambda,csv,doc,document-extraction,document-processing,docx,email,eml,json,llm,mbox,md,microsoft-office,msg,nlp,odp,ods,office,pdf,ppt,pptx,pure-python,rag,rtf,serverless,sharepoint,text-extraction,tsv,xls,xlsx
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Communications :: Email
Classifier: Topic :: Office/Business
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Indexing
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: charset-normalizer>=3.3.0
Requires-Dist: defusedxml>=0.7.1
Requires-Dist: mail-parser>=4.1.4
Requires-Dist: msg-parser>=1.2.0
Requires-Dist: olefile>=0.47
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: pypdf>=6.6.0
Requires-Dist: xlrd>=2.0.2
Provides-Extra: pdf-crypto
Requires-Dist: pycryptodome>=3.20.0; extra == 'pdf-crypto'
Description-Content-Type: text/markdown

# sharepoint-to-text

A **pure Python** library for extracting text, metadata, and structured elements from Microsoft Office files—both modern (`.docx`, `.xlsx`, `.pptx`) and legacy (`.doc`, `.xls`, `.ppt`) formats—plus PDF, email formats, and plain text.

The library also includes an optional SharePoint client for reading files directly from Microsoft SharePoint sites via the Graph API. You still orchestrate the pipeline: pull files (via `sharepoint_io` or your own Graph client), then pass the bytes into the extractors.

**Install:** `uv add sharepoint-to-text`
**Python import:** `import sharepoint2text`
**CLI (text):** `sharepoint2text --file /path/to/file.docx > extraction.txt`
**CLI (JSON):** `sharepoint2text --file /path/to/file.docx --json > extraction.json` (images ignored by default; add `--include-images` to extract)

## What You Get

- **Unified API**: `sharepoint2text.read_file(path)` yields one or more typed extraction results.
- **Typed results**: each format returns a specific dataclass (e.g. `DocxContent`, `PdfContent`) that also supports the common interface.
- **Text**: `get_full_text()` or `iterate_units()` (pages / slides / sheets depending on format; call `unit.get_text()` for the string).
- **Structured content**: tables and images where the format supports it.
- **Metadata**: file metadata (plus format-specific metadata where available).
- **Serialization**: `result.to_json()` returns a JSON-serializable dict.

## Extractor Interface (Developer Guide)

Every extracted result implements the same high-level interface (`ExtractionInterface`). Use it to build pipelines that work across file types without special-casing `.pdf` vs `.docx` vs `.pptx`.

### Quick Guide: Which Method Should I Use?

| Goal | Recommended | Why |
|---|---|---|
| Get "the document text" as one string | `result.get_full_text()` | Best default for indexing and simple exports; hides format-specific unit details. |
| Chunk text by page/slide/sheet (RAG, citations, per-unit metadata) | `result.iterate_units()` | Stable unit boundaries for formats that have them (PDF pages, PPT slides, XLS(X) sheets). |
| Extract images (and optionally store payloads) | `result.iterate_images()` | Returns image objects with metadata; binary payload handling is caller-controlled. |
| Extract tables as structured data | `result.iterate_tables()` | Returns table objects as 2D arrays, suitable for CSV/JSON downstream. |
| Attach filename/path context | `result.get_metadata()` | Normalizes file metadata regardless of format; useful for provenance and linking. |
| Persist/transport results | `result.to_json()` / `ExtractionInterface.from_json(...)` | JSON-serializable representation; optional base64 encoding for binary fields. |

### Method Details (When to Use Which)

- `get_full_text()`:
  - Use when you want a single string per extracted item (search indexing, previews, "export to .txt").
  - It is usually derived from `iterate_units()`, but some formats may prepend metadata (e.g., titles) or omit optional content by default.
- `iterate_units()`:
  - Use when you need chunk boundaries aligned with the source structure (pages/slides/sheets) or when you want to keep unit-level metadata.
  - Each unit supports `unit.get_text()`, `unit.get_images()`, `unit.get_tables()`, and `unit.get_metadata()`.
- `iterate_images()` / `iterate_tables()`:
  - Use when you want *all* images/tables across the document (often simpler than traversing units).
  - Prefer unit-level access (`unit.get_images()`, `unit.get_tables()`) when you need "where did this come from?" context (page/slide number).
- `get_metadata()`:
  - Use for provenance fields like `filename`, `file_extension`, `file_path`, `folder_path`.
  - Pair with unit metadata for precise citations (e.g., `file_path + page_number`).
- `to_json()` / `from_json()`:
  - Use to store results, send them across processes, or debug extraction output.
  - Binary payloads are representable but can be large; omit them unless you explicitly need embedded data.

### Examples

Plain text (single string):

```python
import sharepoint2text

result = next(sharepoint2text.read_file("document.pdf"))
text = result.get_full_text()
```

Unit-based chunking (recommended for RAG):

```python
import sharepoint2text

result = next(sharepoint2text.read_file("deck.pptx"))
for unit in result.iterate_units():
    chunk = unit.get_text()
    unit_meta = unit.get_metadata()  # e.g., slide/page/sheet number when available
```

## Why This Library?

### Pure Python, No System Dependencies

Unlike popular alternatives that shell out to **LibreOffice** or **Apache Tika** (requiring Java), `sharepoint-to-text` is a **native Python implementation** with no system-level dependencies:

| Approach | Requirements | Cross-platform | Container-friendly |
|----------|-------------|----------------|-------------------|
| **sharepoint-to-text** | `uv add` only | Yes | Yes (minimal image) |
| LibreOffice-based | LibreOffice install, X11/headless setup | Complex | Large images (~1GB+) |
| Apache Tika | Java runtime, Tika server | Complex | Heavy (~500MB+) |
| subprocess-based | Shell access, security concerns | No | Risky |

This library parses Office binary formats (OLE2) and XML-based formats (OOXML) directly in Python, making it ideal for:

- **RAG pipelines** and LLM document ingestion
- **Serverless functions** (AWS Lambda, Google Cloud Functions)
- **Containerized deployments** with minimal footprint
- **Secure environments** where shell execution is restricted
- **Cross-platform** applications (Windows, macOS, Linux)

### Enterprise SharePoint Reality

Enterprise SharePoints contain decades of accumulated documents. While modern `.docx`, `.xlsx`, and `.pptx` files are well-supported, legacy `.doc`, `.xls`, and `.ppt` files remain common. This library provides a **unified interface** for all formats—no conditional logic needed.

### SharePoint Connectivity

For scenarios where documents live in Microsoft SharePoint, the library includes a built-in Graph API client. This is an optional convenience layer, not required for local files or other storage backends. You are responsible for orchestrating the pull (list/download) and then calling the extractors:

```python
from sharepoint2text.sharepoint_io import SharePointRestClient, EntraIDAppCredentials

credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)
client = SharePointRestClient(site_url="https://contoso.sharepoint.com/sites/Docs", credentials=credentials)

# List and download files
for file in client.list_all_files():
    content = client.download_file(file.id)
    # Pass to sharepoint2text extractors...
```

The client supports filtering by modification date (for delta-sync patterns), folder paths, and file extensions. See [`sharepoint2text/sharepoint_io/SETUP.md`](sharepoint2text/sharepoint_io/SETUP.md) for Azure/Entra ID configuration instructions.

## Supported Formats

### Format Detection and Processing

The library uses a sophisticated multi-layered approach to detect and process file formats:

1. **Extension-based detection** (primary): File extensions are mapped to appropriate extractors
2. **Extension aliases**: Template formats and variants automatically map to their base extractors
3. **MIME type fallback**: When extensions are unclear, MIME types are used for detection
4. **Archive processing**: Archives recursively extract and process all supported files within them

**Extension Aliases**: The library automatically handles format variants:
- Office templates (`.dotx`, `.xltx`, `.potx`) → processed as their base formats
- Office shows (`.ppsx`) → processed as presentations
- OpenDocument templates (`.ott`, `.ots`, `.otp`) → processed as their base formats
- Compression variants (`.tar.gz` → `.tgz`, `.tar.bz2` → `.tbz2`, `.tar.xz` → `.txz`)

**Security Features**: Archive processing includes protection against zip bombs and has size limits (7z archives max 100MB).

### Legacy Microsoft Office

| Format             | Extension | Description                      |
|--------------------|-----------|----------------------------------|
| Word 97-2003       | `.doc`    | Word 97-2003 documents           |
| Word 97-2003 (template) | `.dot` | Word 97-2003 templates           |
| Excel 97-2003      | `.xls`    | Excel 97-2003 spreadsheets       |
| Excel 97-2003 (template) | `.xlt` | Excel 97-2003 templates          |
| PowerPoint 97-2003 | `.ppt`    | PowerPoint 97-2003 presentations |
| PowerPoint 97-2003 (template) | `.pot` | PowerPoint 97-2003 templates     |
| PowerPoint 97-2003 (show) | `.pps` | PowerPoint 97-2003 slide shows    |
| Rich Text Format   | `.rtf`    | Rich Text Format documents       |

### Modern Microsoft Office

| Format                    | Extension | Description                              |
|---------------------------|-----------|------------------------------------------|
| Word 2007+                | `.docx`   | Word 2007+ documents                     |
| Word 2007+ (macro)        | `.docm`   | Word 2007+ macro-enabled documents       |
| Word 2007+ (template)     | `.dotx`   | Word 2007+ templates                     |
| Word 2007+ (template, macro) | `.dotm` | Word 2007+ macro-enabled templates       |
| Excel 2007+               | `.xlsx`   | Excel 2007+ spreadsheets                 |
| Excel 2007+ (macro)       | `.xlsm`   | Excel 2007+ macro-enabled spreadsheets   |
| Excel 2007+ (template)    | `.xltx`   | Excel 2007+ templates                    |
| Excel 2007+ (template, macro) | `.xltm` | Excel 2007+ macro-enabled templates      |
| PowerPoint 2007+          | `.pptx`   | PowerPoint 2007+ presentations           |
| PowerPoint 2007+ (macro)  | `.pptm`   | PowerPoint 2007+ macro-enabled presentations |
| PowerPoint 2007+ (template) | `.potx` | PowerPoint 2007+ templates               |
| PowerPoint 2007+ (template, macro) | `.potm` | PowerPoint 2007+ macro-enabled templates |
| PowerPoint 2007+ (show)   | `.ppsx`   | PowerPoint 2007+ slide shows             |
| PowerPoint 2007+ (show, macro) | `.ppsm` | PowerPoint 2007+ macro-enabled slide shows |

### OpenDocument

| Format       | Extension | Description               |
|--------------|-----------|---------------------------|
| Text         | `.odt`    | OpenDocument Text         |
| Text (template) | `.ott` | OpenDocument Text templates |
| Presentation | `.odp`    | OpenDocument Presentation |
| Presentation (template) | `.otp` | OpenDocument Presentation templates |
| Spreadsheet  | `.ods`    | OpenDocument Spreadsheet  |
| Spreadsheet (template) | `.ots` | OpenDocument Spreadsheet templates |
| Drawing      | `.odg`    | OpenDocument Drawing      |
| Formula      | `.odf`    | OpenDocument Formula      |

### Email

| Format | Extension | Description                           |
|--------|-----------|---------------------------------------|
| EML    | `.eml`    | RFC 822 email format                  |
| MSG    | `.msg`    | Microsoft Outlook email format        |
| MBOX   | `.mbox`   | Unix mailbox format (multiple emails) |

### Plain Text

| Format     | Extension | Description              |
|------------|-----------|--------------------------|
| Plain Text | `.txt`    | Plain text files         |
| Markdown   | `.md`     | Markdown                 |
| CSV        | `.csv`    | Comma-separated values   |
| TSV        | `.tsv`    | Tab-separated values     |
| JSON       | `.json`   | JSON files               |

### Configuration & Data Files

| Format       | Extension | Description                      |
|--------------|-----------|----------------------------------|
| YAML         | `.yaml`, `.yml` | YAML configuration files |
| XML          | `.xml`    | XML documents and data files     |
| Log files    | `.log`    | Application log files            |
| INI/Config   | `.ini`, `.cfg`, `.conf` | Configuration files |
| Properties   | `.properties` | Java properties files        |

### PDF

| Format | Extension | Description    |
|--------|-----------|----------------|
| PDF    | `.pdf`    | PDF documents  |

### HTML / Web

| Format | Extension      | Description                     |
|--------|----------------|---------------------------------|
| HTML   | `.html`, `.htm`| HTML documents                  |
| MHTML  | `.mhtml`, `.mht`| MIME HTML (web archive) files  |
| EPUB   | `.epub`        | EPUB e-book format              |

### Archives

| Format           | Extension                  | Description                     |
|------------------|----------------------------|---------------------------------|
| ZIP              | `.zip`                     | ZIP archives                    |
| 7-Zip            | `.7z`                      | 7-Zip archives (max 100MB)      |
| TAR              | `.tar`                     | TAR archives                    |
| Gzip TAR         | `.tar.gz`, `.tgz`, `.gz`   | Gzip-compressed TAR archives    |
| Bzip2 TAR        | `.tar.bz2`, `.tbz2`, `.bz2`| Bzip2-compressed TAR archives   |
| XZ TAR           | `.tar.xz`, `.txz`, `.xz`   | XZ-compressed TAR archives      |

Archive extraction recursively processes all supported files within the archive. Each file inside the archive is extracted and processed individually, yielding separate results for each supported document found. Archives include security measures to prevent zip bomb attacks.

## Installation

```bash
uv add sharepoint-to-text
```

Optional: faster AES handling for encrypted PDFs (avoids the slow fallback crypto and large-PDF image skips):

```bash
uv add "sharepoint-to-text[pdf-crypto]"
```

Or install from source:

```bash
git clone https://github.com/Horsmann/sharepoint-to-text.git
cd sharepoint-to-text
uv sync --all-groups
```

## Libraries

### Core Libraries (runtime)

These are required for normal use of the library:

- `charset-normalizer`: Automatic encoding detection for plain text files and MIME type detection fallback
- `defusedxml`: Hardened XML parsing for OOXML/ODF formats
- `mail-parser`: RFC 822 email parsing (`.eml`)
- `msg-parser`: Outlook `.msg` extraction
- `olefile`: OLE2 container parsing for legacy Office formats
- `openpyxl`: `.xlsx` parsing
- `.7z` archive extraction is handled via a built-in pure-Python reader (stdlib `lzma`)
- `pypdf`: `.pdf` parsing
- `xlrd`: `.xls` parsing

**Format Detection**: The library primarily uses file extensions for format detection, with MIME type fallback for ambiguous cases. Template formats and variants are automatically mapped to their appropriate extractors.

### Development Libraries

These are only needed for development workflows:

- `pytest`: test runner
- `pre-commit`: linting/format hooks
- `black`: code formatter

### Optional Libraries

These are opt-in extras for specific use cases:

- `pycryptodome`: Faster AES crypto for encrypted PDFs (`pdf-crypto` extra)

## Quick Start

### The Unified Interface

`sharepoint2text.read_file(...)` returns a **generator** of extraction results implementing a common interface. Most formats yield a single item, but some (notably `.mbox`) can yield multiple items.

**Format Detection**: Automatically detects file formats using extensions (primary) and MIME types (fallback). Template formats and variants are automatically mapped to appropriate extractors.

```python
import sharepoint2text

# Works identically for ANY supported format
# Most formats yield a single item, so use next() for convenience
for result in sharepoint2text.read_file("document.docx"):  # or .doc, .pdf, .pptx, etc.
    # Methods available on ALL content types:
    text = result.get_full_text()  # Complete text as a single string
    metadata = result.get_metadata()  # File metadata (filename/path; plus format-specific fields when available)

    # Iterate over logical units (varies by format - see below)
    for unit in result.iterate_units():
        print(unit.get_text())

    # Iterate over extracted images
    for image in result.iterate_images():
        print(image)

    # Iterate over extracted tables
    for table in result.iterate_tables():
        print(table)

# For single-item formats, you can use next() directly:
result = next(sharepoint2text.read_file("document.docx"))
print(result.get_full_text())
```

Notes: `ImageInterface` provides `get_bytes()`, `get_content_type()`, `get_caption()`, `get_description()`, and `get_metadata()` (unit index, image index, content type, width, height). `TableInterface` provides `get_table()` (rows as lists) and `get_dim()` (rows, columns).

Most results also expose **format-specific structured fields** (e.g. `PdfContent.pages`, `PptxContent.slides`, `XlsxContent.sheets`) in addition to the common interface—see **Return Types** below.

### JSON Output (`to_json()`)

All extraction results support `to_json()` for a JSON-serializable representation of the extracted data (including nested dataclasses).

```python
import json
import sharepoint2text

result = next(sharepoint2text.read_file("document.docx"))
print(json.dumps(result.to_json()))
```

To restore objects from JSON, use `ExtractionInterface.from_json(...)`.

```python
from sharepoint2text.parsing.extractors.data_types import ExtractionInterface

restored = ExtractionInterface.from_json(result.to_json())
```

### Understanding `iterate_units()` Output by Format

Different file formats have different natural structural units:

| Format | `iterate_units()` yields | Notes |
|--------|-------------------------|-------|
| `.docx`, `.doc`, `.odt` | 1 item (full text) | Word/text documents have no page structure in the file format |
| `.xlsx`, `.xls`, `.ods` | 1 item per **sheet** | Each yield contains sheet content |
| `.pptx`, `.ppt`, `.odp` | 1 item per **slide** | Each yield contains slide text |
| `.pdf` | 1 item per **page** | Each yield contains page text |
| `.eml`, `.msg` | 1 item (email body) | Plain text or HTML body |
| `.mbox` | 1 item per **email** | Mailboxes can contain multiple emails |
| `.txt`, `.csv`, `.json`, `.tsv` | 1 item (full content) | Single unit |

**Note on Word documents:** The `.doc` and `.docx` file formats do not store page boundaries—pages are a rendering artifact determined by fonts, margins, and printer settings. The library returns the full document as a single text unit.

**Note on generators:** All extractors return generators. Most formats yield a single content object, but `.mbox` files can yield multiple `EmailContent` objects (one per email in the mailbox). Use `next()` for single-item formats or iterate with `for` to handle all cases.

#### Skipping Images in Unit Iteration

When you don't need image data in your units, pass `ignore_images=True` to `iterate_units()` to improve performance and reduce memory usage:

```python
result = next(sharepoint2text.read_file("document.docx"))

# Default behavior: includes images in units
for unit in result.iterate_units():
    print(f"Text: {unit.get_text()}")
    print(f"Images: {len(unit.get_images())}")  # May contain images

# Skip images: units will have empty image lists
for unit in result.iterate_units(ignore_images=True):
    print(f"Text: {unit.get_text()}")
    print(f"Images: {len(unit.get_images())}")  # Always 0
```

This is useful when:
- You only need text content and don't want the overhead of image processing
- You're processing documents in a text-only pipeline (e.g., full-text search indexing)
- You want to reduce memory usage for large documents with many embedded images

**Note:** This only affects unit iteration. `result.iterate_images()` still yields all images regardless of this flag.

### Choosing Between `get_full_text()` and `iterate_units()`

The interface provides two methods for accessing text content, and **you must decide which is appropriate for your use case**:

| Method | Returns | Best for |
|--------|---------|----------|
| `get_full_text()` | All text as a single string | Simple extraction, full-text search, when structure doesn't matter |
| `iterate_units()` | Yields logical units (pages, slides, sheets) | RAG pipelines, per-unit indexing, preserving document structure |

**For RAG and vector storage:** Consider whether storing pages/slides/sheets as separate chunks with metadata (e.g., page numbers) benefits your retrieval strategy. This allows more precise source attribution when users query your system.

```python
# Option 1: Store entire document as one chunk
result = next(sharepoint2text.read_file("report.pdf"))
store_in_vectordb(text=result.get_full_text(), metadata={"source": "report.pdf"})

# Option 2: Store each page separately with page numbers
result = next(sharepoint2text.read_file("report.pdf"))
for page_num, unit in enumerate(result.iterate_units(), start=1):
    store_in_vectordb(
        text=unit.get_text(),
        metadata={"source": "report.pdf", "page": page_num}
    )
```

**Trade-offs to consider:**
- **Per-unit storage** enables citing specific pages/slides in responses, but creates more chunks
- **Full-text storage** is simpler and may work better for small documents
- **Word documents** (`.doc`, `.docx`) only yield one unit from `iterate_units()` since they lack page structure—for these formats, both methods are equivalent

### Format-Specific Notes on `get_full_text()`

**Template Format Support**: Office template formats (`.dotx`, `.xltx`, `.potx`, etc.) and OpenDocument templates (`.ott`, `.ots`, `.otp`) are automatically processed using their respective base format extractors. Template files behave identically to regular documents during extraction.

`get_full_text()` is intended as a convenient "best default" for each format. In a few formats it intentionally differs from a plain `"\n".join(unit.get_text() for unit in iterate_units())`, or it omits optional content unless you opt in:

| Format | `get_full_text()` default behavior | Not included by default / where to find it |
|--------|------------------------------------|--------------------------------------------|
| `.doc` | Prepends `metadata.title` (if present) and returns main document body | `footnotes`, `headers_footers`, `annotations` are separate fields (`DocContent`) |
| `.docx` | Returns `full_text` (including formulas) | Comments are available on `DocxContent.comments` (not included in `get_full_text()`) |
| `.ppt` | Per-slide `title + body + other` concatenated | Speaker notes live in `slide.notes` (`PptSlideContent`) |
| `.pptx` | Per-slide `base_text` plus formulas concatenated | Pass `include_image_captions` to `PptxContent.get_full_text(...)` (comments are available on `PptxSlide.comments`) |
| `.odp` | Per-slide `text_combined` concatenated | Pass `include_notes/include_annotations` to `OdpContent.get_full_text(...)` |
| `.xls` | Concatenation of sheet `text` blocks (no sheet names) | Sheet names are available as `sheet.name` (`XlsSheet`) |
| `.xlsx`, `.ods` | Includes sheet name + sheet text for each sheet | Images are available via `iterate_images()` / sheet image lists |
| `.pdf` | Concatenation of extracted page text | Tables/images are available via `iterate_tables()` / `iterate_images()` (`PdfContent.pages`) |
| `.eml`, `.msg`, `.mbox` | Returns `body_plain` when present, else `body_html` | Attachments are in `EmailContent.attachments` and can be extracted via `iterate_supported_attachments()` |
| `.txt`, `.csv`, `.tsv`, `.json`, `.md`, `.html` | Returns stripped content (leading/trailing whitespace removed) | Use the raw fields (`.content`) if you need untrimmed text |
| `.rtf` | Returns the extractor's `full_text` when available | `iterate_units()` yields per-page text when explicit `\page` breaks exist |

### Basic Usage Examples

```python
import sharepoint2text

# Extract from any file - format auto-detected (use next() for single-item formats)
result = next(sharepoint2text.read_file("quarterly_report.docx"))
print(result.get_full_text())

# Check format support before processing
if sharepoint2text.is_supported_file("document.xyz"):
    for result in sharepoint2text.read_file("document.xyz"):
        print(result.get_full_text())

# Access metadata
result = next(sharepoint2text.read_file("presentation.pptx"))
meta = result.get_metadata()
print(f"Author: {meta.author}, Modified: {meta.modified}")
print(meta.to_dict())  # Convert to dictionary

# Process emails (mbox can contain multiple emails)
for email in sharepoint2text.read_file("mailbox.mbox"):
    print(f"From: {email.from_email.address}")
    print(f"Subject: {email.subject}")
    print(email.get_full_text())
```

### Working with Structured Content

```python
import sharepoint2text

# Excel: iterate over sheets
result = next(sharepoint2text.read_file("budget.xlsx"))
for sheet in result.sheets:
    print(f"Sheet: {sheet.name}")
    print(f"Rows: {len(sheet.data)}")  # List of row dictionaries
    print(sheet.text)                   # Text representation

# PowerPoint: iterate over slides
result = next(sharepoint2text.read_file("deck.pptx"))
for slide in result.slides:
    print(f"Slide {slide.slide_number}: {slide.title}")
    print(slide.content_placeholders)  # Body text
    print(slide.images)                # Image metadata

# PDF: iterate over pages
result = next(sharepoint2text.read_file("report.pdf"))
for page_num, page in enumerate(result.pages, start=1):
    print(f"Page {page_num}: {page.text[:100]}...")
    print(f"Images: {len(page.images)}")

# Email: access email-specific fields
email = next(sharepoint2text.read_file("message.eml"))
print(f"From: {email.from_email.name} <{email.from_email.address}>")
print(f"To: {', '.join(e.address for e in email.to_emails)}")
print(f"Subject: {email.subject}")
print(f"Body: {email.body_plain or email.body_html}")
```

### Using Format-Specific Extractors with BytesIO

For API responses or in-memory data:

```python
import sharepoint2text
import io

# Direct extractor usage with BytesIO (returns generator, use next() for single items)
with open("document.docx", "rb") as f:
    result = next(sharepoint2text.read_docx(io.BytesIO(f.read()), path="document.docx"))

# Get extractor dynamically based on filename
def extract_from_api(filename: str, content: bytes):
    extractor = sharepoint2text.get_extractor(filename)
    # Returns a generator - iterate or use next()
    return list(extractor(io.BytesIO(content), path=filename))
```

### Archive Processing

The library automatically extracts and processes all supported files within archives:

```python
import sharepoint2text
import io

# Process archives using the main API (auto-detects format by extension)
# Each file in the archive yields a separate result
for result in sharepoint2text.read_file("archive.zip"):
    print(f"Extracted: {result.get_metadata().filename}")
    print(f"Content: {result.get_full_text()[:200]}...")

# 7-Zip archives are supported with size limits (100MB max)
for result in sharepoint2text.read_file("documents.7z"):
    print(f"Extracted: {result.get_metadata().filename}")
    print(f"Content: {result.get_full_text()[:200]}...")

# Process mixed-content archives (documents, images, etc.)
archive_results = list(sharepoint2text.read_file("mixed_files.zip"))
documents = [r for r in archive_results if hasattr(r, 'get_full_text')]
print(f"Found {len(documents)} extractable documents in archive")

# Or use the direct extractor for BytesIO data
from sharepoint2text.parsing.extractors.archive_extractor import read_archive

with open("archive.zip", "rb") as f:
    for result in read_archive(io.BytesIO(f.read()), path="archive.zip"):
        print(f"Extracted: {result.get_metadata().filename}")
        print(f"Content: {result.get_full_text()[:200]}...")
```

## Limitations / Caveats

### PDF Extraction

- **No OCR support:** This library does not perform optical character recognition. PDFs that consist of scanned images or photos of documents will return empty text. The images themselves are still extracted and available via `iterate_images()`, but no text is derived from them.
- **Table extraction not implemented:** PDF table extraction is not currently implemented. `iterate_tables()` will always yield empty results for PDF files. Tables may appear as part of the page text in `get_full_text()` or `iterate_units()`, but structured table data is not available.
- **Image extraction on large encrypted files:** When a PDF is AES-encrypted and pypdf is running in its fallback crypto provider (i.e., neither `cryptography` nor `pycryptodome` is installed), image extraction is skipped for large files (>= 10MB). Text and tables still extract, but image lists are empty. Install `cryptography` or `pycryptodome` to enable full PDF image extraction without this skip.
- **Password-protected PDFs:** PDFs requiring a non-empty password are rejected with an `ExtractionFileEncryptedError`.
- **JBIG2 image format:** Some PDFs contain images encoded with JBIG2 compression. If you see warnings like `Failed to extract image data: jbig2dec binary is not available`, you can install the `jbig2dec` system binary to enable extraction of these images:
  - **macOS:** `brew install jbig2dec`
  - **Ubuntu/Debian:** `sudo apt-get install jbig2dec`
  - **Other Linux:** Use your package manager (e.g., `yum install jbig2dec`, `pacman -S jbig2dec`)

  Note: This is optional. The warning is harmless if you don't need to extract JBIG2-encoded images.

## CLI

After installation, a `sharepoint2text` command is available. Use `--file` to specify the path and print the extracted full text to stdout by default.

```bash
sharepoint2text --file /path/to/file.pdf > extraction.txt
```

### Command Line Options

| Option | Output | Notes |
|---|---|---|
| `--file FILE` | *(required)* | Path to the file to extract. Can be specified in any order relative to other options. |
| `--output FILE`, `-o FILE` | Output file | Write output to file instead of stdout. |
| `--version` | Version info | Show the version and exit. |
| *(default)* | Plain text | Prints `result.get_full_text()`. |
| `--json` | JSON extraction object(s) | Prints `result.to_json()`. Images are ignored by default for faster processing. |
| `--json-unit` | JSON unit list(s) | Prints a JSON list of unit representations using `result.iterate_units()` (e.g., pages/slides/sheets). For multi-item inputs (e.g. `.mbox`), emits a JSON list where each item is that extraction's unit list. Images are ignored by default for faster processing. |
| `--include-images` | Include image data | Extract images and include as base64 blobs in JSON output. Only valid with `--json` or `--json-unit`. |

`--json` and `--json-unit` are mutually exclusive.

### Examples

Extract text from a file:

```bash
sharepoint2text --file /path/to/file.pdf > extraction.txt
```

Emit structured JSON output:

```bash
sharepoint2text --file /path/to/file.pdf --json > extraction.json
```

Write to output file instead of stdout:

```bash
sharepoint2text --file /path/to/file.pdf --output extraction.txt
sharepoint2text --file /path/to/file.pdf --json --output extraction.json
```

Include extracted images as base64:

```bash
sharepoint2text --file /path/to/file.pdf --json --include-images > extraction.with-images.json
```

Parameters can be specified in any order:

```bash
sharepoint2text --json --include-images --file /path/to/file.pdf
sharepoint2text --file /path/to/file.pdf --json --include-images
```

## API Reference

### Main Functions

```python
import sharepoint2text

# Read any supported file (recommended entry point)
# Returns a generator - use next() for single-item formats or iterate for all
# Automatically detects format via extension (primary) or MIME type (fallback)
for result in sharepoint2text.read_file(path: str | Path):
    ...

# Check if a file extension is supported
# Uses same detection logic as read_file(): extension-first, MIME fallback
supported = sharepoint2text.is_supported_file(path: str) -> bool

# Get extractor function for a file type
# Returns appropriate extractor based on file extension/MIME type
extractor = sharepoint2text.get_extractor(path: str) -> Callable[[io.BytesIO, str | None], Generator[ContentType, Any, None]]
```

### Format-Specific Extractors

All accept `io.BytesIO` and optional `path` for metadata population. All return generators:

```python
sharepoint2text.read_docx(file: io.BytesIO, path: str | None = None) -> Generator[DocxContent, Any, None]
sharepoint2text.read_doc(file: io.BytesIO, path: str | None = None) -> Generator[DocContent, Any, None]
sharepoint2text.read_xlsx(file: io.BytesIO, path: str | None = None) -> Generator[XlsxContent, Any, None]
sharepoint2text.read_xls(file: io.BytesIO, path: str | None = None) -> Generator[XlsContent, Any, None]
sharepoint2text.read_pptx(file: io.BytesIO, path: str | None = None) -> Generator[PptxContent, Any, None]
sharepoint2text.read_ppt(file: io.BytesIO, path: str | None = None) -> Generator[PptContent, Any, None]
sharepoint2text.read_odt(file: io.BytesIO, path: str | None = None) -> Generator[OdtContent, Any, None]
sharepoint2text.read_odp(file: io.BytesIO, path: str | None = None) -> Generator[OdpContent, Any, None]
sharepoint2text.read_ods(file: io.BytesIO, path: str | None = None) -> Generator[OdsContent, Any, None]
sharepoint2text.read_pdf(file: io.BytesIO, path: str | None = None) -> Generator[PdfContent, Any, None]
sharepoint2text.read_plain_text(file: io.BytesIO, path: str | None = None) -> Generator[PlainTextContent, Any, None]
sharepoint2text.read_email__eml_format(file: io.BytesIO, path: str | None = None) -> Generator[EmailContent, Any, None]
sharepoint2text.read_email__msg_format(file: io.BytesIO, path: str | None = None) -> Generator[EmailContent, Any, None]
sharepoint2text.read_email__mbox_format(file: io.BytesIO, path: str | None = None) -> Generator[EmailContent, Any, None]
```

### Return Types

All content types implement the common interface:

```python
class ExtractionInterface(Protocol):
    def iterate_units() -> Iterator[UnitInterface]  # Iterate over logical units
    def iterate_images() -> Generator[ImageInterface, None, None]
    def iterate_tables() -> Generator[TableInterface, None, None]
    def get_full_text() -> str                   # Complete text as string
    def get_metadata() -> FileMetadataInterface  # Metadata with to_dict()
    def to_json() -> dict                        # JSON-serializable representation
    @classmethod
    def from_json(data: dict) -> "ExtractionInterface"
```

#### DocxContent (.docx)

```python
result.metadata       # DocxMetadata (title, author, created, modified, ...)
result.paragraphs     # List[DocxParagraph] (text, style, runs with formatting)
result.tables         # List[List[List[str]]] (cell data)
result.images         # List[DocxImage] (filename, content_type, data, size_bytes)
result.headers        # List[DocxHeaderFooter]
result.footers        # List[DocxHeaderFooter]
result.hyperlinks     # List[DocxHyperlink] (text, url)
result.footnotes      # List[DocxNote] (id, text)
result.endnotes       # List[DocxNote]
result.comments       # List[DocxComment] (author, date, text)
result.sections       # List[DocxSection] (page dimensions, margins)
result.full_text      # str (pre-computed full text)
```

#### DocContent (.doc)

```python
result.metadata         # DocMetadata (title, author, num_pages, num_words, num_chars, ...)
result.main_text        # str (main document body)
result.footnotes        # str (concatenated footnotes)
result.headers_footers  # str (concatenated headers/footers)
result.annotations      # str (concatenated annotations)
```

#### XlsxContent / XlsContent (.xlsx, .xls)

```python
result.metadata   # XlsxMetadata / XlsMetadata (title, creator, created, modified, ...)
result.sheets     # List[XlsxSheet / XlsSheet]

# Each sheet:
sheet.name   # str (sheet name)
sheet.data   # List[Dict[str, Any]] (rows as dictionaries)
sheet.text   # str (text representation)
```

#### PptxContent (.pptx)

```python
result.metadata   # PptxMetadata (title, author, created, modified, ...)
result.slides     # List[PPTXSlide]

# Each slide:
slide.slide_number          # int (1-indexed)
slide.title                 # str
slide.footer                # str
slide.content_placeholders  # List[str] (body content)
slide.other_textboxes       # List[str] (free-form text)
slide.images                # List[PPTXImage] (filename, content_type, size_bytes, blob)
slide.text                  # str (pre-computed combined text)
```

#### PptContent (.ppt)

```python
result.metadata   # PptMetadata (title, author, num_slides, created, modified, ...)
result.slides     # List[PptSlideContent]
result.all_text   # List[str] (flat list of all text)

# Each slide:
slide.slide_number   # int (1-indexed)
slide.title          # str | None
slide.body_text      # List[str]
slide.other_text     # List[str]
slide.notes          # List[str] (speaker notes)
slide.text_combined  # str (property: title + body + other)
slide.all_text       # List[PptTextBlock] (with text_type info)
```

#### OdpContent (.odp)

```python
result.metadata   # OdpMetadata (title, creator, creation_date, generator, ...)
result.slides     # List[OdpSlide]

# Each slide:
slide.slide_number   # int (1-indexed)
slide.name           # str (slide name)
slide.title          # str
slide.body_text      # List[str]
slide.other_text     # List[str]
slide.tables         # List[List[List[str]]] (tables on slide)
slide.annotations    # List[OdpAnnotation] (comments)
slide.images         # List[OdpImage] (embedded images with href, name, data, size_bytes)
slide.notes          # List[str] (speaker notes)
slide.text_combined  # str (property: title + body + other)
```

#### OdsContent (.ods)

```python
result.metadata   # OdsMetadata (title, creator, creation_date, generator, ...)
result.sheets     # List[OdsSheet]

# Each sheet:
sheet.name         # str (sheet name)
sheet.data         # List[Dict[str, Any]] (row data with column keys A, B, C, ...)
sheet.text         # str (tab-separated cell values, newline-separated rows)
sheet.annotations  # List[OdsAnnotation] (cell comments)
sheet.images       # List[OdsImage] (embedded images)
```

#### PdfContent (.pdf)

```python
result.metadata    # PdfMetadata (total_pages)
result.pages       # List[PdfPage]

# Each page:
page.text    # str
page.images  # List[PdfImage] (index, name, width, height, data, format)
page.tables  # List[List[List[str]]]
```

#### PlainTextContent (.txt, .csv, .json, .tsv)

```python
result.content   # str (full file content)
result.metadata  # FileMetadataInterface (filename, file_extension, file_path, folder_path)
```

#### EmailContent (.eml, .msg, .mbox)

```python
result.from_email    # EmailAddress (name, address)
result.to_emails     # List[EmailAddress]
result.to_cc         # List[EmailAddress]
result.to_bcc        # List[EmailAddress]
result.reply_to      # List[EmailAddress]
result.subject       # str
result.in_reply_to   # str (message ID of parent email)
result.body_plain    # str (plain text body)
result.body_html     # str (HTML body)
result.metadata      # EmailMetadata (date, message_id, plus file metadata)

# EmailAddress structure:
email.name     # str (display name)
email.address  # str (email address)
```

#### HtmlContent (.html, .htm)

```python
result.content   # str (plain text content)
result.tables    # List[List[List[str]]] (table cell values)
result.headings  # List[Dict[str, str]] (level/text)
result.links     # List[Dict[str, str]] (text/href)
result.metadata  # HtmlMetadata (title, language, charset, ...)
```

## Examples

### Bulk Processing

```python
import sharepoint2text
from pathlib import Path

def extract_all_documents(folder: Path) -> dict[str, list[str]]:
    """Extract text from all supported files in a folder."""
    results = {}

    for file_path in folder.rglob("*"):
        if sharepoint2text.is_supported_file(str(file_path)):
            try:
                # Collect all content from the generator (handles mbox with multiple emails)
                texts = [result.get_full_text() for result in sharepoint2text.read_file(file_path)]
                results[str(file_path)] = texts
            except Exception as e:
                print(f"Failed to extract {file_path}: {e}")

    return results
```

### Extract Images

```python
import sharepoint2text

# From PDF
result = next(sharepoint2text.read_file("document.pdf"))
for page_num, page in enumerate(result.pages, start=1):
    for img in page.images:
        with open(f"page{page_num}_{img.name}.{img.format}", "wb") as out:
            out.write(img.data)

# From PowerPoint
result = next(sharepoint2text.read_file("slides.pptx"))
for slide in result.slides:
    for img in slide.images:
        with open(img.filename, "wb") as out:
            out.write(img.blob)

# From Word
result = next(sharepoint2text.read_file("document.docx"))
for img in result.images:
    if img.data:
        with open(img.filename, "wb") as out:
            out.write(img.data.getvalue())
```

### Email Processing

```python
import sharepoint2text

# Process a single email file (.eml or .msg)
email = next(sharepoint2text.read_file("message.eml"))
print(f"From: {email.from_email.name} <{email.from_email.address}>")
print(f"Subject: {email.subject}")
print(f"Date: {email.metadata.date}")
print(f"Body:\n{email.body_plain}")

# Process a mailbox with multiple emails (.mbox)
for i, email in enumerate(sharepoint2text.read_file("archive.mbox")):
    print(f"\n--- Email {i + 1} ---")
    print(f"From: {email.from_email.address}")
    print(f"To: {', '.join(e.address for e in email.to_emails)}")
    print(f"Subject: {email.subject}")
    if email.to_cc:
        print(f"CC: {', '.join(e.address for e in email.to_cc)}")
```

### RAG Pipeline Integration

```python
import sharepoint2text


def prepare_for_rag(file_path: str) -> list[dict]:
    """Prepare document chunks for RAG ingestion."""
    chunks = []

    # Handle all content items from the generator
    for result in sharepoint2text.read_file(file_path):
        meta = result.get_metadata()

        for i, unit in enumerate(result.iterate_units()):
            if unit.get_text().strip():  # Skip empty units
                chunks.append({
                    "text": unit.get_text(),
                    "metadata": {
                        "source": file_path,
                        "chunk_index": i,
                        "author": getattr(meta, "author", None),
                        "title": getattr(meta, "title", None),
                    }
                })
    return chunks
```

### SharePoint Integration

This integration is optional; you can use `sharepoint2text` with any storage backend. When using `sharepoint_io`, you still orchestrate download and extraction (as shown below).

```python
import io
from datetime import datetime, timedelta, timezone

import sharepoint2text
from sharepoint2text.sharepoint_io import (
    EntraIDAppCredentials,
    FileFilter,
    SharePointRestClient,
)

# Configure SharePoint access
credentials = EntraIDAppCredentials(
    tenant_id="your-tenant-id",
    client_id="your-client-id",
    client_secret="your-client-secret",
)
client = SharePointRestClient(
    site_url="https://contoso.sharepoint.com/sites/Documents",
    credentials=credentials,
)

# Delta sync: process files modified in the last 7 days
one_week_ago = datetime.now(timezone.utc) - timedelta(days=7)
file_filter = FileFilter(
    modified_after=one_week_ago,
    extensions=[".docx", ".pdf", ".pptx"],
)

for file_meta in client.list_files_filtered(file_filter):
    # Download and extract
    content = client.download_file(file_meta.id)
    extractor = sharepoint2text.get_extractor(file_meta.name)

    for result in extractor(io.BytesIO(content), path=file_meta.name):
        print(f"File: {file_meta.get_full_path()}")
        print(f"Text: {result.get_full_text()[:200]}...")
```

## Exceptions

- `ExtractionFileFormatNotSupportedError`: Raised when no extractor exists for a given file type (e.g., unsupported extension/MIME mapping in the router).
- `ExtractionFileEncryptedError`: Raised when an extractor detects encryption or password protection (e.g., encrypted PDF, OOXML/ODF password-protected files, legacy Office with FILEPASS/encryption flags).
- `ExtractionFileTooLargeError`: Raised when a file exceeds the maximum allowed size for extraction (e.g., 7z archives larger than 100MB).
- `LegacyMicrosoftParsingError`: Raised when legacy Office parsing fails for non-encryption reasons (corrupt OLE streams, invalid headers, or unsupported legacy variations).

## License

Apache 2.0 - see [LICENSE](LICENSE) for details.

## Disclaimer
This project is not affiliated with, endorsed by, or sponsored by Microsoft.
