Metadata-Version: 2.4
Name: extracta-ade
Version: 0.1.0
Summary: Agentic document extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, HTML
Author-email: Swapnil Bhattacharya <swapnil.bhatt17@outlook.com>
License: MIT
Project-URL: Repository, https://github.com/NorthCommits/extracta-client
Keywords: document,extraction,agentic,pdf,pptx,docx,layout,reading-order,nlp,cli
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Environment :: Console
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: typer>=0.12.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: rich>=13.7.0

# extracta

> Agentic Document Extraction CLI -- layout-aware, context-threaded extraction for PDF, PPTX, DOCX, and HTML.

[![PyPI version](https://badge.fury.io/py/extracta.svg)](https://pypi.org/project/extracta/)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## What is Extracta?

Extracta is a CLI tool that sends your documents to the Extracta server for intelligent, agentic extraction. It does not just pull text -- it:

- Detects layout per page (single column, multi-column, mixed, table-heavy)
- Determines the correct reading order (H-Major or V-Major) using projection profile analysis and Recursive XY-Cut
- Threads context between segments using an LLM agent (ADE -- Agentic Document Extraction)
- Outputs structured JSON with full context metadata per region

---

## Installation

```bash
pip install extracta
```

---

## Usage

```bash
extracta-extract path/to/your/document.pdf
```

Output is saved as `document_extracted.json` in the same directory as the input file.

### Supported Formats

| Format     | Extension       |
|------------|-----------------|
| PDF        | `.pdf`          |
| PowerPoint | `.pptx`         |
| Word       | `.docx`         |
| HTML       | `.html` / `.htm`|

---

## Example Output

```json
{
  "file": "report.pdf",
  "format": "pdf",
  "total_pages": 3,
  "pages": [
    {
      "page_number": 1,
      "layout_type": "multi_col",
      "strategy": "v_major",
      "regions": [
        {
          "region_id": "p1_r1",
          "type": "title",
          "text": "Efficacy in Treatment-Naive Patients",
          "bbox": { "x0": 50.0, "y0": 40.0, "x1": 540.0, "y1": 65.0 },
          "sequence": 1,
          "context_thread_id": "thread_001",
          "context_role": "heading",
          "continues_on_page": null,
          "references_region": null
        }
      ],
      "full_text": "Efficacy in Treatment-Naive Patients\n\nIn clinical trials..."
    }
  ]
}
```

---

## Terminal Output

```
╭──────────────────────────────────────────╮
│  Extracta -- Agentic Document Extraction  │
╰──────────────────────────────────────────╯
  File   : report.pdf
  Server : http://localhost:8000

  Analysing layout...

  Page   Layout        Strategy    Regions
  1      multi_col     V-Major     12
  2      single_col    H-Major     8
  3      mixed         V-Major     15

  Running ADE context threading...

  Done -- 3 pages | 35 regions

  Output : report_extracted.json
```

---

## How It Works

```
File Uploaded
     ↓
[DETECT]   -- scan all pages, determine H-Major or V-Major per page
     ↓
[EXTRACT]  -- extract blocks in natural reading order using Recursive XY-Cut
     ↓
[ADE]      -- LLM agent threads context, links segments, assigns roles
     ↓
JSON Output
```

### Layout Types

| Layout Type  | Description                                      |
|--------------|--------------------------------------------------|
| single_col   | Simple single column document                    |
| multi_col    | Two or more columns (e.g. academic papers)       |
| mixed        | Complex irregular layout (e.g. pharma slides)    |
| table_heavy  | Majority of content is tabular                   |
| image_heavy  | Majority of content is images                    |

### Reading Strategies

| Strategy | Description                                                  |
|----------|--------------------------------------------------------------|
| V-Major  | Vertical-first -- top to bottom within each column           |
| H-Major  | Horizontal-first -- left to right across each row            |

### Context Roles

| Role         | Description                                          |
|--------------|------------------------------------------------------|
| heading      | Section title or heading                             |
| body         | Main body paragraph                                  |
| callout      | Sidebar, highlighted box, callout                    |
| caption      | Image or table caption                               |
| footnote     | Footer or footnote text                              |
| continuation | Continues directly from a previous block             |

---

## Project Structure

```
extracta-client/
├── extracta/
│   ├── __init__.py
│   ├── cli.py          -- entry point
│   ├── client.py       -- HTTP calls to extracta-server
│   └── display.py      -- rich terminal output
├── pyproject.toml
└── README.md
```

---

## Server

This CLI requires a running instance of `extracta-server`. By default it connects to `http://localhost:8000`.

To use a deployed server, update `SERVER_URL` in `extracta/client.py`.

---

## Publishing to PyPI

```bash
pip install build twine
python -m build
twine upload dist/*
```

---

## License

MIT -- see [LICENSE](LICENSE)

---

## Author

Swapnil Bhattacharya -- [NorthCommits](https://github.com/NorthCommits)
