Metadata-Version: 2.4
Name: pyingestion
Version: 0.5.0b1
Summary: General-purpose document data ingestion library.
License: MIT
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: pypdf>=6.13.0
Requires-Dist: rich>=13.7.1
Requires-Dist: python-docx>=1.1.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: pdf2image>=1.17.0
Requires-Dist: pillow>=10.0.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"

# PyIngestion (Codename: Gaia) — Generalized Document Data Extractor

**PyIngestion** (project codename **Gaia**) is a versatile and robust document data extraction system designed to retrieve structured key-value pair (KVP) records from text and files. It is packaged both as a **programmatic Python library** (`pyingestion`) and a feature-rich **command-line tool (CLI)**.

PyIngestion uses a modular architecture using fast native text extraction and an extensible parser interface to ensure high speed, fidelity, and future adaptability to new file formats.

---

## 🚀 Key Features

* **Dual-Purpose Design**:
  * **Programmatic Library**: Integrate the `TransformStream`, built-in or custom `InputStream` components, and observers directly into your own codebase.
  * **Command-Line Interface**: Run parsing pipelines directly from your shell with dynamic dashboards, detailed progress tracking, and configurable execution.
* **Extensible Input Stream (Parser) Architecture**:
  * Fully decoupled document discovery and data extraction. Programmatic users can write and inject custom input streams (e.g., Docx, OCR, XML) by subclassing the abstract `InputStream` class.
* **Fast Native PDF Processing**:
  * Employs fast native layout-based PDF text extraction (via `pypdf`) as a built-in default input stream.
* **Dynamic Terminal Interface (TUI)**:
  * Real-time metrics rendered via `rich.live`.
  * Live status dashboard featuring counters for processed files, pages, failures, and a progress bar with numerical Estimated Time of Arrival (**ETA**).
* **Robust Session Resume**:
  * Automatically checkpoints progress using a state file (`.gaia_resume.json`). If interrupted, the `--resume` flag lets you pick up right where you left off.
* **Custom Regex Configurations**:
  * Supply custom pattern matching rules via a JSON configuration file.
* **Multi-Page Unit Grouping**:
  * Group multiple pages as a single unit using `--pages-per-unit` for patterns that span across page boundaries.
* **Internationalization (i18n)**:
  * Complete user interface and message translation support for English (`en`) and Portuguese (`pt`).
* **Graceful Interrupt Handlers**:
  * Supports clean cancellation via `ESC` or `Ctrl+C`, ensuring resources, files, and terminal settings are restored safely.

---

## 📁 Project Directory Structure

```text
Gaia/
├── pyingestion/
│   ├── __init__.py          # Main entry points exposing library API classes
│   ├── __main__.py          # Main entry point for python -m pyingestion
│   ├── cli/
│   │   ├── __init__.py      # CLI subpackage initialization
│   │   ├── cli_helper.py    # CLI arguments parser and prevalidation helper
│   │   └── terminal_ui.py   # Rich TUI display and keyboard input handling
│   ├── pyingestion.py       # Main global program class (PyIngestion, codename: Gaia)
│   ├── extraction_session.py# Session progress tracking & state serialization
│   ├── options.py           # Config options container class & parameter validations
│   ├── input_stream.py      # Abstract InputStream base, InputStreamType Enum, and InputStreamFactory
│   ├── i18n.py              # Gettext wrappers and language initialization
│   ├── locale/              # Compiled translations directory
│   │   ├── en/LC_MESSAGES/messages.mo
│   │   └── pt/LC_MESSAGES/messages.mo
│   ├── observer.py          # Progress notification interface (observer pattern)
│   ├── output_stream.py     # Output stream interfaces (OutputStream, CsvWriteStream, DefaultOutputStream)
│   ├── parsers.py           # Concrete InputStream implementations (PdfParser, DocxParser, OcrParser)
│   ├── transform_stream.py  # Abstract and concrete TransformStream and RegexEngine implementations
│   └── main.py              # CLI entry point implementation
├── pyproject.toml           # Setuptools PEP 621 packaging definitions
├── requirements.txt         # Package requirements
├── tests/                   # Extensive test suites
└── tools/
    └── linux/
        ├── compile_locales.sh # Compiles Translation Catalog (.po -> .mo)
        └── run_tests.sh       # Script to execute unittest suite
```

---

## 🛠️ Requirements & Installation

### Prerequisites
1. **Python 3.10+**

### Environment Setup & Packaging

1. Clone or navigate to the repository:
   ```bash
   cd Trabalho/Gaia
   ```

2. Setup virtual environment:
   ```bash
   python -m venv .venv
   source .venv/bin/activate
   ```

3. Install the package in editable mode:
   ```bash
   pip install -e .
   ```

---

## 💻 Usage

### 1. As a Python Library

You can integrate PyIngestion directly into your Python scripts.

#### Orchestrating the Full Pipeline Programmatically

To execute the entire extraction pipeline on a file or directory:

```python
from pyingestion import PyIngestion, Options

# 1. Configure options programmatically
options = Options()
options.BASE_PATH = "path/to/pdfs"
options.REGEX_FILE = "path/to/rules.json"
options.OUTPUT_CSV = "custom_output.csv"
options.PAGES_PER_UNIT = 1

# 2. Run the orchestrator
controller = PyIngestion(options)
success = controller.run()
```

#### Creating & Injecting a Custom Input Stream

You can supply your own extraction parser format by subclassing the abstract base class `InputStream`:

```python
from typing import Generator
from pyingestion import PyIngestion, Options, InputStream, ExtractionSession

class CustomTxtParser(InputStream):
    def accepts(self, file_path: str) -> bool:
        # Define what files this parser/stream accepts
        return file_path.lower().endswith(".txt")

    def process_file(
        self,
        file_path: str,
        session: ExtractionSession | None = None,
        pages_per_unit: int = 1
    ) -> Generator[tuple[int, int, str], None, None]:
        # Process the file and yield: (unit_index, total_units, content_text)
        with open(file_path, "r", encoding="utf-8") as f:
            content = f.read()
        yield 1, 1, content

# Inject it into PyIngestion orchestrator
options = Options()
options.BASE_PATH = "path/to/text/files"
options.REGEX_FILE = "rules.json"

# Supply your custom input_stream (and transform_stream)
controller = PyIngestion(options, transform_stream=..., input_stream=CustomTxtParser())
controller.run()
```

#### Using Input Stream and Engine Components Directly

To parse files manually and match patterns page-by-page:

```python
from pyingestion import PdfParser, NativeRegexEngine

# 1. Setup the Regex engine with rules in-memory (dictionary)
regex_rules = {
    "infraction_id": {
        "regex": r"Código da Infração:\s*([A-Za-z0-9-]+)",
        "required": True
    },
    "plate": {
        "regex": r"Placa:\s*([A-Z]{3}-?\d[A-Z0-9]\d{2})",
        "required": True
    }
}
engine = NativeRegexEngine(regex_rules)

# Alternatively, load rules from a JSON file path:
# engine = NativeRegexEngine.from_file("path/to/rules.json")

# 2. Setup the input stream
input_stream = PdfParser()

# 3. Process files programmatically
# The input stream yields raw text segments for each page/unit.
# You then parse it using the engine.
for unit_index, total_units, raw_text in input_stream.process_file("path/to/infraction.pdf", pages_per_unit=1):
    record = engine.transform(raw_text)
    print("Parsed Record:", record)
```

---

### 2. Command-Line Interface (CLI)

PyIngestion can be executed directly as a global shell command, as a python module run, or as a local script.

```bash
# 1. As a global command (after package installation)
pyingestion <input_dir> [options]

# 2. As a python module run (from the repository root)
python -m pyingestion <input_dir> [options]
```

#### Positional Arguments
* `<input_dir>`: Path to the directory containing files to process.

#### Options
* `-o`, `--output` `<path>`: Custom output CSV file path (Default: `output.csv` in your working directory).
* `-g`, `--regex` `<path>`: Path to a JSON file containing customized regex extraction rules.
* `-r`, `--recursive`: Search for files recursively within subdirectories.
* `--resume`: Resume processing using checkpoint data from `.gaia_resume.json`.
* `-t`, `--test` `<file_path>`: Test your regex rules on the first page of the provided file.
* `-p`, `--pages-per-unit` `<int>`: The number of pages/chunks grouped together as a single block for extraction matching (Default: `1`).
* `-l`, `--lang` `{"en", "pt"}`: Force the interface language to English or Portuguese (Default: `en`).
* `--type` `{"pdf"}`: Define the built-in parser type to use (Default: `pdf`).

#### Examples

* **Basic processing run**:
  ```bash
  pyingestion /path/to/pdfs -g rules.json
  ```

* **Resume an interrupted run**:
  ```bash
  pyingestion /path/to/pdfs --resume
  ```

* **Test matching logic on a single file**:
  ```bash
  pyingestion -t sample.pdf -g rules.json
  ```

---

## 🧪 Testing and Tools

### Running the Test Suite
The unit and integration tests validate CLI logic, parser fallbacks, observers, and settings parsing.
```bash
./tools/linux/run_tests.sh
```

### Compiling Localization Catalogs
To re-compile updated translation dictionary catalogs (`.po`) to gettext binary files (`.mo`):
```bash
./tools/linux/compile_locales.sh
```
