Metadata-Version: 2.4
Name: media_intelligence
Version: 0.1.2
Summary: Abstract Intelligence Platform — a unified, layered pipeline that turns raw media (PDFs, images, video) into structured, searchable, SEO-ready data.
Author: AbstractEndeavors
Keywords: ocr,pdf,video,transcription,summarization,seo,media,pipeline
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: core
Requires-Dist: abstract_essentials; extra == "core"
Provides-Extra: ingest
Requires-Dist: abstract_webtools; extra == "ingest"
Provides-Extra: ocr
Requires-Dist: abstract_ocr; extra == "ocr"
Provides-Extra: documents
Requires-Dist: abstract_pdfs; extra == "documents"
Provides-Extra: video
Requires-Dist: abstract_videos; extra == "video"
Provides-Extra: transcribe
Requires-Dist: hugpy>=0.1.44; extra == "transcribe"
Provides-Extra: enrich
Requires-Dist: hugpy>=0.1.44; extra == "enrich"
Provides-Extra: publish
Requires-Dist: abstract_react; extra == "publish"
Requires-Dist: abstract_nginx; extra == "publish"
Provides-Extra: bridge
Requires-Dist: flask; extra == "bridge"
Provides-Extra: all
Requires-Dist: abstract_webtools; extra == "all"
Requires-Dist: abstract_ocr; extra == "all"
Requires-Dist: abstract_pdfs; extra == "all"
Requires-Dist: abstract_videos; extra == "all"
Requires-Dist: hugpy>=0.1.44; extra == "all"
Requires-Dist: abstract_react; extra == "all"
Requires-Dist: abstract_nginx; extra == "all"

# media_intelligence — Abstract Intelligence Platform

A unified, layered facade that turns raw media — **PDFs, images, and video** —
into **structured, searchable, SEO-ready data**. It does not reimplement any
engine: it selects the *best* function of each sibling package and exposes it
behind one clean, lazy API, plus an orchestrated pipeline.

```text
Raw Media (PDF / Image / Video / URL)
   │
   ▼
ingest  → extract → structure → enrich → persist → publish
(webtools) (ocr/    (typed     (hugpy)  (FS / DB) (react/
            pdfs/    metadata)                      nginx)
            videos)
```

## Layers → canonical owners

| Layer        | Owner package        | What it does                                   |
|--------------|----------------------|------------------------------------------------|
| `ingest`     | `abstract_webtools`  | scrape pages, download video (yt-dlp/ffmpeg)   |
| `ocr`        | `abstract_ocr`       | layout-aware, multi-engine OCR                 |
| `documents`  | `abstract_pdfs`      | PDF decomposition + manifests + HTML           |
| `video`      | `abstract_videos`    | registry pipeline: download/frames/transcribe  |
| `transcribe` | `hugpy` (→ `abstract_ocr` fallback) | Whisper speech-to-text          |
| `enrich`     | `hugpy`              | summaries, keywords, vision captioning, SEO    |
| `persist`    | filesystem (DB-pluggable) | typed JSON/JSONB manifests                |
| `publish`    | `abstract_react` + `abstract_nginx` | SEO/OG metadata + static HTML   |

Overlapping capabilities are resolved to **one owner** (Whisper → `hugpy`;
video download → `webtools`; summarize/keywords → `hugpy`).

## Install

`media_intelligence` is *just this `src/` facade* — it contains none of the
engines. Each layer's owner is its own PyPI package, declared as an **optional
extra**, so you install only what you use:

```bash
pip install media_intelligence              # zero third-party deps — facade only
pip install "media_intelligence[ocr,enrich]"  # just those layers
pip install "media_intelligence[all]"       # the full platform
```

The package has **no required third-party dependencies**: importing it is cheap
(~20 ms) and pulls **none** of the backing packages. Each sibling is imported
**lazily**, only when its layer is actually called; a missing one raises a clear
`MissingDependency` naming the extra to install.

Check what's usable in the current environment without importing anything:

```python
import media_intelligence as mi
mi.available()            # {'ingest': True, 'ocr': True, 'publish': False, ...}
mi.available("enrich")    # True / False
```

## Usage

### Direct namespace access

```python
import media_intelligence as mi

text = mi.ocr.image_to_text("page.png")
kw   = mi.enrich.keywords(text)
mi.documents.process_pdf("doc.pdf")
mi.ingest.download_video("https://site.com/v.mp4", download_directory="/data")
```

### Orchestrated pipeline (idempotent + resumable)

```python
from media_intelligence import MediaPipeline

pipe = MediaPipeline("https://site.com/video.mp4", out_root="/data")
pipe.ingest().extract().structure().enrich().persist().publish()
print(pipe.report.summary)
#   ... or simply:
pipe.run()
```

The pipeline autodetects media kind, dispatches each stage accordingly, skips
stages already satisfied (idempotent), and rehydrates from a prior manifest on
re-run (resumable). Results land in `out_root/<media_id>/manifest.json`.

### Persistence (DB-pluggable, two records)

Each item is persisted as **two** records so indexing stays cheap while
aggregation stays simple:

- `manifest.json` — lean index: ids, counts, `text_chars`, summary, keywords,
  SEO, asset pointers. (The JSONB metadata row.)
- `document.json` — canonical content: full `text`, `pages`/segments,
  `transcript`. The single source of truth for search / aggregation / LLM
  datasets — one read per item, no re-stitching of per-owner on-disk files.

```python
store = mi.persist.FileStore("/data")
store.save_manifest(item.media_id, manifest)   # lean index
store.save_document(item.media_id, document)   # full body
doc = store.load_document(item.media_id)        # aggregation reads this

# later, identical interface, JSONB backend:
# store = mi.persist.PgStore(dsn=...)   # planned (abstract_database)
#   -> metadata in JSONB, body text in a full-text-indexed column
```

`MediaPipeline.persist()` writes both. On re-run, the body is rehydrated from
`document.json`, so `extract`/`enrich` skip (no re-OCR / re-transcribe).
```
