Metadata-Version: 2.4
Name: abstract_pdfs
Version: 0.0.33
Summary: A structured pipeline for transforming PDFs into **searchable, metadata-rich, web-ready content**, combining OCR, page-level analysis, metadata generation, and static site scaffolding.
Home-page: https://github.com/AbstractEndeavors/abstract_pdfs
Author: putkoff
Author-email: partners@abstractendeavors.com
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: abstract_ocr
Requires-Dist: abstract_utilities
Requires-Dist: pymupdf
Requires-Dist: PyMuPDF
Requires-Dist: SpeechRecognition
Requires-Dist: keybert
Requires-Dist: whisper
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

---

# 🔹 🔧 REWRITTEN README (ABSTRACT_PDFS — SYSTEM LEVEL)

---

## **abstract_pdfs — Document Processing & SEO Pipeline for PDF-Based Content**

A structured pipeline for transforming PDFs into **searchable, metadata-rich, web-ready content**, combining OCR, page-level analysis, metadata generation, and static site scaffolding.

Designed for:

* large PDF collections
* SEO-driven content indexing
* document-to-web publishing pipelines
* structured ingestion of unstructured media

---

## 🔹 What This System Is

abstract_pdfs is not a PDF utility — it is a **full document processing pipeline**:

* ingests raw PDFs
* decomposes them into pages, images, and text
* extracts and generates metadata
* enriches content via NLP APIs
* builds structured outputs (JSON + HTML)
* generates navigable web content (galleries + viewers)

The result is a **fully browsable, searchable document corpus**.

---

## 🔹 Pipeline Overview

```text
PDF Input
    ↓
Slice / Decompose (images + text per page)
    ↓
OCR + Text Extraction (layout-aware engines)
    ↓
Metadata Generation
    ├─ summaries
    ├─ keywords
    ├─ descriptions
    ↓
Manifest Creation (per-page + per-document)
    ↓
HTML Generation
    ├─ PDF viewer pages
    ├─ gallery index pages
    ↓
Static Site Output (SEO-ready)
```

---

## 🔹 Core Capabilities

### **Document Decomposition**

* Splits PDFs into:

  * page images
  * extracted text
  * structured page directories
* Maintains consistent directory structure for downstream processing

---

### **Metadata & SEO Enrichment**

* Generates:

  * summaries
  * keywords
  * descriptions
* Integrates with NLP endpoints for:

  * text analysis
  * keyword refinement
  * summarization

Example: page-level analysis via API calls 

---

### **Manifest Generation**

* Produces structured JSON per page:

  * metadata
  * text
  * image references
  * SEO fields
* Aggregates into document-level manifests

---

### **Static Site Generation**

* Generates:

  * **PDF viewer pages** (page-by-page navigation)
  * **gallery index pages** (directory browsing)
* Automatically builds:

  * thumbnails
  * descriptions
  * keyword tags

Example: dynamic card generation for directories 

---

### **Path ↔ URL Mapping**

* Converts filesystem structure into web-accessible URLs
* Maintains consistency between:

  * local storage (`/srv/media/...`)
  * public endpoints (`/pdfs/...`)

---

### **Content Structuring**

* Page-level:

  * text
  * summary
  * keywords
* Document-level:

  * aggregated metadata
  * full-text indexing

---

## 🔹 Architecture

The system is composed of modular components:

* **DocumentPipeline**

  * orchestrates ingestion → processing → output
* **SliceManager**

  * handles PDF decomposition and OCR
* **Manifest Generators**

  * build structured JSON representations
* **HTML Generators**

  * render viewer and gallery pages
* **Metadata Utilities**

  * enrich content via external NLP services

Each stage is:

* independent
* composable
* replaceable

---

## 🔹 Key Design Decisions

### **Page-Level First**

All processing happens per-page, enabling:

* granular indexing
* targeted metadata
* scalable processing

---

### **Structured Over Raw**

Outputs are always:

* JSON manifests
* structured metadata
* normalized fields

Not just raw text dumps.

---

### **SEO as a First-Class Concern**

Every page includes:

* meta tags
* OpenGraph / social metadata
* keyword tagging
* canonical URLs

---

### **Filesystem as Source of Truth**

* directory structure = content hierarchy
* no database required
* easily deployable as static site

---

## 🔹 Why This Exists

Traditional PDF workflows:

* store documents as opaque blobs
* lack searchability
* lack metadata
* are not web-native

abstract_pdfs transforms PDFs into:

* **structured, indexable content**
* **web-ready assets**
* **searchable knowledge bases**

---

## 🔹 Example Use Cases

* PDF → website publishing pipelines
* document archives (research, legal, media)
* SEO-driven content platforms
* knowledge base generation
* preprocessing for LLM / search systems

---

## 🔹 Integration Context

This system integrates with:

* OCR pipelines (layout_ocr / abstract_ocr)
* NLP systems (abstract_hugpy)
* static hosting (Nginx / CDN)
* search indexing systems

---

## 🔹 Design Philosophy

* **Documents are data, not files**
* **Structure before presentation**
* **Metadata is as important as content**
* **Static outputs scale better than dynamic systems**

---
