Metadata-Version: 2.4
Name: abstract-hugging-face
Version: 0.1.1
Summary: Lazy-loaded HuggingFace utilities for summarization, OCR, embeddings, and video pipelines
Home-page: https://github.com/AbstractEndeavors/abstract_hugpy
Author: Abstract Endeavors
Author-email: partners@abstractendeavors.com
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: huggingface_hub>=0.23
Requires-Dist: numpy
Requires-Dist: regex
Provides-Extra: ml
Requires-Dist: torch>=2.2; extra == "ml"
Requires-Dist: transformers>=4.40; extra == "ml"
Requires-Dist: sentence-transformers>=2.6; extra == "ml"
Requires-Dist: keybert>=0.8; extra == "ml"
Requires-Dist: accelerate>=0.30; extra == "ml"
Provides-Extra: ocr
Requires-Dist: paddleocr; extra == "ocr"
Requires-Dist: opencv-python; extra == "ocr"
Requires-Dist: pdf2image; extra == "ocr"
Provides-Extra: nlp
Requires-Dist: spacy>=3.7; extra == "nlp"
Provides-Extra: video
Requires-Dist: moviepy; extra == "video"
Requires-Dist: openai-whisper; extra == "video"
Provides-Extra: full
Requires-Dist: torch>=2.2; extra == "full"
Requires-Dist: transformers>=4.40; extra == "full"
Requires-Dist: sentence-transformers>=2.6; extra == "full"
Requires-Dist: keybert>=0.8; extra == "full"
Requires-Dist: accelerate>=0.30; extra == "full"
Requires-Dist: spacy>=3.7; extra == "full"
Requires-Dist: moviepy; extra == "full"
Requires-Dist: openai-whisper; extra == "full"
Requires-Dist: paddleocr; extra == "full"
Requires-Dist: opencv-python; extra == "full"
Requires-Dist: pdf2image; extra == "full"
Requires-Dist: abstract_security; extra == "full"
Requires-Dist: bs4; extra == "full"
Requires-Dist: abstract_ai; extra == "full"
Dynamic: author-email
Dynamic: home-page
Dynamic: requires-python

# abstract_hugpy
## Description
**Description:**
A batteries-included bridge between your **abstract\_\*** ecosystem and popular **Hugging Face–style** NLP/Speech models. It packages **local model runners**, **text utilities**, **video→audio→transcribe→summarize** workflows, and optional **Flask blueprints** so you can expose everything over HTTP with almost no glue code.

* Repository: `https://github.com/AbstractEndeavors/abstract_hugpy`
* Author: `putkoff`
* License: MIT
* Status: Alpha

## ✨ Features

* **Video intelligence pipeline**

  * Download YouTube videos (`yt_dlp`)
  * Extract audio (`moviepy`/`ffmpeg`)
  * Transcribe with **OpenAI Whisper** (local)
  * Auto-generate **SRT captions**, **summary**, **keywords**, and **metadata**
  * Persistent, per-video directory management (`VideoDirectoryManager`)
* **Summarization**

  * Local **T5** (from your pre-downloaded dir)
  * **google/flan-t5-xl** helper for quick text2text summaries
  * **Falconsai/text\_summarization** pipeline (optional)
* **Keywords & embeddings**

  * **Sentence-BERT + KeyBERT** for keyphrase extraction
  * spaCy-based noun/NER keywording + density metrics
* **Generation helpers**

  * A lightweight text generator (`distilgpt2`) and helper to build public asset URLs
* **DeepCoder (local LLM) integration**

  * Singleton wrapper around a local **DeepCoder-14B** checkpoint with normal/c hat generation
* **Drop-in HTTP APIs (Flask blueprints)**

  * `/download_video`, `/extract_video_audio`, `/get_video_whisper_*`, `/get_video_*path`, etc.
  * `/deepcoder_generate`
  * Optional **proxy** blueprint for port-forwarding to local services

---

## 📦 Install


---

# 1. `pyproject.toml`

```toml
[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "abstract-hugging-face"
version = "0.1.0"
description = "Lazy-loaded HuggingFace utilities for summarization, OCR, embeddings, and video pipelines"
readme = "README.md"
requires-python = ">=3.10"

authors = [
    {name = "Abstract Endeavors"}
]

dependencies = [
    "huggingface_hub>=0.23",
    "numpy",
    "regex",
]

# ------------------------------------------------------------------
# Optional dependency groups
# ------------------------------------------------------------------

[project.optional-dependencies]

ml = [
    "torch>=2.2",
    "transformers>=4.40",
    "sentence-transformers>=2.6",
    "keybert>=0.8",
    "accelerate>=0.30",
]

ocr = [
    "paddleocr",
    "opencv-python",
    "pdf2image",
]

nlp = [
    "spacy>=3.7",
]

video = [
    "moviepy",
    "openai-whisper",
]

full = [
    "torch>=2.2",
    "transformers>=4.40",
    "sentence-transformers>=2.6",
    "keybert>=0.8",
    "accelerate>=0.30",
    "spacy>=3.7",
    "moviepy",
    "openai-whisper",
    "paddleocr",
    "opencv-python",
    "pdf2image",
]

# ------------------------------------------------------------------
# CLI entrypoints (optional but very useful)
# ------------------------------------------------------------------

[project.scripts]

abstract-models = "abstract_hugging_face.cli:main"
```

---

# 2. Install Examples

### Core install

```bash
pip install .
```

Installs only lightweight dependencies.

---

### ML stack

```bash
pip install .[ml]
```

Installs:

```
torch
transformers
sentence-transformers
keybert
accelerate
```

---

### OCR stack

```bash
pip install .[ocr]
```

Installs:

```
paddleocr
opencv-python
pdf2image
```

---

### Everything

```bash
pip install .[full]
```

---

# 3. Torch CUDA installs (important)

PyTorch CUDA builds **cannot be safely pinned in a package**, so you should document this in README.

Example GPU install:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install abstract-hugging-face[ml]
```

CPU:

```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
pip install abstract-hugging-face[ml]
```

---

# 4. Your Package Layout (Recommended)

Given your modules, a clean layout would be:

```
abstract_hugging_face/
│
├── __init__.py
├── config.py
├── lazy_import.py
├── models.py
│
├── summarizers/
│   ├── falcon_flan_t5.py
│   └── chunk_utils.py
│
├── embeddings/
│   ├── keybert_manager.py
│   └── similarity.py
│
├── ocr/
│   ├── paddle_ocr.py
│   └── preprocess.py
│
├── video/
│   ├── whisper_transcribe.py
│   └── audio_extract.py
│
└── cli.py
```

---

# 5. Optional: CLI to Download Models

Since your system uses:

```
ensure_model()
```

you can expose it via CLI.

Example `cli.py`:

```python
import sys
from .models import ensure_model, list_models


def main():

    if len(sys.argv) < 2:
        print("Available models:", ", ".join(list_models()))
        return

    model = sys.argv[1]

    path = ensure_model(model)

    print("Model ready at:", path)
```

Then users can run:

```bash
abstract-models summarizer
```

and it downloads automatically.

---

# 6. Why this structure fits your system

Your project uses:

* lazy imports
* large models
* OCR pipelines
* video pipelines
* multiple ML engines

This structure:

✔ keeps installs small
✔ avoids downloading models during install
✔ supports modular usage
✔ matches your registry system

---

# 7. One improvement I strongly recommend for your stack

You should add a **model cache lock** to prevent multiple workers downloading the same model simultaneously.

This happens a lot when running:

* Flask
* Gunicorn
* multiprocessing
* Celery

It’s a **10-line addition** but prevents corrupted models.


---

## 🗂️ Project Layout

```
abstract_hugpy/
  abstract_hugpy.py               # convenience import
  routes.py                       # re-exports model helpers
  video_utils.py                  # VideoDirectoryManager + video pipeline API
  create/get_video_url_bp.py      # codegen helpers for Flask blueprints
  hugging_face_flasks/
    deep_coder_flask.py
    proxy_video_url_flask.py
    video_url_flask.py
  hugging_face_models/
    config.py                     # DEFAULT_PATHS to local model dirs
    whisper_model.py
    summarizer_model.py
    google_flan.py
    keybert_model.py
    falcon_flan_t5_summarizers.py
    bigbird_module.py
    generation.py
    deepcoder.py
```

---

## ⚙️ Configuration

Local model/checkpoint locations are centralized in `hugging_face_models/config.py`:

```python
DEFAULT_PATHS = {
  "whisper":        "/mnt/24T/hugging_face/modules/whisper_base",
  "keybert":        "/mnt/24T/hugging_face/modules/all_minilm_l6_v2",
  "summarizer_t5":  "/mnt/24T/hugging_face/modules/text_summarization/",
  "flan":           "google/flan-t5-xl",
  "deepcoder":      "/mnt/24T/hugging_face/modules/DeepCoder-14B",
}
```

* You can **override** these at call time where functions accept a `*_path` or `model_directory` parameter.
* Video cache root defaults to `'/mnt/24T/hugging_face/videos'` (`video_utils.VIDEOS_DIRECTORY`). If that path doesn’t exist on your machine, either:

  * create it and grant write permissions, or
  * pass a different directory into `get_abs_videos_directory(...)` before use.

**Environment variables used by the proxy blueprint**

* `DEEPCODER_FLASK_PORT` – local port serving `deepcoder_generate`
* `VIDEO_URL_FLASK_PORT` – local port serving video endpoints

---

## 🚀 Quickstart (Python)

### 1) Summarize text (local T5)

```python
from abstract_hugpy.hugging_face_models.summarizer_model import summarize

text = "Long content ..."
summary = summarize(text, summary_mode="medium")  # short|medium|long|auto
print(summary)
```

### 2) Extract keywords (KeyBERT + spaCy)

```python
from abstract_hugpy.hugging_face_models.keybert_model import refine_keywords

info = refine_keywords(
    full_text="Your document goes here",
    top_n=10, diversity=0.5, use_mmr=True
)
print(info["combined_keywords"], info["keyword_density"])
```

### 3) Transcribe audio/video with Whisper (local)

```python
from abstract_hugpy.hugging_face_models.whisper_model import whisper_transcribe, extract_audio_from_video

audio_path = extract_audio_from_video("/path/to/video.mp4")  # creates audio.wav next to video
result = whisper_transcribe(audio_path, model_size="small", language="english")
print(result["text"])
```

### 4) End-to-end video pipeline (YouTube → metadata)

```python
from abstract_hugpy.video_utils import (
    download_video, extract_video_audio,
    get_video_whisper_text, get_video_metadata, get_video_captions
)

url = "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

download_video(url)                  # cache info + mp4
extract_video_audio(url)             # cache audio.wav
text = get_video_whisper_text(url)   # transcribe (caches whisper_result.json)
meta = get_video_metadata(url)       # summary + keywords (caches video_metadata.json)
srt  = get_video_captions(url)       # captions.srt

print(meta["title"])
```

### 5) DeepCoder: local LLM generation

```python
from abstract_hugpy.hugging_face_models.deepcoder import get_deep_coder

dc = get_deep_coder()  # uses DEFAULT_PATHS["deepcoder"]
out = dc.generate(prompt="Write a Python function that checks if a number is prime.", max_new_tokens=256)
print(out)
```

---

## 🌐 HTTP API (Flask Blueprints)

You can expose the modules via Flask in minutes.

### Register blueprints

```python
from flask import Flask
from abstract_hugpy.hugging_face_flasks.video_url_flask import video_url_bp
from abstract_hugpy.hugging_face_flasks.deep_coder_flask import deep_coder_bp
from abstract_hugpy.hugging_face_flasks.proxy_video_url_flask import proxy_video_url_bp

app = Flask(__name__)
app.register_blueprint(video_url_bp)
app.register_blueprint(deep_coder_bp)
app.register_blueprint(proxy_video_url_bp)

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5005, debug=True)
```

### Video endpoints (JSON in, JSON out)

All accept `POST`/`GET` with body like:

```json
{ "url": "https://www.youtube.com/watch?v=..." }
```

| Endpoint                      | Purpose                          | Returns                 |
| ----------------------------- | -------------------------------- | ----------------------- |
| `/download_video`             | Download/cache the video & info  | video info dict         |
| `/extract_video_audio`        | Ensure `audio.wav` exists        | path or ok              |
| `/get_video_whisper_result`   | Full Whisper JSON                | `{text, segments, ...}` |
| `/get_video_whisper_text`     | Transcribed text only            | `str`                   |
| `/get_video_whisper_segments` | Segment list                     | `list[dict]`            |
| `/get_video_metadata`         | `{title, description, keywords}` | dict                    |
| `/get_video_captions`         | Generate `.srt`                  | content/path            |
| `/get_video_info`             | yt-dlp info                      | dict                    |
| `/get_video_directory`        | cached folder path               | str                     |
| `/get_video_path`             | mp4 path                         | str                     |
| `/get_video_audio_path`       | audio path                       | str                     |
| `/get_video_srt_path`         | captions path                    | str                     |
| `/get_video_metadata_path`    | metadata path                    | str                     |

**Example**

```bash
curl -X POST http://localhost:5005/get_video_whisper_text \
  -H "Content-Type: application/json" \
  -d '{"url":"https://www.youtube.com/watch?v=dQw4w9WgXcQ"}'
```

### DeepCoder endpoint

| Endpoint              | Body                                                  | Notes                                              |
| --------------------- | ----------------------------------------------------- | -------------------------------------------------- |
| `/deepcoder_generate` | Arbitrary JSON passed to `DeepCoder.generate(**data)` | Expects keys like `prompt`, `max_new_tokens`, etc. |

**Example**

```bash
curl -X POST http://localhost:5005/deepcoder_generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Write a Python Fibonacci function.", "max_new_tokens":256}'
```

### Proxy endpoints

If you run the real services on separate local ports, enable the proxy blueprint and set:

* `DEEPCODER_FLASK_PORT`
* `VIDEO_URL_FLASK_PORT`

The proxy exposes the same routes under `/api/*` and forwards requests to the local services.

---

## 🧠 How it works (high level)

```
YouTube URL
   │
   ▼
VideoDirectoryManager (per-ID folder)
   ├── info.json (yt_dlp)
   ├── video.mp4
   ├── audio.wav  (moviepy/ffmpeg)
   ├── whisper_result.json (OpenAI Whisper local)
   ├── captions.srt
   └── video_metadata.json (summary + keywords)
```

* **Whisper** transcribes audio to text & segments.
* **Summarizer** (local T5 or flan-t5-xl) condenses text.
* **KeyBERT + spaCy** extract keywords & densities.
* **Flask blueprints** expose orchestration endpoints.

---

## 📝 Logging

Most modules log via `abstract_utilities.get_logFile(__name__)`. Check your configured log directory for traces (e.g., video extraction progress, errors).

---

## 🔐 Security & Networking

* Downloading videos respects whatever `yt_dlp` supports; mind site TOS.
* The proxy blueprint forwards requests to `http://localhost:{PORT}`—use only within trusted networks and put a reverse proxy (Nginx) in front of it for auth/SSL if exposed publicly.
* Large models on GPU? Make sure to **cap tokens / batch sizes** in production.

---

## 🧩 API Reference (selected)

### `video_utils.VideoDirectoryManager`

* `get_data(video_url=None, video_id=None) -> dict`
* `download_video(video_url) -> dict`
* `extract_audio(video_url) -> str`
* `get_whisper_result(video_url) -> dict`
* `get_metadata(video_url) -> dict`  (summary+keywords)
* `get_captions(video_url) -> str`   (loads/export SRT)

**Convenience functions** mirror the above:
`download_video(...)`, `extract_video_audio(...)`, `get_video_whisper_text(...)`, etc.

### `hugging_face_models.summarizer_model`

* `summarize(text, summary_mode='medium', max_chunk_tokens=450, min_length=None, max_length=None) -> str`

### `hugging_face_models.keybert_model`

* `refine_keywords(full_text, top_n=10, ...) -> dict`
* `extract_keywords(text|list[str], top_n=5, ...) -> list[...]`

### `hugging_face_models.whisper_model`

* `whisper_transcribe(audio_path, model_size='small', language='english', ...) -> dict`
* `extract_audio_from_video(video_path, audio_path=None) -> str|None`

### `hugging_face_models.deepcoder`

* `get_deep_coder(module_path=None, torch_dtype=None, use_quantization=True) -> DeepCoder`
* `DeepCoder.generate(prompt|messages, max_new_tokens=..., use_chat_template=False, ...) -> str`

---

## 🧯 Troubleshooting

* **`ffmpeg` not found**
  Install it (`sudo apt-get install ffmpeg`). MoviePy/yt-dlp rely on it.

* **spaCy model: `OSError: [E050] Can't find model 'en_core_web_sm'`**
  `python -m spacy download en_core_web_sm`

* **CUDA OOM / very slow inference**

  * Use smaller Whisper model (`tiny`/`base`), smaller T5, or run on CPU.
  * For DeepCoder, enable 4-bit quantization (`use_quantization=True`) and reduce `max_new_tokens`.

* **Permission errors under `/mnt/24T/...`**

  * Create the directories and set write perms, or change `DEFAULT_PATHS` and `VIDEOS_DIRECTORY` to locations you own.

* **`moviepy` audio write hangs**
  Ensure the input file has an audio stream; upgrade `moviepy`; verify ffmpeg.

* **`yt_dlp` network errors**
  Update `yt_dlp` and retry, or use cookies/proxy if needed.

---

## 🔄 Versioning

Current package version: **0.0.0.40** (alpha)

---

## 🤝 Contributing

PRs welcome! Please:

1. Open an issue describing the change.
2. Keep new modules consistent with the **abstract\_\*** patterns (logging, `SingletonMeta`, path helpers).
3. Add small, runnable examples for new endpoints or model utilities.

---

## 📜 License

MIT © Abstract Endeavors

---

## 💡 Alternatives & When To Prefer Them

* **Remote inference instead of local heavy models**
  If you don’t need air-gapped/offline ops, delegating summarization/ASR to hosted APIs (e.g., Hugging Face Inference Endpoints, OpenAI Whisper API) can drastically simplify setup and reduce infra friction. You could **wrap those calls** behind the same Flask blueprints used here.

* **Faster keywording at scale**
  For massive batch jobs, a simpler TF-IDF or RAKE pipeline (e.g., `scikit-learn`, `rake-nltk`) may be faster and “good enough.” Keep `abstract_hugpy` for high-value content where semantic quality matters.

* **Video processing queue**
  If you’re ingesting thousands of URLs, a message queue (RabbitMQ/Redis) with worker pods running only `video_utils` calls might be more resilient than synchronous Flask calls. You already use RabbitMQ elsewhere—easy to slot in.

* **Model management**
  For multi-host deployments, consider **HF `safetensors`** checkpoints + `text-generation-inference` or **vLLM** as a backend and adapt `deepcoder.py` to call remote generation instead of local `AutoModelForCausalLM`. This offloads VRAM juggling and gives you token-streaming, parallelism, and metrics “for free.”

