Metadata-Version: 2.4
Name: web-vest
Version: 0.1.1
Summary: Visual Element-based Saliency Toolkit for multimodal webpage saliency extraction and scoring.
Author: Cantay Caliskan, Joyce Zhang, Mark Kimani, Mertkan Karaaslan, Niels Weisbeek
License: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4>=4.14.3
Requires-Dist: deep-translator>=1.11.4
Requires-Dist: gdown>=5.2.1
Requires-Dist: lingua-language-detector>=2.2.0
Requires-Dist: mineru>=2.7.6
Requires-Dist: networkx>=3.6
Requires-Dist: numpy>=2.4.2
Requires-Dist: opencv-python>=4.13.0.92
Requires-Dist: pandas>=2.3.3
Requires-Dist: playwright==1.58.0
Requires-Dist: pillow>=11.3.0
Requires-Dist: requests>=2.32.5
Requires-Dist: torch>=2.10.0
Requires-Dist: torchvision==0.26.0
Requires-Dist: transformers>=4.57.6
Provides-Extra: dev
Requires-Dist: accelerate==1.12.0; extra == "dev"
Requires-Dist: aiofiles==24.1.0; extra == "dev"
Requires-Dist: aiohappyeyeballs==2.6.1; extra == "dev"
Requires-Dist: aiohttp==3.13.3; extra == "dev"
Requires-Dist: aiosignal==1.4.0; extra == "dev"
Requires-Dist: albucore==0.0.24; extra == "dev"
Requires-Dist: albumentations==2.0.8; extra == "dev"
Requires-Dist: annotated-doc==0.0.4; extra == "dev"
Requires-Dist: annotated-types==0.7.0; extra == "dev"
Requires-Dist: antlr4-python3-runtime==4.9.3; extra == "dev"
Requires-Dist: anyio==4.12.1; extra == "dev"
Requires-Dist: attrs==25.4.0; extra == "dev"
Requires-Dist: av==16.1.0; extra == "dev"
Requires-Dist: beautifulsoup4==4.14.3; extra == "dev"
Requires-Dist: boto3==1.42.54; extra == "dev"
Requires-Dist: botocore==1.42.54; extra == "dev"
Requires-Dist: brotli==1.2.0; extra == "dev"
Requires-Dist: certifi==2026.1.4; extra == "dev"
Requires-Dist: cffi==2.0.0; extra == "dev"
Requires-Dist: charset-normalizer==3.4.4; extra == "dev"
Requires-Dist: click==8.3.1; extra == "dev"
Requires-Dist: colorlog==6.10.1; extra == "dev"
Requires-Dist: contourpy==1.3.3; extra == "dev"
Requires-Dist: cryptography==46.0.5; extra == "dev"
Requires-Dist: cycler==0.12.1; extra == "dev"
Requires-Dist: datasets==4.5.0; extra == "dev"
Requires-Dist: deep-translator==1.11.4; extra == "dev"
Requires-Dist: dill==0.4.0; extra == "dev"
Requires-Dist: distro==1.9.0; extra == "dev"
Requires-Dist: doclayout_yolo==0.0.4; extra == "dev"
Requires-Dist: einops==0.8.2; extra == "dev"
Requires-Dist: fast-langdetect==0.2.5; extra == "dev"
Requires-Dist: fastapi==0.131.0; extra == "dev"
Requires-Dist: fasttext-predict==0.9.2.4; extra == "dev"
Requires-Dist: ffmpy==1.0.0; extra == "dev"
Requires-Dist: filelock==3.24.3; extra == "dev"
Requires-Dist: flatbuffers==25.12.19; extra == "dev"
Requires-Dist: fonttools==4.61.1; extra == "dev"
Requires-Dist: frozenlist==1.8.0; extra == "dev"
Requires-Dist: fsspec==2025.10.0; extra == "dev"
Requires-Dist: ftfy==6.3.1; extra == "dev"
Requires-Dist: gdown==5.2.1; extra == "dev"
Requires-Dist: gradio==5.49.1; extra == "dev"
Requires-Dist: gradio_client==1.13.3; extra == "dev"
Requires-Dist: gradio_pdf==0.0.22; extra == "dev"
Requires-Dist: groovy==0.1.2; extra == "dev"
Requires-Dist: h11==0.16.0; extra == "dev"
Requires-Dist: hf-xet==1.2.0; extra == "dev"
Requires-Dist: httpcore==1.0.9; extra == "dev"
Requires-Dist: httpx==0.28.1; extra == "dev"
Requires-Dist: httpx-retries==0.4.6; extra == "dev"
Requires-Dist: huggingface_hub==0.36.2; extra == "dev"
Requires-Dist: idna==3.11; extra == "dev"
Requires-Dist: ImageIO==2.37.2; extra == "dev"
Requires-Dist: Jinja2==3.1.6; extra == "dev"
Requires-Dist: jiter==0.13.0; extra == "dev"
Requires-Dist: jmespath==1.1.0; extra == "dev"
Requires-Dist: json_repair==0.58.0; extra == "dev"
Requires-Dist: kiwisolver==1.4.9; extra == "dev"
Requires-Dist: lazy_loader==0.4; extra == "dev"
Requires-Dist: lingua-language-detector==2.2.0; extra == "dev"
Requires-Dist: loguru==0.7.3; extra == "dev"
Requires-Dist: magika==1.0.1; extra == "dev"
Requires-Dist: markdown-it-py==4.0.0; extra == "dev"
Requires-Dist: MarkupSafe==3.0.3; extra == "dev"
Requires-Dist: matplotlib==3.10.8; extra == "dev"
Requires-Dist: mdurl==0.1.2; extra == "dev"
Requires-Dist: mineru==2.7.6; extra == "dev"
Requires-Dist: mineru_vl_utils==0.1.22; extra == "dev"
Requires-Dist: mlx==0.30.6; extra == "dev"
Requires-Dist: mlx-lm==0.29.1; extra == "dev"
Requires-Dist: mlx-metal==0.30.6; extra == "dev"
Requires-Dist: mlx-vlm==0.3.9; extra == "dev"
Requires-Dist: modelscope==1.34.0; extra == "dev"
Requires-Dist: mpmath==1.3.0; extra == "dev"
Requires-Dist: multidict==6.7.1; extra == "dev"
Requires-Dist: multiprocess==0.70.18; extra == "dev"
Requires-Dist: networkx==3.6.1; extra == "dev"
Requires-Dist: numpy==2.4.2; extra == "dev"
Requires-Dist: omegaconf==2.3.0; extra == "dev"
Requires-Dist: onnxruntime==1.24.2; extra == "dev"
Requires-Dist: openai==2.21.0; extra == "dev"
Requires-Dist: opencv-python==4.13.0.92; extra == "dev"
Requires-Dist: opencv-python-headless==4.13.0.92; extra == "dev"
Requires-Dist: orjson==3.11.7; extra == "dev"
Requires-Dist: packaging==26.0; extra == "dev"
Requires-Dist: pandas==2.3.3; extra == "dev"
Requires-Dist: pdfminer.six==20260107; extra == "dev"
Requires-Dist: pdftext==0.6.3; extra == "dev"
Requires-Dist: pillow==11.3.0; extra == "dev"
Requires-Dist: polars==1.38.1; extra == "dev"
Requires-Dist: polars-runtime-32==1.38.1; extra == "dev"
Requires-Dist: propcache==0.4.1; extra == "dev"
Requires-Dist: protobuf==6.33.5; extra == "dev"
Requires-Dist: psutil==7.2.2; extra == "dev"
Requires-Dist: py-cpuinfo==9.0.0; extra == "dev"
Requires-Dist: pyarrow==23.0.1; extra == "dev"
Requires-Dist: pyclipper==1.4.0; extra == "dev"
Requires-Dist: pycparser==3.0; extra == "dev"
Requires-Dist: pydantic==2.11.10; extra == "dev"
Requires-Dist: pydantic-settings==2.13.1; extra == "dev"
Requires-Dist: pydantic_core==2.33.2; extra == "dev"
Requires-Dist: pydub==0.25.1; extra == "dev"
Requires-Dist: Pygments==2.19.2; extra == "dev"
Requires-Dist: pyparsing==3.3.2; extra == "dev"
Requires-Dist: pypdf==6.7.2; extra == "dev"
Requires-Dist: pypdfium2==4.30.0; extra == "dev"
Requires-Dist: PySocks==1.7.1; extra == "dev"
Requires-Dist: python-dateutil==2.9.0.post0; extra == "dev"
Requires-Dist: python-dotenv==1.2.1; extra == "dev"
Requires-Dist: python-multipart==0.0.22; extra == "dev"
Requires-Dist: pytz==2025.2; extra == "dev"
Requires-Dist: PyYAML==6.0.3; extra == "dev"
Requires-Dist: qwen-vl-utils==0.0.14; extra == "dev"
Requires-Dist: regex==2026.2.19; extra == "dev"
Requires-Dist: reportlab==4.4.10; extra == "dev"
Requires-Dist: requests==2.32.5; extra == "dev"
Requires-Dist: rich==14.3.3; extra == "dev"
Requires-Dist: robust-downloader==0.0.2; extra == "dev"
Requires-Dist: ruff==0.15.2; extra == "dev"
Requires-Dist: s3transfer==0.16.0; extra == "dev"
Requires-Dist: safehttpx==0.1.7; extra == "dev"
Requires-Dist: safetensors==0.7.0; extra == "dev"
Requires-Dist: scikit-image==0.26.0; extra == "dev"
Requires-Dist: scipy==1.17.1; extra == "dev"
Requires-Dist: seaborn==0.13.2; extra == "dev"
Requires-Dist: semantic-version==2.10.0; extra == "dev"
Requires-Dist: sentencepiece==0.2.1; extra == "dev"
Requires-Dist: setuptools==82.0.0; extra == "dev"
Requires-Dist: shapely==2.1.2; extra == "dev"
Requires-Dist: shellingham==1.5.4; extra == "dev"
Requires-Dist: simsimd==6.5.13; extra == "dev"
Requires-Dist: six==1.17.0; extra == "dev"
Requires-Dist: sniffio==1.3.1; extra == "dev"
Requires-Dist: soundfile==0.13.1; extra == "dev"
Requires-Dist: soupsieve==2.8.3; extra == "dev"
Requires-Dist: starlette==0.52.1; extra == "dev"
Requires-Dist: stringzilla==4.6.0; extra == "dev"
Requires-Dist: sympy==1.14.0; extra == "dev"
Requires-Dist: thop==0.1.1.post2209072238; extra == "dev"
Requires-Dist: tifffile==2026.2.20; extra == "dev"
Requires-Dist: timm==1.0.25; extra == "dev"
Requires-Dist: tokenizers==0.22.2; extra == "dev"
Requires-Dist: tomlkit==0.13.3; extra == "dev"
Requires-Dist: torch==2.10.0; extra == "dev"
Requires-Dist: tqdm==4.67.3; extra == "dev"
Requires-Dist: transformers==4.57.6; extra == "dev"
Requires-Dist: typer==0.24.1; extra == "dev"
Requires-Dist: typing-inspection==0.4.2; extra == "dev"
Requires-Dist: typing_extensions==4.15.0; extra == "dev"
Requires-Dist: tzdata==2025.3; extra == "dev"
Requires-Dist: ultralytics==8.4.14; extra == "dev"
Requires-Dist: ultralytics-thop==2.0.18; extra == "dev"
Requires-Dist: urllib3==2.6.3; extra == "dev"
Requires-Dist: uv==0.10.4; extra == "dev"
Requires-Dist: uvicorn==0.41.0; extra == "dev"
Requires-Dist: wcwidth==0.6.0; extra == "dev"
Requires-Dist: websockets==15.0.1; extra == "dev"
Requires-Dist: wheel==0.46.3; extra == "dev"
Requires-Dist: xxhash==3.6.0; extra == "dev"
Requires-Dist: yarl==1.22.0; extra == "dev"
Dynamic: license-file

<h1 align="center">
  <img src="https://raw.githubusercontent.com/wherearethepizzas/vest/main/resources/logo.png" alt="Web Saliency logo" width="100" style="vertical-align: middle; border-radius: 20%; margin-right: 10px;"/>
  <span style="vertical-align: middle;">V.E.S.T.</span>
</h1>

<p align="center">
  <img src="https://img.shields.io/badge/python-3.10+-blue.svg" alt="Python Version"/>
  <img src="https://img.shields.io/badge/License-MIT-green.svg" alt="License"/>
</p>

## Read Me First
Welcome to the Visual Element-based Saliency Toolkit (V.E.S.T.) 

**High-level summary**: This toolkit allows researchers to seamlessly extract and measure the importance of web page elements. This is accomplished using a formula that takes into account the relative location and size of a web page element as well as the prominence of the web page in which the element is located.

**Core Mission**: The primary goal is to assess the branding of a webpage and programmatically identify the kinds of topics and narratives that are most prevalent on it.

## About the Package
**Visual Element-based Saliency Toolkit**

This package uses automated web crawling, topological graph generation and multimodal content extraction to generate a spreadsheet detailing the relative location, size, and web page address of a text or image element in an entire website. Additionally, the package comes with a bespoke element ranking formula, **EleRank Formula**, that utilizes an element's attributes to assign an importance for objective identification and analysis of web page elements in a web site.

### Key Features Highlights
1. **Automated Web Crawling & Archiving**: Crawl domains natively from `.txt` lists, preserving structures and taking high-quality full-page screenshots.
2. **Topological Graph Generation**: Automatically map the structure of crawled domains as directed edge graphs serialized into GraphML format.
3. **Multimodal Content Extraction**: Run a customizable, locally hosted image-to-text pipeline combining MinerU structuring, U2-Net saliency detection, and a choice of modern large vision models (e.g., FLORENCE-2, BLIP-2) to generate structured multimodal CSV datasets.
4. **Element Importance Scoring**: Compute quantitative assessments of visual and textual elements using our bespoke **EleRank Formula**.

---

## Table of Contents
* [Installation](#installation)
* [Architecture & The VoT Formula](#architecture--the-vot-formula)
* [Quick Start Guide](#quick-start-guide)
* [Implemented Tools & Supported Models](#implemented-tools--supported-models)
* [Usage Notes](#usage-notes)

---

## Installation

This project leverages deep learning for computer vision and linguistics, requiring a robust environment setup. We recommend downloading the package from PyPI or using Conda to manage your dependencies.
**Downloading from PyPI**
```bash
   pip install web-vest
```

**Setting up your Conda Environment**
We will walk through setting up a dedicated workspace (`vest`), modeled after the project's internal environment:
1. **Create the virtual environment**:
   ```bash
   conda create -n vest python=3.10 -y
   ```
2. **Activate the environment**:
   ```bash
   conda activate vest
   ```
3. **Install Required Python Dependencies** (aligned with `pyproject.toml`; adjust `torch` install for your hardware):
   ```bash
   pip install torch torchvision
   pip install transformers pillow deep-translator lingua-language-detector beautifulsoup4 gdown networkx numpy opencv-python pandas requests mineru playwright
   ```
4. **Optional: Install MinerU Extra Dependencies**:
   *Use this if you want the full MinerU extras stack in your environment.*
   ```bash
   pip install --upgrade pip
   pip install uv
   uv pip install -U "mineru[all]"
   ```


---

## Architecture & The VoT Formula

### Pipeline Architecture
![Pipeline Architecture Diagram](https://raw.githubusercontent.com/wherearethepizzas/vest/main/resources/vot_pipeline_graph.png)
*Visualizes the flow: from Raw URL -> Screenshot -> MinerU Extraction -> Captioning/Translation -> Importance Scoring.*

### The EleRank Formula
Once elements are extracted, structured, captioned, and translated, they reflect specific themes and visual real estate on the host sites. To establish "what matters most" on any given parsed page, the toolkit uses the VoT Formula:
```
Importance = weight_1(size_of_content) + weight_2(coordinates_on_page) + weight_3(host_webpage_importance)
```
- **size_of_content**: The raw pixel area the text or image occupies on the screen.
- **coordinates_on_page**: Positional penalty/bonus (e.g., elements at the top coordinate space matter more).
- **host_webpage_importance**: A multiplier reflecting the domain graph's PageRank or explicitly defined weight of the host domain.

---

## Quick Start Guide

### 1. Generate Site Graphs from a URL List
```bash
generate-site-graphs seeds.txt --output-folder site-graphs
```

`seeds.txt` should contain one website per line.

### 2. Run Preprocessing Independently
```bash
preprocess-folder data/raw data/interim
```

### 3. Run Webpage Element Extraction Independently
```bash
extract-webpage-elements data/interim data/interim
```

### 4. Run Captioning and Translation Independently
```bash
process-webpage-elements \
  data/interim \
  data/processed \
  --model florence \
  --hf-token "$HF_TOKEN" \
  --generate-salient-image no \
  --translate-to-eng yes
```

### 5. Rank Webpages (PageRank) Independently
```bash
rank-webpages site-graphs/visitqatar_com.graphml data/processed
```

This creates `data/processed/visitqatar_com.csv` with columns:
- `webpage_name`
- `rank`

### 6. Score Webpages Independently
```bash
score-webpages \
  0.5 0.3 0.2 \
  data/processed/webpage_elements_captions.csv \
  data/processed/visitqatar_com.csv \
  data/processed/webpage_elements_scored.csv
```

### 7. Run the Entire Pipeline in One Command
```bash
web-saliency \
  --raw-files-path data/raw \
  --model florence \
  --generate-salient-image no \
  --translate-to-eng yes \
  --output-csv-name webpage_elements_captions.csv
```
---

## Implemented Tools & Supported Models

| Type | Library / Model | Purpose |
|------|-----------------|---------|
| **Crawling** | Playwright, Requests | Archiving and rendering JavaScript-heavy pages |
| **Topology** | NetworkX | Parsing links into a directed GraphML object |
| **Structuring** | MinerU | Bounding box generation and modality classification |
| **Saliency** | U2-Net | "Soft dimming" background elements prior to captioning |
| **Captioning** | BLIP-2, Florence-2 | Vision-Language Models to summarize visual context |
| **NLP** | Lingua, Google Translate | Detecting languages and providing English homogenization |

---

## Usage Notes
- **Hugging Face Token**: If you plan to use gated models like BLIP-2 (or want faster downloads), you may need to export a Hugging Face API token: `export HF_TOKEN="your_token"`.
- **GPU Acceleration**: MinerU, Florence-2, and BLIP-2 all highly benefit from CUDA (NVIDIA GPUs) or MPS (Apple Silicon). When available, the pipeline automatically routes tensor processing to these accelerators.
- **Data Preprocessing**: Place directories containing the webpages into `data/raw`. Make sure to structure folders cleanly (e.g., `Country/Webpage/dimensions/image.jpg`). 

---
