Metadata-Version: 2.4
Name: scrapetl
Version: 3.0.0a1
Summary: Web scraper orchestration platform with a visual flow builder
Author: Sadig Akhund
License: MIT
Project-URL: Homepage, https://github.com/sadigaxund/ScrapeTL
Project-URL: Repository, https://github.com/sadigaxund/ScrapeTL.git
Project-URL: Issues, https://github.com/sadigaxund/ScrapeTL/issues
Keywords: scraping,automation,etl,no-code,web-scraper,fastapi
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: fastapi>=0.115.0
Requires-Dist: uvicorn[standard]>=0.30.6
Requires-Dist: apscheduler>=3.10.4
Requires-Dist: sqlalchemy>=2.0.35
Requires-Dist: requests>=2.32.3
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: croniter>=3.0.3
Requires-Dist: beautifulsoup4>=4.12.3
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: pytz>=2024.1
Requires-Dist: tzlocal>=5.2
Requires-Dist: openpyxl>=3.1
Provides-Extra: playwright
Requires-Dist: playwright>=1.49.1; extra == "playwright"
Requires-Dist: playwright-stealth>=1.0.6; extra == "playwright"

<div align="center">
  <h1>🚀 ScrapeTL</h1>
  <p><b>The Open-Source Scraper Management & Orchestration Platform</b></p>
  <p>
    <img src="https://img.shields.io/badge/Python-3.10+-blue.svg" alt="Python 3.10+">
    <img src="https://img.shields.io/badge/FastAPI-005571?style=flat&logo=fastapi" alt="FastAPI">
    <img src="https://img.shields.io/badge/Database-SQLite-003B57?style=flat&logo=sqlite" alt="SQLite">
    <img src="https://img.shields.io/badge/Frontend-Vanilla_JS-F7DF1E?style=flat&logo=javascript" alt="Vanilla JS">
  </p>
</div>

---

**ScrapeTL** is a lightweight, robust orchestration engine designed to manage, schedule, and execute custom web scrapers. It provides a beautiful web-based interface for overseeing your data extraction pipeline, featuring native timezone support, custom webhook integrations, and a secure no-code deployment workflow.

## ✨ Core Pillars

- **🏗️ Visual Scraper Builder**: Move beyond code with the integrated node-based flow editor. Connect Fetchers, Selectors, and Transformers on a zoomable 2D canvas to build enterprise-grade scrapers visually.
- **🧩 Functional Expression Engine**: Inject dynamic logic into your flows using `{{ ... }}` syntax. Call built-in functions like `now()`, `uuid()`, and `random()` or define custom Python UDFs for complex data cleaning.
- **🔬 Granular Debug Inspector**: Troubleshoot failing scrapers with ease. Use the Debug Sink node to "tap" into any part of your flow and view raw data, HTML previews, and JSON structures in a dedicated, sandboxed inspector.
- **🕒 Precision Scheduling**: Powered by `APScheduler`, manage complex execution cycles via standard Cron expressions. Includes native IANA timezone resolution (e.g., `Asia/Baku`, `UTC`) to ensure global accuracy.
- **🔄 Fault-Tolerant Queue**: Automatically tracks and recovers missed tasks. If the server restarts, ScrapeTL identifies overdue jobs and processes them immediately via a persistent catch-up queue.
- **🔌 Flexible Integrations**: Distribute data effortlessly. Configure Discord webhooks, JSON payloads, and custom notification templates directly through the UI.
- **📦 Semantic Versioning**: Built-in system snapshots allow you to edit scraper logic and perform on-the-fly version bumps (e.g., `v2.0.0`) with clear audit trails.
- **💎 Premium Dashboard**: A high-performance, glassmorphic dark-mode SPA providing real-time log monitoring, queue management, and interactive health diagnostics.

## 🛠️ Technology Stack

- **Core**: Python 3.10+, FastAPI, SQLAlchemy, APScheduler
- **UI**: HTML5, CSS3 (Vanilla), Vanilla ES6+ JavaScript (Zero build dependencies)
- **Database**: SQLite (Production-ready out of the box)

## 🚀 Deployment

1. **Clone the repository:**
   ```bash
   git clone https://github.com/sadigaxund/ScrapeTL.git
   cd ScrapeTL
   ```

2. **Environment Setup:**
   ```bash
   python -m venv .venv
   source .venv/bin/activate  # Linux/macOS
   .venv\Scripts\activate     # Windows
   pip install -r requirements.txt
   ```

3. **Launch:**
   ```bash
   python run.py
   ```
   *Access the dashboard at `http://localhost:8000`.*

---

## 🧩 Building Your First Scraper

ScrapeTL offers a hybrid approach to development, supporting both visual flows and traditional Python scripts.

### Option A: The Scraper Builder (Visual)
Navigate to **Builder** in the dashboard to create a scraper using the node-based flow editor. This is the fastest way to get started and requires zero code for most common scraping tasks (HTML extraction, Regex, JSON parsing).

### Option B: Python-Based (Code)
For complex scraping scenarios requiring custom libraries or intricate logic, inherit from `BaseScraper` and define your rules in a `.py` file.

```python
from app.scrapers.base import BaseScraper

class WebMonitor(BaseScraper):
    def scrape(self):
        # Your extraction logic (BeautifulSoup, requests, etc.)
        return [{"title": "Data Point A", "value": "123.45"}]
```

Once written, upload this file through the **Setup Wizard** on the dashboard. ScrapeTL will securely import, version, and orchestrate the tasks according to your schedule.

## 📜 License

This project is open-source software licensed under the **MIT License**.
