Metadata-Version: 2.4
Name: scrapy-delta-guard
Version: 1.0.0
Summary: A Scrapy extension to detect data changes (deltas) between scraped items and a database.
Home-page: https://github.com/nazaradn/scrapy-delta-guard
Author: Abdul Nazar
Classifier: Development Status :: 5 - Production/Stable
Classifier: Framework :: Scrapy
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: Scrapy>=2.0.0
Requires-Dist: SQLAlchemy>=1.3.0
Requires-Dist: requests>=2.25.0
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# 🛡️ Scrapy DeltaGuard

**Scrapy DeltaGuard** is a powerful and easy-to-integrate Scrapy extension that monitors changes (deltas) between the data you scrape and the data already present in your database.

It helps maintain data integrity by monitoring data drift, reduces noisy or incorrect updates, and can automatically trigger alerts (e.g., to Slack or Jira) or stop a spider when significant changes occur.

## 🚀 Why use DeltaGuard?

Large-scale web scraping projects commonly face issues that corrupt data quality:
-   **Silent layout changes** on target websites that break selectors.
-   **Inconsistent formatting** for data like phone numbers (`123-456` vs `123456`).
-   **Transient incorrect data** appearing in sponsored or related content blocks.

DeltaGuard helps you:
-   Detect **real content changes**, not just formatting noise.
-   Avoid cascading bad data writes by automatically halting crawls if too many fields drift.
-   Integrate seamlessly without rewriting your existing item pipeline logic.
-   Notify downstream systems (Jira, Slack, etc.) when data quality issues are found.

---

## Installation

```bash
pip install scrapy-delta-guard
```

---

## ⚙️ Quick Start Guide

Follow these three steps to get `DeltaGuard` running in your project.

### 1. Configure Your Database Session for Detached Object Handling

To avoid SQLAlchemy’s `DetachedInstanceError` during delta checking, configure your
SQLAlchemy session with:

```python
from sqlalchemy.orm import sessionmaker

Session = sessionmaker(bind=engine, expire_on_commit=False)
```

This retains loaded database objects on commit, allowing access during Scrapy’s async pipeline.

### 2. Configure `settings.py`

Enable the extension and define field monitoring with flexible thresholds and options.

```python
EXTENSIONS = {
    'deltaguard.extension.DeltaGuard': 500,
}

DELTA_GUARD_ENABLED = True

DELTA_GUARD_BATCH_SIZE = 50

DELTA_GUARD_DEFAULT_THRESHOLD = '5%'

DELTA_GUARD_FIELDS_CONFIG = [
    {'name': 'email'},  # simple shorthand for same db/spider field
    {'name': 'phone_number', 'threshold': 10},  # 10% threshold
    {
        'name': 'years_experience',
        'db_var': 'years_exp',  # different db attribute
        'spider_var': 'years_exp_spider',  # different spider field
        'threshold': '15%'
    },
]

DELTA_GUARD_DB_NONE_IS_DELTA = True
DELTA_GUARD_SPIDER_NONE_IS_DELTA = False

DELTA_GUARD_STOP_SPIDER_ON_HIGH_DELTA = True

DELTA_GUARD_JIRA_FUNC = 'my_project.utils.create_jira_ticket'

DELTA_GUARD_SLACK_WEBHOOK = 'https://hooks.slack.com/services/your/webhook/url'

LOG_LEVEL = 'DEBUG'
```

### 3. Update Your Scrapy Item

Ensure your Scrapy `Item` class includes the `db_item` field to avoid `KeyError`:

```python
import scrapy

class YourItem(scrapy.Item):
    # ... your existing fields ...
    db_item = scrapy.Field()
```

### 4. Attach DB Items Using the Adapter in Your Pipelines

```python
from deltaguard.adapter import DeltaGuardAdapter

class YourPipeline:
    def process_item(self, item, spider):
        db_item = self.session.query(YourModel).filter_by(email=item.get('email')).first()
        DeltaGuardAdapter.attach(item, db_item)
        return item
```

---

## How Does `DeltaGuard` Work?

- The extension compares the fields in `DELTA_GUARD_FIELDS_CONFIG` between the scraped item and its corresponding database record.
- Differences are accumulated in batch sized groups defined by `DELTA_GUARD_BATCH_SIZE`.
- If deltas for any specific field exceed their configured percentage threshold during a batch, alerts are sent.
- Optionally, the spider is stopped immediately to prevent cascading bad data writes.

---

## Configuration Reference

| Setting | Type | Default | Description |
|---------|------|---------|-------------|
| `DELTA_GUARD_ENABLED` | `bool` | `False` | Enables or disables the extension globally. |
| `DELTA_GUARD_FIELDS_CONFIG` | `list[dict]` | `[]` | Fields to monitor with optional threshold, `db_var`, and `spider_var`. |
| `DELTA_GUARD_BATCH_SIZE` | `int` | `50` | Number of items processed per batch evaluation. |
| `DELTA_GUARD_DEFAULT_THRESHOLD` | `str` or `float` | `5%` | Default batch delta threshold (percentage) if none specified per field. |
| `DELTA_GUARD_DB_NONE_IS_DELTA` | `bool` | `False` | Treats a `None` in DB as delta if spider has a value. |
| `DELTA_GUARD_SPIDER_NONE_IS_DELTA` | `bool` | `False` | Treats a `None` in spider as delta if DB has a value. |
| `DELTA_GUARD_STOP_SPIDER_ON_HIGH_DELTA` | `bool` | `True` | Stops the spider when any field delta threshold is exceeded. |
| `DELTA_GUARD_JIRA_FUNC` | `str` | `None` | Dotted path to alert function (e.g., JIRA ticket creator). |
| `DELTA_GUARD_SLACK_WEBHOOK` | `str` | `None` | Slack Incoming Webhook URL for notifications. |

---

## Advanced Field Configuration

The `DELTA_GUARD_FIELDS_CONFIG` allows flexible definitions.

```python
DELTA_GUARD_FIELDS_CONFIG = [
    {'name': 'email'},                 # Simple shorthand
    {'name': 'phone_number', 'db_var': 'phone', 'spider_var': 'contact_phone'},  # Custom fields
    {'name': 'salary', 'threshold': 15},  # 15% threshold as integer
    {'name': 'location', 'threshold': '25%'},  # 25% threshold as string
]
```

---

### Using `safe_commit` to Prevent Data Corruption

If we enable `DELTA_GUARD_STOP_SPIDER_ON_HIGH_DELTA` it will gracefully stop the spider , however it will process the requests in queue and may write a few more batches of data. To prevent it we can use the `safe_commit` utility.

The `safe_commit` utility  ensures you only commit your SQLAlchemy session if DeltaGuard has **not** flagged a high-delta event. If the flag was set, it automatically rolls back the session to prevent potentially bad/partial data saves.

**Example usage in your pipeline:**
```
from deltaguard.adapter import safe_commit

class YourDatabasePipeline:
def close_spider(self, spider):
# This checks the flag and decides to commit or rollback
safe_commit(self.session, spider)
self.session.close()
```

#### If you ever need to force a commit even when a high delta was detected:

```
safe_commit(self.session, spider, force_commit=True)

```
- The function returns `True` if committed, `False` if rolled back.

This wrapper lets you centralize all your session commit logic and align with how DeltaGuard protects your database from corrupt/incomplete batches.


### CSV Delta Logs on Slack (Optional)

DeltaGuard can automatically generate and send detailed CSV logs to Slack when high delta thresholds are exceeded. This gives you a complete audit trail of what changed.

#### Configuration

In your settings.py
```
Enable CSV logs on Slack
DELTA_GUARD_LOGS_ON_SLACK = True

Required: Slack Bot Token (not webhook)
DELTA_GUARD_SLACK_BOT_TOKEN = "xoxb-your-bot-token-here"

Required: Slack Channel ID (not channel name)
Right Click on Channel >> View Channel Details
DELTA_GUARD_SLACK_CHANNEL_ID = "C01234ABCDE"
```


#### Setting up Slack Bot for File Uploads

To enable CSV file uploads, you need a Slack Bot Token (webhooks don't support file uploads):

1. **Create a Slack App**:
   - Go to [https://api.slack.com/apps](https://api.slack.com/apps)
   - Click "Create New App" → "From scratch"
   - Give it a name (e.g., "DeltaGuard Bot")

2. **Add Bot Permissions**:
   - Navigate to "OAuth & Permissions"
   - Under "Bot Token Scopes", add:
     - `files:write` (to upload files)
     - `chat:write` (to post messages)

3. **Install the App**:
   - Click "Install to Workspace"
   - Copy the "Bot User OAuth Token" (starts with `xoxb-`)

4. **Get Your Channel ID**:
   - In Slack, right-click your channel name
   - Select "View channel details"
   - Scroll to bottom and copy the Channel ID

5. **Invite Bot to Channel**:
   - In your Slack channel, type: `/invite @DeltaGuard Bot`

#### CSV Format

The generated CSV includes the following columns (sorted by field name):
- `db_item_id`: Primary key of the database record
- `field`: Field name that changed
- `old_value`: Value in the database
- `new_value`: Value from the spider

**Note**: Only deltas from fields that exceeded the threshold are included in the CSV.

#### Example Output

When a high delta is detected, you'll receive:
1. A text alert showing which fields exceeded thresholds
2. A CSV file attachment with detailed change logs

The CSV filename format: `deltaguard_{spider_name}_deltas.csv`


## License

MIT License
