Metadata-Version: 2.4
Name: et-scraper-safe
Version: 0.2.1
Summary: A polite, RSS-first Economic Times news and sentiment collector for market research.
Author: et_scraper_safe contributors
License: MIT
Project-URL: Homepage, https://github.com/yourusername/et-scraper-safe
Project-URL: Issues, https://github.com/yourusername/et-scraper-safe/issues
Keywords: economic-times,news,sentiment,scraper,rss,finance,swing-trading
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial :: Investment
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: pandas>=2.0
Requires-Dist: feedparser>=6.0
Requires-Dist: python-dateutil>=2.8
Requires-Dist: lxml>=4.9
Dynamic: license-file

# et-scraper-safe

[![PyPI version](https://img.shields.io/pypi/v/et-scraper-safe.svg)](https://pypi.org/project/et-scraper-safe/)
[![Python versions](https://img.shields.io/pypi/pyversions/et-scraper-safe.svg)](https://pypi.org/project/et-scraper-safe/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

A polite, **RSS-first** Python library for collecting public Economic Times
news, deduplicating it, scoring sentiment + market impact, mapping headlines
to NSE stock symbols, and persisting everything to **SQLite + CSV** — built
for swing-trading and market research pipelines.

> **PyPI:** https://pypi.org/project/et-scraper-safe/

---

## Why "safe"?

This library is designed to be a good citizen of the web:

- ✅ Respects `robots.txt` before fetching any HTML page
- ✅ Prefers RSS feeds over HTML scraping
- ✅ Configurable delay between HTML requests
- ✅ Shared `requests.Session` with retry, exponential backoff, and 429/5xx handling
- ✅ Identifiable `User-Agent`
- ❌ Does **not** bypass logins, paywalls, captchas, Cloudflare, or rate limits

If `robots.txt` disallows a URL, the request is skipped — full stop.

---

## What's new in 0.2.0

- 🔁 **Reliable HTTP** — `requests.Session` with retry, exponential backoff, 429/5xx handling
- 🗃️ **SQLite store** — `news_raw`, `news_clean`, `stock_news_map`, `sentiment_scores`, `scrape_logs`
- 📈 **Stock mapping** — built-in NSE large-cap list, extensible via DataFrame
- 🧠 **Structured sentiment** — `{sentiment, impact, confidence, reason, score}` payload
- 🪪 **URL hashing + dedup** — every row keyed by SHA-1 of its link
- 🕒 **Published date parsing** — UTC ISO `published_at` column
- 📜 **Logging** — proper `logging` setup instead of `print`
- 🧾 **Run logs** — every CLI run records stats and status in `scrape_logs`

---

## Install

```bash
pip install --upgrade et-scraper-safe
```

Requires Python 3.9+.

---

## Quick start

### 1. As a command-line tool

```bash
et-scraper-safe
```

This runs the full pipeline:

```
RSS Collector → Dedup → Sentiment + Impact → NSE Stock Mapping → CSV + SQLite + run log
```

Outputs:
- `data/raw/et_news_raw_*.csv`
- `data/clean/et_news_clean_*.csv`
- `data/news.db` (SQLite, 5 tables)

### 2. As a Python library

```python
from et_scraper import (
    fetch_all_rss_news,
    analyze_sentiment,
    map_news_to_stock,
    load_default_stocks,
    init_db,
    upsert_clean,
    upsert_sentiment,
    upsert_stock_map,
)
import pandas as pd

# 1. Fetch + dedup (returns a DataFrame keyed by url_hash)
df = fetch_all_rss_news()

# 2. Structured sentiment per headline
sent = pd.DataFrame(df["title"].apply(analyze_sentiment).tolist())
df = pd.concat([df, sent], axis=1)

# 3. Map headlines to NSE symbols
stocks = load_default_stocks()
df["symbols"] = df["title"].apply(lambda t: map_news_to_stock(t, stocks))

# 4. Persist to SQLite
init_db()
upsert_clean(df)
upsert_sentiment(df)
```

---

## Architecture

```
ET / Moneycontrol / NSE
        ↓
RSS Collector            (et_scraper.rss_collector)
        ↓
HTML Article Collector   (et_scraper.html_collector — robots-aware, retrying)
        ↓
Cleaner + Dedup          (et_scraper.dedup — SHA-1 url_hash)
        ↓
Stock Mapper             (et_scraper.stock_mapper)
        ↓
Sentiment Engine         (et_scraper.sentiment — Level 1: keyword)
        ↓
Impact Scorer            (et_scraper.sentiment.analyze_sentiment)
        ↓
Database                 (et_scraper.database — SQLite)
        ↓
Swing Trading Signal Engine  ← your code
```

### Suggested swing-trading scoring

```python
final_score = (
    technical_score      * 0.40 +
    news_sentiment_score * 0.25 +
    news_impact_score    * 0.15 +
    volume_score         * 0.10 +
    sector_score         * 0.10
)
```

`et-scraper-safe` provides the two news inputs; the rest comes from your
technical + market data pipeline.

---

## Sentiment engine roadmap

| Level | Engine                  | Status                           |
| ----- | ----------------------- | -------------------------------- |
| 1     | Keyword lexicon         | ✅ Built in (zero ML deps)        |
| 2     | FinBERT                 | Wrap `analyze_sentiment` shape   |
| 3     | LLM sentiment           | Wrap `analyze_sentiment` shape   |
| 4     | Market-impact model     | Wrap `analyze_sentiment` shape   |

The `analyze_sentiment` return shape is intentionally stable so you can
swap engines without changing downstream code:

```python
{
  "sentiment":  "Bullish" | "Bearish" | "Neutral",
  "impact":     "High" | "Medium" | "Low",
  "confidence": 0-100,
  "reason":     "Short explanation",
  "score":      int,
}
```

---

## Categories collected

| Category       | Source                       |
| -------------- | ---------------------------- |
| `latest`       | Top stories RSS              |
| `markets`      | Markets RSS                  |
| `stocks`       | Stocks RSS                   |
| `economy`      | Economy RSS                  |
| `business`     | Company / business RSS       |
| `ipo`          | IPO RSS                      |
| `mutual_funds` | Mutual funds RSS             |
| `commodities`  | Commodities RSS              |
| `forex`        | Forex RSS                    |

Feed URLs live in [`et_scraper/config.py`](et_scraper/config.py) and can be extended.

---

## SQLite schema

`init_db()` creates these tables in `data/news.db` (path configurable):

| Table              | Key                       | Purpose                              |
| ------------------ | ------------------------- | ------------------------------------ |
| `news_raw`         | —                         | Append-only raw fetched rows         |
| `news_clean`       | `url_hash` PK             | Deduplicated, parsed news rows       |
| `stock_news_map`   | (`url_hash`, `symbol`) PK | Many-to-many news ↔ stock mapping    |
| `sentiment_scores` | `url_hash` PK             | sentiment, impact, confidence, score |
| `scrape_logs`      | `run_id` PK               | Per-run stats and status             |

Indexes on `news_clean.published_at` and `stock_news_map.symbol`.

---

## DataFrame columns (after the full pipeline)

| Column            | Description                                          |
| ----------------- | ---------------------------------------------------- |
| `url_hash`        | SHA-1 of the article URL (dedup key)                 |
| `source`          | Always `"economic_times"`                            |
| `category`        | One of the categories above                          |
| `title`           | Article headline                                     |
| `summary`         | RSS summary / description                            |
| `link`            | Canonical article URL                                |
| `published`       | Raw publish string from the feed                     |
| `published_at`    | UTC ISO publish timestamp (parsed)                   |
| `fetched_at`      | UTC ISO timestamp when the row was collected         |
| `sentiment`       | `Bullish` / `Bearish` / `Neutral`                    |
| `impact`          | `High` / `Medium` / `Low`                            |
| `confidence`      | 0–100 heuristic confidence                           |
| `reason`          | Short explanation of which signals fired             |
| `score`           | Integer = positive_word_count − negative_word_count  |
| `symbols`         | List of NSE symbols mentioned in the title           |

---

## Public API

| Symbol                                       | Module                       |
| -------------------------------------------- | ---------------------------- |
| `fetch_all_rss_news() -> DataFrame`          | `et_scraper.rss_collector`   |
| `analyze_sentiment(text) -> dict`            | `et_scraper.sentiment`       |
| `sentiment_score(text) -> int`               | `et_scraper.sentiment`       |
| `sentiment_label(score) -> str`              | `et_scraper.sentiment`       |
| `map_news_to_stock(title, stocks_df=None)`   | `et_scraper.stock_mapper`    |
| `load_default_stocks() -> DataFrame`         | `et_scraper.stock_mapper`    |
| `url_hash(url) -> str`                       | `et_scraper.dedup`           |
| `drop_duplicates(df, subset='url_hash')`     | `et_scraper.dedup`           |
| `create_session(...)` → `requests.Session`   | `et_scraper.http_session`    |
| `init_db(db_path=...)`                       | `et_scraper.database`        |
| `save_to_sqlite(df, db_path=..., table=...)` | `et_scraper.database`        |
| `upsert_clean(df)` / `upsert_sentiment(df)`  | `et_scraper.database`        |
| `upsert_stock_map(df)` / `log_scrape(...)`   | `et_scraper.database`        |
| `save_dataframe(df, folder, name)`           | `et_scraper.storage`         |
| `get_logger(name)`                           | `et_scraper.logging_setup`   |

Lower-level helpers:

| Symbol                                          | Module                      |
| ----------------------------------------------- | --------------------------- |
| `can_fetch(url, user_agent="*") -> bool`        | `et_scraper.robots_checker` |
| `fetch_public_page(url) -> BeautifulSoup\|None` | `et_scraper.html_collector` |
| `extract_headlines(soup) -> list[str]`          | `et_scraper.parser`         |
| `extract_article_text(soup) -> str`             | `et_scraper.parser`         |

---

## Project layout

```
et_scraper_safe/
├── pyproject.toml          # PyPI packaging metadata
├── LICENSE                 # MIT
├── README.md
├── requirements.txt        # For running from source
├── main.py                 # Convenience runner (same as the CLI)
├── et_scraper/
│   ├── __init__.py
│   ├── cli.py              # Entry point for `et-scraper-safe`
│   ├── config.py           # RSS feed URLs, headers, timeouts
│   ├── http_session.py     # Retry/backoff requests.Session
│   ├── logging_setup.py    # Centralized logger
│   ├── robots_checker.py   # robots.txt enforcement
│   ├── rss_collector.py    # RSS → DataFrame + dedup + date parsing
│   ├── html_collector.py   # Polite, robots-aware HTML fetcher
│   ├── parser.py           # Headline / article-text extraction
│   ├── sentiment.py        # Lexicon sentiment + impact + confidence
│   ├── stock_mapper.py     # Headline → NSE symbol mapping
│   ├── dedup.py            # url_hash + drop_duplicates
│   ├── storage.py          # Timestamped CSV writer
│   └── database.py         # SQLite schema + upserts + run logs
└── data/
    ├── raw/                # Raw scraped CSVs
    ├── clean/              # Cleaned + scored CSVs
    └── news.db             # SQLite store (created on first run)
```

---

## Development

Run from source:

```bash
git clone <your-fork>
cd et_scraper_safe
pip install -r requirements.txt
python main.py
```

Build + publish a new version (maintainers only):

```bash
# 1. Bump version in pyproject.toml AND et_scraper/__init__.py
# 2. Build and upload
rm -rf dist build *.egg-info
python -m build
TWINE_USERNAME=__token__ TWINE_PASSWORD="$PYPI_API_TOKEN" python -m twine upload dist/*
```

---

## Disclaimer

This library only collects data that the Economic Times publishes openly via
RSS or pages allowed by their `robots.txt`. It is intended for personal
research and educational use. **You** are responsible for complying with the
Economic Times' Terms of Service and any applicable laws when using this
library or the data it collects.

---

## License

MIT — see [LICENSE](LICENSE).
