Metadata-Version: 2.4
Name: et-scraper-safe
Version: 0.2.0
Summary: A polite, RSS-first Economic Times news and sentiment collector for market research.
Author: et_scraper_safe contributors
License: MIT
Project-URL: Homepage, https://github.com/yourusername/et-scraper-safe
Project-URL: Issues, https://github.com/yourusername/et-scraper-safe/issues
Keywords: economic-times,news,sentiment,scraper,rss,finance,swing-trading
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Office/Business :: Financial :: Investment
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: pandas>=2.0
Requires-Dist: feedparser>=6.0
Requires-Dist: python-dateutil>=2.8
Requires-Dist: lxml>=4.9
Dynamic: license-file

# et-scraper-safe

[![PyPI version](https://img.shields.io/pypi/v/et-scraper-safe.svg)](https://pypi.org/project/et-scraper-safe/)
[![Python versions](https://img.shields.io/pypi/pyversions/et-scraper-safe.svg)](https://pypi.org/project/et-scraper-safe/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

A polite, **RSS-first** Python library for collecting public Economic Times
news headlines and tagging them with simple sentiment — built for market and
swing-trading research pipelines.

> **PyPI:** https://pypi.org/project/et-scraper-safe/

---

## Why "safe"?

This library is designed to be a good citizen of the web:

- ✅ Respects `robots.txt` before fetching any HTML page
- ✅ Prefers RSS feeds over HTML scraping
- ✅ Adds a configurable delay between HTML requests
- ✅ Sends a clear, identifiable `User-Agent`
- ❌ Does **not** bypass logins, paywalls, captchas, Cloudflare, or rate limits
- ❌ Does **not** scrape any content the publisher has restricted

If `robots.txt` disallows a URL, the request is skipped — full stop.

---

## Install

```bash
pip install et-scraper-safe
```

Requires Python 3.9+.

---

## Quick start

### 1. As a command-line tool

After install, a console command is available:

```bash
et-scraper-safe
```

This will:
1. Fetch all configured Economic Times RSS feeds.
2. Score each headline's sentiment.
3. Save raw + cleaned CSVs into `./data/raw/` and `./data/clean/`.
4. Print a summary of bullish / bearish / neutral counts.

### 2. As a Python library

```python
from et_scraper import (
    fetch_all_rss_news,
    sentiment_score,
    sentiment_label,
    save_dataframe,
)

df = fetch_all_rss_news()
df["sentiment_score"] = df["title"].apply(sentiment_score)
df["sentiment_label"] = df["sentiment_score"].apply(sentiment_label)

print(df.head())
save_dataframe(df, folder="data/clean", name="et_news_clean")
```

---

## Categories collected

| Category       | Source                                              |
| -------------- | --------------------------------------------------- |
| `latest`       | Top stories RSS                                     |
| `markets`      | Markets RSS                                         |
| `stocks`       | Stocks RSS                                          |
| `economy`      | Economy RSS                                         |
| `business`     | Company / business RSS                              |
| `ipo`          | IPO RSS                                             |
| `mutual_funds` | Mutual funds RSS                                    |
| `commodities`  | Commodities RSS                                     |
| `forex`        | Forex RSS                                           |

Feed URLs are defined in [`et_scraper/config.py`](et_scraper/config.py) and can be extended.

---

## Output schema

Each row of the returned `pandas.DataFrame` has:

| Column            | Description                                          |
| ----------------- | ---------------------------------------------------- |
| `source`          | Always `"economic_times"`                            |
| `category`        | One of the categories above                          |
| `title`           | Article headline                                     |
| `summary`         | RSS summary / description                            |
| `link`            | Canonical article URL                                |
| `published`       | Publish timestamp from the feed                      |
| `fetched_at`      | UTC ISO timestamp when the row was collected         |
| `sentiment_score` | Integer = positive_word_count − negative_word_count  |
| `sentiment_label` | `"Bullish"`, `"Bearish"`, or `"Neutral"`             |

Example:

```csv
source,category,title,summary,link,published,sentiment_score,sentiment_label
economic_times,stocks,Tata Motors shares rally...,...,link,...,2,Bullish
economic_times,economy,Rupee falls against dollar...,...,link,...,-1,Bearish
```

---

## Public API

| Symbol                                | What it does                                                |
| ------------------------------------- | ----------------------------------------------------------- |
| `fetch_all_rss_news() -> DataFrame`   | Fetch all configured RSS feeds into a DataFrame.            |
| `sentiment_score(text: str) -> int`   | Lexicon-based score: `positive − negative` word counts.     |
| `sentiment_label(score: int) -> str`  | Map a score to `"Bullish"` / `"Bearish"` / `"Neutral"`.     |
| `save_dataframe(df, folder, name)`    | Save a DataFrame to a timestamped CSV; returns the path.    |
| `__version__`                         | Library version string.                                     |

Lower-level helpers (use only if you really need raw HTML):

| Symbol                                          | Module                  |
| ----------------------------------------------- | ----------------------- |
| `can_fetch(url, user_agent="*") -> bool`        | `et_scraper.robots_checker` |
| `fetch_public_page(url) -> BeautifulSoup\|None` | `et_scraper.html_collector` |
| `extract_headlines(soup) -> list[str]`          | `et_scraper.parser`     |
| `extract_article_text(soup) -> str`             | `et_scraper.parser`     |

---

## Use in a swing trading pipeline

```
Economic Times News (this library)
        ↓
Headline Sentiment (this library)
        ↓
Stock Symbol Mapping (your code)
        ↓
Technical Indicators (your code)
        ↓
Final Swing Score (your code)
```

---

## Project layout

```
et_scraper_safe/
├── pyproject.toml          # PyPI packaging metadata
├── LICENSE                 # MIT
├── README.md
├── requirements.txt        # For running from source
├── main.py                 # Convenience runner (same as the CLI)
├── et_scraper/
│   ├── __init__.py
│   ├── cli.py              # Entry point for `et-scraper-safe` console command
│   ├── config.py           # RSS feed URLs, headers, timeouts
│   ├── robots_checker.py   # robots.txt enforcement
│   ├── rss_collector.py    # RSS → DataFrame
│   ├── html_collector.py   # Polite, robots-aware HTML fetcher
│   ├── parser.py           # Headline / article-text extraction
│   ├── sentiment.py        # Lexicon-based sentiment
│   └── storage.py          # Timestamped CSV writer
└── data/
    ├── raw/                # Raw scraped CSVs
    └── clean/              # Cleaned + scored CSVs
```

---

## Development

Run from source:

```bash
git clone <your-fork>
cd et_scraper_safe
pip install -r requirements.txt
python main.py
```

Build + publish a new version (maintainers only):

```bash
# 1. Bump version in pyproject.toml and et_scraper/__init__.py
# 2. Build and upload
rm -rf dist build *.egg-info
python -m build
TWINE_USERNAME=__token__ TWINE_PASSWORD="$PYPI_API_TOKEN" python -m twine upload dist/*
```

---

## Disclaimer

This library only collects data that the Economic Times publishes openly via
RSS or pages allowed by their `robots.txt`. It is intended for personal
research and educational use. **You** are responsible for complying with the
Economic Times' Terms of Service and any applicable laws when using this
library or the data it collects.

---

## License

MIT — see [LICENSE](LICENSE).
