Metadata-Version: 2.4
Name: osint-news-deamon-pkg
Version: 0.1.0
Summary: Async news scraper for Nigerian and International news.
Author-email: Chukwudi Prince <chukwudinwokeke@initsng.com>
License: MIT
Project-URL: Homepage, https://bitbucket.org/inits/osint-news-cat-pkg/src
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: selenium>=4.10.0
Requires-Dist: webdriver-manager>=4.0.0
Requires-Dist: playwright>=1.35.0

# OSINT News Deamon Package

A high-performance, asynchronous Open Source Intelligence (OSINT) tool designed to scrape news articles from major Nigerian and International news outlets. This package leverages `aiohttp`, `selenium`, and `playwright` to handle both static and dynamic (JavaScript-heavy) websites concurrently.

## Supported Outlets

| News Outlet | Method | Key Features |
|:---|:---|:---|
| **BBC News** | `aiohttp` | Fast, lightweight, static scraping. |
| **CNN** | `aiohttp` | Fast, lightweight, static scraping. |
| **Arise TV** | `aiohttp` | Fast, lightweight, static scraping. |
| **TVC News** | `aiohttp` | Fast, lightweight, static scraping. |
| **Punch NG** | `Selenium` | Handles dynamic content & anti-bot checks. |
| **Business Day** | `Playwright` | Handles complex JS rendering & search results. |

## Installation

### 1. Install the Package
You can install the package directly from PyPI (once published) or from the source:

```bash
pip install osint-news-deamon-pkg
```

### Usage
Basic Usage (Fast Scrapers)
For outlets like BBC, CNN, Arise, and TVC, the scrapers are purely asynchronous and very fast.

```bash
import asyncio
from osint_news_deamon_pkg import BBCTVScraper, AriseTvScraper

async def main():
    # 1. BBC News
    print("--- Scraping BBC ---")
    bbc = BBCTVScraper()
    bbc_results = await bbc.scrape(keyword="election", max_pages=1)
    for article in bbc_results[:3]:
        print(f"[BBC] {article['title']} - {article['page_link']}")

    # 2. Arise TV
    print("\n--- Scraping Arise TV ---")
    arise = AriseTvScraper()
    arise_results = await arise.scrape(keyword="economy", max_pages=1)
    for article in arise_results[:3]:
        print(f"[Arise] {article['title']} - {article['page_link']}")

if __name__ == "__main__":
    asyncio.run(main())

```    

Advanced Usage (Browser-Based Scrapers)
For outlets like Punch NG and Business Day, the scrapers launch a headless browser engine.

```bash
import asyncio
from osint_news_deamon_pkg import PunchNGScraper, BusinessDayScraper

async def scrape_dynamic():
    # 1. Punch NG (Uses Selenium)
    # Note: Requires Chrome installed
    print("--- Scraping Punch NG ---")
    punch = PunchNGScraper(headless=True)
    # Supports date filtering: "DD MMM, YYYY"
    punch_results = await punch.scrape(
        query="politics", 
        max_pages=1,
        from_date="01 Jan, 2024",
        to_date="20 Dec, 2024"
    )
    for article in punch_results[:3]:
        print(f"[Punch] {article.get('title', 'No Title')} - {article.get('url')}")

    # 2. Business Day (Uses Playwright)
    print("\n--- Scraping Business Day ---")
    bd = BusinessDayScraper()
    bd_results = await bd.scrape(query="finance", max_pages=1)
    for article in bd_results[:3]:
        print(f"[BusinessDay] {article['title']} - {article['url']}")

if __name__ == "__main__":
    asyncio.run(scrape_dynamic())

```

### Configuration
Most scrapers accept the following parameters in their .scrape() method:

keyword / query: The search term.

max_pages: Number of pagination pages to traverse (Default: 1).

from_date: Start date filter (Format: "DD MMM, YYYY", e.g., "01 Jan, 2025").

to_date: End date filter (Format: "DD MMM, YYYY").

### Requirements
Python 3.8+

Google Chrome (for Selenium)

Playwright Chromium (install via playwright install chromium)

### Disclaimer
This tool is intended for educational purposes and legitimate Open Source Intelligence (OSINT) research. Users are responsible for adhering to the Terms of Service and robots.txt policies of the target websites.
