Metadata-Version: 2.4
Name: webc
Version: 0.1.1
Summary: Treat websites as programmable objects (Wikipedia-Locked Beta)
Author: Ashwin Prasanth
License: MIT
Project-URL: Homepage, https://github.com/ashtwin2win-Z/WebC
Project-URL: Bug Tracker, https://github.com/ashtwin2win-Z/WebC/issues
Keywords: web,scraper,automation,wikipedia,resource
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28.0
Requires-Dist: beautifulsoup4>=4.11.0

<h1 align="center"> WebC – Treat Websites as Python Objects</h1>

<p align="center">
<img src="https://github.com/ashtwin2win-Z/WebC/raw/main/assets/webc.png" alt="WebC Logo" width="280">
</p>

**Version:** 0.1.1
**Author:** Ashwin Prasanth

---

## Overview

`webc` is a Python library that allows you to treat websites as programmable Python objects.

Instead of manually handling HTTP requests, parsing HTML, and writing repetitive scraping logic, WebC provides a structured, object-oriented interface to access semantic content, query elements, and perform intent-driven tasks.

The goal is simple:

* Make web data feel native to Python
* Provide meaningful abstractions over raw HTML
* Encourage ethical and secure usage by default

---

## ⚠️ Developer Preview / Secure Beta

**WebC v0.1.1** is a developer preview release intended for testing and feedback.

This version prioritizes security, architecture stability, and controlled usage.

APIs may change during the beta phase.

---

## Installation

Install via pip:

```bash
pip install webc
```

### Dependencies

* requests
* beautifulsoup4

---

## Core Architecture

WebC is organized into four conceptual layers.

---

### 1. Resource Layer

Access a webpage as a `Resource` object:

```python
from webc import web

site = web["https://en.wikipedia.org/wiki/Python_(programming_language)"]
```

* Represents a single webpage
* Uses lazy loading (fetches HTML only when needed)
* Caches parsed content internally

---

### 2. Structure Layer

Provides semantic, high-level content extracted from the page:

```python
site.structure.title
site.structure.links
site.structure.images
site.structure.tables
```

#### Image Handling

* Extracts from `src`, `srcset`, `data-src`, and `<noscript>`
* Filters UI icons and SVG assets
* Resolves relative URLs automatically

Download images:

```python
site.structure.save_images(folder="python_images")
```

#### Table Extraction

* Detects Wikipedia `wikitable` tables
* Handles rowspan and colspan alignment
* Removes citation brackets (e.g., `[1]`)

Save tables as CSV:

```python
site.structure.save_tables(folder="wiki_data")
```

---

### 3. Query Layer

Provides direct DOM access via CSS selectors:

```python
headings = site.query["h1, h2"]

for h in headings:
    print(h.get_text(strip=True))
```

* Returns BeautifulSoup elements
* Useful for custom extraction logic
* Acts as an advanced access layer

---

### 4. Task Layer

Provides intent-driven actions:

```python
summary = site.task.summarize(max_chars=500)
print(summary)
```

Currently supported:

* `summarize(max_chars=500)`

More tasks will be introduced in future releases.

---

## Security & Usage Policy

This secure beta is intentionally restricted.

### Platform Restrictions

* Locked to **Wikipedia.org only**
* Only **HTTPS URLs** are allowed

### Built-in Protections

WebC includes safeguards against:

* SSRF attacks
* Path traversal
* Unsafe file writes
* Excessive downloads

Requests are controlled and content is cached to prevent unnecessary repeated fetching.

---

## Responsible Use

WebC is designed for:

✔ Educational purposes
✔ Research
✔ Personal automation
✔ Ethical data access

It must not be used for:

* Mass scraping
* Circumventing website policies
* Service disruption
* Data abuse

Users are responsible for complying with website Terms of Service.

---

## Full Usage Example

```python
from webc import web

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
site = web[url]

print("=== STRUCTURE ===")
print(f"Title: {site.structure.title}")
print(f"Total Links: {len(site.structure.links)}")
print(f"First 5 links: {site.structure.links[:5]}")

print("\n--- Downloading Resources ---")
site.structure.save_images(folder="python_images")
site.structure.save_tables(folder="python_data")

print("\n=== QUERY ===")
headings = site.query["h1, h2"]
print(f"Found {len(headings)} headings:")

for h in headings[:3]:
    print(f" - {h.get_text(strip=True)}")

print("\n=== TASK ===")
summary = site.task.summarize(max_chars=500)
print(summary)
```

---

## Roadmap

Planned future improvements:

* Multi-domain support
* Advanced rate limiting
* Enhanced security layers
* Plugin-based task extensions
* Dataset export helpers
* Cloud-safe scraping mode

---

## License

MIT License
© Ashwin Prasanth
