Metadata-Version: 2.4
Name: pydantic-scrape
Version: 0.2.0
Summary: Advanced web automation framework with AI-powered agents, Chawan terminal browser integration, and geographic search targeting
Author: Pydantic Scrape Contributors
License: MIT
Project-URL: Homepage, https://github.com/philmade/pydantic_scrape
Project-URL: Repository, https://github.com/philmade/pydantic_scrape
Project-URL: Issues, https://github.com/philmade/pydantic_scrape/issues
Keywords: scraping,web-scraping,pydantic,chawan,automation,browser-automation,ai-agents,geographic-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup :: HTML
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pydantic-ai>=0.2.11
Requires-Dist: pydantic-graph>=0.2.11
Requires-Dist: loguru>=0.7.3
Requires-Dist: python-dotenv
Requires-Dist: httpx>=0.24.0
Requires-Dist: platformdirs>=3.0.0
Requires-Dist: camoufox>=0.4.11
Requires-Dist: newspaper3k>=0.2.8
Requires-Dist: beautifulsoup4>=4.13.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: lxml_html_clean>=0.1.0
Requires-Dist: pyalex>=0.17
Requires-Dist: habanero>=1.2.6
Requires-Dist: goose3>=3.1.19
Requires-Dist: PyMuPDF>=1.25.0
Requires-Dist: python-docx>=1.1.0
Requires-Dist: EbookLib>=0.19
Requires-Dist: yt-dlp>=2023.12.30
Requires-Dist: google-generativeai>=0.3.0
Requires-Dist: openai>=1.0.0
Requires-Dist: rapidfuzz
Provides-Extra: all
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"

# Pydantic Scrape

A modern web scraping framework that combines AI-powered content extraction with intelligent workflow orchestration. Built on pydantic-ai for reliable, type-safe scraping operations.

## Why Pydantic Scrape?

Web scraping is complex. You need to handle dynamic content, extract meaningful information, and orchestrate multi-step workflows. Most tools force you to choose between simple scrapers or complex frameworks with steep learning curves.

Pydantic Scrape bridges this gap by providing:

- **AI-powered extraction** - Let AI understand and extract what you need instead of writing brittle selectors
- **Type-safe workflows** - Structured data with validation built-in  
- **Academic research focus** - First-class support for papers, citations, and research workflows
- **Browser automation** - Handle JavaScript, authentication, and complex interactions seamlessly

## Installation

### 1. Install Chawan Terminal Browser (Required)

Pydantic Scrape uses [Chawan](https://sr.ht/~bptato/chawan/) for advanced web automation and JavaScript-heavy sites.

**macOS (Homebrew):**
```bash
brew install chawan
```

**Linux (from source):**
```bash
# Install Nim compiler
curl https://nim-lang.org/choosenim/init.sh -sSf | sh
# Install Chawan
git clone https://git.sr.ht/~bptato/chawan
cd chawan && make && sudo make install
```

**Verify installation:**
```bash
cha --version
```

### 2. Install Pydantic Scrape

```bash
# Standard installation
pip install pydantic-scrape

# With development tools (if contributing)
pip install pydantic-scrape[dev]
```

## Quick Start

Get a comprehensive research answer in under 10 lines:

```python
import asyncio
from pydantic_scrape.graphs.search_answer import search_answer

async def research():
    result = await search_answer(
        query="latest advances in quantum computing",
        max_search_results=5
    )
    
    print(f"Found {len(result['answer']['sources'])} sources")
    print(result['answer']['answer'])

asyncio.run(research())
```

This searches academic sources, extracts content, and generates a structured answer with citations - all automatically.

## Core Features

### 🔍 Smart Content Extraction
```python
from pydantic_scrape.dependencies.fetch import fetch_url

# Automatically handles JavaScript, selects best extraction method
content = await fetch_url("https://example.com/article")
print(content.title, content.text, content.metadata)
```

### 🤖 AI-Powered Scraping
```python
from pydantic_scrape.agents.bs4_scrape_script_agent import get_bs4_scrape_script_agent

# AI writes the scraping code for you
agent = get_bs4_scrape_script_agent()
result = await agent.run_sync("Extract product prices from this e-commerce page", 
                              html_content=page_html)
```

### 📚 Academic Research
```python
from pydantic_scrape.dependencies.openalex import OpenAlexDependency

# Search papers by topic, author, or DOI
openalex = OpenAlexDependency()
papers = await openalex.search_papers("machine learning healthcare")
```

### 📄 Document Processing
```python
from pydantic_scrape.dependencies.document import DocumentDependency

# Extract text from PDFs, Word docs, EPUBs
doc = DocumentDependency()
content = await doc.extract_text("research_paper.pdf")
```

### 🌐 Advanced Browser Automation
```python
from pydantic_scrape.agents.search_and_browse import SearchAndBrowseAgent

# Intelligent search + browse with memory and geographic targeting
agent = SearchAndBrowseAgent()
result = await agent.run(
    "Find 5 cabinet refacing services in North West England with contact details"
)

# Automatically handles: cookie popups, JavaScript, geographic targeting, 
# content caching, parallel browsing, and form detection
```

## Common Use Cases

- **Literature Reviews** - Automatically search, extract, and summarize academic papers
- **Market Research** - Monitor competitor content, pricing, and product updates  
- **News Monitoring** - Track mentions, trends, and breaking news across sources
- **Content Migration** - Extract structured data from legacy systems or websites
- **Research Workflows** - Build custom pipelines for domain-specific content extraction

## Architecture

Pydantic Scrape organizes code into three layers:

- **Dependencies** (`pydantic_scrape.dependencies.*`) - Reusable components for specific tasks
- **Agents** (`pydantic_scrape.agents.*`) - AI-powered workers that make decisions  
- **Graphs** (`pydantic_scrape.graphs.*`) - Orchestrate multi-step workflows

This makes it easy to compose complex workflows from simple, tested components.

## Configuration

### 1. Set API Keys
Create a `.env` file in your project root:

```bash
# AI Providers (choose one or more)
OPENAI_API_KEY=your_openai_key
GOOGLE_GENAI_API_KEY=your_google_key  
ANTHROPIC_API_KEY=your_anthropic_key

# Google Search (for enhanced search capabilities)
GOOGLE_SEARCH_API_KEY=your_google_search_key
GOOGLE_SEARCH_ENGINE_ID=your_search_engine_id
```

### 2. Chawan Configuration
The package includes an optimized Chawan configuration in `.chawan/config.toml` that provides:

- **7x faster** web automation vs default settings
- **Cookie popup handling** without JavaScript overhead  
- **Content caching** for instant subsequent operations
- **Geographic search targeting** for accurate local results

No additional Chawan setup required - works out of the box!

## Documentation

- [Installation Guide](INSTALLATION.md) - Detailed setup instructions
- [Examples](examples/) - Working code samples for common tasks
- [API Reference](https://github.com/philmade/pydantic_scrape) - Full documentation

## Contributing

We welcome contributions! The framework is designed to be extensible - add new content sources, AI agents, or workflow patterns.

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup and guidelines.

## License

MIT License - see [LICENSE](LICENSE) for details.

---

**Ready to build intelligent scraping workflows?** Start with `pip install pydantic-scrape` and try the examples above.
