Metadata-Version: 2.4
Name: acon-intel
Version: 0.1.2
Summary: Acon is the intelligence layer for any web scraper. Pair it with Scrapling, Playwright, or httpx to crawl smarter.
Author: WillyEverGreen
License: MIT
Project-URL: Homepage, https://github.com/WillyEverGreen/acon
Project-URL: Bug Tracker, https://github.com/WillyEverGreen/acon/issues
Keywords: web-scraping,crawler,python,spider,playwright
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: AsyncIO
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Intended Audience :: Developers
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: curl-cffi>=0.6.0
Requires-Dist: playwright>=1.40.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: aiosqlite>=0.19.0
Requires-Dist: defusedxml>=0.7.0
Dynamic: license-file

<div align="center">
  <img src="https://raw.githubusercontent.com/WillyEverGreen/acon/main/logo.png" width="120" alt="Acon Logo">
  <h1>Acon — The Intelligent Brain for Any Scraper</h1>
  <p>Acon doesn't replace Scrapling or Firecrawl. It tells them where to look.</p>
</div>

---

## Why Acon?

Most crawlers are dumb. They follow links blindly, return raw HTML, and break the moment a site changes its structure. Before you can extract anything useful, you need to understand what you're dealing with.

**Acon is a site intelligence engine.** It maps the structural "skeleton" of a website automatically — before any data extraction happens — so your scraper always knows where to look.

---

## 🏗️ The Core Thesis
Most modern web scrapers suffer from **"URL Exhaustion"**—they spend 90% of their bandwidth fetching identical product or blog pages. Acon introduces a **Topology Orchestrator** that maps, classifies, and samples site structures to find the "Skeleton" of a site before you spend a cent on proxies.

### 💰 Acon vs. Scrapling (The 1:1 Battle)

| Metric | Scrapling Alone (Blind) | Acon + Scrapling (Brain) |
| :--- | :--- | :--- |
| **Pages Crawled** | 1,000 | **40** |
| **Time Taken** | 870s (14.5 min) | **111s (1.8 min)** |
| **Bandwidth Used** | 20.72 MB | **1.39 MB** |
| **Est. Proxy Cost** | $1.000 | **$0.040** |
| **Structural DNA** | 4/4 Found | **4/4 Found** |

**96% less crawling. 25x faster structural discovery.**
*Measured on books.toscrape.com.*

---

## 📊 Elite Benchmarks: Real-World Performance

We tested Acon against a standard BFS crawler on complex, live targets with a shared **50-page budget** to measure discovery quality vs. brute force.

| Target | Request Reduction | Discovery Yield (DNA) | Outcome |
| :--- | :--- | :--- | :--- |
| **Next.js Showcase** | **68% Reduction** | 5/5 Templates Identified | ✅ **PASS** |
| **The Hindu (News)** | **40% Reduction** | **8 vs 4** Templates Found | 🏆 **ELITE** |
| **books.toscrape** | 0% (Static Parity) | 5 vs 4 Templates Found | ✅ **PASS** |
| **Flipkart Mobiles** | Budget Equalized | 8 vs 8 Templates Found | ⚖️ **STABLE** |

### 🧠 The "Brain" Advantage
- **News Sites**: Acon finds **2x more structural variations** (DNA) than a blind crawler by understanding category vs. article patterns.
- **SPAs**: Acon reaches structural saturation on React/Next.js sites **3x faster** than standard tools by navigating the virtual DOM.
- **Honest Limitations**: On simple static sites, Acon's "Brain" matches BFS but adds rendering overhead. Acon is an **Intelligence Engine** for complex sites, not a replacement for basic fetchers on simple blogs.

---

## 🚀 Use Cases

**Price Monitoring & E-Commerce Intelligence**  
Acon detects pagination patterns and repeating product templates automatically. No manual selector configuration per site.

**Content Archival & Research**  
Feed Acon a publication's root URL. It identifies the site's content structure, prioritizes article pages over navigation noise, and hands you a clean discovery map.

**Site Auditing & SEO Analysis**  
Get an instant structural report — template count, link depth, topology classification (SPA vs static vs paginated) — in a single run.

---

## ⚡ What Makes Acon Different

| Capability | Typical Crawler | Acon |
|---|---|---|
| **JS-rendered sites** | Manual Playwright setup | **Autonomous escalation** |
| **Site structure** | Unknown until scraped | **Detected before extraction** |
| **Large site performance** | Degrades at scale | **O(log N) priority queue** |
| **Bandwidth efficiency** | Downloads everything | **Asset blocking (Discovery mode)** |
| **Discovery Latency** | Static only | **Static-First Hybrid Escalation** |
| **Failed crawls** | Lost progress | **SQLite resumption (WAL)** |

---

## 🏗️ The Efficiency Pillars

Acon is optimized for production environments where every request costs money:

*   ⚡ **Static-First Discovery**: Acon probes pages with raw HTTP first. It only launches a browser if the site is a SPA, saving 90% of compute on standard sites.
*   🚫 **Intelligent Asset Blocking**: During discovery, Acon automatically aborts requests for images, fonts, and CSS to slash bandwidth and CPU usage.
*   📉 **Debounced Topology Detection**: Structural analysis (DNA mapping) is throttled to key milestones (1, 10, 25, 50 pages) to ensure max throughput.

## 🏗️ The Unified Intelligence Stack (The Acon Alliance)

Acon doesn't just map sites; it orchestrates the most powerful open-source scraping tools into a single, high-fidelity pipeline.

*   **🕵️ Stealth (Camoufox)**: Enable `use_stealth=True` to launch an "invisible" browser engine that bypasses Cloudflare and Akamai automatically.
*   **📄 Content (Trafilatura)**: Enable `extract_content=True` to get clean, LLM-ready Markdown from every discovered page natively.
*   **🚀 Speed (Scrapling)**: Use the `scrapling_adapter` to export Acon's "DNA Map" into Scrapling for turbo-charged mass extraction at 10x standard speeds.

---

## 🛠️ Installation

```bash
pip install acon-intel

# To enable the Alliance pillars (Highly Recommended)
pip install trafilatura camoufox scrapling
playwright install chromium
```

---

## ⚡ Quick Start (The Alliance Stack)

```python
import asyncio
from acon import SiteCrawlOrchestrator, CrawlConfig

async def main():
    # Acon discovers the 'skeleton', Trafilatura extracts the 'flesh'
    # Camoufox provides the 'stealth'
    config = CrawlConfig(
        max_pages=10,
        extract_content=True, # Pillar 1: Trafilatura
        use_stealth=True      # Pillar 2: Camoufox
    )
    
    brain = SiteCrawlOrchestrator()
    result = await brain.crawl_site("https://news.ycombinator.com", config)
    
    for page in result["page_summaries"]:
        print(f"URL: {page['url']}")
        if page['content']:
            print(f"Markdown: {page['content'][:100]}...")
            
if __name__ == "__main__":
    asyncio.run(main())
```

---

## 📦 The Output Shape
Acon returns a structured `SiteCrawlResult` containing everything needed for downstream extraction:

```json
{
  "topology": "paginated",
  "pages_crawled": 42,
  "page_summaries": [
    {
      "url": "https://example.com/p/123",
      "page_type": "standard",
      "js_required": false,
      "content": "# Extracted Markdown Content...",
      "parent_url": "https://example.com/list"
    }
  ],
  "crawl_meta": {
    "reflection": {
      "intelligence_score": 0.85,
      "advice": "Continue current strategy."
    }
  }
}
```

---

## 🛣️ Roadmap
- [x] **Stealth Integration**: Native support for **Camoufox** (Fingerprint bypass).
- [x] **LLM-Ready Pipeline**: Native **Trafilatura** integration for high-fidelity Markdown output.
- [x] **Speed Pillar**: Official **Scrapling** adapter for mass extraction.
- [ ] **Discovery API**: Expose Acon as a standalone Discovery microservice.

---
*Acon: The connective tissue of the intelligent web.*
