Metadata-Version: 2.4
Name: graph-crawler
Version: 4.0.13
Summary: Sync-First бібліотека для побудови графу веб-сайтів - просто як requests!
Author: 0-EternalJunior-0
Maintainer: 0-EternalJunior-0
License-Expression: MIT
Project-URL: Homepage, https://github.com/0-EternalJunior-0/GraphCrawler
Project-URL: Documentation, https://github.com/0-EternalJunior-0/GraphCrawler/-/blob/main/README.md
Project-URL: Repository, https://github.com/0-EternalJunior-0/GraphCrawler
Project-URL: Bug Tracker, https://github.com/0-EternalJunior-0/GraphCrawler/-/issues
Keywords: web,crawler,scraper,graph,spider,scrapy,vectorization,free-threading
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31.0
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: lxml_html_clean
Requires-Dist: selectolax>=0.3.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: orjson>=3.9.0
Requires-Dist: fake-useragent
Requires-Dist: aiofiles>=23.2.0
Requires-Dist: aiosqlite>=0.19.0
Requires-Dist: pybloom-live
Requires-Dist: fastapi
Provides-Extra: native
Requires-Dist: cython>=3.0.0; extra == "native"
Requires-Dist: mmh3>=4.0.0; extra == "native"
Provides-Extra: playwright
Requires-Dist: playwright>=1.40.0; extra == "playwright"
Provides-Extra: mongodb
Requires-Dist: motor>=3.3.0; extra == "mongodb"
Provides-Extra: postgresql
Requires-Dist: asyncpg>=0.29.0; extra == "postgresql"
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=2.2.0; extra == "embeddings"
Requires-Dist: numpy>=1.24.0; extra == "embeddings"
Provides-Extra: newspaper
Requires-Dist: newspaper3k>=0.2.8; extra == "newspaper"
Provides-Extra: goose
Requires-Dist: goose3>=3.1.0; extra == "goose"
Provides-Extra: readability
Requires-Dist: readability-lxml>=0.8.0; extra == "readability"
Provides-Extra: articles
Requires-Dist: newspaper3k>=0.2.8; extra == "articles"
Requires-Dist: goose3>=3.1.0; extra == "articles"
Requires-Dist: readability-lxml>=0.8.0; extra == "articles"
Provides-Extra: viz
Requires-Dist: pyvis>=0.3.0; extra == "viz"
Requires-Dist: networkx>=3.6; extra == "viz"
Provides-Extra: celery
Requires-Dist: celery>=5.3.0; extra == "celery"
Requires-Dist: redis>=5.0.0; extra == "celery"
Provides-Extra: ml
Requires-Dist: g4f>=0.3.0; extra == "ml"
Requires-Dist: scikit-learn>=1.0.0; extra == "ml"
Provides-Extra: performance
Requires-Dist: aiodns>=3.1.0; extra == "performance"
Requires-Dist: uvloop>=0.19.0; platform_system != "Windows" and extra == "performance"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Provides-Extra: all
Requires-Dist: playwright>=1.40.0; extra == "all"
Requires-Dist: motor>=3.3.0; extra == "all"
Requires-Dist: asyncpg>=0.29.0; extra == "all"
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Requires-Dist: numpy>=1.24.0; extra == "all"
Requires-Dist: newspaper3k>=0.2.8; extra == "all"
Requires-Dist: goose3>=3.1.0; extra == "all"
Requires-Dist: readability-lxml>=0.8.0; extra == "all"
Requires-Dist: pyvis>=0.3.0; extra == "all"
Requires-Dist: networkx>=3.6; extra == "all"
Requires-Dist: celery>=5.3.0; extra == "all"
Requires-Dist: redis>=5.0.0; extra == "all"
Requires-Dist: g4f>=0.3.0; extra == "all"
Requires-Dist: scikit-learn>=1.0.0; extra == "all"
Requires-Dist: aiodns>=3.1.0; extra == "all"
Requires-Dist: uvloop>=0.19.0; platform_system != "Windows" and extra == "all"
Dynamic: license-file

# GraphCrawler

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![PyPI](https://img.shields.io/badge/pypi-v4.0.13-green.svg)](https://pypi.org/project/graph-crawler/)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow.svg)](LICENSE)

Бібліотека для побудови графу структури веб-сайтів.

## Встановлення

```bash
pip install graph-crawler
```

Додаткові залежності:

```bash
pip install graph-crawler[playwright]    # JavaScript сайти
pip install graph-crawler[embeddings]    # Векторизація
pip install graph-crawler[mongodb]       # MongoDB storage
pip install graph-crawler[all]           # Все
```

## Використання

```python
import graph_crawler as gc

# Базове сканування
graph = gc.crawl("https://example.com", max_depth=2, max_pages=50)

print(f"Сторінок: {len(graph.nodes)}")
print(f"Посилань: {len(graph.edges)}")

# Збереження
gc.save_graph(graph, "site.json")
```

### Async API

```python
import asyncio
import graph_crawler as gc

async def main():
    graph = await gc.async_crawl("https://example.com")
    return graph

graph = asyncio.run(main())
```

### Параметри crawl()

| Параметр | Default | Опис |
|----------|---------|------|
| `max_depth` | 3 | Глибина сканування |
| `max_pages` | 100 | Ліміт сторінок |
| `same_domain` | True | Тільки поточний домен |
| `request_delay` | 0.5 | Затримка між запитами (сек) |
| `timeout` | 300 | Загальний таймаут (сек) |
| `driver` | "http" | Драйвер: `http`, `playwright` |

### URL Rules

```python
from graph_crawler import crawl, URLRule

rules = [
    URLRule(pattern=r"\.pdf$", should_scan=False),
    URLRule(pattern=r"/admin/", should_scan=False),
    URLRule(pattern=r"/products/", priority=10),
]

graph = crawl("https://example.com", url_rules=rules)
```

### Операції з графом

```python
# Статистика
stats = graph.get_stats()

# Пошук
node = graph.get_node_by_url("https://example.com/page")

# Об'єднання графів
merged = graph1 + graph2

# Експорт
graph.export_edges("edges.csv", format="csv")
graph.export_edges("graph.dot", format="dot")
```

## Драйвери

| Драйвер | Призначення |
|---------|-------------|
| `http` | Статичні сайти (default) |
| `playwright` | JavaScript/SPA сайти |

```python
# Playwright для JS сайтів
graph = gc.crawl("https://spa-site.com", driver="playwright")
```

## Storage

| Тип | Рекомендовано |
|-----|---------------|
| `memory` | < 1K сторінок |
| `json` | 1K - 20K сторінок |
| `sqlite` | 20K+ сторінок |
| `mongodb` | Великі проекти |

## CLI

```bash
graph-crawler crawl https://example.com --max-depth 2
graph-crawler list
graph-crawler info graph_name
```

## Вимоги

- Python 3.11+

## Ліцензія

MIT
