Metadata-Version: 2.4
Name: nlweb-crawler
Version: 0.7.0
Summary: NLWeb Crawler - Web crawling and indexing service
Author: nlweb-ai
License: MIT
Project-URL: Homepage, https://github.com/nlweb-ai/nlweb-ask-agent
Project-URL: Repository, https://github.com/nlweb-ai/nlweb-ask-agent
Project-URL: Issues, https://github.com/nlweb-ai/nlweb-ask-agent/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: flask>=2.3.3
Requires-Dist: flask-cors>=4.0.0
Requires-Dist: pymssql>=2.2.0
Requires-Dist: requests>=2.31.0
Requires-Dist: azure-storage-blob>=12.19.0
Requires-Dist: azure-identity>=1.14.0
Requires-Dist: azure-search-documents>=11.4.0
Requires-Dist: azure-storage-queue>=12.8.0
Requires-Dist: azure-cosmos>=4.5.0
Requires-Dist: openai>=1.0.0
Requires-Dist: defusedxml>=0.7.1
Requires-Dist: feedparser>=6.0.0
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: prometheus-client>=0.21.0
Requires-Dist: Cython<3
Requires-Dist: pytest>=7.0.0
Requires-Dist: ruff>=0.4.0
Requires-Dist: pyright>=1.1.408
Dynamic: license-file

# Crawler

Distributed web crawler for schema.org structured data.

## Architecture

Master/worker pattern running as separate pods in Kubernetes:
- **Master**: Flask API + job scheduler
- **Worker**: Queue processor (embedding + upload to Azure AI Search)

Flow: Parse schema.org sitemaps → queue JSON files → embed → upload

## Endpoints

- `GET /` - Web UI
- `GET /api/status` - System status
- `POST /api/sites` - Add site to crawl
- `GET /api/queue/status` - Queue statistics

## Commands

Run `make help` for the full list. Key targets:

```
make dev     # Run master + worker via Docker Compose
make test    # Run pytest
make build   # Build image to ACR
make deploy  # Deploy to AKS via Helm
```
