Metadata-Version: 2.4
Name: llm-web-crawler
Version: 0.4.0
Summary: LLM data collection and synthetic fine-tuning dataset pipeline
Requires-Python: >=3.11
Requires-Dist: aiofiles>=23
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: datasets>=2.20
Requires-Dist: httpx[http2]>=0.27
Requires-Dist: huggingface-hub>=0.23
Requires-Dist: jinja2>=3.1
Requires-Dist: kaggle>=1.6
Requires-Dist: litellm>=1.40
Requires-Dist: loguru>=0.7
Requires-Dist: lxml>=5
Requires-Dist: markdownify>=0.12
Requires-Dist: psutil>=5.9
Requires-Dist: pyarrow>=16
Requires-Dist: pydantic-settings>=2.3
Requires-Dist: pydantic>=2.7
Requires-Dist: python-dotenv>=1.0
Requires-Dist: questionary>=2.0
Requires-Dist: rich>=13
Requires-Dist: sqlmodel>=0.0.18
Requires-Dist: tenacity>=8.3
Requires-Dist: tiktoken>=0.7
Requires-Dist: typer>=0.12
Requires-Dist: xmltodict>=0.13
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pyinstaller>=6; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: unsloth
Requires-Dist: unsloth>=2024.0; extra == 'unsloth'
Description-Content-Type: text/markdown

# DataForge

An interactive CLI pipeline that turns websites into fine-tuning datasets for LLMs.
Discovers URLs, scrapes content, chunks it, generates synthetic Q&A / instruction / conversation
samples, scores them for quality, and exports to HuggingFace Hub, Kaggle, or local files.

---

## Installation

### uv (recommended)
```bash
uv tool install llm-web-crawler
dataforge
```

Update:
```bash
uv tool upgrade llm-web-crawler
```

Uninstall:
```bash
uv tool uninstall llm-web-crawler
```

### pip
```bash
pip install llm-web-crawler
dataforge
```

Update:
```bash
pip install --upgrade llm-web-crawler
```

Uninstall:
```bash
pip uninstall llm-web-crawler
```

### From source
```bash
git clone https://github.com/ianktoo/data-forge.git
cd data-forge
uv sync
uv run dataforge
```

### Standalone executables (no Python required)
Download pre-built binaries for your platform from [GitHub Releases](https://github.com/ianktoo/data-forge/releases):

| Platform | File |
|---|---|
| Windows | `dataforge-windows-x64.exe` |
| macOS | `dataforge-macos-x64` |
| Linux | `dataforge-linux-x64` |

---

## Quick start

```bash
dataforge          # interactive guided pipeline
dataforge explore <url>   # preview URL discovery without running the full pipeline
dataforge config   # set your LLM provider and API key
dataforge sessions # list past sessions
dataforge resume <id>     # resume a paused session
dataforge update   # update to the latest version
```

---

## Features

### URL Discovery
- Automatically finds and parses XML sitemaps (including sitemap indexes)
- Checks `robots.txt` for `Sitemap:` directives
- **BFS crawler fallback** — if no sitemap is found, crawls the site up to a configurable depth and page limit
- **SPA support** — detects JavaScript-rendered pages (few links, rich body) and retries with Playwright if installed
- Parallel discovery across multiple seed URLs

### Zero-trust input handling
- All user-supplied URLs are sanitised before entering the pipeline
- Strips control characters, URL fragments, and tracking parameters (`utm_*`, `fbclid`, `gclid`, etc.)
- Auto-corrects bare domains (adds `https://`) and percent-encodes unsafe path characters
- Non-HTML resources (images, PDFs, JS, CSS) are filtered from crawl candidates

### Collection
- Async HTTPX client with retry + exponential backoff
- Per-domain rate limiting and `robots.txt` compliance
- Pages saved as Markdown in the session directory

### Processing
- Token-aware chunking with configurable size and overlap
- Boilerplate removal (nav, footer, cookie notices, etc.)
- Output as JSONL and Parquet

### Generation
- Synthetic Q&A, instruction, and conversation samples via LiteLLM
- Supports OpenAI, Anthropic, Groq, Together AI, and local Ollama
- Custom system prompt support

### Quality
- LLM-based quality scoring (1–5)
- Configurable approval threshold

### Export
- HuggingFace Hub (public or private datasets)
- Kaggle datasets
- Local JSONL / Parquet / CSV

### CLI experience
- Ghost-text inline autocomplete with Tab completion (powered by `prompt_toolkit`)
- Typo correction for unknown commands with fuzzy closest-match suggestions
- Contextual rotating tips at each pipeline stage
- `dataforge config` prompts for API keys securely via `getpass` and saves to `.env`
- Startup hint when no provider key is detected, with guidance to run `dataforge config`
- User preferences persisted to `~/.config/dataforge/prefs.json` (cross-project)

---

## Configuration

DataForge reads settings from environment variables or a `.env` file in the working directory.
Run `dataforge config` to set your provider and API key interactively.

### Environment variables

| Variable | Default | Description |
|---|---|---|
| `OPENAI_API_KEY` | — | OpenAI key |
| `ANTHROPIC_API_KEY` | — | Anthropic key |
| `GROQ_API_KEY` | — | Groq key |
| `TOGETHER_API_KEY` | — | Together AI key |
| `OLLAMA_BASE_URL` | `http://localhost:11434` | Ollama endpoint (no key needed) |
| `DATAFORGE_LLM_PROVIDER` | `openai` | Active provider |
| `DATAFORGE_LLM_MODEL` | `gpt-4o-mini` | Model name |
| `DATAFORGE_RATE_LIMIT` | `2.0` | Requests/sec per domain |
| `DATAFORGE_MAX_PAGES` | `500` | Max pages scraped per session |
| `DATAFORGE_MAX_CRAWL_PAGES` | `50` | Max pages found by BFS crawler |
| `DATAFORGE_MAX_CRAWL_DEPTH` | `3` | Max link depth for BFS crawler |
| `DATAFORGE_CHUNK_SIZE` | `512` | Tokens per chunk |
| `DATAFORGE_CHUNK_OVERLAP` | `64` | Token overlap between chunks |
| `DATAFORGE_LOG_LEVEL` | `INFO` | `DEBUG` / `INFO` / `WARNING` / `ERROR` |
| `DATAFORGE_OUTPUT_DIR` | `./output` | Session output directory |
| `DATAFORGE_DB_PATH` | `./dataforge.db` | SQLite database path |
| `HUGGINGFACE_TOKEN` | — | HuggingFace Hub write token |
| `KAGGLE_USERNAME` | — | Kaggle username |
| `KAGGLE_KEY` | — | Kaggle API key |

### Using Ollama (fully local, no API key)

```bash
ollama serve
ollama pull llama3.2
dataforge config   # choose ollama / llama3.2
dataforge
```

---

## Pipeline stages

```
Discovery → Collection → Processing → Generation → Quality → Export
```

Each stage is pausable and resumable. The session state is persisted to SQLite after every stage.

---

## Development

```bash
git clone https://github.com/ianktoo/data-forge.git
cd data-forge
uv sync --extra dev
uv run pytest
uv run ruff check src/ tests/
uv run mypy src/
```

---

## Releasing

```bash
# Bump version
uv version patch   # or minor / major

# Commit, tag, push — CI handles the rest
git add pyproject.toml uv.lock
git commit -m "Bump version to $(uv version --short)"
git tag v$(uv version --short)
git push origin master --tags
```

GitHub Actions will:
1. Build cross-platform executables (Windows, macOS, Linux) via PyInstaller
2. Attach them to a GitHub Release
3. Publish the package to PyPI via `uv publish` using Trusted Publishers

---

## Project structure

```
data-forge/
├── src/dataforge/
│   ├── agents/          # pipeline stage agents (explorer, scraper, processor, …)
│   ├── cli/             # typer app, prompts, UI, prefs, tips
│   ├── collectors/      # HTTP client, sitemap parser, BFS crawler, HTML extractor
│   ├── config/          # pydantic-settings, provider registry
│   ├── exporters/       # local, HuggingFace, Kaggle
│   ├── generators/      # LiteLLM wrapper, synthetic sample generation
│   ├── processors/      # chunker, cleaner, formatter
│   ├── storage/         # SQLModel models, database session
│   └── utils/           # logger, rate limiter, URL sanitiser, errors
├── tests/
├── .github/workflows/
│   ├── build-executables.yml
│   └── publish-pypi.yml
├── pyproject.toml
└── uv.lock
```

---

## License

See [LICENSE](LICENSE) for details.
