Metadata-Version: 2.4
Name: kabigon
Version: 0.18.0
Author-email: narumi <toucans-cutouts0f@icloud.com>
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: charset-normalizer>=3.4.4
Requires-Dist: firecrawl-py>=4.14.1
Requires-Dist: httpx>=0.28.1
Requires-Dist: markdownify>=1.2.2
Requires-Dist: openai-whisper>=20250625
Requires-Dist: playwright>=1.58.0
Requires-Dist: pypdf>=6.7.0
Requires-Dist: rich>=14.3.2
Requires-Dist: typer>=0.23.0
Requires-Dist: youtube-transcript-api>=1.2.4
Requires-Dist: yt-dlp>=2026.2.4
Description-Content-Type: text/markdown

# kabigon

[![PyPI version](https://badge.fury.io/py/kabigon.svg)](https://badge.fury.io/py/kabigon)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![codecov](https://codecov.io/gh/narumiruna/kabigon/branch/main/graph/badge.svg)](https://codecov.io/gh/narumiruna/kabigon)

A URL content loader library that extracts content from various sources (YouTube, Instagram Reels, Twitter/X, Reddit, Truth Social, GitHub files, PDFs, web pages) and converts them to text/markdown format.

## Features

✨ **Multi-Platform Support**: YouTube, Twitter/X, Truth Social, Reddit, Instagram Reels, PTT, GitHub files, PDFs, and generic web pages

🔄 **Async-First Design**: Built with async/await for efficient parallel processing

🎯 **Smart Routing + Fallback**: Routes URLs to source-specific pipelines first, then tries deduplicated fallback loaders

🚀 **Simple API**: Single-line usage with sensible defaults, or full control with custom loader chains

🔌 **Extensible**: Easy to add new loaders for additional platforms

## Table of Contents

- [Installation](#installation)
- [Usage](#usage)
  - [CLI](#cli)
  - [Python API - Sync](#python-api---sync)
  - [Python API - Async](#python-api---async)
- [Supported Sources](#supported-sources)
- [Examples](#examples)
- [Troubleshooting](#troubleshooting)
- [Development](#development)
- [License](#license)

## Installation

```shell
uv tool install kabigon
# or just
uvx kabigon <url>

# Install Playwright browsers
uvx playwright install chromium
# or
uvx playwright install chrome
```

## Usage

### CLI

```shell
uvx kabigon <url>

# Examples
uvx kabigon --list
uvx kabigon --loader youtube,playwright https://www.youtube.com/watch?v=dQw4w9WgXcQ
uvx kabigon --loader twitter https://x.com/elonmusk/status/123456789
uvx kabigon https://www.youtube.com/watch?v=dQw4w9WgXcQ
uvx kabigon https://truthsocial.com/@realDonaldTrump/posts/123456
uvx kabigon https://reddit.com/r/python/comments/xyz/...
uvx kabigon https://github.com/anthropics/claude-code/blob/main/plugins/ralph-wiggum/README.md
uvx kabigon https://example.com/document.pdf
```

By default (without `--loader`), Kabigon routes the URL to a source-specific pipeline first (for example YouTube),
then runs the remaining default fallback loaders without repeating already-attempted loaders.

### Python API - Sync

```python
import kabigon
from kabigon.loaders import Compose
from kabigon.loaders import PDFLoader
from kabigon.loaders import PlaywrightLoader
from kabigon.loaders import RedditLoader
from kabigon.loaders import TruthSocialLoader
from kabigon.loaders import TwitterLoader
from kabigon.loaders import YoutubeLoader

url = "https://www.google.com.tw"

# Simplest usage - automatically uses the best loader
content = kabigon.load_url_sync(url)
print(content)

# Or use specific loader
content = PlaywrightLoader().load_sync(url)
print(content)

# With multiple loaders (tries each in order)
loader = Compose([
    TwitterLoader(),
    TruthSocialLoader(),
    YoutubeLoader(),
    RedditLoader(),
    PDFLoader(),
    PlaywrightLoader(),  # Fallback for generic URLs
])
content = loader.load_sync(url)
print(content)
```

### Python API - Async

```python
import asyncio
import kabigon
from kabigon.loaders import Compose
from kabigon.loaders import PlaywrightLoader
from kabigon.loaders import RedditLoader
from kabigon.loaders import TruthSocialLoader
from kabigon.loaders import TwitterLoader
from kabigon.loaders import YoutubeLoader

async def main():
    url = "https://www.google.com.tw"

    # Simplest usage - automatically uses the best loader
    content = await kabigon.load_url(url)
    print(content)

    # Or use specific loader
    loader = PlaywrightLoader()
    content = await loader.load(url)
    print(content)

    # Batch processing multiple URLs in parallel
    urls = [
        "https://x.com/user1/status/123",
        "https://truthsocial.com/@user/posts/456",
        "https://youtube.com/watch?v=abc",
        "https://reddit.com/r/python/comments/xyz",
    ]

    loader = Compose([
        TwitterLoader(),
        TruthSocialLoader(),
        YoutubeLoader(),
        RedditLoader(),
        PlaywrightLoader(),
    ])

    # Parallel processing with automatic loader selection
    results = await asyncio.gather(*[kabigon.load_url(url) for url in urls])
    for url, content in zip(urls, results):
        print(f"{url}: {len(content)} chars")

asyncio.run(main())
```

### API Comparison

| Usage | Simplest | Custom Loader Chain |
|-------|----------|---------------------|
| **Sync** | `kabigon.load_url_sync(url)` | `loader.load_sync(url)` |
| **Async** | `await kabigon.load_url(url)` | `await loader.load(url)` |
| **Batch Async** | `await asyncio.gather(*[kabigon.load_url(url) for url in urls])` | `await asyncio.gather(*[loader.load(url) for url in urls])` |

## Supported Sources

| Source | Loader | Description |
|--------|--------|-------------|
| YouTube | `YoutubeLoader` | Extracts video transcripts |
| YouTube | `YoutubeYtdlpLoader` | Audio transcription via yt-dlp + Whisper |
| Twitter/X | `TwitterLoader` | Extracts tweet content |
| Truth Social | `TruthSocialLoader` | Extracts Truth Social posts |
| Reddit | `RedditLoader` | Extracts Reddit posts and comments |
| Instagram Reels | `ReelLoader` | Audio transcription + metadata |
| GitHub | `GitHubLoader` | Fetches GitHub web pages and file content (supports repo URLs + `github.com/.../blob/...`) |
| BBC | `BBCLoader` | BBC article extraction with article-aware parsing |
| CNN | `CNNLoader` | CNN article extraction with article-aware parsing |
| PDF | `PDFLoader` | Extracts text from PDF files (URL or local) |
| PTT | `PttLoader` | Taiwan PTT forum posts |
| Generic Web | `PlaywrightLoader` | Browser-based scraping for any website |
| Generic Web | `HttpxLoader` | Simple HTTP requests with markdown conversion |

## Examples

See the [`examples/`](examples/) directory for more usage examples:

- [`simple_usage.py`](examples/simple_usage.py) - Basic single-line usage
- [`async_usage.py`](examples/async_usage.py) - Async usage and parallel batch processing
- [`twitter.py`](examples/twitter.py) - Twitter/X post extraction
- [`truthsocial.py`](examples/truthsocial.py) - Truth Social post extraction
- [`read_reddit.py`](examples/read_reddit.py) - Reddit post and comments extraction
- [`ptt.py`](examples/ptt.py) - PTT forum post extraction
- [`fetch_billgertz_tweet.py`](examples/fetch_billgertz_tweet.py) - Real-world Twitter scraping example

## Troubleshooting

### Playwright browser not installed

**Error**: `Executable doesn't exist at /path/to/chromium`

**Solution**: Install Playwright browsers after installing kabigon:
```bash
playwright install chromium
```

### FFmpeg not found (for audio transcription)

**Error**: `ffmpeg not found`

**Solution**: Install FFmpeg for your platform:
```bash
# Ubuntu/Debian
sudo apt-get install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html
```

Or set custom FFmpeg path:
```bash
export FFMPEG_PATH=/path/to/ffmpeg
```

### Timeout errors

**Error**: `Timeout 30000ms exceeded`

**Solution**: Increase timeout for slow-loading pages:
```python
# Increase timeout to 60 seconds
from kabigon.loaders import PlaywrightLoader

loader = PlaywrightLoader(timeout=60_000)
content = loader.load_sync(url)
```

### CAPTCHA or rate limiting

Some websites may show CAPTCHAs or block automated access. For Reddit, kabigon automatically uses `old.reddit.com` to avoid CAPTCHAs. For other sites, you may need to:

- Add delays between requests
- Use a custom user agent
- Implement retry logic with exponential backoff

## Development

### Setup

```bash
# Clone the repository
git clone https://github.com/narumiruna/kabigon.git
cd kabigon

# Install dependencies with uv
uv sync

# Install Playwright browsers
playwright install chromium
```

### Testing

```bash
# Run all tests with coverage
uv run pytest -v -s --cov=src tests

# Run specific test file
uv run pytest -v -s tests/loaders/test_youtube.py
```

Current test coverage: **69%** (37 tests passing)

### Linting and Type Checking

```bash
# Run linter
uv run ruff check .

# Run type checker
uv run ty check .

# Auto-fix linting issues
uv run ruff check --fix .

# Format code
uv run ruff format .
```

### Building and Publishing

```bash
# Build wheel
uv build -f wheel

# Publish to PyPI
uv publish
```

### Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

When adding a new loader:
1. Create a new file in `src/kabigon/loaders/`
2. Inherit from the `Loader` base class
3. Implement `async def load(url: str) -> str`
4. Add domain validation
5. Add tests in `tests/loaders/`
6. Update documentation

See [`CLAUDE.md`](CLAUDE.md) for detailed development guidelines.

## License

MIT License - see [LICENSE](LICENSE) file for details.
