Metadata-Version: 2.4
Name: pathik
Version: 0.3.1
Summary: A web crawler implemented in Go with Python bindings
Home-page: https://github.com/justrach/pathik
Author: Your Name
Author-email: your.email@example.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.25.0
Requires-Dist: tqdm>=4.50.0
Provides-Extra: kafka
Requires-Dist: kafka-python>=2.0.0; extra == "kafka"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Pathik

<p align="center">
  <img src="assets/pathik_logo.png" alt="Pathik Logo">
</p>

A high-performance web crawler implemented in Go with Python and JavaScript bindings. It converts web pages to both HTML and Markdown formats.

## Features

- Fast crawling with Go's concurrency model
- Clean content extraction
- Markdown conversion
- Parallel URL processing
- Cloudflare R2 integration
- Kafka streaming support with configurable buffer sizes
- Enhanced security with URL and IP validation
- Memory-efficient (uses ~10x less memory than browser automation tools)
- Automatic binary version management
- Customizable compression options (Gzip and Snappy support)
- Session-based message tracking for multi-user environments

## Performance Benchmarks

### Memory Usage Comparison

Pathik is significantly more memory-efficient than browser automation tools like Playwright:

<p align="center">
  <img src="assets/PathikvPlaywright.png" alt="Memory Usage Comparison">
</p>

### Parallel Crawling Performance

Parallel crawling significantly improves performance when processing multiple URLs. Our benchmarks show:

#### Python Performance

```
Testing with 5 URLs:
- Parallel crawling completed in 7.78 seconds
- Sequential crawling completed in 18.52 seconds
- Performance improvement: 2.38x faster with parallel crawling
```

#### JavaScript Performance

```
Testing with 5 URLs:
- Parallel crawling completed in 6.96 seconds
- Sequential crawling completed in 21.07 seconds
- Performance improvement: 3.03x faster with parallel crawling
```

Parallel crawling is enabled by default when processing multiple URLs, but you can explicitly control it with the `parallel` parameter.

## Installation

```bash
pip install pathik
```

The package will automatically download the correct binary for your platform from GitHub releases on first use.

### Binary Version Management

Pathik now automatically handles binary version checking and updates:

- When you install or upgrade the Python package, it will check if the binary matches the package version
- If the versions don't match, it will automatically download the correct binary
- You can manually check and update the binary with:
  ```python
  # Force binary update
  import pathik
  from pathik.crawler import get_binary_path
  binary_path = get_binary_path(force_download=True)
  ```

- Command line options:
  ```bash
  # Check if binary is up to date
  pathik --check-binary
  
  # Force update of the binary
  pathik --force-update-binary
  ```

This ensures you always have the correct binary version with all the latest features, especially when using new functionality like Kafka streaming with session IDs.

## Usage

### Python API

#### Basic Crawling

```python
import pathik
from pathik.crawler import CrawlerError

try:
    # Crawl a single URL with named parameters
    result = pathik.crawl(urls="https://example.com")
    
    # Check result
    if "https://example.com" in result:
        url_result = result["https://example.com"]
        if url_result.get("success", False):
            print(f"HTML saved to: {url_result.get('html')}")
            print(f"Markdown saved to: {url_result.get('markdown')}")
        else:
            print(f"Error: {url_result.get('error')}")
    
    # Crawl multiple URLs in parallel
    results = pathik.crawl(
        urls=["https://example.com", "https://httpbin.org/html"],
        parallel=True  # This is the default
    )
    
    # To disable parallel crawling
    results = pathik.crawl(
        urls=["https://example.com", "https://httpbin.org/html"],
        parallel=False
    )
    
    # To specify output directory
    results = pathik.crawl(
        urls="https://example.com",
        output_dir="./output"
    )
    
except CrawlerError as e:
    print(f"Crawler error: {e}")
```

#### R2 Upload

```python
import pathik
import uuid
from pathik.crawler import CrawlerError

try:
    # Generate a UUID or use your own
    my_uuid = str(uuid.uuid4())
    
    # Crawl and upload to R2
    results = pathik.crawl_to_r2(
        urls="https://example.com",
        uuid_str=my_uuid
    )
    
    if "https://example.com" in results:
        r2_result = results["https://example.com"]
        print(f"UUID: {r2_result.get('uuid')}")
        print(f"R2 HTML key: {r2_result.get('r2_html_key')}")
        print(f"R2 Markdown key: {r2_result.get('r2_markdown_key')}")
    
    # Upload multiple URLs
    results = pathik.crawl_to_r2(
        urls=["https://example.com", "https://httpbin.org/html"],
        uuid_str=my_uuid
    )
    
except CrawlerError as e:
    print(f"Crawler error: {e}")
```

#### Secure Kafka Streaming with Buffer Customization

```python
import pathik
import uuid
from pathik.crawler import CrawlerError

try:
    # Generate a session ID for tracking
    session_id = str(uuid.uuid4())
    
    # Stream a single URL to Kafka
    result = pathik.stream_to_kafka(
        urls="https://example.com",
        session=session_id
    )
    
    if "https://example.com" in result:
        if result["https://example.com"].get("success", False):
            print(f"Successfully streamed")
    
    # Stream multiple URLs with custom options
    results = pathik.stream_to_kafka(
        urls=["https://example.com", "https://httpbin.org/html"],
        content_type="html",              # Options: "html", "markdown", or "both"
        topic="custom_topic",             # Optional custom topic
        session=session_id,               # Optional session ID
        parallel=True,                    # Process URLs in parallel (default)
        max_message_size=15728640,        # 15MB message size limit
        buffer_memory=157286400           # 150MB buffer memory
    )
    
    # Check results
    for url, status in results.items():
        if status.get("success", False):
            print(f"Successfully streamed {url}")
        else:
            print(f"Failed to stream {url}: {status.get('error')}")
            
except CrawlerError as e:
    print(f"Crawler error: {e}")
```

### Command Line

```bash
# Crawl a single URL
pathik crawl https://example.com

# Crawl multiple URLs
pathik crawl https://example.com https://httpbin.org/html

# Specify output directory
pathik crawl -o ./output https://example.com

# Use sequential (non-parallel) mode
pathik crawl -s https://example.com https://httpbin.org/html

# Upload to R2 (Cloudflare)
pathik r2 https://example.com

# Stream crawled content to Kafka
pathik kafka https://example.com

# Stream only HTML content to Kafka
pathik kafka -c html https://example.com

# Stream only Markdown content to Kafka
pathik kafka -c markdown https://example.com

# Stream to a specific Kafka topic
pathik kafka -t user1_crawl_data https://example.com

# Add a session ID for multi-user environments
pathik kafka --session user123 https://example.com

# Set custom buffer sizes
pathik kafka --max-message-size 15728640 --buffer-memory 157286400 https://example.com

# Combine options
pathik kafka -c html -t user1_data --session user123 --max-message-size 15728640 https://example.com
```

## Kafka Streaming

Pathik supports streaming crawled content directly to Kafka. This is useful for real-time processing pipelines.

### Kafka Configuration

Configure Kafka connection details using environment variables or `.env` file:

```
KAFKA_BROKERS=localhost:9092        # Comma-separated list of brokers
KAFKA_TOPIC=pathik_crawl_data       # Topic to publish to
KAFKA_USERNAME=                     # Optional username for SASL authentication
KAFKA_PASSWORD=                     # Optional password for SASL authentication
KAFKA_CLIENT_ID=pathik-crawler      # Client ID for Kafka
KAFKA_USE_TLS=false                 # Whether to use TLS
KAFKA_MAX_MESSAGE_SIZE=10485760     # 10MB max message size (default)
KAFKA_BUFFER_MEMORY=104857600       # 100MB buffer memory (default)
KAFKA_MAX_REQUEST_SIZE=20971520     # 20MB max request size (default)
```

### Buffer Size Customization

You can customize Kafka producer and consumer buffer sizes for handling large content:

```python
# Custom buffer sizes for producer
pathik.stream_to_kafka(
    "https://example.com",
    max_message_size=15728640,    # 15MB max message size (default: 10MB)
    buffer_memory=157286400       # 150MB buffer memory (default: 100MB)
)

# For consuming large messages
python kafka_consumer_direct.py --session=YOUR_SESSION_ID --max-bytes=20971520 --max-partition-bytes=10485760
```

For debugging or handling large web pages, you may need to increase these buffer sizes to prevent message size errors.

### Optional Kafka Dependencies

To use Kafka streaming, install the required dependencies:

```bash
# Install pathik with Kafka support
pip install "pathik[kafka]"

# Or install the dependency separately
pip install kafka-python

# For Snappy compression support (recommended)
pip install python-snappy
```

If Kafka dependencies are not available, Pathik will use a fallback simulation mode that logs the messages locally without actually sending them to Kafka.

### Kafka Message Format

When streaming to Kafka, Pathik sends two messages per URL:

1. HTML Content:
   - Key: URL
   - Value: Raw HTML content
   - Headers:
     - url: The original URL
     - contentType: "text/html"
     - timestamp: ISO 8601 timestamp
     - sessionID: Session ID (if provided)

2. Markdown Content:
   - Key: URL
   - Value: Markdown content
   - Headers:
     - url: The original URL
     - contentType: "text/markdown"
     - timestamp: ISO 8601 timestamp
     - sessionID: Session ID (if provided)

## Security Enhancements

Pathik includes several security features to ensure safe and reliable crawling:

### URL Validation

URLs are validated to prevent security issues:
- Only HTTP and HTTPS schemes are allowed
- Private IP addresses and localhost access are restricted
- Hostnames are resolved and checked against private IP ranges

### Input Sanitization

All inputs (file paths, session IDs, etc.) are sanitized to prevent injection attacks:
- File paths are checked for directory traversal attempts
- Session IDs are validated against a safe pattern (alphanumeric and some special chars)
- Topic names and other parameters are validated for safe characters

### Rate Limiting

Built-in rate limiting prevents accidentally overloading target servers:
- Default rate limit of 1 request per second with configurable burst
- Automatic retries with exponential backoff
- Delay between retries is configurable

### Error Handling

Robust error handling ensures graceful failure:
- Detailed error messages for troubleshooting
- Automatic retries for transient failures
- Graceful shutdown with proper cleanup
