Metadata-Version: 2.4
Name: S3impleClient
Version: 0.0.1
Summary: A simple, fast, and robust async S3/HTTP downloader with parallel range requests.
Author-email: "Shih-Ying Yeh(KohakuBlueLeaf)" <apolloyeh0123@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/KohakuBlueleaf/S3impleClient
Keywords: s3,download,async,parallel,huggingface
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.24.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: aiofiles>=25.0.0
Provides-Extra: dev
Requires-Dist: pytest>=9.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=1.3.0; extra == "dev"
Provides-Extra: jupyter
Requires-Dist: nest_asyncio>=1.5.0; extra == "jupyter"
Provides-Extra: hf
Requires-Dist: huggingface_hub>=1.0.0; extra == "hf"
Provides-Extra: all
Requires-Dist: S3impleClient[dev,hf,jupyter]; extra == "all"
Dynamic: license-file

# S3impleClient

A simple, fast, and robust async S3/HTTP downloader and uploader with pipelined parallel transfers.

## Features

- **Pipelined Parallel I/O**: Download/upload large chunks while writing/reading the previous one
- **Two-Level Chunking**: Large chunks for disk I/O, small chunks for network requests
- **Async/Sync Support**: Use in both async and synchronous contexts
- **HuggingFace Hub Integration**: Patch `huggingface_hub` for faster model downloads/uploads
- **Progress Tracking**: Built-in tqdm progress bars with `[S3C]` prefix
- **Configurable Logging**: Debug upload/download operations with `configure_logging()`
- **Automatic Fallback**: Falls back to single-stream for servers without range support
- **Retry Logic**: Exponential backoff retry for failed chunks

## Installation

```bash
pip install s3impleclient
```

## Quick Start

### Download

```python
import s3impleclient as s3c

# Synchronous download
result = s3c.download(
    url="https://example.com/large-file.bin",
    dest="./downloads/file.bin",
)

if result.success:
    print(f"Downloaded {result.total_bytes:,} bytes")
```

### Upload (Multipart)

```python
import s3impleclient as s3c

# Upload with pre-signed multipart URLs (from S3 or similar)
result = s3c.upload(
    file_path="./large-file.bin",
    part_urls=["https://s3.../part1", "https://s3.../part2", ...],
    chunk_size=64 * 1024 * 1024,  # 64MB per part (from server)
    completion_url="https://s3.../complete",  # optional
)

if result.success:
    print(f"Uploaded {result.total_bytes:,} bytes in {len(result.parts)} parts")
```

### HuggingFace Hub Integration

```python
import logging
import s3impleclient as s3c
from huggingface_hub import hf_hub_download, upload_folder

# Enable logging to see transfer details
s3c.configure_logging(logging.INFO)

# Patch both download and upload
s3c.patch_all()

# Downloads now use S3impleClient (look for [S3C] in progress bar)
path = hf_hub_download(
    repo_id="username/model",
    filename="model.safetensors",
)

# Uploads also use parallel multipart
upload_folder(
    folder_path="./my-model",
    repo_id="username/model",
)

# Restore original behavior
s3c.unpatch_all()
```

## CLI Usage

```bash
# Download
s3c download https://example.com/file.bin
s3c download https://example.com/file.bin -o ./myfile.bin
s3c download https://example.com/file.bin -w 16 -c 20  # workers, chunk MB

# Upload (requires pre-signed URLs in JSON file)
s3c upload ./file.bin --url https://s3.../upload  # single part
s3c upload ./file.bin --part-urls parts.json --chunk-size 67108864  # multipart
```

## How It Works

### Download Pipeline

S3impleClient uses a pipelined approach for maximum throughput:

```
Time ->
┌─────────────────────────────────────────────────────────────┐
│ Download Large Chunk 0 (parallel HTTP range requests)       │
│                        │ Write Chunk 0 │ Download Chunk 1   │
│                                        │ Write 1 │ Download │
│                                                  │ Write... │
└─────────────────────────────────────────────────────────────┘
```

**Two-level chunking:**
- **Large chunks (128MB default)**: Units for disk writes - fits in memory, efficient I/O
- **Small chunks (4MB default)**: Units for HTTP range requests - parallel within large chunk

```
Large Chunk 0 (128MB)
├── HTTP Range 0-4MB      ─┐
├── HTTP Range 4-8MB       │
├── HTTP Range 8-12MB      ├── Parallel (8 workers)
├── ...                    │
└── HTTP Range 124-128MB  ─┘
         │
         ▼
    Write to disk (while downloading next large chunk)
```

### Upload Pipeline

Similar pipelining for uploads with prefetch:

```
Time ->
┌─────────────────────────────────────────────────────────────┐
│ Read Large Chunk 0 (32 parts)                               │
│                          │ Upload Parts 0-7   (parallel)    │
│                          │ Upload Parts 8-15  (parallel)    │
│                          │ Upload Parts 16-23 (parallel)    │
│                          │ Upload Parts 24-31 │ Read Chunk 1│
│                                               │ Upload...   │
└─────────────────────────────────────────────────────────────┘
```

**Upload chunking:**
- **Large chunk**: `max_workers_per_file * prefetch_factor * part_size` bytes read at once
- **Part size**: Defined by server (e.g., 64MB for HuggingFace)
- **Parallel uploads**: Limited by `max_workers_per_file` semaphore

With defaults (8 workers, 4 prefetch, 64MB parts):
- Large chunk = 8 * 4 * 64MB = 2GB read into memory
- 8 parts upload in parallel at any time
- While uploading, next 2GB is being read

## Configuration

### Download Config

```python
import s3impleclient as s3c

s3c.configure_download(s3c.DownloadConfig(
    chunk_size=4 * 1024 * 1024,         # 4MB per HTTP request
    write_chunk_size=128 * 1024 * 1024, # 128MB per disk write
    max_workers=8,                       # Parallel HTTP requests
    timeout=30.0,
    max_retries=5,
))
```

### Upload Config

```python
s3c.configure_upload(s3c.UploadConfig(
    max_workers_per_file=8,   # Parallel uploads per file
    max_file_concurrency=4,   # Parallel files (for multi-file upload)
    prefetch_factor=4,        # Read 8*4=32 parts at once
    timeout=60.0,
    max_retries=5,
))
```

### Logging

```python
import logging
import s3impleclient as s3c

# See upload/download configuration
s3c.configure_logging(logging.INFO)

# See per-chunk progress details
s3c.configure_logging(logging.DEBUG)
```

## API Reference

### Download

| Function | Description |
|----------|-------------|
| `download(url, dest, ...)` | Sync download to file |
| `download_async(url, dest, ...)` | Async download to file |
| `configure_download(config)` | Set default download config |
| `Downloader(config)` | Create custom downloader instance |

### Upload

| Function | Description |
|----------|-------------|
| `upload(file_path, ...)` | Sync upload single file |
| `upload_async(file_path, ...)` | Async upload single file |
| `upload_files(files, ...)` | Sync upload multiple files |
| `upload_files_async(files, ...)` | Async upload multiple files |
| `configure_upload(config)` | Set default upload config |
| `Uploader(config)` | Create custom uploader instance |

### HuggingFace Patching

| Function | Description |
|----------|-------------|
| `patch_huggingface_hub(config)` | Patch downloads only |
| `patch_huggingface_hub_upload(config)` | Patch uploads only |
| `patch_all(dl_config, ul_config)` | Patch both |
| `unpatch_huggingface_hub()` | Restore original download |
| `unpatch_huggingface_hub_upload()` | Restore original upload |
| `unpatch_all()` | Restore both |
| `is_patched()` | Check download patch status |
| `is_upload_patched()` | Check upload patch status |

### Logging

| Function | Description |
|----------|-------------|
| `configure_logging(level)` | Set logging level (default: WARNING) |

## Documentation

See the [docs/](docs/) directory for detailed documentation:

### Concepts
- [Parallel Range Downloads](docs/concept/parallel-range-downloads.md) - How parallel downloads work
- [Parallel Multipart Uploads](docs/concept/parallel-multipart-uploads.md) - How parallel uploads work
- [HuggingFace Hub Download Flow](docs/concept/huggingface-hub-download-flow.md) - Download integration details
- [HuggingFace Hub Upload Flow](docs/concept/huggingface-hub-upload-flow.md) - Upload integration details

### Implementation
- [Architecture](docs/impl/architecture.md) - Code structure and design
- [API Reference](docs/impl/api-reference.md) - Full API documentation

## Examples

See the [examples/](examples/) directory:
- `basic_download.py` - Sync and async download usage
- `huggingface_download.py` - HuggingFace Hub download integration
- `huggingface_patch.py` - Patching details
- `progress_callback.py` - Custom progress tracking
- `huggingface_upload.py` - HuggingFace Hub upload integration

## License

Apache-2.0
