Metadata-Version: 2.4
Name: vcti-archive
Version: 1.0.2
Summary: Archive handling for VCollab applications — extract zip/tar.gz archives and stream directories as zip
Author: Visual Collaboration Technologies Inc.
Requires-Python: <3.15,>=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: fastapi
Requires-Dist: fastapi; extra == "fastapi"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: fastapi; extra == "test"
Provides-Extra: lint
Requires-Dist: ruff; extra == "lint"
Dynamic: license-file

# Archive Utilities

## Purpose

Archive handling for VCollab applications -- extract ZIP and TAR.GZ
archives from binary streams and stream directories as ZIP archives
(in-memory or via tempfile).

This package provides two categories of functionality:

- **Extraction** -- Extract ZIP and TAR.GZ archives from any seekable
  binary stream (`BinaryIO`) to a target directory, using either
  in-memory BytesIO (fast, for small files) or temporary file
  (memory-efficient, for large files) strategies. Includes path
  traversal protection and configurable size/count limits.
- **Streaming** -- Generate ZIP archives from directory contents
  on-the-fly for download responses, with memory-based streaming for
  small directories and tempfile-based streaming for large directories.
  Supports file filtering/exclusion.

### When to use this package

Use `vcti-archive` when your application needs to:

- Accept uploaded ZIP or TAR.GZ files and extract them to disk
- Serve directory contents as downloadable ZIP archives
- Stream large directory archives without loading everything into memory
- Choose between memory-efficient and fast extraction strategies
- Protect against malicious archives (path traversal, zip bombs)

---

## Installation

The package has **zero required dependencies**.  FastAPI integration
is available as an optional extra.

### Without FastAPI (CLI tools, background workers, any framework)

Use this when you only need extraction and streaming — no
FastAPI-specific helpers.  Works with Django, Flask, plain scripts,
or anything that accepts `BinaryIO` and `bytes` iterators.

```bash
pip install vcti-archive@v1.0.0
```

What you get:
- `ZipExtractor`, `TarGzExtractor` — extract from any `BinaryIO`
- `DirectoryZipMemoryStreamer` — stream directory as ZIP (in-memory)
- `LargeDirectoryZipStreamer` — stream directory as ZIP (tempfile)
- Async wrappers, bomb protection, path traversal safety, logging

### With FastAPI

Use this when you need `streaming_zip_response()` — a helper that
wraps `LargeDirectoryZipStreamer` in a `StreamingResponse` with
correct headers and `BackgroundTasks` cleanup.

```bash
pip install "vcti-archive[fastapi]>=1.0.2"
```

Everything above, plus:
- `streaming_zip_response(streamer, background_tasks)` from
  `vcti.archive.fastapi`


### In `requirements.txt`

```
# Without FastAPI
vcti-archive>=1.0.2

# With FastAPI
vcti-archive[fastapi]>=1.0.2
```

### In `pyproject.toml` dependencies

```toml
# Without FastAPI
dependencies = [
    "vcti-archive>=1.0.2",
]

# With FastAPI
dependencies = [
    "vcti-archive[fastapi]>=1.0.2",
]
```

---

## Quick Start

### Usage without FastAPI

All core functionality works with any framework or no framework at
all.  Extractors accept any seekable `BinaryIO` (open files,
`io.BytesIO`, `UploadFile.file`, etc.) and streamers yield plain
`bytes` iterators.

**Extract a ZIP archive:**

```python
from pathlib import Path
from vcti.archive import ZipExtractor

with open("archive.zip", "rb") as f:
    extractor = ZipExtractor(f, Path("/target/dir"))
    extractor.extract_using_bytesio()   # Fast, for small files
    # or
    extractor.extract_using_tempfile()  # Memory-efficient, for large files

# With bomb protection
extractor = ZipExtractor(
    stream, Path("/target"),
    max_total_size=500_000_000,  # 500MB limit
    max_file_count=10_000,       # 10K files limit
)
extractor.extract_using_bytesio()
```

**Extract a TAR.GZ archive:**

```python
from pathlib import Path
from vcti.archive import TarGzExtractor

with open("archive.tar.gz", "rb") as f:
    TarGzExtractor(f, Path("/target/dir")).extract_using_bytesio()
```

**Async extraction (non-blocking for async frameworks):**

```python
extractor = ZipExtractor(stream, Path("/target/dir"))
await extractor.async_extract_using_bytesio()
```

**Stream a directory as ZIP (in-memory):**

```python
from pathlib import Path
from vcti.archive import DirectoryZipMemoryStreamer

streamer = DirectoryZipMemoryStreamer(Path("/data/project"))

# Write to a file
with open("output.zip", "wb") as out:
    for chunk in streamer:
        out.write(chunk)

# Or pass to any framework's streaming response
# Django: StreamingHttpResponse(streamer, content_type="application/zip")
# Flask:  Response(streamer, mimetype="application/zip")
```

**Stream a large directory as ZIP (tempfile):**

```python
from pathlib import Path
from vcti.archive import LargeDirectoryZipStreamer

streamer = LargeDirectoryZipStreamer(
    folder_path=Path("/data/project"),
    archive_name="project.zip",
)
for chunk in streamer.stream():
    response.write(chunk)
```

**File filtering (works with both streamers):**

```python
streamer = DirectoryZipMemoryStreamer(
    Path("/data/project"),
    exclude=lambda p: p.name.startswith(".") or p.suffix == ".log",
)
```

### Usage with FastAPI

Install with `pip install vcti-archive[fastapi]`.  Everything above
still works, plus you get `streaming_zip_response()` — a helper that
wraps `LargeDirectoryZipStreamer` in a `StreamingResponse` with
correct headers and deferred temp-file cleanup via `BackgroundTasks`.

**Extract an uploaded file:**

```python
from pathlib import Path
from fastapi import UploadFile
from vcti.archive import ZipExtractor

@app.post("/upload")
async def upload(file: UploadFile):
    extractor = ZipExtractor(file.file, Path("/data/uploads"))
    await extractor.async_extract_using_bytesio()
    return {"status": "extracted"}
```

**Stream a directory as a download (in-memory, small dirs):**

```python
from fastapi.responses import StreamingResponse
from vcti.archive import DirectoryZipMemoryStreamer

@app.get("/download")
def download():
    streamer = DirectoryZipMemoryStreamer(Path("/data/project"))
    return StreamingResponse(streamer, media_type="application/zip")
```

**Stream a large directory as a download (tempfile, large dirs):**

```python
from fastapi import BackgroundTasks
from vcti.archive import LargeDirectoryZipStreamer
from vcti.archive.fastapi import streaming_zip_response

@app.get("/download/large")
def download_large(background_tasks: BackgroundTasks):
    streamer = LargeDirectoryZipStreamer(
        folder_path=Path("/data/dataset"),
        archive_name="dataset.zip",
    )
    return streaming_zip_response(streamer, background_tasks)
```

### Choosing a streamer

Both streamers produce identical ZIP output.  The difference is
where the ZIP is assembled:

- **`DirectoryZipMemoryStreamer`** -- builds the ZIP in a `BytesIO`
  buffer, yielding chunks as it goes.  Simple (no temp files, no
  cleanup), but the buffer stays in memory for the duration of the
  request.
- **`LargeDirectoryZipStreamer`** -- writes the complete ZIP to a
  temp file first, then streams from disk.  Needs cleanup (via
  `on_cleanup` or the `streaming_zip_response` helper) but memory
  usage stays flat regardless of archive size.

The right choice depends on your deployment, not a universal size
threshold.  Consider:

- **Process memory budget** -- in a 512 MB container, a 200 MB
  in-memory ZIP may be too large; on a 32 GB server it's trivial.
- **Concurrent requests** -- one 300 MB buffer is fine; fifty
  concurrent ones may not be.
- **Disk I/O** -- the tempfile streamer writes then reads the full
  archive, so slow disks add latency that the memory streamer avoids.

When in doubt, start with `DirectoryZipMemoryStreamer` (less moving
parts) and switch to `LargeDirectoryZipStreamer` if you observe
memory pressure under production load.

---

## Public API

| Class / Function | Purpose |
|------------------|---------|
| `ArchiveExtractor` | ABC base class for archive extractors (BytesIO and tempfile strategies) |
| `ZipExtractor` | Extract ZIP archives with path traversal and bomb protection |
| `TarGzExtractor` | Extract TAR.GZ archives with `filter="data"` security |
| `DirectoryZipMemoryStreamer` | Stream directory as ZIP using in-memory buffer (reusable) |
| `LargeDirectoryZipStreamer` | Stream directory as ZIP using temporary file |
| `UnsupportedArchiveFormat` | Exception for unsupported archive formats |
| `streaming_zip_response()` | FastAPI helper (optional, requires `vcti-archive[fastapi]`) |

---

## Dependencies

- **Zero required dependencies** -- Core functionality uses Python
  stdlib only (`zipfile`, `tarfile`, `shutil`, `tempfile`, `asyncio`).
- **Optional:** `fastapi` -- Install with `vcti-archive[fastapi]` for
  `streaming_zip_response()` and FastAPI-specific integration.

---

## Documentation

- [Design](docs/design.md) -- Architecture decisions and extraction strategies
- [Source Guide](docs/source-guide.md) -- File descriptions and execution flow traces
- [API Reference](docs/api.md) -- Autodoc for all modules
