Metadata-Version: 2.4
Name: dscanpy
Version: 0.1.1
Summary: Concurrent directory tree scanner for Python 3.12+
License: MIT
Requires-Python: >=3.12
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: System :: Filesystems
Project-URL: Repository, https://github.com/yourusername/dscanpy
Description-Content-Type: text/markdown

# dscan

[![PyPI](https://img.shields.io/pypi/v/dscanpy)](https://pypi.org/project/dscanpy/)
[![Python](https://img.shields.io/pypi/pyversions/dscanpy)](https://pypi.org/project/dscanpy/)
[![License](https://img.shields.io/pypi/l/dscanpy)](LICENSE)


`dscan` is a concurrent directory scanner for Python 3.12+. It wraps `os.scandir` in a thread pool with a work-stealing queue, exposing a filtering API that covers most of what you'd otherwise implement by hand on top of `os.walk`.

Two modes: `scan_entries` yields raw `os.DirEntry` objects with minimal overhead; `scan` yields dataclass models with pre-computed metadata.

---

## Why concurrent scanning?

On a local SSD, directory traversal is fast enough that threading adds more overhead than it saves. `scan_entries` still matches or edges out `os.walk`, but the real case for concurrency is **network-attached storage**.

On SMB shares, NFS mounts, or any high-latency filesystem, each `scandir` call blocks waiting for a server response. `os.walk` does this serially — one directory at a time. dscan keeps multiple directories in-flight simultaneously, so workers aren't sitting idle while the network responds. On deep trees with many subdirectories, this compounds significantly.

---

## Windows + SMB: the strongest use case

On Windows, the underlying `FindNextFile` API returns full file metadata — including size and timestamps — in the same call as the directory listing. This means `DirEntry.stat()` is effectively free; no additional syscalls are needed to populate a `FileEntry` model.

This makes `scan()` model mode on Windows significantly more efficient than on Linux or macOS, where `stat` requires a separate syscall per entry. The structured output you get from `scan()` comes at almost no extra cost over `scan_entries`.

Combined with the concurrency win on high-latency mounts, **Windows users scanning SMB network shares or mapped corporate drives get the best of both worlds**: concurrent traversal and rich metadata at near-zero overhead. This is the scenario where dscan provides the clearest, most measurable improvement over `os.walk`.

Recommended for:
- Corporate environments with large SMB file servers
- NAS devices accessed over Windows network shares
- Any mapped drive with deep directory trees

Tuning for high-latency mounts:

```python
# Increase workers to match network latency
for entry in scan("//fileserver/share", max_workers=32):
    print(entry.path)
```

---

## Benchmarks

### Local SSD (~4M entries, MacBook)

| | entries | time |
|---|---|---|
| `os.walk` (no stat) | 4,046,505 | 33.30s |
| `os.walk` (+ stat) | 4,039,313 | 85.24s |
| `dscan.scan_entries` | 4,046,502 | **31.90s** |
| `dscan.scan` (models) | 4,014,758 | 140.15s |

`scan_entries` is on par with bare `os.walk`. `scan` is slower because stat calls happen on the main thread serially — the workers parallelise `scandir`, not `stat`. Use `scan` when you want the structured output; use `scan_entries` when throughput matters.

> **Note:** This benchmark was run on macOS where `stat` requires a separate syscall per entry. On Windows, `scan()` performance is substantially better due to `FindNextFile` bundling metadata. See the Windows + SMB section above.

### Simulated network latency (5ms per directory)

```python
# rough simulation
import time, os
_real = os.scandir
os.scandir = lambda p: (time.sleep(0.005), _real(p))[1]
```

| | time |
|---|---|
| `os.walk` | ~linear with directory count |
| `dscan.scan_entries` | scales with `max_workers` |

At 5ms latency per directory, a tree with 10,000 directories takes ~50s serially. With 16 workers dscan brings that to ~4s. The deeper and wider the tree, the bigger the difference.

---

## Installation

```bash
pip install dscan
```

Requires Python 3.12+. No other dependencies.

---

## Usage

### Basic scan

```python
from dscan import scan

for entry in scan("."):
    print(f"{entry.name} - {entry.path}")
```

### Raw entries (lower overhead)

```python
from dscan import scan_entries

for entry in scan_entries("~/Documents", max_depth=2):
    if entry.is_file():
        print(entry.name)
```

---

## Filtering

### Extensions

```python
# Only Python and Markdown files
for file in scan(".", extensions={".py", ".md"}):
    print(file.path)

# Skip compiled files
for file in scan(".", ignore_extensions={".bin", ".exe"}):
    print(file.path)
```

### Glob patterns

```python
# Only test files
for entry in scan(".", match="test_*"):
    print(entry.name)

# Skip hidden files and directories
for entry in scan(".", ignore_pattern=".*"):
    print(entry.name)
```

### Directory traversal

```python
# Immediate children only
for entry in scan(".", max_depth=0):
    print(entry.name)

# Only descend into src/ and lib/
for entry in scan(".", only_dirs=["src", "lib"]):
    print(entry.path)

# Skip specific directories
# .git, .idea, .venv, __pycache__ are skipped by default
for entry in scan(".", ignore_dirs=["node_modules", "dist"]):
    print(entry.path)

# Disable all default ignores
for entry in scan(".", ignore_dirs=[]):
    print(entry.path)
```

### Custom filter

```python
def is_large_file(entry):
    return entry.is_file() and entry.stat().st_size > 1_000_000

for entry in scan(".", custom_filter=is_large_file):
    print(entry.name)
```

### Tuning workers

```python
# default is min(32, cpu_count * 2)
# increase on high-latency mounts
for entry in scan_entries("/mnt/nas", max_workers=32):
    print(entry.path)
```

---

## Data Models

`scan()` returns `FileEntry` or `DirectoryEntry` dataclasses.

### `FileEntry`

| field | description |
|---|---|
| `name` | filename without extension |
| `extension` | lowercase extension, no leading dot |
| `path` | full path |
| `dir_path` | containing directory |
| `size` | bytes |
| `created_at` | `datetime` |
| `modified_at` | `datetime` |

### `DirectoryEntry`

| field | description |
|---|---|
| `name` | directory name |
| `path` | full path |
| `parent_path` | parent directory |
| `created_at` | `datetime` |
| `modified_at` | `datetime` |

---

## vs the stdlib

| | `os.walk` | `pathlib.rglob` | `dscan` |
|---|:---:|:---:|:---:|
| Concurrent traversal | No | No | Yes |
| Built-in models | No | No | Yes |
| Depth limit | Manual | No | Yes |
| Directory exclusions | Manual | No | Yes |

---

## Roadmap

- **Move stat into workers** — on Linux/macOS over NFS or high-latency mounts, `stat` is a separate network round-trip per entry, just like `scandir`. Running stat inside the worker threads would let latency overlap across concurrent workers, significantly improving `scan()` model performance on those platforms.
- **`getattrlistbulk` support (macOS)** — macOS exposes a syscall that returns full file attributes (including size and timestamps) for all entries in a single directory call, equivalent to what Windows gets from `FindNextFile`. Implementing this would bring `scan()` performance on local macOS disk in line with Windows, and close the current gap between `scan()` and `scan_entries()` shown in the benchmarks above.

---

## License

MIT

