Metadata-Version: 2.4
Name: anscom
Version: 1.5.0
Summary: High-performance native C recursive file scanner: multi-threaded, terabyte-scale, with CSV/JSON/Tree export, duplicate detection, largest-N report, and regex filtering.
Home-page: https://github.com/PC5518/anscom-nfie-python-extension
Author: Aditya Narayan Singh
Author-email: adityansdsdc@outlook.com
Project-URL: Homepage, https://anscomqs.github.io/anscom/
Project-URL: Source, https://github.com/PC5518/anscom-nfie-python-extension
Project-URL: Bug Tracker, https://github.com/PC5518/anscom-nfie-python-extension/issues
Keywords: filesystem,scanner,file-analysis,directory,recursive,multithreaded,C-extension,duplicate-detection,disk-usage,audit,enterprise
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: C
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: System :: Filesystems
Classifier: Topic :: System :: Systems Administration
Classifier: Topic :: Utilities
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Provides-Extra: all
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-python
Dynamic: summary

# Anscom

**High-performance native C recursive file scanner for Python. *v1.5.0***

*MIT Licensed*

Multi-threaded · Terabyte-scale · Zero dependencies · Cross-platform

```
pip install anscom
```

---

## What it is

Anscom is a Python C extension that scans directories at raw OS speed. It uses direct kernel syscalls (`getdents64` on Linux, `FindFirstFileW` on Windows, `readdir`/`lstat` on macOS), a multi-threaded work queue, and per-thread statistics accumulation. It never loads file contents into memory. It never follows symlinks. It never slows down as the filesystem grows.

The result is always a plain Python `dict` — five keys minimum, more when you ask for them.

```python
import anscom

result = anscom.scan("/mnt/storage")
# → {'total_files': 2841903, 'scan_errors': 0, 'duration_seconds': 1.87,
#    'categories': {...}, 'extensions': {...}}
```

2.8 million files. 1.87 seconds. 16 threads. No configuration.

---

## What's New in v1.5.0

v1.5.0 is a major feature release — the largest single update since the initial release. Every existing parameter, behavior, and output format from v1.3.0 is fully preserved.

| Feature             | Parameter                  | Description                                                                                                                            |
| ------------------- | -------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- |
| File list return    | `return_files=True`      | Returns every scanned file as a list of dicts with `path`,`size`,`ext`,`category`,`mtime`                                    |
| CSV export          | `export_csv="out.csv"`   | Writes per-file data to a UTF-8 CSV — zero dependencies                                                                               |
| Largest-N report    | `largest_n=20`           | Top N files by size via per-thread min-heap — O(log N) per file, no extra pass                                                        |
| Duplicate detection | `find_duplicates=True`   | Groups files by size then CRC32 of first 4KB — returns grouped path lists                                                             |
| Regex filter        | `regex_filter="pattern"` | Only counts files whose full path matches the pattern. Uses POSIX `regexec`on Linux/macOS (no GIL); Python `re`fallback on Windows |

**Performance note:** All five features are strictly opt-in. A plain `anscom.scan(".")` with no new parameters runs the identical hot path as v1.3.0 — no extra syscalls, no allocations per file, no behavioral change.

### Migration from v1.3.0

No breaking changes. All v1.3.0 code runs unchanged on v1.5.0. The new parameters all default to off.

```python
# v1.3.0 code — works identically on v1.5.0
result = anscom.scan("/data", silent=True, ignore_junk=True)

# v1.5.0 — opt into new features as needed
result = anscom.scan(
    "/data",
    silent          = True,
    ignore_junk     = True,
    return_files    = True,   # new
    largest_n       = 20,     # new
    find_duplicates = True,   # new
    export_csv      = "inventory.csv",  # new
)
```

---

## Table of Contents

* [Installation](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#installation)
* [Quick Start](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#quick-start)
* [Full API Reference](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#full-api-reference)
* [Return Value](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#return-value)
* [All Parameters in Depth](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#all-parameters-in-depth)
  * [path](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#path)
  * [max_depth](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#max_depth)
  * [workers](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#workers)
  * [min_size](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#min_size)
  * [extensions](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#extensions)
  * [ignore_junk](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#ignore_junk)
  * [silent](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#silent)
  * [show_tree](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#show_tree)
  * [callback](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#callback)
  * [export_json](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export_json)
  * [export_tree](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export_tree)
  * [export_csv](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export_csv)
  * [return_files](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#return_files)
  * [largest_n](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#largest_n)
  * [find_duplicates](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#find_duplicates)
  * [regex_filter](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#regex_filter)
* [Export Features](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export-features)
* [Tree Mode](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#tree-mode)
* [Exclusion Filter](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#exclusion-filter)
* [Report Format](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#report-format)
* [File Categories and Extensions](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#file-categories-and-extensions)
* [Architecture](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#architecture)
* [Security and Compliance](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#security-and-compliance)
* [Enterprise Recipes](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#enterprise-recipes)
* [Changelog](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#changelog)
* [License](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#license)

---

## Installation

```bash
pip install anscom
```

Requires Python 3.6+. Works on Linux, macOS, and Windows.

> **Windows source builds** require the "Desktop development with C++" workload from Visual Studio Build Tools.

No runtime dependencies. Every feature in v1.5.0 works with nothing else installed.

### Verify

```python
import anscom
r = anscom.scan(".", silent=True)
print(r["total_files"], "files —", round(r["duration_seconds"], 3), "s")
```

---

## Quick Start

```python
import anscom

# Default scan — prints live counter + full report
anscom.scan(".")

# Silent scan — just get the dict
result = anscom.scan(".", silent=True)

# Scan a specific path with more depth
result = anscom.scan("/home/user/projects", max_depth=20, silent=True)

# Print the category breakdown
for cat, count in result["categories"].items():
    if count > 0:
        print(f"{cat:20s} {count:>10,}")
```

---

## Full API Reference

```python
anscom.scan(
    path,                    # str      — required
    max_depth    = 6,        # int
    show_tree    = False,    # bool
    workers      = 0,        # int
    min_size     = 0,        # int
    extensions   = None,     # list[str] | None
    callback     = None,     # callable | None
    silent       = False,    # bool
    ignore_junk  = False,    # bool
    export_json  = None,     # str | None
    export_tree  = None,     # str | None
    return_files = False,    # bool       ← new in v1.5.0
    export_csv   = None,     # str | None ← new in v1.5.0
    largest_n    = 0,        # int        ← new in v1.5.0
    find_duplicates = False, # bool       ← new in v1.5.0
    regex_filter = None,     # str | None ← new in v1.5.0
) -> dict
```

---

## Return Value

The return value is always a `dict`. Five keys are always present. Three are added on demand.

| Key                  | Type                | Always?                  | Description                                                       |
| -------------------- | ------------------- | ------------------------ | ----------------------------------------------------------------- |
| `total_files`      | `int`             | ✓                       | Files that passed all filters and were categorized                |
| `scan_errors`      | `int`             | ✓                       | Paths that failed to open (permissions, broken links)             |
| `duration_seconds` | `float`           | ✓                       | Wall-clock time from first thread spawn to last join              |
| `categories`       | `dict[str, int]`  | ✓                       | All 9 categories, always present even if zero                     |
| `extensions`       | `dict[str, int]`  | ✓                       | Only non-zero extension counts                                    |
| `files`            | `list[dict]`      | `return_files=True`    | Per-file records:`path`,`size`,`ext`,`category`,`mtime` |
| `largest_files`    | `list[dict]`      | `largest_n > 0`        | Top-N files by size:`path`,`size`                             |
| `duplicates`       | `list[list[str]]` | `find_duplicates=True` | Groups of paths sharing identical content (size + CRC32)          |

The nine category keys inside `result["categories"]`:

```
"Code/Source"    "Documents"      "Images"         "Videos"
"Audio"          "Archives"       "Executables"    "System/Config"
"Other/Unknown"
```

---

## All Parameters in Depth

### `path`

**Type:** `str` — **Required**

The root directory to scan. Accepts relative paths (`.`, `../data`), absolute paths (`/mnt/storage`, `C:\Users`), or an empty string (treated as `.`).

```python
anscom.scan(".")
anscom.scan("/mnt/nas")
anscom.scan("C:\\Users\\Aditya\\Documents")
anscom.scan("")  # same as "."
```

---

### `max_depth`

**Type:** `int` — **Default:** `6` — **Range:** `[0, 64]`

Maximum directory recursion depth. Depth 0 means only the immediate children of `path` are examined — no subdirectories are entered. Depth 64 is the hard ceiling enforced in C.

```python
# Only the top level — no recursion
anscom.scan("/data", max_depth=0, silent=True)

# Standard project scan
anscom.scan("/project", max_depth=6, silent=True)

# Deep NAS or archive scan
anscom.scan("/mnt/archive", max_depth=30, silent=True)

# Maximum depth — unlimited for practical purposes
anscom.scan("/", max_depth=64, silent=True)
```

Values below 0 are clamped to 0. Values above 64 are clamped to 64.

---

### `workers`

**Type:** `int` — **Default:** `0`

Number of worker threads. `0` auto-detects the hardware CPU count via `sysconf(_SC_NPROCESSORS_ONLN)` on Linux/macOS and `GetSystemInfo()` on Windows. If auto-detection fails, falls back to 4.

When `show_tree=True`, `workers` is forced to `1` regardless of what is passed — multiple threads writing to stdout would produce interleaved output.

```python
# Auto (recommended for most cases)
anscom.scan("/data", workers=0)

# Pin to a specific count
anscom.scan("/data", workers=8)

# Maximum parallelism on a 64-core machine
anscom.scan("/data", workers=64)
```

At shallow depths the work queue feeds all threads efficiently. At depth >= 3 each thread recurses inline, so thread count has diminishing returns past ~16 for typical filesystems unless the tree is extremely wide.

---

### `min_size`

**Type:** `int` — **Default:** `0` (no filter)

Skip all files smaller than this many bytes. Files below the threshold are not counted, not categorized, and not included in `return_files` or `export_csv` output.

```python
# Only files larger than 1 MB
anscom.scan("/data", min_size=1024 * 1024, silent=True)

# Only files larger than 100 MB
anscom.scan("/mnt/video", min_size=100 * 1024 * 1024, silent=True)

# Only files larger than 1 GB
anscom.scan("/mnt/backup", min_size=1024 ** 3, silent=True)
```

On Linux, `fstatat()` is called to retrieve file size only when this filter is active. On Windows, the size is available directly in `WIN32_FIND_DATAW` at no extra syscall cost.

---

### `extensions`

**Type:** `list[str]` | `None` — **Default:** `None`

Extension whitelist. When set, **only** files whose extension matches one of the listed strings are counted. All other files are silently skipped — they do not appear in counts, categories, `files`, `export_csv`, or any other output.

Pass extensions without the leading dot, lowercase.

```python
# Count only Python files
result = anscom.scan("/repo", extensions=["py"], silent=True)

# Count only web code
result = anscom.scan("/project", extensions=["js", "ts", "jsx", "tsx", "css", "html"])

# Count only media
result = anscom.scan("/media", extensions=["mp4", "mkv", "mov", "avi", "mp3", "flac"])

# Count only documents
result = anscom.scan("/docs", extensions=["pdf", "docx", "xlsx", "pptx", "md", "txt"])
```

Unknown extensions (not in the built-in table) are also excluded when a whitelist is active.

---

### `ignore_junk`

**Type:** `bool` — **Default:** `False`

When `True`, the following directories are skipped **entirely** — no `opendir`, no syscall, no recursion. The check is a case-insensitive match on the directory basename, at any depth, under any parent.

**Skipped directories:**

| Category         | Directories                                                                        |
| ---------------- | ---------------------------------------------------------------------------------- |
| Version control  | `.git` `.svn` `.hg`                                                          |
| IDE metadata     | `.idea` `.vscode`                                                              |
| Dependency trees | `node_modules` `bower_components` `site-packages` `.venv` `venv` `env` |
| Build output     | `build` `dist` `target` `__pycache__`                                      |
| Cache / temp     | `temp` `tmp` `.cache` `.pytest_cache` `.mypy_cache`                      |

```python
# Measure dependency bloat
raw   = anscom.scan("/project", ignore_junk=False, silent=True)
clean = anscom.scan("/project", ignore_junk=True,  silent=True)
bloat = raw["total_files"] - clean["total_files"]
print(f"Dependency files: {bloat:,}")

# Fast production audit — skip all junk
result = anscom.scan("/codebase", ignore_junk=True, workers=32, silent=True)
```

The default is `False` — Anscom counts everything unless you opt in to exclusions.

---

### `silent`

**Type:** `bool` — **Default:** `False`

When `False` (default), Anscom prints:

* A live "Scanned files: N ..." counter that updates every 250ms
* The full summary report and extension breakdown on completion

When `True`, all of that is suppressed. The returned `dict` is always identical regardless of this flag.

`silent=True` does **not** suppress tree output from `show_tree=True` — those are separate.

```python
# For scripting — no output, just the data
result = anscom.scan("/data", silent=True)

# For interactive use — full live output
anscom.scan("/data")
```

---

### `show_tree`

**Type:** `bool` — **Default:** `False`

When `True`, prints a DFS-ordered directory tree to `sys.stdout` as each entry is discovered. Forces `workers=1` to guarantee correct ordering.

```
  |-- [src]
  |   |   |-- main.py
  |   |   |-- utils.py
  |   |   |-- [tests]
  |   |   |   |   |-- test_main.py
  |-- [docs]
  |   |   |-- readme.md
  |-- config.json
```

* Square brackets `[name]` indicate a directory
* No brackets indicates a regular file
* Each depth level adds `"  |   "` (6 characters) of indentation

Output is produced one line at a time via `PySys_WriteStdout`. Any `sys.stdout` redirect in Python will capture every line. There is no internal buffer — a 50 million file filesystem produces 50+ million lines without accumulating memory.

```python
# Print tree to terminal
anscom.scan(".", show_tree=True, max_depth=4)

# Capture tree in Python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan(".", show_tree=True, max_depth=3, silent=True)
sys.stdout = sys.__stdout__
tree_text = buf.getvalue()

# Save tree to file (see also export_tree)
anscom.scan("/data", show_tree=True, silent=True, export_tree="tree.txt")
```

---

### `callback`

**Type:** `callable` | `None` — **Default:** `None`

A Python callable invoked approximately every 1 second with the current scanned file count as a single `int` argument. Fired by the progress thread every 4th tick (250ms × 4 = 1000ms).

The GIL is acquired before each call and released immediately after. Scan worker threads are never blocked by callback invocation.

```python
def on_progress(n):
    print(f"\rScanned: {n:,}", end="", flush=True)

result = anscom.scan("/data", callback=on_progress, silent=True)
print()

# Push to Prometheus
from prometheus_client import Gauge
g = Gauge("files_scanned", "Current file scan count")
anscom.scan("/data", callback=lambda n: g.set(n), silent=True)
```

---

### `export_json`

**Type:** `str` | `None` — **Default:** `None`

Path to write the full result dict as a formatted JSON file. Uses Python's built-in `json` module — no external dependencies. Written with 4-space indentation after the scan completes.

The JSON file contains **all** keys that are in the returned dict, including optional keys (`files`, `largest_files`, `duplicates`) when those features are enabled in the same call.

```python
anscom.scan("/data", export_json="report.json", silent=True)

# With optional features — JSON gets those keys too
anscom.scan(
    "/data",
    export_json     = "report.json",
    return_files    = True,
    largest_n       = 10,
    find_duplicates = True,
    silent          = True
)
```

Example output:

```json
{
    "total_files": 21008,
    "scan_errors": 0,
    "duration_seconds": 1.5186,
    "categories": {
        "Code/Source": 5955,
        "Documents": 203,
        "Images": 151,
        "Videos": 0,
        "Audio": 730,
        "Archives": 0,
        "Executables": 0,
        "System/Config": 5707,
        "Other/Unknown": 8992
    },
    "extensions": {
        "py": 5955,
        "pyc": 5707,
        "mp3": 730,
        "txt": 160,
        "png": 151
    }
}
```

---

### `export_tree`

**Type:** `str` | `None` — **Default:** `None`

Path to write the tree output to a text file. **Only active when `show_tree=True`.**

The file is written incrementally — each line is written and flushed as it is produced. For a filesystem with 50 million entries this produces a multi-gigabyte file without accumulating any output in memory. `stdout` and the file both receive every line simultaneously.

```python
anscom.scan(
    "/mnt/storage",
    show_tree   = True,
    export_tree = "filesystem_tree.txt",
    silent      = True,
    max_depth   = 64
)
```

---

### `export_csv`

**Type:** `str` | `None` — **Default:** `None` — **New in v1.5.0**

Path to write a per-file inventory as a UTF-8 CSV. Columns: `path`, `size`, `ext`, `category`, `mtime`.

* `path`: full absolute path, RFC 4180-quoted (double-quoted, inner quotes doubled)
* `size`: file size in bytes as an integer
* `ext`: lowercase extension without the dot (empty string for unrecognized extensions)
* `category`: one of the 9 category names
* `mtime`: Unix timestamp (seconds since epoch) of last modification

```python
anscom.scan("/data", export_csv="inventory.csv", silent=True)
```

**Loading the CSV downstream:**

```python
# With pandas
import pandas as pd
df = pd.read_csv("inventory.csv")
print(df.groupby("category")["size"].sum().sort_values(ascending=False))

# Convert to Excel
df.to_excel("report.xlsx", index=False)

# Standard library only
import csv
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        print(row["path"], row["size"])

# With openpyxl directly
import csv, openpyxl
wb = openpyxl.Workbook()
ws = wb.active
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.reader(f):
        ws.append(row)
wb.save("report.xlsx")
```

---

### `return_files`

**Type:** `bool` — **Default:** `False` — **New in v1.5.0**

When `True`, the result dict gains a `"files"` key containing a Python `list` of dicts — one entry per scanned file.

Each dict has five fields:

| Field        | Type    | Description                                                |
| ------------ | ------- | ---------------------------------------------------------- |
| `path`     | `str` | Full absolute path to the file                             |
| `size`     | `int` | File size in bytes                                         |
| `ext`      | `str` | Lowercase extension (no dot), empty string if unrecognized |
| `category` | `str` | One of the 9 category names                                |
| `mtime`    | `int` | Unix timestamp of last modification                        |

```python
result = anscom.scan("/project", return_files=True, silent=True)

# Iterate
for f in result["files"]:
    print(f["path"], f["size"], f["category"])

# Filter in Python
large_code = [
    f for f in result["files"]
    if f["category"] == "Code/Source" and f["size"] > 50_000
]

# Sort by size descending
by_size = sorted(result["files"], key=lambda f: f["size"], reverse=True)
print("Largest file:", by_size[0]["path"])

# Group by extension
from collections import defaultdict
by_ext = defaultdict(list)
for f in result["files"]:
    by_ext[f["ext"]].append(f)
```

`len(result["files"]) == result["total_files"]` is always true.

---

### `largest_n`

**Type:** `int` — **Default:** `0` (disabled) — **New in v1.5.0**

When > 0, finds the top N files by size across the entire scanned filesystem. Uses a per-thread min-heap of capacity N — O(log N) per file, no extra pass, no sorting of the full file list. After all threads join, per-thread heaps are merged and sorted descending.

The result dict gains a `"largest_files"` key where each entry is a dict with `path` (str) and `size` (int).

```python
result = anscom.scan("/mnt/storage", largest_n=20, silent=True)

for f in result["largest_files"]:
    gb = f["size"] / (1024 ** 3)
    print(f"{gb:8.2f} GB  {f['path']}")
```

The printed report also gains a section:

```
=== TOP 20 LARGEST FILES ===========================
  1073741824 bytes : /data/backup/archive.tar.gz
   536870912 bytes : /data/media/4k_reel.mkv
...
===================================================
```

```python
# Find the single largest file
result = anscom.scan("/mnt/nas", largest_n=1, silent=True)
top = result["largest_files"][0]
print(f"Largest: {top['path']} ({top['size']:,} bytes)")

# Top 100 across a petabyte volume
result = anscom.scan("/mnt/petabyte", largest_n=100, workers=64, silent=True)
```

---

### `find_duplicates`

**Type:** `bool` — **Default:** `False` — **New in v1.5.0**

When `True`, detects duplicate files using a two-phase algorithm:

1. **Size bucketing** — all files sorted by size. Files with a unique size are skipped entirely — zero I/O.
2. **CRC32 fingerprinting** — for each same-size group (≥2 files, non-zero size), the first 4096 bytes of each file are read and CRC32 is computed. Files with matching CRC32 are reported as duplicates.

The result dict gains a `"duplicates"` key: a list of groups, each group being a list of path strings. Every group has at least 2 members.

```python
result = anscom.scan("/media-library", find_duplicates=True, silent=True)

print(f"Duplicate groups: {len(result['duplicates'])}")

for group in result["duplicates"]:
    print(f"\nDuplicate set ({len(group)} files):")
    for path in group:
        print(f"  {path}")
```

**Calculating reclaimable space** (combine with `return_files=True`):

```python
result = anscom.scan(
    "/mnt/archive",
    find_duplicates = True,
    return_files    = True,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}

wasted = sum(
    sum(size_map.get(p, 0) for p in group[1:])   # keep 1, discard rest
    for group in result["duplicates"]
)

print(f"Reclaimable: {wasted / (1024**3):.2f} GB across {len(result['duplicates'])} groups")
```

The printed report adds:

```
=== DUPLICATES SUMMARY ============================
Groups found : 142
===================================================
```

---

### `regex_filter`

**Type:** `str` | `None` — **Default:** `None` — **New in v1.5.0**

A regular expression pattern. When set, **only** files whose full absolute path matches the pattern are counted, categorized, and included in any file-tracking output (`return_files`, `export_csv`, `find_duplicates`, `largest_n`).

**Platform behavior:**

* **Linux / macOS:** Compiled with POSIX `regcomp(REG_EXTENDED | REG_NOSUB)`, matched with `regexec` —  **no GIL acquisition** , runs fully in C inside the worker threads.
* **Windows:** Falls back to Python's `re` module (GIL acquired per file). For large scans on Windows, prefer `extensions` whitelist filtering which has zero GIL cost.

The pattern is also compiled with Python's `re.compile` before the scan starts. An invalid pattern raises `ValueError` immediately.

```python
# Only .py files anywhere under a tests/ directory
result = anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)

# Only files in directories named 'src'
result = anscom.scan("/project", regex_filter=r"/src/", silent=True)

# Only Python test files
result = anscom.scan("/repo", regex_filter=r"test_.*\.py$", silent=True)
print(f"Test files: {result['total_files']}")

# Invalid patterns raise ValueError immediately — no scan is started
try:
    anscom.scan("/data", regex_filter=r"[invalid(")
except ValueError as e:
    print(e)  # Failed to compile regex_filter.
```

---

## Export Features

All export parameters are independent and combinable. A single scan pass can write to all simultaneously — one traversal, multiple outputs, no re-scanning.

```python
result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)
# One scan. Four output files. Full in-memory results.
```

| Parameter       | Format     | Dependencies                | Notes                                      |
| --------------- | ---------- | --------------------------- | ------------------------------------------ |
| `export_json` | JSON       | None (built-in)             | Full result dict including optional keys   |
| `export_csv`  | CSV        | None (built-in)             | Per-file: path, size, ext, category, mtime |
| `export_tree` | Plain text | Requires `show_tree=True` | Written line-by-line, safe at any scale    |

---

## Tree Mode

```python
# Basic tree to terminal
anscom.scan(".", show_tree=True)

# Tree saved to file
anscom.scan("/project", show_tree=True, export_tree="tree.txt", silent=True)

# Deep tree, no terminal output
import sys, io
sys.stdout = io.StringIO()
anscom.scan("/mnt/volume", show_tree=True, export_tree="tree.txt", max_depth=64)
sys.stdout = sys.__stdout__
```

**Output format:**

```
  |-- [src]            ← [brackets] = directory
  |   |   |-- main.py  ← no brackets = regular file
  |   |   |-- [lib]
  |   |   |   |   |-- utils.py
  |-- config.json
  |-- [tests]
  |   |   |-- test_core.py
```

* One `"  |   "` block per depth level (6 chars each)
* At depth 64: 384 characters of indentation — all structurally valid
* DFS order is strict: every file inside a directory appears before that directory's sibling
* `workers` is forced to 1 — required for correct ordering
* No internal buffer — safe at 50+ million entries

---

## Exclusion Filter

`ignore_junk=True` skips these directory names at any depth, at any nesting level:

```
.git         .svn         .hg          .idea        .vscode
node_modules bower_components site-packages .venv   venv
env          build         dist          target      __pycache__
temp         tmp           .cache        .pytest_cache .mypy_cache
```

The check is case-insensitive basename comparison — not a path substring match. A `node_modules` at `/project/frontend/node_modules/` is caught regardless of nesting depth.

---

## Report Format

Printed to `sys.stdout` when `silent=False` (the default).

```
Anscom Enterprise v1.5.0 (Threads: 16)
Target: /data

Scanned files: 21008 ...

=== SUMMARY REPORT ================================
+-----------------+--------------+----------+
| Category        | Count        | Percent  |
+-----------------+--------------+----------+
| Code/Source     |         5955 |   28.34% |
| System/Config   |         5707 |   27.16% |
| Other/Unknown   |         8992 |   42.81% |
| Documents       |          203 |    0.97% |
| Images          |          151 |    0.72% |
+-----------------+--------------+----------+
| TOTAL FILES     |        21008 |  100.00% |
+-----------------+--------------+----------+

=== DETAILED EXTENSION BREAKDOWN ==================
+-----------------+--------------+
| Extension       | Count        |
+-----------------+--------------+
| .py             |         5955 |
| .pyc            |         5707 |
| .mp3            |          730 |
| .txt            |          160 |
| .png            |          151 |
+-----------------+--------------+

Time     : 1.5186 seconds
Errors   : 0 (permission denied / inaccessible)
===================================================

=== TOP 20 LARGEST FILES ===========================   ← only with largest_n > 0
  1073741824 bytes : /data/backup/full.tar.gz
...

=== DUPLICATES SUMMARY ============================   ← only with find_duplicates=True
Groups found : 142
===================================================
```

Capture programmatically:

```python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan("/data")
sys.stdout = sys.__stdout__
report_text = buf.getvalue()
```

---

## File Categories and Extensions

170+ extensions across 9 categories. The table is sorted lexicographically and validated at module init — if the sort invariant is violated, `import anscom` raises `RuntimeError`.

| Category                | Sample Extensions                                                                                                                                                                              |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Code/Source**   | `c` `cpp` `cs` `go` `h` `html` `java` `js` `json` `jsx` `kt` `lua` `php` `py` `r` `rb` `rs` `sh` `sql` `swift` `ts` `vue` `xml` `yaml` `yml` |
| **Documents**     | `csv` `doc` `docx` `epub` `md` `mobi` `odp` `ods` `odt` `pdf` `ppt` `pptx` `rst` `rtf` `txt` `xls` `xlsx`                                                    |
| **Images**        | `ai` `avif` `bmp` `gif` `heic` `ico` `jpeg` `jpg` `png` `psd` `raw` `svg` `tiff` `webp`                                                                            |
| **Videos**        | `avi` `flv` `mkv` `mov` `mp4` `mpeg` `ogv` `webm` `wmv`                                                                                                                      |
| **Audio**         | `aac` `flac` `m4a` `mid` `mp3` `ogg` `wav` `wma`                                                                                                                               |
| **Archives**      | `7z` `bz2` `deb` `dmg` `gz` `iso` `jar` `rar` `tar` `tgz` `zip`                                                                                                          |
| **Executables**   | `app` `bin` `class` `dll` `elf` `exe` `msi` `pyd` `so`                                                                                                                       |
| **System/Config** | `bak` `cfg` `conf` `db` `env` `gitignore` `ini` `log` `pyc` `reg` `sys` `tmp` `ttf` `woff`                                                                         |
| **Other/Unknown** | Any extension not in the above table                                                                                                                                                           |

---

## Architecture

### OS backends

Three separate scanning implementations compiled and selected at build time:

| Platform    | Backend            | Mechanism                                                                                                                         |
| ----------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------- |
| Linux       | `getdents64`     | Direct `syscall(SYS_getdents64, dirfd, buf, 131072)`— raw kernel ABI, 128KB read buffer,`d_type`for zero-stat type detection |
| Windows     | `FindFirstFileW` | Wide-char `wchar_t`paths, UTF-16→UTF-8 conversion, size+mtime from `WIN32_FIND_DATAW`at no extra syscall cost                |
| macOS / BSD | POSIX `readdir`  | `opendir`/`readdir`with `lstat`for type resolution                                                                          |

### Thread model

```
main thread
  ├── spawn N worker threads (all waiting on cond var)
  ├── spawn 1 progress thread
  ├── push root path to queue
  ├── wait until queue.count == 0 && active_workers == 0
  └── join all threads → merge stats

worker thread (×N)
  └── loop: queue_pop → process_dir_recursive → queue_task_done

process_dir_recursive
  ├── depth < 3: push subdirs to queue (parallel pickup by idle threads)
  └── depth ≥ 3: recurse inline (avoids queue overhead for deep narrow trees)
```

### Per-thread stats — zero locks during counting

Each thread has its own `ScanStats` struct with `ext_counts[170+]` and `cat_counts[9]`. No lock is acquired during file categorization. The only shared atomic write per file is a single `__sync_fetch_and_add` for the progress counter. Stats are merged in one serial pass after all threads join.

### Slab path allocator

Each thread allocates `(max_depth + 2) * PATH_MAX` bytes once before scanning. Path strings during traversal are written into `slab[depth * PATH_MAX]` via `snprintf`. Zero heap allocation during traversal.

### Extension hash table

512-slot open-addressing hash table with FNV-1a hash and linear probing. Built once at module init from the sorted extension table. O(1) average lookup, no heap allocation, never modified after init.

### FileArray pre-allocation

When `return_files`, `export_csv`, or `find_duplicates` is enabled, each thread pre-allocates a `FileInfo` array of 65,536 entries before scanning begins. Growth beyond that doubles via `realloc`. For typical filesystems: zero reallocations during the scan.

### Min-heap for largest_n

Each thread maintains a min-heap of capacity N. Per-file cost: O(log N) comparison, no lock. Thread heaps merged globally after join using the same push logic.

### Two-phase duplicate detection

1. `qsort` all files by size — O(M log M), no I/O
2. For each same-size group ≥2 members: read first 4KB of each, compute CRC32, sort by CRC32, group consecutive matches

Zero I/O for unique-size files. One bounded read per candidate. CRC32 computed using a fully inlined lookup table — no external library.

---

## Security and Compliance

| Property                                  | Guarantee                                                                                                                          |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| **No file contents read**           | Only directory entries and metadata. Exception:`find_duplicates=True`reads up to 4KB per candidate — bounded, opt-in, read-only |
| **Symlinks never followed**         | Linux:`fstatat(AT_SYMLINK_NOFOLLOW)`. POSIX:`lstat`. Windows:`FILE_ATTRIBUTE_REPARSE_POINT`skipped unconditionally           |
| **Depth hard-capped at 64**         | Enforced in C at the top of every `process_dir_recursive`call — cannot be bypassed by filesystem topology                       |
| **All path assembly bounded**       | `snprintf(slab, PATH_MAX, ...)`— always null-terminated, always within `PATH_MAX`bytes                                        |
| **Errors counted, not silenced**    | Every failed `opendir`/`open`/`FindFirstFileW`increments `scan_errors`and continues — the final count is exact            |
| **Work queue bounded**              | 131,072 fixed slots. Overflow falls back to inline recursion — no unbounded allocation                                            |
| **Hash table immutable after init** | Built once at module load. No runtime modification                                                                                 |
| **Zero external dependencies**      | No mandatory third-party packages — no supply chain surface                                                                       |

---

## Enterprise Recipes

### Storage cost allocation

```python
import anscom

result = anscom.scan("/mnt/nas", workers=16, ignore_junk=True, silent=True)
total = result["total_files"]
cats  = result["categories"]

media = cats["Videos"] + cats["Images"] + cats["Audio"]
code  = cats["Code/Source"]
docs  = cats["Documents"]

print(f"Media   : {media:>10,}  ({media/total*100:5.1f}%)")
print(f"Code    : {code:>10,}  ({code/total*100:5.1f}%)")
print(f"Docs    : {docs:>10,}  ({docs/total*100:5.1f}%)")
print(f"Total   : {total:>10,}  in {result['duration_seconds']:.2f}s")
```

### Pre-migration audit

```python
import anscom

result = anscom.scan(
    "/legacy-server/data",
    max_depth    = 30,
    silent       = True,
    return_files = True,
    export_json  = "audit.json",
    export_csv   = "inventory.csv"
)

print(f"Recorded {result['total_files']:,} files")
print(f"Errors  : {result['scan_errors']}")
```

### CI/CD policy gate

```python
import anscom, sys

result = anscom.scan("./repo", silent=True, ignore_junk=True)

violations = []
if result["categories"]["Executables"] > 0:
    violations.append(f"{result['categories']['Executables']} executable files")
if result["categories"]["Videos"] > 0:
    violations.append(f"{result['categories']['Videos']} video files")

if violations:
    for v in violations:
        print(f"POLICY VIOLATION: {v}")
    sys.exit(1)

print("File composition check passed.")
```

### Storage reclamation

```python
import anscom

result = anscom.scan(
    "/mnt/media-archive",
    find_duplicates = True,
    return_files    = True,
    workers         = 16,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}
wasted   = sum(
    sum(size_map.get(p, 0) for p in group[1:])
    for group in result["duplicates"]
)

print(f"Duplicate groups : {len(result['duplicates'])}")
print(f"Reclaimable      : {wasted / (1024**3):.2f} GB")

groups_by_waste = sorted(
    result["duplicates"],
    key=lambda g: sum(size_map.get(p, 0) for p in g[1:]),
    reverse=True
)
for group in groups_by_waste[:5]:
    waste = sum(size_map.get(p, 0) for p in group[1:])
    print(f"\n  {waste / (1024**2):.1f} MB wasted:")
    for path in group:
        print(f"    {path}")
```

### Top-100 largest files

```python
import anscom

result = anscom.scan("/mnt/storage", largest_n=100, workers=32, silent=True)

total_gb = sum(f["size"] for f in result["largest_files"]) / (1024**3)
print(f"Top 100 total: {total_gb:.1f} GB\n")

for i, f in enumerate(result["largest_files"][:10], 1):
    print(f"{i:3}. {f['size']/1024**3:8.2f} GB  {f['path']}")
```

### Regex scan — test files only

```python
import anscom
from collections import Counter
import os

result = anscom.scan(
    "/codebase",
    regex_filter = r"/tests?/.*\.py$",
    return_files = True,
    silent       = True
)

print(f"Test files: {result['total_files']}")

dirs = Counter(os.path.dirname(f["path"]) for f in result["files"])
for d, count in dirs.most_common(10):
    print(f"  {count:4d}  {d}")
```

### Live Prometheus push

```python
import anscom
from prometheus_client import Gauge, start_http_server

start_http_server(9090)
g_progress = Gauge("anscom_files_scanned",   "Files scanned so far")
g_total    = Gauge("anscom_total_files",      "Total files found")
g_duration = Gauge("anscom_duration_seconds", "Scan duration")

result = anscom.scan(
    "/data-lake",
    callback = lambda n: g_progress.set(n),
    silent   = True,
    workers  = 32
)

g_total.set(result["total_files"])
g_duration.set(result["duration_seconds"])
```

### Full audit — everything at once

```python
import anscom

result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)

print(f"Files        : {result['total_files']:,}")
print(f"Duration     : {result['duration_seconds']:.3f}s")
print(f"Dup groups   : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")
print("Written      : audit.json  inventory.csv  tree.txt")
```

---

## Changelog

### v1.5.0 (current)

* **Added** `return_files` — per-file list in result dict with `path`, `size`, `ext`, `category`, `mtime`
* **Added** `export_csv` — per-file inventory as UTF-8 CSV, zero dependencies, RFC 4180-compliant quoting
* **Added** `largest_n` — top-N files by size using per-thread min-heap, O(log N) per file
* **Added** `find_duplicates` — size-bucket + CRC32 duplicate detection, zero I/O for unique-size files
* **Added** `regex_filter` — path pattern filter; POSIX `regexec` on Linux/macOS (no GIL), Python `re` fallback on Windows
* **Added** `FILEARRAY_INIT_CAP` (65536) pre-allocation per thread — zero reallocations for typical scans
* **Fixed** `fstatat` on Linux called only when needed — two separate guards for type resolution vs. size/mtime collection
* **Fixed** `sorted_top` paths are `strdup`'d independently from `global_heap` — no lifetime overlap, no double-free
* **Removed** `export_excel` — was crashing on Windows due to `openpyxl` `Workbook.read_only` exception; use `export_csv` + `pandas.to_excel()` instead
* **Improved** Full docstring on `anscom.scan` accessible via `help(anscom.scan)`

### v1.3.0

* Added `export_json`, `export_excel`, `export_tree`
* Fixed DFS tree output ordering
* Added file tracking in tree mode

### v1.2.0 and earlier

* Multi-threaded worker pool with condition-variable termination detection
* `getdents64` direct syscall on Linux
* Per-thread statistics, zero shared state during scan
* `ignore_junk`, `min_size`, `extensions`, `callback`, `silent`

---

## License

MIT License. Free for personal and commercial use.
