Metadata-Version: 2.4
Name: anscom
Version: 1.4.0
Summary: High-performance native C recursive file scanner: multi-threaded, terabyte-scale, with CSV/JSON/Tree export, duplicate detection, largest-N report, and regex filtering.
Home-page: https://github.com/PC5518/anscom-nfie-python-extension
Author: Aditya Narayan Singh
Author-email: adityansdsdc@outlook.com
Project-URL: Homepage, https://anscomqs.github.io/anscom/
Project-URL: Source, https://github.com/PC5518/anscom-nfie-python-extension
Project-URL: Bug Tracker, https://github.com/PC5518/anscom-nfie-python-extension/issues
Keywords: filesystem,scanner,file-analysis,directory,recursive,multithreaded,C-extension,duplicate-detection,disk-usage,audit,enterprise
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: C
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: System :: Filesystems
Classifier: Topic :: System :: Systems Administration
Classifier: Topic :: Utilities
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: System Administrators
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary


# Anscom

**High-performance native C recursive file scanner for Python. *v1.4.0***

*MIT Licensed* 

Multi-threaded · Terabyte-scale · Zero dependencies · Cross-platform

```
pip install anscom
```

---

## What it is

Anscom is a Python C extension that scans directories at raw OS speed. It uses direct kernel syscalls (`getdents64` on Linux, `FindFirstFileW` on Windows, `readdir`/`lstat` on macOS), a multi-threaded work queue, and per-thread statistics accumulation. It never loads file contents into memory. It never follows symlinks. It never slows down as the filesystem grows.

The result is always a plain Python `dict` — five keys minimum, more when you ask for them.

```python
import anscom

result = anscom.scan("/mnt/storage")
# → {'total_files': 2841903, 'scan_errors': 0, 'duration_seconds': 1.87,
#    'categories': {...}, 'extensions': {...}}
```

547K files. 1.87 seconds. 16 threads. No configuration.

---

## Table of Contents

* [Installation](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#installation)
* [Quick Start](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#quick-start)
* [Full API Reference](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#full-api-reference)
* [Return Value](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#return-value)
* [All Parameters in Depth](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#all-parameters-in-depth)
  * [path](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#path)
  * [max_depth](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#max_depth)
  * [workers](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#workers)
  * [min_size](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#min_size)
  * [extensions](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#extensions)
  * [ignore_junk](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#ignore_junk)
  * [silent](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#silent)
  * [show_tree](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#show_tree)
  * [callback](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#callback)
  * [export_json](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export_json)
  * [export_tree](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export_tree)
  * [export_csv](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export_csv)
  * [return_files](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#return_files)
  * [largest_n](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#largest_n)
  * [find_duplicates](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#find_duplicates)
  * [regex_filter](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#regex_filter)
* [Export Features](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#export-features)
* [Tree Mode](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#tree-mode)
* [Exclusion Filter](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#exclusion-filter)
* [Report Format](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#report-format)
* [File Categories and Extensions](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#file-categories-and-extensions)
* [Architecture](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#architecture)
* [Security and Compliance](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#security-and-compliance)
* [Enterprise Recipes](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#enterprise-recipes)
* [Changelog](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#changelog)
* [License](https://claude.ai/chat/80f9f45e-76d8-4fed-b26b-eec474163d5f#license)

---

## Installation

```bash
pip install anscom
```

Requires Python 3.6+. Works on Linux, macOS, and Windows.

> **Windows source builds** require the "Desktop development with C++" workload from Visual Studio Build Tools.

No runtime dependencies. Every feature in v1.4.0 works with nothing else installed.

### Verify

```python
import anscom
r = anscom.scan(".", silent=True)
print(r["total_files"], "files —", round(r["duration_seconds"], 3), "s")
```

---

## Quick Start

```python
import anscom

# Default scan — prints live counter + full report
anscom.scan(".")

# Silent scan — just get the dict
result = anscom.scan(".", silent=True)

# Scan a specific path with more depth
result = anscom.scan("/home/user/projects", max_depth=20, silent=True)

# Print the category breakdown
for cat, count in result["categories"].items():
    if count > 0:
        print(f"{cat:20s} {count:>10,}")
```

---

## Full API Reference

```python
anscom.scan(
    path,                    # str      — required
    max_depth    = 6,        # int
    show_tree    = False,    # bool
    workers      = 0,        # int
    min_size     = 0,        # int
    extensions   = None,     # list[str] | None
    callback     = None,     # callable | None
    silent       = False,    # bool
    ignore_junk  = False,    # bool
    export_json  = None,     # str | None
    export_tree  = None,     # str | None
    return_files = False,    # bool
    export_csv   = None,     # str | None
    largest_n    = 0,        # int
    find_duplicates = False, # bool
    regex_filter = None,     # str | None
) -> dict
```

---

## Return Value

The return value is always a `dict`. Five keys are always present. Three are added on demand.

| Key                  | Type                | Always?                  | Description                                                       |
| -------------------- | ------------------- | ------------------------ | ----------------------------------------------------------------- |
| `total_files`      | `int`             | ✓                       | Files that passed all filters and were categorized                |
| `scan_errors`      | `int`             | ✓                       | Paths that failed to open (permissions, broken links)             |
| `duration_seconds` | `float`           | ✓                       | Wall-clock time from first thread spawn to last join              |
| `categories`       | `dict[str, int]`  | ✓                       | All 9 categories, always present even if zero                     |
| `extensions`       | `dict[str, int]`  | ✓                       | Only non-zero extension counts                                    |
| `files`            | `list[dict]`      | `return_files=True`    | Per-file records:`path`,`size`,`ext`,`category`,`mtime` |
| `largest_files`    | `list[dict]`      | `largest_n > 0`        | Top-N files by size:`path`,`size`                             |
| `duplicates`       | `list[list[str]]` | `find_duplicates=True` | Groups of paths sharing identical content (size + CRC32)          |

The nine category keys inside `result["categories"]`:

```
"Code/Source"    "Documents"      "Images"         "Videos"
"Audio"          "Archives"       "Executables"    "System/Config"
"Other/Unknown"
```

---

## All Parameters in Depth

### `path`

**Type:** `str` — **Required**

The root directory to scan. Accepts relative paths (`.`, `../data`), absolute paths (`/mnt/storage`, `C:\Users`), or an empty string (treated as `.`).

```python
anscom.scan(".")
anscom.scan("/mnt/nas")
anscom.scan("C:\\Users\\Aditya\\Documents")
anscom.scan("")  # same as "."
```

---

### `max_depth`

**Type:** `int` — **Default:** `6` — **Range:** `[0, 64]`

Maximum directory recursion depth. Depth 0 means only the immediate children of `path` are examined — no subdirectories are entered. Depth 64 is the hard ceiling enforced in C.

```python
# Only the top level — no recursion
anscom.scan("/data", max_depth=0, silent=True)

# Standard project scan
anscom.scan("/project", max_depth=6, silent=True)

# Deep NAS or archive scan
anscom.scan("/mnt/archive", max_depth=30, silent=True)

# Maximum depth — unlimited for practical purposes
anscom.scan("/", max_depth=64, silent=True)
```

Values below 0 are clamped to 0. Values above 64 are clamped to 64.

---

### `workers`

**Type:** `int` — **Default:** `0`

Number of worker threads. `0` auto-detects the hardware CPU count via `sysconf(_SC_NPROCESSORS_ONLN)` on Linux/macOS and `GetSystemInfo()` on Windows. If auto-detection fails, falls back to 4.

When `show_tree=True`, `workers` is forced to `1` regardless of what is passed — multiple threads writing to stdout would produce interleaved output.

```python
# Auto (recommended for most cases)
anscom.scan("/data", workers=0)

# Pin to a specific count
anscom.scan("/data", workers=8)

# Maximum parallelism on a 64-core machine
anscom.scan("/data", workers=64)
```

At shallow depths the work queue feeds all threads efficiently. At depth >= 3 each thread recurses inline, so thread count has diminishing returns past ~16 for typical filesystems unless the tree is extremely wide.

---

### `min_size`

**Type:** `int` — **Default:** `0` (no filter)

Skip all files smaller than this many bytes. Files below the threshold are not counted, not categorized, and not included in `return_files` or `export_csv` output.

```python
# Only files larger than 1 MB
anscom.scan("/data", min_size=1024 * 1024, silent=True)

# Only files larger than 100 MB
anscom.scan("/mnt/video", min_size=100 * 1024 * 1024, silent=True)

# Only files larger than 1 GB
anscom.scan("/mnt/backup", min_size=1024 ** 3, silent=True)
```

On Linux, `fstatat()` is called to retrieve file size only when this filter is active. On Windows, the size is available directly in `WIN32_FIND_DATAW` at no extra syscall cost.

---

### `extensions`

**Type:** `list[str]` | `None` — **Default:** `None`

Extension whitelist. When set, **only** files whose extension matches one of the listed strings are counted. All other files are silently skipped — they do not appear in counts, categories, `files`, `export_csv`, or any other output.

Pass extensions without the leading dot, lowercase.

```python
# Count only Python files
result = anscom.scan("/repo", extensions=["py"], silent=True)

# Count only web code
result = anscom.scan("/project", extensions=["js", "ts", "jsx", "tsx", "css", "html"])

# Count only media
result = anscom.scan("/media", extensions=["mp4", "mkv", "mov", "avi", "mp3", "flac"])

# Count only documents
result = anscom.scan("/docs", extensions=["pdf", "docx", "xlsx", "pptx", "md", "txt"])
```

Unknown extensions (not in the built-in table) are also excluded when a whitelist is active.

---

### `ignore_junk`

**Type:** `bool` — **Default:** `False`

When `True`, the following directories are skipped **entirely** — no `opendir`, no syscall, no recursion. The check is a case-insensitive match on the directory basename, at any depth, under any parent.

**Skipped directories:**

| Category         | Directories                                                                        |
| ---------------- | ---------------------------------------------------------------------------------- |
| Version control  | `.git` `.svn` `.hg`                                                          |
| IDE metadata     | `.idea` `.vscode`                                                              |
| Dependency trees | `node_modules` `bower_components` `site-packages` `.venv` `venv` `env` |
| Build output     | `build` `dist` `target` `__pycache__`                                      |
| Cache / temp     | `temp` `tmp` `.cache` `.pytest_cache` `.mypy_cache`                      |

```python
# Measure dependency bloat
raw   = anscom.scan("/project", ignore_junk=False, silent=True)
clean = anscom.scan("/project", ignore_junk=True,  silent=True)
bloat = raw["total_files"] - clean["total_files"]
print(f"Dependency files: {bloat:,}")

# Fast production audit — skip all junk
result = anscom.scan("/codebase", ignore_junk=True, workers=32, silent=True)
```

The default is `False` — Anscom counts everything unless you opt in to exclusions.

---

### `silent`

**Type:** `bool` — **Default:** `False`

When `False` (default), Anscom prints:

* A live "Scanned files: N ..." counter that updates every 250ms
* The full summary report and extension breakdown on completion

When `True`, all of that is suppressed. The returned `dict` is always identical regardless of this flag.

`silent=True` does **not** suppress tree output from `show_tree=True` — those are separate.

```python
# For scripting — no output, just the data
result = anscom.scan("/data", silent=True)

# For interactive use — full live output
anscom.scan("/data")
```

---

### `show_tree`

**Type:** `bool` — **Default:** `False`

When `True`, prints a DFS-ordered directory tree to `sys.stdout` as each entry is discovered. Forces `workers=1` to guarantee correct ordering.

```
  |-- [src]
  |   |   |-- main.py
  |   |   |-- utils.py
  |   |   |-- [tests]
  |   |   |   |   |-- test_main.py
  |-- [docs]
  |   |   |-- readme.md
  |-- config.json
```

* Square brackets `[name]` indicate a directory
* No brackets indicates a regular file
* Each depth level adds `"  |   "` (6 characters) of indentation

Output is produced one line at a time via `PySys_WriteStdout`. Any `sys.stdout` redirect in Python will capture every line. There is no internal buffer — a 50 million file filesystem produces 50+ million lines without accumulating memory.

```python
# Print tree to terminal
anscom.scan(".", show_tree=True, max_depth=4)

# Capture tree in Python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan(".", show_tree=True, max_depth=3, silent=True)
sys.stdout = sys.__stdout__
tree_text = buf.getvalue()

# Save tree to file (see also export_tree)
anscom.scan("/data", show_tree=True, silent=True, export_tree="tree.txt")
```

---

### `callback`

**Type:** `callable` | `None` — **Default:** `None`

A Python callable invoked approximately every 1 second with the current scanned file count as a single `int` argument. Fired by the progress thread every 4th tick (250ms × 4 = 1000ms).

The GIL is acquired before each call and released immediately after. Scan worker threads are never blocked by callback invocation.

```python
def on_progress(n):
    print(f"\rScanned: {n:,}", end="", flush=True)

result = anscom.scan("/data", callback=on_progress, silent=True)
print()

# Push to Prometheus
from prometheus_client import Gauge
g = Gauge("files_scanned", "Current file scan count")
anscom.scan("/data", callback=lambda n: g.set(n), silent=True)
```

---

### `export_json`

**Type:** `str` | `None` — **Default:** `None`

Path to write the full result dict as a formatted JSON file. Uses Python's built-in `json` module — no external dependencies. Written with 4-space indentation after the scan completes.

The JSON file contains **all** keys that are in the returned dict, including optional keys (`files`, `largest_files`, `duplicates`) when those features are enabled in the same call.

```python
anscom.scan("/data", export_json="report.json", silent=True)

# With optional features — JSON gets those keys too
anscom.scan(
    "/data",
    export_json     = "report.json",
    return_files    = True,
    largest_n       = 10,
    find_duplicates = True,
    silent          = True
)
```

Example output:

```json
{
    "total_files": 21008,
    "scan_errors": 0,
    "duration_seconds": 1.5186,
    "categories": {
        "Code/Source": 5955,
        "Documents": 203,
        "Images": 151,
        "Videos": 0,
        "Audio": 730,
        "Archives": 0,
        "Executables": 0,
        "System/Config": 5707,
        "Other/Unknown": 8992
    },
    "extensions": {
        "py": 5955,
        "pyc": 5707,
        "mp3": 730,
        "txt": 160,
        "png": 151
    }
}
```

---

### `export_tree`

**Type:** `str` | `None` — **Default:** `None`

Path to write the tree output to a text file. **Only active when `show_tree=True`.**

The file is written incrementally — each line is written and flushed as it is produced. For a filesystem with 50 million entries this produces a multi-gigabyte file without accumulating any output in memory. `stdout` and the file both receive every line simultaneously.

```python
anscom.scan(
    "/mnt/storage",
    show_tree   = True,
    export_tree = "filesystem_tree.txt",
    silent      = True,
    max_depth   = 64
)
```

---

### `export_csv`

**Type:** `str` | `None` — **Default:** `None`

Path to write a per-file inventory as a UTF-8 CSV. Columns: `path`, `size`, `ext`, `category`, `mtime`.

* `path`: full absolute path, RFC 4180-quoted (double-quoted, inner quotes doubled)
* `size`: file size in bytes as an integer
* `ext`: lowercase extension without the dot (empty string for unrecognized extensions)
* `category`: one of the 9 category names
* `mtime`: Unix timestamp (seconds since epoch) of last modification

```python
anscom.scan("/data", export_csv="inventory.csv", silent=True)
```

**Loading the CSV downstream:**

```python
# With pandas
import pandas as pd
df = pd.read_csv("inventory.csv")
print(df.groupby("category")["size"].sum().sort_values(ascending=False))

# Convert to Excel
df.to_excel("report.xlsx", index=False)

# Standard library only
import csv
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.DictReader(f):
        print(row["path"], row["size"])

# With openpyxl directly
import csv, openpyxl
wb = openpyxl.Workbook()
ws = wb.active
with open("inventory.csv", newline="", encoding="utf-8") as f:
    for row in csv.reader(f):
        ws.append(row)
wb.save("report.xlsx")
```

---

### `return_files`

**Type:** `bool` — **Default:** `False`

When `True`, the result dict gains a `"files"` key containing a Python `list` of dicts — one entry per scanned file.

Each dict has five fields:

| Field        | Type    | Description                                                |
| ------------ | ------- | ---------------------------------------------------------- |
| `path`     | `str` | Full absolute path to the file                             |
| `size`     | `int` | File size in bytes                                         |
| `ext`      | `str` | Lowercase extension (no dot), empty string if unrecognized |
| `category` | `str` | One of the 9 category names                                |
| `mtime`    | `int` | Unix timestamp of last modification                        |

```python
result = anscom.scan("/project", return_files=True, silent=True)

# Iterate
for f in result["files"]:
    print(f["path"], f["size"], f["category"])

# Filter in Python
large_code = [
    f for f in result["files"]
    if f["category"] == "Code/Source" and f["size"] > 50_000
]

# Sort by size descending
by_size = sorted(result["files"], key=lambda f: f["size"], reverse=True)
print("Largest file:", by_size[0]["path"])

# Group by extension
from collections import defaultdict
by_ext = defaultdict(list)
for f in result["files"]:
    by_ext[f["ext"]].append(f)
```

`len(result["files"]) == result["total_files"]` is always true.

---

### `largest_n`

**Type:** `int` — **Default:** `0` (disabled)

When > 0, finds the top N files by size across the entire scanned filesystem. Uses a per-thread min-heap of capacity N — O(log N) per file, no extra pass, no sorting of the full file list. After all threads join, per-thread heaps are merged and sorted descending.

The result dict gains a `"largest_files"` key where each entry is a dict with `path` (str) and `size` (int).

```python
result = anscom.scan("/mnt/storage", largest_n=20, silent=True)

for f in result["largest_files"]:
    gb = f["size"] / (1024 ** 3)
    print(f"{gb:8.2f} GB  {f['path']}")
```

The printed report also gains a section:

```
=== TOP 20 LARGEST FILES ===========================
  1073741824 bytes : /data/backup/archive.tar.gz
   536870912 bytes : /data/media/4k_reel.mkv
...
===================================================
```

```python
# Find the single largest file
result = anscom.scan("/mnt/nas", largest_n=1, silent=True)
top = result["largest_files"][0]
print(f"Largest: {top['path']} ({top['size']:,} bytes)")

# Top 100 across a petabyte volume
result = anscom.scan("/mnt/petabyte", largest_n=100, workers=64, silent=True)
```

---

### `find_duplicates`

**Type:** `bool` — **Default:** `False`

When `True`, detects duplicate files using a two-phase algorithm:

1. **Size bucketing** — all files sorted by size. Files with a unique size are skipped entirely — zero I/O.
2. **CRC32 fingerprinting** — for each same-size group (≥2 files, non-zero size), the first 4096 bytes of each file are read and CRC32 is computed. Files in the same group with matching CRC32 are reported as duplicates.

The result dict gains a `"duplicates"` key: a list of groups, each group being a list of path strings. Every group has at least 2 members.

```python
result = anscom.scan("/media-library", find_duplicates=True, silent=True)

print(f"Duplicate groups: {len(result['duplicates'])}")

for group in result["duplicates"]:
    print(f"\nDuplicate set ({len(group)} files):")
    for path in group:
        print(f"  {path}")
```

**Calculating reclaimable space** (combine with `return_files=True`):

```python
result = anscom.scan(
    "/mnt/archive",
    find_duplicates = True,
    return_files    = True,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}

wasted = sum(
    sum(size_map.get(p, 0) for p in group[1:])   # keep 1, discard rest
    for group in result["duplicates"]
)

print(f"Reclaimable: {wasted / (1024**3):.2f} GB across {len(result['duplicates'])} groups")
```

The printed report adds:

```
=== DUPLICATES SUMMARY ============================
Groups found : 142
===================================================
```

---

### `regex_filter`

**Type:** `str` | `None` — **Default:** `None`

A regular expression pattern. When set, **only** files whose full absolute path matches the pattern are counted, categorized, and included in any file-tracking output (`return_files`, `export_csv`, `find_duplicates`, `largest_n`).

**Platform behavior:**

* **Linux / macOS:** Compiled with POSIX `regcomp(REG_EXTENDED | REG_NOSUB)`, matched with `regexec` —  **no GIL acquisition** , no Python overhead, runs fully in C inside the worker threads.
* **Windows:** Falls back to Python's `re` module (GIL acquired per file). For large scans on Windows, prefer `extensions` whitelist filtering which has zero GIL cost.

The pattern is also compiled with Python's `re.compile` before the scan starts. An invalid pattern raises `ValueError` immediately.

```python
# Only .py files anywhere under a tests/ directory
result = anscom.scan("/codebase", regex_filter=r"/tests/.*\.py$", silent=True)

# Only files in directories named 'src'
result = anscom.scan("/project", regex_filter=r"/src/", silent=True)

# Only files in a specific year's folder
result = anscom.scan("/archive", regex_filter=r"/2024/", silent=True)

# Only Python test files
result = anscom.scan("/repo", regex_filter=r"test_.*\.py$", silent=True)
print(f"Test files: {result['total_files']}")

# Invalid patterns raise ValueError immediately — no scan is started
try:
    anscom.scan("/data", regex_filter=r"[invalid(")
except ValueError as e:
    print(e)  # Failed to compile regex_filter.
```

---

## Export Features

All export parameters are independent and combinable. A single scan pass can write to all simultaneously — one traversal, multiple outputs, no re-scanning.

```python
result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)
# One scan. Four output files. Full in-memory results.
```

| Parameter       | Format     | Dependencies                | Notes                                      |
| --------------- | ---------- | --------------------------- | ------------------------------------------ |
| `export_json` | JSON       | None (built-in)             | Full result dict including optional keys   |
| `export_csv`  | CSV        | None (built-in)             | Per-file: path, size, ext, category, mtime |
| `export_tree` | Plain text | Requires `show_tree=True` | Written line-by-line, safe at any scale    |

---

## Tree Mode

```python
# Basic tree to terminal
anscom.scan(".", show_tree=True)

# Tree saved to file
anscom.scan("/project", show_tree=True, export_tree="tree.txt", silent=True)

# Deep tree, no terminal output
import sys, io
sys.stdout = io.StringIO()
anscom.scan("/mnt/volume", show_tree=True, export_tree="tree.txt", max_depth=64)
sys.stdout = sys.__stdout__
```

**Output format:**

```
  |-- [src]            ← [brackets] = directory
  |   |   |-- main.py  ← no brackets = regular file
  |   |   |-- [lib]
  |   |   |   |   |-- utils.py
  |-- config.json
  |-- [tests]
  |   |   |-- test_core.py
```

* One `"  |   "` block per depth level (6 chars each)
* At depth 64: 384 characters of indentation — all structurally valid
* DFS order is strict: every file inside a directory appears before that directory's sibling
* `workers` is forced to 1 — required for correct ordering
* No internal buffer — safe at 50+ million entries

---

## Exclusion Filter

`ignore_junk=True` skips these directory names at any depth, at any nesting level:

```
.git         .svn         .hg          .idea        .vscode
node_modules bower_components site-packages .venv   venv
env          build         dist          target      __pycache__
temp         tmp           .cache        .pytest_cache .mypy_cache
```

The check is case-insensitive basename comparison — not a path substring match. A `node_modules` at `/project/frontend/node_modules/` is caught regardless of nesting depth.

---

## Report Format

Printed to `sys.stdout` when `silent=False` (the default).

```
Anscom Enterprise v1.4.0 (Threads: 16)
Target: /data

Scanned files: 21008 ...

=== SUMMARY REPORT ================================
+-----------------+--------------+----------+
| Category        | Count        | Percent  |
+-----------------+--------------+----------+
| Code/Source     |         5955 |   28.34% |
| System/Config   |         5707 |   27.16% |
| Other/Unknown   |         8992 |   42.81% |
| Documents       |          203 |    0.97% |
| Images          |          151 |    0.72% |
+-----------------+--------------+----------+
| TOTAL FILES     |        21008 |  100.00% |
+-----------------+--------------+----------+

=== DETAILED EXTENSION BREAKDOWN ==================
+-----------------+--------------+
| Extension       | Count        |
+-----------------+--------------+
| .py             |         5955 |
| .pyc            |         5707 |
| .mp3            |          730 |
| .txt            |          160 |
| .png            |          151 |
+-----------------+--------------+

Time     : 1.5186 seconds
Errors   : 0 (permission denied / inaccessible)
===================================================

=== TOP 20 LARGEST FILES ===========================   ← only with largest_n > 0
  1073741824 bytes : /data/backup/full.tar.gz
...

=== DUPLICATES SUMMARY ============================   ← only with find_duplicates=True
Groups found : 142
===================================================
```

Capture programmatically:

```python
import io, sys
buf = io.StringIO()
sys.stdout = buf
anscom.scan("/data")
sys.stdout = sys.__stdout__
report_text = buf.getvalue()
```

---

## File Categories and Extensions

170+ extensions across 9 categories. The table is sorted lexicographically and validated at module init — if the sort invariant is violated, `import anscom` raises `RuntimeError`.

| Category                | Sample Extensions                                                                                                                                                                              |
| ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Code/Source**   | `c` `cpp` `cs` `go` `h` `html` `java` `js` `json` `jsx` `kt` `lua` `php` `py` `r` `rb` `rs` `sh` `sql` `swift` `ts` `vue` `xml` `yaml` `yml` |
| **Documents**     | `csv` `doc` `docx` `epub` `md` `mobi` `odp` `ods` `odt` `pdf` `ppt` `pptx` `rst` `rtf` `txt` `xls` `xlsx`                                                    |
| **Images**        | `ai` `avif` `bmp` `gif` `heic` `ico` `jpeg` `jpg` `png` `psd` `raw` `svg` `tiff` `webp`                                                                            |
| **Videos**        | `avi` `flv` `mkv` `mov` `mp4` `mpeg` `ogv` `webm` `wmv`                                                                                                                      |
| **Audio**         | `aac` `flac` `m4a` `mid` `mp3` `ogg` `wav` `wma`                                                                                                                               |
| **Archives**      | `7z` `bz2` `deb` `dmg` `gz` `iso` `jar` `rar` `tar` `tgz` `zip`                                                                                                          |
| **Executables**   | `app` `bin` `class` `dll` `elf` `exe` `msi` `pyd` `so`                                                                                                                       |
| **System/Config** | `bak` `cfg` `conf` `db` `env` `gitignore` `ini` `log` `pyc` `reg` `sys` `tmp` `ttf` `woff`                                                                         |
| **Other/Unknown** | Any extension not in the above table                                                                                                                                                           |

---

## Architecture

### OS backends

Three separate scanning implementations compiled and selected at build time:

| Platform    | Backend            | Mechanism                                                                                                                         |
| ----------- | ------------------ | --------------------------------------------------------------------------------------------------------------------------------- |
| Linux       | `getdents64`     | Direct `syscall(SYS_getdents64, dirfd, buf, 131072)`— raw kernel ABI, 128KB read buffer,`d_type`for zero-stat type detection |
| Windows     | `FindFirstFileW` | Wide-char `wchar_t`paths, UTF-16→UTF-8 conversion, size+mtime from `WIN32_FIND_DATAW`at no extra syscall cost                |
| macOS / BSD | POSIX `readdir`  | `opendir`/`readdir`with `lstat`for type resolution                                                                          |

### Thread model

```
main thread
  ├── spawn N worker threads (all waiting on cond var)
  ├── spawn 1 progress thread
  ├── push root path to queue
  ├── wait until queue.count == 0 && active_workers == 0
  └── join all threads → merge stats

worker thread (×N)
  └── loop: queue_pop → process_dir_recursive → queue_task_done

process_dir_recursive
  ├── depth < 3: push subdirs to queue (parallel pickup by idle threads)
  └── depth ≥ 3: recurse inline (avoids queue overhead for deep narrow trees)
```

### Per-thread stats — zero locks during counting

Each thread has its own `ScanStats` struct with `ext_counts[170+]` and `cat_counts[9]`. No lock is acquired during file categorization. The only shared atomic write per file is a single `__sync_fetch_and_add` for the progress counter. Stats are merged in one serial pass after all threads join.

### Slab path allocator

Each thread allocates `(max_depth + 2) * PATH_MAX` bytes once before scanning. Path strings during traversal are written into `slab[depth * PATH_MAX]` via `snprintf`. Zero heap allocation during traversal.

### Extension hash table

512-slot open-addressing hash table with FNV-1a hash and linear probing. Built once at module init from the sorted extension table. O(1) average lookup, no heap allocation, never modified after init.

### FileArray pre-allocation

When `return_files`, `export_csv`, or `find_duplicates` is enabled, each thread pre-allocates a `FileInfo` array of 65,536 entries before scanning begins. Growth beyond that doubles via `realloc`. For typical filesystems: zero reallocations during the scan.

### Min-heap for largest_n

Each thread maintains a min-heap of capacity N. Per-file cost: O(log N) comparison, no lock. Thread heaps merged globally after join using the same push logic.

### Two-phase duplicate detection

1. `qsort` all files by size — O(M log M), no I/O
2. For each same-size group ≥2 members: read first 4KB of each, compute CRC32, sort by CRC32, group consecutive matches

Zero I/O for unique-size files. One bounded read per candidate. CRC32 is computed using a fully inlined lookup table — no external library.

---

## Security and Compliance

| Property                                  | Guarantee                                                                                                                          |
| ----------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- |
| **No file contents read**           | Only directory entries and metadata. Exception:`find_duplicates=True`reads up to 4KB per candidate — bounded, opt-in, read-only |
| **Symlinks never followed**         | Linux:`fstatat(AT_SYMLINK_NOFOLLOW)`. POSIX:`lstat`. Windows:`FILE_ATTRIBUTE_REPARSE_POINT`skipped unconditionally           |
| **Depth hard-capped at 64**         | Enforced in C at the top of every `process_dir_recursive`call — cannot be bypassed by filesystem topology                       |
| **All path assembly bounded**       | `snprintf(slab, PATH_MAX, ...)`— always null-terminated, always within `PATH_MAX`bytes                                        |
| **Errors counted, not silenced**    | Every failed `opendir`/`open`/`FindFirstFileW`increments `scan_errors`and continues — the final count is exact            |
| **Work queue bounded**              | 131,072 fixed slots. Overflow falls back to inline recursion — no unbounded allocation                                            |
| **Hash table immutable after init** | Built once at module load. No runtime modification                                                                                 |
| **Zero external dependencies**      | No mandatory third-party packages — no supply chain surface                                                                       |

---

## Enterprise Recipes

### Storage cost allocation

```python
import anscom

result = anscom.scan("/mnt/nas", workers=16, ignore_junk=True, silent=True)
total = result["total_files"]
cats  = result["categories"]

media = cats["Videos"] + cats["Images"] + cats["Audio"]
code  = cats["Code/Source"]
docs  = cats["Documents"]

print(f"Media   : {media:>10,}  ({media/total*100:5.1f}%)")
print(f"Code    : {code:>10,}  ({code/total*100:5.1f}%)")
print(f"Docs    : {docs:>10,}  ({docs/total*100:5.1f}%)")
print(f"Total   : {total:>10,}  in {result['duration_seconds']:.2f}s")
```

### Pre-migration audit

```python
import anscom

result = anscom.scan(
    "/legacy-server/data",
    max_depth    = 30,
    silent       = True,
    return_files = True,
    export_json  = "audit.json",
    export_csv   = "inventory.csv"
)

print(f"Recorded {result['total_files']:,} files")
print(f"Errors  : {result['scan_errors']}")
```

### CI/CD policy gate

```python
import anscom, sys

result = anscom.scan("./repo", silent=True, ignore_junk=True)

violations = []
if result["categories"]["Executables"] > 0:
    violations.append(f"{result['categories']['Executables']} executable files")
if result["categories"]["Videos"] > 0:
    violations.append(f"{result['categories']['Videos']} video files")

if violations:
    for v in violations:
        print(f"POLICY VIOLATION: {v}")
    sys.exit(1)

print("File composition check passed.")
```

### Storage reclamation

```python
import anscom

result = anscom.scan(
    "/mnt/media-archive",
    find_duplicates = True,
    return_files    = True,
    workers         = 16,
    silent          = True
)

size_map = {f["path"]: f["size"] for f in result["files"]}
wasted   = sum(
    sum(size_map.get(p, 0) for p in group[1:])
    for group in result["duplicates"]
)

print(f"Duplicate groups : {len(result['duplicates'])}")
print(f"Reclaimable      : {wasted / (1024**3):.2f} GB")

# Largest duplicate groups first
groups_by_waste = sorted(
    result["duplicates"],
    key=lambda g: sum(size_map.get(p, 0) for p in g[1:]),
    reverse=True
)
for group in groups_by_waste[:5]:
    waste = sum(size_map.get(p, 0) for p in group[1:])
    print(f"\n  {waste / (1024**2):.1f} MB wasted:")
    for path in group:
        print(f"    {path}")
```

### Top-100 largest files

```python
import anscom

result = anscom.scan("/mnt/storage", largest_n=100, workers=32, silent=True)

total_gb = sum(f["size"] for f in result["largest_files"]) / (1024**3)
print(f"Top 100 total: {total_gb:.1f} GB\n")

for i, f in enumerate(result["largest_files"][:10], 1):
    print(f"{i:3}. {f['size']/1024**3:8.2f} GB  {f['path']}")
```

### Regex scan — test files only

```python
import anscom
from collections import Counter
import os

result = anscom.scan(
    "/codebase",
    regex_filter = r"/tests?/.*\.py$",
    return_files = True,
    silent       = True
)

print(f"Test files: {result['total_files']}")

dirs = Counter(os.path.dirname(f["path"]) for f in result["files"])
for d, count in dirs.most_common(10):
    print(f"  {count:4d}  {d}")
```

### Live Prometheus push

```python
import anscom
from prometheus_client import Gauge, start_http_server

start_http_server(9090)
g_progress = Gauge("anscom_files_scanned",   "Files scanned so far")
g_total    = Gauge("anscom_total_files",      "Total files found")
g_duration = Gauge("anscom_duration_seconds", "Scan duration")

result = anscom.scan(
    "/data-lake",
    callback = lambda n: g_progress.set(n),
    silent   = True,
    workers  = 32
)

g_total.set(result["total_files"])
g_duration.set(result["duration_seconds"])
```

### Full audit — everything at once

```python
import anscom

result = anscom.scan(
    "/mnt/enterprise",
    max_depth       = 20,
    workers         = 32,
    ignore_junk     = True,
    silent          = True,
    largest_n       = 50,
    find_duplicates = True,
    return_files    = True,
    export_json     = "audit.json",
    export_csv      = "inventory.csv",
    show_tree       = True,
    export_tree     = "tree.txt",
)

print(f"Files        : {result['total_files']:,}")
print(f"Duration     : {result['duration_seconds']:.3f}s")
print(f"Dup groups   : {len(result['duplicates'])}")
print(f"Largest file : {result['largest_files'][0]['path']}")
print("Written      : audit.json  inventory.csv  tree.txt")
```

---

## Changelog

### v1.4.0 (current)

* **Added** `return_files` — per-file list in result dict
* **Added** `export_csv` — per-file inventory as UTF-8 CSV
* **Added** `largest_n` — top-N files by size using per-thread min-heap
* **Added** `find_duplicates` — size-bucket + CRC32 duplicate detection
* **Added** `regex_filter` — path pattern filter; POSIX `regexec` on Linux/macOS (no GIL), Python `re` fallback on Windows
* **Added** `FILEARRAY_INIT_CAP` (65536) pre-allocation per thread — zero reallocations for typical scans
* **Fixed** `fstatat` on Linux called only when needed — two separate guards for type resolution vs. size/mtime collection
* **Fixed** `sorted_top` paths are `strdup`'d independently from `global_heap` — no lifetime overlap, no double-free
* **Fixed** Excel export removed — was crashing on Windows due to `openpyxl` `Workbook.read_only` exception. Replace with `export_csv` + `pandas.to_excel()`
* **Improved** Full docstring on `anscom.scan` accessible via `help(anscom.scan)`

### v1.3.0

* Added `export_json`, `export_excel`, `export_tree`
* Fixed DFS tree output ordering
* Added file tracking in tree mode

### v1.2.0 and earlier

* Multi-threaded worker pool with condition-variable termination detection
* `getdents64` direct syscall on Linux
* Per-thread statistics, zero shared state during scan
* `ignore_junk`, `min_size`, `extensions`, `callback`, `silent`

---

## License

MIT License. Free for personal and commercial use.
