Metadata-Version: 2.4
Name: ptar-index
Version: 1.0.0
Summary: Python interface for parallel-tar (.etr / .idx) index files
Project-URL: Homepage, https://github.com/JBlaschke/parallel-tar
Project-URL: Repository, https://github.com/JBlaschke/parallel-tar
Project-URL: Issues, https://github.com/JBlaschke/parallel-tar/issues
Project-URL: Documentation, https://github.com/JBlaschke/parallel-tar/tree/main/python
Author-email: Johannes Blaschke <johannes@blaschke.science>
License-Expression: MIT
Keywords: archive,index,msgpack,parallel,tar
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: System :: Archiving
Classifier: Topic :: System :: Filesystems
Requires-Python: >=3.10
Requires-Dist: msgpack>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: pandas
Requires-Dist: pandas>=2.0; extra == 'pandas'
Description-Content-Type: text/markdown

# ptar-index

Python interface for [parallel-tar](https://github.com/JBlaschke/parallel-tar)
index files (`.etr` and `.idx`).

Load binary index files generated by `parallel-idx` and work with them using
native Python data types — dataclasses, iterators, dicts, and optional pandas
DataFrames.

## Installation

```bash
uv sync
```

With pandas support:

```bash
uv sync --extra pandas
```

## Quick start

```python
from ptar_index import load_index

idx = load_index("example.idx")
print(idx)
# <PtarIndex [IDX] '/global/projects/data'  4821995 files, 417285 dirs, 1.65 TB>

# Browse the tree
idx.root.print_tree(max_depth=2)

# Navigate like a dict
lcls = idx.root["LCLS"]
psdm = idx.resolve("LCLS/sit_psdm_data/psdm")

# Iterate all files
for f in idx.walk_files():
    print(f.path, f.human_size, f.hash_hex)

# Glob search
for f in idx.glob("**/*.tar"):
    print(f.path, f.size)

# Export to pandas DataFrame
df = idx.to_dataframe()

# Compare two indexes
old = load_index("before.idx")
new = load_index("after.idx")
diff = old.diff(new)
print(diff.summary())  # "42 added, 3 removed, 17 changed"
```

## CLI

The package installs a `ptar-index` command:

```bash
# Summary + tree view
ptar-index example.idx

# Inspect raw msgpack structure (for debugging)
ptar-index example.idx --raw

# List all file entries
ptar-index example.idx --files

# Filter by glob pattern
ptar-index example.idx --glob "**/*.tar"

# Full JSON export
ptar-index example.idx --json
```

## Debugging format issues

If the auto-detection doesn't map fields correctly, dump the raw MessagePack
structure first:

```python
from ptar_index import describe_raw
print(describe_raw("example.idx", max_depth=3))
```

This shows the exact field names and types as stored in the binary file, making
it straightforward to adjust the parser.

## License

MIT
