Metadata-Version: 2.4
Name: iatro-base-iac
Version: 0.0.2
Summary: IatroCache (.iac): a lightweight medical data cache format
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyarrow
Requires-Dist: brotli
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file
Dynamic: requires-python

# iatro-base-iac

**IatroCache (`.iac`)** — a lightweight, high-performance binary container format for offline caching of multimodal medical datasets (image tiles, feature vectors, clinical text, expert tokens, etc.).

This package provides the core container format and readers/writers for `.iac` files. It is designed for high-concurrency training pipelines, utilizing memory-mapped files (`mmap`) for thread-safe, zero-copy, lock-free random access.

---

## Installation

```bash
pip install iatro-base-iac
```

## Namespace Support

To facilitate transition and support a wider ecosystem of medical AI libraries, this package supports importing from **two namespaces**:

```python
# 1. Primary Namespace (Recommended)
from iatro import iac
from iatro.iac import build_pack, PackReader

# 2. Compatibility Namespace
from iatro_base import iac
from iatro_base.iac import build_pack, PackReader
```

---

## Format Layout

```
[ fixed header        ] 65536 bytes  — magic "IATROC", JSON header (codec, payload_type, offsets, ...)
[ slide table         ] Arrow IPC    — slide_idx / slide_id / patient_id
[ index table         ] Arrow IPC    — caller-defined columns + offset / length / crc32
[ data segment        ] raw bytes    — concatenated payloads, indexed by the index table
```

### Key Technical Features
*   **Explicit Boundaries & Integrity**: Each record carries `offset` / `length` / `crc32`. Payload boundaries are explicit, eliminating the need to scan for framing markers. This works seamlessly for codecs without self-delimiting frames (e.g. raw Brotli).
*   **High Performance**: `PackReader` uses `mmap` to map the file into virtual memory, allowing highly concurrent, lock-free random reads across worker threads/processes.
*   **Metadata Flexibility**: `payload_type` and `codec` are free-form header fields; the low-level container does not interpret the payload bytes directly.

---

## Quick Start (Core API)

Below is an example of writing raw bytes to an `.iac` pack and reading them back:

```python
import pyarrow as pa
from iatro.iac import build_pack, PackReader

# 1. Define metadata tables
slide_table = pa.table({
    "slide_idx": pa.array([0], pa.uint8()),
    "slide_id": ["s0"], 
    "patient_id": ["p0"]
})

# Offset, length, and crc32 columns are populated automatically
index_table = pa.table({
    "item_id": ["item_a", "item_b"]
})

# 2. Build the cache file
build_pack(
    filepath="out.iac",
    header={"payload_type": "raw_bytes", "codec": "none"},
    slide_table=slide_table,
    index_table=index_table,
    payloads=[b"first_payload_data", b"second_payload_data"]
)

# 3. Read payloads concurrently
reader = PackReader("out.iac")
print(reader.read_payload(1))  # Output: b"second_payload_data"
reader.close()
```

For large-scale or streaming datasets, refer to `build_pack_streaming`, `build_pack_data_segment`, and `build_pack_data_segment_from_file`.

---

## Clinical Text Pair Adapter

`iatro-base-iac` includes domain-specific adapters such as `clinical_text_pair`. This adapter is designed to store paired datasets (e.g., raw clinical source text and compressed text for LLM distillation/training):
*   Organizes data such that one patient maps to one record.
*   Each document inside that patient record contains both `source_text` and `compressed_text` plus metadata.
*   Allows training loaders to retrieve all document pairs for a patient in a single random-access read.

```python
from iatro.iac.adapters.text_pair import (
    ClinicalTextPairDoc,
    ClinicalTextPairReader,
    PatientTextPairs,
    build_clinical_text_pair_pack,
)

patients = [
    PatientTextPairs(
        patient_id="Patient_00000001",
        institution="XJ",
        docs=[
            ClinicalTextPairDoc(
                doc_id="Patient_00000001/2024-01-01/入院记录_20240101000000",
                source="XJ/Patient_00000001/2024-01-01/入院记录_20240101000000.txt",
                source_text="原始文书正文",
                compressed_text="教师压缩正文",
                doc_type="入院记录",
                encounter="2024-01-01",
            )
        ],
    )
]
build_clinical_text_pair_pack("pairs.iac", patients)

# Read it back
reader = ClinicalTextPairReader("pairs.iac")
patient_data = reader.read_patient("Patient_00000001")
doc = patient_data.docs[0]
assert doc.source_text == "原始文书正文"
reader.close()
```

---

## Contributing & License

This project is licensed under the MIT License. Contributions and adapters (e.g., custom payload formats or codecs) are welcome.
