Metadata-Version: 2.4
Name: mmap_ninja_dataframe
Version: 0.8.0
Summary: mmap_ninja_dataframe: Memory mapped data structures
Author-email: Hristo Vrigazov <hvrigazov@gmail.com>
License: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.4
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mmap_ninja
Requires-Dist: numpy
Requires-Dist: zstandard
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: dnn_cool_synthetic_dataset; extra == "test"
Requires-Dist: transformers; extra == "test"
Requires-Dist: opencv-contrib-python; extra == "test"
Dynamic: license-file

# mmap_ninja_dataframe
Memory-mapped dataframe abstraction based on mmap_ninja 

Run tests:
```python
uvx --with-editable . --with joblib --with zstandard --with dnn_cool_synthetic_dataset --with opencv-contrib-python --with transformers pytest
```

## TextPropertiesMmap

`TextPropertiesMmap` stores unique texts and their computed properties (e.g. embeddings, token counts) as memory-mapped arrays. Texts are deduplicated by content hash. Properties can be added incrementally — a property does not need to be fully computed before the store is used.

### Create from a list of texts

```python
from mmap_ninja_dataframe import TextPropertiesMmap

store = TextPropertiesMmap.from_texts(
    out_dir="my_store",
    texts=["Hello, world!", "Memory maps are fast.", "Another sentence."],
)
```

### Add a text and get its index

```python
idx = store.add("A new sentence.")
# Returns the index. If the text already exists, returns its existing index without duplicating.
```

### Add multiple texts

```python
indices = store.update(["First.", "Second.", "Hello, world!"])
# "Hello, world!" already exists — its existing index is returned, store does not grow.
```

### Add a computed property

```python
import numpy as np

embeddings = [np.random.rand(768).astype(np.float32) for _ in range(len(store))]
store.add_property("embedding", embeddings)
```

Properties can be added partially — the array does not need to cover all texts yet:

```python
store.add_property("token_count", [np.array([5]), np.array([4])])  # only first two texts
```

### Query property status

```python
result = store.get_property("embedding")
print(result.unprocessed_count)    # number of texts without this property computed
print(result.staging_files_count)  # number of pending staged result files
print(len(result.mmap))            # number of computed results
```

### Get unprocessed indices

```python
indices = store.get_unprocessed_indices_for_property("embedding")
texts_to_process = store.text[indices]
```

### Stage and flush results

Use `set_results_for_property` to stage results (e.g. from a batch inference job). Results are flushed automatically when they can be applied in order:

```python
batch_texts = ["First.", "Second."]
batch_embeddings = [np.random.rand(768).astype(np.float32) for _ in batch_texts]
store.set_results_for_property("embedding", batch_texts, batch_embeddings)

# Explicitly flush any remaining staged results:
store.flush_results_for_property_if_possible("embedding")
```

### Look up properties by text

```python
props = store.get_text_properties("Hello, world!")
# {"embedding": array([...]), "token_count": array([5])}
# If a property is not yet computed for this text, it appears in props["unprocessed"]
```

### Fetch properties for multiple texts

```python
result = store.get_properties_for_texts(["Hello, world!", "Memory maps are fast."])
# {"text": [...], "content_hash": [...], "idx": [0, 1], "embedding": [...]}
# Raises KeyError for unknown texts, ValueError if any property is not fully computed.
```

### Check overall status

```python
status = store.get_status()
# Returns a list of PropertyResult for properties that are not yet fully computed.
for r in status:
    print(r)
```

### Delete a property

```python
store.delete_property("embedding")
# Removes the mmap directory and any staged files for that property.
```

### List computed properties

```python
store.get_properties()  # ["embedding", "token_count"]
# Does not include "text" or "content_hash" — use store.text and store.content_hash directly.
```


