Metadata-Version: 2.4
Name: map_with_tree
Version: 0.0.17
Summary: A Python library for reading and writing compressed, sorted key-value stores with efficient lookup using a B-tree-like structure.
Author-email: Arthur Gouhier <ajgouhier@gmail.com>
License-Expression: MIT
Project-URL: Repository, https://github.com/ajgouhier/map_with_tree
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: zstd>=1.5
Dynamic: license-file

# Map With Tree

A Python library for reading and writing compressed, sorted key-value stores with efficient lookup using a B-tree-like structure.

## Features

- **Efficient Storage**: Data is compressed using zstandard (zstd) with configurable compression levels
- **Fast Lookups**: Built-in B-tree index for O(log n) key lookups
- **Fast Iteration**: Sequential block iteration without tree traversal for O(n) scanning
- **Sorted Keys**: Keys are automatically sorted during finalization for efficient range queries
- **Data Integrity**: MD5 hash of all entries is computed and stored for verification
- **Flexible Types**: Support for multiple key and value types including bytes, strings, integers, floats, and JSON
- **Memory Efficient**: Block-based compression and caching minimize memory usage
- **Simple API**: Pythonic interface with context managers and dict-like operations

## Installation

```bash
pip install map_with_tree
```

Requires Python >= 3.8 and zstd >= 1.5.

## Quick Start

### Writing Data

```python
import map_with_tree

# Create a new map file with string values
with map_with_tree.open("data.mwt", "w", types=("bytes", "string")) as writer:
    writer.set(b"key_1", "value_1")
    writer.set(b"key_2", "value_2")
    writer.set(b"key_3", "value_3")
```

### Reading Data

```python
import map_with_tree

# Open and read from a map file
with map_with_tree.open("data.mwt") as reader:
    # Get a value by key
    value = reader[b"key_1"]
    
    # Check if key exists
    if b"key_2" in reader:
        print("Key exists!")
    
    # Get with default value
    value = reader.get(b"key_99", default="not found")
    
    # Iterate over all entries
    for key, value in reader:
        print(f"{key}: {value}")
    
    # Get file metadata
    print(f"Total entries: {len(reader)}")
    print(f"Header: {reader.header}")
```

## API Reference

### Opening Files

```python
map_with_tree.open(path, mode="r", **kwargs)
```

- `path`: File path for the map file
- `mode`: `"r"` for reading, `"w"` for writing
- `**kwargs`: Additional options for writing (see Writer Options)

### Writer Options

```python
MapWithTreeWriter(
    path,
    header=None,              # Custom header metadata (dict)
    types=("bytes", "bytes"), # Tuple of (key_type, value_type)
    keys_per_node=128,        # Number of keys per B-tree node
    block_size=64*1024,       # Block size for compression (64KB default)
    compression_level=3,      # zstd compression level (0-22)
    duplicates="warn,remove"  # Duplicate key handling (see below)
)
```

#### Supported Types

- **bytes**: Raw bytes (default)
- **string** or **str**: UTF-8 encoded strings
- **int** or **i64**: 64-bit signed integer
- **uint** or **u64**: 64-bit unsigned integer
- **i8**, **i16**, **i32**: Signed integers (8, 16, 32 bit)
- **u8**, **u16**, **u32**: Unsigned integers (8, 16, 32 bit)
- **float** or **f64**: 64-bit float
- **f32**: 32-bit float
- **json**: JSON-serializable objects
- **struct:format**: Custom struct format (e.g., `"struct:<IIf"` for two unsigned ints and a float)

#### Duplicate Key Handling

The `duplicates` parameter controls how duplicate keys are handled during file creation. It accepts a comma-separated string with the following options:

- **warn**: Log the total count of duplicate keys to stderr after sorting
- **list**: Log each duplicate key to stderr as it's encountered
- **remove**: Remove duplicate keys, keeping only the last occurrence

Default is `"warn,remove"`, which removes duplicates and reports the count.

**Examples:**
```python
# Remove duplicates silently
with map_with_tree.open("data.mwt", "w", duplicates="remove") as writer:
    writer.set(b"key", "value1")
    writer.set(b"key", "value2")  # Will override value1
```

```python
# Keep duplicates and don't warn (not recommended - may cause lookup issues)
with map_with_tree.open("data.mwt", "w", duplicates="") as writer:
    writer.set(b"key", "value1")
    writer.set(b"key", "value2")  # Both will be kept
```

```python
# Verbose duplicate tracking
with map_with_tree.open("data.mwt", "w", duplicates="warn,list,remove") as writer:
    writer.set(b"key", "value1")
    writer.set(b"key", "value2")
    # stderr: key b'key' duplicated
    # stderr: warning: 1 duplicated keys
```

**Note:** If you don't remove duplicates, the B-tree index will point to the first occurrence, but iteration may encounter all duplicates.

### Writer Methods

```python
writer.set(key, value)        # Add a key-value pair
writer.finalize()             # Finalize the file (called automatically on context exit)
writer.close()                # Close file handles
```

### Reader Methods

```python
reader[key]                   # Get value by key (raises KeyError if not found)
reader.get(key, default=None) # Get value with default
key in reader                 # Check if key exists
len(reader)                   # Get number of entries
iter(reader)                  # Iterate over (key, value) pairs in insertion order
reader.sorted_keys()          # Iterate over keys in sorted order (tree traversal)
reader.header                 # Access header metadata
reader.close()                # Close file handle
```

## File Format

Map With Tree (.mwt) files consist of:

1. **Magic Header**: 8-byte signature (`mwt\0\0\0\0\1`)
2. **Header Offset**: 8-byte pointer to compressed header
3. **Data Blocks**: Compressed blocks containing key-value pairs with zstd
4. **B-tree Index**: Tree structure pointing into blocks for efficient key lookups
5. **Compressed Header**: JSON metadata with file statistics

### Dual Access Modes

The format supports two efficient access patterns:

- **Random Access**: Use the B-tree index for O(log n) lookup of specific keys
- **Sequential Iteration**: Read blocks sequentially for O(n) full scans without tree traversal

This design enables:
- Sequential writes for optimal I/O performance during creation
- Minimal memory usage during both reading and writing
- Fast random access through the B-tree index
- Fast sequential iteration by reading blocks in order
- Efficient compression with block-level granularity

## Examples

### Large Dataset

```python
import map_with_tree
import uuid

# Write 100,000 entries
with map_with_tree.open("large.mwt", "w", types=("bytes", "string")) as writer:
    for i in range(100000):
        key = f"key_{i:06d}".encode()
        value = uuid.uuid4().hex
        writer.set(key, value)

# Read and check compression
with map_with_tree.open("large.mwt") as reader:
    print(f"Entries: {len(reader)}")
    uncompressed = reader.header["uncompressed_size"]
    compressed = reader.header["compressed_size"]
    print(f"Compression ratio: {compressed / uncompressed:.2%}")
```

### Structured Data with JSON

```python
import map_with_tree

with map_with_tree.open("users.mwt", "w", types=("string", "json")) as writer:
    writer.set("user_1", {"name": "Alice", "age": 30, "city": "NYC"})
    writer.set("user_2", {"name": "Bob", "age": 25, "city": "SF"})

with map_with_tree.open("users.mwt") as reader:
    user = reader[b"user_1"]
    print(f"{user['name']} is {user['age']} years old")
```

### Custom Struct Types

```python
import map_with_tree

# Values are tuples of (unsigned int, unsigned int, float)
with map_with_tree.open("metrics.mwt", "w", types=("bytes", "struct:<IIf")) as writer:
    writer.set(b"metric_1", (100, 200, 3.14))
    writer.set(b"metric_2", (150, 250, 2.71))

with map_with_tree.open("metrics.mwt") as reader:
    count1, count2, ratio = reader[b"metric_1"]
    print(f"Counts: {count1}, {count2}, Ratio: {ratio}")
```

### Efficient Iteration

The format stores both keys and values in the data blocks, enabling fast sequential iteration without traversing the B-tree:

```python
import map_with_tree

# Iterate through all entries efficiently
with map_with_tree.open("data.mwt") as reader:
    # Sequential iteration reads blocks in order
    for key, value in reader:
        process(key, value)
    
    # This is faster than tree traversal for full scans
    # because it only decompresses blocks without tree lookups
```

The iteration order is the insertion order (before sorting), which allows you to:
- Process all entries efficiently in O(n) time
- Scan through data without random access overhead
- Stream process large datasets with minimal memory

### Sorted Key Iteration

If you need to iterate over keys in sorted order, use the `sorted_keys()` method which traverses the B-tree:

```python
import map_with_tree

with map_with_tree.open("data.mwt") as reader:
    # Get keys in sorted order via tree traversal
    for key in reader.sorted_keys():
        value = reader[key]
        process(key, value)
    
    # Or just get the sorted keys
    all_sorted_keys = list(reader.sorted_keys())
```

Key differences between iteration modes:
- `iter(reader)`: Returns (key, value) pairs in insertion order, reads blocks sequentially
- `sorted_keys()`: Returns keys only in sorted order, traverses the B-tree structure
- Use regular iteration for full scans, use `sorted_keys()` for sorted access or range queries

## Performance Tips

1. **Adjust block size**: Larger blocks (e.g., 256KB) improve compression but use more memory
2. **Tune compression level**: Lower levels (1-3) for speed, higher (10-22) for size
3. **Choose appropriate types**: Use native types (int, float) instead of strings when possible
4. **Batch writes**: Add all entries before finalizing to ensure optimal tree structure
5. **Keys per node**: Increase for larger datasets to reduce tree height
