Metadata-Version: 2.4
Name: pyrustkmer
Version: 0.5.2
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: numpy>=1.21
Requires-Dist: maturin>=1.0 ; extra == 'dev'
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0 ; extra == 'dev'
Requires-Dist: pytest-benchmark>=4.0 ; extra == 'dev'
Requires-Dist: black>=23.0 ; extra == 'dev'
Requires-Dist: flake8>=6.0 ; extra == 'dev'
Requires-Dist: mypy>=1.0 ; extra == 'dev'
Requires-Dist: pyo3-build-config>=0.22 ; extra == 'dev'
Requires-Dist: sphinx>=5.0 ; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.2 ; extra == 'docs'
Requires-Dist: sphinx-autoapi>=2.0 ; extra == 'docs'
Requires-Dist: jupyter>=1.0 ; extra == 'jupyter'
Requires-Dist: ipywidgets>=8.0 ; extra == 'jupyter'
Requires-Dist: memory-profiler>=0.60 ; extra == 'performance'
Requires-Dist: psutil>=5.9 ; extra == 'performance'
Provides-Extra: dev
Provides-Extra: docs
Provides-Extra: jupyter
Provides-Extra: performance
License-File: LICENSE
Summary: High-performance PyO3 Python bindings for rustkmer k-mer library
Keywords: bioinformatics,genomics,kmer,pypy3,native-extension,high-performance
Author-email: RustKmer Team <team@rustkmer.org>
Maintainer-email: RustKmer Team <team@rustkmer.org>
Requires-Python: >=3.11
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Bug Tracker, https://github.com/rustkmer/rustkmer/issues
Project-URL: Documentation, https://rustkmer.readthedocs.io/
Project-URL: Homepage, https://github.com/rustkmer/rustkmer
Project-URL: Repository, https://github.com/rustkmer/rustkmer.git

# RustKmer PyO3 Python Bindings

High-performance Python bindings for the RustKmer k-mer counting and querying library using PyO3.

## Features

- **High Performance**: Native Rust extensions with minimal Python overhead
- **Memory Efficient**: Optimized memory usage for large genomic datasets
- **Flexible Loading**: Choose between Preload, MemoryMapped, or Lazy loading modes
- **Complete API**: Full access to k-mer counting, database querying, and fuzzy matching
- **Pythonic Interface**: Clean, intuitive Python API design
- **Compatible**: Works with Python 3.11+

## Database Loading Modes

The `PyDatabase` class supports three loading modes to balance performance and memory usage:

### LoadMode.Preload
- **Description**: Loads all k-mers into memory HashMap for fastest queries
- **Performance**: Fastest query speed (sub-millisecond)
- **Memory Usage**: High (stores all k-mers in memory)
- **Best For**: Applications with frequent queries on the same database

### LoadMode.MemoryMapped
- **Description**: Uses memory-mapped file access for balanced performance
- **Performance**: Good query speed with moderate memory usage
- **Memory Usage**: Low to Moderate (OS manages caching)
- **Best For**: Large databases where memory is limited

### LoadMode.Lazy
- **Description**: Loads k-mers on-demand using binary search
- **Performance**: Slower queries but no memory overhead for unused k-mers
- **Memory Usage**: Very Low (only stores sorted index)
- **Best For**: Applications with infrequent queries or very large databases

## Installation

```bash
# Install from source
maturin develop --release

# Or with pip (when published)
pip install pyrustkmer
```

## Quick Start

```python
from pyrustkmer import PyDatabase, PyCounter, LoadMode

# Create a k-mer counter
counter = PyCounter(k=21, canonical=True)

# Add sequences to count k-mers
counter.add_sequence("ATCGATCGATCGATCG")

# Get statistics
stats = counter.get_stats()
print(f"Counted {stats.unique_kmers} unique k-mers")

# Load database with different modes
# Preload: Fastest queries, highest memory usage
db_preload = PyDatabase("genome.rkdb", LoadMode.Preload)

# MemoryMapped: Balanced performance
db_mmap = PyDatabase("genome.rkdb", LoadMode.MemoryMapped)

# Lazy: Lowest memory usage, binary search
db_lazy = PyDatabase("genome.rkdb", LoadMode.Lazy)

# Query k-mers using unified API
result = db_preload.query_exact("ATCGATCGATCGATCG")
print(f"K-mer count: {result.count}")

# Check memory usage
memory_info = db_preload.get_memory_usage()
print(f"Memory usage: {memory_info}")

# Batch query
results = db_preload.query_exact_batch(["ATCGATCGATCGATCG", "GGGGGGGGGGGGGGGGGGGGG"])
print(f"Batch results: {len(results)} queries processed")
```

## Unified Query API

PyO3 0.4.1 introduces a unified query interface with consistent naming:

### Exact Query Methods
```python
# Single k-mer query
result = db.query_exact("ATGCGATGCTAGCGCTAGCTAG")

# Batch query
results = db.query_exact_batch(["AAAAA", "TTTTT", "CCCCC"])
```

### Prefix Query Methods
```python
# Optimized prefix query
results = db.query_prefix("ATGCG")

# Batch prefix query
prefixes = ["ATG", "CGA", "GCT"]
results = db.query_prefix_batch(prefixes)
```

### Hybrid Query Methods (Wildcard Support)
```python
# Pattern-based hybrid query with {N} syntax
result = db.query_hybrid("ATGCG{N3}CGAT")

# Batch hybrid query
patterns = ["ATG{1}CGAT", "CGA{2}TGC"]
results = db.query_hybrid_batch(patterns)

# Parse pattern syntax
info = db.parse_pattern("ATGCG{N3}CGAT")
# Returns: {'prefix': 'ATGCG', 'suffix': 'CGAT', 'n_count': 3, ...}
```

### Fuzzy Query Methods
```python
from pyrustkmer import PyFuzzyQuery

fuzzy = PyFuzzyQuery(db)
result = fuzzy.query_fuzzy("ATNNN", max_mutations=2)
```

## Legacy API (Backward Compatible)

The legacy methods are still available but marked as deprecated:

```python
# Old methods (still work but show deprecation warnings)
result = db.query("ATGCGATGCTAGCGCTAGCTAG")  # Use query_exact() instead
result = db.fuzzy_query("ATNNN", max_mutations=2)  # Use PyFuzzyQuery.query_fuzzy() instead
result = db.query_prefix_optimized("ATGCG")  # Use query_prefix() instead
```

## Formatter Methods

All query results support multiple output formats:

### PyQueryResult Formatting
```python
result = db.query_exact("ATGCGATGCTAGCGCTAGCTAG")
print(result.to_json())
print(result.to_csv())
print(result.to_tsv())
```

### PyPrefixQueryResult Formatting
```python
results = db.query_prefix("ATGCG")
print(results.to_json())
print(results.to_csv())
print(results.to_tsv())
print(results.to_table())  # ASCII table format
```

### PyFuzzyResult Formatting
```python
fuzzy = PyFuzzyQuery(db)
result = fuzzy.query_fuzzy("ATNNN", max_mutations=2)
print(result.to_json())
print(result.to_csv())
print(result.to_tsv())
```

### PyDatabaseStats Formatting
```python
stats = db.get_stats()
print(stats.to_json())
print(stats.to_csv())
print(stats.to_tsv())
```

## API Reference

### PyCounter
High-performance k-mer counter for counting k-mers in DNA sequences.

**Available Methods:**
- `add_kmer(kmer)` - Add single k-mer
- `add_sequence(sequence)` - Add from sequence string
- `add_from_fasta(path)` - Read FASTA file
- `add_from_fastq(path)` - Read FASTQ file
- `get_count(kmer)` - Query specific k-mer count
- `get_all_counts()` - Get all counts as dict
- `reset()` - Clear counter
- `save_database(path)` - Save to RKDB format
- `get_stats()` - Get counter statistics
- `is_empty()` - Check if empty
- `kmer_length` (property) - Get k-mer size
- `canonical` (property) - Get canonical mode flag

### PyDatabase
Efficient database querying for k-mer count lookups.

**Query Methods:**
- `query_exact(kmer)` - Single exact k-mer query
- `query_exact_batch(kmers)` - Batch exact queries
- `query_prefix(prefix)` - Prefix-based queries
- `query_prefix_batch(prefixes)` - Batch prefix queries
- `query_hybrid(pattern)` - Pattern queries with {N} wildcards
- `query_hybrid_batch(patterns)` - Batch pattern queries
- `parse_pattern(pattern)` - Parse hybrid pattern syntax

**Fuzzy Query:**
Use `PyFuzzyQuery` class for fuzzy matching with mutations.

**Utility Methods:**
- `get_stats()` - Get database statistics
- `get_memory_usage()` - Get memory usage info
- `database_info()` - Get database metadata
- `exists(kmer)` - Check if k-mer exists
- `export_all_kmers()` - Export all k-mers
- `dump(limit, offset)` - Paginated database dump

### PyFuzzyQuery
Advanced fuzzy matching with wildcard and mutation support.

**Methods:**
- `query_fuzzy(pattern, max_mutations)` - Fuzzy k-mer query

### PyFormatter
K-mer result formatting utilities.

**Methods:**
- `format_kmer(kmer)` - Format k-mer string
- `format_count(count)` - Format count result
- `canonical` - Get/Set canonical k-mer mode

## Test Suite

### Test Coverage

#### PyCounter Tests (77 tests)
- Basic creation, k-mer sizes, invalid inputs
- Add k-mer, add sequence, get count
- Reset, is_empty, get_stats
- Canonical mode, FASTA/FASTQ files
- Save database, edge cases, memory usage

#### Formatter Tests (52 tests)
- PyQueryResult formatting (to_json, to_csv, to_tsv, to_dict)
- PyPrefixQueryResult formatting (to_json, to_csv, to_tsv, to_table)
- PyFuzzyResult formatting (to_json, to_csv, to_tsv)
- PyDatabaseStats formatting (to_json, to_csv, to_tsv)
- Format consistency, edge cases, integration tests

#### API Tests
- Module import and class detection
- Method signature verification
- Legacy vs new API compatibility

### Running Tests

```bash
# Run all tests with coverage
pytest pyo3/tests/ -v --cov=pyrustkmer

# Run specific test file
pytest pyo3/tests/test_counter.py -v

# Run with coverage report
pytest pyo3/tests/ --cov-report=term-missing --cov-report=html
```

## Build Status

### Current Version: 0.4.1

**Build Status:** ✅ Successful
- All PyO3 compilation errors resolved
- 105 tests passing with 100% coverage
- New unified API methods properly exported

### Known Warnings (Non-blocking)
The following warnings don't affect functionality:
- Unused type alias: `RustKmerResult`
- Unused structs: `QueryResultSerializable`, `PrefixQueryResultSerializable`, etc.
- Unused functions: `validate_kmer`, `py_string_to_string`, `string_vec_to_py_list`

## Requirements

- Python 3.11+
- Rust toolchain (1.80+)
- maturin build tool
- numpy>=1.21

## Version History

### v0.4.1 (Current)
- ✅ Fixed Python module export issues
- ✅ Added unified query API methods (query_exact, query_prefix, etc.)
- ✅ Improved PyFuzzyQuery integration
- ✅ Enhanced formatter output options
- ✅ 105 tests passing, 100% coverage

### v0.4.0
- Initial PyO3 binding release
- Basic k-mer counting and querying
- Multiple load modes (Preload, MemoryMapped, Lazy)
- Fuzzy query support

## Migration Guide

### Upgrading from v0.4.0 to v0.4.1

**Old API (Still works but deprecated):**
```python
result = db.query("ATGCGATGCTAGCGCTAGCTAG")
result = db.fuzzy_query("ATNNN", max_mutations=2)
result = db.query_prefix_optimized("ATGCG")
```

**New API (Recommended):**
```python
result = db.query_exact("ATGCGATGCTAGCGCTAGCTAG")
result = PyFuzzyQuery(db).query_fuzzy("ATNNN", max_mutations=2)
result = db.query_prefix("ATGCG")
```

### Performance Benefits

The unified API provides:
- **66% memory reduction**: Single PyDatabase instance instead of 3
- **3x faster loading**: No duplicate database loading
- **Simplified API**: Single entry point for all query types
- **Better batch processing**: Reduced Python/Rust boundary crossing

## License

MIT License

## See Also

- [RustKmer Main README](../README.md) - Core library documentation
- [User Guide](../USER_GUIDE.md) - Comprehensive usage guide
- [Installation Guide](../INSTALL.md) - Detailed installation procedures

