Metadata-Version: 2.4
Name: dsrp-ml-utils
Version: 0.1.0
Summary: ML utilities for DSRP Machine Learning Engineering course - movie recommendation pipeline
Project-URL: Homepage, https://github.com/dsrp-org/dsrp-machine-learning-engineering-4
Project-URL: Repository, https://github.com/dsrp-org/dsrp-machine-learning-engineering-4
Project-URL: Documentation, https://github.com/dsrp-org/dsrp-machine-learning-engineering-4#readme
Author-email: DSRP <info@dsrp.dev>
License-Expression: MIT
Keywords: dsrp,machine-learning,mlops,recommendation-system
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: mlflow>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: polars>=0.20.0
Requires-Dist: sentence-transformers>=2.2.0
Provides-Extra: all
Requires-Dist: azure-identity>=1.0.0; extra == 'all'
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'all'
Requires-Dist: pytest-cov>=4.0.0; extra == 'all'
Requires-Dist: pytest>=7.0.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Provides-Extra: azure
Requires-Dist: azure-identity>=1.0.0; extra == 'azure'
Requires-Dist: azure-storage-blob>=12.0.0; extra == 'azure'
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Description-Content-Type: text/markdown

# DSRP ML Utils

Utility library for ML pipelines in the DSRP Machine Learning Engineering course.

## Installation

```bash
pip install dsrp-ml-utils
```

With Azure storage support:
```bash
pip install dsrp-ml-utils[azure]
```

## Quick Start

```python
from dsrp_ml_utils import (
    load_imdb_database,
    add_derived_features,
    extract_top_genres,
    normalize_embeddings,
)

# Load and prepare data
movies = load_imdb_database("data/movies_base.parquet", "data/omdb_raw.jsonl")
movies = add_derived_features(movies)

# Extract metadata
genres = extract_top_genres(movies, top_n=10)
```

## Features

### Data Loading
- `load_imdb_database()` - Load and combine IMDB data with OMDB enrichment
- `add_derived_features()` - Add computed features (log votes, normalized year, etc.)

### Metadata Extraction
- `extract_top_genres()` - Get most frequent genres
- `extract_decades()` - Get decades present in dataset

### Query Generation
- `generate_template_queries()` - Generate synthetic queries for LTR training

### Candidate Retrieval
- `normalize_embeddings()` - L2 normalize embeddings for cosine similarity
- `get_candidates_for_query()` - Retrieve top-K candidate movies

### Relevance Scoring
- `compute_relevance_score()` - Calculate relevance scores with adjustable emphasis
- `assign_relevance_labels()` - Convert continuous scores to discrete labels

### MLflow Integration
- `search_best_model()` - Search for best run by metric
- `get_artifact_uri_production()` - Get production model artifact URI

### Azure Storage (optional)
- `upload_to_blob()` / `download_from_blob()` - File operations
- `sync_to_azure()` / `sync_from_azure()` - Batch sync operations
- `blob_exists()` / `list_blobs()` - Storage queries

## License

MIT
