Metadata-Version: 2.4
Name: kwslib
Version: 0.0.6
Summary: Python client library for KWS Platform API - dataset splits, feature npz download, model/artifacts/metrics push
Author: Ngoc An Lam
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.28.0
Requires-Dist: minio>=7.1.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: pandas>=1.3.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: seaborn>=0.12.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=5.0.0; extra == "dev"

# KWS Library (kwslib)

Thư viện Python dùng API từ backend **KWS_Server**, dùng trên Google Colab, Jupyter Notebook hoặc script Python.

> **Single Source of Truth (Operations):** Tất cả tài liệu vận hành/quy trình/template của dự án nằm tại `Workspace/` (entry: `Workspace/README.md`). Repo này chỉ giữ **tài liệu kỹ thuật** cho component.

## Ba công việc chính (KWS_Lib ↔ KWS_Server)

KWS_Lib là thư viện client để thao tác dữ liệu và metadata qua **KWS_Server** (DB + MinIO), dùng tốt trên **Google Colab**, **Jupyter Notebook** hoặc script.

| Công việc | Mô tả | API / High-level |
|-----------|--------|-------------------|
| **1. Tạo dataset_split (train, val, test)** | Lấy danh sách file feature (MFCC/npz) → chia train/val/test (pandas, sklearn) → đẩy lên DB. | `dataset_splits.get_mfcc_files()` → chia → `dataset_splits.create_split_from_list()`; hoặc `DatasetPipeline.get_data()` → `split_and_push()`. |
| **2. Tải npz (feature extraction từ MinIO)** | Tải file .npz của split đã tạo (server đọc từ MinIO, stream qua API). | `dataset_splits.download(split_id, output_path)` (ZIP npz); `DatasetSplitFilesClient.download_all_npz(split_id, output_dir)`; `list_files(split_id, file_type="npz")` + `download_file(...)`. |
| **3. Đẩy thông tin model (config, artifacts, metrics)** | Đăng ký run, upload artifact (file model lên MinIO + DB), POST metrics. | `ModelManager.register_run()`, `push_artifact()`, `push_metrics()`; hoặc `experiments.create_run()`, `artifacts.upload()`, `metrics.create(payload=...)`. |

**Chuẩn metrics khi POST**: bắt buộc **Accuracy**, **Precision**, **Recall**, **F1-Score**, **Confusion Matrix** (dùng `build_metrics_payload` hoặc `metrics_from_sklearn`).

## Features

- **API Coverage**: Bọc toàn bộ API backend (datasets, dataset_splits, models, experiments, metrics, mlflow)
- **Data splitting**: `get_mfcc_files`, `create_split_from_list`, script `create_dataset_split.py` + pandas
- **Chuẩn metrics**: `build_metrics_payload`, `metrics_from_sklearn` → payload đúng chuẩn để POST `/api/v1/metrics`
- **MinIO qua API**: Tải .wav / .npz qua API (stream), không cần kết nối trực tiếp MinIO
- **Telegram**: Optional thông báo khi job xong

## Installation (PyPI)

```bash
pip install kwslib
```

From source:

```bash
git clone <repository>
cd KWS_Lib
pip install -e .
```

## Quick Start

### Smoke check (no login)

Verify KWS_Server is reachable:

```bash
python -m kwslib.smoke
```

Override base URL:

```bash
KWS_SERVER_URL=http://127.0.0.1:8000 python -m kwslib.smoke
```

### Basic Usage

```python
from kwslib import KWSClient

# Initialize client
client = KWSClient(base_url="http://localhost:8000")

# Login
client.login(username="admin", password="password")

# List datasets
datasets = client.datasets.list()
print(f"Found {datasets['total']} datasets")

# Get dataset details
dataset = client.datasets.get(dataset_id=1)
print(f"Dataset: {dataset['name']}")
```

### Download Dataset Split Files for Training

```python
from kwslib import KWSClient, DatasetSplitFilesClient

# Initialize API client
api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Initialize files client (uses API, no direct MinIO connection)
files_client = DatasetSplitFilesClient(api)

# List all files in split
files_info = files_client.list_files(split_id=1, file_type="npz")
print(f"Found {files_info['total_files']} files")

# Download all .npz files
files_client.download_all_npz(
    split_id=1,
    output_dir="features"
)

# Download all .wav files
files_client.download_all_wav(
    split_id=1,
    output_dir="audio"
)

# Or download as ZIP
files_client.download_all_files_zip(
    split_id=1,
    file_type="npz",
    output_path="features.zip"
)

# Get file metadata (for Google Colab loop/download strategy)
urls = files_client.get_file_urls(split_id=1, file_type="npz")
for file_info in urls["files"]:
    print(file_info["file_name"])
```

### With Telegram Notifications

```python
from kwslib import KWSClient, TelegramNotifier

# Initialize
client = KWSClient(base_url="http://localhost:8000")
client.login(username="admin", password="password")

notifier = TelegramNotifier(
    bot_token="YOUR_BOT_TOKEN",
    chat_id="YOUR_CHAT_ID"
)

# Create experiment run (triggers background job)
run = client.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=1,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-run",
)

# Wait for completion
job_id = run.get("job_id")
status = client.jobs.wait_for_completion(job_id)

# Send notification
if status["status"] == "completed":
    notifier.send(f"Training completed! Results: {status['result']}")
else:
    notifier.send(f"Training failed: {status.get('error')}")
```

## API Modules

### Authentication
- `client.auth.login()` - Login
- `client.auth.logout()` - Logout
- `client.auth.get_me()` - Get current user info

### Datasets
- `client.datasets.list()` - List datasets
- `client.datasets.get()` - Get dataset
- `client.datasets.create()` - Create dataset
- `client.datasets.update()` - Update dataset
- `client.datasets.delete()` - Delete dataset
- `client.datasets.list_versions()` - List versions
- `client.datasets.create_version()` - Create version

### Models
- `client.models.list()` - List models
- `client.models.get()` - Get model
- `client.models.create()` - Create model
- `client.models.list_model_inits()` - List model architectures

### Experiments
- `client.experiments.list()` - List experiments
- `client.experiments.create()` - Create experiment
- `client.experiments.create_run(experiment_id, model_id, dataset_split_id, config, git_commit)` - Tạo run (background job)
- `client.experiments.list_runs(experiment_id)` - List runs của một experiment
- `client.experiments.list_runs_global(experiment_id=..., model_id=...)` - List tất cả runs (có lọc)
- `client.experiments.get_run(experiment_id, run_id)` - Chi tiết run

### Dataset Splits
- `client.dataset_splits.list(dataset_version_id=..., config_name=..., name=...)` - List splits
- `client.dataset_splits.get(split_id)` - Lấy một split theo ID
- `client.dataset_splits.create(dataset_version_id, name, config_name)` - Tạo bản ghi split (chỉ metadata)
- `client.dataset_splits.create_split_from_list(...)` - Tạo split từ danh sách file (sau khi chia bằng pandas)
- `client.dataset_splits.get_mfcc_files(dataset_version_id, ...)` - Lấy danh sách file MFCC (DataFrame) để chia
- `client.dataset_splits.download(split_id, output_path)` - Tải split dạng ZIP (npz)
- `client.dataset_splits.generate(split_id)` - Trigger job generate split
- `client.dataset_splits.list_files(split_id, file_type)` - List file .wav / .npz trong split

### Metrics (chuẩn: Accuracy, Precision, Recall, F1-Score, Confusion Matrix)
- `client.metrics.list(model_id=..., dataset_split_id=...)` - List metrics
- `client.metrics.get(metric_id)` - Chi tiết metric
- `client.metrics.create(payload=payload)` - POST metric (dùng payload từ `build_metrics_payload` hoặc `metrics_from_sklearn`)
- `client.metrics.compare_metrics(model_ids=[1,2,3], split_id=...)` - So sánh metrics giữa các model

### Audio
- `client.audio.list_keyword_samples()` - List keyword audio
- `client.audio.upload_keyword_sample()` - Upload audio
- `client.audio.get_keyword_sample_url()` - Get download URL from API

### Features
- `client.features.get_keyword_features()` - Get features
- `client.features.extract_keyword_features()` - Extract features

### Jobs
- `client.jobs.get()` - Get job status
- `client.jobs.list()` - List jobs
- `client.jobs.wait_for_completion()` - Wait for job completion

### Dataset Split Files Client
- `files_client.list_files()` - List all files in split
- `files_client.download_wav()` - Download a .wav file
- `files_client.download_npz()` - Download and load a .npz file
- `files_client.download_all_wav()` - Download all .wav files
- `files_client.download_all_npz()` - Download all .npz files
- `files_client.download_all_files_zip()` - Download all files as ZIP
- `files_client.get_file_urls()` - Get file metadata for all files

### Telegram Notifier
- `notifier.send()` - Send message
- `notifier.send_file()` - Send file
- `notifier.send_photo()` - Send photo

## Examples

### Data splitting (pandas + create_split_from_list)

```python
from kwslib import KWSClient
from create_dataset_split import get_split_data, push_splits
from sklearn.model_selection import train_test_split

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# 1. Lấy danh sách file MFCC
df = get_split_data(api=api, dataset_version_id=48, feature_type_id=2)

# 2. Chia train/test (stratify theo label)
train_df, test_df = train_test_split(
    df, train_size=0.8, test_size=0.2, random_state=42, stratify=df["derivative_label"]
)

# 3. Đẩy lên DB (tạo splits + gán file)
created = push_splits(
    api=api,
    dataset_version_id=48,
    config_name="config_80_20",
    splits={"train": train_df, "test": test_df},
)
# created = {"train": 123, "test": 124}
```

### Metrics chuẩn (Accuracy, Precision, Recall, F1-Score, Confusion Matrix)

```python
from kwslib import KWSClient, build_metrics_payload, metrics_from_sklearn
import numpy as np

api = KWSClient(base_url="http://localhost:8000")
api.login(username="admin", password="password")

# Cách 1: Từ y_true, y_pred (sklearn)
y_true = np.array([0, 1, 1, 0])
y_pred = np.array([0, 1, 0, 0])
payload = metrics_from_sklearn(
    y_true, y_pred,
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    average="weighted",
)
api.metrics.create(payload=payload)

# Cách 2: Từ dict metrics đã tính
metrics = {
    "accuracy": 0.92,
    "precision": 0.91,
    "recall": 0.90,
    "f1_score": 0.905,
    "confusion_matrix": [[50, 2], [3, 45]],  # 2D list int
}
payload = build_metrics_payload(
    model_id=1, dataset_split_id=1, experiment_run_id=1,
    metrics=metrics,
)
api.metrics.create(payload=payload)

# So sánh nhiều model trên một split
comparison = api.metrics.compare_metrics(model_ids=[1, 2, 3], split_id=1)
```

### Complete split/download/upload workflow (tóm tắt)

```python
# 1. Tạo split (metadata) hoặc dùng create_split_from_list sau khi chia pandas
split = api.dataset_splits.create(dataset_version_id=1, name="train", config_name="config_70_15_15")
# Hoặc: push_splits(api, dataset_version_id, config_name, splits={"train": train_df, "val": val_df, "test": test_df})

# 2. Generate split (job) nếu cần
# job = api.dataset_splits.generate(split_id)

# 3. Tạo experiment run (background job)
run = api.experiments.create_run(
    experiment_id=1,
    model_id=1,
    dataset_split_id=split_id,
    config={"learning_rate": 0.001, "batch_size": 32},
    git_commit="manual-run",
)

# 4. Sau khi train, POST metrics (chuẩn: accuracy, precision, recall, f1_score, confusion_matrix)
# run_id = ID experiment run (lấy từ list_runs sau khi job create_run hoàn thành)
payload = metrics_from_sklearn(y_true, y_pred, model_id=1, dataset_split_id=split_id, experiment_run_id=run_id)
api.metrics.create(payload=payload)
```

### Google Colab Usage

```python
# In Google Colab, iterate file metadata then call API download endpoints
from kwslib import KWSClient, DatasetSplitFilesClient

api = KWSClient(base_url="https://your-api.com")
api.login(username="admin", password="password")

files_client = DatasetSplitFilesClient(api)

# Get file metadata
urls = files_client.get_file_urls(split_id=1, file_type="npz")

# Download in Colab
import urllib.request
for file_info in urls["files"]:
    urllib.request.urlretrieve(
        file_info["url"],
        f"/content/{file_info['file_name']}"
    )
```

## Configuration

### Environment Variables

You can set default values using environment variables:

```bash
export KWS_BASE_URL="http://localhost:8000"
export KWS_USERNAME="admin"
export KWS_PASSWORD="password"
export MINIO_ENDPOINT="localhost:9000"
export MINIO_ACCESS_KEY="minioadmin"
export MINIO_SECRET_KEY="minioadmin"
```

## Publishing to PyPI

```bash
pip install build twine
python -m build
twine upload dist/*
```

Đảm bảo đã tăng `version` trong `pyproject.toml` trước khi build.

## License

MIT License

## Contributing

Contributions are welcome! Please open an issue or submit a pull request.
