Metadata-Version: 2.4
Name: deepaudio-x
Version: 0.1.4
Summary: DeepAudio-X: Self-supervised audio toolkit for audio classification and beyond.
Project-URL: Homepage, https://github.com/magcil/deepaudio-x
Project-URL: Repository, https://github.com/magcil/deepaudio-x
Author-email: Christos Nikou <chrisnick92@gmail.com>, Stefanos Vlachos <stevenvlaxos@gmail.com>, Ellie Vakalaki <evakalakiaidl@gmail.com>
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: librosa>=0.11.0
Requires-Dist: numpy>=2.3.3
Requires-Dist: platformdirs>=4.5.0
Requires-Dist: soundfile>=0.13.1
Requires-Dist: torch>=2.8.0
Requires-Dist: torchaudio>=2.8.0
Requires-Dist: tqdm>=4.67.1
Description-Content-Type: text/markdown

# DeepAudioX

DeepAudioX is a PyTorch-based library that provides **simple, flexible pipelines for audio classification** using **pretrained audio foundation models** as feature extractors.

It is designed to let users train, evaluate, and run inference on **custom audio datasets** with only a few lines of code, while still allowing advanced customization when needed.

---

## Key Features

- 🔊 **Pretrained audio backbones** for feature extraction  
- 🧠 **Modular pooling strategies** (e.g. mean, attentive, learnable pooling)
- 🧩 **Custom classifier heads** for downstream audio classification
- 🚀 **High-level training, evaluation, and inference APIs**
- 🔁 Fully **PyTorch-native** and extensible
- 📦 Clean integration with existing PyTorch workflows

---

## Installation

```bash
pip install deepaudio-x
```

Or install from source:

```bash
git@github.com:magcil/deepaudio-x.git
cd deepaudio-x
pip install -e .
```

---

## Quick Start

### Creating an Audio Classification Dataset

DeepAudioX provides flexible dataset creation methods for audio classification tasks. Here are the main approaches:

#### Method 1: From Directory Structure

If your audio files are organized in subdirectories where each subdirectory name is a class label:

```
data/
├── speech/
│   ├── audio1.wav
│   ├── audio2.wav
│   └── ...
├── music/
│   ├── audio3.wav
│   ├── audio4.wav
│   └── ...
└── noise/
    ├── audio5.wav
    └── ...
```

You can load the dataset as follows:

```python
from deepaudiox.datasets.audio_classification_dataset import audio_classification_dataset_from_dir
from deepaudiox.utils.training_utils import get_class_mapping_from_dir

# Define a class mapping
class_mapping = get_class_mapping_from_dir(root_dir="path/to/data")

dataset = audio_classification_dataset_from_path(
    root_path="path/to/data",
    sample_rate=16_000  # sampling rate in Hz
)
```

#### Method 2: From Custom File-to-Class Mapping

If your audio files aren't organized in subdirectories, or you need custom mappings, you can create a dictionary mapping file paths to class labels:

```python
from deepaudiox.datasets.audio_classification_dataset import audio_classification_dataset_from_dictionary
from deepaudiox.utils.training_utils import get_class_mapping

# Create a file-to-class mapping
file_to_class_mapping = {
    "path/to/audio1.wav": "speech",
    "path/to/audio2.wav": "speech",
    "path/to/audio3.wav": "music",
    # ... more mappings
}

# Create a class-to-id mapping
class_mapping = {"speech": 0, "music": 1, "noise": 2}

# Initialize the dataset
dataset = audio_classification_dataset_from_dictionary(
    file_to_class_mapping=file_to_class_mapping,
    sample_rate=16_000,
    class_mapping=class_mapping
)
```

#### Audio Segmentation

To split long audio files into fixed-duration segments, use the `segment_duration` parameter:

```python
# Create dataset with 2-second audio segments
dataset = audio_classification_dataset_from_dir(
    root_dir="path/to/data",
    sample_rate=16_000,
    segment_duration=2.0  # Duration in seconds
    class_mapping=class_mapping
)
```

When `segment_duration` is specified, each audio file is divided into non-overlapping segments of the given duration. Each segment is treated as an independent sample in the dataset, with the same class label as the original audio file. The `segment_idx` field in the dataset output indicates which segment a sample corresponds to.

**Example**: A 10-second audio file with `segment_duration=2.0` will produce 5 separate samples, each 2 seconds long, all with the same class label.

Both methods return an `AudioClassificationDataset` object that can be used with PyTorch's DataLoader for training and evaluation.

#### Dataset Output Format

Each item returned by the dataset is a dictionary containing:

```python
{
    "path": str,                # File path of the audio
    "y_true": int,              # Integer class ID
    "class_name": str,          # String class label
    "segment_idx": int,         # Segment index (for segmented audio)
    "feature": np.ndarray       # Audio waveform as numpy array
}
```

Example usage:

```python
from torch.utils.data import DataLoader

dataloader = DataLoader(dataset, batch_size=32)

for batch in dataloader:
    paths = batch["path"]           # File paths
    class_ids = batch["y_true"]     # Shape: (batch_size,)
    class_names = batch["class_name"]  # Class names
    segment_indices = batch["segment_idx"]  # Segment indices
    waveforms = batch["feature"]    # Shape: (batch_size, num_samples)
```

## Create an Audio Classifier with Pretrained Backbone

DeepAudioX simplifies the creation of audio classifiers by combining pretrained audio backbones with custom classifier heads. Here's how to build and configure a classifier:

### Basic Setup

```python
from deepaudiox.modules.audio_classifier_constructor import AudioClassifierConstructor

# Initialize classifier with pretrained BEATs backbone
classifier = AudioClassifierConstructor(
    num_classes=10,              # Number of output classes
    backbone="beats",            # Pretrained backbone (e.g., "beats")
    sample_rate=16_000,          # Audio sample rate
    pretrained=True,             # Use pretrained weights
    freeze_backbone=True         # Freeze backbone for fine-tuning
)
```

**Note**: When `pretrained=True`, the BEATs model will be automatically downloaded and cached in your OS-specific cache directory (e.g., `~/.cache` on Linux). The library does not contain pretrained model files (`.pt` files), keeping the repository lightweight. Subsequent uses will load the model from the cache.

### Available Backbones

- **BEATs**: BEATs: Audio Pre-Training with Acoustic Tokenizers (https://arxiv.org/abs/2212.09058)

### Key Parameters

- `num_classes`: Number of output classification classes
- `sample_rate`: Audio sampling rate (Hz) - must match your dataset
- `pretrained`: Whether to use pretrained weights (recommended)
- `freeze_backbone`: Freeze backbone parameters during training (reduces parameters to fine-tune)

### Optional: Custom Pooling Strategies

You can customize the pooling strategy used to aggregate audio features:

```python

classifier = AudioClassifierConstructor(
    num_classes=10,
    backbone="beats",
    sample_rate=16_000,
    pretrained=True,
    freeze_backbone=True,
    pooling="gap"
)
```

Available pooling strategies include:
- **GAP**: Simple average pooling
- **SimPool**: As presented in "Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit?" (https://arxiv.org/pdf/2309.06891)
- **EP**: As presented in "Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency"(https://arxiv.org/abs/2506.10178)

Normally, attentive pooling methods like ep and simpool will perform better than the Global Average Pooling (GAP).

The classifier is now ready for training or inference.

## Training

Train your audio classifier with a few lines of code using the built-in `Trainer` class:

#### Minimal Example (Recommended)

```python
from deepaudiox.loops.trainer import Trainer

# Initialize trainer with defaults
trainer = Trainer(
    train_dset=train_dataset,
    model=classifier,
    validation_dset=val_dataset,  # Optional
    batch_size=32,
    epochs=100,
    num_workers=4,
    patience=20
)

# Start training
trainer.train()
```

By default, the trainer uses:
- **Optimizer**: Adam with learning rate `1e-3`
- **Scheduler**: ReduceLROnPlateau with patience `10`

#### Advanced: Custom Optimizer and Scheduler

For more control, you can provide custom optimizer and learning rate scheduler:

```python
from torch.optim import Adam
from torch.optim.lr_scheduler import CosineAnnealingLR
from deepaudiox.loops.trainer import Trainer

optimizer = Adam(classifier.parameters(), lr=1e-2)
lr_scheduler = CosineAnnealingLR(optimizer=optimizer, T_max=100, eta_min=1e-6)

trainer = Trainer(
    train_dset=train_dataset,
    model=classifier,
    validation_dset=val_dataset,
    optimizer=optimizer,
    lr_scheduler=lr_scheduler,
    batch_size=32,
    epochs=100,
    num_workers=4,
    patience=20,
    path_to_checkpoint="checkpoint.pt"
)

trainer.train()
```

### Trainer Parameters

- `train_dset`: Training dataset (AudioClassificationDataset)
- `model`: Audio classifier model to train
- `validation_dset`: Optional validation dataset for monitoring (if None, will split from train_dset)
- `optimizer`: Optional custom PyTorch optimizer (default: Adam with lr=1e-3)
- `lr_scheduler`: Optional custom learning rate scheduler (default: ReduceLROnPlateau with patience=10)
- `batch_size`: Number of samples per batch (default: 16)
- `epochs`: Maximum number of training epochs (default: 100)
- `patience`: Number of epochs with no improvement before early stopping (default: 15)
- `num_workers`: Number of workers for data loading (default: 4)
- `path_to_checkpoint`: Path to save the best model checkpoint (default: "checkpoint.pt")

### Features

- **Automatic Checkpointing**: Saves the best model based on validation loss
- **Early Stopping**: Stops training when validation loss plateaus
- **Progress Tracking**: Displays training progress with loss metrics
- **Device Agnostic**: Automatically detects and uses GPU if available

## Evaluate

Evaluate your trained classifier on a test dataset using the `Evaluator` class:

```python
import torch

from deepaudiox.loops.evaluator import Evaluator

# Initialize evaluator
evaluator = Evaluator(
    test_dset=test_dataset,
    model=classifier,
    class_mapping=class_mapping,
    batch_size=32,
    num_workers=4
)

# Load model
classifier.load_state_dict(torch.load("checkpoint.pt"))

# Run evaluation
evaluator.evaluate()

# Access evaluation results
y_true = evaluator.state.y_true       # True labels
y_pred = evaluator.state.y_pred       # Predicted labels
posteriors = evaluator.state.posteriors  # Prediction probabilities
```

### Evaluator Parameters

- `test_dset`: Test dataset (AudioClassificationDataset)
- `model`: Trained audio classifier model
- `class_mapping`: Dictionary mapping class names to IDs
- `batch_size`: Number of samples per batch (default: 16)
- `num_workers`: Number of workers for data loading (default: 4)
- `device_index`: GPU device index to use (optional, auto-detects by default)

### Evaluation Results

The evaluator stores predictions in its state:
- `y_true`: Ground truth labels as NumPy array
- `y_pred`: Predicted class IDs as NumPy array
- `posteriors`: Class probability distributions as NumPy array

You can use these results to compute metrics like accuracy, precision, recall, F1-score, etc.:

```python
from sklearn.metrics import accuracy_score, classification_report

accuracy = accuracy_score(evaluator.state.y_true, evaluator.state.y_pred)
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(evaluator.state.y_true, evaluator.state.y_pred))
```

---

## Customization

Advanced users can:

- **Plug in custom backbones** - Implement your own audio feature extractors
- **Implement new pooling layers** - Create custom aggregation strategies for sequence features
- **Define custom classifier heads** - Design specialized classification architectures
- **Override training loops** - Customize the training process while keeping the pipeline structure

The library is designed to scale from quick experiments to research and production use.

---

## Project Status

🚧 This project is under active development.

APIs may evolve, but backward compatibility will be considered once a stable release is reached.

---

## Attribution

This project is developed at MagCIL and is created and primarily maintained by:

- Christos Nikou ([@ChrisNick92](https://github.com/ChrisNick92))
- Stefanos Vlachos ([@stefanos-vlachos](https://github.com/stefanos-vlachos))
- Ellie Vakalaki ([@ellievak](https://github.com/ellievak))

---

## Citation

If you use this library in academic work, please cite:

```bibtex
@software{DeepAudioX,
  author = {Nikou, Christos and Vlachos, Stefanos and Vakalaki, Ellie},
  title = {DeepAudioX: A PyTorch-based audio classification framework},
  year = {2026},
  url = {https://github.com/magcil/deepaudio-x}
}
```

---

## Contributing

Contributions are welcome!

Please open an issue to discuss major changes before submitting a pull request.
