Metadata-Version: 2.1
Name: auditus
Version: 0.0.4
Summary: Simple Audio Embeddings
Home-page: https://github.com/CarloLepelaars/auditus
Author: Carlo Lepelaars
Author-email: info@carlolepelaars.nl
License: Apache Software License 2.0
Keywords: audio embeddings
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastcore
Requires-Dist: fasttransform
Requires-Dist: torch
Requires-Dist: transformers
Requires-Dist: tensorflow
Requires-Dist: numpy
Requires-Dist: soundfile
Requires-Dist: torch<2.7,>=1.10
Provides-Extra: dev

# auditus


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

`auditus` gives you simple access to state-of-the-art audio embeddings.
Like [SentenceTransformers](https://sbert.net/) for audio.

``` sh
$ pip install auditus
```

## Quickstart

The high-level object in `auditus` is the
[`AudioPipeline`](https://CarloLepelaars.github.io/auditus/transform.html#audiopipeline)
which takes in a path and returns a pooled embedding.

``` python
from auditus.transform import AudioPipeline

pipe = AudioPipeline(
    # Default AST model
    model_name="MIT/ast-finetuned-audioset-10-10-0.4593", 
    # PyTorch output
    return_tensors="pt", 
    # Resampled to 16KhZ
    target_sr=16000, 
     # Mel-frequency bins is equal to output length for this model.
    num_mel_bins=64,
    # 1024 length equals max. ~25.6 seconds with default hop length.
    # Longer files are truncated.
    max_length=1024,
    # Mean pooling to obtain single embedding vector
    pooling="mean",
)

output = pipe("../test_files/XC119042.ogg").squeeze(0)
print(output.shape)
output[:5]
```

    torch.Size([64])

    tensor([-0.0943, -0.1549, -0.2868, -0.3495, -0.4023])

To see
[`AudioPipeline`](https://CarloLepelaars.github.io/auditus/transform.html#audiopipeline)
in action on a practical use case, check out [this Kaggle Notebook for
the BirdCLEF+ 2025
competition](https://www.kaggle.com/code/carlolepelaars/generating-audio-embeddings-with-auditus).

## Individual steps

`auditus` offers a range of transforms to process audio for downstream
tasks.

### Loading

Simply load audio with a given sampling rate.

``` python
from auditus.transform import AudioLoader

audio = AudioLoader(sr=32000)("../test_files/XC119042.ogg")
audio
```

    auditus.core.AudioArray(a=array([-2.64216160e-05, -2.54259703e-05,  5.56615578e-06, ...,
           -2.03555092e-01, -2.03390077e-01, -2.45199591e-01]), sr=32000)

The
[`AudioArray`](https://CarloLepelaars.github.io/auditus/core.html#audioarray)
object offers a convenient interface to inspect the audio data. Like
listening to the audio in Jupyter Notebook with `audio.audio()`.

``` python
audio.a[:5], audio.sr, len(audio)
```

    (array([-2.64216160e-05, -2.54259703e-05,  5.56615578e-06, -5.17481631e-08,
            -1.35020821e-06]),
     32000,
     632790)

### Resampling

Many Audio Transformer models work only on a specific sampling rate.
With
[`Resampling`](https://CarloLepelaars.github.io/auditus/transform.html#resampling)
you can resample the audio to the desired sampling rate. Here we go from
32kHz to 16kHz.

``` python
from auditus.transform import Resampling

resampled = Resampling(target_sr=16000)(audio)
resampled
```

    auditus.core.AudioArray(a=array([-2.64216160e-05,  5.56613802e-06, -1.35020873e-06, ...,
           -2.39605007e-01, -2.03555112e-01, -2.45199591e-01]), sr=16000)

### Embedding

The main transform in `auditus` is the
[`AudioEmbedding`](https://CarloLepelaars.github.io/auditus/transform.html#audioembedding)
transform. It takes an
[`AudioArray`](https://CarloLepelaars.github.io/auditus/core.html#audioarray)
and returns a tensor. Check out the [HuggingFace
docs](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer#transformers.ASTFeatureExtractor)
for more information on the available parameters.

``` python
from auditus.transform import AudioEmbedding

emb = AudioEmbedding(return_tensors="pt", num_mel_bins=64, sampling_rate=16000)(resampled)
print(emb.shape)
emb[0][0][:5]
```

    torch.Size([1, 1024, 64])

    tensor([-0.8148, -0.9460, -0.9955, -0.9856, -1.0303])

### Pooling

After generating the embeddings, you often want to pool the embeddings
to a single vector.
[`Pooling`](https://CarloLepelaars.github.io/auditus/transform.html#pooling)
supports `mean` and `max` pooling.

``` python
from auditus.transform import Pooling

pooled = Pooling(pooling="max")(emb)
print(pooled.shape)
pooled[0][:5]
```

    torch.Size([1, 64])

    tensor([ 0.3470,  0.2991,  0.1366, -0.0023, -0.1394])
